CorpuScript: an automated text-cleaning tool for corpus linguistics

Descrição

Tipo

Artigo de evento

Título alternativo

Primeiro orientador

Membros da banca

Resumo

The process of corpus compilation remains a significant challenge in the field of corpus linguistics. This paper introduces CorpuScript, an innovative text-cleaning software aimed at aiding researchers in the process of corpus preparation. By combining software engineering with corpus linguistics methods, this tool can significantly improve the workflow for corpora compilation, specifically in the task of corpus cleaning. The necessity for CorpuScript emerged from recurring challenges experienced by our research team, particularly during our current corpus research project, in which a considerable large number of texts needed to be cleaned before being used for data analysis. Considering the pressing need for an automated solution that could improve the text-cleaning process in our research project, CorpuScript was carefully developed to help us accelerate the corpus compilation, while meeting the requirements outlined in our corpus design.

Abstract

Assunto

Linguística de corpus, Linguística computacional, Engenharia de software

Palavras-chave

Citação

Curso

Endereço externo

https://www.elc-ebralc.net.br/_files/ugd/75f182_29b728735a1a48b99531566f48678cdc.pdf

Avaliação

Revisão

Suplementado Por

Referenciado Por