CorpuScript: an automated text-cleaning tool for corpus linguistics
| dc.creator | Jhonatan Henrique Lopes Alves | |
| dc.creator | Ana Eliza Pereira Bocorny | |
| dc.creator | Deise Prina Dutra | |
| dc.creator | Carolina Godoi de Faria Marques | |
| dc.creator | Gustavo Leal Teixeira | |
| dc.creator | Danilo Duarte Costa | |
| dc.date.accessioned | 2025-07-17T13:33:40Z | |
| dc.date.accessioned | 2025-09-09T00:17:00Z | |
| dc.date.available | 2025-07-17T13:33:40Z | |
| dc.date.issued | 2024-10 | |
| dc.description.sponsorship | FAPEMIG - Fundação de Amparo à Pesquisa do Estado de Minas Gerais | |
| dc.identifier.isbn | 9786501233710 | |
| dc.identifier.uri | https://hdl.handle.net/1843/83603 | |
| dc.language | eng | |
| dc.publisher | Universidade Federal de Minas Gerais | |
| dc.relation.ispartof | XVI Encontro de Linguística de Corpus e da da XII Escola Brasileira de Linguística Computacional | |
| dc.rights | Acesso Aberto | |
| dc.subject | Linguística de corpus | |
| dc.subject | Linguística computacional | |
| dc.subject | Engenharia de software | |
| dc.title | CorpuScript: an automated text-cleaning tool for corpus linguistics | |
| dc.type | Artigo de evento | |
| local.citation.epage | 167 | |
| local.citation.issue | 16; 12 | |
| local.citation.spage | 163 | |
| local.description.resumo | The process of corpus compilation remains a significant challenge in the field of corpus linguistics. This paper introduces CorpuScript, an innovative text-cleaning software aimed at aiding researchers in the process of corpus preparation. By combining software engineering with corpus linguistics methods, this tool can significantly improve the workflow for corpora compilation, specifically in the task of corpus cleaning. The necessity for CorpuScript emerged from recurring challenges experienced by our research team, particularly during our current corpus research project, in which a considerable large number of texts needed to be cleaned before being used for data analysis. Considering the pressing need for an automated solution that could improve the text-cleaning process in our research project, CorpuScript was carefully developed to help us accelerate the corpus compilation, while meeting the requirements outlined in our corpus design. | |
| local.publisher.country | Brasil | |
| local.publisher.department | ICA - INSTITUTO DE CIÊNCIAS AGRÁRIAS | |
| local.publisher.initials | UFMG | |
| local.url.externa | https://www.elc-ebralc.net.br/_files/ugd/75f182_29b728735a1a48b99531566f48678cdc.pdf |