CorpuScript: an automated text-cleaning tool for corpus linguistics

dc.creatorJhonatan Henrique Lopes Alves
dc.creatorAna Eliza Pereira Bocorny
dc.creatorDeise Prina Dutra
dc.creatorCarolina Godoi de Faria Marques
dc.creatorGustavo Leal Teixeira
dc.creatorDanilo Duarte Costa
dc.date.accessioned2025-07-17T13:33:40Z
dc.date.accessioned2025-09-09T00:17:00Z
dc.date.available2025-07-17T13:33:40Z
dc.date.issued2024-10
dc.description.sponsorshipFAPEMIG - Fundação de Amparo à Pesquisa do Estado de Minas Gerais
dc.identifier.isbn9786501233710
dc.identifier.urihttps://hdl.handle.net/1843/83603
dc.languageeng
dc.publisherUniversidade Federal de Minas Gerais
dc.relation.ispartofXVI Encontro de Linguística de Corpus e da da XII Escola Brasileira de Linguística Computacional
dc.rightsAcesso Aberto
dc.subjectLinguística de corpus
dc.subjectLinguística computacional
dc.subjectEngenharia de software
dc.titleCorpuScript: an automated text-cleaning tool for corpus linguistics
dc.typeArtigo de evento
local.citation.epage167
local.citation.issue16; 12
local.citation.spage163
local.description.resumoThe process of corpus compilation remains a significant challenge in the field of corpus linguistics. This paper introduces CorpuScript, an innovative text-cleaning software aimed at aiding researchers in the process of corpus preparation. By combining software engineering with corpus linguistics methods, this tool can significantly improve the workflow for corpora compilation, specifically in the task of corpus cleaning. The necessity for CorpuScript emerged from recurring challenges experienced by our research team, particularly during our current corpus research project, in which a considerable large number of texts needed to be cleaned before being used for data analysis. Considering the pressing need for an automated solution that could improve the text-cleaning process in our research project, CorpuScript was carefully developed to help us accelerate the corpus compilation, while meeting the requirements outlined in our corpus design.
local.publisher.countryBrasil
local.publisher.departmentICA - INSTITUTO DE CIÊNCIAS AGRÁRIAS
local.publisher.initialsUFMG
local.url.externahttps://www.elc-ebralc.net.br/_files/ugd/75f182_29b728735a1a48b99531566f48678cdc.pdf

Arquivos

Pacote original

Agora exibindo 1 - 1 de 1
Carregando...
Imagem de Miniatura
Nome:
Corpuscript an automated text-cleaning tool for corpus linguistics.pdf
Tamanho:
133.89 KB
Formato:
Adobe Portable Document Format

Licença do pacote

Agora exibindo 1 - 1 de 1
Carregando...
Imagem de Miniatura
Nome:
License.txt
Tamanho:
1.99 KB
Formato:
Plain Text
Descrição: