Agrupamento automático de notícias de jornais on-line usando técnicas de machine learning para clustering de textos no idioma português

Lúcia Helena de Magalhães

Please use this identifier to cite or link to this item: http://hdl.handle.net/1843/37525

Type:	Tese
Title:	Agrupamento automático de notícias de jornais on-line usando técnicas de machine learning para clustering de textos no idioma português
Authors:	Lúcia Helena de Magalhães
First Advisor:	Renato Rocha Souza
First Referee:	Emerson Augusto Priamo Moraes
Second Referee:	Luiz Cláudio Gomes Maia
Third Referee:	Maurício Barcellos Almeida
metadata.dc.contributor.referee4:	Renata Maria Abrantes Baracho Porto
Abstract:	Clusterização é uma técnica de organizar dados em grupos cujos membros apresentam alguma semelhança. Assim, a proposta desta pesquisa é utilizar as tecnologias de Mineração de Textos, Processamento de Linguagem Natural, Machine Learning e Clustering, para criar grupos de informes semelhantes a partir de uma amostra recuperada dos principais jornais on-line, uma vez que existem poucos estudos relacionados ao tema clustering de notícias publicadas no idioma português. Dessa forma, a lacuna de pesquisas nessa área acaba por reforçar e aprofundar a escassez de informação relacionada ao desenvolvimento de soluções automatizadas, capazes de recuperar e comparar as matérias em destaque na mídia, publicadas na língua brasileira, e agrupá-las por similaridade. Assim, este estudo tem como objetivo utilizar uma metodologia de aprendizado não supervisionado, que seja capaz de agrupar, automaticamente, notícias publicadas no idioma do Brasil, postadas na grande mídia. Além disso, busca identificar quais são os principais métodos utilizados no processo de clustering de textos; aplicar essas técnicas em uma coleção de notícias publicadas na língua portuguesa e verificar o desempenho dos algoritmos de clusterização ao serem alimentados por um corpus de textos; aplicar a metodologia em diferentes corpora e discutir o sucesso da técnica em cada caso; averiguar a possibilidade efetiva de clusterização dos documentos e analisar as dificuldades encontradas para diferentes amostras. Para tanto, são apresentados os conceitos e as áreas relacionadas com o tema, bem como a revisão bibliográfica dos trabalhos correlatos, a metodologia proposta e alguns experimentos que permitem desenvolver determinados argumentos e comprovar algumas hipóteses. Para as experimentações, primeiramente, coletaram-se as notícias e, em seguida, realizou-se o pré-processamento dos informes, etapa em que as stop words foram removidas e as técnicas de tokenização e stemming foram aplicadas. Assim, com o corpus preparado, extraíram-se as principais características dos textos e os documentos foram representados em um modelo de espaço vetorial. A semelhança entre as matérias foi encontrada através do cálculo da similaridade, imediatamente a técnica de clustering foi aplicada e consequentemente os grupos foram formados. Para melhor visualização, validação e interpretação dos resultados, apresentaram-se os clusters em dendogramas e em diagramas de dispersão. As conclusões principais desta pesquisa indicaram que a etapa de pré-processamento exige um esforço especial para garantir a qualidade dos dados. Assim como a complexidade da língua portuguesa, a necessidade de atualização da lista de stop words, a detecção de quais características são mais importantes e, em geral, a complexidade dos problemas relacionados à alta dimensionalidade dos dados foram evidenciados durante todo o processo deste estudo. As medidas de distância também desempenharam um papel importante na análise de clustering, porém não existe uma que melhor se adapte a todos os problemas de agrupamento. O algoritmo k-means obteve os melhores resultados para esse tipo de informação e o Hierarchical Clustering apresentou dificuldades para corpus grande, visto que documentos semelhantes foram alocados em grupos diferentes. Já o algoritmo Affinity Propagation apresentou divergência quanto ao número ideal de clusters, mas conseguiu bom desempenho ao agrupar por similaridade.
Abstract:	Clustering is the technique of organization of data into groups whose members are somewhat similar. The purpose of this research is to use the techniques of Text Mining, Natural Language Processing, Machine Learning and Clustering, to create groups of similar reports from a sample retrieved from online newspapers, considering that there are few studies related to the clustering theme of news published in Portuguese. The lack of research in this area ends up reinforcing the scarcity of information, which interferes in the development of automated solutions capable of retrieving and comparing the articles featured in the media, published in Portuguese, and grouping them by similarity. Thus, this study aims to use an unsupervised learning methodology, which is capable of automatically grouping news published in the Brazilian Portuguese language, posted in the mainstream media. In addition, it also seeks to identify which are the main methods used in the text clustering process; apply these techniques to a collection of news published in the Portuguese language and verify the performance of the clustering algorithms when fed by a corpus of texts; apply the methodology in different corpora and discuss the success of the technique in each case; to investigate the effective possibility of document clustering and to analyze the difficulties encountered for different samples. For that, the concepts and areas related to the theme are presented, as well as the bibliographic review of related works, the proposed methodology and some experiments that allow developing certain arguments and proving some hypotheses. For the experiments, first, the news were collected and then, the pre-processing of the reports was carried out, a stage in which the stop words were removed and the tokenization and stemming techniques were applied. Thus, with the corpus prepared, the main characteristics of the texts were extracted and the documents were represented in a vector space model. The similarity between the materials was found by calculating the similarity, immediately the clustering technique was applied and consequently the groups were formed. For better visualization, validation and interpretation of results, clusters were presented in dendograms and in dispersion diagrams. The main conclusions of this research indicated that the pre-processing stage requires a special effort to guarantee the quality of the data. As well as the complexity of the Portuguese language, the need to update the list of stop words, the detection of which characteristics are most important and, in general, the complexity of the problems related to the high dimensionality of the data were evidenced throughout the process of this study. Distance measurements also played an important role in clustering analysis, but there is no one that best suits all clustering problems. The k-means algorithm obtained the best results for this type of information and Hierarchical Clustering presented difficulties for larger corpus, since similar documents were allocated to different groups. The Affinity Propagation algorithm, on the other hand, diverged as to the ideal number of clusters, but achieved good performance when grouping by similarity.
Subject:	Ciência da informação Organização da informação Aprendizado do computador Processamento da linguagem natural (Computação)
language:	por
metadata.dc.publisher.country:	Brasil
Publisher:	Universidade Federal de Minas Gerais
Publisher Initials:	UFMG
metadata.dc.publisher.department:	ECI - ESCOLA DE CIENCIA DA INFORMAÇÃO
metadata.dc.publisher.program:	Programa de Pós-Graduação em Gestão e Organização do Conhecimento
Rights:	Acesso Aberto
URI:	http://hdl.handle.net/1843/37525
Issue Date:	13-Feb-2020
Appears in Collections:	Teses de Doutorado

Files in This Item:

File	Description	Size	Format
TeseLuciaHelenaUFMGversaoFinal_CorrecaoPosDefesa.pdf		29.39 MB	Adobe PDF	View/Open

Show full item record