Credibilidade de exemplos em classificação automática

Joao Rafael de Moura Palotti

Please use this identifier to cite or link to this item: http://hdl.handle.net/1843/SLSS-8M3MZS

Type:	Dissertação de Mestrado
Title:	Credibilidade de exemplos em classificação automática
Authors:	Joao Rafael de Moura Palotti
First Advisor:	Gisele Lobo Pappa
First Referee:	Adriano Alonso Veloso
Second Referee:	Marcos Andre Goncalves
Third Referee:	Aurora Trinidad Ramirez Pozo
Abstract:	Organizar e recuperar grandes quantidades de informação tornaram-se tarefas de extrema importância, principalmente nas áreas de Mineração de Dados e Recuperação de Informação, responsáveis por estudar uma maneira de lidar com essa explosão de dados. Dentre as diversas tarefas estudadas por essas duas áreas destacamos a Classificação Automática de dados.Nessa dissertação, tratamos o problema de classificar automaticamente a informação disponível. Em especial, esse trabalho foi desenvolvido em cima da ideia de que nem todos os exemplos de uma base de treinamento devem contribuir igualmente para a construção do modelo de classificação e, portanto, considerar que alguns exemplos são mais confiáveis que outros pode aumentar a eficácia do classificador. Para lidar com esse problema, propomos estimar e empregar funções de credibilidade capazes de capturar o quanto um classificador pode confiar em um exemplo ao gerar o modelo.A credibilidade é considerada na literatura como dependente do contexto no qual está inserida, além de ser também dependente de quem a estima. Para tornar mais objetiva sua avaliação, recomenda-se que sejam definidos os fatores que influenciam no seu cálculo. Definimos que, do ponto de vista de um classificador, dois fatores são cruciais: as relações atributos/classe e relacionamentos entre exemplos. Relações atributos/classe podem ser facilmente extraídas utilizando um grande conjunto de métricas previamente propostas na literatura, principalmente para a tarefa de seleção de atributos. Relacionamentos entre exemplos podem ser criados a partir de uma característica presente na base. Por exemplo, no contexto de classificação de documentos, já foi mostrado que redes de citações e autorias (que relacionam dois documentos de acordo com seus autores ou artigos citados) provêem grande fonte de informação para classificação. Diversas métricas da literatura de redes complexas podem ser utilizadas para quantificar esses relacionamentos.Baseados nesses dois fatores, selecionamos 30 métricas para explorar a credibilidade dos atributos e 16 para os relacionamentos. Elas foram inspiradas em métricas presentes na literatura que indicam a separação entre as classes e investigam as características dos relacionamentos entre os exemplos. Porém, fica difícil dizer qual dessas métricas seria mais apropriada para estimar a credibilidade de um exemplo. Assim, por possuirmos um grande número de métricas para cada fator, após experimentos com métricas isoladas, criamos um algoritmo de Programação Genética para melhor explorar esse espaço de métricas, gerando funções de credibilidade capazes de melhorar a eficácia de classificadores se associadas a eles.A programação genética é um algoritmo baseado nos princípios de evolução de Darwin, capaz de percorrer, de forma robusta e eficaz, o grande espaço de busca com que estamos trabalhando. As funções evoluídas foram então incorporadas a dois algoritmos de classificação: o Nave Bayes e o KNN. Experimentos foram realizados com três tipos de bases de dados: bases de documentos, bases da UCI com atributos exclusivamente categóricos e uma grande base de assinaturas proteicas. Os resultados mostram ganhos consideráveis em todos os cenários, culminando em melhorias de até 17.51% na MacroF1 da base Ohsumed e de 26.58% e 50.78% na MicroF1 e MacroF1 da base de assinaturas estruturais proteicas.
Abstract:	Organization and recovery of large amounts of information became tasks of extreme importance, especially on the areas of Data Mining and Information Recovery, which are responsible for finding a way to deal with this data explosion. Among the topics studied in these two areas, there is the Automatic Classification of data.In this thesis, we treat the problem of automatically classifying the available information. In particular, this work was developed on the consideration that not all examples in a training set contribute equally to the construction of a classification model, so, assuming that some examples are more trustworthy than others can increase the effectiveness of the classifier. To deal with this problem, we propose the use of credibility functions capable of capturing how much a classifier should trust an example while generating the model.Credibility in the literature is considered as context dependent and also dependent on who is estimating it. To make its evaluation more objective, it is recommended that the factors used for its calculation are defined. We defined that, from the classifier's view, there are two crucial factors: the attribute/class relations and relationships among examples. The attribute/class relation can be easily extracted using lots of metrics already proposed in the literature, especially for the task of selecting the attributes. The relationships among the examples can be deduced from a feature that appear in the database. For example, in the context of document classification, it is shown that the networks of citations and authorship (which relate two documents based on its authors or citations) are a big source of information for the classification. Several metrics of complex networks can be used to quantify these relationships.Given these two factors, we selected 30 and 16 metrics to explore the attributes' and relationships' credibility respectively. They were inspired in metrics that occur in the literature, and indicate the separation among the classes and investigate characteristics of the relationship between the examples. Nevertheless, it is hard to tell which of these metrics is more appropriate to estimate the credibility of an example. So, since there is a big number of metrics for each factor, after some experiments with isolated metrics, we developed a Genetic Programming algorithm to better explore this search space, generating credibility functions capable of improving the effectiveness of classifiers associated with it.Genetic programming is an algorithm based on Darwin's theory of evolution, capable of traversing the search space of functions in a robust and effective way. The evolved functions were then incorporated to two classification algorithms: Naive Bayes and KNN. Experiments have been run using three different kinds of databases: document databases, UCI databases of categorical attributes and a protein signature database. The results show considerable improvement of the classification in all cases. In particular, for the database Oshmed, MacroF1 was improved by 17.51%, and for the protein signature database, Micro$F_1$ and Macro$F_1$ were improved by 26.58% and 50.78% respectively.
Subject:	Computação Mineração de dados (Computação) Sistemas de recuperação da informação
language:	Português
Publisher:	Universidade Federal de Minas Gerais
Publisher Initials:	UFMG
Rights:	Acesso Aberto
URI:	http://hdl.handle.net/1843/SLSS-8M3MZS
Issue Date:	23-Sep-2011
Appears in Collections:	Dissertações de Mestrado

Files in This Item:

File	Description	Size	Format
joaorafaelmourapalotti.pdf		4.04 MB	Adobe PDF	View/Open

Show full item record