Understanding software defects with machine learning

Geanderson Esteves dos Santos

Please use this identifier to cite or link to this item: http://hdl.handle.net/1843/52751

Type:	Tese
Title:	Understanding software defects with machine learning
Other Titles:	Entendendo defeitos de software com aprendizado de máquina
Authors:	Geanderson Esteves dos Santos
First Advisor:	Eduardo Magno Lages Figueiredo
First Co-advisor:	Adriano Alonso Veloso
First Referee:	Ivan do Carmo Machado
Second Referee:	Valter Vieira de Camargo
Third Referee:	Marco Túlio de Oliveira Valente
metadata.dc.contributor.referee4:	Wagner Meira Júnior
Abstract:	Software defect prediction represents an area of interest in both academia and industry. In fact, defects are prevalent in software development and might generate numerous diﬀiculties for project managers, users, stakeholders, and developers. Recent studies reveal that approximately 42% of the software development budget goes to fixing defects. Although the current literature offers multiple alternative approaches to predict the likelihood of defects, there is a lack of understanding about the features that contribute to the defects of a software project. Furthermore, most of the literature concentrates on predicting defects from a broad set of features. However, the individual discriminating power of software features is still unknown as some perform well only with specific projects. For this reason, in this thesis, we aim at understanding the features that impact the defectiveness of software projects. To do so, we applied machine learning techniques to popular datasets. Hence, we convey an exploratory investigation that produced thousands of models from a diverse collection of software features. These models are random because they promptly select the features from the entire pool of software features. Even though the immense majority of models are ineffective, we could produce several models that yield accurate predictions. Thus, the models distinguish defect-prone classes from clean ones. We focus our investigation on models that rank a randomly chosen defective software class higher than a randomly selected non-defective class with over 85% accuracy. More importantly, we employ these results to discuss a set of features contributing to the understandability of model decisions. As a result, we notice that the best-performing models are simple to understand as they rely on a small set of features. Therefore, we present which features contribute to the defects of twelve projects. Further, we also compare the threshold of these features. To validate the results, we survey 40 developers to measure their perceptions of the models and conclude that the models are fairly understandable. Complementary, we also evaluate developers’ perception of the quality attributes with active GitHub developers, where 54 participated in the investigation. Then, we conclude that developers’ perceptions differ significantly from the machine learning models in terms of quality attributes. Finally, we compare the redundancies and similarities between defect models with code smell as they share several features. By the end, this thesis promotes reasoning on which software features influence the defects of these projects.
Abstract:	A predição de defeitos representa uma área de interesse tanto no meio acadêmico quanto na indústria. Os defeitos são comuns no desenvolvimento de software e podem gerar muitas dificuldades para gerentes de projetos, usuários, e desenvolvedores. Estudos recentes revelam que cerca de 42% do orçamento de desenvolvimento é gasto corrigindo defeitos. Embora a literatura atual ofereça múltiplas abordagens para prever a probabilidade de defeitos, ainda existe uma falta de compreensão sobre as características que contribuem para os defeitos. Além disso, a maioria destes estudos concentra-se na predição de defeitos a partir de um amplo conjunto de características. Entretanto, o poder discriminador individual das características ainda é desconhecido, já que algumas têm um bom desempenho apenas em projetos específicos. Por essa razão, nesta tese, nosso objetivo é compreender as características que afetam os defeitos em projetos de software. Para isso, aplicamos técnicas de aprendizado de máquina em conjuntos de dados populares. Portanto, realizamos uma investigação exploratória que produziu milhares de modelos a partir de uma coleção diversa de características. Estes modelos são aleatórios porque selecionam as características de todo o conjunto de características. Embora a imensa maioria dos modelos seja ineficaz, conseguimos produzir vários modelos que fornecem previsões precisas. Logo, os modelos distinguem classes propensas a defeitos de classes que não tenham defeitos. Concentramos nossa investigação em modelos que classificam com mais de 85% de precisão uma classe defeituosa. Assim, utilizamos esses resultados para discutir um conjunto de características que contribuem para a explicabilidade do modelo. Como resultado, notamos que os modelos mais eficientes são fáceis de entender, pois dependem de um conjunto pequeno de características. Além disso, comparamos o limite dessas características. Para validar os resultados, realizamos uma pesquisa com 40 desenvolvedores para medir suas percepções sobre os modelos e concluímos que os modelos são bastante explicáveis. Complementarmente, também avaliamos a percepção dos desenvolvedores sobre os atributos de qualidade com desenvolvedores ativos do GitHub, onde obtivemos 54 participantes. Assim, concluímos que as percepções dos desenvolvedores diferem significativamente dos modelos. Finalmente, comparamos as similaridades entre os modelos de predição de defeito com o mau cheiro do código. Ao final, esta tese promove o raciocínio sobre quais características de software influenciam os defeitos desses projetos.
Subject:	Computação – Teses Aprendizado de máquina – Teses Predição de falhas – Teses Código fonte ( Computação) – Teses
language:	eng
metadata.dc.publisher.country:	Brasil
Publisher:	Universidade Federal de Minas Gerais
Publisher Initials:	UFMG
metadata.dc.publisher.department:	ICX - DEPARTAMENTO DE CIÊNCIA DA COMPUTAÇÃO
metadata.dc.publisher.program:	Programa de Pós-Graduação em Ciência da Computação
Rights:	Acesso Aberto
metadata.dc.rights.uri:	http://creativecommons.org/licenses/by-nc-nd/3.0/pt/
URI:	http://hdl.handle.net/1843/52751
Issue Date:	13-Feb-2023
Appears in Collections:	Teses de Doutorado

Files in This Item:

File	Description	Size	Format
thesis.pdf		2.1 MB	Adobe PDF	View/Open

Show full item record

This item is licensed under a Creative Commons License