Exploring imbalanced data challenges: oversampling efficacy and sample size estimation
| dc.creator | Gabriel Oliveira Assunção | |
| dc.date.accessioned | 2025-04-25T16:45:04Z | |
| dc.date.accessioned | 2025-09-09T00:43:27Z | |
| dc.date.available | 2025-04-25T16:45:04Z | |
| dc.date.issued | 2025-02-27 | |
| dc.description.abstract | Class imbalance affects the accuracy and generalization of predictive models, making it essential to explore efficient strategies to mitigate this issue. In this thesis, we investigate the effectiveness of oversampling techniques and propose a method for estimating the optimal sample size in imbalanced classification problems. The results indicate that optimizing the decision threshold can replace the need for synthetic data generation, reducing reliance on oversampling. Additionally, the proposed methodology allows for the estimation of the necessary sample size to ensure more stable classifications, avoiding excessive data collection. Thus, this research contributes to understanding the impact of balancing techniques and provides more efficient alternatives for improving model performance. The proposed approach enables more informed decisions regarding sampling and preprocessing, minimizing the need for artificial data manipulation. | |
| dc.description.sponsorship | CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior | |
| dc.identifier.uri | https://hdl.handle.net/1843/81860 | |
| dc.language | eng | |
| dc.publisher | Universidade Federal de Minas Gerais | |
| dc.relation | Programa Institucional de Internacionalização – CAPES - PrInt | |
| dc.rights | Acesso Restrito | |
| dc.subject | Estatística – Teses | |
| dc.subject | Probabilidades - Teses | |
| dc.subject | Aprendizado do computador – Teses | |
| dc.subject | Amostragem (Estatística) – Teses | |
| dc.subject.other | Machine learning | |
| dc.subject.other | Data augmentation | |
| dc.subject.other | Sample size | |
| dc.title | Exploring imbalanced data challenges: oversampling efficacy and sample size estimation | |
| dc.title.alternative | Explorando desafios de dados desequilibrados: eficácia da sobreamostragem e estimativa do tamanho da amostra | |
| dc.type | Tese de doutorado | |
| local.contributor.advisor-co1 | Rafael Izbicki | |
| local.contributor.advisor1 | Marcos Oliveira Prates | |
| local.contributor.advisor1Lattes | https://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4732895E1&tokenCaptchar=03AFcWeA4MaK_Ny10S6XZm5mrU0in6ADjO8VsPkNgc-k5Iis5edVbF8ikHgFI74Lg9hsSMV8PGi6Yi9BoPGrZX1PbwwmucxUMFfNdGp-zdM3l4uuXefOlc2JNzYtD2Fujo8kk-GFjaz7Ihm4hXOMl3hG1ZzrAz2AovqruairZMVHyvT5yZeFFABmnt-ZRAXag2MHOm-yu4XA8OgSwkXY7WSalH-IBV1U0fx471hp3oCnK69W1h1jzLKv0h7ocBK--cQPw2_2BGbKz3yIiUD6CCz6SH7jdbTZzBpxxtGQ7qX8mU7_fgXrU3V2PEWrP42RN_ArvwuJ7jDoEIkNiagJJDB9O85tPqd0lJo7_48En9oy8IaSBIsccG2Z7sObB8fMkgloFL1eza_7O5BC3UlsREYDA0a_bYoKEEfLD-6bJUyZYmPFiBg2kI-L1cBvSldRL0KGr9U-myf3JOkI7043wEHnG_STxulO3moebWzrvSO7JSuJAYUSjQh4FqyM-uuqf-5e7kzoM4tn-n9uMbmA_g8tFcezVGuExopE7eNEtJWKfUShI1-DPfMCz0amN4-SXERgP-7Y1yLeDnLCWYfYK_4d93cGQChBpVEylMsYXeyhvAQRiw9ms2kKnb9zqRqOebl9rzfa93ycx_fSQ9YBNVolJH3wo3QOZkjY9NYLFl-NVkjekpJW8kuFVwm_Y8_Xkqw5VAjUK_oYYT1xw51GdQNpY5qwwPcCab2sO6rhhfLz9mT3nRY2e0T6EL2s5pudCRwogDlJzQL6VBV1uFHzu32uEd01NDNdwGA7JHzJa0ivWoGmd4l3hpmHRsFwNrMPUMib288l7F5yC5NIcRd4K4oAIuRNOIARwuiqw3I5caYTK1n2BV3lmhxCtuM_y9WgQd7KRv40ZalQ--uD8jUJJZC7qBhd221BTTRx1_GDryrJ-kvmoVe-yZ3NHqCTrvKkDm1JUilergp4htuaptgqyakrwjae67MDfg9HtAYPbXzS9kuy4bhvkEtTacnD_FYog_1nmMp_iXuVZ-EJQ2mHyHLfGV8p64_lNk-to5dSaQOm_ZSyE6K-jVWfE | |
| local.contributor.referee1 | Uriel Moreira Silva | |
| local.contributor.referee1 | Anderson Luiz Ara Souza | |
| local.contributor.referee1 | Paulo Henrique Ferreira da Silva | |
| local.contributor.referee1 | Rafael Bassi Stern | |
| local.creator.Lattes | https://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K8189832Y6&tokenCaptchar=03AFcWeA5a_ep-m0kDOeOQ_cFADxcmXFk4kPruZMTaBG8-7gn1zaKME6fcJRuGtsxaL1z64oQkZ6IZHlLqIi7K4sEwZERcJOl-aMUAxnleDvA4mqMxd7KpJFh93QMvnmicvo8MuSyXotSfpnQIAnaWSM-Ew-WXCeEahRiHrP5HDymKtnfGCUXbLiOb85xmVZ27hHVSDmIdPPdYHHqw_WA4Rngb3XubcIGgOVacV2C7Cf-SWcoqKVxrghQ_Xur73Je-tCy3zSIOBQuswgTarDBDMsg0pJcJ0qeyIPCUpIRhTro5BJTR2WtpK1Zut6nLO3a_u3DYMBfSQPGAvyFA9zFbSbchm75kmOZHH7w0N8NvTj1MkAhEqCAN0YQ8jJdfvdfUxcb3ncKZb2tbIfAZOOIZ6b5a9x2-o7N7z9420XaelDLa0SVz7MNujGacfRn_DATF-st3HlNjqCM_deg-jQjoT6SOeseqyzZHYhXr1Z1lJZ4fvoTD8XTC3W9me9yoSZFXdzmL1MNttkFV4ghHOOxa6cT3u1puRHjB5QEDy6uBef7dPwrsoKHDl5a8i1wVMvv0Q91HowPx-m4LOM0m2LnYA5rrAdYK77gLLGk0JvjgbDev3cc0CbTmp5MSLSL9RsX8bejGDUrs_em1ntFgT6amGidaxMH43D5McJmpSWJYUGQkQrsBhBrilBVdU6i49H-8Kidw9UHpT9r3rAwC1yXif6V52ReTMeFy0OKZuJOCoqpSjQ42lACURKakbHcENX76KQCs2qRJAUzZni7_62Jl8oUsC2vJoZ5ckO3FwnId5yOplFP8gA7tPsoXs6e2iR3FcG-8Takszj2qPPGLJbySxjYRHi7Wq5SGpccyvYYnDQclShDbZlS0kE_dnsep6ngN4EyPkRcRgeu61cSq3mheZdPeLTcYKaFKZ1-79F8WKXF2RkRSzJh_x3kQSN8XmJvlPuQ9IYcc9IK_5svUB5PLUHXgopRznCFVQlUG9Hj1kpoZ8ULwkrYD0XrvmkQ0KSHbX4D3Kjo7W2XXdAKZsJ2UfzTz8hbE3lI-3-hsPISeNInuQplLk3feewo | |
| local.description.embargo | 2027-02-27 | |
| local.description.resumo | O desbalanceamento de classes impacta a precisão e generalização de modelos preditivos, tornando essencial a busca por estratégias eficientes para mitigar esse problema. Nesta tese, investigamos a eficácia das técnicas de oversampling e propomos um método para a estimação do tamanho ideal da amostra em classificações desbalanceadas. Os resultados indicam que a otimização do limiar de decisão pode substituir a necessidade de geração de dados sintéticos, reduzindo a dependência do oversampling. Além disso, a metodologia desenvolvida permite estimar o tamanho de amostra necessário para garantir classificações mais estáveis, evitando coletas excessivas de dados. Assim, esta pesquisa contribui para o entendimento do impacto das técnicas de balanceamento e fornece alternativas mais eficientes para melhorar a performance dos modelos. A abordagem proposta permite decisões mais fundamentadas sobre amostragem e pré-processamento, minimizando o uso de manipulações artificiais nos dados. | |
| local.publisher.country | Brasil | |
| local.publisher.department | ICX - DEPARTAMENTO DE ESTATÍSTICA | |
| local.publisher.initials | UFMG | |
| local.publisher.program | Programa de Pós-Graduação em Estatística |