Classification from imbalanced datasets. A framework for improving the application of sampling strategies

Kraiem, Mohamed S.

Classification from imbalanced datasets. A framework for improving the application of sampling strategies

Kraiem, Mohamed S.

Dirigida por:

María Navelonga Moreno García Directora

Universidad de defensa: Universidad de Salamanca

Fecha de defensa: 27 de noviembre de 2020

Tribunal:

Juan Francisco de Paz Santana Presidente
Daniel Hernández de la Iglesia Secretario
Ana Maria Neves de Almeida Baptista de Figueiredo Vocal

Departamento:

INFORMÁTICA Y AUTOMÁTICA

Tipo: Tesis

Teseo: 643678 DIALNET

Resumen

In recent years, the class imbalance problem has emerged to become as one of the hottest research topics in the area of supervised learning, where finding a suitable solution is still a challenge. Low classification performance from imbalanced data is produced because of the predictive model bias toward the majority class, while the minority class is almost ignored or assumed as noise since it is represented by very few instances. Datasets used by classification models often have a different distribution of class instances. This situation, called imbalanced dataset classification, produces low predictive performance for the minority class samples. Consequently, the prediction model is usually not valid, even though the global model precision can be acceptable, since it is mainly obtained from the proper classification of the majority class examples. Some strategies and techniques are commonly used to solve this problem, such as the oversampling and under-sampling, which are recognized procedures to treat this problem by balancing the number of examples of each class. However, the efficiency of these strategies is affected by some factors such as overlapping between classes, dataset size, borderline examples, imbalance ratio, data intrinsic characteristics and noisy data, among others. This research is divided into two parts, the first part involves a preliminary study about the effect of some sampling methods applied on datasets with different imbalance ratio on the classification performance of several learning algorithms. The main purpose of these initial experiments is to provide a reference that helps to select the algorithm with the best behavior to carry out the subsequent study, which constitutes the second and most important part of this research. In this part, different factors related to dataset characteristics are examined to determine both which re-sampling method is more appropriate depending on the characteristics of the dataset, and the advantages and the drawbacks of the basic and advanced resampling techniques. The factors analyzed in the study are imbalance ratio, overlapping between classes, borderline examples, disjuncts problem, data shift, small sample size, number of instances and number of attributes. Various evaluation measures have been utilized in order to contrast the outcomes of the different models induced from imbalanced datasets before and after processing them by basic and advanced sampling strategies. Some general metrics such as accuracy and some specific measures for imbalanced data classification, such as optimized precision and generalized index of balanced accuracy, have been used. In this part, experiments were conducted on datasets with a wide range of characteristics, which were treated by seven re-sampling techniques. The classification algorithm selected for this study was Random Forest, since the previous experiments proved its better behavior in imbalanced data contexts compared with the other algorithms.