DEVELOPMENT OF MULTICLASS DATASET RESAMPLING TECHNIQUE BASED ON DATA SIMILARITY DEGREE AND DATA DIFFICULTY FACTORS

DAKO, Dickson Apaleokhai

Please use this identifier to cite or link to this item: http://irepo.futminna.edu.ng:8080/jspui/handle/123456789/19715

Title:	DEVELOPMENT OF MULTICLASS DATASET RESAMPLING TECHNIQUE BASED ON DATA SIMILARITY DEGREE AND DATA DIFFICULTY FACTORS
Authors:	DAKO, Dickson Apaleokhai
Issue Date:	Nov-2021
Abstract:	ABSTRACT One of the most complex machine learning and data classification problem is learning from skewed or imbalanced dataset. These imbalanced preprocessing approaches having received increasing research attention over the years, makes it necessary to access the scope of what have been achieved and what needs to be improved upon. Although numerous techniques for improving classifiers performance have been introduced but most of these techniques are for binary problems; the identification of conditions for the efficient use of these techniques is still an open research problem. This research work developed a multiclass resampling technique using class similarities degree and data difficulty factors. Nearest Neighbours technique was adopted to evaluate the neighbours of each example in the dataset and also the distance between each example x and its neighbours. This information about the neighbours of each example was further used to derive their difficulty type (safe and unsafe). 20 samples were selected from each class in the imbalanced dataset; these samples were evaluated using the proposed method to derive the similarity degree between classes. Finally, the similarities degree and difficulty type of each example were used to evaluate the safe level of examples; which then served as the criteria for selecting the examples to oversample and undersample. The new resampling technique, MIRT was tested on five standard imbalanced dataset, which were selected based on their different degree of difficulty level. After resampling the dataset, classification of the dataset was done using KNN, SVM and CART classifier. The performance of the proposed technique, MIRT on CART classifier which achieved a 100 percentage in 4 of the 5 data samples used was better than SOUP, SOUPBag and MRBB resampling techniques which were compared using the G-mean values. Also, among the claasifiers used, CART performed way better than KNN and SVM. Finally, the similarity degree derived from this work can be further apply on dataset with classes more than four; which are more complex to classify.
URI:	http://repository.futminna.edu.ng:8080/jspui/handle/123456789/19715
Appears in Collections:	Masters theses and dissertations

Files in This Item:

File	Description	Size	Format
DAKO DICKSON APALEOKHAI DEVELOPMENT OF MULTICLASS DATASET RESAMPLING TECHNIQUE BASED ON DATA SIMILARITY DEGREE AND DATA DIFFICULTY FACTORS.pdf		1.01 MB	Adobe PDF	View/Open

Show full item record