Please use this identifier to cite or link to this item:
http://irepo.futminna.edu.ng:8080/jspui/handle/123456789/19715
Title: | DEVELOPMENT OF MULTICLASS DATASET RESAMPLING TECHNIQUE BASED ON DATA SIMILARITY DEGREE AND DATA DIFFICULTY FACTORS |
Authors: | DAKO, Dickson Apaleokhai |
Issue Date: | Nov-2021 |
Abstract: | ABSTRACT One of the most complex machine learning and data classification problem is learning from skewed or imbalanced dataset. These imbalanced preprocessing approaches having received increasing research attention over the years, makes it necessary to access the scope of what have been achieved and what needs to be improved upon. Although numerous techniques for improving classifiers performance have been introduced but most of these techniques are for binary problems; the identification of conditions for the efficient use of these techniques is still an open research problem. This research work developed a multiclass resampling technique using class similarities degree and data difficulty factors. Nearest Neighbours technique was adopted to evaluate the neighbours of each example in the dataset and also the distance between each example x and its neighbours. This information about the neighbours of each example was further used to derive their difficulty type (safe and unsafe). 20 samples were selected from each class in the imbalanced dataset; these samples were evaluated using the proposed method to derive the similarity degree between classes. Finally, the similarities degree and difficulty type of each example were used to evaluate the safe level of examples; which then served as the criteria for selecting the examples to oversample and undersample. The new resampling technique, MIRT was tested on five standard imbalanced dataset, which were selected based on their different degree of difficulty level. After resampling the dataset, classification of the dataset was done using KNN, SVM and CART classifier. The performance of the proposed technique, MIRT on CART classifier which achieved a 100 percentage in 4 of the 5 data samples used was better than SOUP, SOUPBag and MRBB resampling techniques which were compared using the G-mean values. Also, among the claasifiers used, CART performed way better than KNN and SVM. Finally, the similarity degree derived from this work can be further apply on dataset with classes more than four; which are more complex to classify. |
URI: | http://repository.futminna.edu.ng:8080/jspui/handle/123456789/19715 |
Appears in Collections: | Masters theses and dissertations |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
DAKO DICKSON APALEOKHAI DEVELOPMENT OF MULTICLASS DATASET RESAMPLING TECHNIQUE BASED ON DATA SIMILARITY DEGREE AND DATA DIFFICULTY FACTORS.pdf | 1.01 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.