Imbalanced dataset classification performs poorly

Imbalanced datasets can have tremendeous negative impact on training a classification model. Is there a way in Azure ML to deal with these kind of challenge?
1 answer

SMOTE as a strategy for balancing dataset

This article solves the following challenge: 

Imbalanced dataset classification performs poorly

If you are not able to gain more instances of the minority classes to balance to dataset, Azure ML offers a strategy for creating new instances of the minority classes by synthetic oversampling.
This method is called SMOTE (Synthetic Minority Oversampling Technique), which is a statistical method to increase the number of instances of smaller classes.
SMOTE expects as input a dataset with 2 classes.
If you have more than 2 classes, you have to split the dataset into junks of the majority class and one of the minority classes.

  1. Create a Split Module
  2. Choose Splitting mode: Regular expression
  3. In the regular expression insert: \"Class" ^(Majority|Minority1)
  4. This expression splits the original dataset into a smaller one consisting only of instances of the Majority class and the Minority1 class.
    You can repeat this step for the other minority classes.

  5. The resulting dataset has now only 2 classes which are expected as input to SMOTE
  6. Create a SMOTE module
  7. Choose a percentage of instances which has to be created to balance out the class distribution.
  8. For example:
    Majority class has 1000 instances
    Minority class has 250 insances
    Choose a SMOTE percentage of 300 percent to add about 750 instances to the minority class.

  9. Choose a number of nearest neighbour to specify how similar the created instances are to the original instances. The higher this number is, the more vary the instances from the originals. For very similar instances choose 1.

Attention: If you had more than one minority class, you have to merge the SMOTE results but only merging once the majority class.
Now that the dataset is balanced, you can train your model again and see if the performance improved.

Evaluate complexity of present statement:

Select ratingCancelGuessingPassing knowledgeKnowledgeableExpert

Your rating: 4 Average: 4 (3 votes)