machine learning

Manipulating machine learning datasets in VS .net

This article solves the following challenge: 

Manipulating machine learning datasets in VS .net

Through extensive research on the web a NuGet-package called "ArffTools"[1] was found. It provides a clean API to read (ArffReader) .arff files into comprehensive C#-classes and write (ArffWriter) the respective objects into a new file. It provides classes for ArffHeaders, ArffAttributes, ArffAttributeTypes and an array for the data-instances, which allows to deal with Arff files in a simple way [2].
References:
[1] https://www.nuget.org/packages/ArffTools/
[2] https://github.com/chausner/ArffTools

Evaluate complexity of present statement:

Select ratingCancelGuessingPassing knowledgeKnowledgeableExpert

Your rating: 3 Average: 3 (3 votes)

SMOTE as a strategy for balancing dataset

This article solves the following challenge: 

Imbalanced dataset classification performs poorly

If you are not able to gain more instances of the minority classes to balance to dataset, Azure ML offers a strategy for creating new instances of the minority classes by synthetic oversampling.
This method is called SMOTE (Synthetic Minority Oversampling Technique), which is a statistical method to increase the number of instances of smaller classes.
SMOTE expects as input a dataset with 2 classes.
If you have more than 2 classes, you have to split the dataset into junks of the majority class and one of the minority classes.

  1. Create a Split Module
  2. Choose Splitting mode: Regular expression
  3. In the regular expression insert: \"Class" ^(Majority|Minority1)
  4. This expression splits the original dataset into a smaller one consisting only of instances of the Majority class and the Minority1 class.
    You can repeat this step for the other minority classes.

  5. The resulting dataset has now only 2 classes which are expected as input to SMOTE
  6. Create a SMOTE module
  7. Choose a percentage of instances which has to be created to balance out the class distribution.
  8. For example:
    Majority class has 1000 instances
    Minority class has 250 insances
    Choose a SMOTE percentage of 300 percent to add about 750 instances to the minority class.

  9. Choose a number of nearest neighbour to specify how similar the created instances are to the original instances. The higher this number is, the more vary the instances from the originals. For very similar instances choose 1.

Attention: If you had more than one minority class, you have to merge the SMOTE results but only merging once the majority class.
Now that the dataset is balanced, you can train your model again and see if the performance improved.

Evaluate complexity of present statement:

Select ratingCancelGuessingPassing knowledgeKnowledgeableExpert

Your rating: 4 Average: 4 (3 votes)

WEKA Crash after exceeding the avaiable Memory

In the lecture "Machine Learning" I have to process different kinds of datasets and perform several machine learning techniques on them. A few classifier like tree-based ones work very good on small datasets with less attributes but doesn't on big datasets with lots of attributes. In the worst case, WEKA crashes because it exceeds its reserved memory space.

OutOfMemoryException when using WEKA 3-6 under Windows 7

Weka is a collection of machine learning algorithms for data mining tasks. This software can be used by students who participate in the class "Machine Learning" at TU Wien. When the Weka-User executes memory-intesitive tasks (e.g. visualizes results of a machine learning algorithm), Weka is very likely to throw an OutOfMemoryException. As a result to the exception the software crashes. This execption is absolutely repeatable. The Weka-User will get the exception at every time he executes the sequence of operations which had previously lead to the exeception. That is why there is no way to workaround this error without having to disclaim using Weka. The OutOfMemoryException is not only limited to Weka 3-6 on Windows 7.

Creating database for text classfication

Internet seams the best choice because we are interested in choosing different types of data. The restriction for my test is that I want that the positive data contains only onelines(short jokes). The negative data in order to have a good classification has to have the same structure(short sentences).
Subscribe to machine learning