Imbalanced Vs Balanced Dataset….Problems!
What is Balanced Dataset?
A balanced dataset is one that contains an equal or almost equal number of samples from the positive and negative classes.
If the samples from one of the classes outnumber the other, the data is skewed in favor of one of the classes.
Let's assume we have two classes: Positive Class And Negative Class. If the number of positive samples is similar to the negative samples, the dataset is balanced.
The ratio of 50/50 or 60/40 or similar between the two classes is considered to be an example of a balanced dataset.
Figure 1 is one such representation of a balanced dataset.
What is an imbalanced dataset?
Data imbalance usually reflects an unequal distribution of classes within a dataset, where the number of instances of one class is much lower than the instances of the other classes. A classification data set with skewed class proportions is called imbalanced.
Classes that make up a large proportion of the data set are called majority classes. Those that make up a smaller proportion are minority classes.
Let’s assume we have two classes: Positive Class And Negative Class. If the number of Positive Class samples is much more to the Negative Class samples, then the Positive Class will be the majority class and the Negative Class will be the minority class.
The ratio of 80/20 or 90/10 or similar between the two classes is considered to be an example of an imbalanced dataset.
Figure 2 is one such representation of an imbalanced dataset.
Classification Problems with Imbalanced Datasets!
Many of the classification predictive modeling problems that we are interested in solving in practice are imbalanced.
As such, it is surprising that imbalanced classification does not get more attention than it does.
Below is a list of eight examples of problem domains where the class distribution of examples is inherently imbalanced.
- Fraud Detection.
- Claim Prediction
- Churn Prediction.
- Spam Detection.
- Anomaly Detection.
- Outlier Detection.
- Intrusion Detection
- Conversion Prediction.
What’s WRONG with Accuracy Score in Imbalanced Dataset Classification?
My Model has an accuracy score of 90% for the classification problem.
Classification Metrics
Accuracy
Probably the most straightforward and intuitive metric for classifier performance. It is simply counting the number of times we predicted the right class over the total number of predictions.
True positive: A true positive(TP) is an outcome where the model correctly predicts the positive class.
True Negative: A true negative (TN) is an outcome where the model correctly predicts the negative class.
False Positive: A false positive(FP) is an outcome where the model incorrectly predicts the positive class.
False Negative: A false negative(FN) is an outcome where the model incorrectly predicts the negative class.
Let’s assume we have two classes: Positive Class And Negative Class: Positive Class is 90% of your data and Negative Class is the remaining 10%. Even if say every data point as Positive Class so you will still be able to correctly classify 90% of Positive Class, you can reach an accuracy of 90% by predicting Positive Class. Which seems to be a good score but it is not since it is an imbalanced dataset.
Metrics that can be considered on an imbalanced dataset are Precision and Recall, F1-score.
Precision & Recall
Precision tells when you predict something positive, how many times they were actually positive.
In other words, How accurate the positive predictions are!
Recall refers to the percentage of total relevant results correctly classified by your algorithm
In Simple words, Recall Coverage of actual positive samples.
Let’s say I searched on Google for “what is precision and recall?” and in less than a minute I have about 15,600,000 results.
Let’s say out of these 15.6 million results, the relevant links to my question were about 2 million. Assuming there were also about 6 million more results that were relevant but weren’t returned by Google, for such a system we would say that it has a precision of 2M/15.6M and a recall of 2M/8M.
This implies that the probability of Google’s algorithm to retrieve all the relevant links was 0.25 (recall) and the probability that all the retrieved links were relevant is 0.13 (precision).
A model that produces no false positives has a precision of 1.0. and A model that produces no false negatives has a recall of 1.0.
F1-score
The F1 score conveys the balance between the precision and the recall
That is, a good F1 score means that you have low false positives and low false negatives, so you’re correctly identifying real threats and you are not disturbed by false alarms. An F1 score is considered perfect when it’s 1, while the model is a total failure when it’s 0.
Note on Errors: Type I & II Errors
Type I Error leads to False Positive (FP). For instance, when a classifier predicted a data point as “Positive Class” and it turns out to be “Negative Class”
An example of a false positive is when a particular test designed to detect melanoma, a type of skin cancer, tests positive for the disease, even though the person does not have cancer.
Type II Error leads to False Negative (FP). For instance, when a classifier predicted a data point as “Negative Class” and it turns out to be “Positive Class”
A false negative is a test result that indicates a person does not have a disease or condition when the person actually does have it.
Summary of Type I vs Type II
How to Handle Imbalanced Classes in Machine Learning?
Sampling is one of the strategies used to resolve an imbalanced dataset classification problem.
Data sampling provides a collection of techniques that transform a training dataset in order to balance or better balance the class distribution.
Once balanced, standard machine learning algorithms can be trained directly on the transformed dataset without any modification.
This allows the challenge of imbalanced classification, even with severely imbalanced class distributions, to be addressed with a data preparation method.
There are many different types of data sampling methods that can be used, and there is no single best method to use on all classification problems and with all classification models.
Balance the training set in some way:
- Oversample the minority class.
- Undersample the majority class.
- Synthesize new minority classes.
- Class weights for majority class and minority class.
Let’s assume we have two classes: Positive Class And Negative Class: Positive Class is 90% of your data and Negative Class is the remaining 10%.
Undersampling :
If we use the undersampling method Positive Class And Negative Class: Positive Class is 90% of your data and Negative Class is the remaining 10%.
So to balance the dataset we will randomly select 10% of the Positive Class and the same 10% of Negative Class
A disadvantage of Undersampling: If we do under Undersampling we lose 80% of the dataset which is a huge data loss. So we will consider using Oversampling.
Oversampling :
If we use the oversampling method Positive Class And Negative Class: Positive Class is 90% of your data and Negative Class is the remaining 10%.
So to balance the dataset we will use 90% of the Positive Class and we will repeat 10% of Negative Class 9x times to make it 90%.
Summary:
So in this way, we learned and got the general intuition about what is Balanced Dataset, Imbalanced Dataset, Classification Problems with Imbalanced Dataset, Applications of Imbalanced Dataset, Accuracy Score in Imbalanced Dataset Classification, Handling Imbalanced Classes with the use of Oversampling and Undersampling Methods.
Reference link for detailed examples on imbalanced datasets: https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/index.html
Other References: https://towardsdatascience.com/model-evaluation-i-precision-and-recall-166ddb257c7b