Techniques for handling Class Imbalance in Datasets

What is Class Imbalance?

While working on a classification problem where you don’t control sampling or capture of the data, you are likely to run into a situation where your data is class-imbalanced. Typically you will encounter this type of data where one of the classes represents an event that occurs rarely, relative to the larger population or sample. Domains where this is common include fraud and anomaly detection, catastrophic events, machine failure, etc. From a representative statistics and data profiling perspective it makes complete sense, but from a machine learning perspective it poses some unique challenges.

Looking at the Titanic dataset as an example, you have two classes, a binary classification, where one set of the observations significantly outnumbers the other. In this case if a passenger survived or not.

Supporting notebook for this post available here

import pandas as pd
import seaborn as sns

titanic_train = pd.read_csv('train.csv')

sns.distplot(titanic_train['Survived'], kde=False, hist=True, bins=3)

print(titanic_train['Survived'].value_counts())
#0 549
#1 342

Why is using imbalanced data bad for Classification?

There is nothing stopping you from using imbalanced data for training an algorithm, or testing it, or validating it. It might actually appear that your model is doing well against all of these tasks, and immediately after deployment in a controlled release. However, in Production after it is exposed to larger population, it performs poorly and will not generalize.

A general best practice with all classification problems is using multiple evaluation metrics, and where applicable a confusion matrix. A “rollup”, or single evaluation metric like a F1 score can make it look like your algorithm is doing well, however it can hide issues such as poor performance with False Positives or False Negatives. We can see an example of this continuing with the example of the Titanic dataset.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

from sklearn.linear_model import LogisticRegression
log_regress = LogisticRegression(penalty='l2', dual=False, solver='liblinear').fit(X_train, y_train)

log_results = log_regress.predict(X_test)

sns.heatmap(confusion_matrix(y_test, log_results), annot=True, fmt="d")

from sklearn.metrics import f1_score

print(f1_score(y_test, log_results))
# 0.7076923076923077

Here you can see that while the F1 score isn’t amazing or poor, you only get the full picture of how each class performs by plotting a confusion matrix. In it, you can see that the False Positive (upper right – 21), and False Negative (lower left – 34) are low but aren’t great. The True Positive is good, but the True Negative has room for improvement.

As a general rule for classification problems, reporting model performance by using a Confusion Matrix, or Precision and Recall (quadrants of a Confusion Matrix) you can more accurately identify how your model is performing on a per-class basis.

When do I balance my data?

It depends on how much imbalance you detect, and the relative size of the dataset and other classes. In a situation where you have millions of observations and notice an imbalance of less than say, ten percent, you likely don’t need to perform any balancing (especially if you’re training a Neural Network). If you’re working with a dataset with less than five hundred observations, the imbalance can be more dramatic and have more impact training a model with that small percentage difference.

From my own experience, I will generally do some form of balancing if the dataset is in the “thousands of observations” size, where the imbalance is greater than a standard deviation of the number of observations of the next largest class (~32%). If it’s a dataset in the “hundreds of observations” size, I will perform balancing using a percentage threshold with some caution. The exception for this would be using a Neural Network architecture with pre-trained layers, or leveraging transfer learning from an existing network in a complementary domain.

Class Rebalancing Techniques

Like most techniques in Data Science, there is no one right way to do things and each offers its own scenario and tradeoff.

One of the most important considerations with these techniques is that they are done prior to dataset splitting or a cross-validation run on your data (without stratification). You want to perform those actions after you up-sample or apply any synthetic method otherwise your model will be overfit, and it will fail to generalize when deployed.

Down sample majority class

Down sampling, or under sampling, is the most straightforward technique with the least amount of impact to your pipeline. It requires that you remove observations to bring the majority class count down to a level that is more in balance, and not necessarily equal with, the minority or other classes.

Downsampling works well if you have enough data to work with where losing some observations in the majority class to balance the sample should not pose any problems with training (thousands, millions). It also requires that the observations in the data are independent – a likely prerequisite step to using this data for training anyway.

There are a few approaches for dealing with downsampling, the simplest would be taking a subset of the majority class, and then combining it with the existing minority class samples prior to splitting of the dataset. The following is a simple example using a resampling strategy for the majority class and making it equal to the minority class.

from sklearn.utils import resample

major_class = titanic_train[titanic_train['Survived']==0]
minor_class = titanic_train[titanic_train['Survived']==1]

major_class.shape
# (549, 11)

minor_class.shape
# (342, 11)

downsample_major = resample(major_class, replace=False, n_samples=342)
downsample_major.shape
# (342, 11)

balanced_train_data = pd.concat([downsample_major, minor_class])
balanced_train_data.shape
# (684, 11)

# Keeping data split and model training the same as before
print(f1_score(y_test, log_results_downsamp))
# 0.7263681592039802

sns.heatmap(confusion_matrix(y_test, log_results_downsamp), annot=True, fmt="d")

It’s not a dramatic increase in this example, but there is noticeable improvement in the True Negative and False Negative regions of the confusion matrix, as well as in the F1 score.

Up sample minority class

Up sampling, specifically using a synthetic method, is a little more involved. It requires that you create enough observations to bring the minority class up to a level that is more in balance with the majority or other classes.

In practice, up sampling works best if your dataset is not “wide” and contains only a few features. In a categorical dataset, you would have to create new observations for each feature using an imputation or frequency sampling approach. With each additional feature you run the risk of creating observations that are not realistic within the feature space, and may also introduce issues with feature importance when trying to explain the model later on.

SMOTE

The SMOTE (Synthetic Minority Oversampling Technique) family of algorithms is a popular approach to up sampling. It works by using existing data from the minority class and generating synthetic observations using a k nearest-neighbors approach.

At an abstract level, the algorithm looks at the feature space between observations in the minority class dataset. It takes the difference between these existing observations, multiplies a random value between 0 and 1, and then generates synthetic observations along the feature vector between the nearest neighbors. It then randomly adds some of these synthetic observations (depending on how many new samples are needed) to the feature space.

Applying this to a real dataset, we can look at Boston Housing Prices. Taking at look at two observations and the features, Age and Median Value, we can apply SMOTE to generate a synthetic observation. Starting with 65.2 and 24 for Age and Median Value respectively, followed by 78.9 and 21.6. Applying the algorithm with a random 0.6 we get:
Age: 65.2 + 0.6(78.9 – 65.2) = 73.42
Median Value: 24 + 0.6(21.6 – 24) = 22.56

As with most popular algorithms in Data Science, there is a package available that has support for SMOTE and SMOTE-NC, used with categorical datasets. An example of how this can be implemented with a real dataset is as follows:

# Using SMOTENC for categorical features 'Sex' and 'Embarked'
X_smote, y_smote = SMOTENC(categorical_features=[2,7]).fit_resample(X.values, y.values)

# convenience functions to ensure common split, training, and metrics
X_train, X_test, y_train, y_test = prepandsplitdata(X_smote, y_smote) 
smote_model = trainmodel(X_train, y_train)
smote_f1, smote_cm = getmetrics(smote_model, X_test, y_test)

print(smote_f1)
# 0.7914110429447853

sns.heatmap(smote_cm, annot=True, fmt="d")

Looking at the confusion matrix we see that there is a notable reduction in the number and percentage of False Positive and False Negative results when compared with no imbalance approach, and with downsampling.

ADASYN

ADASYN (Adaptive Synthetic Sampling Approach) is closely related to SMOTE. The major differentiator with ADASYN is that the algorithm will determine how many synthetic observations are needed for each existing minority class observation, which in-turn will lead to a well balanced dataset overall. An additional characteristic benefit of ADASYN is that it forces the algorithm to focus on difficult to learn, or heavily unbalanced feature spaces. Taking a look at the algorithm in some detail will shed some additional light on how this works.

Procedurally, the algorithm takes some of the same steps as SMOTE, but with some additional steps to provide the benefits aforementioned:

First, the degree of class imbalance is determined by dividing the count of the minority class by the majority class
$d = m_s/m_l$
If the degree of tolerance of imbalance is less than the threshold, the algorithm determines the amount of synthetic observations that are required using a parameter that specifies the desired balance level
$G = (m_l - m_s) \times \beta$
Next, for each sample in the minority class find the nearest neighbors using Euclidean distance, and calculate their ratio relative to the majority class samples in the same feature space
$r_i = \Delta_i/K , \quad i=1, ..., m_s$
Normalize the ratio so that it can be expressed as a density distribution, essentially a distribution of weights for the different minority class observations. This is used as the criteria to automatically decide the number of synthetic observations that need to be generated for each existing minority observation
$\hat{r}_i = r_i/\sum_{i=1}^{m_s}r_i$
From here on, the algorithm functions similarly to SMOTE and will generate synthetic observations using the same formula but throttled by the weights specified in the distribution in the previous steps.

As with SMOTE, ADASYN is also supported in the imbalanced-learn package. A short example as follows:

X_adasyn, y_adasyn = ADASYN().fit_resample(X.values, y.values)

X_train, X_test, y_train, y_test = prepandsplitdata(X_adasyn, y_adasyn)
adasyn_model = trainmodel(X_train, y_train)
adasyn_f1, adasyn_cm = getmetrics(adasyn_model, X_test, y_test)

print(adasyn_f1)
# 0.7912772585669782

sns.heatmap(adasyn_cm, annot=True, fmt="d")

Note the similarity of the ADASYN and SMOTE-NC results. In this instance SMOTE-NC does have a small edge over ADASYN with False Negatives, however, with datasets where the imbalance is more significant ADASYN tends to perform slightly better than SMOTE.

Stratified Sampling

Stratified Sampling is a technique that ensures that class proportions are maintained when the data is split into Training and Test datasets. This ensures that the class balance made during model training is the same proportion being used when evaluating your model performance.

The advantage of this approach is that the class imbalance of the dataset as a whole is taken into consideration, and set equally during Testing and Training. The disadvantage is if the proportionality of classes relative to each other in the larger population of inferencing data is either not well known, or not clearly represented in your dataset. E.g.:
Sample Class 0: 25% Class 1: 75%
Population Class 0: 5% Class1: 95%

Scikit-learn provides a few different options for stratification using train_test_split(), as well as an option for a cross-validation K-fold approach.

Algorithm Class Weighting

You can additionally attempt to handle the class imbalance in your dataset by applying different class weights during training. This approach depends on if the implementation of algorithm that you are using has support for something like a “class_weight” parameter, as many classification algorithms within Scikit-learn do.

The implementation allows you to specify a “balanced” flag, that uses the values in the labels found in y to automatically adjust your input data using the following:

n_samples/(n_classes * np.bincount(y))

Metric Adjustments

Yet another technique is the ability to adjust your reporting and performance metrics to compensate for a known class imbalance. An example of this in Scikit-learn is the Balanced Accuracy Score metric that applies a weight to the accuracy of a sample, proportionate to the weighting of its true class.

Additionally, using Sensitivity/True Positive Rate/Recall, and Specificity/True Negative Rate/Precision metrics can help with identifying how well a classifier is performing. Some metrics in Scikit-learn also allow for an optional “weighted” parameter to help accurately report on model performance given a dataset class imbalance.

Conclusions

We have seen a number of different techniques that can be employed to rebalance a dataset. Some may be more appropriate than others given the size of the dataset, the type of data, sampling method, and choice of algorithm. As with most approaches in Data Science, there is no one distinct approach that stands above others, however there are a few best practices worth mentioning:

Profile your data during Exploratory Data Analysis. This will assist with identifying class imbalance well ahead of Training and Evaluation
Leverage more than one evaluation metric for your model. Having alternative metrics will provide additional insight into how the model is actually performing, and help to identify issues that may spring up post-Deployment
Use more than one technique if applicable. In the SMOTE paper, the researchers indicate that a combination of downsampling the majority class, and creation of synthetic samples using SMOTE had a positive effect in some experiments
Stratified sampling is almost always a good idea if you can be sure that the proportions of the classes in your sample match, or are close to, proportions in the inferencing population