Skip to content

Working with imbalanced data

In machine learning it is quite usual to have to deal with imbalanced dataset. This is particularly true in online learning for tasks such as fraud detection and spam classification. In these two cases, which are binary classification problems, there are usually many more 0s than 1s, which generally hinders the performance of the classifiers we thrown at them.

As an example we'll use the credit card dataset available in river. We'll first use a collections.Counter to count the number of 0s and 1s in order to get an idea of the class balance.

import collections
from river import datasets

X_y = datasets.CreditCard()

counts = collections.Counter(y for _, y in X_y)

for c, count in counts.items():
    print(f'{c}: {count} ({count / sum(counts.values()):.5%})')
Downloading https://maxhalford.github.io/files/datasets/creditcardfraud.zip (65.95 MB)
Uncompressing into /home/runner/river_data/CreditCard
0: 284315 (99.82725%)
1: 492 (0.17275%)

Baseline

The dataset is quite unbalanced. For each 1 there are about 578 0s. Let's now train a logistic regression with default parameters and see how well it does. We'll measure the ROC AUC score.

from river import linear_model
from river import metrics
from river import evaluate
from river import preprocessing


X_y = datasets.CreditCard()

model = (
    preprocessing.StandardScaler() |
    linear_model.LogisticRegression()
)

metric = metrics.ROCAUC()

evaluate.progressive_val_score(X_y, model, metric)
ROCAUC: 89.11%

Importance weighting

The performance is already quite acceptable, but as we will now see we can do even better. The first thing we can do is to add weight to the 1s by using the weight_pos argument of the Log loss function.

from river import optim

model = (
    preprocessing.StandardScaler() |
    linear_model.LogisticRegression(
        loss=optim.losses.Log(weight_pos=5)
    )
)

metric = metrics.ROCAUC()

evaluate.progressive_val_score(X_y, model, metric)
ROCAUC: 91.43%

Focal loss

The deep learning for object detection community has produced a special loss function for imbalanced learning called focal loss. We are doing binary classification, so we can plug the binary version of focal loss into our logistic regression and see how well it fairs.

model = (
    preprocessing.StandardScaler() |
    linear_model.LogisticRegression(loss=optim.losses.BinaryFocalLoss(2, 1))
)

metric = metrics.ROCAUC()

evaluate.progressive_val_score(X_y, model, metric)
ROCAUC: 91.31%

Under-sampling the majority class

Adding importance weights only works with gradient-based models (which includes neural networks). A more generic, and potentially more effective approach, is to use undersamplig and oversampling. As an example, we'll under-sample the stream so that our logistic regression encounter 20% of 1s and 80% of 0s. Under-sampling has the additional benefit of requiring less training steps, and thus reduces the total training time.

from river import imblearn

model = (
    preprocessing.StandardScaler() |
    imblearn.RandomUnderSampler(
        classifier=linear_model.LogisticRegression(),
        desired_dist={0: .8, 1: .2},
        seed=42
    )
)

metric = metrics.ROCAUC()

evaluate.progressive_val_score(X_y, model, metric)
ROCAUC: 94.75%

The RandomUnderSampler class is a wrapper for classifiers. This is represented by a rectangle around the logistic regression bubble when we visualize the model.

model
StandardScaler
{'counts': Counter({'Time': 284807, 'V1': 284807, 'V2': 284807, 'V3': 284807, 'V4': 284807, 'V5': 284807, 'V6': 284807, 'V7': 284807, 'V8': 284807, 'V9': 284807, 'V10': 284807, 'V11': 284807, 'V12': 284807, 'V13': 284807, 'V14': 284807, 'V15': 284807, 'V16': 284807, 'V17': 284807, 'V18': 284807, 'V19': 284807, 'V20': 284807, 'V21': 284807, 'V22': 284807, 'V23': 284807, 'V24': 284807, 'V25': 284807, 'V26': 284807, 'V27': 284807, 'V28': 284807, 'Amount': 284807}), 'means': defaultdict(<class 'float'>, {'Amount': 88.34961925093155, 'Time': 94813.8595750808, 'V1': 2.9277520180090704e-15, 'V10': 2.419775348112352e-15, 'V11': 2.6777824308789593e-15, 'V12': -2.2140916080800113e-15, 'V13': 8.342900777166882e-16, 'V14': 1.903846574088133e-15, 'V15': 8.581815631423259e-15, 'V16': 1.4766213137618707e-15, 'V17': -1.6801787893383664e-16, 'V18': 5.854597499006342e-16, 'V19': 1.0841438330912623e-15, 'V2': 5.886023480140661e-16, 'V20': 7.744049542249276e-16, 'V21': 2.332071227413037e-16, 'V22': 4.956273530422241e-16, 'V23': -2.4249219202998693e-16, 'V24': 4.437131669261511e-15, 'V25': -6.981503896318856e-16, 'V26': 1.6805599541646309e-15, 'V27': -3.266881107112892e-16, 'V28': -1.173670292237036e-16, 'V3': -1.2140654523102711e-15, 'V4': 3.4083746059071583e-15, 'V5': 3.0974740213536643e-15, 'V6': 1.6259034591771526e-15, 'V7': -1.293283785185756e-16, 'V8': 3.1643541546820877e-16, 'V9': -1.6996522885539796e-15}), 'vars': defaultdict(<class 'float'>, {'Amount': 62559.84938856013, 'Time': 2255116088.124347, 'V1': 3.8364757815609964, 'V10': 1.1855896488198305, 'V11': 1.041851426830977, 'V12': 0.9983999112951535, 'V13': 0.9905673151089326, 'V14': 0.9189023195064231, 'V15': 0.837800459457307, 'V16': 0.7678164267285925, 'V17': 0.721370914880897, 'V18': 0.7025368914993138, 'V19': 0.6626596101863256, 'V2': 2.726810450381156, 'V20': 0.594323307231822, 'V21': 0.5395236333332668, 'V22': 0.5266409057048476, 'V23': 0.3899492915994535, 'V24': 0.3668070828485584, 'V25': 0.27172987273928645, 'V26': 0.23254207582578096, 'V27': 0.16291861895803472, 'V28': 0.10895457872151114, 'V3': 2.2990211684909436, 'V4': 2.004676782760293, 'V5': 1.905074357779823, 'V6': 1.7749400245019011, 'V7': 1.5303951971990823, 'V8': 1.426473847533605, 'V9': 1.2069882295421888}), 'with_std': True}
RandomUnderSampler
{'_actual_dist': Counter({0: 284315, 1: 492}), '_pivot': 1, '_rng': <random.Random object at 0x55bcd1bcc940>, 'classifier': LogisticRegression ( optimizer=SGD ( lr=Constant ( learning_rate=0.01 ) ) loss=Log ( weight_pos=1. weight_neg=1. ) l2=0. intercept_init=0. intercept_lr=Constant ( learning_rate=0.01 ) clip_gradient=1e+12 initializer=Zeros () ), 'desired_dist': {0: 0.8, 1: 0.2}, 'seed': 42}
LogisticRegression
{'_weights': {'Time': -1.643530810799457, 'V1': -0.07117984574594215, 'V2': 0.08689788816561651, 'V3': -0.2262482818395249, 'V4': 0.6827332420162907, 'V5': 0.1883362004546642, 'V6': -0.1169509365867438, 'V7': -0.13413474923347005, 'V8': -0.2575447733912746, 'V9': -0.028884393000813757, 'V10': -0.24916880001207672, 'V11': 0.32422036210718164, 'V12': -0.6194078910255971, 'V13': -0.03024537378274561, 'V14': -0.5855987715566455, 'V15': -0.09972202536223847, 'V16': -0.24026703465261673, 'V17': -0.05536505790548094, 'V18': 0.03247486327614661, 'V19': -0.0849483897575928, 'V20': -0.12459547198256604, 'V21': 0.04276103699144129, 'V22': 0.10363988666872358, 'V23': -0.08712048453858094, 'V24': 0.043970621647022, 'V25': -0.050376004211653315, 'V26': -0.02767069610819978, 'V27': 0.12223298288462735, 'V28': -0.019825032606385413, 'Amount': 0.027224523831184427}, '_y_name': None, 'clip_gradient': 1000000000000.0, 'initializer': Zeros (), 'intercept': -1.0699242219644576, 'intercept_init': 0.0, 'intercept_lr': Constant({'learning_rate': 0.01}), 'l2': 0.0, 'loss': Log({'weight_pos': 1.0, 'weight_neg': 1.0}), 'optimizer': SGD({'lr': Constant({'learning_rate': 0.01}), 'n_iterations': 3633})}

Over-sampling the minority class

We can also attain the same class distribution by over-sampling the minority class. This will come at cost of having to train with more samples.

model = (
    preprocessing.StandardScaler() |
    imblearn.RandomOverSampler(
        classifier=linear_model.LogisticRegression(),
        desired_dist={0: .8, 1: .2},
        seed=42
    )
)

metric = metrics.ROCAUC()

evaluate.progressive_val_score(X_y, model, metric)
ROCAUC: 91.71%

Sampling with a desired sample size

The downside of both RandomUnderSampler and RandomOverSampler is that you don't have any control on the amount of data the classifier trains on. The number of samples is adjusted so that the target distribution can be attained, either by under-sampling or over-sampling. However, you can do both at the same time and choose how much data the classifier will see. To do so, we can use the RandomSampler class. In addition to the desired class distribution, we can specify how much data to train on. The samples will both be under-sampled and over-sampled in order to fit your constraints. This is powerful because it allows you to control both the class distribution and the size of the training data (and thus the training time). In the following example we'll set it so that the model will train with 1 percent of the data.

model = (
    preprocessing.StandardScaler() |
    imblearn.RandomSampler(
        classifier=linear_model.LogisticRegression(),
        desired_dist={0: .8, 1: .2},
        sampling_rate=.01,
        seed=42
    )
)

metric = metrics.ROCAUC()

evaluate.progressive_val_score(X_y, model, metric)
ROCAUC: 94.71%

Hybrid approach

As you might have guessed by now, nothing is stopping you from mixing imbalanced learning methods together. As an example, let's combine sampling.RandomUnderSampler and the weight_pos parameter from the optim.losses.Log loss function.

model = (
    preprocessing.StandardScaler() |
    imblearn.RandomUnderSampler(
        classifier=linear_model.LogisticRegression(
            loss=optim.losses.Log(weight_pos=5)
        ),
        desired_dist={0: .8, 1: .2},
        seed=42
    )
)

metric = metrics.ROCAUC()

evaluate.progressive_val_score(X_y, model, metric)
ROCAUC: 96.52%