Working with imbalanced data¶
In machine learning it is quite usual to have to deal with imbalanced dataset. This is particularly true in online learning for tasks such as fraud detection and spam classification. In these two cases, which are binary classification problems, there are usually many more 0s than 1s, which generally hinders the performance of the classifiers we thrown at them.
As an example we'll use the credit card dataset available in river
. We'll first use a collections.Counter
to count the number of 0s and 1s in order to get an idea of the class balance.
import collections
from river import datasets
X_y = datasets.CreditCard()
counts = collections.Counter(y for _, y in X_y)
for c, count in counts.items():
print(f'{c}: {count} ({count / sum(counts.values()):.5%})')
Downloading https://maxhalford.github.io/files/datasets/creditcardfraud.zip (65.95 MB)
Uncompressing into /home/runner/river_data/CreditCard
0: 284315 (99.82725%)
1: 492 (0.17275%)
Baseline¶
The dataset is quite unbalanced. For each 1 there are about 578 0s. Let's now train a logistic regression with default parameters and see how well it does. We'll measure the ROC AUC score.
from river import linear_model
from river import metrics
from river import evaluate
from river import preprocessing
X_y = datasets.CreditCard()
model = (
preprocessing.StandardScaler() |
linear_model.LogisticRegression()
)
metric = metrics.ROCAUC()
evaluate.progressive_val_score(X_y, model, metric)
ROCAUC: 89.11%
Importance weighting¶
The performance is already quite acceptable, but as we will now see we can do even better. The first thing we can do is to add weight to the 1s by using the weight_pos
argument of the Log
loss function.
from river import optim
model = (
preprocessing.StandardScaler() |
linear_model.LogisticRegression(
loss=optim.losses.Log(weight_pos=5)
)
)
metric = metrics.ROCAUC()
evaluate.progressive_val_score(X_y, model, metric)
ROCAUC: 91.43%
Focal loss¶
The deep learning for object detection community has produced a special loss function for imbalanced learning called focal loss. We are doing binary classification, so we can plug the binary version of focal loss into our logistic regression and see how well it fairs.
model = (
preprocessing.StandardScaler() |
linear_model.LogisticRegression(loss=optim.losses.BinaryFocalLoss(2, 1))
)
metric = metrics.ROCAUC()
evaluate.progressive_val_score(X_y, model, metric)
ROCAUC: 91.31%
Under-sampling the majority class¶
Adding importance weights only works with gradient-based models (which includes neural networks). A more generic, and potentially more effective approach, is to use undersamplig and oversampling. As an example, we'll under-sample the stream so that our logistic regression encounter 20% of 1s and 80% of 0s. Under-sampling has the additional benefit of requiring less training steps, and thus reduces the total training time.
from river import imblearn
model = (
preprocessing.StandardScaler() |
imblearn.RandomUnderSampler(
classifier=linear_model.LogisticRegression(),
desired_dist={0: .8, 1: .2},
seed=42
)
)
metric = metrics.ROCAUC()
evaluate.progressive_val_score(X_y, model, metric)
ROCAUC: 94.75%
The RandomUnderSampler
class is a wrapper for classifiers. This is represented by a rectangle around the logistic regression bubble when we visualize the model.
model
StandardScaler
{'counts': Counter({'Time': 284807,
'V1': 284807,
'V2': 284807,
'V3': 284807,
'V4': 284807,
'V5': 284807,
'V6': 284807,
'V7': 284807,
'V8': 284807,
'V9': 284807,
'V10': 284807,
'V11': 284807,
'V12': 284807,
'V13': 284807,
'V14': 284807,
'V15': 284807,
'V16': 284807,
'V17': 284807,
'V18': 284807,
'V19': 284807,
'V20': 284807,
'V21': 284807,
'V22': 284807,
'V23': 284807,
'V24': 284807,
'V25': 284807,
'V26': 284807,
'V27': 284807,
'V28': 284807,
'Amount': 284807}),
'means': defaultdict(<class 'float'>,
{'Amount': 88.34961925093155,
'Time': 94813.8595750808,
'V1': 2.9277520180090704e-15,
'V10': 2.419775348112352e-15,
'V11': 2.6777824308789593e-15,
'V12': -2.2140916080800113e-15,
'V13': 8.342900777166882e-16,
'V14': 1.903846574088133e-15,
'V15': 8.581815631423259e-15,
'V16': 1.4766213137618707e-15,
'V17': -1.6801787893383664e-16,
'V18': 5.854597499006342e-16,
'V19': 1.0841438330912623e-15,
'V2': 5.886023480140661e-16,
'V20': 7.744049542249276e-16,
'V21': 2.332071227413037e-16,
'V22': 4.956273530422241e-16,
'V23': -2.4249219202998693e-16,
'V24': 4.437131669261511e-15,
'V25': -6.981503896318856e-16,
'V26': 1.6805599541646309e-15,
'V27': -3.266881107112892e-16,
'V28': -1.173670292237036e-16,
'V3': -1.2140654523102711e-15,
'V4': 3.4083746059071583e-15,
'V5': 3.0974740213536643e-15,
'V6': 1.6259034591771526e-15,
'V7': -1.293283785185756e-16,
'V8': 3.1643541546820877e-16,
'V9': -1.6996522885539796e-15}),
'vars': defaultdict(<class 'float'>,
{'Amount': 62559.84938856013,
'Time': 2255116088.124347,
'V1': 3.8364757815609964,
'V10': 1.1855896488198305,
'V11': 1.041851426830977,
'V12': 0.9983999112951535,
'V13': 0.9905673151089326,
'V14': 0.9189023195064231,
'V15': 0.837800459457307,
'V16': 0.7678164267285925,
'V17': 0.721370914880897,
'V18': 0.7025368914993138,
'V19': 0.6626596101863256,
'V2': 2.726810450381156,
'V20': 0.594323307231822,
'V21': 0.5395236333332668,
'V22': 0.5266409057048476,
'V23': 0.3899492915994535,
'V24': 0.3668070828485584,
'V25': 0.27172987273928645,
'V26': 0.23254207582578096,
'V27': 0.16291861895803472,
'V28': 0.10895457872151114,
'V3': 2.2990211684909436,
'V4': 2.004676782760293,
'V5': 1.905074357779823,
'V6': 1.7749400245019011,
'V7': 1.5303951971990823,
'V8': 1.426473847533605,
'V9': 1.2069882295421888}),
'with_std': True}
RandomUnderSampler
{'_actual_dist': Counter({0: 284315, 1: 492}),
'_pivot': 1,
'_rng': <random.Random object at 0x565497299640>,
'classifier': LogisticRegression (
optimizer=SGD (
lr=Constant (
learning_rate=0.01
)
)
loss=Log (
weight_pos=1.
weight_neg=1.
)
l2=0.
l1=0.
intercept_init=0.
intercept_lr=Constant (
learning_rate=0.01
)
clip_gradient=1e+12
initializer=Zeros ()
),
'desired_dist': {0: 0.8, 1: 0.2},
'seed': 42}
LogisticRegression
{'_weights': {'Time': -1.643530810799457, 'V1': -0.07117984574594215, 'V2': 0.08689788816561651, 'V3': -0.2262482818395249, 'V4': 0.6827332420162907, 'V5': 0.1883362004546642, 'V6': -0.1169509365867438, 'V7': -0.13413474923347005, 'V8': -0.2575447733912746, 'V9': -0.028884393000813757, 'V10': -0.24916880001207672, 'V11': 0.32422036210718164, 'V12': -0.6194078910255971, 'V13': -0.03024537378274561, 'V14': -0.5855987715566455, 'V15': -0.09972202536223847, 'V16': -0.24026703465261673, 'V17': -0.05536505790548094, 'V18': 0.03247486327614661, 'V19': -0.0849483897575928, 'V20': -0.12459547198256604, 'V21': 0.04276103699144129, 'V22': 0.10363988666872358, 'V23': -0.08712048453858094, 'V24': 0.043970621647022, 'V25': -0.050376004211653315, 'V26': -0.02767069610819978, 'V27': 0.12223298288462735, 'V28': -0.019825032606385413, 'Amount': 0.027224523831184427},
'_y_name': None,
'clip_gradient': 1000000000000.0,
'initializer': Zeros (),
'intercept': -1.0699242219644576,
'intercept_init': 0.0,
'intercept_lr': Constant({'learning_rate': 0.01}),
'l1': 0.0,
'l2': 0.0,
'loss': Log({'weight_pos': 1.0, 'weight_neg': 1.0}),
'optimizer': SGD({'lr': Constant({'learning_rate': 0.01}), 'n_iterations': 3633})}
Over-sampling the minority class¶
We can also attain the same class distribution by over-sampling the minority class. This will come at cost of having to train with more samples.
model = (
preprocessing.StandardScaler() |
imblearn.RandomOverSampler(
classifier=linear_model.LogisticRegression(),
desired_dist={0: .8, 1: .2},
seed=42
)
)
metric = metrics.ROCAUC()
evaluate.progressive_val_score(X_y, model, metric)
ROCAUC: 91.71%
Sampling with a desired sample size¶
The downside of both RandomUnderSampler
and RandomOverSampler
is that you don't have any control on the amount of data the classifier trains on. The number of samples is adjusted so that the target distribution can be attained, either by under-sampling or over-sampling. However, you can do both at the same time and choose how much data the classifier will see. To do so, we can use the RandomSampler
class. In addition to the desired class distribution, we can specify how much data to train on. The samples will both be under-sampled and over-sampled in order to fit your constraints. This is powerful because it allows you to control both the class distribution and the size of the training data (and thus the training time). In the following example we'll set it so that the model will train with 1 percent of the data.
model = (
preprocessing.StandardScaler() |
imblearn.RandomSampler(
classifier=linear_model.LogisticRegression(),
desired_dist={0: .8, 1: .2},
sampling_rate=.01,
seed=42
)
)
metric = metrics.ROCAUC()
evaluate.progressive_val_score(X_y, model, metric)
ROCAUC: 94.71%
Hybrid approach¶
As you might have guessed by now, nothing is stopping you from mixing imbalanced learning methods together. As an example, let's combine sampling.RandomUnderSampler
and the weight_pos
parameter from the optim.losses.Log
loss function.
model = (
preprocessing.StandardScaler() |
imblearn.RandomUnderSampler(
classifier=linear_model.LogisticRegression(
loss=optim.losses.Log(weight_pos=5)
),
desired_dist={0: .8, 1: .2},
seed=42
)
)
metric = metrics.ROCAUC()
evaluate.progressive_val_score(X_y, model, metric)
ROCAUC: 96.52%