Mini-batching¶
In its purest form, online machine learning encompasses models which learn with one sample at a time. This is the design which is used in River.
The main downside of single-instance processing is that it doesn't scale to big data, at least not in the sense of traditional batch learning. Indeed, processing one sample at a time means that we are unable to fully take advantage of vectorisation and other computational tools that are taken for granted in batch learning. On top of this, processing a large dataset in River essentially involves a Python for
loop, which might be too slow for some usecases. However, this doesn't mean that River is slow. In fact, for processing a single instance, River is actually a couple of orders of magnitude faster than libraries such as scikit-learn, PyTorch, and Tensorflow. The reason why is because River is designed from the ground up to process a single instance, whereas the majority of other libraries choose to care about batches of data. Both approaches offer different compromises, and the best choice depends on your usecase.
In order to propose the best of both worlds, River offers some limited support for mini-batch learning. Some of River's estimators implement *_many
methods on top of their *_one
counterparts. For instance, preprocessing.StandardScaler
has a learn_many
method as well as a transform_many
method, in addition to learn_one
and transform_one
. Each mini-batch method takes as input a pandas.DataFrame
. Supervised estimators also take as input a pandas.Series
of target values. We choose to use pandas.DataFrames
over numpy.ndarrays
because of the simple fact that the former allows us to name each feature. This in turn allows us to offer a uniform interface for both single instance and mini-batch learning.
As an example, we will build a simple pipeline that scales the data and trains a logistic regression. Indeed, the compose.Pipeline
class can be applied to mini-batches, as long as each step is able to do so.
from river import compose
from river import linear_model
from river import preprocessing
model = compose.Pipeline(
preprocessing.StandardScaler(),
linear_model.LogisticRegression()
)
For this example, we will use datasets.Higgs
.
from river import datasets
dataset = datasets.Higgs()
if not dataset.is_downloaded:
dataset.download()
dataset
Downloading https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz (2.62 GB)
Higgs dataset.
The data has been produced using Monte Carlo simulations. The first 21 features (columns 2-22)
are kinematic properties measured by the particle detectors in the accelerator. The last seven
features are functions of the first 21 features; these are high-level features derived by
physicists to help discriminate between the two classes.
Name Higgs
Task Binary classification
Samples 11,000,000
Features 28
Sparse False
Path /home/runner/river_data/Higgs/HIGGS.csv.gz
URL https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
Size 2.62 GB
Downloaded True
The easiest way to read the data in a mini-batch fashion is to use the read_csv
from pandas
.
import pandas as pd
names = [
'target', 'lepton pT', 'lepton eta', 'lepton phi',
'missing energy magnitude', 'missing energy phi',
'jet 1 pt', 'jet 1 eta', 'jet 1 phi', 'jet 1 b-tag',
'jet 2 pt', 'jet 2 eta', 'jet 2 phi', 'jet 2 b-tag',
'jet 3 pt', 'jet 3 eta', 'jet 3 phi', 'jet 3 b-tag',
'jet 4 pt', 'jet 4 eta', 'jet 4 phi', 'jet 4 b-tag',
'm_jj', 'm_jjj', 'm_lv', 'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb'
]
for x in pd.read_csv(dataset.path, names=names, chunksize=8096, nrows=3e5):
y = x.pop('target')
y_pred = model.predict_proba_many(x)
model.learn_many(x, y)
If you are familiar with scikit-learn, you might be aware that some of their estimators have a partial_fit
method, which is similar to river's learn_many
method. Here are some advantages that river has over scikit-learn:
- We guarantee that river's is just as fast, if not faster than scikit-learn. The differences are negligeable, but are slightly in favor of river.
- We take as input dataframes, which allows us to name each feature. The benefit is that you can add/remove/permute features between batches and everything will keep working.
- Estimators that support mini-batches also support single instance learning. This means that you can enjoy the best of both worlds. For instance, you can train with mini-batches and use
predict_one
to make predictions.
Note that you can check which estimators can process mini-batches programmatically:
import importlib
import inspect
def can_mini_batch(obj):
return hasattr(obj, 'learn_many')
for module in importlib.import_module('river.api').__all__:
if module in ['datasets', 'synth']:
continue
for name, obj in inspect.getmembers(importlib.import_module(f'river.{module}'), can_mini_batch):
print(name)
OneClassSVM
MiniBatchClassifier
MiniBatchRegressor
MiniBatchSupervisedTransformer
MiniBatchTransformer
SKL2RiverClassifier
SKL2RiverRegressor
Pipeline
Select
TransformerProduct
TransformerUnion
BagOfWords
TFIDF
LinearRegression
LogisticRegression
Perceptron
OneVsRestClassifier
BernoulliNB
ComplementNB
MultinomialNB
MLPRegressor
OneHotEncoder
StandardScaler
Because mini-batch learning isn't treated as a first-class citizen, some of the river's functionalities require some work in order to play nicely with mini-batches. For instance, the objects from the metrics
module have an update
method that take as input a single pair (y_true, y_pred)
. This might change in the future, depending on the demand.
We plan to promote more models to the mini-batch regime. However, we will only be doing so for the methods that benefit the most from it, as well as those that are most popular. Indeed, River's core philosophy will remain to cater to single instance learning.