QOSplitter¶

Quantization observer (QO).

This splitter utilizes a hash-based quantization algorithm to keep track of the target statistics and evaluate split candidates. QO, relies on the radius parameter to define discretization intervals for each incoming feature. Split candidates are defined as the midpoints between two consecutive hash slots. Both binary splits and multi-way splits can be created by this attribute observer. This class implements the algorithm described in ¹.

The smaller the quantization radius, the more hash slots will be created to accommodate the discretized data. Hence, both the running time and memory consumption increase, but the resulting splits ought to be closer to the ones obtained by a batch exhaustive approach. On the other hand, if the radius is too large, fewer slots will be created, less memory and running time will be required, but at the cost of coarse split suggestions.

QO assumes that all features have the same range. It is always advised to scale the features to apply this splitter. That can be done using the preprocessing module. A good "rule of thumb" is to scale data using preprocessing.StandardScaler and define the radius as a proportion of the features' standard deviation. For instance, the default radius value would correspond to one quarter of the normalized features' standard deviation (since the scaled data has zero mean and unit variance). If the features come from normal distributions, by following the empirical rule, roughly 32 hash slots will be created.

Parameters¶

radius

Type → float

Default → 0.25

The quantization radius. QO discretizes the incoming feature in intervals of equal length that are defined by this parameter.
allow_multiway_splits

Default → False

Whether or not allow that multiway splits are evaluated. Numeric multi-way splits use the same quantization strategy of QO to create multiple tree branches. The same quantization radius is used, and each stored slot represents the split enabling statistics of one branch.

Attributes¶

is_numeric

Determine whether or not the splitter works with numerical features.
is_target_class

Check on which kind of learning task the splitter is designed to work. If True, the splitter works with classification trees, otherwise it is designed for regression trees.

Methods¶

best_evaluated_split_suggestion

Get the best split suggestion given a criterion and the target's statistics.

Parameters

criterion — 'SplitCriterion'
pre_split_dist — 'list | dict'
att_idx — 'base.typing.FeatureName'
binary_only — 'bool' — defaults to True

Returns

BranchFactory: Suggestion of the best attribute split.

cond_proba

Get the probability for an attribute value given a class.

Parameters

att_val
target_val — 'base.typing.ClfTarget'

Returns

float: Probability for an attribute value given a class.

update

Update statistics of this observer given an attribute value, its target value and the weight of the instance observed.

Parameters

att_val
target_val — 'base.typing.Target'
sample_weight — 'float'

Mastelini, S.M. and de Leon Ferreira, A.C.P., 2021. Using dynamical quantization to perform split attempts in online tree regressors. Pattern Recognition Letters. ↩