ExtremelyFastDecisionTreeClassifier¶
Extremely Fast Decision Tree classifier.
Also referred to as Hoeffding AnyTime Tree (HATT) classifier.
Parameters¶
-
grace_period (int) – defaults to
200
Number of instances a leaf should observe between split attempts.
-
max_depth (int) – defaults to
None
The maximum depth a tree can reach. If
None
, the tree will grow indefinitely. -
min_samples_reevaluate (int) – defaults to
20
Number of instances a node should observe before reevaluating the best split.
-
split_criterion (str) – defaults to
info_gain
Split criterion to use. - 'gini' - Gini - 'info_gain' - Information Gain - 'hellinger' - Helinger Distance
-
split_confidence (float) – defaults to
1e-07
Allowed error in split decision, a value closer to 0 takes longer to decide.
-
tie_threshold (float) – defaults to
0.05
Threshold below which a split will be forced to break ties.
-
leaf_prediction (str) – defaults to
nba
Prediction mechanism used at leafs. - 'mc' - Majority Class - 'nb' - Naive Bayes - 'nba' - Naive Bayes Adaptive
-
nb_threshold (int) – defaults to
0
Number of instances a leaf should observe before allowing Naive Bayes.
-
nominal_attributes (list) – defaults to
None
List of Nominal attributes identifiers. If empty, then assume that all numeric attributes should be treated as continuous.
-
splitter (river.tree.splitter.base_splitter.Splitter) – defaults to
None
The Splitter or Attribute Observer (AO) used to monitor the class statistics of numeric features and perform splits. Splitters are available in the
tree.splitter
module. Different splitters are available for classification and regression tasks. Classification and regression splitters can be distinguished by their propertyis_target_class
. This is an advanced option. Special care must be taken when choosing different splitters. By default,tree.splitter.GaussianSplitter
is used ifsplitter
isNone
. -
kwargs
Other parameters passed to
tree.HoeffdingTree
. Check thetree
module documentation for more information.
Attributes¶
-
depth
The depth of the tree.
-
leaf_prediction
Return the prediction strategy used by the tree at its leaves.
-
max_size
Max allowed size tree can reach (in MB).
-
model_measurements
Collect metrics corresponding to the current status of the tree in a string buffer.
-
split_criterion
Return a string with the name of the split criterion being used by the tree.
Examples¶
>>> from river import synth
>>> from river import evaluate
>>> from river import metrics
>>> from river import tree
>>> gen = synth.Agrawal(classification_function=0, seed=42)
>>> # Take 1000 instances from the infinite data generator
>>> dataset = iter(gen.take(1000))
>>> model = tree.ExtremelyFastDecisionTreeClassifier(
... grace_period=100,
... split_confidence=1e-5,
... nominal_attributes=['elevel', 'car', 'zipcode'],
... min_samples_reevaluate=100
... )
>>> metric = metrics.Accuracy()
>>> evaluate.progressive_val_score(dataset, model, metric)
Accuracy: 89.09%
Methods¶
clone
Return a fresh estimator with the same parameters.
The clone has the same parameters but has not been updated with any data. This works by looking at the parameters from the class signature. Each parameter is either - recursively cloned if it's a River classes. - deep-copied via copy.deepcopy
if not. If the calling object is stochastic (i.e. it accepts a seed parameter) and has not been seeded, then the clone will not be idempotent. Indeed, this method's purpose if simply to return a new instance with the same input parameters.
debug_one
Print an explanation of how x
is predicted.
Parameters
- x (dict)
Returns
typing.Union[str, NoneType]: A representation of the path followed by the tree to predict x
; None
if
draw
Draw the tree using the graphviz
library.
Since the tree is drawn without passing incoming samples, classification trees will show the majority class in their leaves, whereas regression trees will use the target mean.
Parameters
- max_depth (int) – defaults to
None
The maximum depth a tree can reach. IfNone
, the tree will grow indefinitely.
learn_one
Incrementally train the model
Parameters
- x
- y
- sample_weight – defaults to
1.0
Returns
self
model_description
Walk the tree and return its structure in a buffer.
Returns
The description of the model.
predict_many
Predict the labels of a DataFrame X
.
Parameters
- X (pandas.core.frame.DataFrame)
Returns
Series: Series of predicted labels.
predict_one
Predict the label of a set of features x
.
Parameters
- x (dict)
Returns
typing.Union[bool, str, int]: The predicted label.
predict_proba_many
Predict the labels of a DataFrame X
.
Parameters
- X (pandas.core.frame.DataFrame)
Returns
DataFrame: DataFrame that associate probabilities which each label as columns.
predict_proba_one
Predict the probability of each label for a dictionary of features x
.
Parameters
- x
Returns
A dictionary that associates a probability which each label.
Notes¶
The Extremely Fast Decision Tree (EFDT) 1 constructs a tree incrementally. The EFDT seeks to select and deploy a split as soon as it is confident the split is useful, and then revisits that decision, replacing the split if it subsequently becomes evident that a better split is available. The EFDT learns rapidly from a stationary distribution and eventually it learns the asymptotic batch tree if the distribution from which the data are drawn is stationary.
References¶
-
C. Manapragada, G. Webb, and M. Salehi. Extremely Fast Decision Tree. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). ACM, New York, NY, USA, 1953-1962. DOI: https://doi.org/10.1145/3219819.3220005 ↩