LocalOutlierFactor¶
Incremental Local Outlier Factor (Incremental LOF).
The Incremental Local Outlier Factor (ILOF) is an online version of the Local Outlier Factor (LOF), proposed by Pokrajac et al. (2017), and is used to identify outliers based on density of local neighbors.
The algorithm take into account the following elements: - NewPoints
: new points;
- `kNN(p)`: the k-nearest neighboors of `p` (the k-closest points to `p`);
- `RkNN(p)`: the reverse-k-nearest neighboors of `p` (points that have `p` as one of their neighboors);
- `set_upd_lrd`: Set of points that need to have the local reachability distance updated;
- `set_upd_lof`: Set of points that need to have the local outlier factor updated.
This current implementation within River
, based on the original one in the paper, follows the following steps: 1) Insert new data points (NewPoints
) and calculate its distance to existing points; 2) Update the nreaest neighboors and reverse nearest neighboors of all the points; 3) Define sets of affected points that required updates; 4) Calculate the reachability-distance from new point to neighboors (NewPoints
-> kNN(NewPoints)
) and from rev-neighboors to new point (RkNN(NewPoints)
-> NewPoints
); 5) Update the reachability-distance for affected points: RkNN(RkNN(NewPoints))
-> RkNN(NewPoints)
6) Update local reachability distance of affected points: lrd(set_upd_lrd)
; 7) Update local outlier factor: lof(set_upd_lof)
.
The incremental LOF algorithm is expected to provide equivalent detection performance as the iterated static LOF algroithm (applied after insertion of each data record), while requiring significantly less computational time. Moreover, the insertion of a new data point as well as deletion of an old data point influence only a limited number of their closest neighbors, which means that the number of updates per such insertion/deletion does not depend on the total number of instances learned/in the data set.
Parameters¶
-
n_neighbors
Type → int
Default →
10
The number of nearest neighbors to use for density estimation.
-
distance_func
Type → DistanceFunc
Default →
None
Distance function to be used. By default, the Euclidean distance is used.
Attributes¶
-
x_list
A list of stored observations.
-
x_batch
A buffer to hold incoming observations until it's time to update the model.
-
x_scores
A buffer to hold incoming observations until it's time to score them.
-
dist_dict
A dictionary to hold distances between observations.
-
neighborhoods
A dictionary to hold neighborhoods for each observation.
-
rev_neighborhoods
A dictionary to hold reverse neighborhoods for each observation.
-
k_dist
A dictionary to hold k-distances for each observation.
-
reach_dist
A dictionary to hold reachability distances for each observation.
-
lof
A dictionary to hold Local Outlier Factors for each observation.
-
local_reach
A dictionary to hold local reachability distances for each observation.
Examples¶
import pandas as pd
from river import anomaly
from river import datasets
cc_df = pd.DataFrame(datasets.CreditCard())
lof = anomaly.LocalOutlierFactor(n_neighbors=20)
for x, _ in datasets.CreditCard().take(200):
lof.learn_one(x)
lof.learn_many(cc_df[201:401])
scores = []
for x in cc_df[0][401:406]:
scores.append(lof.score_one(x))
[round(score, 3) for score in scores]
[1.802, 1.936, 1.566, 1.181, 1.272]
X = [0.5, 0.45, 0.43, 0.44, 0.445, 0.45, 0.0]
lof = anomaly.LocalOutlierFactor()
for x in X[:3]:
lof.learn_one({'x': x}) # Warming up
for x in X:
features = {'x': x}
print(
f'Anomaly score for x={x:.3f}: {lof.score_one(features):.3f}')
lof.learn_one(features)
Anomaly score for x=0.500: 0.000
Anomaly score for x=0.450: 0.000
Anomaly score for x=0.430: 0.000
Anomaly score for x=0.440: 1.020
Anomaly score for x=0.445: 1.032
Anomaly score for x=0.450: 0.000
Anomaly score for x=0.000: 0.980
Methods¶
learn
learn_many
learn_one
Update the model.
Parameters
- x — 'dict'
score_one
Return an outlier score.
A high score is indicative of an anomaly. A low score corresponds to a normal observation.
Parameters
- x — 'dict'
Returns
float: An anomaly score. A high score is indicative of an anomaly. A low score corresponds a
David Pokrajac, Aleksandar Lazarevic, and Longin Jan Latecki (2007). Incremental Local Outlier Detection for Data Streams. In: Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2007). 504-515. DOI: 10.1109/CIDM.2007.368917.