Synopsis
Identifies n outliers in the given ExampleSet based on the distance to their k nearest neighbors.
Description
This operator performs a D^k_n Outlier Search according to the outlier detection approach recommended by Ramaswamy, Rastogi and Shim in "Efficient Algorithms for Mining Outliers from Large Data Sets". It is primarily a statistical outlier search based on a distance measure similar to the DB(p,D)-Outlier Search from Knorr and Ng. But it utilizes a distance search through the k-th nearest neighbourhood, so it implements some sort of locality as well.
The method states, that those objects with the largest distance to their k-th nearest neighbours are likely to be outliers respective to the data set, because it can be assumed, that those objects have a more sparse neighbourhood than the average objects. As this effectively provides a simple ranking over all the objects in the data set according to the distance to their k-th nearest neighbours, the user can specify a number of n objects to be the top-n outliers in the data set.
The operator supports cosine, sine or squared distances in addition to the euclidian distance which can be specified by a distance parameter. The Operator takes an example set and passes it on with an boolean top-n D^k outlier status in a new boolean-valued special outlier attribute indicating true (outlier) and false (no outlier).
Input
- example set input: expects: ExampleSetMetaData: #examples: = 0; #attributes: 0
Output
- example set output:
- original:
Parameters
- number of neighbors: Specifies the k value for the k-th nearest neighbours to be the analyzed.(default value is 10, minimum 1 and max is set to 1 million)
- number of outliers: The number of top-n Outliers to be looked for.(default value is 10, minimum 2 (internal reasons) and max is set to 1 million)
- distance function: choose which distance function will be used for calculating the distance between two objects