Characteristics and classification of outlier detection
3.1 Statistical based techniques
These techniques require an underlying data distribution model for the detection of outliers.
They assume or estimate a statistical (probability distribution) model which captures the
distribution of the data and evaluate data instances with respect to how well they fit the
model. A data instance is d eclared as an outlier if the probability of the data instance to
be generated by this model is very low, based on the distance measure. These techniques
can further be classified as parametric or non-parametric. Parametric techniques assume
availability of the knowledge about underlying data distribution, i.e., the data is generated
from a known distribution. Distribution parameters are then estimated from the available
data. These techniques are based on either a Gaussian based model or a non-Gaussian model.
Non-parametric techniques do not assume availability of data distribution. They typically
define a distance measure between a new test instance and the statistical model and use
some kind of thresholds on this distance to determine whether the observation is an outlier.
Most widely used approaches in this respect are histogram, kernel density and wavelet based
approaches. Some of the statistical based techniques considered in this paper are Dereszynski
and Dietterich (2011), Zhang et al. (2012), Wu et al. (2007), Yozo et al. (2004), Jun et al.
(2005), Bettencourt et al. (2007), Sheng et al. (2007), Palpanas et al. (2003), Subramaniam
et al. (2006)
3.2 Nearest neighbor based techniques
Nearest neighbor-based approaches had been the most commonly used approaches to analyze
a data instance with respect to its nearest neighbors in the data mining and machine learning
community in the past. They use several well-defined distance notions to compute the distance
(similarity measure) between two data instances. A data instance is declared as an outlier if
it is located far from its neighbors. Euclidean distance is a popular choice for univariate data,
whereas, multivariate continuous attributes are handled by Mahalanobis distance metric.
Some of the nearest neighbor based techniques considered in this paper are presented in
Branch et al. (2006), Zhang et al. (2007b), Zhuang and Chen (2006). These techniques have
not been the focus of research community recently due to the limitations that will be discussed
in the forth-coming sections.
3.3 Clustering based techniques
Grouping similar data instances into clusters with similar behavior is known as clustering.
Clustering algorithms can be either centralized or distributed. In centralized clustering algo-
rithms each node transmits its entire data to the gateway/ central node which then performs
data clustering. This approach is however communication inefficient. In a distributed clus-
tering approach, all the nodes are able to perform clustering of the sensed data vectors and
then send specific parameters of clustered data to the gateway node to reduce communication
overhead. The nodes then use some distance measure from the nearest cluster to identify out-
liers (Bezdek et al. 2011; Rajasegarar et al. 2008a, 2010b, 2012; Moshtaghi et al. 2011a,b,c;
Bezdek et al. 2010; Suthaharan et al. 2010a,b).
3.4 Classification based techniques
Classification based techniques learn a classification model using the set of data instances
during the training phase and then classify the data instance to one of the training classes
123