Anomaly detection in streaming environmental sensor data: A data-driven
modeling approach
David J. Hill
a
,
*
, Barbara S. Minsker
b
a
Department of Civil and Environmental Engineering, Rutgers University, 623 Bowser Rd, Piscataway, NJ 08854, USA
b
Department of Civil and Environmental Engineering, University of Illinois Urbana-Champaign, 205 N. Mathews Ave., Urbana, IL 61801, USA
article info
Article history:
Received 9 March 2009
Received in revised form
25 August 2009
Accepted 25 August 2009
Available online 24 October 2009
Keywords:
Coastal environment
Data-driven modeling
Anomaly detection
Machine learning
Real-time data
Sensor networks
Data quality control
Artificial intelligence
abstract
The deployment of environmental sensors has generated an interest in real-time applications of the data
they collect. This research develops a real-time anomaly detection method for environmental data
streams that can be used to identify data that deviate from historical patterns. The method is based on an
autoregressive data-driven model of the data stream and its corresponding prediction interval. It
performs fast, incremental evaluation of data as it becomes available, scales to large quantities of data,
and requires no pre-classification of anomalies. Furthermore, this method can be easily deployed on
a large heterogeneous sensor network. Sixteen instantiations of this method are compared based on
their ability to identify measurement errors in a windspeed data stream from Corpus Christi, Texas. The
results indicate that a multilayer perceptron model of the data stream, coupled with replacement of
anomalous data points, performs well at identifying erroneous data in this data stream.
Ó 2009 Published by Elsevier Ltd.
1. Introduction
In-situ environmental sensors are sensors that are physically
located in the environment they are monitoring. Through telemetry,
the time-series data collected by these sensors can be transmitted
continuously to a repository as a data stream. Recently, there have
been efforts to make use of streaming data for real-time applications
(e.g., Bonner et al., 2002). For example, draft plans for the Water and
Environmental Research Systems (WATERS) Network, a proposed
national environmental observatory network, have identified real-
time analysis and modeling as a significant priority (NRC 2006).
Because in-situ sensors operate under harsh conditions, and
because the data they collect must be transmitted across commu-
nication networks, the data can easily become corrupted. Unde-
tected errors can significantly affect the data’s value for real-time
applications. Thus, the NSF (National Science Foundation), 2005 has
indicated a need for automated data quality assurance and control
(QA/QC). Anomaly detection is the process of identifying data that
deviate markedly from historical patterns (Hodge and Austin,
2004). Anomalous data can be caused by sensor or data trans-
mission errors or by infrequent system behaviors that are often of
interest to scientific and regulatory communities. In addition to
data QA/QC, where data anomalies may be the result of sensor or
telemetry errors, anomaly detection has many other practical
applications, such as adaptive monitoring, where anomalous data
indicate phenomena that researchers may wish to investigate
further through increased sampling, and anomalous event detec-
tion, where anomalous data signal system behaviors that require
other actions to be taken, for example in the case of a natural
disaster. These applications require that data anomalies be identi-
fied in near-real time; thus, the anomaly detection method must be
rapid and be performed incrementally to ensure that detection
keeps up with the rate of data collection.
Traditionally, anomaly detection has been carried out manually
with the assistance of data visualization tools (Mourad and Bertrand-
Krajewski, 2002), but manual methods are unsuitable for real-time
detection in streaming data, since they necessitate an operator to be
performing analysis 24 h a day, 7 days a week. More recently,
researchers have suggested automated statistical and machine
learning approaches, such as minimumvolume ellipsoid (Rousseeuw
and Leroy, 1996), convex pealing (Rousseeuw and Leroy, 1 996),
nearest neighbor (Tang et al., 2002; Ramaswamy et al., 2000),
clustering (Bolton and Hand, 2001), neural network classifier
*
Corresponding author. Tel.: þ1 217 714 3490.
E-mail addresses: ecodavid@rci.rutgers.edu (D.J. Hill), minsker@illinois.edu
(B.S. Minsker).
Contents lists available at ScienceDirect
Environmental Modelling & Software
journal homepage: www.elsevier.com/locate/envsoft
1364-8152/$ – see front matter Ó 2009 Published by Elsevier Ltd.
doi:10.1016/j.envsoft.2009.08.010
Environmental Modelling & Software 25 (2010) 1014–1022