An Accurate and Low-cost PM
2.5
Estimation
Method Based on Artificial Neural Network
Lixue Xia, Rong Luo, Bin Zhao, Yu Wang, Huazhong Yang
Dept. of E.E., Tsinghua National Laboratory for Information Science and Technology (TNList),
Tsinghua University, Beijing, China
e-mail: xialx13@mails.tsinghua.edu.cn
Abstract—PM
2.5
has already been a major pollutant in many
cities in China. It is a kind of harmful pollutant which may cause
several kinds of lung diseases. However, the existing methods to
monitor PM
2.5
with high accuracy are too expensive to popular-
ize. The high cost also limits the further researches about PM
2.5
.
This paper implements a method to estimate PM
2.5
with low cost
and high accuracy by Artificial Neural Network (ANN) technique
using other pollutants and meteorological factors that are easy
to be monitored. An Entropy Maximization step is proposed to
avoid the over-fitting related to the data distribution of pollutant
data. Also, how to choose the input attributes is abstracted to an
optimization problem. An iterative greedy algorithm is proposed
to solve it, which reduces the cost and increases the estimation
accuracy at the same time. The experiment shows that the linear
correlation coefficient between the estimated value and real value
is 0.9488. Our model can also classify PM
2.5
levels with a high
accuracy. Additionally, the trade-off between accuracy and cost is
investigated according to the price and error rate of each sensor.
I. INTRODUCTION
Nowadays, the high frequency of hazy weather in many
cities in China has made the particles with aerodynamic
diameter less than 2.5 micrometer (PM
2.5
) attract more and
more attention. PM
2.5
can attach many kinds of poisonous
chemicals and impact human health, which may cause many
diseases such as asthma and chronic obstructive pulmonary
disease (COPD) [1]. As a result, many citizens urgently
want to know the PM
2.5
quality in their living and working
environment.
However, the existing methods to accurately monitor PM
2.5
require the support from a high-cost and complicated system,
which makes it difficult to measure PM
2.5
without a special-
ized monitor station [2]. It can be seen from Table I that all
these highly accurate equipments need a high cost that most
citizens and researchers cannot afford these equipments. As a
result, monitoring PM
2.5
is far from universal, and the lack
of data blocks the progress of researching and controlling of
PM
2.5
. Also, the high cost also leads to difficulties to analyse
the PM
2.5
problem under a specific environment, such as the
in-door PM
2.5
[3].
In order to reduce the cost of monitoring PM
2.5
, some
researchers use low-cost methods such as ANN technique to
estimate PM
2.5
recently [4]–[6]. The ANN technique attempts
to use data that are easy to be sensed to calculate PM
2.5
.
Nevertheless, PM
2.5
has complex causes and can be influenced
by too many factors compared with other molecular pollutants
such as O
3
[7]. The estimation accuracy is low when directly
using ANN, or the cost goes high again after many kinds
of expensive data are used. Given this situation, we find two
0
50
100
150
200
250
300
350
400
450
500
Time
PM
2.5
13-Mar-2014
12:00:00
24-Apr-2014
22:00:00
The boundary of Heavily
Polluted according to
the definition of Air
Quality Level [8]
Fig. 1. IAQI of PM
2.5
over a mounth
TABLE I
C
OST AND METHOD OF EQUIPMENTS TO MONITOR PM
2.5
Method Cost Principle
TEOM 1405 22,000$ TEOM Gravimentric
BAM-1020 23,000$ Beta-ray
TSI DUSTTRAJ II 80,000CNY Photometric
Dylos DC1700 425$ Particle counter
major problems that limit the estimation accuracy of ANN
model and choose specific algorithms to solve them.
The first problem is the estimation error caused by the
different distributions of data over different data sets. As is
shown in Fig. 1, the Individual Air Quality Index (IAQI)
of PM
2.5
is more likely to take a low value and only has
little chance to take a high value. However, it is just the
data over the boundary of Heavily Polluted range in Fig. 1
that contain important information. As a result, the important
data may be ignored or only have little weights in training
phase due to the small amount. This is a kind of over-fitting
phenomenon which leads to a high error rate in the key range,
so the trained model is inefficient when the situation of the
T estingDataset is different from the T rainingDataset, for
example, the heavily polluted weeks. This paper proposes an
Entropy M aximization operation before training phase to
emphasize the important data, which can avoid the over-fitting
related to the data distribution and thus improve the estimation
accuracy.
Second, the redundant input attributes lead to unnecessary
cost and may bring noise to reduce the estimation accuracy. An
attribute refers to a kind of data, for example, the W indSpeed.
Considering that some meteorological data have an aggre-
gation characteristic over seasons, using the irrelevant input
attributes may also cause over-fitting. In fact, the problem
978-1-4799-7792-5/15/$31.00 ©2015 IEEE