276 X. Wang, M. Ye / Sensors and Actuators B 129 (2008) 274–284
Fig. 2. Plot of the responses of the PS relative humidity sensor with the variation
of humidity, including the initial state (: increasing, 䊉: decreasing), after the
storage for 3 months (: increasing, : decreasing) and 6 months (: increasing,
: decreasing) at room temperature.
is drifted from the initial value with time at the same relative
humidity. It is also observed on the long-term basis that the drift
behavior is nonlinear in nature. Briefly, a practical PS sensor
exhibits nonideal characteristics such as nonlinearity, hysteresis
and the drift.
3. Support vector machine for regression
A simple description of the ν-SVM algorithm for regression
is provided here; for more details please refer to Ref. [20].
The regression approximation addresses the problem of esti-
mating a function based on a given set of data points {(x
1
,
y
1
),···(x
L
, y
L
)} (x
i
∈ R
n
is an input and y
i
∈ R
1
is a desired
output), which is produced from an unknown function. SVM
approximates the function in the following form:
y = w
T
φ(x) + b, (1)
where φ(x) represents the high (maybe infinite) dimensional fea-
ture space, which is nonlinearly mapped from the input space
x. The coefficients w and b are estimated by minimizing the
regularized risk function
R(C) =
1
2
w
T
w + CR
ε
emp
, (2)
R
ε
emp
=
1
L
L
i=1
|d
i
− y
i
|
ε
, (3)
|
d − y
|
ε
=
|d − y|−ε, |d − y|≥ε,
0, otherwise.
(4)
The first term (1/2)w
T
w is called the regularized term. Min-
imizing this term will make the function as flat as possible.
The second term CR
ε
emp
is an empirical error (risk) measured
by the ε-insensitive loss function given in Eq. (4). This loss
function provides the advantage of using sparse data points to
represent the designed function given by Eq. (1). C is referred
as the regularized constant determining the trade off between
the empirical error and the regularized term. ε is called the tube
size of SVM. Both C and ε are user-prescribed parameters and
selected empirically.
The parameter ε can be useful if the desired accuracy of the
approximation can be specified beforehand. In some case, how-
ever, we just want the estimate to be as accurate as possible,
without having to commit ourselves to a specific level of accu-
racy. Hence, Ref. [20] presented a modification of the SVM that
automatically minimized ε, thus adjusting the accuracy level to
the data at hand. A new parameter ν(0 ≤ ν ≤ 1) was introduced,
which let one control the number of support vectors and train-
ing errors. To be more precise, the ν is an upper bound on the
fraction of margin errors and a lower bound of the fraction of
support vectors. To get the estimations of w and b, Eq. (2) is
transformed to the primal problem of ν-SVM regression:
minimize
1
2
w
T
w + C(νε +
1
L
L
i=1
(ξ
i
+ ξ
∗
i
)),
subject to (w
T
φ(x
i
) + b) − y
i
≤ ε + ξ
i
,
y
i
− (w
T
φ(x
i
) + b) ≤ ε + ξ
∗
i
,
ξ
i
,ξ
∗
i
≥ 0,i= 1, ···,L, ε ≥ 0.
(5)
Here, the slack variables ξ and ξ* are introduced. The ξ is the
upper training error (ξ* is the lower) subject to the ε-insensitive
tube |y − (w
T
φ(x)+b)|≤ε.
By introducing Lagrange multipliers and exploiting the opti-
mality constraints, solving Eq. (5) is equivalent to finding
minimize
1
2
(α − α
∗
)
T
Q(α − α
∗
) + y
T
(α − α
∗
)
subject to e
T
(α − α
∗
) = 0, e
T
(α − α
∗
) ≤ Cν,
0 ≤ α
i
,α
∗
i
≤
C
L
,i= 1, ···,L,
(6)
where Q
ij
= K(x
i
, x
j
)=φ(x
i
)
T
φ(x
j
) is the kernel and e is the vector
of all ones. α and α* are the introduced Lagrange multipliers.
Thus, the regression estimative function given by Eq. (1) can
take the following form:
y =
L
i=1
(α
i
− α
∗
i
) K(x
i
, x
j
) + b. (7)
Based on the Karush–Kuhn–Tucker (KKT) conditions of
quadratic programming, only a number of coefficients (α
i
− α
∗
i
)
will assume nonzero values, and the data points associated with
them could be referred to as support vectors.
In Eq. (7), K(x
i
, x
j
) is the kernel function. The value is equal
to the inner product of two vectors x
i
and x
j
in the feature space
φ(x
i
) and φ(x
j
). That is, K(x
i
, x
j
)=φ(x
i
)
T
φ(x
j
). The elegance
of using the kernel function lies in the fact that one can deal
with feature spaces of arbitrary dimensionality without having to
compute the map φ(x) explicitly. Any function that satisfies Mer-
cer’s condition [12] can be used as the kernel function. Common