
(Cristianini & Shawe-Taylor, 2000; Gunn, 1998; Hearst
et al., 1998; Vapnik, 1998).
SVM is simple enough to be analyzed mathematically
since it can be shown to correspond to a linear method in
a high dimensional feature space nonlinearly related to
input space. In this sense, SVM may serve as a promising
alternative combining the strengths of conventional
statistical methods that are more theory-driven and easy
to analyze, and more data-driven, distribution-free and
robust machine learning methods. Recently, the SVM
approach has been introduced to several financial
applications such as credit rating, time series prediction,
and insurance claim fraud detection (Fan & Palaniswami,
2000; Gestel et al., 2001; Huang, Chen, Hsu, Chen, &
Wu, 2004; Kim, 2003; Tay & Cao, 2001; Viaene, Derrig,
Baesens, & Dedene, 2002). These studies reported that
SVM was comparable to and even outperformed other
classifiers including ANN, CBR, MDA, and Logit in
terms of generalization performance. Motivated by these
previous researches, we apply SVM to the domain of
bankruptcy prediction, and compare its prediction per-
formance with those of MDA, Logit, and BPNs.
A simple description of the SVM algorithm is provided
as follows. Given a training set DZ fx
i
; y
i
g
N
iZ1
with input
vectors x
i
Z ðx
ð1Þ
i
; .; x
ðnÞ
i
Þ
T
2R
n
and target labels
y
i
2fK1;C1g, the support vector machine (SVM) classifier,
according to Vapnik’s original formulation, satisfies the
following conditions
w
T
fðx
i
Þ C bRC1; if y
i
Z C1
w
T
fðx
i
Þ C b%K1; if y
i
Z K1
(
(1)
which is equivalent to
y
i
½w
T
fðx
i
Þ C bR 1; i Z 1; .; N (2)
where w represents the weight vector and b the bias.
Nonlinear function fð,Þ : R
n
/ R
n
k
maps input or
measurement space to a high-dimensional, and possibly
infinite-dimensional, feature space. Eq. (2) then comes
down to the construction of two parallel bounding
hyperplanes at opposite sides of a separating hyperplane
w
T
f(x)CbZ0 in the feature space with the margin width
between both hyperplanes equal to 2/(jjwjj
2
). In primal
weight space, the classifier then takes the decision
function form (3)
sgnðw
T
fðxÞ C bÞ (3)
Most of classification problems are, however, linearly
non-separable. Therefore, it is general to find the
weight vector using slack variable (x
i
)topermit
misclassification. One defines the primal optimization
problem as
Min
w;b;x
1
2
w
T
w C C
X
N
iZ1
x
i
(4)
subject to
y
i
ðw
T
fðx
i
Þ C bÞR 1 K x
i
; i Z 1; .; N
x
i
R 0; i Z 1; .; N
(
(5)
where x
i
’s are slack variables needed to allow misclassi-
fications in the set of inequalities, and C 2R
C
is a
tuning hyperparameter, weighting the importance of
classification errors vis-a
`
-vis the margin width. The
solution of the primal problem is obtained after
constructing the Lagrangian. From the conditions of
optimality, one obtains a quadratic programming (QP)
problem with Lagrange multipliers a
i
’s. A multiplier a
i
exists for each training data instance. Data instances
corresponding to non-zero a
i
’s are called support vectors.
On the other hand, the above primal problem can be
converted into the following dual problem with objective
function (6) and constraints (7). Since the decision variables
are support vector of Lagrange multipliers, it is easier to
interpret the results of this dual problem than those of the
primal one.
Max
a
1
2
a
T
Qa K e
T
a (6)
subject to
0% a
i
% C; i Z 1; .; N
y
T
a Z 0
(
(7)
In the dual problem above, e is the vector of all ones, Q is
a N!N positive semi-definite matrix, Q
ij
Zy
i
y
j
K(x
i
,x
j
), and
K(x
i
,x
j
)hf(x
i
)
T
f(x
j
) is the kernel. Here, training vectors x
i
’s
are mapped into a higher (maybe infinite) dimensional space
by function f. As is typical for SVMs, we never calculate w
or f(x). This is made possible due to Mercer’s condition,
which relates mapping function f(x) to kernel function
K($,$) as follows.
Kðx
i
; x
j
Þ Z fðx
i
Þ
T
fðx
j
Þ (8)
For kernel function K($,$), one typically has
several design choices such as the linear kernel of
Kðx
i
; x
j
ÞZ x
T
i
x
j
, the polynomial kernel of degree d of
Kðx
i
; x
j
ÞZ ðgx
T
i
x
j
C rÞ
d
; gO 0, the radial basis function
(RBF) kernel of K(x
i
,x
j
)Zexp{Kgjjx
i
Kx
j
jj
2
}, gO0, and
the sigmoid kernel of Kðx
i
; x
j
ÞZ tanhfgx
T
i
x
j
C rg, where d;
r 2N and g 2R
C
are constants. Then one constructs the
final SVM classifier as
sgn
X
N
i
a
i
y
i
Kðx; x
i
Þ C b
!
(9)
The details of the optimization are discussed in (Chang &
Lin, 2001; Cristianini & Shawe-Taylor, 2000; Gunn, 1998;
Vapnik, 1998).
J.H. Min, Y.-C. Lee / Expert Systems with Applications 28 (2005) 603–614 605