5
the entire set of non-zero Lagrange multipliers has been identified, hence the last step solves the
large QP problem.
Chunking seriously reduces the size of the matrix from the number of training examples squared
to approximately the number of non-zero Lagrange multipliers squared. However, chunking still
cannot handle large-scale training problems, since even this reduced matrix cannot fit into
memory.
In 1997, Osuna, et al. [16] proved a theorem which suggests a whole new set of QP algorithms
for SVMs. The theorem proves that the large QP problem can be broken down into a series of
smaller QP sub-problems. As long as at least one example that violates the KKT conditions is
added to the examples for the previous sub-problem, each step will reduce the overall objective
function and maintain a feasible point that obeys all of the constraints. Therefore, a sequence of
QP sub-problems that always add at least one violator will be guaranteed to converge. Notice
that the chunking algorithm obeys the conditions of the theorem, and hence will converge.
Osuna, et al. suggests keeping a constant size matrix for every QP sub-problem, which implies
adding and deleting the same number of examples at every step [16] (see figure 2). Using a
constant-size matrix will allow the training on arbitrarily sized data sets. The algorithm given in
Osuna’s paper [16] suggests adding one example and subtracting one example every step.
Clearly this would be inefficient, because it would use an entire numerical QP optimization step
to cause one training example to obey the KKT conditions. In practice, researchers add and
subtract multiple examples according to unpublished heuristics [17]. In any event, a numerical
QP solver is required for all of these methods. Numerical QP is notoriously tricky to get right;
there are many numerical precision issues that need to be addressed.
Chunking
Osuna
SMO
Figure 2. Three alternative methods for training SVMs: Chunking, Osuna’s algorithm, and SMO. For
each method, three steps are illustrated. The horizontal thin line at every step represents the training
set, while the thick boxes represent the Lagrange multipliers being optimized at that step. For
chunking, a fixed number of examples are added every step, while the zero Lagrange multipliers are
discarded at every step. Thus, the number of examples trained per step tends to grow. For Osuna’s
algorithm, a fixed number of examples are optimized every step: the same number of examples is
added to and discarded from the problem at every step. For SMO, only two examples are analytically
optimized at every step, so that each step is very fast.