The Pascal Visual Object Classes Challenge – a Retrospective 7
2.4.3 Classification and detection
Both the classification and detection tasks were eval-
uated as a set of 20 independent two-class tasks: e.g.
for classification “is there a car in the image?”, and for
detection “where are the cars in the image (if any)?”.
A separate ‘score’ is computed for each of the classes.
For the classification task, participants submitted re-
sults in the form of a confidence level for each image
and for each class, with larger values indicating greater
confidence that the image contains the object of in-
terest. For the detection task, participants submitted
a bounding box for each detection, with a confidence
level for each bounding box. The provision of a confi-
dence level allows results to be ranked such that the
trade-off between false positives and false negatives can
be evaluated, without defining arbitrary costs on each
type of classification error.
In the case of classification, the correctness of a class
prediction depends only on whether an image contains
an instance of that class or not. However, for detec-
tion a decision must be made on whether a prediction
is correct or not. To this end, detections were assigned
to ground truth objects and judged to be true or false
positives by measuring bounding box overlap. To be
considered a correct detection, the area of overlap a
o
between the predicted bounding box B
p
and ground
truth bounding box B
gt
must exceed 50% by the for-
mula:
a
o
=
area(B
p
∩ B
gt
)
area(B
p
∪ B
gt
)
, (1)
where B
p
∩B
gt
denotes the intersection of the predicted
and ground truth bounding boxes and B
p
∪ B
gt
their
union.
Detections output by a method were assigned to
ground truth object annotations satisfying the overlap
criterion in order ranked by the (decreasing) confidence
output. Ground truth objects with no matching detec-
tion are false negatives. Multiple detections of the same
object in an image were considered false detections, e.g.
5 detections of a single object counted as 1 correct de-
tection and 4 false detections – it was the responsibility
of the participant’s system to filter multiple detections
from its output.
For a given task and class, the precision-recall
curve is computed from a method’s ranked output.
Up until 2009 interpolated average precision (Salton
and Mcgill, 1986) was used to evaluate both classifi-
cation and detection. However, from 2010 onwards the
method of computing AP changed to use all data points
rather than TREC-style sampling (which only sampled
the monotonically decreasing curve at a fixed set of
uniformly-spaced recall values 0, 0.1, 0.2, ..., 1). The in-
tention in interpolating the precision-recall curve was
to reduce the impact of the ‘wiggles’ in the precision-
recall curve, caused by small variations in the ranking of
examples. However, the downside of this interpolation
was that the evaluation was too crude to discriminate
between the methods at low AP.
2.4.4 Segmentation
The segmentation challenge was assessed per class on
the intersection of the inferred segmentation and the
ground truth, divided by the union (commonly referred
to as the ‘intersection over union’ metric):
seg. accuracy =
true pos.
true pos. + false pos. + false neg.
(2)
Pixels marked ‘void’ in the ground truth (i.e. those
around the border of an object that are marked as nei-
ther an object class or background) are excluded from
this measure. Note, we did not evaluate at the individ-
ual object level, even though the data had annotation
that would have allowed this. Hence, the precision of
the segmentation between overlapping objects of the
same class was not assessed.
2.4.5 Action classification
The task is assessed in a similar manner to classifica-
tion. For each action class a score for that class should
be given for the person performing the action (indicated
by a bounding box or a point), so that the test data can
be ranked. The average precision is then computed for
each class.
2.4.6 Person layout
At test time the method must output the bounding
boxes of the parts (head, hands and feet) that are visi-
ble, together with a single real-valued confidence of the
layout so that a precision/recall curve can be drawn.
From VOC 2010 onwards, person layout was evalu-
ated by how well each part individually could be pre-
dicted: for each of the part types (head, hands and feet)
a precision/recall curve was computed, using the confi-
dence supplied with the person layout to determine the
ranking. A prediction of a part was considered true or
false according to the overlap test, as used in the detec-
tion challenge, i.e. for a true prediction the bounding
box of the part overlaps the ground truth by at least
50%. For each part type, the average precision was used
as the quantitative measure.
This method of evaluation was introduced following
criticism of an earlier evaluation used in 2008, that was