A PREPRINT - DECEMBER 24, 2019
a total of 2,688 images. These and other relatively primitive image sets have been mostly abandoned in the
semantic segmentation literature due to their limited resolution and low volume.
2.1.2 Urban Street Semantic Segmentation Image Sets
•
Cityscapes [
23
]: This is a largescale image set with a focus on the semantic understanding of urban street
scenes. It contains annotations for high-resolution images from 50 different cities, taken at different hours of
the day and from all seasons of the year, and also with varying background and scene layout. The annotations
are carried out at two quality levels: fine for 5,000 images and course for 20,000 images. There are 30 different
class labels, some of which also have instance annotations (vehicles, people, riders etc.). Consequently, there
two challenges with separate public leaderboards
3
: one for pixel-level semantic segmentation, and a second
for instance-level semantic segmentation. There are more than 100 entries to the challenge, making it the most
popular regarding semantic segmentation of urban street scenes.
•
Other Urban Street Semantic Segmentation Image Sets: There are a number of alternative image sets for urban
street semantic segmentation, such as CamVid [
24
], KITTI [
25
], and SYNTHIA [
26
]. These are generally
overshadowed by the Cityscapes image set [
23
] for several reasons. Principally, their scale is relatively low.
Only the SYNTHIA image set [
26
] can be considered as largescale (with more than 13k annotated images);
however, it is an artificially generated image set, and this is considered a major limitation for security-critical
systems like driverless cars.
2.2 Performance Evaluation
There are two main criteria in evaluating the performance of semantics segmentation: accuracy, or in other words, the
success of an algorithm; and computation complexity in terms of speed and memory requirements. In this section we
analyse these two criteria separately.
2.2.1 Accuracy
Measuring the performance of segmentation can be complicated, mainly because there are two distinct values to measure.
The first is classification, which is simply determining the pixel-wise class labels; and the second is localisation, or
finding the correct set of pixels that enclose the object. Different metrics can be found in the literature to measure one or
both of these values. The following is a brief explanation of the principal measures most commonly used in evaluating
semantic segmentation performance.
•
ROC-AUC: ROC stands for the Receiver-Operator Characteristic curve, which summarises the trade-off
between true positive rate and false positive rate for a predictive model using different probability thresholds;
whereas AUC stands for the area under this curve, which is 1 at maximum. This tool is useful in the
interpretation of binary classification problems, and is appropriate when observations are balanced between
classes. However, since most semantic segmentation image sets [
14
,
15
,
16
,
17
,
18
,
19
,
23
] are not balanced
between the classes, this metric is no longer used by the most popular challenges.
•
Pixel Accuracy: Also known as global accuracy [
27
], pixel accuracy (PA) is a very simple metric which
calculates the ratio between the amount of properly classified pixels and their total number. Mean pixel
accuracy (mPA), is a version of this metric which computes the ratio of correct pixels on a per-class basis.
mPA is also referred to as class average accuracy [27].
P A =
P
k
j=1
n
jj
P
k
j=1
t
j
, mP A =
1
k
k
X
j=1
n
jj
t
j
(1)
where
n
jj
is the total number of pixels both classified and labelled as class j. In other words,
n
jj
corresponds
to the total number of True Positives for class j. t
j
is the total number of pixels labelled as class j.
•
Intersection over Union (IoU): Also known as the Jaccard Index, IoU is a statistic used for comparing the
similarity and diversity of sample sets. In semantics segmentation, it is the ratio of the intersection of the
pixel-wise classification results with the ground truth, to their union.
IoU =
P
k
j=1
n
jj
P
k
j=1
(n
ij
+ n
ji
+ n
jj
)
, i 6= j (2)
3
https://www.cityscapes-dataset.com/benchmarks/
4