5
useful information such as their purpose, number of classes,
data format, and training/validation/testing splits.
3.1 2D Datasets
Throughout the years, semantic segmentation has been
mostly focused on two-dimensional images. For that reason,
2D datasets are the most abundant ones. In this section
we describe the most popular 2D large-scale datasets for
semantic segmentation, considering 2D any dataset that
contains any kind of two-dimensional representations such
as gray-scale or Red Green Blue (RGB) images.
• PASCAL Visual Object Classes (VOC) [27]
1
: this
challenge consists of a ground-truth annotated
dataset of images and five different competitions:
classification, detection, segmentation, action classi-
fication, and person layout. The segmentation one is
specially interesting since its goal is to predict the
object class of each pixel for each test image. There
are 21 classes categorized into vehicles, household,
animals, and other: aeroplane, bicycle, boat, bus, car,
motorbike, train, bottle, chair, dining table, potted
plant, sofa, TV/monitor, bird, cat, cow, dog, horse,
sheep, and person. Background is also considered if
the pixel does not belong to any of those classes.
The dataset is divided into two subsets: training
and validation with 1464 and 1449 images respec-
tively. The test set is private for the challenge. This
dataset is arguably the most popular for semantic
segmentation so almost every remarkable method in
the literature is being submitted to its performance
evaluation server to validate against their private
test set. Methods can be trained either using only
the dataset or either using additional information.
Furthermore, its leaderboard is public and can be
consulted online
2
.
• PASCAL Context [28]
3
: this dataset is an extension
of the PASCAL VOC 2010 detection challenge which
contains pixel-wise labels for all training images
(10103). It contains a total of 540 classes – includ-
ing the original 20 classes plus background from
PASCAL VOC segmentation – divided into three
categories (objects, stuff, and hybrids). Despite the
large number of categories, only the 59 most frequent
are remarkable. Since its classes follow a power law
distribution, there are many of them which are too
sparse throughout the dataset. In this regard, this
subset of 59 classes is usually selected to conduct
studies on this dataset, relabeling the rest of them
as background.
• PASCAL Part [29]
4
: this database is an extension of
the PASCAL VOC 2010 detection challenge which
goes beyond that task to provide per-pixel segmen-
tation masks for each part of the objects (or at least
silhouette annotation if the object does not have a
1. http://host.robots.ox.ac.uk/pascal/VOC/voc2012/
2. http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?
challengeid=11&compid=6
3. http://www.cs.stanford.edu/
∼
roozbeh/pascal-context/
4. http://www.stat.ucla.edu/
∼
xianjie.chen/pascal part dataset/
pascal part.html
consistent set of parts). The original classes of PAS-
CAL VOC are kept, but their parts are introduced,
e.g., bicycle is now decomposed into back wheel,
chain wheel, front wheel, handlebar, headlight, and
saddle. It contains labels for all training and valida-
tion images from PASCAL VOC as well as for the
9637 testing images.
• Semantic Boundaries Dataset (SBD) [30]
5
: this
dataset is an extended version of the aforementioned
PASCAL VOC which provides semantic segmenta-
tion ground truth for those images that were not
labelled in VOC. It contains annotations for 11355
images from PASCAL VOC 2011. Those annotations
provide both category-level and instance-level infor-
mation, apart from boundaries for each object. Since
the images are obtained from the whole PASCAL
VOC challenge (not only from the segmentation one),
the training and validation splits diverge. In fact,
SBD provides its own training (8498 images) and
validation (2857 images) splits. Due to its increased
amount of training data, this dataset is often used as
a substitute for PASCAL VOC for deep learning.
• Microsoft Common Objects in Context (COCO)
[31]
6
: is another image recognition, segmentation,
and captioning large-scale dataset. It features various
challenges, being the detection one the most relevant
for this field since one of its parts is focused on
segmentation. That challenge, which features more
than 80 classes, provides more than 82783 images
for training, 40504 for validation, and its test set
consist of more than 80000 images. In particular,
the test set is divided into four different subsets or
splits: test-dev (20000 images) for additional vali-
dation, debugging, test-standard (20000 images) is
the default test data for the competition and the
one used to compare state-of-the-art methods, test-
challenge (20000 images) is the split used for the
challenge when submitting to the evaluation server,
and test-reserve (20000 images) is a split used to
protect against possible overfitting in the challenge
(if a method is suspected to have made too many
submissions or trained on the test data, its results will
be compared with the reserve split). Its popularity
and importance has ramped up since its appearance
thanks to its large scale. In fact, the results of the
challenge are presented yearly on a joint workshop
at the European Conference on Computer Vision
(ECCV)
7
together with ImageNet’s ones.
• SYNTHetic Collection of Imagery and Annotations
(SYNTHIA) [32]
8
: is a large-scale collection of photo-
realistic renderings of a virtual city, semantically
segmented, whose purpose is scene understanding in
the context of driving or urban scenarios.The dataset
provides fine-grained pixel-level annotations for 11
classes (void, sky, building, road, sidewalk, fence,
vegetation, pole, car, sign, pedestrian, and cyclist). It
5. http://home.bharathh.info/home/sbd
6. http://mscoco.org/
7. http://image-net.org/challenges/ilsvrc+coco2016
8. http://synthia-dataset.net/