with 2D Scale Invariant Feature Transform (SIFT) to de-
velop multimodal face recognition. However, the keypoint
detection method and features were both sensitive to facial
expressions. For robustness to facial expressions, Mian et
al. [35] proposed a parts based multimodal hybrid method
(MMH) which exploited local and global features in the 2D
and 3D modalities. A key component of their method was
a variant of the ICP [7] algorithm which is computation-
ally expensive due to its iterative nature. Gupta et al. [23]
matched the 3D Euclidean and geodesic distances between
pairs of fiducial landmarks to perform 3D face recognition.
Berretti et al. [5] represented a 3D face with multiple mesh-
DOG keypoints and local geometric histogram descriptors
while Drira et al. [18] represented the facial surface by ra-
dial curves emanating from the nosetip.
Model based methods construct a 3D morphable face
model and fit it to each probe face. Face recognition is
performed by matching the model parameters to those in
the gallery. Gilani et al. [13] proposed a keypoint based
dense correspondence model and performed 3D face recog-
nition by matching the parameters of a statistical morphable
model called K3DM. Blanz et al. [8, 11] used the parame-
ters of their 3DMM [10] for face recognition. Passalis et
al. [46] proposed an Annotated Face Model (AFM) based
on an average facial 3D mesh. Later, Kakadiaris et al. [26]
proposed elastic registration using this AFM and performed
3D face recognition by comparing the wavelet coefficients
of the deformed images obtained from morphing. Model
fitting algorithms can be computationally expensive and do
not perform well on large galleries as shown in our results.
Both local and global techniques were tested on indi-
vidual 3D datasets, the largest one being FRGCv2 with a
gallery size of 466 identities. To the best of our knowledge,
none of the conventional methods have performed large-
scale 3D face recognition.
Deep Learning: Akin to progress in other applications of
computer vision, deep learning has given a quantum jump
in 2D face recognition. Three years ago, Facebook AI group
proposed a nine-layer DeepFace model [58] mainly consist-
ing of two convolutional, three locally-connected and two
fully-connected (FC) layers. The network was trained on
4.4M 2D facial images of 4,030 identities and achieved an
accuracy of 97.35% on the benchmark LFW [25] dataset
which is 27% higher than the previous state of the art. This
was followed by Google Inc., a year later, with FaceNet [53]
based on eleven convolutional and three FC layers. The dis-
tinction of this network was its training dataset of 200M
face images of 8M identities and a triplet loss function.
The authors reported face recognition accuracy of 98.87%
on LFW. DeepFace and FaceNet were both trained on pri-
vate datasets which are not available to the broader research
community. Consequently, Parkhi et al. [45] proposed a
method for crawling the web to collect a face database
of 2.6M 2D images from 2,622 identities and presented
the VGG-Face model comprising of 16 convolutional and
three FC layers. Despite training on a smaller dataset, the
authors reported face recognition accuracy of 98.95% on
the LFW dataset. However, recently the MegaFace Chal-
lenges [28, 42] claimed that the existing 2D benchmark
datasets have reached saturation and proposed adding mil-
lions of faces to the galleries of these datasets to match the
real world scenarios. They showed that the face recognition
accuracy of state-of-the-art 2D networks dropped by more
that 20% when just a few thousand distractors were added
to the gallery of public face recognition benchmark datasets.
The take away for the 3D domain is that CNNs on 2D data
perform best when they learn from massive training sets and
are particularly designed for the 2D modality, and yet, their
real performance can be validated only when they are tested
with large gallery sizes.
To the best of our knowledge, only Kim et al. [29] have
presented deep 3D face recognition results. They reported
results on three public datasets after fine tuning the VGG-
Face network [45] on 3D depth images. They used an
augmented dataset of 123,325 depth images to fine-tune
the VGG-Face network and then tested it on the Bospho-
rus [51], BU3DFE [65] and 3D-TEC (twins) [61] datasets
individually. Except for the Bosphorus dataset, their results
do not outperform the state-of-the-art conventional meth-
ods. Moreover, they have not reported results on the chal-
lenging FRGCv2 dataset and their fine-tuned model is not
publicly available.
Data Augmentation: Dou et al. [17] and Richardson et
al. [50] generated thousands of synthetic 3D images for face
reconstruction using BFM [48], AFM [26] and 3DMM [10].
This method generates 3D faces within the linear space of
a specific statistical face model. The faces generally have a
variation of ±3 standard deviations from the model mean
with highly smooth surfaces. Gilani et al. [9] generated
synthetic images using a similar approach. However, these
images were used to train a 3D landmark identification net-
work. Kim et al. [29] fitted the BFM [48] to 577 identities
of FRGCv2 [49] database and induced 25 expressions in
each identity. They also introduced minor pose variations
between ±10
◦
in yaw, pitch and roll for each original scan.
To simulate occlusions, the authors introduced eight ran-
dom occlusion patches to each 2D depth map to increase
the dataset to 123,325 scans. This method only increases
the intra-person variations without augmenting the number
of identities, which in this case remained 577.
3. Proposed Data Generation for Training
We use 3D facial scans of 1,785 individuals (a propri-
ety dataset) who were participants of various studies in our
institution to train our deep network. The number of identi-
ties in this dataset is larger than any 3D dataset but still not