First release: 26 October 2017 www.sciencemag.org (Page numbers not final at time of first release) 4
at the lowest level have a “smoothing parameter” that sets an
estimate on the probability that an edge pixel is ON owing to
noise. This parameter can be set according to the noise levels
in a domain.
Results
A CAPTCHA is considered broken if it can be automatically
solved at a rate above 1% (3). RCN was effective in breaking
a wide variety of text-based CAPTCHAs with very little train-
ing data, and without using CAPTCHA-specific heuristics
(Fig. 5). It was able to solve reCAPTCHAs at an accuracy rate
of 66.6% (character level accuracy of 94.3%), BotDetect at
64.4%, Yahoo at 57.4% and PayPal at 57.1%, significantly
above the 1% rate at which CAPTCHAs are considered inef-
fective (3). The only differences in architecture across differ-
ent CAPTCHA tasks are the sets of clean fonts used for
training and the different choices of a few hyper-parameters,
which depend on the size of the CAPTCHA image and the
amount of clutter and deformations. These parameters are
straightforward to set by hand, or can be tuned automatically
via cross validation on an annotated CAPTCHA set. Noisy,
cluttered and deformed examples from the CAPTCHAs were
not used for training, yet RCN was effective in generalizing
to those variations.
For reCAPTCHA parsing at 66.6% accuracy, RCN required
only five clean training examples per character. The model
uses three parameters that affect how single characters are
combined together to read out a string of characters, and
these parameters were both independent of the length of the
CAPTCHAs and were robust to the spacing of the characters
[Fig. 5B and section 8.4 of (33)]. In addition to obtaining a
transcription of the CAPTCHA, the model also provides a
highly accurate segmentation into individual characters, as
shown in Fig. 5A. To compare, human accuracy on reCAP-
TCHA is 87.4%. Because many input images have multiple
valid interpretations (Fig. 5A), parses from two humans agree
only 81% of the time.
In comparison to RCNs, a state-of-the-art CNN (6) re-
quired a 50,000-fold larger training set of actual CAPTCHA
strings, and it was less robust to perturbations to the input.
Because the CNN required a large number of labeled exam-
ples, this control study used a CAPTCHA-generator that we
created to emulate the appearance of reCAPTCHAs [see sec-
tion 8.4.3 of (33)]. The approach used a bank of position-spe-
cific CNNs, each trained to discriminate the letter at a
particular position. Training the CNNs to achieve a word-ac-
curacy rate of 89.9% required over 2.3 million unique train-
ing images, created using translated crops for data
augmentation, from 79,000 distinct CAPTCHA words. The re-
sulting network fails on string lengths not present during
training, and more importantly, the recognition accuracy of
the network deteriorates rapidly with even minor perturba-
tions to the spacing of characters that are barely perceptible
to humans – 15% more spacing reduced accuracy to 38.4%,
and 25% more spacing reduced accuracy to just 7%. This sug-
gests that the deep-learning method learned to exploit the
specifics of a particular CAPTCHA rather than learning mod-
els of characters that are then used for parsing the scene. For
RCN, increasing the spacing of the characters results in an
improvement in the recognition accuracy (Fig. 5B).
The wide variety of character appearances in BotDetect
(Fig. 5C) demonstrates why the factorization of contours and
surfaces is important: models without this factorization
could latch on to the specific appearance details of a font,
thereby limiting their generalization. The RCN results are
based on testing on 10 different styles of CAPTCHAs from
BotDetect, all parsed based on a single network trained on 24
training example per character, and using the same parsing
parameters across all styles. Although BotDetect CAPTCHAs
can be parsed using contour information alone, using the ap-
pearance information boosted the accuracy from 61.8% to
64.4%, using the same appearance model across all data sets.
See section 8.4.6 of (33) for more details.
RCN outperformed other models on one-shot and few-
shot classification tasks on the standard MNIST handwritten
digit data set [section 8.7 of (33)]. We compared RCN’s clas-
sification performance on MNIST as we varied the number of
training examples from 1 to 100 per category. CNN compari-
sons were made with two state-of-the art models, a LeNet-5
(45) and the VGG-fc6 CNN (46) with its levels pre-trained for
ImageNet (47) classification using millions of images. The
fully-connected-layer fc6 of VGG-CNN was chosen for com-
parison because it gave the best results for this task compared
to other pre-trained levels of the VGG-CNN, and compared to
other pre-trained CNNs that used the same data set and edge
pre-processing as RCN [section 5.1 of (33)]. In addition, we
compared against the Compositional Patch Model (48) that
recently reported state-of-the-art performance on this task.
RCN outperformed the CNNs and the CPM (Fig. 6A). The one-
shot recognition performance of RCN was 76.6% vs 68.9% for
CPM and 54.2% for VGG-fc6. RCN was also robust to different
forms of clutter that were introduced during testing, without
having to expose the network to those transformations dur-
ing training. In comparison, such out-of-sample test exam-
ples had a large detrimental effect on the generalization
performance of CNNs (Fig. 6B). To isolate the contributions
of lateral connections, forward pass, and backward pass to
RCN’s accuracy, we conducted lesion studies that selectively
turned off these mechanisms. The results, summarized in Fig.
6C, show that all these mechanisms contribute significantly
toward the performance of RCNs. RCN networks with two
levels of feature detection and pooling were sufficient to get
the best accuracy performance on character parsing tasks.
on October 30, 2017 http://science.sciencemag.org/Downloaded from