高效训练的生成视觉模型：打破文本验证码难题

需积分: 9 127 浏览量更新于2024-07-19 收藏 3.25MB PDF 举报

本文介绍了一种名为"A Generative Vision Model"的创新技术，该模型在2017年发表于《科学》杂志（Science, 10.1126/science.aag2612, 2017），其特点是具有高数据效率并成功破解了基于文本的验证码（CAPTCHA）。验证码是一种设计用来区分人类与机器交互的图像难题，对于算法而言，它们通常包含混乱的字符组合，这使得识别变得困难，因为字符分割依赖于对字符本身的理解，而字符的呈现方式又千变万化（2-5）。传统的方法中，深度学习方法曾尝试解决某一特定验证码样式的问题，但需要大量的标记数据支持（例如，数百万个示例）（6）。相比之下，人类能够无需明确训练就能适应新的样式，这表明了人类智能在处理复杂视觉任务上的优势（图1A）。然而，传统的验证码解决方案主要依赖于人工设计的、针对特定风格的启发式字符分割规则（3,7）。 A Generative Vision Model通过生成模型的原理，突破了这一困境。它采用了一种新颖的策略，能够在较少的标注数据下学习并泛化，从而在遇到新的验证码样式时也能有效地解析。该模型可能利用了生成对抗网络（GANs）、自编码器（AEs）或类似的深度学习架构，这些模型能够通过学习数据的潜在分布，生成新的样本，并在处理未见过的字符组合时展现出强大的泛化能力。这种模型的工作原理可能是首先通过无监督学习捕捉字符的基本特征和布局模式，然后在少量有标签样本的指导下，进行微调和优化，使得模型能够理解字符间的关联，以及它们在不同干扰背景下的表示。这样的设计不仅提高了数据效率，还减少了对特定验证码样式过度依赖的风险，使得验证码系统面临更大的挑战。 A Generative Vision Model的研究不仅展示了人工智能在视觉识别领域的进步，而且也为未来的自动处理复杂视觉问题提供了新的可能性。它提示我们在设计复杂的机器学习任务时，需要考虑如何在有限的数据和多样化的输入情况下，实现高效的学习和泛化能力。随着这项技术的发展，我们可能会看到更高级别的自动化处理和更难以被机器识别的验证机制的出现。

First release: 26 October 2017 www.sciencemag.org (Page numbers not final at time of first release) 4

at the lowest level have a “smoothing parameter” that sets an

estimate on the probability that an edge pixel is ON owing to

noise. This parameter can be set according to the noise levels

in a domain.

Results

A CAPTCHA is considered broken if it can be automatically

solved at a rate above 1% (3). RCN was effective in breaking

a wide variety of text-based CAPTCHAs with very little train-

ing data, and without using CAPTCHA-specific heuristics

(Fig. 5). It was able to solve reCAPTCHAs at an accuracy rate

of 66.6% (character level accuracy of 94.3%), BotDetect at

64.4%, Yahoo at 57.4% and PayPal at 57.1%, significantly

above the 1% rate at which CAPTCHAs are considered inef-

fective (3). The only differences in architecture across differ-

ent CAPTCHA tasks are the sets of clean fonts used for

training and the different choices of a few hyper-parameters,

which depend on the size of the CAPTCHA image and the

amount of clutter and deformations. These parameters are

straightforward to set by hand, or can be tuned automatically

via cross validation on an annotated CAPTCHA set. Noisy,

cluttered and deformed examples from the CAPTCHAs were

not used for training, yet RCN was effective in generalizing

to those variations.

For reCAPTCHA parsing at 66.6% accuracy, RCN required

only five clean training examples per character. The model

uses three parameters that affect how single characters are

combined together to read out a string of characters, and

these parameters were both independent of the length of the

CAPTCHAs and were robust to the spacing of the characters

[Fig. 5B and section 8.4 of (33)]. In addition to obtaining a

transcription of the CAPTCHA, the model also provides a

highly accurate segmentation into individual characters, as

shown in Fig. 5A. To compare, human accuracy on reCAP-

TCHA is 87.4%. Because many input images have multiple

valid interpretations (Fig. 5A), parses from two humans agree

only 81% of the time.

In comparison to RCNs, a state-of-the-art CNN (6) re-

quired a 50,000-fold larger training set of actual CAPTCHA

strings, and it was less robust to perturbations to the input.

Because the CNN required a large number of labeled exam-

ples, this control study used a CAPTCHA-generator that we

created to emulate the appearance of reCAPTCHAs [see sec-

tion 8.4.3 of (33)]. The approach used a bank of position-spe-

cific CNNs, each trained to discriminate the letter at a

particular position. Training the CNNs to achieve a word-ac-

curacy rate of 89.9% required over 2.3 million unique train-

ing images, created using translated crops for data

augmentation, from 79,000 distinct CAPTCHA words. The re-

sulting network fails on string lengths not present during

training, and more importantly, the recognition accuracy of

the network deteriorates rapidly with even minor perturba-

tions to the spacing of characters that are barely perceptible

to humans – 15% more spacing reduced accuracy to 38.4%,

and 25% more spacing reduced accuracy to just 7%. This sug-

gests that the deep-learning method learned to exploit the

specifics of a particular CAPTCHA rather than learning mod-

els of characters that are then used for parsing the scene. For

RCN, increasing the spacing of the characters results in an

improvement in the recognition accuracy (Fig. 5B).

The wide variety of character appearances in BotDetect

(Fig. 5C) demonstrates why the factorization of contours and

surfaces is important: models without this factorization

could latch on to the specific appearance details of a font,

thereby limiting their generalization. The RCN results are

based on testing on 10 different styles of CAPTCHAs from

BotDetect, all parsed based on a single network trained on 24

training example per character, and using the same parsing

parameters across all styles. Although BotDetect CAPTCHAs

can be parsed using contour information alone, using the ap-

pearance information boosted the accuracy from 61.8% to

64.4%, using the same appearance model across all data sets.

See section 8.4.6 of (33) for more details.

RCN outperformed other models on one-shot and few-

shot classification tasks on the standard MNIST handwritten

digit data set [section 8.7 of (33)]. We compared RCN’s clas-

sification performance on MNIST as we varied the number of

training examples from 1 to 100 per category. CNN compari-

sons were made with two state-of-the art models, a LeNet-5

(45) and the VGG-fc6 CNN (46) with its levels pre-trained for

ImageNet (47) classification using millions of images. The

fully-connected-layer fc6 of VGG-CNN was chosen for com-

parison because it gave the best results for this task compared

to other pre-trained levels of the VGG-CNN, and compared to

other pre-trained CNNs that used the same data set and edge

pre-processing as RCN [section 5.1 of (33)]. In addition, we

compared against the Compositional Patch Model (48) that

recently reported state-of-the-art performance on this task.

RCN outperformed the CNNs and the CPM (Fig. 6A). The one-

shot recognition performance of RCN was 76.6% vs 68.9% for

CPM and 54.2% for VGG-fc6. RCN was also robust to different

forms of clutter that were introduced during testing, without

having to expose the network to those transformations dur-

ing training. In comparison, such out-of-sample test exam-

ples had a large detrimental effect on the generalization

performance of CNNs (Fig. 6B). To isolate the contributions

of lateral connections, forward pass, and backward pass to

RCN’s accuracy, we conducted lesion studies that selectively

turned off these mechanisms. The results, summarized in Fig.

6C, show that all these mechanisms contribute significantly

toward the performance of RCNs. RCN networks with two

levels of feature detection and pooling were sufficient to get

the best accuracy performance on character parsing tasks.

on October 30, 2017 http://science.sciencemag.org/Downloaded from

剩余18页未读，继续阅读

franzfan

粉丝: 0

高效训练的生成视觉模型：打破文本验证码难题

RCN论文理解与分析

递归皮层网络RCN识别文本CAPTCHAS的Science论文基础知识和译文

Efficiency Centric Communication Model for

Tutorial on event-based vision.pdf

Advanced Topics in Computer Vision 无水印pdf 0分

Empowering All Sectors: A Detailed Explanation of OpenCV Machine Vision Applications, from ...

[Practical Guide]: Building a GAN Model from Scratch: Step-by-Step Optimization for Your First AI ...

【Optimization Algorithms】: Tips for Enhancing GAN Stability: Creating More Robust Generative ...

Performance Comparison of OpenCV Computer Vision Algorithms Across Different Python Versions: Data-...

深度学习与计算机视觉的融合：《Foundations of Computer Vision》新视角下的创新应用

最新资源