WAF支持的中文字符识别：提升垃圾邮件图像过滤效率

201 浏览量更新于2024-08-26 收藏 1.91MB PDF 举报

本文主要探讨了在信息技术快速发展的背景下，针对中文字符识别在垃圾邮件图像过滤中的应用，特别是针对嵌入文本的中文垃圾邮件（也称为“图像垃圾邮件”或“视觉垃圾邮件”）检测问题。随着传统的文本垃圾邮件过滤器的普及，不法分子开始利用图像技术来隐藏或混淆文本，以逃避常规的文本识别系统，这使得图像垃圾邮件识别成为一项更具挑战性的任务。论文标题"基于WAF的中文字符识别，用于垃圾邮件图像过滤"强调了使用Web应用程序防火墙（WAF）作为核心技术来应对这一挑战。WAF作为一种网络安全设备，通常用于保护网络入口点免受恶意攻击，但在本研究中，它被扩展应用于识别和过滤通过图像形式传播的潜在垃圾信息。作者Si-Yuan Li、Rui-Guang Li、Bin Xu、Han-Bing Yan和Hong-Gang Zhang来自北京邮电大学、中国国家网络与信息安全研究所和中国国家计算机网络应急响应技术协作中心，其中Han-Bing Yan是通讯作者。研究的核心方法是提出了一种新颖的关键词重建算法，该算法旨在弥补光学字符识别（OCR）系统在处理模糊或加密图像文本时的不足。OCR系统通常依赖于清晰的文本模式来识别，但图像垃圾邮件可能包含手写体、变形字体或者经过特殊处理的字符，使得OCR识别变得困难。关键词重建算法通过分析图像的特征，如形状、结构和上下文，尝试重构出隐藏或修改的文本信息，从而提高垃圾邮件的识别准确性和过滤效率。此外，文章可能还会涉及深度学习、机器学习等现代技术在图像处理中的应用，以及如何结合WAF的数据包分析和行为分析能力来增强中文字符的识别性能。此外，文章可能会讨论实验设置、数据集的选择、性能评估指标（如精确率、召回率和F1分数）以及与其他图像垃圾邮件检测技术的对比分析。这篇研究论文深入探讨了如何利用WAF的优势和创新算法，有效地解决中文字符识别在图像垃圾邮件过滤中的难题，对于提升网络安全防护水平具有重要意义。通过阅读这篇论文，读者可以了解到在面对不断演变的网络威胁时，如何运用最新技术手段进行有效的防御策略设计。

Chinese Journal of Electronics

Vol.XX, No.X, Jan. (Apr. July Oct.) XX

WAF-based Chinese Character Recognition for

Spam Image Filtering

Si-Yuan Li(1,2), Rui-Guang Li(3), Bin Xu(1,2), Han-Bing Yan(*2,3), Hong-Gang Zhang(1)

(1.Beijing University of Posts and Telecommunication)

(2.National Institute of Network and Information Security of China)

(3.National Computer Network Emergency Response Technical Team Coordination Center of China) (*:Corresponding Author)

Abstract — We address the problem of ﬁltering im-

age spam, a kind of rapidly spread spam in which the text

is embedded into images to defeat text-based spam ﬁlter.

Particularly, we focus on image spam with Chinese tex-

t as ’spam’ which is a more challenging task. A popular

way to detect image spam is by optical character recogni-

tion(OCR) system, which detects and recognizes the em-

bedded text, then followed by a text classiﬁer that dis-

criminate spam from ham. However, spammers start to

obscure image text to prevent OCR system discovering

the spam text. To compensate for the shortcomings of

OCR system, a novel method which essentially is a key-

word reconstruction algorithm based on Word Activation

Force(WAF) model is proposed. It is eﬀective on discov-

ering keywords, hence is beneﬁt for the later classiﬁcation

stage and notably improve the performance of image spam

ﬁltering. The experimental results on a personal data set

of spam images (publicly available) validate the eﬀective-

ness of our approach that outperforms the original OCR

system in practical usage with complex background in im-

age spam.

Key words — Spam image, Chinese character recogni-

tion, Keyword reconstruction, WAF

I. Introduction

The improvement of text document classiﬁcation ap-

proaches on email spam detection has driven spammers

to explore new variations of spam emails that embedding

the spam message into attached images, known as image

spam. The dramatic increasing amount of image spam

makes this a consistent task for researchers to explore

more robust method to ﬁght against image spam. This is

challenging primarily due to: 1) The concealment of infor-

mation content. 2) The large diversity of spam images. 3)

image obscuring, that is very complex background noise

embedding in images.

There exists a considerable amount of previous work

on image spam. Some early works detect spam images

based on low-level image features

[1]

. These methods usu-

ally neglect the diﬀerences among text languages and la-

tent semantic meanings. Another popular way is to pull

out the embedded texts in the spam images by using OCR

system which recognize all the characters, and then apply

text based spam ﬁltering techniques

[2]

to identify image

spams. However, the results of OCR-based methods are

often unsatisfactory for that it is unable to handle clutters

eﬃciently in spam image. More speciﬁcally, the embed-

ded texts in spam images often accompany a complex and

noisy background, which preventing OCR-based system

from real success in practical.

In this work, we particular focus on the problem of s-

pam images with Chinese character spam. To date, there

is no eﬃcient method in this area as its unique language.

Comparing with other languages, e.g. English, Chinese

character recognition is considered as an extremely diﬃ-

cult problem due to: huge number of categories, compli-

cated structures, high similarity of strokes between two

characters but totally diﬀerent meanings, and the vari-

ability of fonts. In the mainland of China, two character

sets, containing 3,755 single characters and 6,763 single

characters, respectively, are announced as the National S-

tandard GB2312-80 (the ﬁrst set is a subset of the second

one)

[3]

. Although the large number of Chinese character-

s makes the problem of recognizing them more diﬃcult,

the good news is that just about 10 percent of the 6,763

characters in GB2312-80 cover the most of usage in the

texts which are appearance in the spam images and only

half of these key characters can form semantic words such

as “u¦”(invoice) and “ú i”(company). Therefore it’s

wise to discriminate between spam and legitimate email

rely on the recognition result of key characters. Howev-

∗

Manuscript Received Sept. 20XX; Accepted Nov. 20XX. This work is supported

下载后可阅读完整内容，剩余6页未读，立即下载

weixin_38619613

粉丝: 6
资源: 947

WAF支持的中文字符识别：提升垃圾邮件图像过滤效率

waf识别指纹工具

Linux Opencv在图像上写中文字符

sqlmap怎么过滤waf

waf 匹配任意文本开头

ACL和WAF有什么区别呢

有waf时sqlmap

识别https://so.csdn.net/so/search是否存在WAF

AWS waf 和AWS security group的区别

sqlmap被waf挡了

waf防火墙事前事中事后详细分析

最新资源