高精度图像去重技术：重复数据删除新突破

76 浏览量更新于2024-08-26 收藏 582KB PDF 举报

"这篇研究论文提出了一种高精度重复图像重复数据删除方法，旨在解决传统技术无法识别视觉感知相同但编码不同的图像问题。通过五个阶段实现：特征提取、高维索引、精度优化、质心选择和去重评估，提高了图像去重的效率和准确性。" 在信息技术领域，重复数据删除是一种有效提升存储空间利用率的技术，尤其在备份系统和归档系统中应用广泛。然而，传统的重复数据删除方法存在局限性，只能识别和删除完全一致的图像，对于那些在视觉上看似相同，但在数字编码或压缩形式上有所差异的图像，却无法进行有效的处理。这限制了其在实际应用中的效果，尤其是在面对大量图像数据时。针对这一问题，该研究论文提出了一个创新的高精度重复图像重复数据删除方法。该方法的核心是通过五个步骤来实现对相似图像的精准识别和删除： 1. **特征提取**：首先，对图像进行分析，提取出能够反映图像内容的关键特征。这些特征可能包括颜色分布、纹理信息、形状结构等，以形成图像的指纹。 2. **高维索引**：将提取的特征转换成高维向量，然后构建索引结构，如哈希表或倒排索引，以便快速查找相似的图像特征。 3. **精度优化**：对特征匹配过程进行优化，以提高匹配的精确度。这可能涉及误差容忍度的设定、特征匹配算法的选择（如欧氏距离、余弦相似性）等，确保在一定程度的差异下也能识别出相似图像。 4. **质心选择**：在找到一组相似图像后，选择一个代表性的图像作为质心，通常是最具代表性的或占用空间最小的图像。 5. **去重评估**：最后，基于预设的去重策略（如阈值设置、一致性校验等）评估并执行去重操作，删除与质心图像相似度超过一定阈值的其他图像。这种高精度的重复图像去重方法对于存储优化和管理大量图像数据至关重要，它不仅能够节省存储空间，还能够提高数据管理和检索的效率。同时，这种方法的应用也扩展到了更广泛的领域，如社交媒体图像库的管理、云计算存储优化以及数字取证等。尽管存在一定的计算复杂性，但通过优化算法和并行处理，可以在保持高效的同时，实现大规模图像数据的高效去重。

A High-precision Duplicate Image Deduplication

Approach

Ming Chen

1. National Engineering Laboratory for Disaster Backup and Recovery, Beijing University of Posts and

Telecommunications, Beijing, China

Email: cm19834@163.com

Shupeng Wang

and Liang Tian

2. Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China

3. College of Computer and Information Engineering, Xinxiang University, Xinxiang, China

Email: wangshupeng@iie.ac.cn, gaa252@gmail.com

Abstract—Deduplication has been widely used in backup

systems and archive systems to improve storage utilization

effectively. However the traditional deduplication

technology can only eliminate exactly the same images, but

it is unavailable to duplicate images which have the same

visual perceptions but different codes. To address the above

problem, this paper proposes a high-precision duplicate

image deduplication approach. The main idea of the

proposed approach is eliminating the duplicate images by

five stages including feature extraction, high-dimension

indexing, accuracy optimization, centroid selection and

deduplication evaluation. Experimental results demonstrate:

in a real dataset, the proposed approach not only effectively

saves storage space, but also significantly improves the

retrieval precision of duplicate images. In addition, the

selection of the centroid images can meet the requirements

of people’s perception.

Index Terms—image deduplication; B+ tree; accuracy

optimization; centroid selection; fuzzy synthetic evaluation

I. INTRODUCTION

Recently, with the development of Internet and the

popularity of digital products, the volume of global

digital resource is growing at an alarming rate. For

examples, in 2007, for the first time ever, the total

volume of digital resource exceeded the global storage

capacity. It is estimated that by 2011 only half of the

digital information will be stored [1]. Hence, it is

impossible to solve the data explosion problem by blindly

increasing storage devices. In order to solve the

requirement of storage space, Kai Li Professor of

Princeton University presented a new technology called

global compression technology or deduplication.

Deduplication can identify redundant data, eliminate all

but one copy, and create local pointers to the information

that users can access. This technology has been

widespread concerned by industry and academia [2, 3, 4,

5].

However the traditional deduplication technology

judges two data items redundant only if their underlying

bit-streams are identical. This restriction is too strict for

many applications [6]. For example, in an image storage

platform, according to encoding rules, any tiny

transformation will completely change bit-streams of

images. So the traditional deduplication technology can

only eliminate exactly the same images. It is unavailable

to duplicate images which have the same visual

perceptions but different codes.

However, in practical applications, due to the

requirement of network transmission or the restriction of

storage space, users often uploaded the modified images

and the same content images often present different

versions which are varied in resolution or quality. From a

visual angle, the images which have the same visual

perceptions but different codes can be seen a redundancy.

Therefore in large scale datacenter storages and data-

clouds, effectively eliminating the redundant copies of

images can significantly improve storage utilization.

Namely, the storage optimization will have an important

practical significance.

At present, the research of image deduplication doesn’t

have satisfactory results. In 2011, Katiyar proposed an

application-aware framework for video deduplication [6].

This framework chose ordinal signature to construct

video-signatures, and used Sequence Shape Similar (SSS)

to measure the similarity of the compared video

sequences. Finally, the proper centroid-video was

selected in the duplicate video collection by minimizing

the compression-ratio and maximizing the quality of

compressed videos. But there were two defects in this

framework: Firstly, it did not consider deduplication

accuracy. Error deduplication would bring losses to user

and affect the quality of the service. Secondly, it did not

consider the system scalability. For 1017 videos, this

involved

1017

⎛⎞

⎜⎟

⎝⎠

pairwise video-comparisons which

would spend nearly 2 hours [6]. In addition, The Targeted

Public Distribution System (TPDS) of India was a

mechanism for ensuring access and availability of food

grains and other essential commodities at subsidized

prices to the households [7]. To bogus ration cards appear

2768

JOURNAL OF COMPUTERS, VOL. 8, NO. 11, NOVEMBER 2013

doi:10.4304/jcp.8.11.2768-2775

下载后可阅读完整内容，剩余7页未读，立即下载

weixin_38678510

粉丝: 8
资源: 967

高精度图像去重技术：重复数据删除新突破

dupe-images:Node.js软件包，用于以极高的精度查找和删除重复的图像文件

重复照片杀手

基于深度学习的果园道路导航线生成算法研究.pdf

编写对比高低精度图像识别的PPT材料

matlab数据存储为高精度

高精度数据怎么处理mysql

卫星导航定位新技术及高精度数据处理方法 pdf

matlab双精度tif图像的导入

无人驾驶高精度地图是实时提供还是数据存储

数据结构高精度浮点数

最新资源