MapReduce驱动的Kam1n0：反向工程中的高效assembly代码克隆搜索

下载需积分: 9 | PDF格式 | 1.07MB | 更新于2024-09-06 | 196 浏览量 | 举报

"这篇2016年发表的研究论文探讨了在源代码不可用的情况下，对反汇编代码进行分析的重要性，这是检测软件抄袭和专利侵权的关键步骤。在现有的软件中发现漏洞和攻击手段也是此类分析的常见应用。然而，对于经验丰富的逆向工程师来说，这个过程仍然非常耗时且手工密集型。传统的反汇编代码克隆搜索面临的主要挑战是效率和准确性，尤其是在处理大量代码库时。论文作者们与国防研究与发展中心（DRDC）合作，从数据挖掘的角度出发，针对现有方法在实践中遇到的问题进行了深入研究。他们提出了一种新的变体局部敏感哈希（LSH）方案，并将其与图匹配技术结合，旨在解决这些问题。论文的核心成果是Kam1n0，一个基于MapReduce的反汇编代码克隆搜索引擎。Kam1n0的特点在于它能够高效地从大型反汇编代码库中找出给定查询函数的子图克隆。该系统利用Apache Spark计算框架构建，同时采用Cassandra类的分布式键值存储技术，以实现大规模数据的处理和存储。作者们还构建了一个公开部署的演示系统，使得研究人员和开发者可以方便地利用这一工具进行克隆检测。实验结果表明，Kam1n0在精度、效率和扩展性方面表现出色，能够有效地应对大规模反汇编代码的处理需求。这对于提高软件安全性和维护知识产权具有显著的价值，同时也为反汇编代码分析领域的实践提供了新的技术支撑。通过MapReduce架构，Kam1n0实现了并行化处理，极大地提高了分析速度，使得复杂和繁琐的克隆搜索任务变得更加可行和有效。因此，这篇论文不仅阐述了理论方法，也展示了其在实际场景中的应用潜力。"

展开

Kam1n0: MapReduce-based Assembly Clone Search for

Reverse Engineering

Steven H. H. Ding



Benjamin C. M. Fung



Philippe Charland

†



School of Information Studies, McGill University, Montreal, QC, Canada

†

Mission Critical Cyber Security Section, Defence R&D Canada - Valcartier, Quebec, QC, Canada

steven.h.ding@mail.mcgill.ca ben.fung@mcgill.ca philippe.charland@drdc-rddc.gc.ca

ABSTRACT

Assembly code analysis is one of the critical processes for de-

tecting and proving software plagiarism and software patent

infringements when the source code is unavailable. It is also

a common practice to discover exploits and vulnerabilities

in existing software. However, it is a manually intensive and

time-consuming process even for experienced reverse engi-

neers. An eﬀective and eﬃcient assembly code clone search

engine can greatly reduce the eﬀort of this process, since

it can identify the cloned parts that have been previously

analyzed. The assembly code clone search problem belongs

to the ﬁeld of software engineering. However, it strongly

depends on practical nearest neighbor search techniques in

data mining and databases. By closely collaborating with

reverse engineers and Defence Research and Development

Canada (DRDC ), we study the concerns and challenges that

make existing assembly code clone approaches not practi-

cally applicable from the perspective of data mining. We

propose a new variant of LSH scheme and incorporate it with

graph matching to address these challenges. We implement

an integrated assembly clone search engine called Kam1n0.

It is the ﬁrst clone search engine that can eﬃciently identify

the given query assembly function’s subgraph clones from a

large assembly code repository. Kam1n0 is built upon the

Apache Spark computation framework and Cassandra-like

key-value distributed storage. A deployed demo system is

publicly available.

Extensive experimental results suggest

that Kam1n0 is accurate, eﬃcient, and scalable for handling

large volume of assembly code.

Keywords

Assembly clone search; Information retrieval; Mining soft-

ware repositories

Kam1n0 online demo (no installation required). Both the user

name and password are “sigkdd2016”. Use Chrome for best expe-

rience. http://dmas.lab.mcgill.ca/projects/kam1n0.htm

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full citation

on the ﬁrst page. Copyrights for components of this work owned by others than the

author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

KDD ’16 August 13–17, 2016, San Francisco, CA, USA

 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM.

ISBN 978-1-4503-4232-2/16/08. . . $15.00

DOI:

http://dx.doi.org/XXXX.XXXX

1. INTRODUCTION

Code reuse is a common but uncontrolled issue in software

engineering [15]. Mockus [25] found that more than 50%

of ﬁles were reused in more than one open source project.

Sojer’s survey [29] indicates that more than 50% of the de-

velopers modify the components before reusing them. This

massively uncontrolled reuse of source code does not only

introduce legal issues such as GNU General Public License

(GPL) violation [36, 17]. It also implies security concerns,

as the source code and the vulnerabilities are uncontrollably

shared between projects [4].

Identifying all these infringements and vulnerabilities re-

quires intensive eﬀort from reverse engineers. However, the

learning curve to master reverse engineering is much steeper

than for programming [4]. Reverse engineering is a time

consuming process which involves inspecting the execution

ﬂow of the program in assembly code and determining the

functionalities of the components. Given the fact that code

reuse is prevalent in software development, there is a press-

ing need to develop an eﬃcient and eﬀective assembly clone

search engine for reverse engineers. Previous clone search

approaches only focus on the search accuracy. However,

designing a practically useful clone search engine is a non-

trivial task which involves multiple factors to be considered.

By closely collaborating with reverse engineers and Defence

Research and Development Canada (DRDC ), we outline the

deployment challenges and requirements as follows:

Interpretability and usability: An assembly function

can be represented as a control ﬂow graph consisting of con-

nected basic blocks. Given an assembly function as query, all

of the previous assembly code clone search approaches [7, 6,

18, 26] only provide the top-listed candidate assembly func-

tions. They are useful when there exists a function in the

repository that shares a high degree of similarity with the

query. However, due to the unpredictable eﬀects of diﬀer-

ent compilers, compiler optimization, and obfuscation tech-

niques, given an unknown function, it is less probable to have

a very similar function in the repository. Returning a list of

clones with a low degree of similarity values is not useful.

As per our discussions with DRDC, a practical search en-

gine should be able to decompose the given query assembly

function to diﬀerent known subgraph clones which can help

reverse engineers better understand the function’s composi-

tion. We deﬁne a subgraph clone as one of its subgraphs that

can be found in the other function. Refer to the example

in Figure 1. The previous clone search approaches cannot

address this challenge.

http://dx.doi.org/10.1145/2939672.2939719

下载后可阅读完整内容，剩余9页未读，立即下载

身份认证购VIP最低享 7 折!

30元优惠券

weixin_44406522

粉丝: 0

MapReduce驱动的Kam1n0：反向工程中的高效assembly代码克隆搜索

hadoop-mapreduce-examples-2.7.1.jar

hadoop-mapreduce-client-jobclient-2.6.5-API文档-中文版.zip

MapReduce-Simplified Data Processing on Large Clusters.pdf

appengine-mapreduce-src-20110122.jar.zip

福建师范大学精品大数据导论课程系列 (6.2.1)--5.1 一种并行编程模型--MapReduce-之二.pdf

福建师范大学精品大数据导论课程系列 (6.1.1)--5.1 一种并行编程模型--MapReduce-之一.pdf

matlab20行代码-MapReduce-Based-Deep-Learning:2013年NerveCloud的秋季云计算项目小组：基于M

阿里云E-MapReduce-最佳实践-D.docx

Hadoop-MapReduce-学习资料及文档.zip

hadoop-mapreduce-client-core-2.7.1.jar

最新资源