Kam1n0: MapReduce-based Assembly Clone Search for
Reverse Engineering
Steven H. H. Ding
Benjamin C. M. Fung
Philippe Charland
†
School of Information Studies, McGill University, Montreal, QC, Canada
†
Mission Critical Cyber Security Section, Defence R&D Canada - Valcartier, Quebec, QC, Canada
steven.h.ding@mail.mcgill.ca ben.fung@mcgill.ca philippe.charland@drdc-rddc.gc.ca
ABSTRACT
Assembly code analysis is one of the critical processes for de-
tecting and proving software plagiarism and software patent
infringements when the source code is unavailable. It is also
a common practice to discover exploits and vulnerabilities
in existing software. However, it is a manually intensive and
time-consuming process even for experienced reverse engi-
neers. An effective and efficient assembly code clone search
engine can greatly reduce the effort of this process, since
it can identify the cloned parts that have been previously
analyzed. The assembly code clone search problem belongs
to the field of software engineering. However, it strongly
depends on practical nearest neighbor search techniques in
data mining and databases. By closely collaborating with
reverse engineers and Defence Research and Development
Canada (DRDC ), we study the concerns and challenges that
make existing assembly code clone approaches not practi-
cally applicable from the perspective of data mining. We
propose a new variant of LSH scheme and incorporate it with
graph matching to address these challenges. We implement
an integrated assembly clone search engine called Kam1n0.
It is the first clone search engine that can efficiently identify
the given query assembly function’s subgraph clones from a
large assembly code repository. Kam1n0 is built upon the
Apache Spark computation framework and Cassandra-like
key-value distributed storage. A deployed demo system is
publicly available.
1
Extensive experimental results suggest
that Kam1n0 is accurate, efficient, and scalable for handling
large volume of assembly code.
Keywords
Assembly clone search; Information retrieval; Mining soft-
ware repositories
1
Kam1n0 online demo (no installation required). Both the user
name and password are “sigkdd2016”. Use Chrome for best expe-
rience. http://dmas.lab.mcgill.ca/projects/kam1n0.htm
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
KDD ’16 August 13–17, 2016, San Francisco, CA, USA
c
2016 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ISBN 978-1-4503-4232-2/16/08. . . $15.00
DOI:
http://dx.doi.org/XXXX.XXXX
1. INTRODUCTION
Code reuse is a common but uncontrolled issue in software
engineering [15]. Mockus [25] found that more than 50%
of files were reused in more than one open source project.
Sojer’s survey [29] indicates that more than 50% of the de-
velopers modify the components before reusing them. This
massively uncontrolled reuse of source code does not only
introduce legal issues such as GNU General Public License
(GPL) violation [36, 17]. It also implies security concerns,
as the source code and the vulnerabilities are uncontrollably
shared between projects [4].
Identifying all these infringements and vulnerabilities re-
quires intensive effort from reverse engineers. However, the
learning curve to master reverse engineering is much steeper
than for programming [4]. Reverse engineering is a time
consuming process which involves inspecting the execution
flow of the program in assembly code and determining the
functionalities of the components. Given the fact that code
reuse is prevalent in software development, there is a press-
ing need to develop an efficient and effective assembly clone
search engine for reverse engineers. Previous clone search
approaches only focus on the search accuracy. However,
designing a practically useful clone search engine is a non-
trivial task which involves multiple factors to be considered.
By closely collaborating with reverse engineers and Defence
Research and Development Canada (DRDC ), we outline the
deployment challenges and requirements as follows:
Interpretability and usability: An assembly function
can be represented as a control flow graph consisting of con-
nected basic blocks. Given an assembly function as query, all
of the previous assembly code clone search approaches [7, 6,
18, 26] only provide the top-listed candidate assembly func-
tions. They are useful when there exists a function in the
repository that shares a high degree of similarity with the
query. However, due to the unpredictable effects of differ-
ent compilers, compiler optimization, and obfuscation tech-
niques, given an unknown function, it is less probable to have
a very similar function in the repository. Returning a list of
clones with a low degree of similarity values is not useful.
As per our discussions with DRDC, a practical search en-
gine should be able to decompose the given query assembly
function to different known subgraph clones which can help
reverse engineers better understand the function’s composi-
tion. We define a subgraph clone as one of its subgraphs that
can be found in the other function. Refer to the example
in Figure 1. The previous clone search approaches cannot
address this challenge.
1
http://dx.doi.org/10.1145/2939672.2939719