4 Front. Comput. Sci.
ing facts.
Fact 1 (metrical similar codes) If two code fragments P and
P
are similar under the set of features measured by M,then
the values of M(P)andM(P
) should be proximate [27].
Fact 2 (candidate similar codes) Based on Fact 1, P and P
are cand idate similar codes if the metric sim ilarity between
them is no less than the pr edefined threshold of metric simi-
larity T
m
.
Fact 3 (SDG) SDG is a semantic representation for a pro-
gram. If two programs have the same SDG, they are semanti-
cally equivalent. However, two semantically equivalent pro-
grams may have different SDGs [37,39].
Fact 4 (semantic-preserving transformation) A program
transfor mation is semantic-preserving if it preserves the ex-
ecuting result for the same input. A sufficient sequence of
semantic-pr eserving transformations can normalize semanti-
cally equivalent programs that use the same algorithm into
those having the same SDG [40].
Fact 5 (code normalization) Code normalization is a pro-
cess of performing semantic-preserving transformations on
the SDG of a program according to a set of predefined rules,
so as to transform the semantically equivalent codes to the
same representation.
Fact 6 (semantic similarity) Based on Fact 3, Fact 4 and
Fact 5, the semantic similarity of P and P
can be computed
by matching the normalized SDGs of P and P
.
Fact 7 (semantically similar codes) Based on Fact 2 and Fact
6, P and P
are semantically similar codes if they are candi-
date similar codes and the semantic similar ity between them
is no less than the predefined threshold of semantic similarity
T
s
.
3.3 Model of CMGA
In order to solve the three problems proposed in Section 3 .1,
we develop a model of CMGA as shown in Fig. 1 based on
the facts proposed in Section 3.2.
To solve Problem 1, we improve the traditional SDG. The
structure theorem [41,42] puts forward that any proper pro-
gram, of any complexity, can be represented only by three
basic constructs: sequence, selection and iteration. Based on
this theorem, some control transfer statements such as goto
and break can be eliminated. Therefore, the CDS is repre-
sented as an ordered tree, which is called control dependence
tree (CDT) by us. The data flow analysis is only performed
on the candidate similar codes in the advanced code normal-
ization stage. As the candidate similar codes only account a
small part of the source code, the complexity of representing
SDG is g reatly decreased, so the augmented SDG is scalable
to large programs.
To solve Problem 2, we propose a metrics-based candi-
date similar codes extracting approach based on Fact 1 and
Fact 2 to filter out most of the dissimilar code pairs, so as to
prune the search space. The metrics-based approach has an
advantage that the computational complexity is much lower
than that of graph matching. In order to make CMGA effi-
cient enough for practical applications, we use the m etrics-
base approach for a fast selection of potential similar codes
at a pre-processing stage to prune the search space, when us-
ing the more accurate but more computationally expensive
method of graph matching.
Fig. 1 Model of metric-based and graph-based combined similar codes detection