DETECTION OF REDUNDANT CODE USING R
2
D
2
365
Unfortunately, there are some shortcomings: only very simple syntactical elements
are compared, which excludes detection of semantically equal code fragments that are
syntactically different. Moreover, the size threshold that must be used to avoid false
positives limits the detection to duplications of relatively big sections of code.
1.3.2. Syntactical analysis Syntactical analysis methods are based on the compar-
ison of abstract syntax trees (AST).
Abstract syntax trees have two nice properties for the detection process: (1) com-
ments and white space are automatically eliminated; (2) each identifier is recognized
according to its context. The abstract syntax trees are then compared using several
possible techniques.
YAP (Wise, 1992, 1996) is a system that operates on a very simplified (and nor-
malized) abstract syntax tree. This tree is produced by canonicalizing the program
identifiers, reordering of the procedures according to invocation order, expansion of
each procedure on the first invocation point and replacement of the remaining invoca-
tion by distinct markers, and deletion of all syntactical elements that do not belong to
the language, leaving only reserved words and names of pre-defined procedures. The
result is a linearized sequence of syntactical elements which is then compared using an
algorithm that can deal with transposed subsequences. Note that YAP was developed
to deal primarily with the problem of software plagiarism where it is common to find
transposed elements.
Clone Doctor (Baxter et al., 1998) is a tool that finds redundant code fragments via
the comparison of all subtrees of the abstract syntax tree of a program. This allows for
fine grained comparison between code fragments but is computationally demanding.
The process is O(n
3
) on the number of nodes of the abstract syntax tree (an abstract
syntax tree typically has ten times more nodes than the number of lines of the repre-
sented program).
There are two solutions to make the process more efficient: (1) diminish the number
of compared trees; (2) compare each tree with only some of the other trees. Clone
Doctor explores both solutions. Each tree is stored in an hash table, thus dividing the
tree space into several different independent subspaces. Only the trees in a given sub-
space are compared. Moreover, the indexing function ignores very small trees, largely
reducing the number of trees and making the process less sensitive to small code vari-
ations. Each subspace will then contain all subtrees that have a similar structure, only
differingontheleaves.
Each pair of trees on a subspace is then analyzed using a similarity function that
takes into account both the number of common nodes and the number of different
nodes. This approach can deal with cases where the code was structurally changed (by
adding or removing code).
1.3.3. Metric-based analysis Program metric is a measure used to characterize
quantitatively essential features of a program in order to allow its classification, com-
parison and mathematical analysis (Conte et al., 1986).
One of the first metric-based duplication detectors was presented in (Ottenstein,
1976) and explored four metrics, namely the number of unique operators, the number
of unique operands, the number of operators and the number of operands. Programs