“sanitizers”, and S
3
is a set of “sinks”. A source is a method whose
return value is considered tainted, or untrusted.
2
A sanitizer is a
method that manipulates its input to produce taint-free output. A
sink is a pair (m, P ), where m is a method that performs security-
sensitive computations and P contains those parameters of m that
are vulnerable to attack via tainted data. TAJ statically checks that
no value derived from a source is passed as an input to a sink unless
it first undergoes appropriate sanitization.
TAJ consists of two stages. The first phase performs pointer
analysis and builds a call graph. The second phase runs a novel
slicing algorithm to track tainted data.
3.1 Pointer Analysis and Call-graph Construction
The TAJ architecture supports any preliminary pointer analysis
and call graph construction algorithm. The current implementation
relies on a context-sensitive variant of Andersen’s analysis [1] with
on-the-fly call graph construction.
TAJ employs a custom context-sensitivity policy tuned to ad-
dress precision and performance issues that arise when analyzing
real codes. Most methods are analyzed with one level of object
sensitivity [18; 22], in which the context of a method invocation
consists of the invoked method and the object abstraction represent-
ing the receiver. The policy also includes careful treatment of col-
lections and security-related Application Programming Interfaces
(APIs). In particular:
•
Java collection classes are treated with unlimited-depth (up to
recursion) object-sensitivity. This means that all internal objects
of a collection are cloned for each collection instance. As a
result, the contents of Java collections from different allocation
sites are fully disambiguated, eliminating a major source of
pointer-analysis pollution.
•
The pointer analysis adds one level of call-string context to
calls to library factory methods. These methods tend to pollute
pointer-flow precision if handled without context sensitivity,
because all the objects created by a factory method share the
same allocation site.
•
Taint-specific APIs, such as sources and sinks, are also analyzed
with one level of call-string context. This is necessary due to the
special role these APIs play in taint propagation. In the example
given in Figure 1, this context allows TAJ to disambiguate the
two calls to source method getParameter at lines 13 and 14,
even though they are performed on the same receiver object.
As for other dimensions of precision, the pointer analysis of TAJ
is field-sensitive [29]. Furthermore, it relies on a Static-Single As-
signment (SSA) register-transfer language representation of each
method [6], which gives a measure of flow sensitivity for points-to
sets of local variables [14].
3.2 Hybrid Thin Slicing
Using the preliminary pointer analysis and call graph, the second
phase of TAJ tracks data flow from tainted sources using hybrid
thin slicing, a novel thin-slicing algorithm [33]. Hybrid thin slicing
combines flow-insensitive reasoning about flow through the heap
with flow- and context-sensitive tracking of flow through local
variables.
Thin slicing [33] is a good basis for taint analysis since a thin
slice typically captures the statements most relevant to a tainted
flow. A forward thin slice from a statement t consists of those state-
ments that are data-dependent on t [16], excluding base-pointer de-
pendencies: for a store statement x.f=y, dependencies due to
2
Some methods, such as RandomAccessFile.readFully in package
java.io, receive parameters by reference and taint their internal state. TAJ
also supports the specification of such methods as sources.
uses of the base pointer x are ignored; loads are handled similarly.
Thin slices are typically much smaller and more understandable
than program slices. Note that in [33], the term “thin slice” refers
to a backward thin slice, in which data dependencies are consid-
ered in the opposite direction, while here we use this term to mean
a forward thin slice.
Thin slices do not include control dependencies, and hence
TAJ does not track the corresponding indirect information flow.
Experience shows that attacks based on control dependence are rare
and complex, and thus less important than direct vulnerabilities.
Hybrid thin slicing combines aspects of the previously proposed
context-sensitive (CS) and context-insensitive (CI) thin slicing al-
gorithms [33], achieving a better tradeoff between scalability and
precision for taint analysis. Like CS thin slicing, hybrid thin slicing
tracks flow through local variables with flow and context sensitiv-
ity. However, unlike CS thin slicing, the hybrid technique does not
track heap data dependencies via additional method parameters and
return values, as this treatment is a scalability bottleneck [33]. This
handling of heap dependencies by CS thin slicing is also unsound
for multi-threaded programs since it is partially flow-sensitive, and
many of our target Web applications are multi-threaded. Instead,
hybrid thin slicing tracks heap data dependencies via direct edges
from stores to loads. Such edges are added based on the prelim-
inary pointer analysis, as in CI thin slicing. As we shall show in
Section 7, the hybrid approach yields better scalability than CS thin
slicing and better precision than CI thin slicing (with better perfor-
mance than CI in some cases).
Hybrid thin slicing performs a demand-driven traversal over a
special System Dependence Graph (SDG) [16] called the Hybrid
SDG (HSDG). Nodes in an HSDG correspond to load and store
statements in the program, as well as call statements representing
source and sink methods.
An HSDG has two types of edges representing data dependence:
“direct edges” and “summary edges”. A direct edge connects a
store to a load and represents a data dependence computed by a
preliminary pointer analysis (as in CI thin slicing [33]). A sum-
mary edge can connect s to t if t is transitively data-dependent on
s purely via flow through local variables; flow through the heap
is excluded. Summary edges are obtained on demand by comput-
ing context-sensitive reachability over a no-heap SDG—an SDG
that elides all control- and data-dependence edges reflecting flow
through heap locations. Note that the no-heap SDG includes no
successor edges for sanitizer return and sink call statements, since
we need not track flow beyond these statements.
TAJ computes the successors of a statement x in the HSDG on
demand, as follows:
•
If x = st is a store statement, then precomputed points-to
information is used to connect st to all load statements l such
that the base pointers of st and l are may-aliased.
•
Otherwise, a context-sensitive slice is computed from program
point x on the no-heap SDG using the Reps-Horwitz-Sagiv
(RHS) tabulation algorithm [28]. All the statements in the slice
corresponding to store instructions and sink invocations are
registered as the successors of x.
Figure 2 shows an example, which displays the slice computed on
the no-heap SDG corresponding to a load-to-store summary edge
in the HSDG.
To find tainted flows, we compute reachability in the HSDG
from each source-call statement s, adding the necessary direct and
summary edges on demand. The nodes reachable from s represent
the load, store, and sink statements directly data-dependent on s
(ignoring base-pointer data dependencies). Our final output recon-
structs thin slices from s to sensitive sinks via the HSDG and rele-
vant no-heap SDGs.