Network Interface
Native System Libraries
Virtual
Machine
Virtual
Machine
Application Code Application CodeMsg
Secondary Storage
Message-level tracking
Variable-level
tracking
Method-level
tracking
File-level
tracking
Figure 1: Multi-level approach for performance efficient
taint tracking within a common smartphone architecture.
clear semantics. First, we instrument the VM interpreter
to provide variable-level tracking within untrusted ap-
plication code.
1
Using variable semantics provided by
the interpreter provides valuable context for avoiding
the taint explosion observed in the x86 instruction set.
Additionally, by tracking variables, we maintain taint
markings only for data and not code. Second, we use
message-level tracking between applications. Tracking
taint on messages instead of data within messages mini-
mizes IPC overhead while extending the analysis system-
wide. Third, for system-provided native libraries, we use
method-level tracking. Here, we run native code with-
out instrumentation and patch the taint propagation on
return. These methods accompany the system and have
known information flow semantics. Finally, we use file-
level tracking to ensure persistent information conserva-
tively retains its taint markings.
To assign labels, we take advantage of the well-
defined interfaces through which applications access sen-
sitive data. For example, all information retrieved from
GPS hardware is location-sensitive, and all informa-
tion retrieved from an address book database is contact-
sensitive. This avoids relying on heuristics [10] or man-
ual specification [61] for labels. We expand on informa-
tion sources in Section 5.
In order to achieve this tracking at multiple granulari-
ties, our approach relies on the firmware’s integrity. The
taint tracking system’s trusted computing base includes
the virtual machine executing in userspace and any na-
tive system libraries loaded by the untrusted interpreted
application. However, this code is part of the firmware,
and is therefore trusted. Applications can only escape
the virtual machine by executing native methods. In our
target platform (Android), we modified the native library
loader to ensure that applications can only load native li-
braries from the firmware and not those downloaded by
the application. Note that an early 2010 survey of the top
50 most popular free applications in each category of the
Android Market [2] (1100 applications in total) revealed
that less than 4% included a .so file. A similar survey
conducted in mid 2010 revealed this fraction increased to
5%, which indicates there is growth in the number of ap-
plications using native third-party libraries, but that the
number of affected applications remains small.
In summary, we provide a novel, efficient, system-
wide, multiple-marking, taint tracking design by com-
bining multiple granularities of information tracking.
While some techniques such as variable tracking within
an interpreter have been previously proposed (see Sec-
tion 9), to our knowledge, our approach is the first to
extend such tracking system-wide. By choosing a mul-
tiple granularity approach, we balance performance and
accuracy. As we show in Sections 6 and 7, our system-
wide approach is both highly efficient ( ∼14% CPU over-
head and ∼4.4% memory overhead for simultaneously
tracking 32 taint markings per data unit) and accurately
detects many suspicious network packets.
3 Background: Android
Android [1] is a Linux-based, open source, mobile
phone platform. Most core phone functionality is imple-
mented as applications running on top of a customized
middleware. The middleware itself is written in Java
and C/C++. Applications are written in Java and com-
piled to a custom byte-code known as the Dalvik EXe-
cutable (DEX) byte-code format. Each application exe-
cutes within its Dalvik VM interpreter instance. Each in-
stance executes as unique UNIX user identities to isolate
applications within the Linux platform subsystem. Ap-
plications communicate via the binder IPC mechanism.
Binder provides transparent message passing based on
parcels. We now discuss topics necessary to understand
our tracking system.
Dalvik VM Interpreter: DEX is a register-based ma-
chine language, as opposed to Java byte-code, which is
stack-based. Each DEX method has its own predefined
number of virtual registers (which we frequently refer to
as simply “registers”). The Dalvik VM interpreter man-
ages method registers with an internal execution state
stack; the current method’s registers are always on the
top stack frame. These registers loosely correspond to
local variables in the Java method and store primitive
types and object references. All computation occurs
on registers, therefore values must be loaded from and
stored to class fields before use and after use. Note that
DEX uses class fields for all long term storage, unlike
hardware register-based machine languages (e.g., x86),
which store values in arbitrary memory locations.
Native Methods: The Android middleware provides ac-
cess to native libraries for performance optimization and
third-party libraries such as OpenGL and Webkit. An-
droid also uses Apache Harmony Java [3], which fre-
quently uses system libraries (e.g., math routines). Na-
tive methods are written in C/C++ and expose function-
ality provided by the underlying Linux kernel and ser-
vices. They can also access Java internals, and hence are