USENIX Association 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) 219
DNS resolution, the request is routed to an Edge Load
Balancer (ELB) [16]. ELBs are geo-distributed so as to
allow TCP sessions to be established closer to the user
and avoid excessive latency during TCP handshake and
SSL termination. ELBs also provide a point of indirec-
tion for better load balancing, acting as a proxy between
the user and data center.
Once a request is routed to a particular data center, a
Software Load Balancer routes it to one of many possi-
ble Web servers, each of which runs the HipHop Virtual
Machine runtime [35]. Request execution on the Web
server triggers many RPCs to caching layers that include
Memcache [20] and TAO [7]. Requests also occasionally
access databases.
RPC responses pass through the load-balancing lay-
ers on their way back to the client. On the client, the
exact order and manner of rendering a Web page are
dependent on the implementation details of the user’s
browser. However, in general, there will be a Cascad-
ing Style Sheet (CSS) download stage and a Document
Object Model rendering stage, followed by a JavaScript
execution stage.
As with all modern Internet services, to achieve la-
tency objectives, the handling of an individual request
exhibits a high degree of concurrency. Tens to hun-
dreds of individual components execute in parallel over
a distributed set of computers, including both server and
client machines. Such concurrency makes performance
analysis and debugging complex. Fortunately, standard
techniques such as critical path analysis and slack analy-
sis can tame this complexity. However, all such analyses
need a model of the causal dependencies in the system
being analyzed. Our work fills this need.
3
¨
UberTrace: End-to-end Request Tracing
As discussed in the prior section, request execution
at Facebook involves many software components. Prior
to our work, almost all of these components had logging
mechanisms used for debugging and optimizing the indi-
vidual components. In fact, our results show that individ-
ual components are almost always well-optimized when
considered in isolation.
Yet, there existed no complete and detailed instru-
mentation for monitoring the end-to-end performance of
Facebook requests. Such end-to-end monitoring is vital
because individual components can be well-optimized in
isolation yet still miss opportunities to improve perfor-
mance when components interact. Indeed, the opportuni-
ties for performance improvement we identify all involve
the interaction of multiple components.
Thus, the first step in our work was to unify the indi-
vidual logging systems at Facebook into a single end-to-
end performance tracing tool, dubbed
¨
UberTrace. Our
basic approach is to define a minimal schema for the in-
formation contained in a log message, and then map ex-
isting log messages to that schema.
¨
UberTrace requires that log messages contain at least:
1. A unique request identifier.
2. The executing computer (e.g., the client or a partic-
ular server)
3. A timestamp that uses the local clock of the execut-
ing computer
4. An event name (e.g., “start of DOM rendering”).
5. A task name, where a task is defined to be a dis-
tributed thread of control.
¨
UberTrace requires that each <event, task> tuple is
unique, which implies that there are no cycles that would
cause a tuple to appear multiple times. Although this
assumption is not valid for all execution environments, it
holds at Facebook given how requests are processed. We
believe that it is also a reasonable assumption for similar
Internet service pipelines.
Since all log timestamps are in relation to local clocks,
¨
UberTrace translates them to estimated global clock val-
ues by compensating for clock skew.
¨
UberTrace looks
for the common RPC pattern of communication in which
the thread of control in an individual task passes from
one computer (called the client to simplify this explana-
tion) to another, executes on the second computer (called
the server), and returns to the client.
¨
UberTrace calcu-
lates the server execution time by subtracting the latest
and earliest server timestamps (according to the server’s
local clock) nested within the client RPC. It then cal-
culates the client-observed execution time by subtract-
ing the client timestamps that immediately succeed and
precede the RPC. The difference between the client and
server intervals is the estimated network round-trip time
(RTT) between the client and server. By assuming that
request and response delays are symmetric,
¨
UberTrace
calculates clock skew such that, after clock-skew adjust-
ment, the first server timestamp in the pattern is exactly
1/2 RTT after the previous client timestamp for the task.
The above methodology is subject to normal variation
in network performance. In addition, the imprecision
of using existing log messages rather than instrument-
ing communication points can add uncertainty. For in-
stance, the first logged server message could occur only
after substantial server execution has already completed,
leading to an under-estimation of server processing time
and an over-estimation of RTT.
¨
UberTrace compensates
by calculating multiple estimates. Since there are many
request and response messages during the processing of
a higher-level request, it makes separate RTT and clock