USENIX Association 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) 387
memcache get requests. For example, loading one of our
popular pages results in an average of 521 distinct items
fetched from memcache.
1
We provision hundreds of memcached servers in a
cluster to reduce load on databases and other services.
Items are distributed across the memcached servers
through consistent hashing [22]. Thus web servers have
to routinely communicate with many memcached servers
to satisfy a user request. As a result, all web servers
communicate with every memcached server in a short
period of time. This all-to-all communication pattern
can cause incast congestion [30] or allow a single server
to become the bottleneck for many web servers. Data
replication often alleviates the single-server bottleneck
but leads to significant memory inefficiencies in the
common case.
We reduce latency mainly by focusing on the
memcache client, which runs on each web server. This
client serves a range of functions, including serializa-
tion, compression, request routing, error handling, and
request batching. Clients maintain a map of all available
servers, which is updated through an auxiliary configu-
ration system.
Parallel requests and batching: We structure our web-
application code to minimize the number of network
round trips necessary to respond to page requests. We
construct a directed acyclic graph (DAG) representing
the dependencies between data. A web server uses this
DAG to maximize the number of items that can be
fetched concurrently. On average these batches consist
of 24 keys per request
2
.
Client-server communication: Memcached servers do
not communicate with each other. When appropriate,
we embed the complexity of the system into a stateless
client rather than in the memcached servers. This greatly
simplifies memcached and allows us to focus on making
it highly performant for a more limited use case. Keep-
ing the clients stateless enables rapid iteration in the
software and simplifies our deployment process. Client
logic is provided as two components: a library that can
be embedded into applications or as a standalone proxy
named mcrouter. This proxy presents a memcached
server interface and routes the requests/replies to/from
other servers.
Clients use UDP and TCP to communicate with
memcached servers. We rely on UDP for get requests to
reduce latency and overhead. Since UDP is connection-
less, each thread in the web server is allowed to directly
communicate with memcached servers directly, bypass-
ing mcrouter, without establishing and maintaining a
1
The 95
th
percentile of fetches for that page is 1,740 items.
2
The 95
th
percentile is 95 keys per request.
Average of Medians Average of 95th Percentiles
microseconds
0 200 600 1000 1400
UDP direct
by mcrouter (TCP)
Figure 3: Get latency for UDP, TCP via mcrouter
connection thereby reducing the overhead. The UDP
implementation detects packets that are dropped or re-
ceived out of order (using sequence numbers) and treats
them as errors on the client side. It does not provide
any mechanism to try to recover from them. In our in-
frastructure, we find this decision to be practical. Un-
der peak load, memcache clients observe that 0.25% of
get requests are discarded. About 80% of these drops
are due to late or dropped packets, while the remainder
are due to out of order delivery. Clients treat get er-
rors as cache misses, but web servers will skip insert-
ing entries into memcached after querying for data to
avoid putting additional load on a possibly overloaded
network or server.
For reliability, clients perform set and delete opera-
tions over TCP through an instance of mcrouter run-
ning on the same machine as the web server. For opera-
tions where we need to confirm a state change (updates
and deletes) TCP alleviates the need to add a retry mech-
anism to our UDP implementation.
Web servers rely on a high degree of parallelism and
over-subscription to achieve high throughput. The high
memory demands of open TCP connections makes it
prohibitively expensive to have an open connection be-
tween every web thread and memcached server without
some form of connection coalescing via mcrouter. Co-
alescing these connections improves the efficiency of
the server by reducing the network, CPU and memory
resources needed by high throughput TCP connections.
Figure 3 shows the average, median, and 95
th
percentile
latencies of web servers in production getting keys over
UDP and through mcrouter via TCP. In all cases, the
standard deviation from these averages was less than
1%. As the data show, relying on UDP can lead to a
20% reduction in latency to serve requests.
Incast congestion: Memcache clients implement flow-
control mechanisms to limit incast congestion. When a