Chukwa: A large-scale monitoring system
Jerome Boulon
jboulon@yahoo-inc.com
Yahoo!, inc
Andy Konwinski
andyk@cs.berkeley.edu
UC Berkeley
Runping Qi
runping@yahoo-inc.com
Yahoo!, inc
Ariel Rabkin
asrabkin@cs.berkeley.edu
UC Berkeley
Eric Yang
eyang@yahoo-inc.com
Yahoo!, inc
Mac Yang
macyang@yahoo-inc.com
Yahoo!, inc
Abstract
We describe the design and initial implementation of
Chukwa, a data collection system for monitoring and an-
alyzing large distributed systems. Chukwa is built on
top of Hadoop, an open source distributed filesystem and
MapReduce implementation, and inherits Hadoop’s scal-
ability and robustness. Chukwa also includes a flexible
and powerful toolkit for displaying monitoring and anal-
ysis results, in order to make the best use of this collected
data.
1 Introduction
Hadoop is a distributed filesystem and MapReduce [1]
implementation that is used pervasively at Yahoo! for a
variety of critical business purposes. Production clusters
often include thousands of nodes. Large distributed sys-
tems such as Hadoop are fearsomely complex, and can
fail in complicated and subtle ways. As a result, Hadoop
is extensively instrumented. A two-thousand node clus-
ter configured for normal operation generates nearly half
a terabyte of monitoring data per day, mostly application-
level log files.
This data is invaluable for debugging, performance
measurement, and operational monitoring. However,
processing this data in real time at scale is a formidable
challenge. A good monitoring system ought to scale out
to very large deployments, and ought to handle crashes
gracefully. In Hadoop, only a handful of aggregate met-
rics, such as task completion rate and available disk
space, are computed in real time. The vast bulk of the
generated data is stored locally, and accessible via a per-
node web interface. Unfortunately, this mechanism does
not facilitate programmatic analysis of the log data, nor
the long term archiving of such data.
To make full use of log data, users must first write
ad-hoc log aggregation scripts to centralize the required
data, and then build mechanisms to analyze the collected
data. Logs are periodically deleted, unless users take the
initiative in storing them.
We believe that our situation is typical, and that lo-
cal storage of logging data is a common model for very
large deployments. To the extent that more sophisticated
data management techniques are utilized, they are largely
supported by ad-hoc proprietary solutions. A well docu-
mented open source toolset for handling monitoring data
thus solves a significant practical problem and provides
a valuable reference point for future development in this
area.
We did not aim to solve the problem of real-time mon-
itoring for failure detection, which systems such as Gan-
glia already do well. Rather, we wanted a system that
would process large volumes of data, in a timescale of
minutes, not seconds, to detect more subtle conditions,
and to aid in failure diagnosis. Human engineers do not
generally react on a timescale of seconds, and so a pro-
cessing delay of a few minutes is not a concern for us.
We are in the process of building a system, which we
call Chukwa, to demonstrate that practical large-scale
can be readily built atop this existing infrastructure.
1
it
uses Hadoop’s distributed file system (HDFS) as its data
store, and relies on MapReduce jobs to process the data.
By leveraging these existing tools, Chukwa can scale
to thousands of nodes in both collection and analysis
capacities, while providing a standardized and familiar
framework for processing the collected data. Many com-
ponents of Chukwa are pluggable, allowing easy cus-
tomization and enhancement.
The core components of Chukwa are largely complete,
and we expect the system to enter production use at Ya-
hoo! within the next few months. We have some ini-
tial operational experience, and preliminary performance
metrics. We begin by discussing our goals and require-
ments in some detail. We then describe our design, ex-
1
In Hindu mythology, Chukwa is the turtle that holds up Maha-
pudma, the elephant that hold up the world. This name is especially
appropriate for us, since the the Hadoop mascot is a yellow elephant.
1