Data Warehousing and Analytics Infrastructure at
Facebook
Ashish Thusoo
Zheng Shao
Suresh Anthony
Dhruba Borthakur
Namit Jain
Joydeep Sen Sarma
Facebook
1
1
The authors can be reached at the following
addresses:
{athusoo,dhruba,rmurthy,zshao,njain,hliu,
suresh,jssarma}@facebook.com
Raghotham Murthy
Hao Liu
ABSTRACT
Scalable analysis on large data sets has been core to the functions
of a number of teams at Facebook - both engineering and non-
engineering. Apart from ad hoc analysis of data and creation of
business intelligence dashboards by analysts across the company,
a number of Facebook's site features are also based on analyzing
large data sets. These features range from simple reporting
applications like Insights for the Facebook Advertisers, to more
advanced kinds such as friend recommendations. In order to
support this diversity of use cases on the ever increasing amount
of data, a flexible infrastructure that scales up in a cost effective
manner, is critical. We have leveraged, authored and contributed
to a number of open source technologies in order to address these
requirements at Facebook. These include Scribe, Hadoop and
Hive which together form the cornerstones of the log collection,
storage and analytics infrastructure at Facebook. In this paper we
will present how these systems have come together and enabled us
to implement a data warehouse that stores more than 15PB of data
(2.5PB after compression) and loads more than 60TB of new data
(10TB after compression) every day. We discuss the motivations
behind our design choices, the capabilities of this solution, the
challenges that we face in day today operations and future
capabilities and improvements that we are working on.
Categories and Subject Descriptors
H.m [Information Systems]: Miscellaneous.
General Terms
Management, Measurement, Performance, Design, Reliability,
Languages.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise,
or republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
SIGMOD’10, June 6–10, 2010, Indianapolis, Indiana, USA.
Copyright 2010 ACM 978-1-4503-0032-2/10/06...$10.00.
Keywords
Data warehouse, scalability, data discovery, resource sharing,
distributed file system, Hadoop, Hive, Facebook, Scribe, log
aggregation, analytics, map-reduce, distributed systems.
1. INTRODUCTION
A number of applications at Facebook rely on processing large
quantities of data. These applications range from simple reporting
and business intelligence applications that generate aggregated
measurements across different dimensions to the more advanced
machine learning applications that build models on training data
sets. At the same time there are users who want to carry out ad
hoc analysis on data to test different hypothesis or to answer one
time questions posed by different functional parts of the company.
On any day about 10,000 jobs are submitted by the users. These
jobs have very diverse characteristics such as degree of
parallelism, execution time, resource-needs, and data delivery
deadlines. This diversity in turn means that the data processing
infrastructure has to be flexible enough to support different
service levels as well as optimal algorithms and techniques for the
different query workloads.
What makes this task even more challenging is the fact that the
data under consideration continues to grow rapidly as more and
more users end up using Facebook as a ubiquitous social network
and as more and more instrumentation is added to the site. As an
example of this tremendous data growth one has to just look at the
fact that while today we load between 10-15TB of compressed
data every day, just 6 months back this number was in the 5-6TB
range. Note that these sizes are the sizes of the data after
compression – the uncompressed raw data would be in the 60-
90TB range (assuming a compression factor of 6). Needless to
say, such a rapid growth places very strong scalability
requirements on the data processing infrastructure. Strategies that
are reliant on systems that do not scale horizontally are
completely ineffective in this environment. The ability to scale
using commodity hardware is the only cost effective option that
enables us to store and process such large data sets.
In order to address both of these challenges – diversity and scale,
we have built our solutions on technologies that support these
characteristics at their core. On the storage and compute side we
rely heavily on Hadoop[1] and Hive[2] – two open source
technologies that we have significantly contributed to, and in the