RCFile: A Fast and Space-efficient Data Placement
Structure in MapReduce-based Warehouse Systems
Yongqiang He
#$1
, Rubao Lee
%2
,YinHuai
%3
, Zheng Shao
#4
,NamitJain
#5
, Xiaodong Zhang
%6
,ZhiweiXu
$7
#
Facebook Data Infrastructure Team
{
1
heyongqiang,
4
zshao,
5
njain}@fb.com
%
Department of Computer Science and Engineering, The Ohio State University
{
2
liru,
3
huai,
6
zhang}@cse.ohio-state.edu
$
Institute of Computing Technology, Chinese A cademy of Sciences
7
zxu@ict.ac.cn
Abstract— MapReduce-based data warehouse systems are
playing important roles of supporting big data analytics to un-
derstand quickly the dynamics of user behavior trends and their
needs in typical Web service providers and social network sites
(e.g., Facebook). In such a system, the data placement structure
is a critical factor that can affect the warehouse performance
in a fundamental way. Based on our observations and analysis
of Facebook production systems, we have characterized four
requirements for the data placement structure: (1) fast data
loading, (2) fast query processing, (3) highly efficient storage
space utilization, and (4) strong adaptivity to highly dynamic
workload patterns. We have examined three commonly accepted
data placement structures in conventional databases, namely row-
stores, column-stores, and hybrid-stores in the context of large
data analysis using MapReduce. We show that they are not very
suitable for big data processing in distributed systems. In this
paper, we present a big data placement structure called RCFile
(Record Columnar File) and its implementation in the Hadoop
system. With intensive experiments, we show the effectiveness
of RCFile in satisfying the four requirements. RCFile has been
chosen in Facebook data warehouse system as the default option.
It has also been adopted by Hive and Pig, the two most widely
used data analysis systems developed in Facebook and Yahoo!
I. INT RODUCTION
We have entered an era of data explosion, where many
data sets being processed and analyzed are called “big data”.
Big data not only requires a huge amount of storage, but
also demands new data management on large distributed
systems because conventional database systems have difficulty
to manage big d ata. One important and emerging application
of big data happens in social networks on the Internet, where
billions of people all over the world connect and the number
of users along with their various activities is growing rapidly.
For example, the number of registered users in Facebook, the
largest soc ial network in the world has been over 500 million
[1]. One critical task in Facebook is to understand quickly the
dynamics of user behavior trends and user needs based on big
data sets recording busy user activities.
The MapReduce framework [2] and its open-source imple-
mentation Hadoop [3] provide a scalable and fault-tolerant
infrastructure for big data analysis on large clusters. Further-
more, MapReduce-based data warehouse systems have been
successfully built in major Web service providers and social
network Websites, and are playing cr itical roles for executing
various daily operations including Web click-stream analysis,
advertisement analysis, data mining applications, and many
others. Two widely used Hadoop-based warehouse systems
are Hive [4][5] in Facebook and Pig [6] in Yahoo!
These MapReduce-based warehouse systems cannot directly
control storage disks in clusters. Instead, they have to uti-
lize the cluster-level distributed file system (e.g. HDFS, the
Hadoop Distributed File System) to store a huge amount of
table data. Therefore, a serious challen ge in building such a
system is to find an efficient d ata placement structure that
determines how to organize table data in the underlying HDFS.
Being a critical factor that can affect warehouse performance
in a fundamental way, such a data placement structure must be
well optimized to meet the big data processing requirements
and to efficiently leverage merits in a MapReduce environ-
ment.
A. Big Data Processing Requirements
Based on our analysis on Facebook systems and huge user
data sets, we have summarized the following four critical
requirements for a data p lacement structure in a MapReduce
environment.
1) Fast data loading. Loading data quickly is critical for
the Facebook production d ata warehouse. On average,
more than 20TB data are pushed into a Facebook data
warehouse every day. Thus, it is highly desirable to
reduce data loading time, since network and d isk traffic
during data loading will interfere with normal query
executions.
2) Fast query processing. Many queries are response-time
critical in order to satisfy the requirements of both real-
time Website requests and heavy workloads of decision
supporting queries submitted by highly-concurrent users.
This requires that the underlying data placement struc-
ture retain the high speed for query processing as the
amount of queries rapidly increases.
3) High ly efficient storage space utilization. Rapidly grow-
ing user activities have constantly de manded scalable
storage capacity and computing power. Limited disk
978-1-4244-8958-9/11/$26.00 © 2011 IEEE ICDE Conference 20111199