Facebook的数据仓库与分析基础设施

需积分: 9 199 浏览量更新于2024-09-08 收藏 523KB PDF 举报

"Facebook的数据仓库和分析基础设施" Facebook作为全球最大的社交网络平台，其数据仓库和分析基础设施对于处理和洞察海量用户数据至关重要。这篇文章由Ashish Thusoo、Zheng Shao、Suresh Anthony、Dhruba Borthakur、Namit Jain、Joydeep Sen Sarma等Facebook的专家共同撰写，他们深入探讨了如何构建一个可扩展、高效且成本效益高的数据分析系统。文章首先强调了在Facebook内部，大规模数据分析是工程团队和非工程团队工作的核心。这不仅包括分析师进行的临时数据探索和商业智能仪表板的创建，还涉及到多个基于大数据集的网站功能。例如，为Facebook广告商提供的简单报告应用Insights，以及更复杂的如好友推荐等功能。为了应对不断增长的数据量和多样化的用例需求，Facebook构建了一个灵活且具有成本效益的基础设施。这个基础设施的关键特性是能够随着需求的增长而扩展。在这个过程中，Facebook充分利用、开发并贡献了许多开源技术。其中，Scribe是一个日志聚合系统，它收集来自各种服务的日志数据，为后续的分析提供基础。Hadoop是另一个重要的开源工具，它提供了分布式存储（HDFS）和分布式计算（MapReduce）框架，使得Facebook能够处理PB级别的数据。此外，Facebook还参与了HBase的开发，这是一个基于Hadoop的分布式NoSQL数据库，适用于实时查询和大数据的存储。除了这些基础架构组件，Facebook还利用了Pig和Hive等工具来简化大数据处理。Pig是一种高级语言，用于编写处理Hadoop数据的脚本，而Hive则提供了SQL-like接口，使得非程序员也能方便地对大数据进行查询和分析。这些工具的结合使用，使得Facebook能够高效地进行数据挖掘和分析，为产品优化和业务决策提供支持。为了提升查询性能，Facebook还引入了列式存储的数据库系统，如Parquet，这种格式能显著提高处理大量结构化数据时的效率。同时，Facebook还构建了内部的实时分析系统，如Presto，它是一个低延迟的分布式SQL查询引擎，能够快速处理PB级别的数据，满足了对实时数据分析的需求。 Facebook的数据仓库和分析基础设施是一个复杂而全面的生态系统，涵盖了数据的收集、存储、处理、查询和可视化等多个环节。通过不断优化和创新，Facebook能够处理和利用其庞大的用户数据，驱动产品改进，提供个性化的用户体验，并支持公司的战略决策。

Data Warehousing and Analytics Infrastructure at

Facebook

Ashish Thusoo

Zheng Shao

Suresh Anthony

Dhruba Borthakur

Namit Jain

Joydeep Sen Sarma

Facebook

The authors can be reached at the following

addresses:

{athusoo,dhruba,rmurthy,zshao,njain,hliu,

suresh,jssarma}@facebook.com

Raghotham Murthy

Hao Liu

ABSTRACT

Scalable analysis on large data sets has been core to the functions

of a number of teams at Facebook - both engineering and non-

engineering. Apart from ad hoc analysis of data and creation of

business intelligence dashboards by analysts across the company,

a number of Facebook's site features are also based on analyzing

large data sets. These features range from simple reporting

applications like Insights for the Facebook Advertisers, to more

advanced kinds such as friend recommendations. In order to

support this diversity of use cases on the ever increasing amount

of data, a flexible infrastructure that scales up in a cost effective

manner, is critical. We have leveraged, authored and contributed

to a number of open source technologies in order to address these

requirements at Facebook. These include Scribe, Hadoop and

Hive which together form the cornerstones of the log collection,

storage and analytics infrastructure at Facebook. In this paper we

will present how these systems have come together and enabled us

to implement a data warehouse that stores more than 15PB of data

(2.5PB after compression) and loads more than 60TB of new data

(10TB after compression) every day. We discuss the motivations

behind our design choices, the capabilities of this solution, the

challenges that we face in day today operations and future

capabilities and improvements that we are working on.

Categories and Subject Descriptors

H.m [Information Systems]: Miscellaneous.

General Terms

Management, Measurement, Performance, Design, Reliability,

Languages.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise,

or republish, to post on servers or to redistribute to lists, requires prior

specific permission and/or a fee.

SIGMOD’10, June 6–10, 2010, Indianapolis, Indiana, USA.

Keywords

Data warehouse, scalability, data discovery, resource sharing,

distributed file system, Hadoop, Hive, Facebook, Scribe, log

aggregation, analytics, map-reduce, distributed systems.

1. INTRODUCTION

A number of applications at Facebook rely on processing large

quantities of data. These applications range from simple reporting

and business intelligence applications that generate aggregated

measurements across different dimensions to the more advanced

machine learning applications that build models on training data

sets. At the same time there are users who want to carry out ad

hoc analysis on data to test different hypothesis or to answer one

time questions posed by different functional parts of the company.

On any day about 10,000 jobs are submitted by the users. These

jobs have very diverse characteristics such as degree of

parallelism, execution time, resource-needs, and data delivery

deadlines. This diversity in turn means that the data processing

infrastructure has to be flexible enough to support different

service levels as well as optimal algorithms and techniques for the

different query workloads.

What makes this task even more challenging is the fact that the

data under consideration continues to grow rapidly as more and

more users end up using Facebook as a ubiquitous social network

and as more and more instrumentation is added to the site. As an

example of this tremendous data growth one has to just look at the

fact that while today we load between 10-15TB of compressed

data every day, just 6 months back this number was in the 5-6TB

range. Note that these sizes are the sizes of the data after

compression – the uncompressed raw data would be in the 60-

90TB range (assuming a compression factor of 6). Needless to

say, such a rapid growth places very strong scalability

requirements on the data processing infrastructure. Strategies that

are reliant on systems that do not scale horizontally are

completely ineffective in this environment. The ability to scale

using commodity hardware is the only cost effective option that

enables us to store and process such large data sets.

In order to address both of these challenges – diversity and scale,

we have built our solutions on technologies that support these

characteristics at their core. On the storage and compute side we

rely heavily on Hadoop[1] and Hive[2] – two open source

technologies that we have significantly contributed to, and in the

下载后可阅读完整内容，剩余7页未读，立即下载

l轶h

粉丝: 0
资源: 5

Facebook的数据仓库与分析基础设施

DataMining入门：与统计分析的区别及与DataWarehousing的关系

BW310: SAP Data Warehousing Course Details

DataMining入门关键：与统计分析的区别及与DataWarehousing的关系

HIVE Data Warehousing & Analytics on Hadoop.ppt

Encyclopedia of Data Warehousing and Mining

Oracle Data Warehousing and Business Intelligence Solutions

Online Bibliography on Data Warehousing and OLAP

Oracle DBA Guide to Data Warehousing and Star Schemas.chm

数据仓库技术（An overview of data warehousing and OLAP technology）

Data Warehousing Techniques and Standards

最新资源