Facebook如何使用Hadoop实现实时处理

需积分: 3 37 浏览量更新于2024-09-19 收藏 704KB PDF 举报

"Apache Hadoop在Facebook实现实时化" 在2011年，Facebook发布了一篇技术论文，详细介绍了他们如何将Apache Hadoop和HBase应用于实时系统，以支持其新推出的Facebook Messages服务。这篇论文由Dhruba Borthakur、Kannan Muthukkaruppan、Karthik Ranganathan等Facebook工程师共同撰写，揭示了选择Hadoop和HBase而不是其他如Apache Cassandra和Voldemort的原因，并深入讨论了应用在一致性、可用性、分区容忍性、数据模型和可扩展性等方面的需求。 Facebook Messages是Facebook首次基于Apache Hadoop平台构建的面向用户的应用。HBase作为Hadoop之上的数据库层，设计目标是支持每天处理数十亿条消息。选择Hadoop和HBase的关键因素在于它们在大数据处理和实时响应能力上的优势。首先，论文探讨了Facebook对一致性的需求。在分布式系统中，一致性涉及到数据更新后的可见性，Facebook选择HBase是因为它提供了强一致性模型，确保用户能立即看到消息的更新，这是像Facebook Messages这样的实时通信应用所必需的。其次，论文讨论了可用性和分区容忍性。Facebook的全球用户基础要求系统即使在网络分区的情况下也能保持运行。Hadoop和HBase的架构允许数据复制和故障切换，从而提高了系统的可用性。相比之下，Cassandra和Voldemort虽然也具有高可用性，但在某些特定场景下可能无法满足Facebook的实时需求。此外，论文还分析了数据模型的设计。HBase提供了一个列族式数据模型，适合存储大规模稀疏数据，这与Facebook Messages的数据结构非常匹配。而Cassandra和Voldemort的数据模型则可能不那么灵活。在可扩展性方面，Hadoop和HBase通过水平扩展的能力，能够轻松地添加更多硬件来处理增长的数据量。Facebook通过优化Hadoop，使其更适合实时操作，包括改进的调度策略和更快的数据检索机制，以适应Facebook Messages的实时需求。最后，论文还详细介绍了在配置系统时所做的权衡，以及如何通过这种方式改善了性能，并对比了Hadoop和HBase解决方案与Facebook其他应用中使用的分片MySQL数据库方案的优势。分片MySQL在处理大量数据和实时请求时可能存在性能瓶颈，而Hadoop和HBase的分布式特性使得Facebook能够更有效地处理海量消息数据。这篇论文展示了Facebook如何利用开源技术解决其业务挑战，以及在选择和优化大数据基础设施时的深思熟虑，对于理解大型互联网公司在实时数据处理方面的实践有着重要的参考价值。

Apache Hadoop Goes Realtime at Facebook

Dhruba Borthakur

Kannan Muthukkaruppan

Karthik Ranganathan

Samuel Rash

Joydeep Sen Sarma

Nicolas Spiegelberg

Dmytro Molkov

Rodrigo Schmidt

Facebook

{dhruba,jssarma,jgray,kannan,

nicolas,hairong,kranganathan,dms,

aravind.menon,rash,rodrigo,

amitanand.s}@fb.com

Jonathan Gray

Hairong Kuang

Aravind Menon

Amitanand Aiyer

ABSTRACT

Facebook recently deployed Facebook Messages, its first ever

user-facing application built on the Apache Hadoop platform.

Apache HBase is a database-like layer built on Hadoop designed

to support billions of messages per day. This paper describes the

reasons why Facebook chose Hadoop and HBase over other

systems such as Apache Cassandra and Voldemort and discusses

the application’s requirements for consistency, availability,

partition tolerance, data model and scalability. We explore the

enhancements made to Hadoop to make it a more effective

realtime system, the tradeoffs we made while configuring the

system, and how this solution has significant advantages over the

sharded MySQL database scheme used in other applications at

Facebook and many other web-scale companies. We discuss the

motivations behind our design choices, the challenges that we

face in day-to-day operations, and future capabilities and

improvements still under development. We offer these

observations on the deployment as a model for other companies

who are contemplating a Hadoop-based solution over traditional

sharded RDBMS deployments.

Categories and Subject Descriptors

H.m [Information Systems]: Miscellaneous.

General Terms

Management, Measurement, Performance, Distributed Systems,

Design, Reliability, Languages.

Keywords

Data, scalability, resource sharing, distributed file system,

Hadoop, Hive, HBase, Facebook, Scribe, log aggregation,

distributed systems.

1. INTRODUCTION

Apache Hadoop [1] is a top-level Apache project that includes

open source implementations of a distributed file system [2] and

MapReduce that were inspired by Google’s GFS [5] and

MapReduce [6] projects. The Hadoop ecosystem also includes

projects like Apache HBase [4] which is inspired by Google’s

BigTable, Apache Hive [3], a data warehouse built on top of

Hadoop, and Apache ZooKeeper [8], a coordination service for

distributed systems.

At Facebook, Hadoop has traditionally been used in conjunction

with Hive for storage and analysis of large data sets. Most of this

analysis occurs in offline batch jobs and the emphasis has been on

maximizing throughput and efficiency. These workloads typically

read and write large amounts of data from disk sequentially. As

such, there has been less emphasis on making Hadoop performant

for random access workloads by providing low latency access to

HDFS. Instead, we have used a combination of large clusters of

MySQL databases and caching tiers built using memcached[9]. In

many cases, results from Hadoop are uploaded into MySQL or

memcached for consumption by the web tier.

Recently, a new generation of applications has arisen at Facebook

that require very high write throughput and cheap and elastic

storage, while simultaneously requiring low latency and disk

efficient sequential and random read performance. MySQL

storage engines are proven and have very good random read

performance, but typically suffer from low random write

throughput. It is difficult to scale up our MySQL clusters rapidly

while maintaining good load balancing and high uptime.

Administration of MySQL clusters requires a relatively high

management overhead and they typically use more expensive

hardware. Given our high confidence in the reliability and

scalability of HDFS, we began to explore Hadoop and HBase for

such applications.

The first set of applications requires realtime concurrent, but

sequential, read access to a very large stream of realtime data

being stored in HDFS. An example system generating and storing

such data is Scribe [10], an open source distributed log

aggregation service created by and used extensively at Facebook.

Previously, data generated by Scribe was stored in expensive and

hard to manage NFS servers. Two main applications that fall into

this category are Realtime Analytics [11] and MySQL backups.

We have enhanced HDFS to become a high performance low

latency file system and have been able to reduce our use of

expensive file servers.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that

copies bear this notice and the full citation on the first page. To copy

otherwise, or republish, to post on servers or to redistribute to lists,

requires prior specific permission and/or a fee.

SIGMOD ’11, June 12–16, 2011, Athens, Greece.

1071

下载后可阅读完整内容，剩余9页未读，立即下载

elbert_li

粉丝: 0
资源: 4

Facebook如何使用Hadoop实现实时处理

Spring Data for Apache Hadoop API（Spring Data for Apache Hadoop 开发文档）.CHM

Pro apache Hadoop

Apache Hadoop

Facebook如何利用Apache Hadoop实现实时处理

Apache Hadoop 3.3.5与Apache Hadoop 3.3.4

apache hadoop

Apache Hadoop YARN

Apache Hadoop 2.7.2

《Hadoop at 10-the History and Evolution of the Apache Hadoop Ecosystem》

Hadoop - Dell Apache Hadoop Solutions

最新资源