精通Spark流处理:实时分析实战指南

5星 · 超过95%的资源 需积分: 31 201 下载量 37 浏览量 更新于2024-07-20 2 收藏 13.41MB PDF 举报
"Pro.Spark.Streaming.The.Zen.of.Real-Time.Analytics.Using.Apache.Spark.1484" 本书《Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark》由Zubair Nabi撰写,是关于使用Apache Spark进行实时流处理的专业指南。它深入介绍了如何利用Spark Streaming构建各种实时流应用,通过实际应用案例、数据和代码来呈现端到端的实时应用开发过程。书中采用应用优先的方法,涉及社交媒体、共享经济、金融、在线广告、电信和物联网等多个行业的用例。 Spark近年来已成为大数据处理的代名词,而DStreams作为其核心部分,提供了微批处理模型以支持流分析。本书将帮助读者掌握DStreams、微批处理和函数式编程的关键特性,成为低延迟应用的专家。书中的实例和代码可以直接部署,旨在成为Spark Streaming的权威参考。 您将学到: 1. Spark Streaming应用程序开发和最佳实践。 2. DStreams的低级别细节,了解实时离散化流。 3. 流分析在不同行业和领域的应用及其重要性。 4. 通过配置策略和使用Graphite、collectd和Nagios进行监控,优化Spark Streaming的生产级部署。 5. 从MQTT、Flume、Kafka、Twitter和自定义HTTP接收器等多源数据的摄入。 6. 与HBase、Cassandra和Redis的集成和耦合。 7. 设计模式,以处理副作用并在Spark Streaming的微批处理模型中保持状态。 8. 使用数据帧、SparkSQL、Hive和SparkR实现实时和可扩展的ETL。 9. 利用流处理和批处理相结合的Lambda架构。 10. 实时机器学习、预测分析和推荐系统。 本书面向数据科学家、大数据专家、BI分析师和数据架构师。目录包括: 1. 大数据的旅行者指南 2. Spark简介 3. DStreams:实时RDD 4. 高速流:并行性和其他故事 5. 实时路线66:链接外部数据源 6. 副作用的艺术 7. 准备好黄金时间 8. 实时ETL和分析魔术 9. 扩大规模的机器学习 10. 云、Lambda和Python的世界 这本书是希望掌握实时数据分析和使用Spark Streaming构建高性能应用的专业人士的理想读物,它不仅提供理论知识,还包含实用的代码示例,使读者能够将所学应用于实际工作场景。
2016-11-16 上传
One million Uber rides are booked every day, 10 billion hours of Netflix videos are watched every month, and $1 trillion are spent on e-commerce web sites every year. The success of these services is underpinned by Big Data and increasingly, real-time analytics. Real-time analytics enable practitioners to put their fingers on the pulse of consumers and incorporate their wants into critical business decisions. We have only touched the tip of the iceberg so far. Fifty billion devices will be connected to the Internet within the next decade, from smartphones, desktops, and cars to jet engines, refrigerators, and even your kitchen sink. The future is data, and it is becoming increasingly real-time. Now is the right time to ride that wave, and this book will turn you into a pro. The low-latency stipulation of streaming applications, along with requirements they share with general Big Data systems—scalability, fault-tolerance, and reliability—have led to a new breed of real- time computation. At the vanguard of this movement is Spark Streaming, which treats stream processing as discrete microbatch processing. This enables low-latency computation while retaining the scalability and fault-tolerance properties of Spark along with its simple programming model. In addition, this gives streaming applications access to the wider ecosystem of Spark libraries including Spark SQL, MLlib, SparkR, and GraphX. Moreover, programmers can blend stream processing with batch processing to create applications that use data at rest as well as data in motion. Finally, these applications can use out-of-the- box integrations with other systems such as Kafka, Flume, HBase, and Cassandra. All of these features have turned Spark Streaming into the Swiss Army Knife of real-time Big Data processing. Throughout this book, you will exercise this knife to carve up problems from a number of domains and industries. This book takes a use-case-first approach: each chapter is dedicated to a particular industry vertical. Real-time Big Data problems from that field are used to drive the discussion and illustrate concepts from Spark Streaming and stream processing in general. Going a step further, a publicly available dataset from that field is used to implement real-world applications in each chapter. In addition, all snippets of code are ready to be executed. To simplify this process, the code is available online, both on GitHub1 and on the publisher’s web site. Everything in this book is real: real examples, real applications, real data, and real code. The best way to follow the flow of the book is to set up an environment, download the data, and run the applications as you go along. This will give you a taste for these real-world problems and their solutions. These are exciting times for Spark Streaming and Spark in general. Spark has become the largest open source Big Data processing project in the world, with more than 750 contributors who represent more than 200 organizations. The Spark codebase is rapidly evolving, with almost daily performance improvements and feature additions. For instance, Project Tungsten (first cut in Spark 1.4) has improved the performance of the underlying engine by many orders of magnitude. When I first started writing the book, the latest version of Spark was 1.4. Since then, there have been two more major releases of Spark (1.5 and 1.6). The changes in these releases have included native memory management, more algorithms in MLlib, support for deep learning via TensorFlow, the Dataset API, and session management. On the Spark Streaming front, two major features have been added: mapWithState to maintain state across batches and using back pressure to throttle the input rate in case of queue buildup.2 In addition, managed Spark cloud offerings from the likes of Google, Databricks, and IBM have lowered the barrier to entry for developing and running Spark applications. Now get ready to add some “Spark” to your skillset!