使用Hadoop构建大数据应用架构

4星 · 超过85%的资源 需积分: 9 62 下载量 198 浏览量 更新于2024-07-21 1 收藏 8.17MB PDF 举报
"Hadoop Application Architectures" 是一本由 Mark Grover、Ted Malaska、Jonathan Seidman 和 Gwen Shapira 合著的书籍,由 O'Reilly 出版于2015年,专注于帮助读者理解和构建基于Apache Hadoop的端到端数据管理解决方案。 这本书探讨了使用Hadoop进行数据存储和建模时需要考虑的关键因素,以及如何有效地将数据输入和输出系统。书中详细介绍了多种数据处理框架,包括经典的MapReduce、快速的Spark以及数据查询工具Hive。这些框架的选择和使用策略对构建高效的大数据应用至关重要。 作者们通过实例讲解了Hadoop中的常见处理模式,如消除重复记录和使用窗口分析,这些都是大数据处理中的核心操作。此外,书中还涉及了处理大规模图数据的工具,如Giraph和GraphX,它们在社交网络分析、推荐系统等领域有广泛应用。 书中还涵盖了工作流编排和调度工具Apache Oozie的使用,这对于协调Hadoop生态系统中的不同组件至关重要,确保任务按照预定顺序和时间执行。对于近实时流处理,作者讨论了Apache Storm、Spark Streaming和Flume,这些工具在实时数据分析和响应方面扮演着重要角色。 在第二部分,作者提供了实际的应用架构案例,包括点击流分析、欺诈检测和数据仓库设计。这些案例研究帮助读者理解如何将理论知识应用于解决实际业务问题,例如通过点击流分析理解用户行为,利用欺诈检测系统保护企业免受损失,以及如何构建高效的数据仓库来支持决策制定。 《Hadoop Application Architectures》是一本面向IT专业人士的指南,旨在帮助他们设计和实现与特定业务需求相匹配的完整Hadoop应用。无论你是要构建全新的Hadoop应用,还是计划将Hadoop集成到现有数据基础设施中,这本书都将提供宝贵的指导,帮助你顺利地完成这一过程。
2015-07-07 上传
Get expert guidance on architecting end-to-end data management solutions with Apache Hadoop. While many sources explain how to use various components in the Hadoop ecosystem, this practical book takes you through architectural considerations necessary to tie those components together into a complete tailored application, based on your particular use case. To reinforce those lessons, the book’s second section provides detailed examples of architectures used in some of the most commonly found Hadoop applications. Whether you’re designing a new Hadoop application, or planning to integrate Hadoop into your existing data infrastructure, Hadoop Application Architectures will skillfully guide you through the process. This book covers: Factors to consider when using Hadoop to store and model data Best practices for moving data in and out of the system Data processing frameworks, including MapReduce, Spark, and Hive Common Hadoop processing patterns, such as removing duplicate records and using windowing analytics Giraph, GraphX, and other tools for large graph processing on Hadoop Using workflow orchestration and scheduling tools such as Apache Oozie Near-real-time stream processing with Apache Storm, Apache Spark Streaming, and Apache Flume Architecture examples for clickstream analysis, fraud detection, and data warehousing Table of Contents Part I. Architectural Considerations For Hadoop Applications Chapter 1. Data Modeling In Hadoop Chapter 2. Data Movement Chapter 3. Processing Data In Hadoop Chapter 4. Common Hadoop Processing Patterns Chapter 5. Graph Processing On Hadoop Chapter 6. Orchestration Chapter 7. Near-Real-Time Processing With Hadoop Part II. Case Studies Chapter 8. Clickstream Analysis Chapter 9. Fraud Detection Chapter 10. Data Warehouse Appendix A. Joins In Impala