精通Hadoop生态系统：快速指南

5星 · 超过95%的资源需积分: 9 57 浏览量更新于2024-07-22 收藏 3.19MB PDF 举报

"Hadoop.Essentials.1784396680" 《Hadoop Essentials》一书深入浅出地介绍了Hadoop生态系统的关键概念和技术，旨在帮助系统和应用开发者以及Hadoop专业人士掌握如何利用Hadoop框架解决实际问题。作者Shiva Achari在书中详细阐述了Hadoop的核心组件、工具及其应用场景。本书适合对Hadoop感兴趣或正在从事Hadoop项目的专业人士阅读。全书共分为七章，涵盖了从大数据基础到Hadoop生态系统的各个方面： 1. **介绍大数据和Hadoop**：首先，书中讨论了大数据的三个V（体积、速度、多样性），以及大数据的含义和NoSQL数据库。接着，列举了不同类型的NoSQL数据库、分析型数据库，并分析了大数据的创建者和常见应用场景。最后，介绍了Hadoop的历史、优势、用途以及Hadoop生态系统，包括Apache Hadoop和各种Hadoop发行版。 2. **Hadoop生态系统**：这一章详细探讨了Hadoop的支柱，即HDFS（分布式文件系统）、MapReduce（并行处理框架）和YARN（资源调度器）。此外，还概述了数据访问组件、数据存储组件（如HBase）以及数据摄入组件（如Sqoop和Flume）的角色。 3. **HDFS、MapReduce和YARN**：HDFS提供了高容错性的分布式存储，MapReduce则用于大规模数据处理，而YARN作为资源管理器，负责任务调度和集群资源分配。 4. **数据访问组件：Hive和Pig**：Hive提供了一种基于SQL的数据查询和分析工具，适合大规模数据处理；Pig则是一种高级编程语言，简化了MapReduce作业的编写。 5. **存储组件：HBase**：HBase是一个非关系型数据库，适用于大数据实时读写，尤其适合需要低延迟数据访问的应用。 6. **数据摄入组件：Sqoop和Flume**：Sqoop用于将结构化数据从传统数据库导入Hadoop，而Flume则处理日志和其他流式数据的收集、聚合和传输。 7. **流处理和实时分析：Storm和Spark**：Storm提供实时数据处理能力，Spark则以其快速、通用且可扩展的计算框架，支持批处理、交互式查询和实时流处理。通过阅读本书，读者将能够理解Hadoop各组件的工作原理，掌握其工具的使用，从而在实际项目中有效地应用Hadoop技术，实现数据的高效处理和分析。

Cookbook, all by Packt Publishing.

He graduated from Moscow State University with an MSc degree in computer science, where he

first got interested in parallel data processing, high load systems, and databases.

Preface

Hadoop is quite a fascinating and interesting project that has seen quite a lot of interest and

contributions from the various organizations and institutions. Hadoop has come a long way, from

being a batch processing system to a data lake and high-volume streaming analysis in low latency

with the help of various Hadoop ecosystem components, specifically YARN. This progress has been

substantial and has made Hadoop a powerful system, which can be designed as a storage,

transformation, batch processing, analytics, or streaming and real-time processing system.

Hadoop project as a data lake can be divided in multiple phases such as data ingestion, data

storage, data access, data processing, and data management. For each phase, we have different

sub-projects that are tools, utilities, or frameworks to help and accelerate the process. The Hadoop

ecosystem components are tested, configurable and proven and to build similar utility on our own it

would take a huge amount of time and effort to achieve. The core of the Hadoop framework is

complex for development and optimization. The smart way to speed up and ease the process is to

utilize different Hadoop ecosystem components that are very useful, so that we can concentrate

more on the application flow design and integration with other systems.

With the emergence of many useful sub-projects in Hadoop and other tools within the Hadoop

ecosystem, the question that arises is which tool to use when and how effectively. This book is

intended to complete the jigsaw puzzle of when and how to use the various ecosystem components,

and to make you well aware of the Hadoop ecosystem utilities and the cases and scenarios where

they should be used.

What this book covers

Chapter 1, Introduction to Big Data and Hadoop, covers an overview of big data and Hadoop, plus

different use case patterns with advantages and features of Hadoop.

Chapter 2, Hadoop Ecosystem, explores the different phases or layers of Hadoop project

development and some components that can be used in each layer.

Chapter 3, Pillars of Hadoop – HDFS, MapReduce, and YARN, is about the three key basic

components of Hadoop, which are HDFS, MapReduce, and YARN.

Chapter 4, Data Access Components – Hive and Pig, covers the data access components Hive

and Pig, which are abstract layers of the SQL-like and Pig Latin procedural languages, respectively,

on top of the MapReduce framework.

Chapter 5, Storage Components – HBase, is about the NoSQL component database HBase in

detail.

Chapter 6, Data Ingestion in Hadoop – Sqoop and Flume, covers the data ingestion library tools

Sqoop and Flume.

Chapter 7, Streaming and Real-time Analysis – Storm and Spark, is about the streaming and real-

time frameworks Storm and Spark built on top of YARN.

剩余197页未读，继续阅读

ramissue

粉丝: 354
资源: 1487

精通Hadoop生态系统：快速指南

YARN架构与安装深度解析：现代Hadoop操作系统的基石

掌握Hadoop核心概念：深入理解Hadoop生态系统

金融领域Hadoop实战：大数据洞察与商业智能

Hadoop.for.Finance.Essentials.1784395161

YARN.Essentials

Hadoop Essentials(PACKT,2015)

YARN Essentials.PDF

PyPI 官网下载 | MkNxGn_Essentials-0.1.32.25.tar.gz

hadoop电子书汇总

HBase Essentials

最新资源