Python指南：利用Hadoop操作大数据集群

需积分: 10 100 浏览量更新于2024-07-18 收藏 1.77MB PDF 举报

《Hadoop with Python》是一本由Zachary Radtka和Donald Miner合著的专业书籍，旨在帮助读者理解和利用Python语言来操作Hadoop大数据集群。本书是2016年由O'Reilly Media出版，享有版权，适用于教育、商业或销售推广用途。全书分为英文版，读者可以通过O'Reilly Safari Online获取电子版本。 Hadoop是一个开源框架，用于处理大规模数据集，它通过分布式存储和计算能力提供高性能的数据处理能力。本书将Python编程语言与Hadoop生态系统相结合，使读者能够更高效地编写和执行数据处理任务。Python因其简洁易读的语法和丰富的库支持，常被数据科学家和开发人员用于Hadoop项目中，如Pandas库在数据清洗、分析方面表现突出，而PySpark则提供了Python接口来操作Apache Spark，它是Hadoop的一个重要组件。书中内容涵盖了Hadoop的基本概念，包括HDFS（Hadoop Distributed File System）分布式文件系统，MapReduce模型，以及YARN（Yet Another Resource Negotiator）资源调度器。作者会介绍如何使用Python进行数据输入、处理和输出，包括使用Hadoop Streaming、Pig Latin或Hive SQL等工具。此外，书中还会探讨如何使用Python在Hadoop上实现机器学习算法，例如使用scikit-learn库进行数据挖掘和预测分析。《Hadoop with Python》不仅适合有Python基础但对Hadoop不熟悉的开发者，也适合已经熟悉Hadoop但想提升其数据分析能力的用户。对于那些希望通过Python简化Hadoop工作流程，或者希望在Hadoop环境中利用Python的强大功能进行深度学习和大数据分析的读者来说，这本书是一本宝贵的参考资料。本书的修订历史显示，该书于2015年10月首次发布，不断更新以反映Hadoop和Python技术的最新进展。为了获取详细的错误修正和发布详情，读者可以访问O'Reilly官方网站提供的在线资源。《Hadoop with Python》是一本实用且深入的教程，它将理论知识与实践经验相结合，为读者提供了一套完整的Python在Hadoop环境下操作指南，有助于读者在这个领域取得成功。

Running the -ls command on a new cluster will not return any

results. This is because the -ls command, without any arguments,

will attempt to display the contents of the user’s home directory on

HDFS. This is not the same home directory on the host machine

(e.g., /home/$USER), but is a directory within HDFS.

Providing -ls with the forward slash (/) as an argument displays

the contents of the root of HDFS:

$ hdfs dfs -ls /

Found 2 items

drwxr-xr-x - hadoop supergroup 0 2015-09-20 14:36 /hadoop

drwx------ - hadoop supergroup 0 2015-09-20 14:36 /tmp

The output provided by the hdfs dfs command is similar to the

output on a Unix filesystem. By default, -ls displays the file and

folder permissions, owners, and groups. The two folders displayed

in this example are automatically created when HDFS is formatted.

The hadoop user is the name of the user under which the Hadoop

daemons were started (e.g., NameNode and DataNode), and the

supergroup is the name of the group of superusers in HDFS (e.g.,

hadoop).

Creating a Directory

Home directories within HDFS are stored in /user/$HOME. From

the previous example with -ls, it can be seen that the /user directory

does not currently exist. To create the /user directory within HDFS,

use the -mkdir command:

$ hdfs dfs -mkdir /user

To make a home directory for the current user, hduser, use the

-mkdir command again:

$ hdfs dfs -mkdir /user/hduser

Use the -ls command to verify that the previous directories were

created:

$ hdfs dfs -ls -R /user

drwxr-xr-x - hduser supergroup 0 2015-09-22 18:01 /user/

hduser

4 | Chapter 1: Hadoop Distributed File System (HDFS)

剩余70页未读，继续阅读

敲键盘的生活

粉丝: 127
资源: 15

Python指南：利用Hadoop操作大数据集群

hadoop with kerbros

使用Python的HadoopHadoop with Python

Hadoop with Python

Hadoop-with-Python(Big Data).zip

hadoop-3.3.0.tar.gz

spark-2.4.8-bin-hadoop2.7.tgz

Python的Hadoop扩展Hadoopy.zip

Hadoop与Python结合实战指南

Python编程实战：探索Hadoop技术

Distributed Computing With Python

最新资源