Hive编程指南

需积分: 10 118 浏览量更新于2024-07-19 1 收藏 3.85MB PDF 举报

“Programming Hive”是一本由Edward Capriolo, Dean Wampler, 和 Jason Rutherglen合著的关于Hive编程的英文书籍。这本书由O'Reilly Media出版，旨在详细介绍Hive的使用和编程，适用于教育、商业或销售推广用途。 Hive是Apache软件基金会开发的一个数据仓库工具，它允许用户使用SQL（HQL，Hive Query Language）语言来处理存储在分布式文件系统（如Hadoop）中的大规模数据集。这本书“Programming Hive”深入探讨了Hive的核心概念和技术，包括： 1. **Hive安装与配置**：书中会介绍如何在不同的操作系统上安装和配置Hive，以及如何设置Hadoop环境以支持Hive操作。 2. **HQL基础**：详细讲解HQL语法，包括数据查询、插入、更新和删除操作，以及如何创建表、分区和桶等数据结构。 3. **数据加载与导出**：讨论如何将数据导入到Hive仓库以及从Hive导出数据，包括使用Hive的LOAD DATA命令和外部表功能。 4. **Hive与MapReduce的关系**：解释Hive如何利用MapReduce进行并行计算，以及如何优化这些任务以提高性能。 5. **Hive的高级特性**：涵盖窗口函数、UDF（用户自定义函数）、UDAF（用户自定义聚合函数）和UDTF（用户自定义表生成函数）的使用，以及如何编写自己的函数扩展Hive的功能。 6. **数据处理与分析**：介绍如何使用Hive进行数据清洗、转换和分析，包括统计分析、时间序列分析等复杂任务。 7. **性能优化**：提供关于如何优化Hive查询性能的策略，包括选择合适的分区策略、使用索引、调整执行计划等。 8. **Hive与其他大数据组件的集成**：探讨Hive如何与Pig、HBase、Spark等其他大数据工具协同工作，实现更高效的数据处理流程。 9. **案例研究**：通过实际案例展示Hive在不同行业的应用，如互联网广告、社交媒体分析和金融数据分析等。 10. **最佳实践**：分享作者在使用Hive过程中的经验教训，帮助读者避免常见错误并提升工作效率。这本书适合已经对Hadoop有一定了解，想要进一步学习Hive的开发者、数据分析师和数据科学家。通过阅读，读者将能够掌握Hive的使用，从而更有效地管理和处理大规模数据。

展开

Constant width bold

Shows commands or other text that should be typed literally by the user.

Constant width italic

Shows text that should be replaced with user-supplied values or by values deter-

mined by context.

This icon signifies a tip, suggestion, or general note.

This icon indicates a warning or caution.

Using Code Examples

This book is here to help you get your job done. In general, you may use the code in

this book in your programs and documentation. You do not need to contact us for

permission unless you’re reproducing a significant portion of the code. For example,

writing a program that uses several chunks of code from this book does not require

permission. Selling or distributing a CD-ROM of examples from O’Reilly books does

require permission. Answering a question by citing this book and quoting example

code does not require permission. Incorporating a significant amount of example code

from this book into your product’s documentation does require permission.

We appreciate, but do not require, attribution. An attribution usually includes the title,

author, publisher, and ISBN. For example: “Programming Hive by Edward Capriolo,

Aspect Research Associates, and Jason Rutherglen, 978-1-449-31933-5.”

If you feel your use of code examples falls outside fair use or the permission given above,

feel free to contact us at

permissions@oreilly.com.

Safari® Books Online

Safari Books Online (www.safaribooksonline.com) is an on-demand digital

library that delivers expert

content in both book and video form from the

world’s leading authors in technology and business.

Technology professionals, software developers, web designers, and business and cre-

ative professionals use Safari Books Online as their primary resource for research,

problem solving, learning, and certification training.

xiv | Preface

champion Hadoop as a solution internally. Even though I am now very familiar with

Hadoop internals, Hive is still my primary method of working with Hadoop.

It is an honor to write a Hive book. Being a Hive Committer and a member of the

Apache Software Foundation is my most valued accolade.

Dean Wampler

As a “big data” consultant at Think Big Analytics, I work with experienced “data people”

who eat and breathe SQL. For them, Hive is a necessary and sufficient condition for

Hadoop to be a viable tool to leverage their investment in SQL and open up new op-

portunities for data analytics.

Hive has lacked good documentation. I suggested to my previous editor at O’Reilly,

Mike Loukides, that a Hive book was needed by the community. So, here we are…

Jason Rutherglen

I work at Think Big Analytics as a software architect. My career has involved an array

of technologies including search, Hadoop, mobile, cryptography, and natural language

processing. Hive is the ultimate way to build a data warehouse using open technologies

on any amount of data. I use Hive regularly on a variety of projects.

Acknowledgments

Everyone involved with Hive. This includes committers, contributors, as well as end

users.

Mark Grover wrote the chapter on Hive and Amazon Web Services. He is a contributor

to the Apache Hive project and is active helping others on the Hive IRC channel.

David Ha and Rumit Patel, at M6D, contributed the case study and code on the Rank

function. The ability to do Rank in Hive is a significant feature.

Ori Stitelman, at M6D, contributed the case study, Data Science using Hive and R,

which demonstrates how Hive can be used to make first pass on large data sets and

produce results to be used by a second R process.

David Funk contributed three use cases on in-site referrer identification, sessionization,

and counting unique visitors. David’s techniques show how rewriting and optimizing

Hive queries can make large scale map reduce data analysis more efficient.

Ian Robertson read the entire first draft of the book and provided very helpful feedback

on it. We’re grateful to him for providing that feedback on short notice and a tight

schedule.

xvi | Preface

剩余349页未读，继续阅读

身份认证购VIP最低享 7 折!

30元优惠券

zhugx_java

粉丝: 35

免费下载：Programming Hive 电子书

Hadoop数据仓库： Programming Hive 深入解析

Hive编程指南：Edward Capriolo, Dean Wampler, and Jason Rutherglen 著

Programming Hive （hive编程）.pdf

Programming Hive pdf

[Hive] Programming Hive (英文版)

impala-2.9.pdf

Hadoop- The Definitive Guide, 3rd Edition.pdf

Hadoop_Data Processing and Modelling-Packt Publishing(2016).pdf

Big Data Analytics with Spark 无水印pdf 0分

最新资源