探索Hive编程：英文技术书籍精华

需积分: 12 9 浏览量更新于2024-07-25 收藏 7.05MB PDF 举报

"hive programming" 《Programming Hive》是一本专注于Hive编程的英文技术书籍，由Edward Capriolo, Dean Wampler和Jason Rutherglen合著。这本书详细介绍了如何在云计算环境中使用Hive进行大数据处理和分析。Hive是Apache软件基金会的一个项目，它提供了一种基于Hadoop的数据仓库工具，可以将结构化的数据文件映射为一张数据库表，并提供SQL（HQL）查询功能，从而简化对大规模数据集的处理。 Hive的核心概念包括： 1. **Hive架构**：Hive构建在Hadoop之上，利用Hadoop的分布式存储和计算能力。它将数据存储在HDFS上，通过MapReduce进行并行处理。Hive包括元数据存储、查询解析、优化以及执行等组件。 2. **HQL（Hive Query Language）**：Hive提供了一种类似SQL的语言，允许用户对大数据进行查询和分析。HQL支持数据的插入、删除、更新和复杂的聚合操作，但不支持事务处理。 3. **表和分区**：在Hive中，数据被组织成表，可以进一步通过分区进行逻辑划分，提高查询效率。分区是根据特定列的值进行的，使得查询时只需要扫描相关的分区，而不是整个表。 4. **Hive元数据**：元数据包括表的结构、分区信息、表的存储位置等，存储在MySQL或其它支持的数据库中。元数据管理是Hive的关键部分，它帮助Hive理解数据的组织方式。 5. **Hive的优化**：Hive通过查询优化器自动转换HQL为MapReduce任务，包括选择最佳的执行计划，如决定是否使用join操作的优化、选择合适的排序和分组策略等。 6. **MapReduce与Hive的交互**：Hive将用户的查询转换为一系列的MapReduce任务，这些任务在Hadoop集群上并行运行。用户无需直接编写Java MapReduce程序，而是通过HQL进行操作。 7. **Hive的扩展性**：Hive支持UDF（User Defined Functions），用户可以自定义函数来处理特定的数据分析需求。此外，Hive还支持UDAF（User Defined Aggregate Functions）和UDTF（User Defined Table Generating Functions）。 8. **Hive与其它大数据工具的集成**：Hive可以与Pig、HBase、Spark等大数据工具无缝集成，实现更复杂的数据处理和分析流程。通过《Programming Hive》，读者可以深入理解Hive的工作原理，学习如何设计和优化Hive查询，以及如何在实际项目中有效利用Hive处理大规模数据。这本书适合数据分析师、数据科学家、大数据工程师以及对Hadoop和大数据处理感兴趣的读者。书中可能涵盖了Hive的安装与配置、HQL语法详解、性能调优、数据导入导出、数据建模、实时查询（如Hive on Tez或Hive on Spark）、以及与Hadoop生态系统中其他工具的整合等内容。对于想要提升Hive技能或者初次接触Hive的读者来说，这是一本宝贵的参考资料。

Constant width bold

Shows commands or other text that should be typed literally by the user.

Constant width italic

Shows text that should be replaced with user-supplied values or by values deter-

mined by context.

This icon signifies a tip, suggestion, or general note.

This icon indicates a warning or caution.

Using Code Examples

This book is here to help you get your job done. In general, you may use the code in

this book in your programs and documentation. You do not need to contact us for

permission unless you’re reproducing a significant portion of the code. For example,

writing a program that uses several chunks of code from this book does not require

permission. Selling or distributing a CD-ROM of examples from O’Reilly books does

require permission. Answering a question by citing this book and quoting example

code does not require permission. Incorporating a significant amount of example code

from this book into your product’s documentation does require permission.

We appreciate, but do not require, attribution. An attribution usually includes the title,

author, publisher, and ISBN. For example: “Programming Hive by Edward Capriolo,

Aspect Research Associates, and Jason Rutherglen, 978-1-449-31933-5.”

If you feel your use of code examples falls outside fair use or the permission given above,

feel free to contact us at permissions@oreilly.com.

Safari® Books Online

Safari Books Online (www.safaribooksonline.com) is an on-demand digital

library that delivers expert content in both book and video form from the

world’s leading authors in technology and business.

Technology professionals, software developers, web designers, and business and cre-

ative professionals use Safari Books Online as their primary resource for research,

problem solving, learning, and certification training.

xiv | Preface

Safari Books Online offers a range of product mixes and pricing programs for organi-

zations, government agencies, and individuals. Subscribers have access to thousands

of books, training videos, and prepublication manuscripts in one fully searchable da-

tabase from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley

Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John

Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT

Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Tech-

nology, and dozens more. For more information about Safari Books Online, please visit

us online.

How to Contact Us

Please address comments and questions concerning this book to the publisher:

O’Reilly Media, Inc.

1005 Gravenstein Highway North

Sebastopol, CA 95472

800-998-9938 (in the United States or Canada)

707-829-0515 (international or local)

707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional

information. You can access this page at http://oreil.ly/Programming_Hive.

To comment or ask technical questions about this book, send email to

bookquestions@oreilly.com.

For more information about our books, courses, conferences, and news, see our website

at http://www.oreilly.com.

Find us on Facebook: http://facebook.com/oreilly

Watch us on YouTube: http://www.youtube.com/oreillymedia

What Brought Us to Hive?

The three of us arrived here from different directions.

Edward Capriolo

When I first became involved with Hadoop, I saw the distributed filesystem and Map-

Reduce as a great way to tackle computer-intensive problems. However, programming

in the MapReduce model was a paradigm shift for me. Hive offered a fast and simple

way to take advantage of MapReduce in an SQL-like world I was comfortable in. This

approach also made it easy to prototype proof-of-concept applications and also to

Preface | xv

champion Hadoop as a solution internally. Even though I am now very familiar with

Hadoop internals, Hive is still my primary method of working with Hadoop.

It is an honor to write a Hive book. Being a Hive Committer and a member of the

Apache Software Foundation is my most valued accolade.

Dean Wampler

As a “big data” consultant at Think Big Analytics, I work with experienced “data people”

who eat and breathe SQL. For them, Hive is a necessary and sufficient condition for

Hadoop to be a viable tool to leverage their investment in SQL and open up new op-

portunities for data analytics.

Hive has lacked good documentation. I suggested to my previous editor at O’Reilly,

Mike Loukides, that a Hive book was needed by the community. So, here we are…

Jason Rutherglen

I work at Think Big Analytics as a software architect. My career has involved an array

of technologies including search, Hadoop, mobile, cryptography, and natural language

processing. Hive is the ultimate way to build a data warehouse using open technologies

on any amount of data. I use Hive regularly on a variety of projects.

Acknowledgments

Everyone involved with Hive. This includes committers, contributors, as well as end

users.

Mark Grover wrote the chapter on Hive and Amazon Web Services. He is a contributor

to the Apache Hive project and is active helping others on the Hive IRC channel.

David Ha and Rumit Patel, at M6D, contributed the case study and code on the Rank

function. The ability to do Rank in Hive is a significant feature.

Ori Stitelman, at M6D, contributed the case study, Data Science using Hive and R,

which demonstrates how Hive can be used to make first pass on large data sets and

produce results to be used by a second R process.

David Funk contributed three use cases on in-site referrer identification, sessionization,

and counting unique visitors. David’s techniques show how rewriting and optimizing

Hive queries can make large scale map reduce data analysis more efficient.

Ian Robertson read the entire first draft of the book and provided very helpful feedback

on it. We’re grateful to him for providing that feedback on short notice and a tight

schedule.

xvi | Preface

剩余349页未读，继续阅读

lisa_2401

粉丝: 0
资源: 1

探索Hive编程：英文技术书籍精华

Hive Programming 编程指南

[Hive] Programming Hive (英文版)

Programming Hive pdf

Programming_Hive

Programming Hive （hive编程）.pdf

Programming Hive：深入解析Hive技术

ProgrammingHive：英文原版Hive编程教程

免费下载：Programming Hive 电子书

hive-programming:此示例将让您提取有用的统计数据，例如排名前 10 的平均评分电影、使用 Hive 查询语言对 200 万条记录进行基于流派的过滤

programming_hive-master.zip

最新资源