构建全面的Hadoop生态系统指南：从基础到实践

需积分: 10 96 浏览量更新于2024-07-20 收藏 25.39MB PDF 举报

《实践Apache Hadoop生态系统》(Apress,2016)是一本实用指南，由Deepak Vohra撰写，旨在深入介绍与Hadoop相关的框架和工具。这本书针对的是那些希望在大数据开发平台上高效利用Apache Hadoop项目的读者，特别是那些不只关注MapReduce和HDFS，而是想全面了解整个生态系统的开发者。本书的主要内容涵盖了以下几个关键部分： 1. **环境设置**：作者详细指导如何在Linux环境下使用Cloudera Hadoop Distribution CDH 5来搭建Hadoop项目的开发环境，确保读者能够顺利启动和管理集群。 2. **MapReduce**：深入讲解如何编写和运行MapReduce作业，包括数据处理的基本流程、分片策略以及优化性能的方法。 3. **存储解决方案**：介绍了如何使用Apache Hive进行数据存储和查询，以及Apache HBase的NoSQL特性，以便处理海量数据和实时查询需求。 4. **搜索与索引**：通过Apache Solr，读者将学习如何在Hadoop分布式文件系统（HDFS）上构建和管理数据索引，提升数据检索效率。 5. **消息队列与流处理**：Kafka的消息系统是书中不可或缺的一部分，它展示了如何设计和实现一个健壮的数据传输和消费架构。 6. **推荐系统**：Mahout的用户推荐算法是另一个亮点，书中会演示如何开发一个基于用户行为的个性化推荐系统。 7. **日志处理**：Apache Flume被用来演示如何有效地收集、清洗和传输日志数据到HDFS，确保数据完整性。 8. **数据库集成**：通过Sqoop，读者将学会如何将MySQL数据库中的数据导入到Hive、HDFS和HBase中，实现数据仓库的无缝连接。 9. **数据建表与索引**：最后，书中还涵盖如何在Apache Solr上创建Hive表，以便进行灵活的数据查询和分析。《实践Apache Hadoop生态系统》不仅提供理论知识，而且注重实践经验，使读者能够在实际项目中快速掌握这些关键技术和工具。无论你是初入大数据领域的新人，还是经验丰富的开发者寻求深化理解，这本书都将是不可或缺的参考资料。通过学习本书，读者将能够更好地理解和利用Hadoop生态系统的全貌，提升大数据处理能力。

About the Author

Deepak Vohra is a consultant and a principal member of the NuBean.

com software company. Vohra is a Sun-certified Java programmer and

web component developer. He has worked in the fields of XML, Java

programming, and Java EE for over seven years. Vohra is the coauthor

of Pro XML Development with Java Technology (Apress, 2006). He is also

the author of the JDBC 4.0 and Oracle JDeveloper for J2EE Development,

Processing XML Documents with Oracle JDeveloper 11g, EJB 3.0 Database

Persistence with Oracle Fusion Middleware 11g , and Java EE Development

in Eclipse IDE (Packt Publishing). He also served as the technical reviewer

on WebLogic: The Definitive Guide (O’Reilly Media, 2004) and Ruby

Programming for the Absolute Beginner (Cengage Learning PTR, 2007).

Foreword

I want to welcome you to the world of Hadoop. If you are novice or an expert looking to expand your

knowledge of the technology, then you have arrived at the right place. This book contains a wealth of

knowledge that can help the former become the latter. Even most experts in a technology focus on particular

aspects. This book will broaden your horizon and add valuable tools to your toolbox of knowledge.

When Deepak asked me to write the foreword, I was honored and excited. Those of you who know me

usually find I have no shortage of words. This case was no exception, but I found myself thinking more about

what to say, and about how to keep it simple.

Every few years, technology has a period of uncertainty. It always seems we are on the cusp of the next

“great” thing. Most of the time, we find that it is a fad that is soon replaced by the next shiny bauble. There

are some moments that have had an impact, and some that leave the community guessing. Let’s take a look

at a couple of examples to make a point.

Java appeared like manna from the heavens in 1995. Well, that is perhaps a bit dramatic. It did burst on

to the scene and made development easier because you didn’t need to worry about memory management

or networking. It also had this marketing mantra, which was “write once, run anywhere”. It turned out to be

mostly true. This was the next “great” thing.

Rolling ahead to 1999 and the release of J2EE. Again, we encounter Java doing all the right things. J2EE

technologies allowed, in a standard way, enterprises to focus on business value and not worry about the

technology stack. Again, this was mostly true.

Next we take a quantum leap to 2006. I attended JavaOne 2005 and 2006 and listened to numerous

presentations of where J2EE technology was going. I met a really passionate developer named Rod Johnson

who was talking about Spring. Some of you may have heard of it. I also listened as Sun pushed Java EE 5,

which was the next big change in the technology stack. I was also sold on a new component-based web UI

framework called Woodstock, which was based on JavaServer Faces. I was in a unique position; I was in

charge of making decisions for a variety of business systems at my employer at the time. I had to make a

series of choices. On the one hand I could use Spring, or on the other, Java EE 5. I chose Java EE 5 because

of the relationships I had developed at Sun, and because I wanted something based on a “standard”.

Woodstock, which I thought was the next “great” thing, turned out to be flash in the pan. Sun abandoned

Woodstock, and well… I guess on occasion I maintain it along with some former Sun and Oracle employees.

Spring, like Java EE 5, turned out to be a “great” thing.

Next was the collapse of Sun and its being acquired by the dark side. This doomsday scenario seemed to

be on everyone’s mind in 2009. The darkness consumed them in 2010. What would happen to Java? It turned

out everyone’s initial assessment was incorrect. Oracle courted the Java community initially, spent time and

treasure to fix a number of issues in the Java SE stack, and worked on Java EE as well. It was a phenomenal

wedding, and the first fruits of the union were fantastic—Java SE 7 and Java EE 7 were “great”. They allowed a

number of the best ideas to become reality. Java SE 8, the third child, was developed in conjunction with the

Java community. The lambda, you would have thought, was a religious movement.

While the Java drama was unfolding, a really bright fellow named Doug Cutting came along in 2006 and

created an Apache project called Hadoop. The funny name was the result of his son’s toy elephant. Today it

literally is the elephant in the room. This project was based on the Google File System and Map Reduce. The

xix

■ FOREWORD

baby elephant began to grow. Soon other projects with cute animal names like Pig, or more even more apt,

Zookeeper, came along. The little elephant that could soon was “the next great thing”.

Suddenly, Hadoop was the talk of the Java community. In 2012, I handed the Cloudera team a

Duke’s Choice Award for Java technology at JavaOne. Later that year, version 1.0 was released. It was the

culmination of hard work for all the folks who invested sweat and tears to make the foundation of what we

have today.

As I sit here wondering about Java EE 8 and its apparent collapse, I am reminded that there is still

innovation going on around me. The elephant in the room is there to remind me of that.

Some of you may be wondering what Hadoop can do for you. Let’s imagine something fun and how we

might use Hadoop to get some answers. Where I live in South Carolina, we have something called a Beer

BBQ 5K. It is a 5K run that includes beer and BBQ at the end, along with music. Some folks will just do the

beer and BBQ. That is fine, but in my case I need to do the run before. So we have data coming in on the

registration; we have demographic data like age and gender. We have geographic data: where they call home.

We have timing data from the timing chips. We have beer and BBQ data based on wristband scans. We have

multiple races in a year.

Hmm… what can we do with that data? One item that comes to mind is marketing, or planning. How

many women in which age groups attended and what beer did they drink? How many men? Did the level of

physical activity have any effect on the consumption of BBQ and beer? Geographically, where did attendees

come from? How diverse were the populations? Do changing locations and times of the year have different

effects? How does this compare with the last three years? We have incomplete data for the first year, and the

data formats have changed over time. We have become more sophisticated as the race and the available data

have grown. Can we combine data from the race with publicly accessible information like runner tracking

software data? How do we link the data from a provider site with our data?

Guess what? Hadoop can answer these questions and more. Each year, the quantity of data grows for

simple things like a Beer BBQ 5K. It also grows in volumes as we become more connected online. Is there

a correlation between Twitter data and disease outbreak and vector tracking? The answer is yes, using

Hadoop. Can we track the relationship between terrorists, terrorist acts, and social media? Yes, using…. well,

you get the point.

If you have read this far, I don’t think I need to convince you that you are on the right path. I want to

welcome you to our community, and if you are already a member, I ask you to consider contributing if you

can. Remember “a rising tide raises all boats,” and you can be a part of the sea change tide.

The best way to learn any technology is to roll up your sleeves and put your fingers to work. So stop

reading my foreword and get coding!

—John Yeary

NetBeans Dream Team

Founder Greenville Java Users Group

Java Users Groups Community Leader

Java Enterprise Community Leader

剩余428页未读，继续阅读

vanridin

粉丝: 108
资源: 1187

构建全面的Hadoop生态系统指南：从基础到实践

Hadoop迁移实战：RDBMS到NoSQL的转型与集成

迁移指南：将RDBMS与Hadoop生态融合，重构关系应用到NoSQL

Hadoop权威指南(第3版)：实战详解

Practical Hadoop Ecosystem

Practical Hive(Apress,2016)

Hadoop - Introduction to the Hadoop Ecosystem

Pro Apache Hadoop(Apress,2ed,2014)

《Hadoop at 10-the History and Evolution of the Apache Hadoop Ecosystem》

Practical Hadoop Security (2014)

Hadoop ecosystem

最新资源