Java驱动的大数据分析实战：从入门到深度应用

需积分: 9 72 浏览量更新于2024-07-18 收藏 5.68MB PDF 举报

"《大数据分析与Java》是一本面向Java开发者的入门指南，深入探讨如何在大型数据环境中利用Java进行数据分析。随着大数据技术的崛起，Java因其在Hadoop等主要平台上的广泛应用，成为了处理大数据的首选语言。本书以实践为导向，分为两大部分：第一部分是导论，帮助读者熟悉大数据环境，包括理解大规模、预测性、社交和自我驱动的数据特性。书中包含一系列实战案例，如从Twitter数据集中进行情感分析，基于MovieLens数据集提供个性化推荐，针对电商数据集进行客户分群，以及在实际航班数据上执行图分析。这些实例让读者能够理解和应用大数据分析的实际操作。第二部分深入讲解大数据分析的核心概念和技术，涵盖了数据处理、可视化、机器学习基础等内容。作者详细介绍了Naïve Bayes回归和分类方法的实操应用，强调了聚类分析的概念，并对深度学习框架如deeplearning4j或Java Spark进行深入讨论，让读者了解如何在真实世界场景中使用这些工具。对于希望学习并应用于实际工作中的Java开发者来说，《大数据分析与Java》是一本不可或缺的参考资料。它不仅提供了理论知识，更注重实践操作，确保读者能够在掌握Java语言的同时，具备处理和解读海量数据的能力。无论你是初次接触大数据分析还是有一定经验的开发者，这本书都能帮助你提升技能，应对日益增长的数据挑战。"

[ vii ]

Preface

Even as you read this content, there is a revolution happening behind the scenes

in the eld of big data. From every coffee that you pick up from a coffee store to

everything you click or purchase online, almost every transaction, click, or choice of

yours is getting analyzed. From this analysis, a lot of deductions are now being made

to offer you new stuff and better choices according to your likes. These techniques

and associated technologies are picking up so fast that as developers we all should

be a part of this new wave in the eld of software. This would allow us better

prospects in our careers, as well as enhance our skill set to directly impact the

business we work for.

Earlier technologies such as machine learning and articial intelligence used to sit

in the labs of many PhD students. But with the rise of big data, these technologies

have gone mainstream now. So, using these technologies, you can now predict which

advertisement the user is going to click on next, or which product they would like

to buy, or it can also show whether the image of a tumor is cancerous or not. The

opportunities here are vast. Big data in itself consists of a whole lot of technologies

whether cluster computing frameworks such as Apache Spark or Tez or distributed

lesystems such as HDFS and Amazon S3 or real-time SQL on underlying data using

Impala or Spark SQL.

This book provides a lot of information on big data technologies, including machine

learning, graph analytics, real-time analytics and an introductory chapter on deep

learning as well. I have tried to cover both technical and conceptual aspects of these

technologies. In doing so, I have used many real-world case studies to depict how

these technologies can be used in real life. So this book will teach you how to run a

fast algorithm on the transactional data available on an e-commerce site to gure out

which items sell together, or how to run a page rank algorithm on a ight dataset

to gure out the most important airports in a country based on air trafc. There are

many content gems like these in the book for readers.

Preface

[ viii ]

What this book covers

Chapter 1, Big Data Analytics with Java, starts with providing an introduction to the

core concepts of Hadoop and provides information on its key components. In easy-

to-understand explanations, it shows how the components t together and gives

simple examples on the usage of the core components HDFS and Apache Spark. This

chapter also talks about the different sources of data that can put their data inside

Hadoop, their compression formats, and the systems that are used to analyze

that data.

Chapter 2, First Steps in Data Analysis, takes the rst steps towards the eld of

analytics on big data. We start with a simple example covering basic statistical

analytic steps, followed by two popular algorithms for building association rules

using the Apriori Algorithm and the FP-Growth Algorithm. For all case studies, we

have used realistic examples of an online e-commerce store to give insights to users

as to how these algorithms can be used in the real world.

Chapter 3, Data Visualization, helps you to understand what different types of charts

there are for data analysis, how to use them, and why. With this understanding, we

can make better decisions when exploring our data. This chapter also contains lots of

code samples to show the different types of charts built using Apache Spark and the

JFreeChart library.

Chapter 4, Basics of Machine Learning, helps you to understand the basic theoretical

concepts behind machine learning, such as what exactly is machine learning, how it

is used, examples of its use in real life, and the different forms of machine learning.

If you are new to the eld of machine learning, or want to brush up your existing

knowledge on it, this chapter is for you. Here I will also show how, as a developer,

you should approach a machine learning problem, including topics on feature

extraction, feature selection, model testing, model selection, and more.

Chapter 5, Regression on Big Data, explains how you can use linear regression to

predict continuous values and how you can do binary classication using logistic

regression. A real-world case study of house price evaluation based on the different

features of the house is used to explain the concepts of linear regression. To explain

the key concepts of logistic regression, a real-life case study of detecting heart disease

in a patient based on different features is used.

Preface

[ ix ]

Chapter 6, Naive Bayes and Sentimental Analysis, explains a probabilistic machine

learning model called Naive Bayes and also briey explains another popular model

called the support vector machine. The chapter starts with basic concepts such as

Bayes Theorem and then explains how these concepts are used in Naive Bayes.

I then use the model to predict the sentiment whether positive or negative in a set

of tweets from Twitter. The same case study is then re-run using the support vector

machine model.

Chapter 7, Decision Trees, explains that decision trees are like owcharts and can be

programmatically built using concepts such as Entropy or Gini Impurity. The golden

egg in this chapter is a case study that shows how we can predict whether a person's

loan application will be approved or not using decision trees.

Chapter 8, Ensembling on Big Data, explains how ensembling plays a major role in

improving the performance of the predictive results. I cover different concepts

related to ensembling in this chapter, including techniques such as how multiple

models can be joined together using bagging or boosting thereby enhancing the

predictive outputs. We also cover the highly popular and accurate ensemble of

models, random forests and gradient-boosted trees. Finally, we predict loan default

by users in a dataset of a real-world Lending Club (a real online lending company)

using these models.

Chapter 9, Recommendation Systems, covers the particular concept that has made

machine learning so popular and it directly impacts business as well. In this chapter,

we show what recommendation systems are, what they can do, and how they are

built using machine learning. We cover both types of recommendation systems:

content-based and collaborative, and also cover their good and bad points. Finally,

we cover two case studies using the MovieLens dataset to show recommendations to

users for movies that they might like to see.

Chapter 10, Clustering and Customer Segmentation on Big Data, speaks about clustering

and how it can be used by a real-world e-commerce store to segment their customers

based on how valuable they are. I have covered both k-Means clustering and

bisecting k-Means clustering, and used both of them in the corresponding case

study on customer segmentation.

Chapter 11, Massive Graphs on Big Data, covers an interesting topic, graph analytics.

We start with a refresher on graphs, with basic concepts, and later go on to explore

the different forms of analytics that can be run on the graphs, whether path-based

analytics involving algorithms such as breadth-rst search, or connectivity analytics

involving degrees of connection. A real-world ight dataset is then used to explore

the different forms of graph analytics, showing analytical concepts such as nding

top airports using the page rank algorithm.

Preface

[ x ]

Chapter 12, Real-Time Analytics on Big Data, speaks about real-time analytics by rst

seeing a few examples of real-time analytics in the real world. We also learn about

the products that are used to build real-time analytics system on top of big data.

We particularly cover the concepts of Impala, Spark Streaming, and Apache Kafka.

Finally, we cover two real-life case studies on how we can build trending videos

from data that is generated in real-time, and also do sentiment analysis on tweets by

depicting a Twitter-like scenario using Apache Kafka and Spark Streaming.

Chapter 13, Deep Learning Using Big Data, speaks about the wide range of applications

that deep learning has in real life whether it's self-driving cars, disease detection, or

speech recognition software. We start with the very basics of what a biological neural

network is and how it is mimicked in an articial neural network. We also cover a lot

of the theory behind articial neurons and nally cover a simple case study of ower

species detection using a multi-layer perceptron. We conclude the chapter with a

brief introduction to the Deeplearning4j library and also cover a case study

on handwritten digit classication using convolution neural networks.

What you need for this book

There are a few things you will require to follow the examples in this book: a text

editor (I use Sublime Text), internet access, admin rights to your machine to install

applications and download sample code, and an IDE (I use Eclipse and IntelliJ).

You will also need other software such as Java, Maven, Apache Spark, Spark

modules, the GraphFrames library, and the JFreeChart library. We mention the

required software in the respective chapters.

You also need a good computer with a good RAM size, or you can also run the

samples on Amazon AWS.

Who this book is for

If you already know some Java and understand the principles of big data, this book

is for you. This book can be used by a developer who has mostly worked on web

programming or any other eld to switch into the world of analytics using machine

learning on big data.

A good understanding of Java and SQL is required. Some understanding of

technologies such as Apache Spark, basic graphs, and messaging will also

be benecial.

剩余415页未读，继续阅读

SorelCheung

粉丝: 61
资源: 120

Java驱动的大数据分析实战：从入门到深度应用

Big Data Analytics with Java 无水印pdf

Big Data Analytics with Java_Code 源码

Big+Data+Analytics+with+Java-Packt+Publishing(2017).pdf

big data analytics with java

Big Data Analytics with Java epub

Big Data Analytics with Java azw3

Big Data Analytics with Java-Packt Publishing(2017).

big data analytics with java 电子档

Big Data Analytics with Spark

Big Data Analytics with R and Hadoop.pdf

最新资源