大数据：算法、分析与应用

需积分: 9 140 浏览量更新于2024-07-20 3 收藏 51.79MB PDF 举报

"《大数据：算法、分析与应用》是一本深入探讨大数据处理、分析和应用的专业书籍，由领域内的专家合著。本书分为五个主要部分，涵盖了大数据管理、处理、流技术与算法、隐私保护以及应用实例。通过介绍最新的研究成果和成就，书中揭示了在大数据环境下如何利用先进的算法和分析策略来挖掘数据中的模式，并以此提升竞争优势。" 在第一部分“大数据管理”中，作者讨论了大数据管理的相关研究问题，包括索引构建和可扩展性方面，以应对海量数据的存储和检索挑战。第二部分“大数据处理”聚焦于在各种资源密集型计算环境中处理大数据的问题，包括使用亚马逊的Hadoop服务进行增量数据处理的可扩展性和成本评估。第三部分“大数据流技术与算法”探讨了流环境下的大数据管理和挖掘的研究问题，为实时数据处理提供了新的思路和技术。第四部分“大数据隐私”关注保护大数据隐私的模型、技术和算法，为在保障隐私的同时进行数据分析提供了理论基础。第五部分“大数据应用”展示了大数据在金融、多媒体工具、生物识别和卫星数据处理等多个领域的实际应用，展示了大数据技术的广泛影响力。书中涵盖的主题包括但不限于：大规模数据集的相似性搜索的奇异值分解、聚类和索引方法；基于遗传算法的多序列比对和聚类；高性能大数据处理的方法和挑战；大数据科学的艺术——调度；MapReduce框架中的时间-空间调度；面向多线程系统的图数据库引擎；大规模网络社区检测；使软件开发者社区对大数据透明化的方法；大数据流计算的关键技术；多核架构上的大数据流处理算法；个人大数据集成和组织的统一框架；在线处理位置流的大规模轨迹数据管理；大数据的个人数据保护；OLAP（在线分析处理）中的隐私保护大数据管理；以及特定领域的应用案例，如金融大数据、语义驱动的异构多媒体大数据检索、大规模多媒体分析和检索的话题建模、基于英特尔Xeon Phi的大数据生物识别处理（以虹膜匹配为例）以及大型卫星数据的存储、管理和分析等。这本书不仅报告了大数据领域的前沿研究，还为读者提供了进一步探索这个充满挑战的科学领域的基础知识，对于下一代数据库、数据仓库、数据挖掘和云计算研究具有指导意义。同时，它还探讨了不同领域内的相关应用，涵盖了媒体/数据通信、弹性媒体/数据存储、跨网络媒体/数据融合和SaaS（软件即服务）等技术。

xiv ◾ Foreword by Jack Dongarra

Data environment, the use of statistical signicance (the P value) may not always be appro-

priate. In analytics terms, correlation is not equivalent to causality and normal distribution

may not be that normal. An ensemble of multiple models is oen used to improve forecast-

ing, prediction, or decision making. Traditional computing problems use static data base,

take input from logical and structured data, and run deterministic algorithms. In the Big

Data era, a relational data base has to be supplemented with other structures due to scal-

ability issue. Moreover, input data are oen unstructured and illogical (due to acquisition

through cognition, speech, or perception). Due to rapid streaming of incoming data, it is

necessary to bring computing to the data acquisition point. Intelligent informatics will use

data mining and semisupervised machine learning techniques to deal with the uncertainty

factor of the complex Big Data environment. e current book, BDA

, has included many

of these data-centric methods and analyzing techniques.

e editors have assembled an impressive book consisting of 22 chapters, written by 57

authors, from 12 countries across America, Europe, and Asia. e chapters are properly

divided into ve sections on Big Data: management, processing, stream technologies and

algorithms, privacy, and applications. Although the authors come from dierent disci-

plines and subelds, their journey is the same: to discover and analyze Big Data and to

create value for them, for their organizations and society, and for the whole world. e

chapters are well written by various authors who are active researchers or practical experts

in the area related to or in Big Data. BDA

will contribute tremendously to the emerg-

ing new paradigm (the fourth paradigm) of the scientic discovery process and will help

generate many new research elds and disciplines such as those in computational x and

x-informatics (x can be biology, neuroscience, social science, or history), as Jim Gray envi-

sioned. On the other hand, it will stimulate technology innovation and possibly inspire

entrepreneurship. In addition, it will have a great impact on cyber security, cloud comput-

ing, and mobility management for public and private sectors.

I would like to thank and congratulate the four editors of BDA

—Kuan-Ching Li, Hai

Jiang, Laurence T. Yang, and Alfredo Cuzzocrea—for their energy and dedication in put-

ting together this signicant volume. In the Big Data era, many institutions and enterprises

in the public and private sectors have launched their Big Data strategy and platform. e

current book, BDA

, is dierent from those strategies and platforms and focuses on essen-

tial Big Data issues, such as management, processing, streaming technologies, privacy, and

applications. is book has great potential to provide fundamental insight and privacy

to individuals, long-lasting value to organizations, and security and sustainability to the

cyber–physical–social ecosystem on the planet.

D. Frank Hsu

Fordham University, New York

Preface

    being generated at an exponential rate all over the world, Big Data

has become an indispensable issue. While organizations are capturing exponentially

larger amounts of data than ever these days, they have to rethink and gure out how to

digest it. e implicit meaning of data can be interpreted in reality through novel and

evolving algorithms, analytics techniques, and innovative and eective use of hardware

and soware platforms so that organizations can harness the data, discover hidden pat-

terns, and use newly acquired knowledge to act meaningfully for competitive advantages.

is challenging vision has attracted a great deal of attention from the research commu-

nity, which has reacted with a number of proposals focusing on fundamental issues, such

as managing Big Data, querying and mining Big Data, making Big Data privacy-preserving,

designing and running sophisticated analytics over Big Data, and critical applications,

which span over a large family of cases, from biomedical (Big) Data to graph (Big) Data,

from social networks to sensor and spatiotemporal stream networks, and so forth.

A conceptually relevant point of result that inspired our research is recognizing that

classical managing, query, and mining algorithms, even developed with very large data

sets, are not suitable to cope with Big Data due to both methodological and performance

issues. As a consequence, there is an emerging need for devising innovative models, algo-

rithms, and techniques capable of managing and mining Big Data while dealing with their

inherent properties, such as volume, variety, and velocity.

Inspired by this challenging paradigm, this book covers fundamental and realistic issues

about Big Data, including ecient algorithmic methods to process data, better analytical

strategies to digest data, and representative applications in diverse elds such as medicine,

science, and engineering, seeking to bridge the gap between huge amounts of data and

appropriate computational methods for scientic and social discovery and to bring tech-

nologies for media/data communication, elastic media/data storage, cross-network media/

data fusion, Soware as a Service (SaaS), and others together. It also aims at interesting

applications involving Big Data.

According to this methodological vision, this book is organized into ve main sections:

• “Big Data Management,” which focuses on research issues related to the eective and

ecient management of Big Data, including indexing and scalability aspects.

• “Big Data Processing,” which moves the attention to the problem of processing Big

Data in a widespread collection of resource-intensive computational settings, for

xvi ◾ Preface

example, those determined by MapReduce environments, commodity clusters, and

data-preponderant networks.

• “Big Data Stream Techniques and Algorithms,” which explores research issues con-

cerning the management and mining of Big Data in streaming environments, a typical

scenario where Big Data show their most problematic drawbacks to deal with—here,

the focus is on how to manage Big Data on the y, with limited resources and approxi-

mate computations.

• “Big Data Privacy,” which focuses on models, techniques, and algorithms that aim at

making Big Data privacy-preserving, that is, protecting them against privacy breaches

that may prevent the anonymity of Big Data in conventional settings (e.g., cloud

environments).

• “Big Data Applications,” which, nally, addresses a rich collection of practical applica-

tions of Big Data in several domains, ranging from nance applications to multi media

tools, from biometrics applications to satellite (Big) Data processing, and so forth.

In the following, we will provide a description of the chapters contained in the book,

according to the previous ve sections.

e rst section (i.e., “Big Data Management”) is organized into the following chapters.

Chapter 1, “Scalable Indexing for Big Data Processing,” by Hisham Mohamed and

Stéphane Marchand-Maillet, focuses on the K-nearest neighbor (K-NN) search problem,

which is the way to nd and predict the most closest and similar objects to a given query.

It nds many applications for information retrieval and visualization, machine learning,

and data mining. e context of Big Data imposes the nding of approximate solutions.

Permutation-based indexing is one of the most recent techniques for approximate simi-

larity search in large-scale domains. Data objects are represented by a list of references

(pivots), which are ordered with respect to their distances from the object. In this context,

the authors show dierent distributed algorithms for ecient indexing and searching based

on permutation-based indexing and evaluate them on big high-dimensional data sets.

Chapter 2, “Scalability and Cost Evaluation of Incremental Data Processing Using

Amazon’s Hadoop Service,” by Xing Wu, Yan Liu, and Ian Gorton, considers the case of

Hadoop that, based on the MapReduce model and Hadoop Distributed File System (HDFS),

enables the distributed processing of large data sets across clusters with scalability and fault

tolerance. Many data-intensive applications involve continuous and incremental updates

of data. Understanding the scalability and cost of a Hadoop platform to handle small and

independent updates of data sets sheds light on the design of scalable and cost-eective

data-intensive applications. With these ideas in mind, the authors introduce a motivating

movie recommendation application implemented in the MapReduce model and deployed

on Amazon Elastic MapReduce (EMR), a Hadoop service provided by Amazon. In par-

ticular, the authors present the deployment architecture with implementation details of

the Hadoop application. With metrics collected by Amazon CloudWatch, they present an

empirical scalability and cost evaluation of the Amazon Hadoop service on processing

Preface ◾ xvii

continuous and incremental data streams. e evaluation result highlights the potential of

autoscaling for cost reduction on Hadoop services.

Chapter 3, “Singular Value Decomposition, Clustering, and Indexing for Similarity

Search for Large Data Sets in High-Dimensional Spaces,” by Alexander omasian,

addresses a popular paradigm, that is, representing objects such as images by their feature

vectors and searching for similarity according to the distances of the points represent-

ing them in high-dimensional space via K-nearest neighbors (K-NNs) to a target image.

e authors discuss a combination of singular value decomposition (SVD), clustering,

and indexing to reduce the cost of processing K-NN queries for large data sets with high-

dimensional data. ey rst review dimensionality reduction methods with emphasis on

SVD and related methods, followed by a survey of clustering and indexing methods for

high-dimensional numerical data. e authors describe combining SVD and clustering as

a framework and the main memory-resident ordered partition (OP)-tree index to speed up

K-NN queries. Finally, they discuss techniques to save the OP-tree on disk and specify the

stepwise dimensionality increasing (SDI) index suited for K-NN queries on dimensionally

reduced data.

Chapter 4, “Multiple Sequence Alignment and Clustering with Dot Matrices, Entropy,

and Genetic Algorithms,” by John Tsiligaridis, presents a set of algorithms and their e-

ciency for Multiple Sequence Alignment (MSA) and clustering problems, including also

solutions in distributive environments with Hadoop. e strength, the adaptability, and

the eectiveness of the genetic algorithms (GAs) for both problems are pointed out. MSA is

among the most important tasks in computational biology. In biological sequence compar-

ison, emphasis is given to the simultaneous alignment of several sequences. GAs are sto-

chastic approaches for ecient and robust search that can play a signicant role for MSA

and clustering. e divide-and-conquer principle ensures undisturbed consistency during

vertical sequences’ segmentations. Indeed, the divide-and-conquer method (DCGA) can

provide a solution for MSA utilizing appropriate cut points. As far as clustering is con-

cerned, the aim is to divide the objects into clusters so that the validity inside clusters is

minimized. As an internal measure for cluster validity, the sum of squared error (SSE) is

used. A clustering genetic algorithm with the SSE criterion (CGA_SSE), a hybrid approach,

using the most popular algorithm, the K-means, is presented. e CGA_SSE combines

local and global search procedures. Comparison of the K-means and CGA_SSE is pro-

vided in terms of the accuracy and quality of the solution for clusters of dierent sizes

and densities. e complexity of all proposed algorithms is examined. e Hadoop for the

distributed environment provides an alternate solution to the CGA_SSE, following the

MapReduce paradigm. Simulation results are provided.

e second section (i.e., “Big Data Processing”) is organized into the following chapters.

Chapter 5, “Approaches for High-Performance Big Data Processing: Applications and

Challenges,” by Ouidad Achahbar, Mohamed Riduan Abid, Mohamed Bakhouya, Chaker

El Amrani, Jaafar Gaber, Mohammed Essaaidi, and Tarek A. El Ghazawi, puts emphasis

on social media websites, such as Facebook, Twitter, and YouTube, and job posting websites

like LinkedIn and CareerBuilder, which involve a huge amount of data that are very use-

ful for economy assessment and society development. ese sites provide sentiments and

xviii ◾ Preface

interests of people connected to web communities and a lot of other information. e Big

Data collected from the web is considered an unprecedented source to fuel data processing

and business intelligence. However, collecting, storing, analyzing, and processing these

Big Data as quickly as possible creates new challenges for both scientists and analytics. For

example, analyzing Big Data from social media is now widely accepted by many compa-

nies as a way of testing the acceptance of their products and services based on customers’

opinions. Opinion mining or sentiment analysis methods have been recently proposed for

extracting positive/negative words from Big Data. However, highly accurate and timely

processing and analysis of the huge amount of data to extract their meaning requires new

processing techniques. More precisely, a technology is needed to deal with the massive

amounts of unstructured and semistructured information in order to understand hidden

user behavior. Existing solutions are time consuming given the increase in data volume

and complexity. It is possible to use high-performance computing technology to accelerate

data processing through MapReduce ported to cloud computing. is will allow compa-

nies to deliver more business value to their end customers in the dynamic and changing

business environment. is chapter discusses approaches proposed in literature and their

use in the cloud for Big Data analysis and processing.

Chapter 6, “e Art of Scheduling for Big Data Science,” by Florin Pop and Valentin

Cristea, moves the attention to applications that generate Big Data, like social networking

and social inuence programs, cloud applications, public websites, scientic experiments

and simulations, data warehouses, monitoring platforms, and e-government services. Data

grow rapidly, since applications produce continuously increasing volumes of both unstruc-

tured and structured data. e impact on data processing, transfer, and storage is the need

to reevaluate the approaches and solutions to better answer user needs. In this context,

scheduling models and algorithms have an important role. A large variety of solutions for

specic applications and platforms exist, so a thorough and systematic analysis of exist-

ing solutions for scheduling models, methods, and algorithms used in Big Data processing

and storage environments has high importance. is chapter presents the best of existing

solutions and creates an overview of current and near-future trends. It highlights, from

a research perspective, the performance and limitations of existing solutions and oers

an overview of the current situation in the area of scheduling and resource management

related to Big Data processing.

Chapter 7, “Time–Space Scheduling in the MapReduce Framework,” by Zhuo Tang,

Ling Qi, Lingang Jiang, Kenli Li, and Keqin Li, focuses on the signicance of Big Data, that

is, analyzing people’s behavior, intentions, and preferences in the growing and popular

social networks and, in addition to this, processing data with nontraditional structures

and exploring their meanings. Big Data is oen used to describe a company’s large amount

of unstructured and semistructured data. Using analysis to create these data in a relational

database for downloading will require too much time and money. Big Data analysis and

cloud computing are oen linked together because real-time analysis of large data requires

a framework similar to MapReduce to assign work to hundreds or even thousands of com-

puters. Aer several years of criticism, questioning, discussion, and speculation, Big Data

nally ushered in the era belonging to it. Hadoop presents MapReduce as an analytics

剩余477页未读，继续阅读

ramissue

粉丝: 354
资源: 1487

大数据：算法、分析与应用

Big Data Technologies and Applications 无水印原版pdf

: Algorithms, Analytics, Data, Models, Optimization

Big Data goes Personal Privacy and Social Challenges

Big.Data.Technologies.and.Applications

Big.Data.in.Complex.Systems.Challenges.and.Opportunities.3319

Scala and Spark for Big Data Analytics

Big Data Analytics with Spark 无水印pdf 0分

Web and Big Data_First International Joint Conference, Part I-Springer(2017).pdf

Big Data, Data Mining, and Machine Learning

最新资源