大数据分析的前沿探索

需积分: 10 48 浏览量更新于2024-07-18 收藏 15.33MB PDF 举报

"Frontiers in massive Data Analysis" 是一本由National Academies Press于2013年9月3日出版的报告，着重探讨了大规模数据挖掘分析的前沿领域。该书指出，随着大数据在危机响应、市场营销、娱乐、网络安全以及国家情报等多个领域的应用，数据不再仅仅是存储、索引和检索的位串，而是被视为潜在的知识和发现的来源。它强调了对这些数据进行复杂分析的必要性，这些分析超出了传统的索引和关键词计数，旨在揭示数据背后的关联性和语义。报告指出，处理Terabytes到Petabytes级别的大数据在科学（如粒子物理、遥感、基因组学）、互联网商业、商业分析、国家安全和通信等领域变得越来越普遍。然而，传统的小规模数据分析工具可能无法有效处理这种大规模的数据。因此，需要新的工具、技能和方法来应对这一挑战，报告列出了许多这样的工具和有前景的研究方向。 "Frontiers in Massive Data Analysis" 描述了从大数据中推断知识时可能遇到的陷阱，并识别了七大类在大规模数据分析中常见的计算类型。报告强调了跨学科知识的重要性，包括计算机科学、统计学、机器学习以及应用学科，这些知识对于从海量数据中得出有用的推断至关重要。本书的编写委员会由国家研究委员会的多个分支组成，包括应用于理论统计委员会、数学科学及其应用董事会。项目得到了国家安全局的资助，但报告中的观点、发现、结论或建议仅代表作者的立场，不一定反映资助机构的观点。国际标准书号为978-0-309-28778-4。这本书提供了一个深入理解大数据分析前沿的重要资源，对于研究人员、从业人员和政策制定者来说都是宝贵的参考资料。

Frontiers in Massive Data Analysis

Summary

THE PROMISE AND PERILS OF MASSIVE DATA

Experiments, observations, and numerical simulations in many areas

of science and business are currently generating terabytes of data, and in

some cases are on the verge of generating petabytes and beyond. Analyses

of the information contained in these data sets have already led to major

breakthroughs in ﬁelds ranging from genomics to astronomy and high-

energy physics and to the development of new information-based industries.

Traditional methods of analysis have been based largely on the assumption

that analysts can work with data within the conﬁnes of their own comput-

ing environment, but the growth of “big data” is changing that paradigm,

especially in cases in which massive amounts of data are distributed across

locations.

While the scientiﬁc community and the defense enterprise have long

been leaders in generating and using large data sets, the emergence of

e-commerce and massive search engines has led other sectors to confront

the challenges of massive data. For example, Google, Yahoo!, Microsoft,

and other Internet-based companies have data that is measured in exa bytes

(10

bytes). Social media (e.g., Facebook, YouTube, Twitter) have exploded

beyond anyone’s wildest imagination, and today some of these companies

have hundreds of millions of users. Data mining of these massive data sets

is transforming the way we think about crisis response, marketing, enter-

tainment, cybersecurity, and national intelligence. It is also transforming

how we think about information storage and retrieval. Collections of docu-

ments, images, videos, and networks are being thought of not merely as bit

ignite678@126.com

Frontiers in Massive Data Analysis

2 FRONTIERS IN MASSIVE DATA ANALYSIS

strings to be stored, indexed, and retrieved, but also as potential sources of

dis covery and knowledge, requiring sophisticated analysis techniques that

go far beyond classical indexing and keyword counting, aiming to ﬁnd rela-

tional and semantic interpretations of the phenomena underlying the data.

A number of challenges in both data management and data analysis

require new approaches to support the big data era. These challenges span

generation of the data, preparation for analysis, and policy-related chal-

lenges in its sharing and use, including the following:

• Dealing with highly distributed data sources,

• Tracking data provenance, from data generation through data

preparation,

• Validating data,

• Coping with sampling biases and heterogeneity,

• Working with different data formats and structures,

• Developing algorithms that exploit parallel and distributed

architectures,

• Ensuring data integrity,

• Ensuring data security,

• Enabling data discovery and integration,

• Enabling data sharing,

• Developing methods for visualizing massive data,

• Developing scalable and incremental algorithms, and

• Coping with the need for real-time analysis and decision-making.

To the extent that massive data can be exploited effectively, the hope is

that science will extend its reach, and technology will become more adap-

tive, personalized, and robust. It is appealing to imagine, for example, a

health-care system in which increasingly detailed data are maintained for

each individual—including genomic, cellular, and environmental data—and

in which such data can be combined with data from other individuals and

with results from fundamental biological and medical research so that opti-

mized treatments can be designed for each individual. One can also envision

numerous business opportunities that combine knowledge of preferences

and needs at the level of single individuals with ﬁne-grained descriptions of

goods, skills, and services to create new markets.

It is natural to be optimistic about the prospects. Several decades of

research and development in databases and search engines have yielded a

wealth of relevant experience in the design of scalable data-centric tech-

nology. In particular, these ﬁelds have fueled the advent of cloud comput-

ing and other parallel and distributed platforms that seem well suited

to massive data analysis. Moreover, innovations in the ﬁelds of machine

learning, data mining, statistics, and the theory of algorithms have yielded

ignite678@126.com

Frontiers in Massive Data Analysis

SUMMARY 3

data-analysis methods that can be applied to ever-larger data sets. How-

ever, such optimism must be tempered by an understanding of the major

difﬁculties that arise in attempting to achieve the envisioned goals. In part,

these difﬁculties are those familiar from implementations of large-scale

databases—ﬁnding and mitigating bottlenecks, achieving simplicity and

generality of the programming interface, propagating metadata, designing

a system that is robust to hardware failure, and exploiting parallel and

distributed hardware—all at an unprecedented scale. But the challenges

for massive data go beyond the storage, indexing, and querying that have

been the province of classical database systems (and classical search en-

gines) and, instead, hinge on the ambitious goal of inference. Inference

is the problem of turning data into knowledge, where knowledge often is

expressed in terms of entities that are not present in the data per se but

are present in models that one uses to interpret the data. Statistical rigor is

necessary to justify the inferential leap from data to knowledge, and many

difﬁculties arise in attempting to bring statistical principles to bear on

massive data. Overlooking this foundation may yield results that are not

useful at best, or harmful at worst. In any discussion of massive data and

inference, it is essential to be aware that it is quite possible to turn data into

something resembling knowledge when actually it is not. Moreover, it can

be quite difﬁcult to know that this has happened.

Indeed, many issues impinge on the quality of inference. A major one

is that of “sampling bias.” Data may have been collected according to a

certain criterion (for example, in a way that favors “larger” items over

“smaller” items), but the inferences and decisions made may refer to a dif-

ferent sampling criterion. This issue seems likely to be particularly severe

in many massive data sets, which often consist of many subcollections of

data, each collected according to a particular choice of sampling criterion

and with little control over the overall composition. Another major issue is

“provenance.” Many systems involve layers of inference, where “data” are

not the original observations but are the products of an inferential proce-

dure of some kind. This often occurs, for example, when there are missing

entries in the original data. In a large system involving interconnected infer-

ences, it can be difﬁcult to avoid circularity, which can introduce additional

biases and can amplify noise. Finally, there is the major issue of controlling

error rates when many hypotheses are being considered. Indeed, massive

data sets generally involve growth not merely in the number of individuals

represented (the “rows” of the database) but also in the number of descrip-

tors of those individuals (the “columns” of the database). Moreover, we

are often interested in the predictive ability associated with combinations

of the descriptors; this can lead to exponential growth in the number of

hypotheses considered, with severe consequences for error rates. That is, a

naive appeal to a “law of large numbers” for massive data is unlikely to be

ignite678@126.com

Frontiers in Massive Data Analysis

4 FRONTIERS IN MASSIVE DATA ANALYSIS

justiﬁed; if anything, the perils associated with statistical ﬂuctuations may

actually increase as data sets grow in size.

While the ﬁeld of statistics has developed tools that can address such

issues in principle, in the context of massive data care must be taken with

all such tools for two main reasons: (1) all statistical tools are based on as-

sumptions about characteristics of the data set and the way it was sampled,

and those assumptions may be violated in the process of assembling massive

data sets; and (2) tools for assessing errors of procedures, and for diagnos-

tics, are themselves computational procedures that may be computationally

infeasible as data sets move into the massive scale.

In spite of the cautions raised above, the Committee on the Analysis of

Massive Data believes that many of the challenges involved in performing

inference on massive data can be confronted usefully. These challenges must

be addressed through a major, sustained research effort that is based solidly

on both inferential and computational principles. This research effort must

develop scalable computational infrastructures that embody inferential

principles that themselves are based on considerations of scale. The research

must take account of real-time decision cycles and the management of

trade-offs between speed and accuracy. And new tools are needed to bring

humans into the data-analysis loop at all stages, recognizing that knowledge

is often subjective and context-dependent and that some aspects of human

intelligence will not be replaced anytime soon by machines.

The current report is the result of a study that addressed the following

charge:

• Assess the current state of data analysis for mining of massive sets

and streams of data,

• Identify gaps in current practice and theory, and

• Propose a research agenda to ﬁll those gaps.

Thus, this report examines the frontiers of research that is enabling the

analysis of massive data. The major research areas covered are as follows:

• Data representation, including characterizations of the raw data

and transformations that are often applied to data, particularly

transformations that attempt to reduce the representational com-

plexity of the data;

• Computational complexity issues and how the understanding of

such issues supports characterization of the computational re-

sources needed and of trade-offs among resources;

• Statistical model-building in the massive data setting, including

data cleansing and validation;

ignite678@126.com

Frontiers in Massive Data Analysis

SUMMARY 5

• Sampling, both as part of the data-gathering process but also as a

key methodology for data reduction; and

• Methods for including humans in the data-analysis loop through

means such as crowdsourcing, where humans are used as a source

of training data for learning algorithms, and visualization, which

not only helps humans understand the output of an analysis but

also provides human input into model revision.

CONCLUSIONS

The research and development necessary for the analysis of massive

data goes well beyond the province of a single discipline, and one of the

main conclusions of this report is the need for a thoroughgoing interdis-

ciplinarity in approaching problems of massive data. Computer scientists

involved in building big-data systems must develop a deeper awareness of

inferential issues, while statisticians must concern themselves with scalabil-

ity, algorithmic issues, and real-time decision-making. Mathematicians also

have important roles to play, because areas such as applied linear algebra

and optimization theory (already contributing to large-scale data analysis)

are likely to continue to grow in importance. Also, as just mentioned, the

role of human judgment in massive data analysis is essential, and contribu-

tions are needed from social scientists and psychologists as well as experts

in visualization. Finally, domain scientists and users of technology have

an essential role to play in the design of any system for data analysis, and

particularly so in the realm of massive data, because of the explosion of

design decisions and possible directions that analyses can follow.

The current report focuses on the technical issues—computational and

inferential—that surround massive data, consciously setting aside major

issues in areas such as public policy, law, and ethics that are beyond the

current scope.

The committee reached the following conclusions:

• Recent years have seen rapid growth in parallel and distributed

computing systems, developed in large part to serve as the back-

bone of the modern Internet-based information ecosystem. These

systems have fueled search engines, electronic commerce, social

networks, and online entertainment, and they provide the platform

on which massive data analysis issues have come to the fore. Part of

the challenge going forward is the problem of scaling these systems

and algorithms to ever-larger collections of data. It is important to

acknowledge, however, that the goals of massive data analysis go

beyond the computational and representational issues that have

been province of classical search engines and database processing

ignite678@126.com

剩余190页未读，继续阅读

ignite678@126.com

粉丝: 2
资源: 42

大数据分析的前沿探索

Frontiers in Massive Data Analysis

Frontiers_in_Massive_Data_Analysis

big data - Frontiers in Massive Data Analysis

New Frontiers in High Performance Computing and Big Data 无水印原版pdf

New Frontiers in Handwriting Recognition

New Frontiers in Respiratory Control 1st

Frontiers in Antennas Next Generation Design and Engineering

知识图谱前沿技术研讨会（Frontiers in Knowledge Graphs）.zip

MultiRelational Data Mining The Current Frontiers

frontiers2.0

最新资源