大数据驱动科学推理的精进：挑战与机遇

Data

需积分: 10 3 浏览量更新于2024-07-19 收藏 2.54MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

随着大数据时代的兴起，科学推断的概念在科研领域经历了深刻的变革。"Refining the Concept of Scientific Inference When Working with Big Data"一书由Ben A. Wender等人撰写，于2017年3月24日由国家科学院出版社出版，共114页。这本书的焦点在于探讨如何在海量数据的背景下，通过精细调整科学推断的方法，以挖掘出隐藏在这些复杂数据中的潜在规律和知识。过去十年间，利用大数据进行科学研究的热情与投资空前高涨，人们期待它能加速科学发现的步伐，推动创新技术与产品的诞生。然而，处理如此庞大而复杂的数据集并从中得出可操作的科学知识并非易事，这需要建立强大的统计模型，确保推断结果的可靠性和有效性（NRC, 2013）。传统的科学推断方法可能无法适应大数据的特性，例如数据的异质性、关联性以及噪声，因此作者们呼吁对科学推断理论进行革新，以适应这种新的数据环境。书中可能会详细讨论以下几个关键点： 1. **大数据挑战**：阐述大数据带来的新问题，如数据的规模、多样性、实时性等，如何迫使科学家重新考虑数据预处理、清洗和集成的方法。 2. **统计模型的升级**：介绍如何设计和应用高级统计模型，如机器学习算法、深度学习、贝叶斯网络等，以提取数据中的模式和结构。 3. **因果关系分析**：在海量数据中，如何区分相关性和因果性，避免误导性的结论，这是科学推断的重要组成部分。 4. **不确定性和解释性**：强调在大数据背景下，如何保持推断结果的可信度，并确保模型结果可以被非专业人士理解。 5. **伦理和隐私考量**：针对大数据收集和使用的道德和法律问题进行讨论，确保科学发现的过程尊重个人隐私和数据安全。 6. **案例研究与实证分析**：通过具体的科研案例，展示如何在实际应用中成功地运用精炼的科学推断方法处理大数据。 7. **政策与实践建议**：根据研究成果，提出关于如何规范大数据科学推断实践，提升科学界和产业界的合作效率和质量的策略。这本书旨在引导读者深入理解在大数据时代科学推断的革新，以便更好地应对这一领域的挑战，促进科学研究的进步和社会福祉的提升。通过严谨的分析和实证研究，本书将为相关领域的研究者、政策制定者以及业界实践者提供有价值的洞见和工具。

资源详情

资源推荐

R e f i n i n g t h e C o n C e p t o f S C i e n t i f i C i n f e R e n C e W h e n W o R k i n g W i t h B i g D a t a

irreproducible results. Thus, big data analytics offers tremendous opportunities

but is simultaneously characterized by numerous potential pitfalls, said Daniels.

With such abundant, messy, and complex data, “statistical principles could hardly

be more important,” concluded Hogan.

Andrew Nobel cautioned that “big data isn’t necessarily the right data” for

answering a speciﬁc question. He alluded to the fundamental importance of deﬁn-

ing the question of interest and assessing the suitability of the available data to

support inferences about that question. Across the 2-day workshop, there was

notable variety in the inferential tasks described; for example, Sebastien Haneuse

described a comparative effectiveness study of two antidepressants to draw infer-

ences about differential effects on weight gain, whereas Daniela Witten described

the use of inferential tools to aid in scientiﬁc discovery. Some presenters remarked

that big data may invite analysts to overuse exploratory analyses to deﬁne research

questions and underemphasize the fundamental issues of data suitability and bias.

Understanding bias is particularly important with large, complex data sets such as

EHRs, explained Daniels, as analysts may not have control over sample selection

among other sources of bias. Alfred Hero explained that when working with large

data sets that contain information on many diverse variables, quantifying bias and

understanding the conditions necessary for replicability can be particularly chal-

lenging. Haneuse encouraged researchers using EHRs to compare available data to

those data that would result from the ideal randomized trial as a strategy to deﬁne

missing data and explore selection bias. More broadly, when analyses of big data

are used for scientiﬁc discovery, to help form scientiﬁc conclusions, or to inform

decision making, statistical reasoning and inferential formalism are required.

Inference Requires Evaluating Uncertainty

Many workshop presenters described signiﬁcant advances made in develop-

ing algorithms and methods for analyzing large, complex data sets. However, a

recurring topic of discussion was that most work to date stops short of formally

assessing the uncertainty associated with the predictions or comparisons made

with big data (as mentioned in the presentations by Michael Daniels, Alfred Hero,

Genevera Allen, Daniela Witten, Michael Kosorok, and Bin Yu). For example, data

mining algorithms that generate network structures representing a snapshot of

complex genetic processes are of limited value without some understanding of the

reliability of the nodes and edges identiﬁed, which in this case correspond to spe-

ciﬁc genes and potential regulatory relationships, respectively. In an applied setting,

Allen and Witten suggested using several estimation techniques on a single data

set and similarly using a single estimation technique with random subsamples of

the observations. In practice, results that hold up across estimation techniques and

across subsamples of the data are more likely to be scientiﬁcally useful. While this

I n t r o d u c t I o n

approach offers a starting place, researchers would prefer the ability to compute a

conﬁdence interval or false discovery rate for network features of interest. Assess-

ment and communication of uncertainty are particularly important and challeng-

ing for exploratory data analyses, which should be viewed as hypothesis-generating

activities with high levels of uncertainty to be addressed through follow-up data

collection and conﬁrmatory analyses.

Statisticians Must Engage Early in

Experimental Design and Data Collection Activities

Emery Brown, Xihong Lin, Cosma Shalizi, Alfred Hero, and Robert Kass noted

that too often statisticians become involved in scientiﬁc research projects only

after experiments have been designed and data collected. Inadequate involvement

of statisticians in such “upstream” activities can negatively impact “downstream”

inference, owing to suboptimal collection of information necessary for reliable in-

ference. Furthermore, these speakers indicated that it is increasingly important for

statisticians to become involved early in and throughout the research process so as

to consider the potential implications of data preprocessing steps on the inference

task. In addition to engaging experimental collaborators early, Lin emphasized the

importance of cooperating and building alliances with computer scientists to help

develop methods and algorithms that are computationally tractable. Responding

to a common mischaracterization of statisticians and their scientiﬁc collabora-

tors, several other speakers emphasized that statisticians are scientists too and

encouraged more of their colleagues to become experimentalists and disciplinary

experts pursuing research in a speciﬁc domain as opposed to focusing on statistical

methods development in isolation from scientiﬁc research. Hero suggested that in

order to be viewed as integral contributors to scientiﬁc advancements, statisticians

could aim to be positive and constructive in interacting with collaborators.

Open Research Questions Can Propel Both the

Domain Sciences and the Field of Statistics Forward

Over the course of the workshop, a number of presenters identiﬁed various open

research questions with potential to advance the ﬁelds of statistics and biomedical

sciences, as well as the broader scientiﬁc research community. Several presenters

illustrated the challenges and opportunities of integrating phenomenological data

across multiple temporal or spatial scales. Examples included connecting subcellular

descriptions of gene and protein expression with longitudinal EHRs and combining

neuroscience technologies and methods spanning the individual neuron scale to

whole brain regions. Alfred Hero said that the challenges associated with creating

integrative statistical models informed by known biology are substantial because

R e f i n i n g t h e C o n C e p t o f S C i e n t i f i C i n f e R e n C e W h e n W o R k i n g W i t h B i g D a t a

of the inherent complexity of biological processes and because integrative models

typically require tracking and relating multiple processes. Andrew Nobel and Xihong

Lin discussed the importance of developing scalable and computationally efﬁcient

inference procedures designed for cloud environments, including increasingly wide-

spread cloud computing and data storage. Similarly, several speakers suggested

that the use of artiﬁcial intelligence and automated statistical analysis packages

will become prevalent and that signiﬁcant opportunity exists to improve statistical

practices for many disciplines by ensuring appropriate methods are implemented

in such emerging tools. Finally, a few presenters encouraged research into methods

that could better deﬁne the questions that a given data set could potentially answer

based on the contained information.

Opportunities Exist to Strengthen Statistics Education at All Levels

Emery Brown, Robert Kass, Bin Yu, Andrew Nobel, and Cosma Shalizi empha-

sized that there are opportunities to improve statistics education and that increased

understanding of statistics broadly across scientiﬁc disciplines could help many

researchers avoid known pitfalls that may be exacerbated when working with big

data. One suggestion was to teach probability and statistical concepts and reason-

ing in middle and high school through a longitudinal and reinforcing curriculum,

which could provide students with time to develop statistical intuition. Another

suggestion was to organize undergraduate curricula around fundamental prin-

ciples rather than introducing students to a series of statistical tests to match with

data. Many pitfalls faced in analysis of large, heterogeneous data sets result from

inappropriate application of simplifying assumptions that are used in introductory

statistics courses, suggested Shalizi. Thus, while teaching those classes, it would be

helpful for educators to clearly articulate the limitations of these assumptions and

work to avoid their misapplication in practice. Beyond core statistics-related teach-

ing and curricular improvements, placing greater emphasis on communications

training for graduate students could help improve interdisciplinary collaboration

between statisticians and domain scientists. Finally, several presenters agreed that

the proliferation of complex data and increasing computational demands of sta-

tistical inference warrants at least cursory training in efﬁcient computing, coding

in languages beyond R,

and the basics of database curation.

The website for the R project for statistical computing is https://www.r-project.org/, accessed

January 4, 2017.

剩余113页未读，继续阅读

ivorstar

粉丝: 18
资源: 8

大数据驱动科学推理的精进：挑战与机遇

基于模糊推理的模糊粗糙集相似度计算及其在句子相似度计算中的应用_Computing Fuzzy Rough Set based

扩展卡尔曼系统的目标跟踪研究.doc

Agile Data Science 2.0

working flow of the CPU design

std::array<unsigned int, 3> m_refining{1, 1, 1};

Continual Learning Through Synaptic Intelligence

What is iteration? What are the benefits of Iterative Development?

pcl 3d segmentation setups

transformer-based detector SWINL Cascade-Mask R-CNN

an iterative solver was used for this model. however, a direct solver may en

Genius is one percent inspiration, ninety nine percent perspiration.What does the proverb mean? Give comments about it

ASIC Artwork

Excessive distortion at a total of 1130 integration points in solid (continuum) elements ABAQUS

context:exclude-filter

stable diffusion prompt

What are the major steps needed for UI design?

CNN+Autoformer

write an AI file based on ultrasound images about breast cancer

最新资源