机器学习驱动的材料发现：征服数据稀缺与质量难题

版权申诉

175 浏览量更新于2024-07-06 1 收藏 6.6MB PDF 举报

"这篇文档是关于在计算材料发现领域中，如何使用机器学习技术克服数据稀缺和数据质量问题的研究。作者Aditya Nandy、Chenru Duan和Heather J. Kulik探讨了如何在面临数据量不足和数据质量参差不齐的情况下，通过创新方法推进机器学习的应用。他们隶属于麻省理工学院的化学工程和化学部门。文章指出，机器学习加速材料发现需要大量高精度的数据来建立结构-属性关系模型，但在实际中，由于数据获取的困难和高昂成本，导致可用数据量有限且质量难以保证。" 正文: 在计算材料发现的领域，机器学习(ML)已经成为一种强大的工具，它能够快速挖掘复杂的材料特性并预测新材料的性能。然而，"数据稀缺"和"数据质量"是阻碍这一进程的两大主要障碍。这篇论文《在计算材料发现的机器学习中，大胆地克服数据稀缺和数据质量的挑战》深入探讨了这些问题，并提出了一些应对策略。首先，数据稀缺是材料科学中的一大难题。实验获取新材料的数据通常需要大量的时间和资源，而且每种材料的测试可能都需要特定的条件和设备。因此，数据集往往小而分散，不足以支持有效的机器学习训练。为了解决这个问题，研究者们提倡采用多源数据融合，将来自不同实验、模拟或文献的数据整合在一起，扩大数据集的规模。此外，利用半监督学习和强化学习等方法，可以从少量标记数据中学习更多的模式。其次，数据质量问题不容忽视。数据的准确性和一致性对模型的预测能力至关重要。论文中提到，可以通过建立数据验证和清洗流程，以及利用异常检测技术来识别和处理错误或不一致的数据。同时，使用数据增强技术，如合成数据生成，可以增加数据多样性，提高模型的泛化能力。为了克服这些挑战，研究者们提出了使用共识学习策略。这种方法涉及训练多个模型，每个模型可能基于不同的数据源或使用不同的算法，然后通过比较它们的预测结果来达到更高的准确性。此外，密度 functional theory (DFT) 函数的组合使用也被认为是一种有效的方法，它可以在一定程度上弥补数据的不足，提供更可靠的结果。最后，论文强调了云计算和人工智能在处理大数据和复杂计算中的作用。云计算提供了弹性计算资源，可以处理大规模数据集的存储和处理需求，而人工智能则能自动化数据预处理和模型优化过程，进一步提高效率。这篇论文展示了如何在数据稀缺和质量不佳的环境下，通过机器学习和相关技术的创新应用，推动计算材料发现领域的进步。通过集成各种策略和工具，研究人员能够更好地利用有限的数据，挖掘出潜在的结构-属性关系，从而加速新材料的发现和开发。

energies comparable to a DFA, they lacked the error cancellation present in a standard DFA for

predicting relative properties, suggesting caution in applying data-driven models blindly in

materials discovery.

Figure 2. Machine learning approaches addressing fidelity limitations in quantum chemistry. a)

uniform manifold approximation and projection (UMAP) visualization of SCO complexes from

187 200 TMCs reported by Duan et al.[10]: the entire design space (gray), leads predicted by a

single NN trained on B3LYP data (red), by the consensus approach of NNs trained on 23 DFAs

(blue), and experimental observation (green) with approximate convex hulls shown as solid lines.

b) Performance of MR/SR classification on a set of 3165 equilibrium or distorted organic

molecules using different methods reported by Duan et al.[13]: k-means clustering (red), a

cutoff-based approach (green), and a semi-supervised learning method named virtual adversarial

training (VAT, blue). c) Schematic of a fully differentiable KS-DFT framework reported by

Kasim et al.,[14] with NNs representing a trainable exchange-correlation functional that yields

both the electron density and energy. d) Illustration of the network architecture of SchNOrb

developed by Schutt et al.[15], starting from initial representations of atom types and positions

(top), continuing with the construction of representations of chemical environments of atoms and

atom pairs (middle) before using these to predict energy and Hamiltonian matrix respectively

(bottom). Reproduced with permission from studies reported by Duan et al. [10] and [13], Kasim

et al.[14], and Schutt et al. [15], published by The Royal Society of Chemistry 2021, the

American Chemical Society 2020, the American Physical Society 2021, and the Nature

Publishing Group 2019, respectively.

Data-driven methods have augmented and supplanted the conventional approaches of

trial and error or local fitting of parameters to revisit the search for a universal exchange-

剩余22页未读，继续阅读

易小侠

粉丝: 6646

机器学习驱动的材料发现：征服数据稀缺与质量难题

FFmpeg_v2.2.2_for_Audacity_on_Windows，Audacity导致ffmpeg库

audacity-win-unicode-1.3.4.zip_audacity_audacity win_audacity-w_

audacity-minsrc-1.3.11.zip_audacity_audacity-src

audacity-src-1.3.3.tar.gz_ Audacity _audacity_音频_音频 源码_音频编辑

audacity-src.rar_audacity_beyonde9h_数字音频波形的程序 Audacity

audacityframe.rar_audacity

audacity-src-1.2.3.tar.gz_Free!_audacity_audio editor_free audio

音频隐写_audacity_win_V2.1.0.0_setup.1442397077.exe.7z

Audacity_1.3.2

Audacity_2.0.3

最新资源

audacity-src-1.3.3.tar.gz_ Audacity _audacity_音频_音频源码_音频编辑