深度学习与表格数据：最新方法探索

需积分: 31 194 浏览量更新于2024-07-09 1 收藏 795KB PDF 举报

"这篇研究论文深入探讨了深度学习在处理表格数据方面的方法，对当前的最新技术进行了分类和总结。作者将这些方法分为数据转换、专用架构和正则化模型三大类，并详细介绍了每类中的关键方法。文章还涵盖了异质性表格数据、离散数据以及概率建模等相关主题，并提供了对深度学习在表格数据生成方面的见解。" 深度学习（Deep Learning）近年来在图像识别、自然语言处理等领域取得了显著成果，但在处理结构化的表格数据时仍面临挑战。表格数据是日常生活中最常用的数据形式，对于许多关键且计算密集型的应用至关重要。尽管在同质性数据集上，深度神经网络（Deep Neural Networks, DNNs）表现优异，但在建模和分析表格数据时，其应用并不直观，因为表格数据通常包含多种数据类型、缺失值和复杂的关系。论文首先将表格数据的深度学习方法分为三类： 1. 数据转换（Data Transformations）：这一类方法主要关注如何将原始的表格数据转换为适合深度学习模型的形式。例如，将分类变量编码为连续数值，处理缺失值，或者通过特征工程来提取有用的表示。数据转换的目的是使得深度学习模型能够更好地理解和学习数据的内在结构。 2. 专用架构（Specialized Architectures）：这类方法设计了专门针对表格数据的网络结构，如使用注意力机制、自注意力层或图神经网络（Graph Neural Networks, GNNs）来捕捉表格中的关系和依赖。这些架构可能包括混合特征学习、考虑行和列交互的模型，或者利用矩阵分解来捕获潜在的非线性关系。 3. 正则化模型（Regularization Models）：考虑到表格数据通常规模较小且过拟合风险高，这类方法专注于正则化技术以提高模型的泛化能力。这包括dropout、批量归一化、L1/L2正则化以及集成学习等策略，它们有助于防止模型在有限的训练数据上过度拟合。论文不仅概述了这些方法，还讨论了深度学习在表格数据生成方面的进展。在生成任务中，模型需要学习到数据的分布，以便能够创建新的、看似真实的表格记录。这涉及到了概率建模和生成对抗网络（Generative Adversarial Networks, GANs）等技术，这些技术可以用于模拟数据、增强现有数据集或保护数据隐私。这篇综述论文为理解深度学习在表格数据处理中的最新进展提供了宝贵的资源，同时也指出了未来研究的可能方向，包括改进模型的解释性、处理不平衡数据以及更好地处理缺失值等问题。对于希望在表格数据领域应用深度学习的开发者和研究人员来说，这篇论文是一个不可或缺的参考。

Deep Neural Networks and Tabular Data: A Survey

3.2. Brief History of Deep Learning on Tabular

Data

Tabular data are the oldest form of data. Before digital

collection of text, images, and sound was possible, almost

all data were tabular. Therefore, it was the target of early

machine learning research. However, deep neural networks

became popular in the digital age and were further developed

with a focus on homogeneous data. In recent years, various

supervised, self-supervised, and semi-supervised deep learn-

ing approaches have been proposed that explicitly address

the issue of tabular data modeling again. Early works mostly

focused on data transformation techniques for preprocessing

(Giles et al., 1992; Horne and Giles, 1995; Willenborg and

De Waal, 1996), which are still important today (Hancock

and Khoshgoftaar, 2020).

A huge stimulus was the rise of e-commerce, which de-

manded novel solutions, especially in advertising (Richardson

et al., 2007; Guo et al., 2017). These tasks required fast

and accurate estimation on heterogeneous data sets with

many categorical variables, for which the traditional machine

learning approaches are not well suited (e.g., categorical

features that have high cardinality can lead to very sparse

high-dimensional feature vectors and non-robust models). As

a result, researchers and data scientists started looking for

more ﬂexible solutions, e.g., based on deep neural networks,

that are able to capture complex non-linear dependencies in

the data.

In particular, the click-through rate prediction problem

has received a lot of attention (Guo et al., 2017; Ke et al.,

2019; Wang et al., 2021). A large variety of approaches

were proposed, most of them relying on specialized neural

network architectures for heterogeneous tabular data. The

most important methods for click-through rate estimation are

included in our survey.

A newer line of research evolved based on idea that

regularization may improve the performance of deep neural

networks on tabular data (Kadra et al., 2021). The idea

was sparked by Shavitt and Segal (2018), leading to an

intensiﬁcation of research on regularization approaches.

Due to the tremendous success of attention-based ap-

proaches such as transformers on textual (Brown et al.,

2020) and visual data (Dosovitskiy et al., 2021; Khan et al.,

2021), researchers have started applying attention-based

methods and self-supervised learning techniques to tabular

data recently. After the ﬁrst and the most inﬂuential work

by Arik and Pﬁster (2019) raised the reasearch interest,

transformers are quickly gaining popularity, especially for

large tabular data sets.

3.3. Challenges of Learning With Tabular Data

As mentioned above, deep neural networks are usually

inferior to more traditional (e.g. linear or tree-based) machine

learning methods when dealing with tabular data. However,

it is often unclear why deep learning cannot achieve the

same level of predictive quality as in other domains such

as image classiﬁcation and natural language processing. In

the following, we identify and discuss four possible reasons:

1. Inappropriate Training Data:

The data quality is a

common issue for real-world tabular data sets. They

often include missing values (Sánchez-Morales et al.,

2020), extreme data (outliers) (Pang et al., 2021),

erroneous or inconsistent data (Karr et al., 2006), and

have small overall size relative to the high-dimensional

feature vectors generated from the data (Xu and Veera-

machaneni, 2018). Also, due to the expensive nature

of data collection, tabular data are frequently class-

imbalanced.

2. Missing or Complex Irregular Spatial Dependen-

cies:

There is often no spatial correlation between the

variables in tabular data sets (Zhu et al., 2021), or the

dependencies between features are rather complex and

irregular. Thus, the inductive biases used in popular

models for homogeneous data, such as convolutional

neural networks, are unsuitable for modeling this data

type (Katzir et al., 2021; Rahaman et al., 2019; Mitchell

et al., 2017).

3. Extensive Preprocessing:

One of the main challenges

when working with tabular data is how to handle cat-

egorical features (Hancock and Khoshgoftaar, 2020).

In most cases, the ﬁrst step is to convert the categories

into a numerical representation, for example, using a

simple one-hot or ordinal encoding scheme. However,

as categorical features may be very sparse (a problem

known as curse of dimensionality), this can lead to a

very sparse feature matrix (using the one-hot encoding

scheme) or a synthetic alignment of unordered values

(using the ordinal encoding scheme). Hancock and

Khoshgoftaar (2020) have analyzed diﬀerent embed-

ding techniques for categorical variables. Dealing with

categorical features is also one of the main aspects we

discuss in Section 4.

Applications that work with homogeneous data have

eﬀectively used data augmentation (Perez and Wang,

2017), transfer learning (Tan et al., 2018) and test-

time augmentation (Shanmugam et al., 2020). For

heterogeneous tabular data, these techniques are often

diﬃcult to apply. However, some frameworks for

learning with tabular data, such as VIME (Yoon et al.,

2020) and SAINT (Somepalli et al., 2021), use data

augmentation strategies in the embedding space.

Lastly, note that we often lose information with respect

to the original data when applying preprocessing meth-

ods for deep neural networks, leading to a reduction in

predictive performance (Fitkov-Norris et al., 2012).

4. Model Sensitivity:

Deep neural networks can be

extremely fragile to tiny perturbations of the input data

(Szegedy et al., 2013; Levy et al., 2020). The smallest

possible change of a categorical (or binary) feature

might already have a large impact on the prediction.

This is usually less problematic for homogeneous

(continuous) data sets.

In contrast to deep neural networks, decision-tree algo-

rithms can handle perturbations exceptionally well by

selecting a feature and threshold value and "ignoring"

V.Borisov et al.: Preprint submitted to Elsevier Page 4 of 19

剩余18页未读，继续阅读

syp_net

粉丝: 158

深度学习与表格数据：最新方法探索

PyTorch Tabular框架：简化表格数据深度学习模型开发

图像与表格数据融合的深度学习python工具

PyTorch深度学习仓库：实践基于框架的学习与研究

基于数据采集的深度学习分析.pdf

基于 Bagging 和深度学习的上市公司财务数据造假预测.zip

NVTabular:NVTabular是用于表格数据的功能工程和预处理库，旨在快速轻松地操作用于训练基于深度学习的推荐系统的TB级数据集

深度学习数据深度学习数据

基于深度学习的表格检测识别算法综述.pdf

基于深度学习的近红外光谱数据回归分析模型.zip

基于时序深度学习的数控机床运动精度预测方法.pdf

最新资源