没有合适的资源?快使用搜索试试~ 我知道了~
首页深度学习与工程系统中的缺失数据:新进展与应用
深度学习与工程系统中的缺失数据:新进展与应用
需积分: 10 4 下载量 17 浏览量
更新于2024-07-17
收藏 5.69MB PDF 举报
《深度学习与工程系统中的缺失数据》是Studies in Big Data系列之一,由Janusz Kacprzyk等编著,波兰科学院华沙分院出版。该研究论文主要探讨了在工程系统背景下,深度学习技术如何处理和应对大规模、复杂数据中的缺失值问题。随着大数据时代的快速发展,工程系统产生的数据日益庞大且多样化,来源于传感器、模拟、社交网络等多种渠道,这些数据既包含物理测量值,也包含用户行为记录。
深度学习作为计算智能领域的重要分支,涵盖了神经网络、进化计算、软计算和模糊系统等多个子领域,它在处理大规模数据时展现出强大的潜力,尤其是对图像识别、自然语言处理等领域。然而,实际应用中,由于各种原因(如设备故障、用户隐私保护等),数据常常存在缺失情况。这给数据分析和模型训练带来了挑战,因为缺失数据可能影响模型的准确性并降低预测性能。
在本文中,作者 Collins Achepsah Leke 和 Tshilidzi Marwala 可能会深入探讨以下几个关键点:
1. **缺失数据的识别与度量**:首先,他们会介绍识别工程系统数据中缺失值的方法,包括随机缺失、非随机缺失(模式性或结构性缺失)以及不完全观察等不同类型,并讨论相应的统计量和算法来衡量数据的质量。
2. **深度学习处理缺失数据的策略**:他们可能会讨论各种策略,如填充(均值、中位数、回归预测)、插值法(线性、多项式、KNN插值)、基于模型的方法(如MICE、EM算法)、以及最近邻方法(如KNN或深度学习本身的自编码器)在深度学习框架下的应用。
3. **深度学习模型的鲁棒性与改进**:如何设计和调整深度学习模型,使其在面对缺失数据时仍能保持较高的稳定性和准确性,这可能是研究的核心内容。可能包括正则化技术、集成学习方法,或者使用专门针对缺失数据设计的深度学习架构。
4. **实证分析与案例研究**:论文可能通过实际工程系统的案例,展示深度学习在处理缺失数据时的效果,评估不同策略的性能,并探讨它们在不同场景下的适用性。
5. **未来趋势与挑战**:最后,作者可能会讨论深度学习处理缺失数据的潜在局限性,如过拟合风险、计算效率问题,以及如何结合其他数据预处理技术(如数据清洗、特征选择)以优化整体解决方案。
《深度学习与工程系统中的缺失数据》为工程领域的研究人员和实践者提供了一个重要的视角,探讨了深度学习技术如何在处理复杂工程数据时应对缺失值问题,对于提高数据驱动决策的准确性和可靠性具有重要意义。
1.3 Missing Data Proportions 3
1.3 Missing Data Proportions
Missing data in datasets influences the analysis, inferences and conclusions reached
based on the information. The impact on machine learning algorithm performances
become more significant with an increase in the proportion of missing data in the
dataset. Researchers have shown that the impact on machine learning algorithms is
not as significant when the proportion of missing data is small in large-scale datasets
(Ramoni and Sebastiani 2001; Tremblay et al. 2010; Polikar et al. 2010). This could
be attributed to the fact that certain machine learning algorithms inherently possess
frameworks to cater to certain proportions of missing data. With an increase in
missing data proportions, for example cases where the proportion is greater than 25%,
it is observed that tolerance and performance levels of machine learning algorithms
decrease significantly (Twala 2009). It is because of these reduced levels in tolerance
and performance that more complex and reliable approaches to solve the problem of
missing data are required.
1.4 Missing Data Mechanisms
As stated before, it is very vital to identify the reason why data are missing. When
the explanation is known, a suitable method for missing data imputation may then be
chosen or derived, resulting in higher effectiveness and prediction accuracy. In many
situations, data collectors may be conscious of such reasons, whereas statisticians
and data users may not have that information available to them when they perform
the analysis. In such scenarios, data users may have to use other techniques that
can assist in data analysis to comprehend how missing data are related to observed
data and as a result, possible reasons may be derived. A variable or a feature in
the dataset is viewed as a mechanism, if it assists in explaining why other variables
are missing or not missing. In datasets collected through surveys, variables that are
mechanisms are frequently associated with details that people are embarrassed to
divulge. However, such information can often be derived from the data that have
been given. As an example, low-income people may be embarrassed to disclose t heir
income but may disclose their highest level of education. Data users may then use
the supplied educational information to acquire an insight into the income. Let us
assume that the variable Y is the complete dataset, then: Y
{
Y
o
, Y
m
}
.
Here Y
o
is the observed component of Y while Y
m
is the missing component of Y .
Any scenario whereby certain or all feature variables within a dataset have missing
data entries or contain data entries which are not exactly characterized within the
bounds of the problem domain is termed missing data (Rubin 1978). The presence
of missing data leads to several issues in a variety of sectors that depend on the
availability of complete and quality data. This has resulted in different methods
being introduced with their aim being to address the missing data problem in varying
disciplines (Rubin 1978; Allison 2000). Handling missing data in an acceptable way
4 1 Introduction to Missing Data Estimation
is dependent upon the nature of the missingness. There are currently four missing data
mechanisms in t he literature and these are missing completely at random (MCAR),
missing at random (MAR), a non-ignorable case or missing not at random (MNAR)
and missing by natural design (MBND).
1.4.1 Missing Completely at Random (MCAR)
The MCAR case is observed when the possibility of a feature variable having missing
data entries is independent of the feature variable itself or of any of the other feature
variables within the dataset. Essentially, this means that the missing data entry does
not depend on the feature variable being considered or any of the other feature
variables in the dataset. This relationship is expressed mathematically as Little and
Rubin (2014):
P
(
M|Y
o
, Y
m
)
P
(
M
)
(1.1)
where M ∈
{
0, 1
}
represents an indication of the missing value. M 1ifY is
known and M 0ifY is unknown/missing. Y
o
represents the observed values in
Y while Y
m
represents the missing values of Y.FromEq.(1.1), the probability of a
missing entry in a variable is not related to Y
o
or Y
m
. For instance, let us assume that
in modelling software defects in relation to development time, if the missingness
is in no way linked to the missing values of the rate of defects itself and at the
same time not linked to the values of the development time, the data is said to be
MCAR. Researchers have successfully addressed cases where the data is MCAR.
Silva-Ramirez et al. (2011) successfully applied multilayer perceptrons (MLPs) for
missing data imputation in datasets with missing values. Other research work done
on this mechanism could be found in Pigott (2001), Nishanth and Ravi (2013).
1.4.2 Missing at Random (MAR)
The MAR case is observed when the possibility of a specific feature variable having
missing data entries is related to the other feature variables in the dataset. However,
this missing data does not depend on the feature variable itself. MAR means the
missing data in the feature variable is conditional on any other feature variable in the
dataset but not on that being considered (Scheffer 2000). For example, consider a
dataset with two related variables, monthly expenditure and monthly income. Assume
for instance that all high-income earners deny revealing their monthly expenditures
while low-income earners do provide this information. This implies that in the dataset,
there is no monthly expenditure entry for high-income earners, while for low-income
earners, the information is available. The missing monthly income entry is linked
1.4 Missing Data Mechanisms 5
to the income earning level of the individual. This relationship can be expressed
mathematically as Marwala (2009):
P
(
M|Y
o
, Y
m
)
P(M|Y
o
)(1.2)
where M ∈
{
0, 1
}
is the missing data indicator, and M 1, if Y is known, with M 0
if Y is unknown/missing. Y
o
represents the observed values in Y while Y
m
represents
the missing values of Y . Equation (1.2) indicates that the probability of a missing entry
given an observable entry and a missing entry is equivalent to the probability of the
missing entry given the observable entry only. Considering the example described
in Sect. 1.1.3.1, the software defects might not be revealed because of a certain
development time. Such a scenario points to the data being MAR. Several studies
have been conducted in the literature where the missing data mechanism is MAR, for
example Nelwamondo et al. (2007b) performed a study to compare the performance
of expectation maximization and a GA-optimized AANN and it was revealed that
the AANN is a better method than the expectation maximization. Further research on
this mechanism was performed in Garca-Laencina et al. (2009), Poleto et al. (2011),
Liu and Brown (2013).
1.4.3 Non-ignorable Case or Missing not at Random (MNAR)
The third missing data mechanism is the missing not at random or non-ignorable
case. The MNAR case is observed when the possibility of a feature variable having
a missing data entry depends on the value of the feature variable itself i rrespective of
any alteration or modification to the values of other feature variables in the datasets
(Allison 2000). In scenarios such as these, it is impossible to estimate the missing
data by making use of the other feature variables in the dataset since the nature
of the missing data is not random. MNAR is the most challenging missing data
mechanism to model and these values are quite tough to estimate (Rubin 1978).
Let us consider the same scenario described in the previous subsection. Assume for
instance that some high-income earners do reveal their monthly expenditures while
others refuse, and the same for low-income earners. Unlike the MAR mechanism,
in this instance the missing entries in the monthly expenditure variable cannot be
ignored because they are not directly linked to t he income variable or any other
variable. Models developed to estimate this kind of missing data are very often not
biased. A probabilistic formulation of this mechanism is not easy because the data
in the mechanism is neither MAR nor MCAR.
6 1 Introduction to Missing Data Estimation
1.4.4 Missing by Natural Design (MBND)
This is a mechanism whereby the missing data occurs because it cannot be measured
physically (Marwala 2009). It is impossible to measure these data entries; however,
they are quite relevant in the data analysis procedure. Overcoming this problem
requires that mathematical equations be formulated. This missing data mechanism
mainly applies to mechanical engineering and natural science problems. Therefore,
it will not be used in this thesis for the problem under consideration.
1.5 Missing Data Patterns
The way in which missing data occurs can be grouped into three patterns given
by Tables 1.1, 1.2, 1.3. Table 1.1 depicts a univariate pattern which is a scenario
described by the presence of missing data in only one feature variable as seen in
column I7. Table 1.2 depicts an arbitrary missing data pattern, which is a scenario
whereby the missing data occurs in a distributed and random manner. The last pattern
is the monotone missing data pattern which is shown in Table 1.3. This pattern is
also referred to as a uniform pattern as it occurs in cases whereby the missing data
can be present in more than one feature variable and, it is easy to understand and
recognize (Ramoni and Sebastiani 2001).
The missing data pattern considered in this book is the arbitrary pattern and the
mechanisms are the missing at random and missing completely at random mecha-
nisms.
Table 1.1 Univariate missing data pattern
Sample I1 I2 I3 I4 I5 I6 I7
1 0.38 0.18 0.20 0.19 0.75 0.67 0.96
2 0.69 0.11 0.08 0.41 0.65 0.63 ?
3 0.17 0.79 0.66 0.53 0.95 0.43 ?
4 0.19 0.24 0.15 0.91 0.46 0.82 ?
Table 1.2 Arbitrary missing data pattern
Sample I1 I2 I3 I4 I5 I6 I7
1 0.38 ? 0.20 0.19 0.75 0.67 0.96
2 0.69 0.11 0.08 0.41 ? 0.63 0.04
3 0.17 0.79 ? 0.53 0.95 0.43 0.054
4 ? 0.24 0.15 0.91 0.46 0.82 ?
1.6 Classical Missing Data Techniques 7
Table 1.3 Monotone missing data pattern
Sample I1 I2 I3 I4 I5 I6 I7
1 0.38 0.18 0.20 0.19 0.75 0.67 ?
2 0.69 0.11 0.08 0.41 0.65 ? ?
3 0.17 0.79 0.66 0.53 ? ? ?
4 0.19 0.24 0.15 ? ? ? ?
1.6 Classical Missing Data Techniques
Depending on how data goes missing in a dataset, there currently exist s everal data
imputation techniques that are being used in statistical packages (Yansaneh et al.
1998). These techniques include basic approaches such as casewise data deletion
and move on to approaches that are characterized by the application of more refined
artificial intelligence and statistical methods. The subsections that follow present
some of the most commonly applied missing data imputation methods. We begin with
basic and naive approaches and carry on presenting more complex and competent
mechanisms. There are a variety of classical missing data imputation techniques
courtesy of their simplicity and ease of implementation. The techniques presented in
this section are listwise or casewise deletion, pairwise deletion, mean substitution,
stochastic imputation with expectation maximization, hot and cold deck imputation,
multiple imputation and regression methods.
1.6.1 Listwise or Casewise Deletion
A lot of statistical approaches will get rid of an entire record if it is seen that any of the
columns in the record has a missing data entry. Such an approach is termed casewise
or listwise data deletion and is a scenario, whereby in the event of any of the columns
in a record having a missing value for a feature variable, the entire record is deleted
from the dataset. Listwise data deletion is the easiest and most basic way to handle
the problem of missing data as well as being the least recommended option for the
problem as it tends to significantly reduce the number of records in the dataset which
are necessary for the data analysis task, and by so doing, it reduces the accuracy of
the findings from the analysis of the data. Applying this technique is a possibility if
the ratio of records with missing data to records with complete data is very small. If
this is not the case, making use of this approach may result in the estimates of the
missing data being biased.
剩余187页未读,继续阅读
swang09
- 粉丝: 0
- 资源: 19
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- Android圆角进度条控件的设计与应用
- mui框架实现带侧边栏的响应式布局
- Android仿知乎横线直线进度条实现教程
- SSM选课系统实现:Spring+SpringMVC+MyBatis源码剖析
- 使用JavaScript开发的流星待办事项应用
- Google Code Jam 2015竞赛回顾与Java编程实践
- Angular 2与NW.js集成:通过Webpack和Gulp构建环境详解
- OneDayTripPlanner:数字化城市旅游活动规划助手
- TinySTM 轻量级原子操作库的详细介绍与安装指南
- 模拟PHP序列化:JavaScript实现序列化与反序列化技术
- ***进销存系统全面功能介绍与开发指南
- 掌握Clojure命名空间的正确重新加载技巧
- 免费获取VMD模态分解Matlab源代码与案例数据
- BuglyEasyToUnity最新更新优化:简化Unity开发者接入流程
- Android学生俱乐部项目任务2解析与实践
- 掌握Elixir语言构建高效分布式网络爬虫
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功