没有合适的资源?快使用搜索试试~ 我知道了~
首页深度学习在工程系统中处理缺失数据的突破
《深度学习与工程系统中的缺失数据》是一本于2019年出版的专业书籍,由Collins Achepsah Leke和Tshilidzi Marwala共同编著,属于"Studies in Big Data"系列的一部分。该系列由Janusz Kacprzyk主编,波兰科学院华沙分院出品,致力于快速分享大数据领域的新进展和高质量研究成果。此书聚焦于深度学习在工程系统中处理缺失数据的应用,这是一项关键且具有挑战性的任务,因为工程系统通常会产生大量复杂且可能存在数据不完整性的大型数据集。
书中深入探讨了深度学习技术如何应对工程系统中的问题,如传感器数据、模拟结果、社交媒体数据或互联网交易记录等来源产生的大规模、高维度和分布式数据。这些数据往往包含缺失值,这对传统的数据分析方法构成了挑战。深度学习作为一种强大的机器学习技术,特别是通过神经网络、卷积神经网络、循环神经网络等模型,能够有效地处理和预测缺失数据,通过特征学习和模式识别来填充空缺信息。
深度学习的优势在于其自动特征提取的能力,它可以从非结构化数据中发现有用的模式,同时也能适应复杂的非线性关系。对于缺失数据,深度学习模型可以采用多种策略,如插补法(如均值插补、回归模型预测)、生成模型(如变分自编码器)或利用相邻样本的相似性进行填充。此外,通过集成学习或迁移学习,可以将预训练的模型应用于特定工程系统的缺失数据问题上,进一步提高预测精度和效率。
书中不仅涵盖了理论分析,还提供了实际案例研究和开发方法,以展示深度学习在实际工程系统中的应用效果。通过阅读这本书,读者不仅可以理解深度学习的基本原理,还能学习到如何将这项技术有效地应用于解决工程系统中缺失数据的难题,从而推动工程实践的发展和提升系统性能。
《深度学习与工程系统中的缺失数据》是一本不可或缺的参考资料,对于数据工程师、机器学习专家以及对大数据在工程领域应用感兴趣的读者来说,它提供了一个全面而深入的视角,展示了深度学习如何在当今工程实践中发挥关键作用。
1.3 Missing Data Proportions 3
1.3 Missing Data Proportions
Missing data in datasets influences the analysis, inferences and conclusions reached
based on the information. The impact on machine learning algorithm performances
become more significant with an increase in the proportion of missing data in the
dataset. Researchers have shown that the impact on machine learning algorithms is
not as significant when the proportion of missing data is small in large-scale datasets
(Ramoni and Sebastiani 2001; Tremblay et al. 2010; Polikar et al. 2010). This could
be attributed to the fact that certain machine learning algorithms inherently possess
frameworks to cater to certain proportions of missing data. With an increase in
missing data proportions, for example cases where the proportion is greater than 25%,
it is observed that tolerance and performance levels of machine learning algorithms
decrease significantly (Twala 2009). It is because of these reduced levels in tolerance
and performance that more complex and reliable approaches to solve the problem of
missing data are required.
1.4 Missing Data Mechanisms
As stated before, it is very vital to identify the reason why data are missing. When
the explanation is known, a suitable method for missing data imputation may then be
chosen or derived, resulting in higher effectiveness and prediction accuracy. In many
situations, data collectors may be conscious of such reasons, whereas statisticians
and data users may not have that information available to them when they perform
the analysis. In such scenarios, data users may have to use other techniques that
can assist in data analysis to comprehend how missing data are related to observed
data and as a result, possible reasons may be derived. A variable or a feature in
the dataset is viewed as a mechanism, if it assists in explaining why other variables
are missing or not missing. In datasets collected through surveys, variables that are
mechanisms are frequently associated with details that people are embarrassed to
divulge. However, such information can often be derived from the data that have
been given. As an example, low-income people may be embarrassed to disclose t heir
income but may disclose their highest level of education. Data users may then use
the supplied educational information to acquire an insight into the income. Let us
assume that the variable Y is the complete dataset, then: Y
{
Y
o
, Y
m
}
.
Here Y
o
is the observed component of Y while Y
m
is the missing component of Y .
Any scenario whereby certain or all feature variables within a dataset have missing
data entries or contain data entries which are not exactly characterized within the
bounds of the problem domain is termed missing data (Rubin 1978). The presence
of missing data leads to several issues in a variety of sectors that depend on the
availability of complete and quality data. This has resulted in different methods
being introduced with their aim being to address the missing data problem in varying
disciplines (Rubin 1978; Allison 2000). Handling missing data in an acceptable way
4 1 Introduction to Missing Data Estimation
is dependent upon the nature of the missingness. There are currently four missing data
mechanisms in t he literature and these are missing completely at random (MCAR),
missing at random (MAR), a non-ignorable case or missing not at random (MNAR)
and missing by natural design (MBND).
1.4.1 Missing Completely at Random (MCAR)
The MCAR case is observed when the possibility of a feature variable having missing
data entries is independent of the feature variable itself or of any of the other feature
variables within the dataset. Essentially, this means that the missing data entry does
not depend on the feature variable being considered or any of the other feature
variables in the dataset. This relationship is expressed mathematically as Little and
Rubin (2014):
P
(
M|Y
o
, Y
m
)
P
(
M
)
(1.1)
where M ∈
{
0, 1
}
represents an indication of the missing value. M 1ifY is
known and M 0ifY is unknown/missing. Y
o
represents the observed values in
Y while Y
m
represents the missing values of Y.FromEq.(1.1), the probability of a
missing entry in a variable is not related to Y
o
or Y
m
. For instance, let us assume that
in modelling software defects in relation to development time, if the missingness
is in no way linked to the missing values of the rate of defects itself and at the
same time not linked to the values of the development time, the data is said to be
MCAR. Researchers have successfully addressed cases where the data is MCAR.
Silva-Ramirez et al. (2011) successfully applied multilayer perceptrons (MLPs) for
missing data imputation in datasets with missing values. Other research work done
on this mechanism could be found in Pigott (2001), Nishanth and Ravi (2013).
1.4.2 Missing at Random (MAR)
The MAR case is observed when the possibility of a specific feature variable having
missing data entries is related to the other feature variables in the dataset. However,
this missing data does not depend on the feature variable itself. MAR means the
missing data in the feature variable is conditional on any other feature variable in the
dataset but not on that being considered (Scheffer 2000). For example, consider a
dataset with two related variables, monthly expenditure and monthly income. Assume
for instance that all high-income earners deny revealing their monthly expenditures
while low-income earners do provide this information. This implies that in the dataset,
there is no monthly expenditure entry for high-income earners, while for low-income
earners, the information is available. The missing monthly income entry is linked
1.4 Missing Data Mechanisms 5
to the income earning level of the individual. This relationship can be expressed
mathematically as Marwala (2009):
P
(
M|Y
o
, Y
m
)
P(M|Y
o
)(1.2)
where M ∈
{
0, 1
}
is the missing data indicator, and M 1, if Y is known, with M 0
if Y is unknown/missing. Y
o
represents the observed values in Y while Y
m
represents
the missing values of Y . Equation (1.2) indicates that the probability of a missing entry
given an observable entry and a missing entry is equivalent to the probability of the
missing entry given the observable entry only. Considering the example described
in Sect. 1.1.3.1, the software defects might not be revealed because of a certain
development time. Such a scenario points to the data being MAR. Several studies
have been conducted in the literature where the missing data mechanism is MAR, for
example Nelwamondo et al. (2007b) performed a study to compare the performance
of expectation maximization and a GA-optimized AANN and it was revealed that
the AANN is a better method than the expectation maximization. Further research on
this mechanism was performed in Garca-Laencina et al. (2009), Poleto et al. (2011),
Liu and Brown (2013).
1.4.3 Non-ignorable Case or Missing not at Random (MNAR)
The third missing data mechanism is the missing not at random or non-ignorable
case. The MNAR case is observed when the possibility of a feature variable having
a missing data entry depends on the value of the feature variable itself i rrespective of
any alteration or modification to the values of other feature variables in the datasets
(Allison 2000). In scenarios such as these, it is impossible to estimate the missing
data by making use of the other feature variables in the dataset since the nature
of the missing data is not random. MNAR is the most challenging missing data
mechanism to model and these values are quite tough to estimate (Rubin 1978).
Let us consider the same scenario described in the previous subsection. Assume for
instance that some high-income earners do reveal their monthly expenditures while
others refuse, and the same for low-income earners. Unlike the MAR mechanism,
in this instance the missing entries in the monthly expenditure variable cannot be
ignored because they are not directly linked to t he income variable or any other
variable. Models developed to estimate this kind of missing data are very often not
biased. A probabilistic formulation of this mechanism is not easy because the data
in the mechanism is neither MAR nor MCAR.
6 1 Introduction to Missing Data Estimation
1.4.4 Missing by Natural Design (MBND)
This is a mechanism whereby the missing data occurs because it cannot be measured
physically (Marwala 2009). It is impossible to measure these data entries; however,
they are quite relevant in the data analysis procedure. Overcoming this problem
requires that mathematical equations be formulated. This missing data mechanism
mainly applies to mechanical engineering and natural science problems. Therefore,
it will not be used in this thesis for the problem under consideration.
1.5 Missing Data Patterns
The way in which missing data occurs can be grouped into three patterns given
by Tables 1.1, 1.2, 1.3. Table 1.1 depicts a univariate pattern which is a scenario
described by the presence of missing data in only one feature variable as seen in
column I7. Table 1.2 depicts an arbitrary missing data pattern, which is a scenario
whereby the missing data occurs in a distributed and random manner. The last pattern
is the monotone missing data pattern which is shown in Table 1.3. This pattern is
also referred to as a uniform pattern as it occurs in cases whereby the missing data
can be present in more than one feature variable and, it is easy to understand and
recognize (Ramoni and Sebastiani 2001).
The missing data pattern considered in this book is the arbitrary pattern and the
mechanisms are the missing at random and missing completely at random mecha-
nisms.
Table 1.1 Univariate missing data pattern
Sample I1 I2 I3 I4 I5 I6 I7
1 0.38 0.18 0.20 0.19 0.75 0.67 0.96
2 0.69 0.11 0.08 0.41 0.65 0.63 ?
3 0.17 0.79 0.66 0.53 0.95 0.43 ?
4 0.19 0.24 0.15 0.91 0.46 0.82 ?
Table 1.2 Arbitrary missing data pattern
Sample I1 I2 I3 I4 I5 I6 I7
1 0.38 ? 0.20 0.19 0.75 0.67 0.96
2 0.69 0.11 0.08 0.41 ? 0.63 0.04
3 0.17 0.79 ? 0.53 0.95 0.43 0.054
4 ? 0.24 0.15 0.91 0.46 0.82 ?
1.6 Classical Missing Data Techniques 7
Table 1.3 Monotone missing data pattern
Sample I1 I2 I3 I4 I5 I6 I7
1 0.38 0.18 0.20 0.19 0.75 0.67 ?
2 0.69 0.11 0.08 0.41 0.65 ? ?
3 0.17 0.79 0.66 0.53 ? ? ?
4 0.19 0.24 0.15 ? ? ? ?
1.6 Classical Missing Data Techniques
Depending on how data goes missing in a dataset, there currently exist s everal data
imputation techniques that are being used in statistical packages (Yansaneh et al.
1998). These techniques include basic approaches such as casewise data deletion
and move on to approaches that are characterized by the application of more refined
artificial intelligence and statistical methods. The subsections that follow present
some of the most commonly applied missing data imputation methods. We begin with
basic and naive approaches and carry on presenting more complex and competent
mechanisms. There are a variety of classical missing data imputation techniques
courtesy of their simplicity and ease of implementation. The techniques presented in
this section are listwise or casewise deletion, pairwise deletion, mean substitution,
stochastic imputation with expectation maximization, hot and cold deck imputation,
multiple imputation and regression methods.
1.6.1 Listwise or Casewise Deletion
A lot of statistical approaches will get rid of an entire record if it is seen that any of the
columns in the record has a missing data entry. Such an approach is termed casewise
or listwise data deletion and is a scenario, whereby in the event of any of the columns
in a record having a missing value for a feature variable, the entire record is deleted
from the dataset. Listwise data deletion is the easiest and most basic way to handle
the problem of missing data as well as being the least recommended option for the
problem as it tends to significantly reduce the number of records in the dataset which
are necessary for the data analysis task, and by so doing, it reduces the accuracy of
the findings from the analysis of the data. Applying this technique is a possibility if
the ratio of records with missing data to records with complete data is very small. If
this is not the case, making use of this approach may result in the estimates of the
missing data being biased.
剩余187页未读,继续阅读
weixin_38290023
- 粉丝: 4
- 资源: 224
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 深入浅出:自定义 Grunt 任务的实践指南
- 网络物理突变工具的多点路径规划实现与分析
- multifeed: 实现多作者间的超核心共享与同步技术
- C++商品交易系统实习项目详细要求
- macOS系统Python模块whl包安装教程
- 掌握fullstackJS:构建React框架与快速开发应用
- React-Purify: 实现React组件纯净方法的工具介绍
- deck.js:构建现代HTML演示的JavaScript库
- nunn:现代C++17实现的机器学习库开源项目
- Python安装包 Acquisition-4.12-cp35-cp35m-win_amd64.whl.zip 使用说明
- Amaranthus-tuberculatus基因组分析脚本集
- Ubuntu 12.04下Realtek RTL8821AE驱动的向后移植指南
- 掌握Jest环境下的最新jsdom功能
- CAGI Toolkit:开源Asterisk PBX的AGI应用开发
- MyDropDemo: 体验QGraphicsView的拖放功能
- 远程FPGA平台上的Quartus II17.1 LCD色块闪烁现象解析
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功