深度学习在工程系统中处理缺失数据的突破

需积分: 10 148 浏览量更新于2024-07-17 1 收藏 6.16MB PDF 举报

《深度学习与工程系统中的缺失数据》是一本于2019年出版的专业书籍，由Collins Achepsah Leke和Tshilidzi Marwala共同编著，属于"Studies in Big Data"系列的一部分。该系列由Janusz Kacprzyk主编，波兰科学院华沙分院出品，致力于快速分享大数据领域的新进展和高质量研究成果。此书聚焦于深度学习在工程系统中处理缺失数据的应用，这是一项关键且具有挑战性的任务，因为工程系统通常会产生大量复杂且可能存在数据不完整性的大型数据集。书中深入探讨了深度学习技术如何应对工程系统中的问题，如传感器数据、模拟结果、社交媒体数据或互联网交易记录等来源产生的大规模、高维度和分布式数据。这些数据往往包含缺失值，这对传统的数据分析方法构成了挑战。深度学习作为一种强大的机器学习技术，特别是通过神经网络、卷积神经网络、循环神经网络等模型，能够有效地处理和预测缺失数据，通过特征学习和模式识别来填充空缺信息。深度学习的优势在于其自动特征提取的能力，它可以从非结构化数据中发现有用的模式，同时也能适应复杂的非线性关系。对于缺失数据，深度学习模型可以采用多种策略，如插补法（如均值插补、回归模型预测）、生成模型（如变分自编码器）或利用相邻样本的相似性进行填充。此外，通过集成学习或迁移学习，可以将预训练的模型应用于特定工程系统的缺失数据问题上，进一步提高预测精度和效率。书中不仅涵盖了理论分析，还提供了实际案例研究和开发方法，以展示深度学习在实际工程系统中的应用效果。通过阅读这本书，读者不仅可以理解深度学习的基本原理，还能学习到如何将这项技术有效地应用于解决工程系统中缺失数据的难题，从而推动工程实践的发展和提升系统性能。《深度学习与工程系统中的缺失数据》是一本不可或缺的参考资料，对于数据工程师、机器学习专家以及对大数据在工程领域应用感兴趣的读者来说，它提供了一个全面而深入的视角，展示了深度学习如何在当今工程实践中发挥关键作用。

1.3 Missing Data Proportions 3

1.3 Missing Data Proportions

Missing data in datasets inﬂuences the analysis, inferences and conclusions reached

based on the information. The impact on machine learning algorithm performances

become more signiﬁcant with an increase in the proportion of missing data in the

dataset. Researchers have shown that the impact on machine learning algorithms is

not as signiﬁcant when the proportion of missing data is small in large-scale datasets

(Ramoni and Sebastiani 2001; Tremblay et al. 2010; Polikar et al. 2010). This could

be attributed to the fact that certain machine learning algorithms inherently possess

frameworks to cater to certain proportions of missing data. With an increase in

missing data proportions, for example cases where the proportion is greater than 25%,

it is observed that tolerance and performance levels of machine learning algorithms

decrease signiﬁcantly (Twala 2009). It is because of these reduced levels in tolerance

and performance that more complex and reliable approaches to solve the problem of

missing data are required.

1.4 Missing Data Mechanisms

As stated before, it is very vital to identify the reason why data are missing. When

the explanation is known, a suitable method for missing data imputation may then be

chosen or derived, resulting in higher effectiveness and prediction accuracy. In many

situations, data collectors may be conscious of such reasons, whereas statisticians

and data users may not have that information available to them when they perform

the analysis. In such scenarios, data users may have to use other techniques that

can assist in data analysis to comprehend how missing data are related to observed

data and as a result, possible reasons may be derived. A variable or a feature in

the dataset is viewed as a mechanism, if it assists in explaining why other variables

are missing or not missing. In datasets collected through surveys, variables that are

mechanisms are frequently associated with details that people are embarrassed to

divulge. However, such information can often be derived from the data that have

been given. As an example, low-income people may be embarrassed to disclose t heir

income but may disclose their highest level of education. Data users may then use

the supplied educational information to acquire an insight into the income. Let us

assume that the variable Y is the complete dataset, then: Y 

{

, Y

}

Here Y

is the observed component of Y while Y

is the missing component of Y .

Any scenario whereby certain or all feature variables within a dataset have missing

data entries or contain data entries which are not exactly characterized within the

bounds of the problem domain is termed missing data (Rubin 1978). The presence

of missing data leads to several issues in a variety of sectors that depend on the

availability of complete and quality data. This has resulted in different methods

being introduced with their aim being to address the missing data problem in varying

disciplines (Rubin 1978; Allison 2000). Handling missing data in an acceptable way

4 1 Introduction to Missing Data Estimation

is dependent upon the nature of the missingness. There are currently four missing data

mechanisms in t he literature and these are missing completely at random (MCAR),

missing at random (MAR), a non-ignorable case or missing not at random (MNAR)

and missing by natural design (MBND).

1.4.1 Missing Completely at Random (MCAR)

The MCAR case is observed when the possibility of a feature variable having missing

data entries is independent of the feature variable itself or of any of the other feature

variables within the dataset. Essentially, this means that the missing data entry does

not depend on the feature variable being considered or any of the other feature

variables in the dataset. This relationship is expressed mathematically as Little and

Rubin (2014):

(

M|Y

, Y

)

 P

(

)

(1.1)

where M ∈

{

0, 1

}

represents an indication of the missing value. M  1ifY is

known and M  0ifY is unknown/missing. Y

represents the observed values in

Y while Y

represents the missing values of Y.FromEq.(1.1), the probability of a

missing entry in a variable is not related to Y

or Y

. For instance, let us assume that

in modelling software defects in relation to development time, if the missingness

is in no way linked to the missing values of the rate of defects itself and at the

same time not linked to the values of the development time, the data is said to be

MCAR. Researchers have successfully addressed cases where the data is MCAR.

Silva-Ramirez et al. (2011) successfully applied multilayer perceptrons (MLPs) for

missing data imputation in datasets with missing values. Other research work done

on this mechanism could be found in Pigott (2001), Nishanth and Ravi (2013).

1.4.2 Missing at Random (MAR)

The MAR case is observed when the possibility of a speciﬁc feature variable having

missing data entries is related to the other feature variables in the dataset. However,

this missing data does not depend on the feature variable itself. MAR means the

missing data in the feature variable is conditional on any other feature variable in the

dataset but not on that being considered (Scheffer 2000). For example, consider a

dataset with two related variables, monthly expenditure and monthly income. Assume

for instance that all high-income earners deny revealing their monthly expenditures

while low-income earners do provide this information. This implies that in the dataset,

there is no monthly expenditure entry for high-income earners, while for low-income

earners, the information is available. The missing monthly income entry is linked

1.4 Missing Data Mechanisms 5

to the income earning level of the individual. This relationship can be expressed

mathematically as Marwala (2009):

(

M|Y

, Y

)

 P(M|Y

)(1.2)

where M ∈

{

0, 1

}

is the missing data indicator, and M  1, if Y is known, with M  0

if Y is unknown/missing. Y

represents the observed values in Y while Y

represents

the missing values of Y . Equation (1.2) indicates that the probability of a missing entry

given an observable entry and a missing entry is equivalent to the probability of the

missing entry given the observable entry only. Considering the example described

in Sect. 1.1.3.1, the software defects might not be revealed because of a certain

development time. Such a scenario points to the data being MAR. Several studies

have been conducted in the literature where the missing data mechanism is MAR, for

example Nelwamondo et al. (2007b) performed a study to compare the performance

of expectation maximization and a GA-optimized AANN and it was revealed that

the AANN is a better method than the expectation maximization. Further research on

this mechanism was performed in Garca-Laencina et al. (2009), Poleto et al. (2011),

Liu and Brown (2013).

1.4.3 Non-ignorable Case or Missing not at Random (MNAR)

The third missing data mechanism is the missing not at random or non-ignorable

case. The MNAR case is observed when the possibility of a feature variable having

a missing data entry depends on the value of the feature variable itself i rrespective of

any alteration or modiﬁcation to the values of other feature variables in the datasets

(Allison 2000). In scenarios such as these, it is impossible to estimate the missing

data by making use of the other feature variables in the dataset since the nature

of the missing data is not random. MNAR is the most challenging missing data

mechanism to model and these values are quite tough to estimate (Rubin 1978).

Let us consider the same scenario described in the previous subsection. Assume for

instance that some high-income earners do reveal their monthly expenditures while

others refuse, and the same for low-income earners. Unlike the MAR mechanism,

in this instance the missing entries in the monthly expenditure variable cannot be

ignored because they are not directly linked to t he income variable or any other

variable. Models developed to estimate this kind of missing data are very often not

biased. A probabilistic formulation of this mechanism is not easy because the data

in the mechanism is neither MAR nor MCAR.

1.6 Classical Missing Data Techniques 7

Table 1.3 Monotone missing data pattern

Sample I1 I2 I3 I4 I5 I6 I7

1 0.38 0.18 0.20 0.19 0.75 0.67 ?

2 0.69 0.11 0.08 0.41 0.65 ? ?

3 0.17 0.79 0.66 0.53 ? ? ?

4 0.19 0.24 0.15 ? ? ? ?

1.6 Classical Missing Data Techniques

Depending on how data goes missing in a dataset, there currently exist s everal data

imputation techniques that are being used in statistical packages (Yansaneh et al.

1998). These techniques include basic approaches such as casewise data deletion

and move on to approaches that are characterized by the application of more reﬁned

artiﬁcial intelligence and statistical methods. The subsections that follow present

some of the most commonly applied missing data imputation methods. We begin with

basic and naive approaches and carry on presenting more complex and competent

mechanisms. There are a variety of classical missing data imputation techniques

courtesy of their simplicity and ease of implementation. The techniques presented in

this section are listwise or casewise deletion, pairwise deletion, mean substitution,

stochastic imputation with expectation maximization, hot and cold deck imputation,

multiple imputation and regression methods.

1.6.1 Listwise or Casewise Deletion

A lot of statistical approaches will get rid of an entire record if it is seen that any of the

columns in the record has a missing data entry. Such an approach is termed casewise

or listwise data deletion and is a scenario, whereby in the event of any of the columns

in a record having a missing value for a feature variable, the entire record is deleted

from the dataset. Listwise data deletion is the easiest and most basic way to handle

the problem of missing data as well as being the least recommended option for the

problem as it tends to signiﬁcantly reduce the number of records in the dataset which

are necessary for the data analysis task, and by so doing, it reduces the accuracy of

the ﬁndings from the analysis of the data. Applying this technique is a possibility if

the ratio of records with missing data to records with complete data is very small. If

this is not the case, making use of this approach may result in the estimates of the

missing data being biased.

剩余187页未读，继续阅读

weixin_38290023

粉丝: 4
资源: 224

深度学习在工程系统中处理缺失数据的突破

最新资源