深度学习与工程系统中的缺失数据：新进展与应用

需积分: 10 104 浏览量更新于2024-07-17 收藏 5.69MB PDF 举报

《深度学习与工程系统中的缺失数据》是Studies in Big Data系列之一，由Janusz Kacprzyk等编著，波兰科学院华沙分院出版。该研究论文主要探讨了在工程系统背景下，深度学习技术如何处理和应对大规模、复杂数据中的缺失值问题。随着大数据时代的快速发展，工程系统产生的数据日益庞大且多样化，来源于传感器、模拟、社交网络等多种渠道，这些数据既包含物理测量值，也包含用户行为记录。深度学习作为计算智能领域的重要分支，涵盖了神经网络、进化计算、软计算和模糊系统等多个子领域，它在处理大规模数据时展现出强大的潜力，尤其是对图像识别、自然语言处理等领域。然而，实际应用中，由于各种原因（如设备故障、用户隐私保护等），数据常常存在缺失情况。这给数据分析和模型训练带来了挑战，因为缺失数据可能影响模型的准确性并降低预测性能。在本文中，作者 Collins Achepsah Leke 和 Tshilidzi Marwala 可能会深入探讨以下几个关键点： 1. **缺失数据的识别与度量**：首先，他们会介绍识别工程系统数据中缺失值的方法，包括随机缺失、非随机缺失（模式性或结构性缺失）以及不完全观察等不同类型，并讨论相应的统计量和算法来衡量数据的质量。 2. **深度学习处理缺失数据的策略**：他们可能会讨论各种策略，如填充（均值、中位数、回归预测）、插值法（线性、多项式、KNN插值）、基于模型的方法（如MICE、EM算法）、以及最近邻方法（如KNN或深度学习本身的自编码器）在深度学习框架下的应用。 3. **深度学习模型的鲁棒性与改进**：如何设计和调整深度学习模型，使其在面对缺失数据时仍能保持较高的稳定性和准确性，这可能是研究的核心内容。可能包括正则化技术、集成学习方法，或者使用专门针对缺失数据设计的深度学习架构。 4. **实证分析与案例研究**：论文可能通过实际工程系统的案例，展示深度学习在处理缺失数据时的效果，评估不同策略的性能，并探讨它们在不同场景下的适用性。 5. **未来趋势与挑战**：最后，作者可能会讨论深度学习处理缺失数据的潜在局限性，如过拟合风险、计算效率问题，以及如何结合其他数据预处理技术（如数据清洗、特征选择）以优化整体解决方案。《深度学习与工程系统中的缺失数据》为工程领域的研究人员和实践者提供了一个重要的视角，探讨了深度学习技术如何在处理复杂工程数据时应对缺失值问题，对于提高数据驱动决策的准确性和可靠性具有重要意义。

1.3 Missing Data Proportions 3

1.3 Missing Data Proportions

Missing data in datasets inﬂuences the analysis, inferences and conclusions reached

based on the information. The impact on machine learning algorithm performances

become more signiﬁcant with an increase in the proportion of missing data in the

dataset. Researchers have shown that the impact on machine learning algorithms is

not as signiﬁcant when the proportion of missing data is small in large-scale datasets

(Ramoni and Sebastiani 2001; Tremblay et al. 2010; Polikar et al. 2010). This could

be attributed to the fact that certain machine learning algorithms inherently possess

frameworks to cater to certain proportions of missing data. With an increase in

missing data proportions, for example cases where the proportion is greater than 25%,

it is observed that tolerance and performance levels of machine learning algorithms

decrease signiﬁcantly (Twala 2009). It is because of these reduced levels in tolerance

and performance that more complex and reliable approaches to solve the problem of

missing data are required.

1.4 Missing Data Mechanisms

As stated before, it is very vital to identify the reason why data are missing. When

the explanation is known, a suitable method for missing data imputation may then be

chosen or derived, resulting in higher effectiveness and prediction accuracy. In many

situations, data collectors may be conscious of such reasons, whereas statisticians

and data users may not have that information available to them when they perform

the analysis. In such scenarios, data users may have to use other techniques that

can assist in data analysis to comprehend how missing data are related to observed

data and as a result, possible reasons may be derived. A variable or a feature in

the dataset is viewed as a mechanism, if it assists in explaining why other variables

are missing or not missing. In datasets collected through surveys, variables that are

mechanisms are frequently associated with details that people are embarrassed to

divulge. However, such information can often be derived from the data that have

been given. As an example, low-income people may be embarrassed to disclose t heir

income but may disclose their highest level of education. Data users may then use

the supplied educational information to acquire an insight into the income. Let us

assume that the variable Y is the complete dataset, then: Y 

{

, Y

}

Here Y

is the observed component of Y while Y

is the missing component of Y .

Any scenario whereby certain or all feature variables within a dataset have missing

data entries or contain data entries which are not exactly characterized within the

bounds of the problem domain is termed missing data (Rubin 1978). The presence

of missing data leads to several issues in a variety of sectors that depend on the

availability of complete and quality data. This has resulted in different methods

being introduced with their aim being to address the missing data problem in varying

disciplines (Rubin 1978; Allison 2000). Handling missing data in an acceptable way

4 1 Introduction to Missing Data Estimation

is dependent upon the nature of the missingness. There are currently four missing data

mechanisms in t he literature and these are missing completely at random (MCAR),

missing at random (MAR), a non-ignorable case or missing not at random (MNAR)

and missing by natural design (MBND).

1.4.1 Missing Completely at Random (MCAR)

The MCAR case is observed when the possibility of a feature variable having missing

data entries is independent of the feature variable itself or of any of the other feature

variables within the dataset. Essentially, this means that the missing data entry does

not depend on the feature variable being considered or any of the other feature

variables in the dataset. This relationship is expressed mathematically as Little and

Rubin (2014):

(

M|Y

, Y

)

 P

(

)

(1.1)

where M ∈

{

0, 1

}

represents an indication of the missing value. M  1ifY is

known and M  0ifY is unknown/missing. Y

represents the observed values in

Y while Y

represents the missing values of Y.FromEq.(1.1), the probability of a

missing entry in a variable is not related to Y

or Y

. For instance, let us assume that

in modelling software defects in relation to development time, if the missingness

is in no way linked to the missing values of the rate of defects itself and at the

same time not linked to the values of the development time, the data is said to be

MCAR. Researchers have successfully addressed cases where the data is MCAR.

Silva-Ramirez et al. (2011) successfully applied multilayer perceptrons (MLPs) for

missing data imputation in datasets with missing values. Other research work done

on this mechanism could be found in Pigott (2001), Nishanth and Ravi (2013).

1.4.2 Missing at Random (MAR)

The MAR case is observed when the possibility of a speciﬁc feature variable having

missing data entries is related to the other feature variables in the dataset. However,

this missing data does not depend on the feature variable itself. MAR means the

missing data in the feature variable is conditional on any other feature variable in the

dataset but not on that being considered (Scheffer 2000). For example, consider a

dataset with two related variables, monthly expenditure and monthly income. Assume

for instance that all high-income earners deny revealing their monthly expenditures

while low-income earners do provide this information. This implies that in the dataset,

there is no monthly expenditure entry for high-income earners, while for low-income

earners, the information is available. The missing monthly income entry is linked

1.4 Missing Data Mechanisms 5

to the income earning level of the individual. This relationship can be expressed

mathematically as Marwala (2009):

(

M|Y

, Y

)

 P(M|Y

)(1.2)

where M ∈

{

0, 1

}

is the missing data indicator, and M  1, if Y is known, with M  0

if Y is unknown/missing. Y

represents the observed values in Y while Y

represents

the missing values of Y . Equation (1.2) indicates that the probability of a missing entry

given an observable entry and a missing entry is equivalent to the probability of the

missing entry given the observable entry only. Considering the example described

in Sect. 1.1.3.1, the software defects might not be revealed because of a certain

development time. Such a scenario points to the data being MAR. Several studies

have been conducted in the literature where the missing data mechanism is MAR, for

example Nelwamondo et al. (2007b) performed a study to compare the performance

of expectation maximization and a GA-optimized AANN and it was revealed that

the AANN is a better method than the expectation maximization. Further research on

this mechanism was performed in Garca-Laencina et al. (2009), Poleto et al. (2011),

Liu and Brown (2013).

1.4.3 Non-ignorable Case or Missing not at Random (MNAR)

The third missing data mechanism is the missing not at random or non-ignorable

case. The MNAR case is observed when the possibility of a feature variable having

a missing data entry depends on the value of the feature variable itself i rrespective of

any alteration or modiﬁcation to the values of other feature variables in the datasets

(Allison 2000). In scenarios such as these, it is impossible to estimate the missing

data by making use of the other feature variables in the dataset since the nature

of the missing data is not random. MNAR is the most challenging missing data

mechanism to model and these values are quite tough to estimate (Rubin 1978).

Let us consider the same scenario described in the previous subsection. Assume for

instance that some high-income earners do reveal their monthly expenditures while

others refuse, and the same for low-income earners. Unlike the MAR mechanism,

in this instance the missing entries in the monthly expenditure variable cannot be

ignored because they are not directly linked to t he income variable or any other

variable. Models developed to estimate this kind of missing data are very often not

biased. A probabilistic formulation of this mechanism is not easy because the data

in the mechanism is neither MAR nor MCAR.

1.6 Classical Missing Data Techniques 7

Table 1.3 Monotone missing data pattern

Sample I1 I2 I3 I4 I5 I6 I7

1 0.38 0.18 0.20 0.19 0.75 0.67 ?

2 0.69 0.11 0.08 0.41 0.65 ? ?

3 0.17 0.79 0.66 0.53 ? ? ?

4 0.19 0.24 0.15 ? ? ? ?

1.6 Classical Missing Data Techniques

Depending on how data goes missing in a dataset, there currently exist s everal data

imputation techniques that are being used in statistical packages (Yansaneh et al.

1998). These techniques include basic approaches such as casewise data deletion

and move on to approaches that are characterized by the application of more reﬁned

artiﬁcial intelligence and statistical methods. The subsections that follow present

some of the most commonly applied missing data imputation methods. We begin with

basic and naive approaches and carry on presenting more complex and competent

mechanisms. There are a variety of classical missing data imputation techniques

courtesy of their simplicity and ease of implementation. The techniques presented in

this section are listwise or casewise deletion, pairwise deletion, mean substitution,

stochastic imputation with expectation maximization, hot and cold deck imputation,

multiple imputation and regression methods.

1.6.1 Listwise or Casewise Deletion

A lot of statistical approaches will get rid of an entire record if it is seen that any of the

columns in the record has a missing data entry. Such an approach is termed casewise

or listwise data deletion and is a scenario, whereby in the event of any of the columns

in a record having a missing value for a feature variable, the entire record is deleted

from the dataset. Listwise data deletion is the easiest and most basic way to handle

the problem of missing data as well as being the least recommended option for the

problem as it tends to signiﬁcantly reduce the number of records in the dataset which

are necessary for the data analysis task, and by so doing, it reduces the accuracy of

the ﬁndings from the analysis of the data. Applying this technique is a possibility if

the ratio of records with missing data to records with complete data is very small. If

this is not the case, making use of this approach may result in the estimates of the

missing data being biased.

剩余187页未读，继续阅读

swang09

粉丝: 0

深度学习与工程系统中的缺失数据：新进展与应用

最新资源