114 5 Machine Learning: Data Pre-processing
overestimates the model fit and the correlation between the variables, as it does not
take into account the uncertainty in the missing data and underestimates variances and
covariances.
e primary objective of stochastic regression is to reduce the bias by an extra step
of augmenting each predicted score with a residual term, which is normally distributed
with a mean of zero and a variance equal to the residual variance from the regression
of the predictor on the target. is stochastic regression-based data imputation method
allows us to preserve the variability in the data and unbiased parameter estimates with
MAR data (see Section 5.1.1). However, since the uncertainty about the imputed values
is not included, the standard error tends to be underestimated, which increases the risk
of type I errors [10].
Likewise, k-NN can be used for data imputation by filling in missing values with the
mean of the k values coming from the k most similar complete observations [11]. Here,
a distance function (e.g. Euclidean, Mahalanobis, Pearson, and Hamming) can be used
to determine the similarity of two observations. One of the advantages of this method is
that the correlation structure of the data is taken into consideration. However, the choice
of the k value is critical. A higher value of k would include attributes that are significantly
different from our target observation, while a lower value of k implies missing out on
significant attributes.
A SOM [12], which generally projects multidimensional data into a two-dimensional
(2D) (feature) map in such a way that the data with similar patterns are associated
with the same neurons (i.e. best matching units (BMUs)) or their neighbors, can be
used for dealing with missing data. e underlying concept of the SOM-based data
imputation method is to substitute missing values by their corresponding BMU values.
Additionally, principal component analysis (PCA) [13] enables data imputation by
projecting missing values on the linear projection of the data where the retained
variance is maximum. More specifically, the linear projection can be obtained from the
observed data.
5.2 Feature Scaling
2
Supervised/unsupervised machine learning algorithms have been widely used for the
development of data-driven anomaly detection, diagnosis, and prognosis methods.
Additionally, the use of high-dimensional data is indispensable for PHM of complex
electronics. However, if each of the dimensions is not normalized to a similar level, the
output of the machine learning algorithms can be biased to some of the large-scale data.
For example, the majority of classifiers calculate the distance between two points by the
Euclidean distance. If one of the features has a broad range of values, the distance will be
governed by this particular feature. Accordingly, feature scaling (or data normalization)
to standardize the range of independent variables or features of data is one of the
critical tasks in data pre-processing, and this section primarily presents well-known
normalization methods used in PHM.
e Min–Max normalization method scales the values of feature X of a dataset
according to its minimum and maximum values. at is, the method converts a value x
2 Visit https://github.com/calceML/PHM.git for hands-on practice in feature scaling.