methods such as partitioning, hierarchical, and density-based
methods. Since the hierarchical clustering
[30,31]
is a class of
simple high-dimensional and high efficiency methods, it is
adopted in this paper for automatically partitioning the operations
into several clusters. For multi-operation processes, the changes of
set values can be used to indicate the changes in the operation.
Hence, hierarchical clustering is implemented according to the
similarity of set values.
The similarity of set values between batches is defined as
follows:
gðS
i
; S
k
Þ¼1
kS
i
S
k
k
2J
¼ 1
X
J
j¼1
kS
i;j
S
k;j
k
2J
; i; k ¼ 1; I
where S
i
and S
k
are set values matrices of the ith and kth batches;
S
i;j
and S
k;j
are the set values of the jth variable of S
i
and S
k
,
g
i;k
2½0;1. More variable g
i;k
means greater similarity between
batch data.
The hierarchical clustering method is used according to g
i;k
;
special algorithm is as follows:
(1) Calculate the similarity g
i;k
between every two batches and
find the two batches with the maximum g, and then merge
these two batches as a cluster.
(2) Calculate the average set value of the merged batches data
and calculate the similarity g between batches or clusters;
find the batches or the clusters with the maximum g and then
merge the batches or the clusters as a new cluster.
(3) Repeat step (2) until g < a, a is the similarity threshold.
The threshold a (0 < a < 1) determines the accuracy and
complexity of the developed cluster-based sub-model. Obviously,
small a values result in coarse clustering and less accurate
modelling, whereas large a values can improve modelling accuracy
but need more sub-models and increase the modelling complexity.
To facilitate understanding, we illustrate the clustering process
with Figure 1. Assume that there are 5 batch data and the number
of stages is 2 for each batch. The data with the same colour and
drawing indicates that set values of these batches are similar, thus
the data can be clustered together by the hierarchical clustering
method. In Figure 1, the number of clusters is 2 for each stage,
where subscript X indicates batch index, and superscript X
indicates stage index. The data belonging to the same cluster show
that their set values are in the same range.
Data Preprocessing
For a batch process, the data used for modelling is a three-way
matrix XðI J K
i
Þ; i=1; 2; ; I: where I is the number of
reference batches, J is the number of selected process variables,
and K
i
is the number of samples in each batch. As aforementioned,
the duration K
i
is different between batches. The data belonging to
the same cluster are designated as X
0
ðI
0
J K
i
0
Þ, where I
0
is the
number of batches belonging to the same cluster, K
i
0
is the number
of samples in batch i, i ¼ 1,2,..., I
0
. In real industrial processes,
since the sequence of operating steps is quite complex and almost
non-reproducible, no attempts are conducted to synchronize the
time evolution of the batch. The three-way matrix is therefore
unfolded into a bi-dimensional array, designated as X
0
K
i
0
I
0
JðÞ,
by a variable-wise technique.
[32]
Since a series of manual
operations make the bi-dimensional array data follow non-
Gaussian distribution, the array data should be pre-processed.
For most MSPM methods, the batch data often need to be
normalized by z-score method before modelling. The standardized
method can normalize each variable to the same level with zero
mean and unit variance. For a training dataset X 2 R
KJ
, K is the
number of samples and J is the number of process variables, the
normalizing process of the z-score method is expressed as follows:
~
x
n
¼
x
n
EðXÞ
SðXÞ
; n ¼ 1; 2; ; K ð3Þ
EðXÞ¼
1
K
X
K
n¼1
x
n
ð4Þ
SðXÞ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
K 1
X
K
n¼1
ðx
n
EðXÞÞ
2
v
u
u
t
ð5Þ
where x
n
is a sample in training dataset, E(X) represents the mean
vector, and S(X ) is the standard deviation vector. The z-score
method is valid to normalize each variable to the same level when
the training data obey Gaussian distribution. However, it will
cause some problems when the z-score method is used for multi-
operation data. Since constant mean and standard deviation
computed from the entire dataset are used in the z-score method,
when the operation is changed, the standard deviation is likely to
change dramatically because the mean of the dataset might be
changed largely. The dataset might still follow non-Gaussian
distribution after z-score. To make the data in multi-operation
Figure 1. Cluster process: (a) original data, (b) in the first stage, and (c) in the second stage.
VOLUME 94, OCTOBER 2016 THE CANADIAN JOURNAL OF CHEMICAL ENGINEERING
1967