32 W. Gan et al. / Knowledge-Based Systems 143 (2018) 30–41
Recently, a novel list-based algorithm named HUI-Miner [33] is
proposed to efficiently mine HUIs without candidate genera-
tion. It introduces a new data structure named utility-list which
stores the actual utility and remaining utility of an itemset
within each transaction in the processed database. Thus, the one-
phase HUI-Miner algorithm significantly outperforms the previ-
ous Two-Phase [34] and pattern-growth tree-based algorithms
(i.e., IHUP [6] , UP-Growth [40] , and UP-Growth + [41] ). After
that, two enhanced versions of HUI-Miner, FHM [18] and HUP-
Miner [22] , are further developed. Experiments have shown that
both FHM [18] and HUP-Miner [22] significantly outperform
the existing state-of-the-art HUI-Miner algorithm for most static
databases. At the same time, another approach named d
2
HUP is
also proposed to mine HUIs without candidate generation-and-
test [35] . Ryang and Yun introduced an indexed list-based IMHUP
algorithm for mining HUIs [38] . Recently, another faster one-phase
approach called EFIM [45] is designed for mining HUIs and it out-
performs all the previous HUIM algorithms.
Developing efficient high-utility pattern mining algorithms is an
active research. Many algorithms are also extensively developed
for various problems of HUIM, such as mining HUIs from dynamic
environment databases [27] , mining on-shelf high-utility itemsets
that the products having different on-shelf behavior [23] , top- k
high-utility itemset mining without setting the minimum utility
threshold [42] , mining the up-to-date HUIs which can show the
recent trend [26] , HUIM from big data [32] . Different from the tra-
ditional definition of HUIM, Hong et al. introduced the concept
of high average utility itemset [16] , and some approaches have
been developed in recent years, such as MAU-Growth [21] . Re-
cently, some interesting issues of mining high-utility itemsets un-
der various constrains has been extensively studied, such as HUIM
with multiple minimum utility thresholds [30] , HUIM with consid-
eration of various discount strategies [29] , utility-based association
rule mining [36] , mining HUIs over uncertain databases [25] , min-
ing HUIs with both positive and negative unit profits [31] , and so
on.
2.2. Affinity/correlation pattern mining
Traditional algorithm of frequent pattern mining (FPM) [14] and
association rule mining (ARM) [2,3] uses the support (frequency)
of patterns as a constraint to prune the search space. However,
this support-based approach has major drawbacks [39,43] . To ad-
dress this issue, the problem of strong affinity pattern mining
has been extensively studied in recent year [7,20,37,39,43] . To find
strong affinity patterns containing low-support items, the con-
cept of hyperclique patterns [20] was proposed. It defines a new
measure named hyperclique (h)-confidence, which is equivalent to
the all-confidence, to discover hyperclique patterns. With the h-
confidence measure, a cross-support property is used to effectively
eliminate spurious patterns. Besides, three interesting measures
called any-confidence, all-confidence, and bond [37] were proposed
to find frequency-based or support-based correlated patterns.
The degree of the expectation-based correlation is highly in-
fluenced by the number of null transactions [43] , i.e., transac-
tions which do not contain items whose correlation has been mea-
sured. Hence, such measures are not suitable for the study of
correlations in large datasets, where the number of null transac-
tions could be large and unstable. For the problem of contrast-
ing positive and negative correlations, it is crucial to adopt a re-
liable correlation measure that is unconcerned with the number
of null-transactions present in the database. These measures are
called null (transaction) - invariant [43] . The main property of a
null-invariant measure is its independence of the total number of
transactions in a database. According to the study in [43] , all five
known null-invariant measures, including: all-confidence [37] , Co-
herence [37] , Cosine [43] , Kulczynsky [9] , can be viewed as a gen-
eralized mean of conditional probabilities. Kulczynsky [9] has the
null (transaction)-invariant property, which implies that the corre-
lation measure is independent of the dataset size [43] .
2.3. Comparative analysis with related work
Up to now, the problem of HUIM has been extensively stud-
ied. Summary of the developed HUIM algorithms are shown in
Table 1 . Although the above traditional models of HUIM can reflect
the utility of the itemsets which beyond a minimum utility value,
the inherent correlation of items inside the patterns has not been
considered. In real-world situations, the discovered HUIs may be
meaningless, invalid even misleading (have happened by chance)
if they are weakly correlated. Identifying correlation and utility
information in databases can provide valuable information. Thus,
it is a critical issue and a challenge to discover non-redundant
and more correlated HUIs from transaction databases. In the past
studies, the HUIPM algorithm [7] and the faster FDHUP algorithm
[28] were respectively proposed to simultaneously consider both
the frequency affinity and utility of patterns. FDHUP introduces
two compact structures named EI-table and FU-tree, and utilizes
three pruning strategies to reduce the search space, thus performs
better than the HUIPM algorithm [7] . Both of them only consider,
however, the co-occur frequency the transactions as the correla-
tion factor, which cannot discover the real inherent correlation of
among items inside the desired patterns.
Thus, it is an important and challenging task to design an ef-
ficient algorithm for extracting non-redundant correlated purchase
behaviors by utility measure. In this paper, the proposed method
aims at discovering non-redundant correlated high-utility itemsets
from transactional quantitative databases and has the following
main differences compared to the state-of-the-art algorithms for
mining high-utility itemsets.
• The mining goal w.r.t. derived patterns is different. All the
above algorithms except HUIPM are designed to discover high-
utility itemsets, while the CoHUIM algorithm aims at mining
the high positive correlated but interesting patterns which hav-
ing high utility values.
• The correlation measure used in this paper is quite different
from the support affinity measure used in HUIPM. The sup-
port affinity was designed only with the co-occurrence consid-
eration. It used the utility relationship instead of the presence
information to measure the correlation of the items inside an
itemset. Utility factor of objects is, however, the user-specified
important value, the subjective criteria may be wrong to mea-
sure the objective correlation relationship. Moreover, they are
out of our purposes since the designed approach is used to
avoid mining the misleading HUIs from transactional databases.
• Furthermore, we develop some techniques to improve mining
performance. Not only the projection technology is adopted, but
also a global sorted downward closure ( SDC ) property is used in
the proposed approach.
3. Preliminaries and problem statement
Let I = { i
1
, i
2
, . . . , i
m
} be a finite set of m distinct items in a
temporal transactional database D = { T
1
, T
2
, . . . , T
n
}, where each
transaction T
q
= { q (i
1
, T
q
) , q (i
2
, T
q
) , . . . , q (i
j
, T
q
) } is a subset of I ,
and has an unique identifier ( TID ). Notice that the q ( i
j
, T
q
) in each
T
q
is the different purchase quantity of each item. An unique profit
pr ( i
j
) is assigned to each item i
j
∈ I , which represents its impor-
tance (e.g., profit, interest, risk), and they are stored in a profit-
table ptable = { pr ( i
1
), pr ( i
2
), . . . , pr ( i
m
)}. An itemset X ∈ I with k
distinct items { i
1
, i
2
, . . . , i
k
} is of length k and is referred to as a