8 P. Fournier-Viger et al.
2.3 Key Properties of the Problem of High Utility Itemset
Mining
For a given quantitative database and minimum utility threshold, the problem of high
utility itemset mining always has a single solution. It is to enumerate all patterns that
have a utility greater than or equal to the user-specified minimum utility threshold.
The problem of high utility itemset mining is difficult for two main reasons. The
first reason is that the number of itemsets to be considered can be very large to find
those that have a high utility. Generally, if a database contains m distinct items there
are 2
m
− 1 possible itemsets (excluding the empty set). For example, if I ={a, b, c},
the possible itemsets are {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, and {a, b, c }. Thus, t here
are 2
3
− 1 = 7 itemsets, which can be formed with I ={a, b, c}. A naive approach
to solve the problem of high utility itemset mining is to count the utilities of all
possible itemsets by scanning the database, to then keep the high utility itemsets.
Although this approach produces the correct result, it is inefficient. The reason is
that the number of possible itemsets can be very large. For example, if a retail store
has 10,000 items on its shelves (m = 10, 000), the utilities of 2
10,000
− 1 possible
itemsets should be calculated, which is unmanageable using the naive approach. It is
to be noted that the problem of high utility itemset mining can be very difficult even
for small databases. For example, a database containing a single transaction of 100
items can produce 2
100
− 1 possible itemsets. Thus, the size of the search space (the
number of possible itemsets) can be very large even if there are few transactions in
a database. In fact, the size of the search space does not only depend on the size of
the database, but also on how similar the transactions are in the database, how large
the utility values are, and also on how low the mi nut il threshold is set by the user.
A second reason why the problem of high utility itemset mining is difficult is
that high utility itemsets are often scattered in the search space. Thus, many itemsets
must be considered by an algorithm before it can find the actual high utility itemsets.
To illustrate this, Fig. 1 provides a visual representation of the search space for the
running example, as a Hasse diagram. A Hasse diagram is a graph where each
possible itemset is represented as a node, and an arrow is drawn from an itemset
X to another itemset Y if and only if X ⊆ Y and |X|+1 =|Y |.InFig.1,high
utility itemsets are depicted using light gray nodes, while low utility itemsets are
represented using white nodes. The utility value of each itemset is also indicated.
An important observation that can be made from that figure is t hat the utility of an
itemset can be greater, higher or equal, to the utility of any of its supersets/subsets.
For example, the utility of the itemset {b, c} is 28, while the utility of its supersets
{b, c, d} and {a, b, c, d, e} are 34 and 25, respectively. Formally, it is thus said that
the utility measure is neither monotone nor anti-monotone.
Property 1 (The utility measure is neither monotone nor anti-monotone) Let there
be two itemsets X and Y such that X ⊂ Y . The relationship between the utilities of
X and Y is either u(X)<u(Y ),u(X)>u(Y ),oru(X) = u(Y ) [83].