trajectory classification methods which define a distance function
to measure the similarity between trajectories also require that
the physical locations are available [18,22]. The popular distance
functions include Euclidean Distance [23], Hausdorff Distance
[24], Fréchet Distance [25] and Dynamic Time Wrapping Distance
(DTW) [26], etc.
For clustering and classifying cell-id trajectory data with no
regard to physical locations, the most challenging problem is to
define a similarity measure that can effectively capture the rela-
tions between cell-id trajectories. Görnerup [27] quantified the
similarity between two cell-id trajectories based on common sub-
set measure (i.e. the ratio of the number of common cell-ids shared
by the two cell-id trajectories to the total number of unique
cell-ids in the two cell-id trajectories). However, the common sub-
set measure does not account for the sequential nature of the
cell-id trajectories, basically reduced to an unordered set of
cell-ids to be calculated. Bayir et al. [28] formed cell tower clusters
by examining the oscillation patterns (oscillation is the phe-
nomenon that a mobile phone switches back and forth constantly
between multiple cell towers), and converted a cell-id trajectory to
a sequence of clusters. Then, the cell-id trajectory similarity could
be calculated by using such as Longest Common Subsequence
(LCSS) [29]. However, cell towers are almost uniformly distributed
in physical space, so it is always difficult to determine the border of
a cluster. Laasonen [30] and Yavas et al. [31] treated a cell-id tra-
jectory as a string, and defined the similarity measure based on
string alignment algorithms [32]. However, the similarity between
cell towers themselves is not taken into account. Thus, two cell-id
trajectories which differ from each other with respect to multiple
similar cell towers (cell towers which are spatially close to each
other) would be considered as dissimilar ones.
3. Method
3.1. Preliminary
The basic data explored in this paper are cell-id trajectories
defined as follow. Fig. 1 gives an overview of the system architec-
ture of our method. The core of our method is a cell-id trajectory
similarity measure that takes into account the cell tower similarity
(which is explored based on handoff patterns implied in the cell-id
trajectory data), the order of cell-ids in the cell-id trajectory and
the time duration of the connected cell towers. The trajectory clus-
tering algorithm uses the similarity measure to divide mobile
phone users’ cell-id trajectories into different groups to form route
patterns, and the route classification algorithm uses the similarity
measure to match the current cell-id trajectory of the mobile
phone user to one of the route patterns. The route pattern here
can be viewed as a representative cell-id trajectory, which reflects
the repeated moving behaviors of the mobile phone user.
Definition 1. A cell-id trajectory is a sequence of cell-id logs
CTraj = hCL
1
,...,CL
n
i. Cell-id log CL
k
is a pair CL
k
=(ID, Duration),
where ID and Duration are respectively identifier and time duration
of the connected cell towers.
3.2. Cell tower similarity measure
Since a cell-id trajectory is represented by a sequence of
cell-ids, it could be processed as a string [30,31]. However, it
ignores the potential similarity between cell towers. As shown in
Fig. 2, three cell-id trajectories (i.e. Traj
1
, Traj
2
and Traj
3
) move
among the area covered by multiple cell towers. These three trajec-
tories have almost totally different string representation: Traj
1
cor-
responds to ‘‘ACFJOSV’’, Traj
2
corresponds to ‘‘AGKPTW’’ and Traj
3
corresponds to ‘‘ABEIN’’. However, Traj
1
should have much higher
similarity with Traj
2
than with Traj
3
since they pass through the
area covered by cell towers that are spatially close to each other.
However, the physical locations of the cell towers are not avail-
able, and it is impossible to capture the spatial closeness between
cell towers by only observing their identifiers. Thus, we try to
explore the similarity between cell towers by analyzing their hand-
off patterns. Given a pair of cell towers, we define their handoff
pattern in a cell-id trajectory as the mutual switches between each
other. This idea is based on the fact that handoff usually occurs
from one cell tower to its neighboring cell towers in order to guar-
antee the continuation of active calls [33], so switch between a pair
of cell towers with higher spatial closeness often occurs more fre-
quently in a shorter interval.
Given a cell tower identifier pair (ID
1
, ID
2
) and a set of cell-id
trajectories TS ={T
1
,T
2
,...,T
N
}, we calculate the similarity score of
ID
1
and ID
2
taking into account both the frequency and interval of
their mutual switches as Eq. (1). In the equation, TS
1
is a subset
of TS in which each cell-id trajectory contains both ID
1
and ID
2
, TS
2
is another subset of TS in which each cell-id trajectory contains
either ID
1
or ID
2
. SN
i
is the number of mutual switches between
ID
1
and ID
2
in T
i
, interval
j
i
is the number of other cell-ids between
ID
1
and ID
2
in the jth switch in T
i
. SC(ID
1
, ID
2
) represents the final
similarity score of ID
1
and ID
2
. The inverse tangent function is
used to ensure that the similarity score falls in the range of
[0,1]. In order to verify the effectiveness of Eq. (1), we analyze
the correlation between cell tower similarities (calculated based
on Eq. (1) ) and cell tower distances (calculated based on cell
towers’ physical locations which could be queried from online
war-drivin g databases). As shown in Fig. 3, the similarity
generally increases as the distance decreases. It verifies the idea
behind Eq. (1), i.e. to give larger similarity for cell tower pairs that
are spatially closer to each other.
SCðID
1
; ID
2
Þ¼
2
p
tan
1
P
jTS
1
j
i¼1
P
SN
i
j¼1
1
1þinter
v
al
i
j
jTS
1
jþjTS
2
j
ð1Þ
Example 1. We give an example to explain the cell tower
similarity calculation process. Given a cell tower pair (c
1
, c
2
) and
TS ={T
1
, T
2
, T
3
}, where T
1
= hc
1
, c
2
, c
1
, c
3
, c
1
, c
3
, c
4
, c
2
i, T
2
= hc
3
, c
1
, c
4
,
c
3
i, T
3
= hc
5
, c
6
, c
5
, c
6
i, TS
1
={T
1
}, TS
2
={T
2
} (the connected time
durations of the cell towers are ignored here for simplification),
there are three switches between c
1
and c
2
in T
1
(i.e. SN
1
= 3): The
first switch occurs from c
1
(at index = 1) to c
2
(at index = 2), the
second switch occurs from c
2
(at index = 2) to c
1
(at index = 3), and
the third switch occurs from c
1
(at index = 5) to c
2
(at index = 8).
We do not force the switches to occur at consecutive positions
since the mobile phone might constantly dither between several
cell towers before transferring to a proper one due to the
oscillation problem [28]. Thus, the similarity score of (c
1
, c
2
)
in T
1
is 1/(1 + 0) + 1/(1 + 0) + 1/(1 + 2) = 2.33, and SC(c
1
, c
2
)=
tan
1
(2.33/(1 + 1)) (2/
p
) = 0.55.
3.3. Cell-id trajectory similarity measure
Since the cellular handoff patterns are proven to be stable over
the same route [18], two cell-id trajectories are similar if they have
long time duration to connect to similar cell towers in similar
orders. Based on this intuition, we propose an optimal alignment
algorithm to find the optimal alignment of two cell-id trajectories,
and then the similarity between the two cell-id trajectories is cal-
culated based on the optimal alignment. The proposed optimal
alignment algorithm is based on the concept of DNA alignment
[34]. However, the difference is that we should take the following
facts into account.
M. Lv et al. / Knowledge-Based Systems 89 (2015) 181–191
183