Topic Model Based Behaviour Modeling and
Clustering Analysis for Wireless Network Users
Bingjie Leng, Jingchu Liu, Huimin Pan, Sheng Zhou, and Zhisheng Niu
Tsinghua National Laboratory for Information Science and Technology
Department of Electronic Engineering
Tsinghua University, Beijing 100084, China
Email: {lengbj14, liu-jc12, phm13}@mails.tsinghua.edu.cn, {sheng.zhou, niuzhs}@tsinghua.edu.cn
Abstract—User behaviour analysis based on traffic log in wire-
less networks can be beneficial to many fields in real life: not only
for commercial purposes, but also for improving network service
quality and social management. We cluster users into groups
marked by the most frequently visited websites to find their
preferences. In this paper, we propose a user behaviour model
based on Topic Model from document classification problems. We
use the logarithmic TF-IDF (term frequency - inverse document
frequency) weighing to form a high-dimensional sparse feature
matrix. Then we apply LSA (Latent semantic analysis) to deduce
the latent topic distribution and generate a low-dimensional
dense feature matrix. K-means++, which is a classic clustering
algorithm, is then applied to the dense feature matrix and several
interpretable user clusters are found. Moreover, by combining
the clustering results with additional demographical information,
including age, gender, and financial information, we are able to
uncover more realistic implications from the clustering results.
Keywords—traffic log, user behaviour modeling, clustering
analysis, topic model.
I. INTRODUCTION
Thanks to the wide adoption of smart devices such as
smart phones and tablets, nowadays people can perform an
unprecedented number of tasks online, ranging from news
and finance to social and gaming. As a consequence, Internet
browsing log in wireless networks has become an essential
source of information for analyzing users’ hidden preferences
and inferring their real life behaviour. With a deeper under-
standing on the usage pattern of mobile users, network service
providers are able to provide more personalized services and
improve the service quality as well. Users’ browsing interests
are also helpful in fields such as urban planning, mobile
advertisement, transportation, education, etc [1–3].
The most naive way to extract user behaviour from the
Internet browsing dataset is to observe the long-term global
statistics of various websites. But in this situation, individual
differences will be covered up. On the contrary, if we focus on
the analysis on one single user, the similarity between users’
browsing habits will be ignored. Hence, clustering becomes an
efficient method to strike a balance between these two extremes
and extract the average behaviour of a group of users who have
similar browsing history. Therefore, we design and implement
a process to cluster similar users into groups, each of which
is labeled by the type of frequently visited websites.
To apply clustering algorithms, the first step is to represent
users with a profile vector through user behaviour modeling.
In this paper, we propose a user behaviour modeling method
based on the topic model, which is originally proposed for
document classification, to generate an original profile matrix.
To enhance the discriminative power of the original matrix, we
apply TF-IDF (term frequency - inverse document frequency)
weights to regenerate a feature matrix with large dimension-
ality. With methods in Latent semantic analysis (LSA) [4],
we are able to get a low-dimensional feature matrix reflecting
the distribution of different topics of all the users. Finally,
clustering algorithms such as K-means++ can be applied to
the final feature matrix and the clustering results are analyzed.
Concretely, we make the following contributions in this
paper:
• We analyze the similarity and differences between
network user modeling and document classification,
and propose to utilize text mining algorithms for
network user modeling problems.
• Based on the analysis on our dataset, we utilize
logarithmic TF-IDF to generate sparse feature matrix
and use LSA for topic discovery and dimensionality
reduction. To our knowledge, this is the first study to
analyze user behaviour with a combination of these
tools.
• We extract users’ interests by clustering users with
similar browsing habits into groups. We also examine
our clustering results with additional demographical
information including age, gender, and financial in-
formation on the campus during five months. Obvious
preference differences are found between different
genders and age. It helps us explain our clustering
findings accordingly and proves that our algorithm can
work effectively. Moreover, our findings can help with
campus management in many aspects.
The rest of the paper is outlined as follows. Section
II introduces related work about user behaviour analysis in
WLAN. Section III presents the network user behavior mod-
eling problem and explain its analogy with topic modeling
in document classification. In Section IV, we introduce our
datasets and the details of our algorithm implementation. In
Section V, we present the clustering results and explain the
findings. Finally, in Section VI, we conclude and discuss future
work.
II. RELATED WORK
With the rapid development of wireless networks, the
potential of user behaviour analysis has brought up tremen-
dous attention recently. The most common method for user
clustering is by applying K-means on raw profile matrix.
For example, the web browsing similarity among users of
arXiv:1511.05618v1 [cs.SI] 17 Nov 2015