A Micro-Cluster-Based Data Stream Clustering Method For P2P Traffic
Classification
Guanghui Yan
1, a
, Minghao Ai
2,b
1
School of Electrical and Computer Engineering Lanzhou Jiaotong University
2
School of Electrical and Computer Engineering Lanzhou Jiaotong University
a
yangh9805@qq.com,
b
aiminghao@sina.com
Keywords: Peer-to-Peer Traffic Identification, Concept Drift, Micro-cluster Based Clustering
Abstract. Many machine learning techniques were proposed to classify P2P traffic and each with
reasonable successes. But in the real P2P network environment, new communities of peers often
attend and old communities of peers often leave. It requires the identification methods to be capable
of coping with concept drift and updating the model incrementally. In this paper, we presented a
concept-adapting algorithm McStream which was based on streaming data mining techniques to
identify P2P applications in Internet traffic. The McStream used two micro-cluster structures,
potential micro-cluster structures and outlier micro-cluster structures, to classify the P2P traffic and
discovered the concept drift with limited memory. Our performance studied over a number of real
data which was captured at a main gateway router demonstrates the effectiveness and efficiency of our
method.
Introduction
With the development of Internet technology, accurate classification and identification of internet
applications play a very important role in many network tasks like: fault monitoring, network
planning, flow prioritization and internet security. Over the last few years, the utilization of peer to
peer (P2P) application, such as file sharing, VoIP, and VoD, media streaming is growing dramatically
and becoming a significant portion of the whole Internet traffic. P2P technology allows data
transmitted between any two peers in P2P network without limiting the flow and bandwidth. This
transfer model exacerbates network congestion, causes the performance degradation of traditional
client-server applications and accelerates the Internet virus or pirate files spread. So, how to identify
P2P flow became an immediate problem of urgent need to solve.
Many papers about P2P identification have published at international conference and academic
journal since the year 2000. The initial approach to identify the P2P application relied on mapping
application to well-know port numbers. It was discussed in Ref. [1]. But [2, 3] confirmed that it has
been ineffective now because P2P file sharing used dynamic ports for communication at present.
The second work was payload-based analysis which was mentioned in [4, 5]. This approach,
however, faced several technical problems now. First, it had to update its signature list frequently for
the purpose of addressing the change of P2P applications. Moreover, these techniques failed to detect
encrypted traffic and many P2P applications began to use encryption now.
To overcome above-mentioned limitation, Machine Learning algorithms were imported to identify
the Internet traffic and each of them can achieve a satisfactory effect in [7-9]. However, identification
P2P application based Machine Learning methods confront several challenges:
1. In contrast, labeled samples are very scare than unlabeled. Classifiers which traditional
supervised learning method produce on few labeled samples do not perform well when they are used
to classify unknown samples.
2. Concept drift can be neglected in the real P2P flow environment. In P2P environment, new
communities of peers often attend and old communities of peers often leave, which make the
distribution of samples changing dynamically. So, the optimal classifiers built on old samples may not
Applied Mechanics and Materials Vols. 263-266 (2013) pp 1121-1126
© (2013) Trans Tech Publications, Switzerland
doi:10.4028/www.scientific.net/AMM.263-266.1121
All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of TTP,
www.ttp.net. (ID: 180.95.224.42-29/11/12,10:31:15)