Clustering-preserving Network Flow Sketching
Yongquan Fu
1
, Dongsheng Li
1
, Siqi Shen
1
, Yiming Zhang
1
, Kai Chen
2
1
Science and Technology Laboratory of Parallel and Distributed Processing,
College of Computer Science, National University of Defense Technology
2
SING Lab, Hong Kong University of Science and Technology
Abstract—Network monitoring is vital in modern clouds and
data center networks that need diverse traffic statistics ranging
from flow size distributions to heavy hitters. To cope with
increasing network rates and massive traffic volumes, sketch
based approximate measurement has been extensively studied
to trade the accuracy for memory and computation cost, which
unfortunately, is sensitive to hash collisions.
This paper presents a clustering-preserving sketch method
to be resilient to hash collisions. We provide an equivalence
analysis of the sketch in terms of the K-means clustering.
Based on the analysis result, we cluster similar network flows
to the same bucket array to reduce the estimation variance and
use the average to obtain unbiased estimation. Testbed shows
that the framework adapts to line rates and provides accurate
query results. Real-world trace-driven simulations show that LSS
remains stable performance under wide ranges of parameters and
dramatically outperforms state-of-the-art sketching structures,
with over 10
3
to 10
5
times reduction in relative errors for per-
flow queries as the ratio of the number of buckets to the number
of network flows reduces from 10% to 0.1%.
Index Terms—sketch, random projection, hash collision, clus-
tering
I. INTRODUCTION
Network measurement is of paramount importance for traf-
fic engineering, network diagnosis, network forensics, intru-
sion detection and prevention in clouds and data centers, which
need a variety of traffic measurement, such as delay, flow
size estimation, flow distribution, heavy hitters [1], [2], [3].
Recently, the self-running network proposal [4], [5] highlights
an automatic management loop for large-scale networks with
timely and accurate data-driven network statistics as the driv-
ing force for machine learning techniques.
Network-flow monitoring is challenging due to ever increas-
ing line rates, massive traffic volumes, and large numbers of
active flows [6], [7], [8], [9]. Traffic statistics tasks require
advanced data structures and traffic statistical algorithms.
Many space- and time-efficient approaches have been studied,
e.g., traffic sampling, traffic counting, traffic sketching. Com-
pared to other approaches, the sketch has received extensive
attentions due to their competitive trade off between space
resource consumption and query efficiency. Further, multiple
sketch structures can be composed for joint traffic analytics.
Existing sketch structures [10], [11], [12], [13] hash in-
coming packets to randomly chosen buckets and take the
This work was sponsored in part by National Key R&D Program of China
under Grant No. 2018YFB0204300, and the National Natural Science Foun-
dation of China (NSFC) under Grant No. 61972409, 61602500, 61402509,
61772541, 61872376.
accumulated counter in these buckets as the estimator. Re-
cently, OpenSketch [14], UnivMon [15], SketchVisor [16],
ElasticSketch [17], and SketchLearn [18] further extend the
generality of the sketch structure to support diverse monitoring
tasks.
The sketch based monitoring approach has a degree of
approximation error due to hash collisions of incoming items,
as multiple keys may be mapped to the same bucket. Hash
collisions are inevitable due to the randomness of the hash
functions. Thus existing methods typically keep multiple in-
dependent copies of the sketch structure and find the least
affected ones as the estimator. However, this approach wastes
the space significantly. Recently, several approaches [17], [18]
propose to separate large items from the rest into a hash table
to reduce the estimation error. Unfortunately, the hash table
needs to allocate dedicated space for new items, thus it is less
efficient than the sketch with a constant-size bucket structure.
Thus, finding a space-efficient approach that is resilient to hash
collisions is an open question.
We present a new class of sketch called locality-sensitive
sketch (LSS for short) that is resilient to hash collisions.
LSS approximately minimizes the estimation error based on
a theoretical equivalence relationship between the sketching
error and the approximation error of the K-means clustering in
Sec. IV. This equivalence provides two important insights for
the sketching methodology: clustering similar items together
reduces the approximation error, and averaging the bucket
counter obtains an unbiased estimator. We exploit these two
theoretical insights to the design of a locality-sensitive sketch
structure.
We adapt to online and dynamic network flows with back-
ground clustering and lightweight temporal caching tech-
niques. First, we maintain the clustering model in a back-
ground and periodical process, which obtains close-to-date
samples and trains a clustering model that enables mapping
online flow records with up-to-date cluster centers. Second, the
insertion process should deal with incremental flow counters,
since the flow size grows as packets are delivered. We adapt to
monodically increasing flow counters with a temporal cache
based on a lightweight Cuckoo hash table [19], [20], [21].
We perform extensive evaluation in Section VI. Testbed
shows that the framework adapts to line rates and provides
accurate query results. Trace-driven study reveals that LSS
remains stable performance under wide ranges of parameters
and dramatically outperforms state-of-the-art sketching struc-
tures, with over 10
3
to 10
5
times reduction in relative errors