1600 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 27, NO. 4, AUGUST 2019
set of latent factors, which may not hold when node pairs
experience diverse hidden factors. Our work addresses this
challenge via a skewness-aware matrix factorization model.
C. Matrix Completion
Our study is related with the matrix-completion theory [5],
which recovers an incomplete matrix via a subset of observed
entries. For a rank-rm×n matrix (r (m, n)) that meets an
incoherent condition
1
, a unique rank-r matrix can be recovered
with a high probability. Minimizing the matrix rank exactly is
NP-hard [5]. OR1MP [43] iteratively finds a rank-one matrix
out of the approximation residual with the SVD. However,
it is generally impossible to exactly recover the SVD result
for a partially observed matrix. Further, different node pairs
are likely to be correlated with heterogeneous latent factors.
D. Traffic Matrix Interpolation
Real-world traffic matrices are usually incomplete. Con-
sequently, interpolating missing entries becomes important.
Traffic matrix interpolation is a related, but different problem,
with different properties. Xie et al. [46]–[48] exploit hidden
spatial and temporal structures with three-dimensional low-
rank tensors, which effectively reduces the estimation error.
Zhang et al. [53] interpolate incomplete traffic matrices with
structure regularized low-rank matrix factorization and local
interpolation procedures. LENS [7] models the traffic matrices
as the sum of multiple matrices that are positively correlated
with the traffic matrix. Our work proposes a unifying model
that keeps the low-rank interpretation and adapts well to
skewed latent factors.
III. P
ROBLEM STATEMENT
We first present the measurement environment for
the mesh-structured cloud services, then introduce the
matrix-factorization results, and discuss the open questions.
A. Measurement Architecture
A service mesh typically consists of a set of nodes located in
mega data center networks or edge data-center networks. Each
node hosts a set of networked micro-services, as discussed
in the introduction. Service-to-service communication is fre-
quent, while the latency between sending service requests and
obtaining responses should meet network SLAs. We assume
that, the service mesh should have synchronized their clocks,
as otherwise we could not correlate the network problems in
different locations. The synchronization protocols such as Net-
work Time Protocol (NTP) [2] or the IEEE 1588 Precise Time
Protocol (PTP) [1] can provide millisecond-level precision for
geo-distributed nodes.
The measurement system is comprised of two main com-
ponents inspired by the software defined networks [23], [33],
[50]: a data plane that consists of service-mesh nodes and a
control plane on a logically centralized server.
(i) At the control plane, the logically centralized controller
schedules the Round-Trip Time (RTT) measurement process
1
Incoherence [5] states that the singular vectors spread out to help the matrix
be loosely aligned with the coordinate axes.
in the data plane. The controller randomly samples a small list
of nodes as probing targets for each service-mesh node. The
choices of probing targets are randomized for different nodes
for load balancing. The number of probing targets depends on
the node’s measurement capability. For a scale of hundreds
of nodes, selecting tens of probing targets suffices to obtain a
good level of accuracy.
Further, the controller handles the churns of nodes, since
an offline node is useless and should be detected and fil-
tered. Accordingly, the controller keeps the online status of
the data plane as volatile states in the main memory. Each
online node periodically sends a heart-beating message to the
centralized controller to notify its online status; as a response,
the controller piggybacks a list of sampled online nodes. The
frequency of the heart-beat messages is platform-dependent,
where stable platforms could choose a long period, while edge
platforms should choose a relatively short period (e.g., one
minute) to reflect system churns.
(ii) At the data plane, each service-mesh node performs a
number of measurements towards other nodes in the same
platform. It downloads the list of probing targets from the
controller, and measures the RTT status towards these probing
targets in a periodical approach. After collecting the RTT
samples in an interval, each node uploads the RTT results
to the persistent storage that is accessed by the controller.
The data plane could use any kinds of measurement meth-
ods. For example, at the network or transport level, the data
plane may choose ICMP or TCP protocol based measurement
methods; at the application level, the data plane could use
RPC or HTTP protocol based methods. Generally, the RTT
value amounts to the absolute difference between the time
of sending a request message to the probing target and that
of receiving the response message from this probing target.
The unit of a measurement interval determines the granularity
of the monitoring process. Increasing the sampling interval
towards a probing target yields a coarser granularity.
B. Challenges for RTT Matrix Completion
For a set of N nodes, the pairwise RTTs between N nodes
in an interval can be represented as a N-by-N matrix D.
The state-of-the-art approaches predict pairwise RTT values
based on the matrix factorization approach, which factorizes
amatrixD ∈ R
N×N
as a product of two low-dimensional
factor matrices F ∈ R
N×r
and G ∈ R
N×r
,i.e.,D ≈ FG
T
,
where r N,and
T
denotes the transpose of a matrix.
A matrix factorization model is equivalent to a sum of a set
of rank-one matrices:
ˆ
D = FG
T
=
k
F
k
G
T
k
(1)
where F
k
,G
k
denote the k-th (k ≤ r) column vector of the
matrix F and G, respectively. An objective function seeks
to minimize the approximation residual between the observed
entries and the sum of the rank-one matrices:
min
F,G
D −
r
k=1
F
∗k
G
T
∗k
(2)