Beyond Streams and Graphs: Dynamic Tensor Analysis
Jimeng Sun
†
Dacheng Tao
‡
Christos Faloutsos
†
† Computer Science Department, Carnegie Mellon University, Pittsburgh,USA
‡ School of Computer Science and Information Systems, Birkbeck College, University of London,UK
{jimeng,christos}@cs.cmu.edu, dacheng@dcs.bbk.ac.uk
ABSTRACT
How do we find patterns in author-keyword associations,
evolving over time? Or in DataCubes, with produ ct-branch-
customer sales information? Matrix decompositions, like
principal component analysis (PCA) and variants, are in-
valuable t ools for mining, dimensionality reduction, feature
selection, rule identification in numerous settings like stream-
ing data, text, graphs, social networks and many more.
However, they have only two orders, like author and key-
word, in the above example.
We propose to envision such higher order data as tensors,
and tap the vast literature on the topic. However, these
methods do not necessarily scale up, let alone operate on
semi-infinite streams. Thus, we introduce the dynamic ten-
sor analysis (DTA) method, and its variants. DTA provides
a compact summary for high-order and high-dimensional
data, and it also reveals the hidden correlations. Algorith-
mically, we designed DTA very carefully so that it is (a)
scalable, (b) space efficient (it does not need to store the
past) and (c) fully automatic with no need for user defined
parameters. Moreover, we propose STA, a streaming tensor
analysis method, which provides a fast, streaming approxi-
mation to DTA.
We implemented all our methods, and applied them in
two real settings, namely, anomaly detection and multi-way
latent semantic indexing. We used two real, large datasets,
one on network flow data (100GB over 1 month) and one
from DBLP (200MB over 25 years). Our experiments show
that our methods are fast, accurate and that they fin d in-
teresting patterns and outliers on the real datasets.
1. INTRODUCTION
Given a keyword-author-timestamp-conference bibliogra-
phy, how can we find patterns and latent concepts? Given
Internet traffic data (who sends packets to whom, on what
port, and when), how can we find anomalies, patterns and
summaries? Anomalies could be, e.g., port-scanners, p at-
terns could be of the form “workstations are down on week-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
KDD’06, August 20–23, 2006, Philadelphia, Pennsylvania, USA.
Copyright 2006 ACM 1-59593-339-5/06/0008 ...$5.00.
ends, while servers spike at Fridays for backups”. Sum-
maries like the one above are useful to give us an idea what
is t he past (which is probably the norm), so that we can
spot deviations from it in the future.
Matrices and matrix operations like SVD/PCA have played
a vital role in finding patterns when the dataset is “2-dimensional”,
and can thus be represented as a matrix. Important appli-
cations of this view point include numerous settings like:
1) information retrieval, where the data can be turned
into a document-term matrix, and t hen apply LSI [9, 24];
2) market basket analysis, with customer-products ma-
trices, where we can apply association rules [2] or “Ratio
Rules” [21]; 3) the web, where both rows and columns are
pages, and links correspond to edges between them; then
we can apply HITS [19] or pageRank [3] to find hubs, au-
thorities and influential nodes; all of t hem are identical or
closely related to eigen analysis or derivatives; 4) social
networks, and in fact, any graph (with un-labelled edges):
people are rows and columns; edges again correspond t o non-
zero entries in the adjacency matrix. The network value
of a customer [13] has close ties to th e first eigenvector;
graph partitioning [18] is often done through matrix alge-
bra (e.g. spectral clustering [16]); 5) streams and co-
evolving sequences can also be envisioned as matrices:
each data source (sensor) corresponds to a row, and each
time-tick to a column. Then we can do multivariate anal-
ysis or SVD [25],“sketches” and random projections [14] to
find patterns and outliers.
The need for tensors: Powerful as they may be, matrix-
based tools can handle neither of the two problems we stated
in the beginning. The crux is that matrices have only two
“dimensions” (e.g., “customers” and “products”), while we
may often need more, like “authors”, “keywords”, “times-
tamps”, “conferences”. This is exactly what a tensor is, and
of course, a tensor is a generalization of a matrix (and of a
vector, and of a scalar). We propose to envision all such
problems as tensor problems, to use the vast literature of
tensors to our benefit, and to introduce new tensor analysis
tools, tailored for streaming applications. Using tensors, we
OLAP this paper tensor literature
dimensionality order order/mode
dimension mode order/mode
attribute value dimension dimension
Table 1: Termi nology correspondence
can attack an even wider range of problems, that matrices
can not even touch. For example, 1) Rich, time-evolving net-
work traffic data, as mentioned earlier: we have tensors of