动态张量分析：挖掘高阶数据模式

4星 · 超过85%的资源需积分: 13 31 浏览量更新于2024-09-11 收藏 550KB PDF 举报

"本文介绍了一种名为‘张量子空间算法’的方法，该方法利用张量的概念来处理图像的特征提取。在数据挖掘和分析领域，矩阵分解如主成分分析(PCA)广泛应用于降维、特征选择和规则识别。然而，这些传统方法局限于二维数据结构，无法有效地处理具有更高阶关系的数据。因此，作者提出了动态张量分析（DTA）技术，以适应更高维度和动态变化的数据流，如作者-关键词关联随时间演变的模式分析或数据立方体中的产品-分支-客户销售信息追踪。DTA提供了一种紧凑的总结方式，用于处理大规模甚至是半无限的数据流，扩展了张量理论在实时分析和挖掘中的应用。" 在本文中，主要讨论的知识点包括： 1. **张量与图像处理**：张量是多维数组的数学概念，可以表示复杂的数据结构，如图像的像素值。在图像特征提取中，张量允许我们考虑图像的多个维度，如颜色、空间位置和时间等，提供更丰富的信息。 2. **矩阵分解与PCA**：主成分分析是一种常用的降维技术，通过线性变换将原始高维数据转换为一组线性无关的特征向量，以减少数据复杂性。PCA及其变体常用于文本、流数据和社交网络等领域的特征选择和模式识别。 3. **数据的高阶性和局限性**：传统的矩阵分解方法仅适用于二维数据，无法有效处理具有更多层次关系的数据。例如，作者-关键词的关联、产品销售信息等具有三元或以上的关系，这需要更高阶的数据结构，即张量来表达。 4. **动态张量分析（DTA）**：DTA是一种新的数据分析方法，它针对的是不断变化的数据流，能处理高阶张量，并且能在数据规模大或数据流无限时保持高效。DTA的核心是提供一种压缩的数据摘要，保留关键信息，适用于实时分析。 5. **应用领域**：DTA技术在处理如作者-关键词关联随时间变化的模式分析、数据立方体中的销售数据分析等场景下展现出强大的潜力。这些场景涉及数据的时间演变和多维关系，是传统矩阵分解方法难以应对的。 6. **可扩展性和实时性**：DTA方法不仅能够扩展到大规模数据集，还能处理半无限的数据流，这在当前大数据和实时分析的需求中至关重要。通过DTA，研究者和数据科学家可以更有效地理解和挖掘复杂、动态的数据结构，从而发现隐藏的模式和趋势，推动数据分析技术的发展。

Beyond Streams and Graphs: Dynamic Tensor Analysis

Jimeng Sun

†

Dacheng Tao

‡

Christos Faloutsos

†

† Computer Science Department, Carnegie Mellon University, Pittsburgh,USA

‡ School of Computer Science and Information Systems, Birkbeck College, University of London,UK

{jimeng,christos}@cs.cmu.edu, dacheng@dcs.bbk.ac.uk

ABSTRACT

How do we ﬁnd patterns in author-keyword associations,

evolving over time? Or in DataCubes, with produ ct-branch-

customer sales information? Matrix decompositions, like

principal component analysis (PCA) and variants, are in-

valuable t ools for mining, dimensionality reduction, feature

selection, rule identiﬁcation in numerous settings like stream-

ing data, text, graphs, social networks and many more.

However, they have only two orders, like author and key-

word, in the above example.

We propose to envision such higher order data as tensors,

and tap the vast literature on the topic. However, these

methods do not necessarily scale up, let alone operate on

semi-inﬁnite streams. Thus, we introduce the dynamic ten-

sor analysis (DTA) method, and its variants. DTA provides

a compact summary for high-order and high-dimensional

data, and it also reveals the hidden correlations. Algorith-

mically, we designed DTA very carefully so that it is (a)

scalable, (b) space eﬃcient (it does not need to store the

past) and (c) fully automatic with no need for user deﬁned

parameters. Moreover, we propose STA, a streaming tensor

analysis method, which provides a fast, streaming approxi-

mation to DTA.

We implemented all our methods, and applied them in

two real settings, namely, anomaly detection and multi-way

latent semantic indexing. We used two real, large datasets,

one on network ﬂow data (100GB over 1 month) and one

from DBLP (200MB over 25 years). Our experiments show

that our methods are fast, accurate and that they ﬁn d in-

teresting patterns and outliers on the real datasets.

1. INTRODUCTION

Given a keyword-author-timestamp-conference bibliogra-

phy, how can we ﬁnd patterns and latent concepts? Given

Internet traﬃc data (who sends packets to whom, on what

port, and when), how can we ﬁnd anomalies, patterns and

summaries? Anomalies could be, e.g., port-scanners, p at-

terns could be of the form “workstations are down on week-

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

KDD’06, August 20–23, 2006, Philadelphia, Pennsylvania, USA.

ends, while servers spike at Fridays for backups”. Sum-

maries like the one above are useful to give us an idea what

is t he past (which is probably the norm), so that we can

spot deviations from it in the future.

Matrices and matrix operations like SVD/PCA have played

a vital role in ﬁnding patterns when the dataset is “2-dimensional”,

and can thus be represented as a matrix. Important appli-

cations of this view point include numerous settings like:

1) information retrieval, where the data can be turned

into a document-term matrix, and t hen apply LSI [9, 24];

2) market basket analysis, with customer-products ma-

trices, where we can apply association rules [2] or “Ratio

Rules” [21]; 3) the web, where both rows and columns are

pages, and links correspond to edges between them; then

we can apply HITS [19] or pageRank [3] to ﬁnd hubs, au-

thorities and inﬂuential nodes; all of t hem are identical or

closely related to eigen analysis or derivatives; 4) social

networks, and in fact, any graph (with un-labelled edges):

people are rows and columns; edges again correspond t o non-

zero entries in the adjacency matrix. The network value

of a customer [13] has close ties to th e ﬁrst eigenvector;

graph partitioning [18] is often done through matrix alge-

bra (e.g. spectral clustering [16]); 5) streams and co-

evolving sequences can also be envisioned as matrices:

each data source (sensor) corresponds to a row, and each

time-tick to a column. Then we can do multivariate anal-

ysis or SVD [25],“sketches” and random projections [14] to

ﬁnd patterns and outliers.

The need for tensors: Powerful as they may be, matrix-

based tools can handle neither of the two problems we stated

in the beginning. The crux is that matrices have only two

“dimensions” (e.g., “customers” and “products”), while we

may often need more, like “authors”, “keywords”, “times-

tamps”, “conferences”. This is exactly what a tensor is, and

of course, a tensor is a generalization of a matrix (and of a

vector, and of a scalar). We propose to envision all such

problems as tensor problems, to use the vast literature of

tensors to our beneﬁt, and to introduce new tensor analysis

tools, tailored for streaming applications. Using tensors, we

OLAP this paper tensor literature

dimensionality order order/mode

dimension mode order/mode

attribute value dimension dimension

Table 1: Termi nology correspondence

can attack an even wider range of problems, that matrices

can not even touch. For example, 1) Rich, time-evolving net-

work traﬃc data, as mentioned earlier: we have tensors of

下载后可阅读完整内容，剩余9页未读，立即下载

youtubeIII

粉丝: 0
资源: 6

动态张量分析：挖掘高阶数据模式

量子遗传算法,量子遗传算法和遗传算法的区别,matlab

双基地EMVS-MIMO雷达目标定位的张量子空间算法

分布式低秩张量子空间聚类算法.pdf

张量子空间人脸识别算法研究的ppt

张量子空间人脸识别算法研究PPT学习教案.pptx

IQGA(改进的量子遗传算法).rar_IQGA_量子_量子 matlab_量子算法_量子遗传算法

量子遗传算法matlab程序_量子遗传_量子_量子遗传算法_matlab遗传算法_并行量子遗传

量子进化算法求解函数优化问题-程序.zip_attentionbru_量子 算法_量子神经网络_量子进化_量子进化算法

qga_2.rar_免疫算法_最小值_遗传算法 可_量子免疫算法_量子遗传算法

量子遗传算法

最新资源

量子进化算法求解函数优化问题-程序.zip_attentionbru_量子算法_量子神经网络_量子进化_量子进化算法

qga_2.rar_免疫算法_最小值_遗传算法可_量子免疫算法_量子遗传算法