线性投影探索：多维数据子集的聚类、离群值与趋势分析

120 浏览量更新于2024-08-28 收藏 1.42MB PDF 举报

"这篇研究论文探讨了如何利用线性投影来揭示多维数据集子集中的聚类、离群值和趋势。通过互动界面，作者提出了一种基于投票的策略，以增强局部模式的识别，克服传统质量度量方法在处理噪声和部分数据点时的局限性。" 在理解和分析多维数据集时，线性投影是一种常用的技术，它能够将高维数据转换到二维空间，从而简化数据的可视化和理解。线性投影包括主成分分析(PCA)、独立成分分析(ICA)等方法，这些方法能够捕获数据的主要特征并减少维度。然而，传统的线性投影方法可能忽略或混淆局部模式，因为它们通常关注整个数据集的全局特性。该论文针对这一问题，提出了一个创新的解决方案，即设计了一个交互式界面，专注于数据集的子集进行2D线性投影的探索。这个界面允许用户聚焦于特定区域，通过投票机制强调和识别局部模式。这种策略能够帮助用户从噪声中分离出有意义的结构，即使这些结构仅由部分数据点组成。局部模式的识别对于发现聚类、离群值和趋势至关重要。聚类是指数据集中自然形成的群体，它们具有相似的属性或特征；离群值则是与其他数据点显著不同的点，可能是由于测量误差或其他异常情况导致的；趋势则表示数据随时间或某一变量的变化方向。在多维数据中，这些模式可能难以直观地捕捉，而本文提出的方法旨在解决这个问题。论文还可能涉及以下关键词的相关内容： - 多维数据：指的是具有多个独立变量的数据集，这些变量共同描述了一个现象或实体。 - 投票机制：这是一种决策或评估策略，通过收集多个观察或判断，以确定最显著或最有意义的模式。 - 可视化探索：通过图形和视觉表示来探索和理解数据，帮助用户发现潜在的关联和模式。 - 数据子集：从原始数据集中选择的一部分数据，用于更深入的分析或特定目的的研究。这篇研究论文提供了一种新的工具和方法，以改善对多维数据集的理解，特别是在识别局部模式、离群值和趋势方面，这对于数据挖掘、机器学习和数据分析等领域具有实际应用价值。

Contents lists available at ScienceDirect

Journal of Visual Languages and Computing

journal homepage: www.elsevier.com/locate/jvlc

Exploring linear projections for revealing clusters, outliers, and trends in

subsets of multi-dimensional datasets

Jiazhi Xia

, Le Gao

, Kezhi Kong

, Ying Zhao

⁎

, Yi Chen

, Xiaoyan Kui

, Yixiong Liang

School of Information Science and Engineering, Central South University, Changsha, China

State Key Lab of CAD&CG, Zhejiang University, Hangzhou, China

Beijing Key Laboratory of Big Data Technology for Food Safety, Beijing Technology and Business University, Beijing, China

ARTICLE INFO

Keywords:

Multi-dimensional data

Projection

Visual exploring

ABSTRACT

Identifying patterns in 2D linear projections is important in understanding multi-dimensional datasets. However,

local patterns, which are composed of partial data points, are usually obscured by noises and missed in tradi-

tional quality measure approaches that measure the whole dataset. In this paper, we propose an interactive

interface to explore 2D linear projections with visual patterns on subsets. First, we propose a voting-based

algorithm to recommend optimal projection, in which the identiﬁed pattern looks the most salient. Speciﬁcally,

we propose three kinds of point-wise quality metrics of 2D linear projections for outliers, clusterings, and trends,

respectively. For each sampled projection, we measure its importance by accumulating the metrics of selected

points. The projection with the highest importance is recommended. Second, we design an exploring interface

with a scatterplot, a projection trail map, and a control panel. Our interface allows users to explore projections

by specifying interested data subsets. At last, we employ three datasets and demonstrate the eﬀectiveness of our

approach through three case studies of exploring clusters, outliers, and trends.

1. Introduction

Multi-dimensional data visualization plays an important role in data

exploring and understanding. Among a variety of visualization ap-

proaches, 2D linear projection remains the most popular method to

provide insights into structures and patterns in datasets [1]. Speciﬁ-

cally, users are interested in the visual patterns of clusters, outliers, and

trends in linear projections [2,3]. However, it is considered to be a

fundamental challenge to identify interesting projections from the nu-

merous possible projections [1].

To resolve this issue, several approaches have been proposed to

provide a small set of representative projections. First, quality measures

are adopted to rank possible projections [4] . Quality measures of

clusters(e.g. Linear Discriminant Analysis [5]), trends (e.g. the Pearson

correlation coeﬃcient), and outliers (e.g. statistics analysis [4]) are

widely studied. Speciﬁcally, the scagnostics [6] comprises nine mea-

surements describing the patterns of points in projections, including

outliers, shape, trend, and density (e.g. clumpy). Second, dissimilarities

among projections are measured to reduce the redundant in the re-

commendation set [7]. Alternatively, Liu et al. [1] look for local max-

imum projections to provide a representative set.

However, most existing quality measures are deﬁned in projections

of the whole dataset. Real-world datasets often contain multiple clusters

and noises. Local patterns that exist in a subset can be obscured by

other components or noises. For instance, it is highly improbable to

present patterns of clusters that lie in diﬀerent subspaces in a single

projection. Therefore, it is challenging to provide insight into local

patterns based on global quality measures.

Let us consider a typical exploratory analysis scenario. When users

explore projections for interested patterns, the exploring process often

contains three stages. First, users look around the projection space until

a global or local pattern is observed. Because the exploring space is

large and the dataset is usually complicated, it is non-trivial to achieve

a projection with the clear pattern in this stage. More probably, users

observe a noised pattern, such as a set of points which are densely

gathered and mixed with sparsely distributed points. Second, this ob-

servation yields a hypothesis of the existence of the local pattern.

Speciﬁcally, the hypothesis is composed of the pattern type (e.g. cluster,

trend, or outlier) and the subset of points that form the pattern. Third,

this hypothesis motivates consequent exploration operations to verify

it. The loop of looking around, suggesting a hypothesis, and verifying

the hypothesis is performed iteratively in the exploring process.

https://doi.org/10.1016/j.jvlc.2018.08.003

Received 20 July 2018; Accepted 6 August 2018

⁎

Corresponding author.

E-mail addresses: xiajiazhi@csu.edu.cn (J. Xia), csugaole@csu.edu.cn (L. Gao), durantkong@zju.edu.cn (K. Kong), zhaoying@csu.edu.cn (Y. Zhao),

chenyi@th.btbu.edu.cn (Y. Chen), xykui@csu.edu.cn (X. Kui), yxliang@csu.edu.cn (Y. Liang).

Journal of Visual Languages and Computing 48 (2018) 52–60

Available online 09 August 2018

下载后可阅读完整内容，剩余8页未读，立即下载

weixin_38572115

粉丝: 6
资源: 946

线性投影探索：多维数据子集的聚类、离群值与趋势分析

对点坐标和数据集进行投影坐标系和地理坐标系之间的转换

ArcGIS中的线性参考

ArcGIS常用投影全部

怎么对多维数据进行fcm聚类分析

多维k-means聚类matlab

多维混合型数据聚类分析

python多维数据聚类

Kmeans多维数据代码实现聚类

如何用GMM实现多维特征数据的聚类

多维混合型数据聚类分析代码

最新资源