没有合适的资源?快使用搜索试试~ 我知道了~
首页KMeans++ k均值++
KMeans++ k均值++
5星 · 超过95%的资源 需积分: 14 77 下载量 4 浏览量
更新于2023-03-16
评论 3
收藏 180KB PDF 举报
The k-means method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting k-means with a very simple, randomized seeding technique, we obtain an algorithm that is (log k)-competitive with the optimal clustering. Preliminary experiments show that our augmentation improves both the speed and the accuracy of k-means, often quite dramatically.
资源详情
资源评论
资源推荐
k-means++: The Advantages of Careful Seeding
David Arthur
∗
Sergei Vassilvitskii
†
Abstract
The k-means method is a widely used clustering technique
that seeks to minimize the average squared distance between
points in the same cluster. Although it offers no accuracy
guarantees, its simplicity and speed are very appealing in
practice. By augmenting k-means with a very simple, ran-
domized seeding technique, we obtain an algorithm that is
Θ(log k)-competitive with the optimal clustering. Prelim-
inary experiments show that our augmentation improves
both the speed and t he accuracy of k-means, often quite
dramatically.
1 Introduction
Clustering is one of the class ic problems in machine
learning and computational geometry. In the popular
k-means formulation, one is given an integer k and a set
of n data points in R
d
. The goal is to choose k centers
so as to minimize φ, the sum of the squared distances
between each point and its closest center.
Solving this problem exactly is NP-hard, even with
just two clusters [10], but twenty-five years ago, Lloyd
[20] proposed a local se arch solution that is still very
widely used today (see for example [1, 11, 15]). Indeed,
a recent survey of data mining techniques states that it
“is by far the most popular clustering algorithm used in
scientific and industrial applications” [5].
Usually re ferred to simply as k-means, Lloyd’s
algorithm begins with k arbitrary center s, typically
chosen unifor mly at rando m from the data points. Ea ch
point is then assigned to the nearest c e nter, and each
center is recomputed as the center of mass of all points
assigned to it. These two s teps (assignment and center
calculation) are rep e ated until the process stabilizes.
One can check that the total error φ is monotoni-
cally decreasing , which ensures that no clustering is r e-
peated during the course of the algorithm. Since there
are at most k
n
possible clusterings, the process will al-
ways terminate. In practice, very few iterations are usu-
ally requir e d, which makes the algorithm much faster
∗
Stanford University, Supported in part by NDSEG Fellow-
ship, NSF Grant ITR-0331640, and grants from Media-X and
SNRC.
†
Stanford University, Supported in part by NSF Grant ITR-
0331640, and grants from Media-X and SNRC.
than most of its competitors.
Unfortunately, the empir ic al speed a nd simplicity
of the k-means algorithm come at the price of accuracy.
There are many natural examples for which the algo-
rithm generates arbitrarily bad clusterings (i.e.,
φ
φ
OPT
is
unbounded even when n and k ar e fix e d). Furthermore,
these examples do not rely on an a dversarial placement
of the starting centers, and the ratio can be unbounded
with hig h probability even with the standard random-
ized seeding technique.
In this paper, we propose a way of initializing
k-means by choosing ra ndom starting centers with
very specific probabilities. Specifically, we choose a
point p as a center with probability propo rtional to p’s
contribution to the overall potential. Letting φ denote
the potential after choosing centers in this way, we show
the following.
Theorem 1.1. For any set of data points, E[φ] ≤
8(ln k + 2)φ
OP T
.
This sampling is both fast and simple, and it already
achieves a pproximation guarantees that k-means can-
not. We propose using it to seed the initial centers
for k-means, leading to a combined algorithm we ca ll
k-means++.
This complements a very recent result of Ostrovsky
et al. [2 4], who independently proposed much the same
algorithm. Whereas they showed this randomized seed-
ing is O(1)-competitive on data sets following a certain
separation condition, we show it is O(log k)-competitive
on all data sets.
We also show that the analysis for Theorem 1.1 is
tight up to a constant factor, a nd that it can be eas-
ily extended to various potential functions in arbitrary
metric spaces. In particular, we can also get a sim-
ple O(log k) approximation algorithm for the k-median
objective. Furthermore, we provide preliminary expe ri-
mental data showing that in practice, k-means++ really
does outperform k-means in terms of both accuracy and
sp e e d, often by a substantial margin.
1.1 Related work As a fundamental problem in
machine learning, k-means ha s a rich histo ry. Because
of its simplicity and its observed speed, Lloyd’s method
[20] remains the most popular approach in practice,
Bomber
- 粉丝: 15
- 资源: 12
上传资源 快速赚钱
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
会员权益专享
最新资源
- c++校园超市商品信息管理系统课程设计说明书(含源代码) (2).pdf
- 建筑供配电系统相关课件.pptx
- 企业管理规章制度及管理模式.doc
- vb打开摄像头.doc
- 云计算-可信计算中认证协议改进方案.pdf
- [详细完整版]单片机编程4.ppt
- c语言常用算法.pdf
- c++经典程序代码大全.pdf
- 单片机数字时钟资料.doc
- 11项目管理前沿1.0.pptx
- 基于ssm的“魅力”繁峙宣传网站的设计与实现论文.doc
- 智慧交通综合解决方案.pptx
- 建筑防潮设计-PowerPointPresentati.pptx
- SPC统计过程控制程序.pptx
- SPC统计方法基础知识.pptx
- MW全能培训汽轮机调节保安系统PPT教学课件.pptx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论6