KMeans++k均值++ - CSDN文库

kmeans

5星 · 超过95%的资源需积分: 14 4 浏览量更新于2023-03-16 评论 3 收藏 180KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

k-means++: The Advantages of Careful Seeding

David Arthur

∗

Sergei Vassilvitskii

†

Abstract

The k-means method is a widely used clustering technique

that seeks to minimize the average squared distance between

points in the same cluster. Although it oﬀers no accuracy

guarantees, its simplicity and speed are very appealing in

practice. By augmenting k-means with a very simple, ran-

domized seeding technique, we obtain an algorithm that is

Θ(log k)-competitive with the optimal clustering. Prelim-

inary experiments show that our augmentation improves

both the speed and t he accuracy of k-means, often quite

dramatically.

1 Introduction

Clustering is one of the class ic problems in machine

learning and computational geometry. In the popular

k-means formulation, one is given an integer k and a set

of n data points in R

. The goal is to choose k centers

so as to minimize φ, the sum of the squared distances

between each point and its closest center.

Solving this problem exactly is NP-hard, even with

just two clusters [10], but twenty-ﬁve years ago, Lloyd

[20] proposed a local se arch solution that is still very

widely used today (see for example [1, 11, 15]). Indeed,

a recent survey of data mining techniques states that it

“is by far the most popular clustering algorithm used in

scientiﬁc and industrial applications” [5].

Usually re ferred to simply as k-means, Lloyd’s

algorithm begins with k arbitrary center s, typically

chosen unifor mly at rando m from the data points. Ea ch

point is then assigned to the nearest c e nter, and each

center is recomputed as the center of mass of all points

assigned to it. These two s teps (assignment and center

calculation) are rep e ated until the process stabilizes.

One can check that the total error φ is monotoni-

cally decreasing , which ensures that no clustering is r e-

peated during the course of the algorithm. Since there

are at most k

possible clusterings, the process will al-

ways terminate. In practice, very few iterations are usu-

ally requir e d, which makes the algorithm much faster

∗

Stanford University, Supported in part by NDSEG Fellow-

ship, NSF Grant ITR-0331640, and grants from Media-X and

SNRC.

†

Stanford University, Supported in part by NSF Grant ITR-

0331640, and grants from Media-X and SNRC.

than most of its competitors.

Unfortunately, the empir ic al speed a nd simplicity

of the k-means algorithm come at the price of accuracy.

There are many natural examples for which the algo-

rithm generates arbitrarily bad clusterings (i.e.,

OPT

unbounded even when n and k ar e ﬁx e d). Furthermore,

these examples do not rely on an a dversarial placement

of the starting centers, and the ratio can be unbounded

with hig h probability even with the standard random-

ized seeding technique.

In this paper, we propose a way of initializing

k-means by choosing ra ndom starting centers with

very speciﬁc probabilities. Speciﬁcally, we choose a

point p as a center with probability propo rtional to p’s

contribution to the overall potential. Letting φ denote

the potential after choosing centers in this way, we show

the following.

Theorem 1.1. For any set of data points, E[φ] ≤

8(ln k + 2)φ

OP T

This sampling is both fast and simple, and it already

achieves a pproximation guarantees that k-means can-

not. We propose using it to seed the initial centers

for k-means, leading to a combined algorithm we ca ll

k-means++.

This complements a very recent result of Ostrovsky

et al. [2 4], who independently proposed much the same

algorithm. Whereas they showed this randomized seed-

ing is O(1)-competitive on data sets following a certain

separation condition, we show it is O(log k)-competitive

on all data sets.

We also show that the analysis for Theorem 1.1 is

tight up to a constant factor, a nd that it can be eas-

ily extended to various potential functions in arbitrary

metric spaces. In particular, we can also get a sim-

ple O(log k) approximation algorithm for the k-median

objective. Furthermore, we provide preliminary expe ri-

mental data showing that in practice, k-means++ really

does outperform k-means in terms of both accuracy and

sp e e d, often by a substantial margin.

1.1 Related work As a fundamental problem in

machine learning, k-means ha s a rich histo ry. Because

of its simplicity and its observed speed, Lloyd’s method

[20] remains the most popular approach in practice,

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余8页未读，立即下载

mark_yueye

2014-03-12

太有用了，多谢！

Bomber

粉丝: 15
资源: 12

会员权益专享

KMeans++ k均值++

评论6

会员权益专享

最新资源

KMeans++ k均值++

评论6

K-means算法的Matlab实现代码（使用文档+源代码）

k-means聚类方法c++实例

Kmeans++算法对图像进行分割

基于Kmeans、Kmeans++和二分K均值算法的图像分割

基于Kmeans、Kmeans++和二分K均值算法的图像分割，使用sklearn

基于Kmeans、Kmeans++和二分K均值算法的图像分割代码

分别用Kmeans、Kmeans++和二分K均值三种聚类方法对图片进行图像分割，使用sklearn

kmeans++损失函数

kmeans ++聚类算法python代码

MNIST数据集数据库由60000个训练样本和10000个测试样本组成，每个样本都是一张28 * 28像素的灰度手写数字图片，总共有0~9共10个手写数字，给定上述数据集，要求使用python语言，应用Kmeans++算法，编写程序实现聚类并测试。

kmeans++聚类算法的原理

kmeans++颜色分割

用python写一个Kmeans++算法

matlab kmeans++

使用kmeans++输出

kmeans++聚类

Kmeans++伪代码

kmeans++数学公式

kmeans++锚框聚类算法

cluster.KMeans

会员权益专享

最新资源