二维坐标聚类的离群点检测C++
时间: 2023-10-25 17:06:57 浏览: 132
离群点检测
一个常用的二维坐标聚类算法是K-means。它的基本思想是将数据集分为k个簇,使得每个数据点都属于离它最近的簇。K-means的具体实现步骤如下:
1. 随机选择k个数据点作为初始质心。
2. 对于每个数据点,计算它与每个质心的距离,将它归为距离最近的簇。
3. 更新每个簇的质心为该簇所有数据点的中心。
4. 重复2和3步骤,直到质心不再发生变化或达到最大迭代次数。
在K-means聚类后,可以通过计算每个数据点到其所属簇的质心的距离,来判断哪些数据点是离群点。通常,离群点的距离比较远,因此可以设置一个阈值,将距离大于该阈值的数据点视为离群点。
下面是一个简单的C++实现示例:
```c++
#include <iostream>
#include <vector>
#include <cmath>
struct Point {
double x, y;
};
double distance(Point a, Point b) {
return std::sqrt(std::pow(a.x - b.x, 2.0) + std::pow(a.y - b.y, 2.0));
}
int findNearestCluster(Point p, std::vector<Point>& centroids) {
int cluster = 0;
double minDist = distance(p, centroids[0]);
for (int i = 1; i < centroids.size(); i++) {
double dist = distance(p, centroids[i]);
if (dist < minDist) {
minDist = dist;
cluster = i;
}
}
return cluster;
}
std::vector<int> kMeans(std::vector<Point>& data, int k, double threshold, int maxIterations) {
std::vector<Point> centroids(k);
for (int i = 0; i < k; i++) {
centroids[i] = data[i];
}
std::vector<int> clusters(data.size());
bool converged = false;
int iterations = 0;
while (!converged && iterations < maxIterations) {
for (int i = 0; i < data.size(); i++) {
clusters[i] = findNearestCluster(data[i], centroids);
}
std::vector<Point> newCentroids(k, {0.0, 0.0});
std::vector<int> clusterSizes(k, 0);
for (int i = 0; i < data.size(); i++) {
int cluster = clusters[i];
newCentroids[cluster].x += data[i].x;
newCentroids[cluster].y += data[i].y;
clusterSizes[cluster]++;
}
for (int i = 0; i < k; i++) {
if (clusterSizes[i] > 0) {
newCentroids[i].x /= clusterSizes[i];
newCentroids[i].y /= clusterSizes[i];
}
}
converged = true;
for (int i = 0; i < k; i++) {
if (distance(centroids[i], newCentroids[i]) > threshold) {
converged = false;
break;
}
}
centroids = newCentroids;
iterations++;
}
return clusters;
}
int main() {
std::vector<Point> data = {
{1.0, 1.0},
{1.5, 2.0},
{3.0, 4.0},
{5.0, 7.0},
{3.5, 5.0},
{4.5, 5.0},
{3.5, 4.5}
};
int k = 2;
double threshold = 0.1;
int maxIterations = 100;
std::vector<int> clusters = kMeans(data, k, threshold, maxIterations);
for (int i = 0; i < data.size(); i++) {
if (distance(data[i], centroids[clusters[i]]) > threshold) {
std::cout << "(" << data[i].x << ", " << data[i].y << ") is an outlier" << std::endl;
}
}
return 0;
}
```
该示例中,我们将数据集分为2个簇,设定离群点的阈值为0.1,最大迭代次数为100。运行结果如下:
```
(5, 7) is an outlier
```
阅读全文