用matlab实现K匿名并改进
时间: 2023-05-26 17:03:14 浏览: 64
由于K匿名算法本身较为简单,本人采用Matlab自带的Kmeans函数实现K匿名,代码如下:
```matlab
clear all
close all
% 读取数据
data = importdata('adult.csv',',',1);
% 获取属性的数目
attr_num = size(data.textdata,2);
% 获取属性的取值
attr_value = cell(1,attr_num);
for i = 1:attr_num
attr_value{i} = unique(data.textdata(:,i));
end
% 将属性的取值转化为数值(顺序编码)
data_num = size(data.data,1);
data_encode = zeros(data_num,attr_num);
for i = 1:attr_num
[~,~,data_encode(:,i)] = unique(data.textdata(:,i));
end
% Kmeans算法
k = 5; % 聚类数目
[IDX,C] = kmeans(data_encode,k); % IDX为所属的簇,C为簇中心
% 对于 k = 5 的结果而言,簇的大小如下(排序)
hist(IDX)
% 输出各属性的信息熵
ent = zeros(1,attr_num);
for i = 1:attr_num
ent(i) = entropy(data_encode(:,i));
end
ent
% 使用修改函数进行修改(test)
def_level = 5; % 定义的匿名等级
d = dist(data_encode,C');
[~,min_idx] = min(d); % 获取每个样本所属的类别
for i = 1:k
[idx,~] = find(min_idx == i);
freq = hist(data_encode(idx,:));
modified_attr = modify_KAnonymity(freq,def_level,attr_value); % 执行修改
data_encode(idx,:) = repmat(modified_attr,length(idx),1);
end
% 对修改后的数据进行聚类
k = 5; % 聚类数目
[IDX,C] = kmeans(data_encode,k); % IDX为所属的簇,C为簇中心
% 对于 k = 5 的结果而言,簇的大小如下(排序)
hist(IDX)
% 输出各个属性的信息熵
ent_after = zeros(1,attr_num);
for i = 1:attr_num
ent_after(i) = entropy(data_encode(:,i));
end
ent_after
```
同时,为了改进K匿名算法,本人实现了一种修改函数modify_KAnonymity。实现整体算法的代码如下:
```matlab
function [modified_attr] = modify_KAnonymity(attr_freq,def_level,
attr_value)
% attr_freq: 属性的出现频率
% def_level:匿名等级
% attr_value: 属性取值
% modified_attr: 修改后的属性取值
if sum(attr_freq) > def_level
[~,idx] = sort(attr_freq,'descend');
freq_sum = 0;
for i = 1:length(attr_freq)
if freq_sum+attr_freq(idx(i)) <= def_level
freq_sum = freq_sum+attr_freq(idx(i));
else
modified_attr = attr_value{idx(i)};
break
end
end
else
modified_attr = attr_value{1};
end
end
```
对于算法的改进,本人考虑到了一个子集被修改后,簇心的改变问题。可以考虑按照簇的可分性分成子集,并根据子集的可分性分别执行修改操作,避免簇心的改变带来的影响。这部分代码如下:
```matlab
% 定义可分性
separability = zeros(1,k);
for i = 1:k
separability(i) = sum(IDX==i) - max(hist(IDX(IDX~=i)));
end
% 安装可分性排序,划分为若干个子集
[~,idx] = sort(separability,'descend');
subset_size = [ones(1,floor(k/2))*(ceil(k/2)+1) ones(1,k-floor(k/2))*(floor(k/2))]; % 子集大小
% 执行子集策略下的算法
modified_data_encode = data_encode;
modified_num = 0;
for i = 1:length(subset_size)
if i == 1
subset_idx = idx(1:subset_size(i));
else
subset_idx = idx(sum(subset_size(1:i-1))+1:sum(subset_size(1:i)));
end
subset_data_encode = modified_data_encode(ismember(IDX,subset_idx),:);
subset_freq = sum(subset_data_encode,1);
modified_subset = zeros(size(subset_data_encode));
for j = 1:attr_num
attr_value_j = attr_value{j};
attr_freq_j = subset_freq(j);
modified_attr = modify_KAnonymity(attr_freq_j,def_level/length(subset_idx),attr_value_j);
[~,~,modified_subset(:,j)] = unique(repmat(modified_attr,attr_freq_j,1));
end
modified_data_encode(ismember(IDX,subset_idx),:) = modified_subset;
modified_num = modified_num + sum(~ismember(subset_data_encode,modified_subset,'rows'));
end
% 输出修改后的结果
modified_data_encode
modified_num
```
完整的K匿名算法代码实现如下: