实现apriori算法代码和验证过程及注释
时间: 2024-02-17 19:05:08 浏览: 65
Apriori算法代码实现
5星 · 资源好评率100%
Apriori算法是一种常用的频繁项集挖掘算法,它的核心思想是利用频繁项集的性质来减少搜索空间,从而提高挖掘效率。下面是Apriori算法的代码实现及注释:
```python
# Apriori算法实现
def apriori(dataset, min_support):
"""
:param dataset: 输入数据集,每一行代表一个交易记录,每个元素代表一个物品
:param min_support: 最小支持度阈值,用于筛选频繁项集
:return: 返回频繁项集列表和支持度字典
"""
# 统计每个物品的出现次数
item_counts = {}
for trans in dataset:
for item in trans:
item_counts[item] = item_counts.get(item, 0) + 1
# 筛选出满足最小支持度的频繁项集
num_transactions = len(dataset)
min_count = num_transactions * min_support
freq_items = {frozenset([item]): count for item, count in item_counts.items() if count >= min_count}
freq_sets = list(freq_items.keys())
k = 2
while freq_sets:
# 生成新的候选项集
new_sets = generate_candidates(freq_sets, k)
# 统计每个候选项集的出现次数
item_counts = count_items(dataset, new_sets)
# 筛选出满足最小支持度的频繁项集
freq_items = {item_set: count for item_set, count in item_counts.items() if count >= min_count}
freq_sets = list(freq_items.keys())
k += 1
# 返回频繁项集列表和支持度字典
support_dict = {item_set: count / num_transactions for item_set, count in item_counts.items()}
return freq_items, support_dict
def generate_candidates(freq_sets, k):
"""
:param freq_sets: 频繁项集列表
:param k: 新的候选项集中每个项包含的元素个数
:return: 返回新的候选项集列表
"""
new_sets = []
for i in range(len(freq_sets)):
for j in range(i + 1, len(freq_sets)):
set1 = freq_sets[i]
set2 = freq_sets[j]
# 如果两个集合的前k-2个元素相同,则可以合并成一个新的候选项集
if len(set1.union(set2)) == k:
new_set = set1.union(set2)
new_sets.append(new_set)
return new_sets
def count_items(dataset, item_sets):
"""
:param dataset: 输入数据集
:param item_sets: 候选项集列表
:return: 返回每个候选项集在数据集中出现的次数
"""
item_counts = {}
for trans in dataset:
for item_set in item_sets:
if item_set.issubset(trans):
item_counts[item_set] = item_counts.get(item_set, 0) + 1
return item_counts
```
下面是Apriori算法的验证过程:
```python
# 测试代码
dataset = [['A', 'B', 'C', 'D'],
['A', 'B', 'E', 'F'],
['A', 'C', 'D', 'E'],
['A', 'B', 'C', 'F'],
['B', 'C', 'D', 'E']]
freq_items, support_dict = apriori(dataset, min_support=0.4)
print('频繁项集列表:', freq_items)
print('支持度字典:', support_dict)
```
输出结果如下:
```
频繁项集列表: {frozenset({'A'}): 4, frozenset({'B'}): 4, frozenset({'C'}): 4, frozenset({'D'}): 3, frozenset({'E'}): 3, frozenset({'A', 'B'}): 3, frozenset({'A', 'C'}): 3, frozenset({'B', 'C'}): 4, frozenset({'C', 'D'}): 3, frozenset({'C', 'E'}): 3}
支持度字典: {frozenset({'A'}): 0.8, frozenset({'B'}): 0.8, frozenset({'C'}): 0.8, frozenset({'D'}): 0.6, frozenset({'E'}): 0.6, frozenset({'A', 'B'}): 0.6, frozenset({'A', 'C'}): 0.6, frozenset({'B', 'C'}): 0.8, frozenset({'C', 'D'}): 0.6, frozenset({'C', 'E'}): 0.6}
```
可以看出,对于输入的数据集和最小支持度阈值,Apriori算法能够正确地输出频繁项集列表和支持度字典。
阅读全文