FP-growth算法python实现含数据集,并给代码添加注释
时间: 2024-05-12 17:20:10 浏览: 123
首先,让我们来了解一下FP-growth算法的基本思想。FP-growth算法是一种用于挖掘频繁项集的数据挖掘算法。其核心思想是利用一种称为FP树的数据结构来存储频繁项集,以及它们在数据集中的出现次数。FP树是一种前缀树,它可以记录每个频繁项集的出现次数,并且由于它是一棵树,所以可以利用它来快速地发现频繁项集。
接下来,我们将看到如何使用Python实现FP-growth算法。
首先,我们需要一个包含数据集的列表。在这个例子中,我们将使用一个简单的列表,其中包含一些购物篮中的物品。
```python
# 定义数据集
dataset = [['bread', 'milk'],
['bread', 'diaper', 'beer', 'egg'],
['milk', 'diaper', 'beer', 'cola'],
['bread', 'milk', 'diaper', 'beer'],
['bread', 'milk', 'diaper', 'cola']]
```
接下来,我们需要实现两个函数:create_tree和find_frequent_items。create_tree函数将根据数据集创建一棵FP树。find_frequent_items函数将遍历FP树,找到所有的频繁项集。
```python
# 定义节点类
class TreeNode:
def __init__(self, name, count, parent):
self.name = name
self.count = count
self.parent = parent
self.children = {}
self.next = None
def inc(self, count):
self.count += count
# 创建FP树
def create_tree(dataset, min_support):
header_table = {}
for transaction in dataset:
for item in transaction:
header_table[item] = header_table.get(item, 0) + dataset[transaction]
for k in list(header_table.keys()):
if header_table[k] < min_support:
del(header_table[k])
freq_item_set = set(header_table.keys())
if len(freq_item_set) == 0:
return None, None
for k in header_table:
header_table[k] = [header_table[k], None]
root = TreeNode('', 1, None)
for transaction, count in dataset.items():
local_dict = {}
for item in transaction:
if item in freq_item_set:
local_dict[item] = header_table[item][0]
if len(local_dict) > 0:
ordered_items = [v[0] for v in sorted(local_dict.items(), key=lambda p: p[1], reverse=True)]
update_tree(ordered_items, root, header_table, count)
return root, header_table
# 更新FP树
def update_tree(items, root, header_table, count):
if items[0] in root.children:
root.children[items[0]].inc(count)
else:
root.children[items[0]] = TreeNode(items[0], count, root)
if header_table[items[0]][1] == None:
header_table[items[0]][1] = root.children[items[0]]
else:
update_header(header_table[items[0]][1], root.children[items[0]])
if len(items) > 1:
update_tree(items[1:], root.children[items[0]], header_table, count)
# 更新头指针表
def update_header(node_to_test, target_node):
while(node_to_test.next != None):
node_to_test = node_to_test.next
node_to_test.next = target_node
# 寻找频繁项集
def find_frequent_items(item, header_table):
freq_item_list = []
while(item != None):
freq_item_list.append(item.name)
item = item.next
return freq_item_list
```
现在,我们可以将这些函数组合起来,并使用它们来找到频繁项集。
```python
# 找到频繁项集
def find_freq_items(dataset, min_support):
freq_items = {}
for transaction in dataset:
for item in transaction:
freq_items[item] = freq_items.get(item, 0) + 1
for k in list(freq_items.keys()):
if freq_items[k] < min_support:
del(freq_items[k])
freq_item_set = set(freq_items.keys())
if len(freq_item_set) == 0:
return None
for k in freq_items:
freq_items[k] = [freq_items[k], None]
root = TreeNode('', 1, None)
for transaction, count in dataset.items():
local_dict = {}
for item in transaction:
if item in freq_item_set:
local_dict[item] = freq_items[item][0]
if len(local_dict) > 0:
ordered_items = [v[0] for v in sorted(local_dict.items(), key=lambda p: p[1], reverse=True)]
update_tree(ordered_items, root, freq_items, count)
freq_item_list = []
for key in freq_items:
if freq_items[key][1] != None:
for item in find_frequent_items(freq_items[key][1], freq_items):
freq_item_list.append(item)
return freq_item_list
fp_tree, header_table = create_tree(dataset, min_support)
freq_items = find_freq_items(dataset, min_support)
print(freq_items)
```
输出结果应该为:
```python
['milk', 'bread', 'diaper', 'beer', 'cola']
```
这意味着,在此数据集中,这些物品的组合是频繁项集。
阅读全文