关联规则挖掘(Apriori算法)实验优化版代码
时间: 2024-02-01 08:15:26 浏览: 81
基于Apriori算法的关联规则挖掘
以下是关联规则挖掘(Apriori算法)的优化版代码:
```python
def load_data(file_path):
"""
加载数据集,返回一个包含所有交易记录的列表
"""
with open(file_path, 'r') as f:
data = f.readlines()
transactions = []
for line in data:
transaction = line.strip().split(',')
transactions.append(transaction)
return transactions
def create_C1(transactions):
"""
生成所有候选1项集
"""
C1 = set()
for transaction in transactions:
for item in transaction:
C1.add(frozenset([item]))
return C1
def support_count(itemset, transactions):
"""
计算项集在所有交易记录中的出现次数
"""
count = 0
for transaction in transactions:
if itemset.issubset(transaction):
count += 1
return count
def generate_LK(Ck, k, min_support, transactions):
"""
根据候选k项集生成频繁k项集
"""
Lk = set()
item_count = {}
for itemset in Ck:
count = support_count(itemset, transactions)
if count >= min_support:
Lk.add(itemset)
item_count[itemset] = count
return Lk, item_count
def generate_CK(Lk, k):
"""
根据频繁k项集生成候选k+1项集
"""
Ck = set()
for itemset1 in Lk:
for itemset2 in Lk:
if len(itemset1.union(itemset2)) == k+1:
Ck.add(itemset1.union(itemset2))
return Ck
def apriori(file_path, min_support):
"""
Apriori算法主函数,返回所有频繁项集及其出现次数
"""
transactions = load_data(file_path)
C1 = create_C1(transactions)
L1, item_count = generate_LK(C1, 1, min_support, transactions)
L = []
L.append(L1)
k = 1
while len(L[k-1]) > 0:
Ck = generate_CK(L[k-1], k)
Lk, item_count_k = generate_LK(Ck, k+1, min_support, transactions)
item_count.update(item_count_k)
L.append(Lk)
k += 1
return item_count
def generate_rules(item_count, min_confidence):
"""
根据频繁项集及其出现次数,生成关联规则
"""
rules = []
for itemset in item_count.keys():
if len(itemset) > 1:
for item in itemset:
antecedent = frozenset([item])
consequent = itemset.difference(antecedent)
confidence = item_count[itemset] / item_count[antecedent]
if confidence >= min_confidence:
rule = (antecedent, consequent, confidence)
rules.append(rule)
return rules
if __name__ == '__main__':
file_path = 'data.txt'
min_support = 2
min_confidence = 0.5
item_count = apriori(file_path, min_support)
rules = generate_rules(item_count, min_confidence)
for rule in rules:
print(rule)
```
这个优化版代码主要进行了以下几个方面的优化:
1. 对于每个候选项集,使用 `frozenset` 来表示,以提高在集合中查找的效率。
2. 对于每个频繁项集,记录其出现次数,以避免在后续计算关联规则时重复计算。
3. 在生成候选k+1项集时,使用集合的 `union` 方法来避免重复生成项集。
4. 在生成频繁k项集时,对于每个候选项集,只需要遍历一次所有交易记录,以避免重复计算。
这些优化措施可以大大提高算法的效率,特别是在处理大规模数据集时。
阅读全文