利用csv导入数据集,并用jupyter设计程序计算数据集的熵和信息增益
时间: 2024-01-27 13:41:23 浏览: 60
以下是利用Python的pandas库和numpy库导入csv数据集并计算熵和信息增益的程序。
首先,导入相关库:
```python
import pandas as pd
import numpy as np
from math import log2
```
然后,读取csv数据集:
```python
df = pd.read_csv('dataset.csv')
```
接下来,定义一个函数来计算数据集的熵:
```python
def entropy(data):
# 获取数据集的标签列
labels = data.iloc[:, -1]
# 统计每个标签出现的次数
label_counts = labels.value_counts()
# 计算熵
entropy = 0
for count in label_counts:
p = count / len(labels)
entropy -= p * log2(p)
return entropy
```
然后,定义一个函数来计算数据集某一特征的信息增益:
```python
def info_gain(data, feature):
# 获取数据集的标签列
labels = data.iloc[:, -1]
# 获取特征列
feature_col = data[feature]
# 获取特征的取值列表
feature_values = feature_col.unique()
# 计算数据集的熵
data_entropy = entropy(data)
# 计算条件熵
feature_entropy = 0
for value in feature_values:
subset = data[feature_col == value]
p = len(subset) / len(data)
feature_entropy += p * entropy(subset)
# 计算信息增益
info_gain = data_entropy - feature_entropy
return info_gain
```
最后,调用函数计算数据集的熵和信息增益:
```python
# 计算数据集的熵
data_entropy = entropy(df)
print('数据集的熵为:', data_entropy)
# 计算每个特征的信息增益
for feature in df.columns[:-1]:
info_gain_value = info_gain(df, feature)
print('特征', feature, '的信息增益为:', info_gain_value)
```
完整代码如下:
```python
import pandas as pd
import numpy as np
from math import log2
def entropy(data):
# 获取数据集的标签列
labels = data.iloc[:, -1]
# 统计每个标签出现的次数
label_counts = labels.value_counts()
# 计算熵
entropy = 0
for count in label_counts:
p = count / len(labels)
entropy -= p * log2(p)
return entropy
def info_gain(data, feature):
# 获取数据集的标签列
labels = data.iloc[:, -1]
# 获取特征列
feature_col = data[feature]
# 获取特征的取值列表
feature_values = feature_col.unique()
# 计算数据集的熵
data_entropy = entropy(data)
# 计算条件熵
feature_entropy = 0
for value in feature_values:
subset = data[feature_col == value]
p = len(subset) / len(data)
feature_entropy += p * entropy(subset)
# 计算信息增益
info_gain = data_entropy - feature_entropy
return info_gain
# 读取csv数据集
df = pd.read_csv('dataset.csv')
# 计算数据集的熵
data_entropy = entropy(df)
print('数据集的熵为:', data_entropy)
# 计算每个特征的信息增益
for feature in df.columns[:-1]:
info_gain_value = info_gain(df, feature)
print('特征', feature, '的信息增益为:', info_gain_value)
```
注意:在实际使用中,应该根据具体的数据集和需求,适当地修改程序。
阅读全文