用python代码实现数据挖掘中的决策树算法 
时间: 2023-05-17 21:02:52 浏览: 47
可以使用Python中的scikit-learn库来实现决策树算法,以下是一个简单的示例代码:
```python
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
# 加载数据集
iris = datasets.load_iris()
X = iris.data
y = iris.target
# 将数据集分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 创建决策树分类器
clf = DecisionTreeClassifier()
# 训练模型
clf.fit(X_train, y_train)
# 在测试集上进行预测
y_pred = clf.predict(X_test)
# 输出模型的准确率
print("Accuracy:", clf.score(X_test, y_test))
```
这段代码使用鸢尾花数据集进行训练和测试,输出模型的准确率。
相关问题
使用python爬取excel表进行数据挖掘,决策树算法
使用Python对Excel表格进行数据挖掘,可以使用pandas库实现,而决策树算法可以使用scikit-learn库实现。下面是一个简单的示例代码,用于爬取Excel表格数据并使用决策树算法进行分类。
首先需要安装pandas和scikit-learn库:
```python
!pip install pandas scikit-learn
```
然后可以使用pandas读取Excel表格数据:
```python
import pandas as pd
# 读取Excel表格数据
df = pd.read_excel('data.xlsx')
```
接下来可以进行数据清洗和特征选择等操作,例如删除缺失值和重复值:
```python
# 删除缺失值和重复值
df.dropna(inplace=True)
df.drop_duplicates(inplace=True)
```
然后可以将数据集拆分为训练集和测试集:
```python
from sklearn.model_selection import train_test_split
# 将数据集拆分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(df.drop('y', axis=1), df['y'], test_size=0.2, random_state=42)
```
接着可以使用决策树算法进行分类:
```python
from sklearn.tree import DecisionTreeClassifier
# 构建决策树模型
model = DecisionTreeClassifier()
# 训练模型
model.fit(X_train, y_train)
```
最后可以对模型进行评估,并进行预测:
```python
from sklearn.metrics import accuracy_score
# 在测试集上评估模型
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
# 对新数据进行预测
new_data = pd.read_excel('new_data.xlsx')
new_predictions = model.predict(new_data)
```
简单数据挖掘ID3决策树分类预测python代码
以下是一个简单的基于ID3算法的决策树分类预测的Python代码:
```python
import pandas as pd
import numpy as np
import math
# 定义计算熵的函数
def calc_entropy(data):
label_col = data.iloc[:, -1]
_, counts = np.unique(label_col, return_counts=True)
probs = counts / len(label_col)
entropy = sum(probs * -np.log2(probs))
return entropy
# 定义计算信息增益的函数
def calc_info_gain(data, feature):
entropy_before_split = calc_entropy(data)
vals, counts = np.unique(data[feature], return_counts=True)
probs = counts / sum(counts)
entropy_after_split = 0
for i in range(len(vals)):
sub_data = data[data[feature] == vals[i]]
entropy_after_split += probs[i] * calc_entropy(sub_data)
info_gain = entropy_before_split - entropy_after_split
return info_gain
# 定义获取最佳切分特征的函数
def get_best_split_feature(data):
features = data.columns[:-1]
best_feature = None
best_info_gain = -1
for feature in features:
info_gain = calc_info_gain(data, feature)
if info_gain > best_info_gain:
best_info_gain = info_gain
best_feature = feature
return best_feature
# 定义决策树训练函数
def train_decision_tree(data):
# 终止条件1:如果数据集中所有样本都属于同一类别,直接返回该类别
if len(np.unique(data.iloc[:, -1])) == 1:
return np.unique(data.iloc[:, -1])[0]
# 终止条件2:如果数据集中没有特征可供切分,直接返回样本数最多的类别
if len(data.columns) == 1:
return np.bincount(data.iloc[:, -1]).argmax()
# 选择最佳切分特征
best_feature = get_best_split_feature(data)
# 构建决策树
decision_tree = {best_feature: {}}
vals, counts = np.unique(data[best_feature], return_counts=True)
for i in range(len(vals)):
# 去掉最佳切分特征,递归构建子树
sub_data = data[data[best_feature] == vals[i]].drop(best_feature, axis=1)
decision_tree[best_feature][vals[i]] = train_decision_tree(sub_data)
return decision_tree
# 定义决策树预测函数
def predict(data, tree):
for feature, subtree in tree.items():
val = data[feature]
if val not in subtree:
return np.bincount(data.iloc[:, -1]).argmax()
subtree = subtree[val]
if isinstance(subtree, dict):
return predict(data.drop(feature, axis=1), subtree)
else:
return subtree
# 加载数据集
data = pd.read_csv('data.csv')
# 训练决策树
decision_tree = train_decision_tree(data)
# 预测新样本
new_data = pd.DataFrame({'feature1': [1, 1, 0, 0], 'feature2': [1, 0, 1, 0]})
for i in range(len(new_data)):
prediction = predict(new_data.iloc[i], decision_tree)
print('Sample', i+1, 'predicted class:', prediction)
```
注:此代码仅作为学习参考,实际应用中需要考虑更多的细节和优化措施。
相关推荐








