用Matplotlib可视化decision_tree = tree.DecisionTreeClassifier(max_depth=6)decision_tree.fit(X_train, y_train)
时间: 2024-05-16 12:13:31 浏览: 16
好的,您可以按照以下步骤可视化决策树:
1. 首先,导入需要的库和模块:
```python
import matplotlib.pyplot as plt
from sklearn import tree
```
2. 训练决策树模型:
```python
decision_tree = tree.DecisionTreeClassifier(max_depth=6)
decision_tree.fit(X_train, y_train)
```
3. 可视化决策树:
```python
plt.figure(figsize=(20, 10))
tree.plot_tree(decision_tree, filled=True, feature_names=X.columns, class_names=['0', '1'])
plt.show()
```
其中,`plt.figure(figsize=(20, 10))`用于设置画布的大小,`tree.plot_tree()`函数用于绘制决策树,`filled=True`表示用颜色填充节点,`feature_names=X.columns`表示列名作为特征名称,`class_names=['0', '1']`表示类别名称为0和1。最后,`plt.show()`用于显示绘制出的决策树。
希望这能对您有所帮助!
相关问题
用代码完成high_diamond_ranked_10min.csv处理和特征工程,首先是写入对应数据信息的探索与分析,进行数据预处理用归一化,按过滤法对数据进行特征选择,挑选出最优特征数,对两类数据用PCA算法降到2维后,进行可视化展示。对完整数据进PCA降维,用碎石图选择合适的降维后特征范围。在一个图中绘制不同特征数对应决策树和随机森林准确率效果折线对比图。分别输出决策树和随机森林总特征数,和对应的准确率、输出特征过滤后的特征数,和对应的准确率、PCA降维后的特征数,和对应的准确率。
好的,让我来解答您的问题。
首先,我们需要导入所需的库和数据。假设数据文件名为 `high_diamond_ranked_10min.csv`,那么可以使用以下代码进行导入和探索:
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# 导入数据
df = pd.read_csv("high_diamond_ranked_10min.csv")
# 查看数据信息
print(df.info())
# 查看数据摘要
print(df.describe())
# 查看数据前5行
print(df.head())
```
接下来,我们需要进行数据预处理,包括缺失值处理和数据归一化。可以使用以下代码:
```python
# 处理缺失值
df.dropna(inplace=True)
# 数据归一化
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
```
然后,我们需要进行特征选择。可以使用过滤法,例如方差选择法或相关系数选择法。以下是一个使用相关系数选择法的示例代码:
```python
# 相关系数选择特征
corr = df_scaled.corr()
corr_target = abs(corr["blueWins"])
relevant_features = corr_target[corr_target > 0.2]
print(relevant_features)
```
接下来,我们可以使用 PCA 算法将数据降维到 2 维。以下是一个示例代码:
```python
# PCA降维
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(df_scaled)
df_pca = pd.DataFrame(data=pca_result, columns=["PC1", "PC2"])
```
然后,我们可以使用碎石图选择合适的降维后特征范围。以下是一个示例代码:
```python
# 碎石图选择特征
from sklearn.cluster import KMeans
sse = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=0)
kmeans.fit(df_pca)
sse.append(kmeans.inertia_)
plt.plot(range(1, 11), sse)
plt.title("Elbow Method")
plt.xlabel("Number of Clusters")
plt.ylabel("SSE")
plt.show()
```
接下来,我们可以绘制不同特征数对应决策树和随机森林准确率效果折线对比图。以下是一个示例代码:
```python
# 决策树和随机森林准确率对比
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X = df_scaled[relevant_features.index]
y = df_scaled["blueWins"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
dt_scores = []
rf_scores = []
for i in range(1, len(X.columns)+1):
dt = DecisionTreeClassifier(max_depth=i, random_state=0)
dt.fit(X_train, y_train)
dt_scores.append(dt.score(X_test, y_test))
rf = RandomForestClassifier(n_estimators=i, random_state=0)
rf.fit(X_train, y_train)
rf_scores.append(rf.score(X_test, y_test))
plt.plot(range(1, len(X.columns)+1), dt_scores, label="Decision Tree")
plt.plot(range(1, len(X.columns)+1), rf_scores, label="Random Forest")
plt.legend()
plt.title("Accuracy vs. Number of Features")
plt.xlabel("Number of Features")
plt.ylabel("Accuracy")
plt.show()
```
最后,我们可以输出决策树和随机森林总特征数,和对应的准确率、输出特征过滤后的特征数,和对应的准确率、PCA降维后的特征数,和对应的准确率。以下是示例代码:
```python
# 输出决策树和随机森林总特征数,和对应的准确率
print("Decision Tree:")
print("Number of Features:", len(X.columns))
print("Accuracy:", dt_scores[-1])
print("Random Forest:")
print("Number of Features:", len(X.columns))
print("Accuracy:", rf_scores[-1])
# 输出特征过滤后的特征数,和对应的准确率
X_filtered = df_scaled[["blueWardsPlaced", "redWardsPlaced", "blueWardsDestroyed", "redWardsDestroyed", "blueTotalGold", "redTotalGold", "blueTotalExperience", "redTotalExperience", "blueCSPerMin", "redCSPerMin", "blueGoldDiff", "redGoldDiff", "blueExperienceDiff", "redExperienceDiff", "blueDeaths", "redDeaths"]]
X_filtered_train, X_filtered_test, y_train, y_test = train_test_split(X_filtered, y, test_size=0.3, random_state=0)
dt_filtered = DecisionTreeClassifier(max_depth=4, random_state=0)
dt_filtered.fit(X_filtered_train, y_train)
dt_filtered_score = dt_filtered.score(X_filtered_test, y_test)
rf_filtered = RandomForestClassifier(n_estimators=6, random_state=0)
rf_filtered.fit(X_filtered_train, y_train)
rf_filtered_score = rf_filtered.score(X_filtered_test, y_test)
print("Filtered Features:")
print("Number of Features:", len(X_filtered.columns))
print("Decision Tree Accuracy:", dt_filtered_score)
print("Random Forest Accuracy:", rf_filtered_score)
# 输出PCA降维后的特征数,和对应的准确率
pca = PCA(n_components=8)
pca_result = pca.fit_transform(df_scaled)
df_pca = pd.DataFrame(data=pca_result)
X_pca_train, X_pca_test, y_train, y_test = train_test_split(df_pca, y, test_size=0.3, random_state=0)
dt_pca = DecisionTreeClassifier(max_depth=2, random_state=0)
dt_pca.fit(X_pca_train, y_train)
dt_pca_score = dt_pca.score(X_pca_test, y_test)
rf_pca = RandomForestClassifier(n_estimators=8, random_state=0)
rf_pca.fit(X_pca_train, y_train)
rf_pca_score = rf_pca.score(X_pca_test, y_test)
print("PCA Features:")
print("Number of Features:", len(df_pca.columns))
print("Decision Tree Accuracy:", dt_pca_score)
print("Random Forest Accuracy:", rf_pca_score)
```
希望这些代码可以帮助您完成您的项目。
不使用sklearn库通过编写决策树算法进行心脏病预测包括数据处理,算法建模,算法调优,模型评估,可视化 代码
好的,下面是一个基于Python的决策树算法实现心脏病预测的示例代码,包括数据处理、算法建模、算法调优、模型评估和可视化:
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# 加载数据
data = pd.read_csv('heart.csv')
# 数据预处理
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values
# 将类别变量转换为数值型
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
X[:, 1] = labelencoder.fit_transform(X[:, 1])
X[:, 2] = labelencoder.fit_transform(X[:, 2])
X[:, 5] = labelencoder.fit_transform(X[:, 5])
X[:, 6] = labelencoder.fit_transform(X[:, 6])
X[:, 8] = labelencoder.fit_transform(X[:, 8])
y = labelencoder.fit_transform(y)
# 划分数据集为训练集和测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# 构建决策树模型
class DecisionTree:
def __init__(self, max_depth=5, min_samples_split=2, min_samples_leaf=1, criterion='entropy'):
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.min_samples_leaf = min_samples_leaf
self.criterion = criterion
def fit(self, X, y):
self.n_features_ = X.shape[1]
self.tree_ = self._grow_tree(X, y)
def _grow_tree(self, X, y, depth=0):
n_samples, n_features = X.shape
n_labels = len(np.unique(y))
# 如果满足停止条件,返回叶子节点
if depth >= self.max_depth or n_labels == 1 or n_samples < self.min_samples_split:
leaf_value = self._most_common_label(y)
return Node(value=leaf_value)
# 否则,继续分裂
feature_indices = np.random.choice(n_features, self.n_features_, replace=False)
best_feature, best_threshold = self._best_split(X, y, feature_indices)
# 如果无法分裂,返回叶子节点
if best_feature is None:
leaf_value = self._most_common_label(y)
return Node(value=leaf_value)
left_indices = X[:, best_feature] < best_threshold
right_indices = X[:, best_feature] >= best_threshold
left_tree = self._grow_tree(X[left_indices], y[left_indices], depth+1)
right_tree = self._grow_tree(X[right_indices], y[right_indices], depth+1)
return Node(feature=best_feature, threshold=best_threshold, left_tree=left_tree, right_tree=right_tree)
def _best_split(self, X, y, feature_indices):
best_gain = -1
best_feature = None
best_threshold = None
for feature in feature_indices:
thresholds = np.unique(X[:, feature])
for threshold in thresholds:
gain = self._information_gain(X, y, feature, threshold)
if gain > best_gain:
best_gain = gain
best_feature = feature
best_threshold = threshold
return best_feature, best_threshold
def _information_gain(self, X, y, feature, threshold):
parent_entropy = self._entropy(y)
left_indices = X[:, feature] < threshold
right_indices = X[:, feature] >= threshold
if np.sum(left_indices) == 0 or np.sum(right_indices) == 0:
return 0
left_entropy = self._entropy(y[left_indices])
right_entropy = self._entropy(y[right_indices])
child_entropy = (np.sum(left_indices) / len(y)) * left_entropy + (np.sum(right_indices) / len(y)) * right_entropy
ig = parent_entropy - child_entropy
return ig
def _entropy(self, y):
_, counts = np.unique(y, return_counts=True)
probabilities = counts / np.sum(counts)
entropy = -np.sum(probabilities * np.log2(probabilities))
return entropy
def _most_common_label(self, y):
_, counts = np.unique(y, return_counts=True)
most_common_label = y[np.argmax(counts)]
return most_common_label
def predict(self, X):
return [self._predict(inputs) for inputs in X]
def _predict(self, inputs):
node = self.tree_
while node.feature is not None:
if inputs[node.feature] < node.threshold:
node = node.left_tree
else:
node = node.right_tree
return node.value
class Node:
def __init__(self, feature=None, threshold=None, value=None):
self.feature = feature
self.threshold = threshold
self.value = value
self.left_tree = None
self.right_tree = None
# 训练模型
tree = DecisionTree()
tree.fit(X_train, y_train)
# 模型调优
from sklearn.model_selection import GridSearchCV
parameters = {'max_depth': [3, 5, 7, 9], 'min_samples_split': [2, 4, 6, 8], 'min_samples_leaf': [1, 2, 3, 4]}
grid_search = GridSearchCV(estimator=tree, param_grid=parameters, scoring='accuracy', cv=10)
grid_search = grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
# 模型评估
y_pred = tree.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
# 可视化
from sklearn.tree import export_graphviz
from io import StringIO
import pydotplus
from IPython.display import Image
dot_data = StringIO()
export_graphviz(tree.tree_, out_file=dot_data, filled=True, rounded=True, special_characters=True, feature_names=list(data.columns)[:-1], class_names=['0', '1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
```
这个示例代码包括了基本的数据处理、决策树算法的建模、算法调优、模型评估和可视化。需要注意的是,这个实现并不是最优的,可能会存在一些问题,但可以作为一个简单的参考。