基于spark的波士顿房价预测中,构建决策树模型并进行可视化的代码
时间: 2024-03-03 11:48:33 浏览: 112
基于Java实现Spark2x新闻网大数据实时分析可视化系统项目【100012794】
5星 · 资源好评率100%
好的,以下是基于Spark的波士顿房价预测中,构建决策树模型并进行可视化的代码:
```python
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import PCA
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import IndexToString
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tree import DecisionTreeModel, DecisionTreeRegressionModel
from pyspark.mllib.evaluation import RegressionMetrics
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
import numpy as np
import matplotlib.pyplot as plt
# 读取数据
data = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("boston.csv")
# 数据预处理
categoricalCols = []
numericCols = data.columns
numericCols.remove('medv')
for categoricalCol in categoricalCols:
stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
encoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol+"classVec"])
stages += [stringIndexer, encoder]
label_stringIdx = StringIndexer(inputCol="medv", outputCol="label")
stages += [label_stringIdx]
assemblerInputs = numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
# 划分训练集和测试集
(trainingData, testData) = data.randomSplit([0.7, 0.3], seed=100)
# 构建决策树模型
dt = DecisionTreeRegressor(labelCol="label", featuresCol="features")
# 设置参数网格
paramGrid = (ParamGridBuilder()
.addGrid(dt.maxDepth, [2, 4, 6, 8])
.addGrid(dt.minInstancesPerNode, [1, 2, 3])
.build())
# 交叉验证
cv = CrossValidator(estimator=dt, estimatorParamMaps=paramGrid, evaluator=RegressionEvaluator(), numFolds=5)
# 构建Pipeline
pipeline = Pipeline(stages=stages)
# 训练模型
model = pipeline.fit(trainingData)
# 对测试集进行预测
predictions = model.transform(testData)
# 评估模型
evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)
# 可视化决策树
treeModel = model.stages[-1]
treeModel.write().overwrite().save("dt_model")
sameModel = DecisionTreeModel.load("dt_model")
print(sameModel.toDebugString)
# 绘制预测结果和实际结果的散点图
plt.scatter(np.array(predictions.select('prediction').collect()), np.array(predictions.select('medv').collect()))
plt.xlabel('Predictions')
plt.ylabel('Actual')
plt.show()
```
以上代码中,我们首先读取数据,并对数据进行预处理,然后划分训练集和测试集。接着,我们构建决策树模型,并设置参数网格进行交叉验证。通过Pipeline将数据预处理和模型构建整合在一起,然后训练模型并对测试集进行预测。最后,我们使用RegressionEvaluator评估模型的性能,并将决策树模型进行可视化,同时绘制预测结果和实际结果的散点图。
阅读全文