根据这些数据字段用pyspark机器学习库对休斯顿房价数据建立机器学习模型并实现预测
时间: 2023-12-22 14:06:16 浏览: 94
首先,我们需要导入pyspark和加载数据集。假设我们的数据集是以CSV格式存储的。
```python
from pyspark.sql import SparkSession
# 创建SparkSession对象
spark = SparkSession.builder.appName("HousePricePrediction").getOrCreate()
# 加载数据集
df = spark.read.format("csv").option("header", "true").load("houston_house_prices.csv")
# 查看数据集的前几行
df.show()
```
接下来,我们需要对数据进行一些预处理,比如删除不必要的列、将类别型变量进行独热编码、将数值型变量进行标准化等。
```python
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml import Pipeline
# 删除不必要的列
df = df.drop("MLS", "Address", "Street", "Zip", "Longitude", "Latitude")
# 将数值型变量进行标准化
numericCols = ["Age", "LotSize", "LivingArea", "Rooms", "Bedrooms", "Bathrooms"]
assemblerInputs = numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features_n")
scaler = StandardScaler(inputCol="features_n", outputCol="features_n_scaled")
pipeline = Pipeline(stages=[assembler, scaler])
df = pipeline.fit(df).transform(df)
# 将类别型变量进行独热编码
categoricalCols = ["Neighborhood", "Type"]
indexers = [StringIndexer(inputCol=col, outputCol=col+"_index") for col in categoricalCols]
encoders = [OneHotEncoder(inputCol=col+"_index", outputCol=col+"_vec") for col in categoricalCols]
assemblerInputs += [col + "_vec" for col in categoricalCols]
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
pipeline = Pipeline(stages=indexers + encoders + [assembler])
df = pipeline.fit(df).transform(df)
# 选取特征列和标签列
selectedCols = ["features", "Price"]
df = df.select(selectedCols)
# 将数据集分为训练集和测试集
train, test = df.randomSplit([0.8, 0.2], seed=12345)
```
现在我们可以建立机器学习模型了。这里我们选择使用决策树回归模型。
```python
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.evaluation import RegressionEvaluator
# 建立决策树回归模型
dt = DecisionTreeRegressor(featuresCol="features", labelCol="Price")
# 训练模型
model = dt.fit(train)
# 在测试集上进行预测
predictions = model.transform(test)
# 评估模型
evaluator = RegressionEvaluator(labelCol="Price", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)
```
最后,我们可以使用模型进行新数据的预测。
```python
# 使用模型进行预测
newData = spark.createDataFrame([(35, 5000, 2000, 5, 3, 2, "Braeswood"), (45, 6000, 2500, 6, 4, 3, "Memorial")],
["Age", "LotSize", "LivingArea", "Rooms", "Bedrooms", "Bathrooms", "Neighborhood"])
newData = pipeline.transform(newData).select("features")
predictions = model.transform(newData)
# 打印预测结果
predictions.show()
```
以上就是使用pyspark机器学习库对休斯顿房价数据建立机器学习模型并实现预测的过程。
阅读全文