from pyspark.ml.feature import StringIndexer, VectorAssembler from pyspark.ml.regression import LinearRegression from pyspark.ml.evaluation import RegressionEvaluator from pyspark.sql import SparkSession spark = SparkSession.builder.appName("DataPrediction").getOrCreate() # 加载数据 data = spark.read.csv("graduate.csv", header=True, inferSchema=True) # 对分类变量进行索引 gender_indexer = StringIndexer(inputCol="sex", outputCol="gender_index") major_indexer = StringIndexer(inputCol="major", outputCol="major_index") hobby_indexer = StringIndexer(inputCol="hobby", outputCol="hobby_index") # 将所有特征列合并成一个向量 assembler = VectorAssembler(inputCols=["gender_index", "major_index", "hobby_index", "salary"], outputCol="features") # 划分数据集为训练集和测试集 train_data, test_data = data.randomSplit([0.7, 0.3], seed=123) # 创建线性回归模型 lr = LinearRegression(featuresCol="features", labelCol="grade") # 训练模型 model = lr.fit(train_data) # 在测试集上进行预测 predictions = model.transform(test_data) # 评估模型 evaluator = RegressionEvaluator(labelCol="grade", predictionCol="prediction", metricName="rmse") rmse = evaluator.evaluate(predictions) print("Root Mean Squared Error (RMSE) on test data = %g" % rmse) 该段代码为什么会出现pyspark.sql.utils.IllegalArgumentException: 'Field "features" does not exist.\nAvailable fields: id, grade, sex, major, salary, hobby'错误
时间: 2024-03-15 12:43:35 浏览: 130
这个错误可能是由于 `VectorAssembler` 没有成功创建 `features` 列导致的。你可以检查以下几个可能的原因:
1. `VectorAssembler` 中输入列的名称是否与数据集中的列名匹配。请确保输入列的名称和数据集中的列名完全匹配,包括大小写和空格。
2. 在创建 `VectorAssembler` 后,是否有对数据集进行缺失值处理。如果数据集中有缺失值,则会导致无法创建 `features` 列。
3. `train_data` 和 `test_data` 中是否存在 `features` 列。请确保在划分数据集时,使用了正确的数据集和列名。
你可以逐一排除以上原因,以确定是哪个问题导致了该错误。
相关问题
from pyspark.mllib.evaluation import BinaryClassificationMetrics from pyspark.sql import SparkSession from pyspark.ml.feature import VectorAssembler, StandardScaler from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator
这段代码是导入所需的Spark MLlib和Spark SQL模块中的相关类和函数。具体说明如下:
- `BinaryClassificationMetrics`:Spark MLlib中二分类模型的评估指标类;
- `SparkSession`:Spark SQL中创建SparkSession的类;
- `VectorAssembler`:Spark MLlib中特征向量转换器的类,用于将多个特征列合并成一个特征向量列;
- `StandardScaler`:Spark MLlib中特征标准化类,用于对特征向量进行标准化处理;
- `LogisticRegression`:Spark MLlib中逻辑回归算法的类;
- `BinaryClassificationEvaluator`:Spark MLlib中二分类模型的评估器类,用于评估模型的性能。
这些类和函数是机器学习中常用的工具,用于对数据进行预处理、训练模型和评估模型性能。在使用这些类和函数之前,需要先导入相应的模块。
基于以下内容来describe the model selection prcedure that you adopted并且report and discuss the estimation result based on training set of each candidate model::from sklearn.model_selection import train_test_split X_tv, X_test, y_tv, y_test = train_test_split(X,y, test_size=0.2, random_state=1 ) X_tra, X_val, y_tra, y_val = train_test_split(X_tv,y_tv, test_size=0.25, random_state=1 ) # setting features F1=["Panel_Capacity"] F2=["Panel_Capacity","Roof_Azimuth","Latitude","Roof_Pitch","Shading_Partial","Shading_Significant"] F3=["Panel_Capacity","Roof_Azimuth","Latitude","Roof_Pitch","Shading_Partial","Shading_Significant","Shading","Year","City_Melbourne","City_Sydney","Shading*Panel_Capacity"] x1_tra=X_tra[F1].to_numpy().reshape(-1,1) y1_tra=y_tra from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error as mse # model estimation by using training set M1=LinearRegression() M1.fit(x1_tra,y1_tra) # coefficients print(M1.intercept_) print(M1.coef_) x2_tra=X_tra[F2].to_numpy() y2_tra=y_tra # model estimation by using training set M2=LinearRegression() M2.fit(x2_tra,y2_tra) # coefficients print(M2.intercept_) print(M2.coef_) # model selection by using validation set x2_val=X_val[F2].to_numpy() M2_pre=M2.predict(x2_val)
The model selection procedure adopted in this code involves splitting the data into training, validation, and testing sets. The training set is used to fit the models, the validation set is used to select the best model, and the testing set is used to evaluate the performance of the final model.
The data is split using the train_test_split function from the sklearn.model_selection module. The test_size parameter is set to 0.2, which means that the testing set will contain 20% of the data. The random_state parameter is set to 1 to ensure reproducibility.
The training set is further split into a training subset and a validation subset using the same function. The test_size parameter is set to 0.25, which means that the validation set will contain 25% of the training set. Again, the random_state parameter is set to 1 for reproducibility.
Three sets of features are defined: F1, F2, and F3. F1 contains only the "Panel_Capacity" feature, F2 contains "Panel_Capacity", "Roof_Azimuth", "Latitude", "Roof_Pitch", "Shading_Partial", and "Shading_Significant" features, and F3 contains all the features in F2 plus "Shading", "Year", "City_Melbourne", "City_Sydney", and "Shading*Panel_Capacity".
For each set of features, a linear regression model is estimated using the training set. The mean squared error is used as the evaluation metric, calculated using the mean_squared_error function from the sklearn.metrics module.
After estimation, the intercept and coefficients of each model are printed.
The second model (M2) is selected as the best model based on its performance on the validation set. The features in F2 were used to fit the model, and the predictions on the validation set were made using the predict method of the M2 object. The predictions are stored in the M2_pre variable.
No further analysis or discussion of the estimation results is provided in the code. However, one could compare the performance of M2 to that of M1 and M3 using the mean squared error on the testing set. Alternatively, one could perform a more thorough evaluation of the models, such as examining their residuals and checking for violations of assumptions.
阅读全文