请给出代码,利用spark ml分析毕业生信息数据,其中字段有序号,成绩绩点,性别,专业,月薪,兴趣爱好
时间: 2024-04-21 07:30:14 浏览: 12
好的,以下是一个简单的示例代码,供参考:
```python
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import SparkSession
# 创建 SparkSession
spark = SparkSession.builder.appName("graduate_analysis").getOrCreate()
# 加载数据集
data = spark.read.csv("path/to/graduate_data.csv", header=True, inferSchema=True)
# 数据清洗
data = data.fillna(0) # 填充缺失值
# 特征工程
gender_indexer = StringIndexer(inputCol="gender", outputCol="genderIndex")
major_indexer = StringIndexer(inputCol="major", outputCol="majorIndex")
encoder = OneHotEncoder(inputCols=["genderIndex", "majorIndex"], outputCols=["genderVec", "majorVec"])
assembler = VectorAssembler(inputCols=["genderVec", "majorVec", "score", "salary"], outputCol="features")
data = encoder.fit(data).transform(data)
data = assembler.transform(data)
# 模型训练
(training_data, test_data) = data.randomSplit([0.8, 0.2])
lr = LinearRegression(labelCol="gpa")
model = lr.fit(training_data)
# 模型评估
predictions = model.transform(test_data)
evaluator = RegressionEvaluator(labelCol="gpa", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)
# 模型应用
new_data = spark.createDataFrame([(0, 3.5, "female", "computer science", 5000, "reading")],
["id", "score", "gender", "major", "salary", "hobby"])
new_data = encoder.transform(new_data)
new_data = assembler.transform(new_data)
result = model.transform(new_data)
result.show()
```
以上代码中,使用了线性回归模型对毕业生的绩点进行预测,同时进行了性别和专业的 OneHot 编码和组合特征。代码中还包含了模型评估和应用的示例。实际应用中,你可以根据具体的需求和数据集特点,选择合适的算法和方法进行分析。