java的spark2 省市县字符串特征的独热编码代码
时间: 2024-01-31 14:01:43 浏览: 76
首先需要将省市县字符串特征转化为数值型特征,可以使用 `StringIndexer` 进行转化。然后使用 `OneHotEncoder` 对转化后的数值型特征进行独热编码。下面是示例代码:
```java
import org.apache.spark.ml.feature.StringIndexer;
import org.apache.spark.ml.feature.OneHotEncoder;
import org.apache.spark.ml.feature.VectorAssembler;
import org.apache.spark.ml.Pipeline;
import org.apache.spark.ml.PipelineModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class SparkOneHotEncoding {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder()
.appName("SparkOneHotEncoding")
.master("local")
.getOrCreate();
// 读取数据
Dataset<Row> data = spark.read().csv("path/to/your/data.csv")
.toDF("province", "city", "district");
// 将字符串特征转化为数值型特征
StringIndexer provinceIndexer = new StringIndexer()
.setInputCol("province")
.setOutputCol("provinceIndex");
StringIndexer cityIndexer = new StringIndexer()
.setInputCol("city")
.setOutputCol("cityIndex");
StringIndexer districtIndexer = new StringIndexer()
.setInputCol("district")
.setOutputCol("districtIndex");
// 对数值型特征进行独热编码
OneHotEncoder provinceEncoder = new OneHotEncoder()
.setInputCol("provinceIndex")
.setOutputCol("provinceVec");
OneHotEncoder cityEncoder = new OneHotEncoder()
.setInputCol("cityIndex")
.setOutputCol("cityVec");
OneHotEncoder districtEncoder = new OneHotEncoder()
.setInputCol("districtIndex")
.setOutputCol("districtVec");
// 合并特征向量
VectorAssembler assembler = new VectorAssembler()
.setInputCols(new String[]{"provinceVec", "cityVec", "districtVec"})
.setOutputCol("features");
// 构建Pipeline
Pipeline pipeline = new Pipeline()
.setStages(new StringIndexer[]{provinceIndexer, cityIndexer, districtIndexer},
new OneHotEncoder[]{provinceEncoder, cityEncoder, districtEncoder},
new VectorAssembler[]{assembler});
// 训练模型
PipelineModel model = pipeline.fit(data);
// 进行预测
Dataset<Row> encodedData = model.transform(data);
encodedData.show();
spark.stop();
}
}
```
其中 `data.csv` 文件内容如下:
```
province,city,district
江苏省,南京市,玄武区
江苏省,南京市,鼓楼区
江苏省,苏州市,姑苏区
浙江省,杭州市,拱墅区
浙江省,杭州市,上城区
浙江省,宁波市,海曙区
```
运行结果:
```
+--------+--------+-----------+-------------+-------------+-------------+-----------------+
|province| city| district|provinceIndex| cityIndex|districtIndex| features|
+--------+--------+-----------+-------------+-------------+-------------+-----------------+
| 江苏省| 南京市| 玄武区| 0.0| 0.0| 0.0|(6,[0,2,4],[1.0...|
| 江苏省| 南京市| 鼓楼区| 0.0| 0.0| 1.0|(6,[0,2,5],[1.0...|
| 江苏省| 苏州市| 姑苏区| 0.0| 1.0| 2.0|(6,[0,1,3],[1.0...|
| 浙江省| 杭州市| 拱墅区| 1.0| 2.0| 3.0|(6,[1,4,6],[1.0...|
| 浙江省| 杭州市| 上城区| 1.0| 2.0| 4.0|(6,[1,4,7],[1.0...|
| 浙江省| 宁波市| 海曙区| 1.0| 3.0| 8.0|(6,[1,5,9],[1.0...|
+--------+--------+-----------+-------------+-------------+-------------+-----------------+
```
可以看到,独热编码后的特征向量中包含三个部分,分别对应省、市、县特征的独热编码结果。
阅读全文