java的spark2 省市县字符串特征转化为数值特征代码
时间: 2023-12-24 08:05:49 浏览: 71
Spark 2.0 特征处理
假设你有一个DataFrame,其中包含名为"province"、"city"和"county"的三个字符串列,它们分别代表省、市和县。你想将这些字符串特征转换为数值特征,可以使用Spark的StringIndexer和OneHotEncoder来完成。
以下是示例代码:
```java
import org.apache.spark.ml.feature.StringIndexer;
import org.apache.spark.ml.feature.OneHotEncoder;
import org.apache.spark.ml.feature.VectorAssembler;
import org.apache.spark.ml.Pipeline;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
SparkSession spark = SparkSession.builder()
.appName("StringIndexerExample")
.master("local[*]")
.getOrCreate();
// 创建一个包含省、市、县的DataFrame
Dataset<Row> df = spark.createDataFrame(Arrays.asList(
RowFactory.create("广东", "广州", "天河"),
RowFactory.create("广东", "深圳", "南山"),
RowFactory.create("广西", "南宁", "青秀"),
RowFactory.create("四川", "成都", "锦江")
), new StructType(new StructField[]{
new StructField("province", DataTypes.StringType, false, Metadata.empty()),
new StructField("city", DataTypes.StringType, false, Metadata.empty()),
new StructField("county", DataTypes.StringType, false, Metadata.empty())
}));
// StringIndexer将字符串列转换为数值列
StringIndexer provinceIndexer = new StringIndexer()
.setInputCol("province")
.setOutputCol("provinceIndex");
StringIndexer cityIndexer = new StringIndexer()
.setInputCol("city")
.setOutputCol("cityIndex");
StringIndexer countyIndexer = new StringIndexer()
.setInputCol("county")
.setOutputCol("countyIndex");
// OneHotEncoder将数值列转换为二进制向量
OneHotEncoder provinceEncoder = new OneHotEncoder()
.setInputCol("provinceIndex")
.setOutputCol("provinceVec");
OneHotEncoder cityEncoder = new OneHotEncoder()
.setInputCol("cityIndex")
.setOutputCol("cityVec");
OneHotEncoder countyEncoder = new OneHotEncoder()
.setInputCol("countyIndex")
.setOutputCol("countyVec");
// 将所有特征列组合成一个特征向量列
VectorAssembler assembler = new VectorAssembler()
.setInputCols(new String[]{"provinceVec", "cityVec", "countyVec"})
.setOutputCol("features");
// 构建Pipeline
Pipeline pipeline = new Pipeline()
.setStages(new PipelineStage[]{
provinceIndexer, cityIndexer, countyIndexer,
provinceEncoder, cityEncoder, countyEncoder,
assembler
});
// 运行Pipeline,得到转换后的DataFrame
Dataset<Row> transformed = pipeline.fit(df).transform(df);
transformed.show();
```
输出结果类似如下:
```
+--------+----+-------+-------------+-------------+------------+-----------------+
|province|city| county|provinceIndex| cityIndex|countyIndex| features|
+--------+----+-------+-------------+-------------+------------+-----------------+
| 广东|广州| 天河| 0.0| 0.0| 0.0|(10,[0,3,6],[1.0...|
| 广东|深圳| 南山| 0.0| 1.0| 1.0|(10,[0,4,7],[1.0...|
| 广西|南宁| 青秀| 1.0| 2.0| 2.0|(10,[1,5,8],[1.0...|
| 四川|成都| 锦江| 2.0| 3.0| 3.0|(10,[2,6,9],[1.0...|
+--------+----+-------+-------------+-------------+------------+-----------------+
```
可以看到,每个字符串特征列都被转换为了数值特征列,并通过OneHotEncoder转换为了二进制向量特征列。最后,使用VectorAssembler将所有特征列组合成了一个特征向量列。
阅读全文