spark 对DataFrame中的文本计算tf-idf向量
时间: 2023-08-14 19:13:14 浏览: 101
在Spark中,可以使用`HashingTF`和`IDF`类来计算DataFrame中文本的tf-idf向量。以下是一个简单的例子:
```python
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType
# 创建一个DataFrame
df = spark.createDataFrame([(0, "hello world"), (1, "world is good"), (2, "goodbye world")], ["id", "sentence"])
# 分词
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
words_df = tokenizer.transform(df)
# 计算tf向量
hashing_tf = HashingTF(inputCol="words", outputCol="raw_features", numFeatures=20)
tf_df = hashing_tf.transform(words_df)
# 计算idf向量
idf = IDF(inputCol="raw_features", outputCol="tf_idf_features")
idf_model = idf.fit(tf_df)
tf_idf_df = idf_model.transform(tf_df)
# 定义udf函数,将稠密向量转换为数组
dense_to_array = udf(lambda v: v.toArray().tolist(), ArrayType(StringType()))
# 提取tf-idf向量
tf_idf_df = tf_idf_df.select("id", dense_to_array("tf_idf_features").alias("tf_idf_features"))
# 展开数组,将tf-idf向量转换为多个列
for i in range(20):
col_name = "tf_idf_" + str(i)
tf_idf_df = tf_idf_df.withColumn(col_name, tf_idf_df.tf_idf_features.getItem(i))
# 删除原始的tf-idf向量列
tf_idf_df = tf_idf_df.drop("tf_idf_features")
```
这样,`tf_idf_df`就是包含原始文本和对应tf-idf向量的DataFrame了。
阅读全文