首页spark 对DataFrame中的文本计算tf-idf向量

spark 对DataFrame中的文本计算tf-idf向量

时间: 2023-08-14 19:13:14 浏览: 101

在Spark中，可以使用`HashingTF`和`IDF`类来计算DataFrame中文本的tf-idf向量。以下是一个简单的例子： ```python from pyspark.ml.feature import HashingTF, IDF, Tokenizer from pyspark.sql.functions import udf from pyspark.sql.types import ArrayType, StringType # 创建一个DataFrame df = spark.createDataFrame([(0, "hello world"), (1, "world is good"), (2, "goodbye world")], ["id", "sentence"]) # 分词 tokenizer = Tokenizer(inputCol="sentence", outputCol="words") words_df = tokenizer.transform(df) # 计算tf向量 hashing_tf = HashingTF(inputCol="words", outputCol="raw_features", numFeatures=20) tf_df = hashing_tf.transform(words_df) # 计算idf向量 idf = IDF(inputCol="raw_features", outputCol="tf_idf_features") idf_model = idf.fit(tf_df) tf_idf_df = idf_model.transform(tf_df) # 定义udf函数，将稠密向量转换为数组 dense_to_array = udf(lambda v: v.toArray().tolist(), ArrayType(StringType())) # 提取tf-idf向量 tf_idf_df = tf_idf_df.select("id", dense_to_array("tf_idf_features").alias("tf_idf_features")) # 展开数组，将tf-idf向量转换为多个列 for i in range(20): col_name = "tf_idf_" + str(i) tf_idf_df = tf_idf_df.withColumn(col_name, tf_idf_df.tf_idf_features.getItem(i)) # 删除原始的tf-idf向量列 tf_idf_df = tf_idf_df.drop("tf_idf_features") ``` 这样，`tf_idf_df`就是包含原始文本和对应tf-idf向量的DataFrame了。

阅读全文