首页利用sklearn进行lda文本主题提取的代码

利用sklearn进行lda文本主题提取的代码

时间: 2024-01-04 11:03:36 浏览: 117

下面是一个使用sklearn进行LDA文本主题提取的示例代码： ```python from sklearn.decomposition import LatentDirichletAllocation from sklearn.feature_extraction.text import CountVectorizer # 假设我们已经有了一个文本列表，每个文本是一个字符串 texts = ["this is the first document", "this is the second document", "and this is the third one", "is this the first document"] # 构建词袋模型 vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts) # 初始化LDA模型 n_topics = 2 lda = LatentDirichletAllocation(n_components=n_topics, max_iter=50, learning_method='online') # 训练LDA模型 lda.fit(X) # 可视化LDA模型结果 feature_names = vectorizer.get_feature_names() for topic_idx, topic in enumerate(lda.components_): print("Topic #%d:" % topic_idx) print(" ".join([feature_names[i] for i in topic.argsort()[:-10 - 1:-1]])) print() # 获取每个文本所属的主题 doc_topic = lda.transform(X) for i in range(len(texts)): print("Document #%d (topic: %d): %s" % (i, doc_topic[i].argmax(), texts[i])) ``` 这个示例代码假设我们已经有了一个文本列表 `texts`，并使用 `CountVectorizer` 构建了词袋模型。然后使用 `LatentDirichletAllocation` 初始化了一个LDA模型，并使用 `fit` 函数训练了模型。最后，通过 `components_` 属性获取每个主题的关键词，并通过 `transform` 函数获取每个文本所属的主题。

阅读全文