python对excel里面的短文本做lda主题分析,根据一致性计算最佳主题数并运用,最终输出一个excel包含文本对应的全部主题和概率的代码
时间: 2024-01-24 18:17:21 浏览: 219
python-LDA主题分析
4星 · 用户满意度95%
以下是一个用Python对Excel中的短文本进行LDA主题分析,并根据一致性计算最佳主题数并运用,最终输出包含文本对应的全部主题和概率的代码:
```python
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from gensim.models import LdaModel
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from sklearn.feature_extraction.text import CountVectorizer
# 导入Excel数据
df = pd.read_excel('your_file_name.xlsx')
text_data = df['text'].tolist()
# 数据预处理:去除停用词、数字、标点符号和长度小于3的词语
nltk.download('stopwords')
en_stopwords = stopwords.words('english')
def preprocess(text):
text = str(text)
text = re.sub(r'\d+', '', text)
text = text.lower()
text = ' '.join([word for word in text.split() if word not in en_stopwords])
text = re.sub(r'[^\w\s]','',text)
text = ' '.join([word for word in text.split() if len(word) > 3])
return text
cleaned_data = [preprocess(text) for text in text_data]
# 构建文本-词频矩阵
vectorizer = CountVectorizer()
doc_term_matrix = vectorizer.fit_transform(cleaned_data)
# 构建词典
id2word = Dictionary(vectorizer.get_feature_names())
# 构建LDA模型并计算一致性
coherence_scores = []
for num_topics in range(2, 11):
lda_model = LdaModel(
corpus=doc_term_matrix,
id2word=id2word,
num_topics=num_topics,
random_state=100,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True
)
coherence_model = CoherenceModel(
model=lda_model,
texts=cleaned_data,
dictionary=id2word,
coherence='c_v'
)
coherence_scores.append(coherence_model.get_coherence())
# 找到最佳主题数
best_num_topics = np.argmax(coherence_scores) + 2
# 构建LDA模型并输出结果到Excel
lda_model = LdaModel(
corpus=doc_term_matrix,
id2word=id2word,
num_topics=best_num_topics,
random_state=100,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True
)
topic_list = []
for i, row_list in enumerate(lda_model[doc_term_matrix]):
row = row_list[0] if lda_model.per_word_topics else row_list
row = sorted(row, key=lambda x: (x[1]), reverse=True)
for j, (topic_num, prop_topic) in enumerate(row):
if j == 0:
wp = lda_model.show_topic(topic_num)
topic_keywords = ", ".join([word for word, prop in wp])
topic_list.append((i, topic_num, prop_topic, topic_keywords))
else:
break
df_topics = pd.DataFrame(topic_list, columns=['Document_Id', 'Topic_Num', 'Prop_Topic', 'Topic_Keywords'])
df_topics.to_excel('output_file_name.xlsx', index=False)
```
注意,需要将代码中的`your_file_name.xlsx`替换为包含输入数据的Excel文件名,并将`output_file_name.xlsx`替换为输出结果的Excel文件名。此外,还需要安装以下Python包:pandas、numpy、re、nltk、gensim和scikit-learn。
阅读全文