dataframe某列是英文text,怎么提取关键词和其他某列权重,保留权重高的几个词语,举例
时间: 2023-09-19 07:08:05 浏览: 83
中英文关键词提取方法与Python示例
您可以使用Python中的nltk库和sklearn库来提取关键词和计算权重。下面是一个简单的示例代码:
```python
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# 读取数据
df = pd.read_csv('data.csv')
#除停用词并分词
stop_words = set(stopwords.words('english'))
df['text'] = df['text'].apply(lambda x: [word for word in word_tokenize(x.lower()) if word.isalpha() and word not in stop_words])
# 计算词频
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['text'].apply(lambda x: ' '.join(x)))
# 计算TF-IDF权重
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(X)
# 提取权重最高的前n个词
n = 5
top_n = []
for i in range(len(df)):
row = tfidf[i].toarray()[0]
top_n_indices = row.argsort()[-n:][::-1]
top_n_words = [vectorizer.get_feature_names()[i] for i in top_n_indices]
top_n_weights = [row[i] for i in top_n_indices]
top_n.append(list(zip(top_n_words, top_n_weights)))
# 添加到DataFrame中
df['top_n'] = top_n
```
该代码将数据读取为一个DataFrame,然后对每个文本进行分词、去除停用词、计算词频和TF-IDF权重,并提取权重最高的前n个词。最后,将结果添加到DataFrame中的新列```top_n```中。您可以根据需要更改n的值,以提取更多或更少的关键词。
阅读全文