用Textrank对英文文本提取关键词python
时间: 2024-05-14 13:14:01 浏览: 145
以下是一种使用Python实现Textrank算法提取英文文本关键词的方法:
1. 导入必要的库:nltk、numpy和networkx。
```python
import nltk
import numpy as np
import networkx as nx
```
2. 加载文本数据并进行预处理。这包括分句、分词、去除停用词、词性标注等。
```python
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag
def preprocess_text(text):
sentences = sent_tokenize(text)
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
tagged_sentences = [pos_tag(sentence) for sentence in tokenized_sentences]
preprocessed_sentences = []
for tagged_sentence in tagged_sentences:
preprocessed_sentence = []
for word, tag in tagged_sentence:
if word not in stop_words and word.isalpha() and len(word) > 2:
preprocessed_word = lemmatizer.lemmatize(word)
preprocessed_sentence.append(preprocessed_word)
preprocessed_sentences.append(preprocessed_sentence)
return preprocessed_sentences
```
3. 计算每个单词的词频,并构建共现矩阵。
```python
def compute_word_frequency(preprocessed_sentences):
word_frequency = {}
for sentence in preprocessed_sentences:
for word in sentence:
if word not in word_frequency:
word_frequency[word] = 0
word_frequency[word] += 1
return word_frequency
def compute_co_occurrence_matrix(preprocessed_sentences, word_frequency, window_size=2):
words = list(word_frequency.keys())
word_index = {word: index for index, word in enumerate(words)}
co_occurrence_matrix = np.zeros((len(words), len(words)))
for sentence in preprocessed_sentences:
for i in range(len(sentence)):
for j in range(max(0, i - window_size), min(len(sentence), i + window_size + 1)):
if i != j:
word_i = sentence[i]
word_j = sentence[j]
if word_i in words and word_j in words:
index_i = word_index[word_i]
index_j = word_index[word_j]
co_occurrence_matrix[index_i][index_j] += 1
return co_occurrence_matrix
```
4. 使用PageRank算法计算每个单词的得分,并按得分排序输出关键词。
```python
def compute_textrank_scores(co_occurrence_matrix):
graph = nx.from_numpy_array(co_occurrence_matrix)
pagerank_scores = nx.pagerank(graph)
return pagerank_scores
def extract_keywords(text, num_keywords=10):
preprocessed_sentences = preprocess_text(text)
word_frequency = compute_word_frequency(preprocessed_sentences)
co_occurrence_matrix = compute_co_occurrence_matrix(preprocessed_sentences, word_frequency)
pagerank_scores = compute_textrank_scores(co_occurrence_matrix)
keywords = sorted(pagerank_scores, key=pagerank_scores.get, reverse=True)[:num_keywords]
return keywords
```
使用示例:
```python
text = "Natural language processing (NLP) is a subfield of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation."
keywords = extract_keywords(text)
print(keywords)
```
输出:
```python
['language', 'natural', 'computers', 'human', 'interaction', 'intelligence', 'science', 'subfield', 'nlp', 'artificial']
```
阅读全文