停用词和标点我有nltk的包 ,请用nltk处理,其他的不变,重新给出代码
时间: 2024-03-13 20:42:34 浏览: 148
好的,以下是使用NLTK和StanfordCoreNLP3.9.2的代码,需要先安装NLTK和StanfordCoreNLP3.9.2:
```python
import os
from pycorenlp import StanfordCoreNLP
import nltk
import networkx as nx
import matplotlib.pyplot as plt
# 启动StanfordCoreNLP服务器
nlp = StanfordCoreNLP('http://localhost:9000')
# 加载停用词列表和标点符号列表
stopwords = set(nltk.corpus.stopwords.words('english'))
punctuations = set(nltk.corpus.stopwords.words('english'))
# 读取新闻文章
with open('news.txt', 'r', encoding='utf-8') as f:
text = f.read()
# 对文章进行分句
output = nlp.annotate(text, properties={
'annotators': 'ssplit',
'outputFormat': 'json'
})
sentences = [s['text'] for s in output['sentences']]
# 分别对每个句子进行分词、去除停用词和标点、词性标注、命名实体识别和依存句法分析
graph = nx.DiGraph()
for sentence in sentences:
output = nlp.annotate(sentence, properties={
'annotators': 'tokenize,ssplit,pos,lemma,ner,depparse',
'outputFormat': 'json'
})
words = []
for token in output['tokens']:
word = token['lemma'].lower()
if word not in stopwords and word not in punctuations:
words.append((word, token['pos'], token['ner']))
for dep in output['sentences'][0]['basicDependencies']:
if dep['dep'] != 'ROOT':
governor = (words[dep['governor'] - 1][0], words[dep['governor'] - 1][1])
dependent = (words[dep['dependent'] - 1][0], words[dep['dependent'] - 1][1])
relation = dep['dep']
graph.add_edge(governor, dependent, relation=relation)
# 绘制依存句法分析图
pos = nx.spring_layout(graph, k=0.3, iterations=50)
nx.draw_networkx_nodes(graph, pos, node_size=500, node_color='lightblue')
nx.draw_networkx_edges(graph, pos, edge_color='gray')
nx.draw_networkx_labels(graph, pos, font_size=10, font_family='Arial')
edge_labels = dict([((u, v,), d['relation']) for u, v, d in graph.edges(data=True)])
nx.draw_networkx_edge_labels(graph, pos, edge_labels=edge_labels, font_size=8, font_family='Arial')
plt.axis('off')
plt.show()
```
其中,停用词列表和标点符号列表使用NLTK提供的数据,依存句法分析图仍然使用StanfordCoreNLP3.9.2,并通过Matplotlib库绘制并显示。执行以上代码将输出依存句法分析图。
阅读全文