将word文本文件导入python,文件地址为C:\Users\Admin\Desktop\三国演义.docx,用jieba实现词频统计,输出前10个频率最高的词,并构建20个节点的知识图谱
时间: 2023-12-08 19:03:47 浏览: 61
首先需要安装jieba和wordcloud两个库:在命令行中输入
```
pip install jieba
pip install wordcloud
```
然后可以按以下步骤实现:
1. 导入库
```python
import jieba
import wordcloud
from docx import Document
```
2. 读取文本
```python
doc = Document('C:\\Users\\Admin\\Desktop\\三国演义.docx')
text = ''
for p in doc.paragraphs:
text += p.text
```
3. 分词
```python
words = jieba.lcut(text)
```
4. 统计词频
```python
freq = {}
for word in words:
if len(word) == 1:
continue
freq[word] = freq.get(word, 0) + 1
```
5. 输出前10个频率最高的词
```python
top10 = sorted(freq.items(), key=lambda x: x[1], reverse=True)[:10]
for word, count in top10:
print(word, count)
```
6. 构建知识图谱
```python
import networkx as nx
import matplotlib.pyplot as plt
G = nx.Graph()
for i, (word1, count1) in enumerate(freq.items()):
if i >= 20:
break
G.add_node(word1)
for j, (word2, count2) in enumerate(freq.items()):
if i >= j:
continue
if word2.startswith(word1) or word1.startswith(word2):
weight = min(count1, count2)
G.add_edge(word1, word2, weight=weight)
pos = nx.spring_layout(G, k=0.5)
nx.draw(G, pos, with_labels=True, font_size=10, node_color='y', node_size=1000, edge_color='gray', width=1.5, alpha=0.8)
nx.draw_networkx_edge_labels(G, pos, edge_labels={(u, v): G[u][v]['weight'] for u, v in G.edges()}, font_size=8)
plt.show()
```
完整代码如下:
```python
import jieba
import wordcloud
from docx import Document
import networkx as nx
import matplotlib.pyplot as plt
# 读取文本
doc = Document('C:\\Users\\Admin\\Desktop\\三国演义.docx')
text = ''
for p in doc.paragraphs:
text += p.text
# 分词
words = jieba.lcut(text)
# 统计词频
freq = {}
for word in words:
if len(word) == 1:
continue
freq[word] = freq.get(word, 0) + 1
# 输出前10个频率最高的词
top10 = sorted(freq.items(), key=lambda x: x[1], reverse=True)[:10]
for word, count in top10:
print(word, count)
# 构建知识图谱
G = nx.Graph()
for i, (word1, count1) in enumerate(freq.items()):
if i >= 20:
break
G.add_node(word1)
for j, (word2, count2) in enumerate(freq.items()):
if i >= j:
continue
if word2.startswith(word1) or word1.startswith(word2):
weight = min(count1, count2)
G.add_edge(word1, word2, weight=weight)
pos = nx.spring_layout(G, k=0.5)
nx.draw(G, pos, with_labels=True, font_size=10, node_color='y', node_size=1000, edge_color='gray', width=1.5, alpha=0.8)
nx.draw_networkx_edge_labels(G, pos, edge_labels={(u, v): G[u][v]['weight'] for u, v in G.edges()}, font_size=8)
plt.show()
```
阅读全文