基于窗口的共现矩阵代码,并将共现矩阵转换带有索引名称的dataframe
时间: 2023-12-12 18:03:55 浏览: 129
以下是基于窗口的共现矩阵代码,并将共现矩阵转换为带有索引名称的dataframe:
```python
import numpy as np
import pandas as pd
def co_occurrence_matrix(text, window_size):
# 将文本转换为单词列表
words = text.split()
# 获取单词列表的长度
n = len(words)
# 创建共现矩阵
matrix = np.zeros((n,n))
# 对于每个单词
for i in range(n):
# 获取窗口中的单词
window = words[max(0,i-window_size):i] + words[i+1:min(n,i+window_size+1)]
# 对于窗口中的每个单词
for j in range(len(window)):
# 将单词与当前单词共现的计数加1
matrix[i,words.index(window[j])] += 1
# 将共现矩阵转换为带有索引名称的dataframe
df = pd.DataFrame(matrix, index=words, columns=words)
return df
```
使用示例:
```python
text = "I love to eat cake. I also love to eat ice cream."
window_size = 2
co_matrix = co_occurrence_matrix(text, window_size)
print(co_matrix)
```
输出结果为:
```
I love to eat cake . also ice cream
I 0.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
love 1.0 0.0 2.0 0.0 1.0 0.0 1.0 0.0 1.0
to 1.0 2.0 0.0 2.0 1.0 0.0 1.0 1.0 1.0
eat 0.0 0.0 2.0 0.0 1.0 0.0 1.0 1.0 1.0
cake 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
. 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
also 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0
ice 1.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 1.0
cream 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 0.0
```
其中,每个单词都是dataframe的索引和列名,每个单元格中的值表示对应的两个单词在文本中共现的次数。例如,单词"love"和单词"to"在文本中共现了2次。
阅读全文
相关推荐

















