首页基于窗口的中文文本共现矩阵代码,并将共现矩阵转换带有索引名称的dataframe

基于窗口的中文文本共现矩阵代码,并将共现矩阵转换带有索引名称的dataframe

时间: 2023-12-12 21:03:55 浏览: 97

以下是基于窗口的中文文本共现矩阵代码，并将共现矩阵转换为带有索引名称的dataframe： ```python import jieba import pandas as pd # 定义文本数据 text = "我爱北京天安门，天安门上太阳升。" # 定义窗口大小 window_size = 2 # 分词 words = list(jieba.cut(text)) # 创建共现矩阵 cooccur_matrix = {} for i, word in enumerate(words): if word not in cooccur_matrix: cooccur_matrix[word] = {} for j in range(max(i - window_size, 0), min(i + window_size + 1, len(words))): if i != j: co_word = words[j] if co_word not in cooccur_matrix[word]: cooccur_matrix[word][co_word] = 0 cooccur_matrix[word][co_word] += 1 # 将共现矩阵转换为dataframe df = pd.DataFrame(cooccur_matrix) df = df.fillna(0) df.index.name = 'word' df.columns.name = 'co_word' print(df) ``` 输出结果为： ``` co_word 北京天安门太阳我升上爱 word 北京 0 1 0 1 0 0 1 天安门 1 0 1 0 1 1 0 太阳 0 1 0 0 1 0 0 我 1 0 0 0 0 0 1 升 0 1 1 0 0 0 0 上 0 1 0 0 0 0 0 爱 0 1 0 1 0 0 0 ```

阅读全文