python实现使用tf-idf算法对word文档进行逐句比对 代码
时间: 2023-06-01 15:05:20 浏览: 62
可以参考以下代码实现:
```python
import os
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
def load_files(path):
files = []
for dirpath, _, filenames in os.walk(path):
for filename in filenames:
files.append(os.path.join(dirpath, filename))
return files
def compare_sentences(file1, file2):
with open(file1, 'r') as f:
text1 = f.read()
with open(file2, 'r') as f:
text2 = f.read()
sentences = [text1, text2]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(sentences)
similarity = np.dot(X[0], X[1].T).todense()[0,0]
return similarity
if __name__ == '__main__':
files = load_files('/path/to/files')
for i in range(len(files)):
for j in range(i+1, len(files)):
similarity = compare_sentences(files[i], files[j])
print('Similarity between {} and {}: {}'.format(files[i], files[j], similarity))
```
其中,load_files函数用于加载待比较的文档列表,compare_sentences函数用于比较两个文档的相似度,使用tf-idf算法进行特征提取和余弦相似度计算。最后,可以遍历文档列表,两两比较所有文档的相似度。