计算两个文本的相似度py
时间: 2023-12-08 07:06:10 浏览: 54
计算两个文本的相似度可以使用Python中的文本相似度计算库,比如gensim、nltk等。具体实现方法如下:
1. 使用gensim库计算文本相似度
```python
from gensim.matutils import softcossim
from gensim import corpora
from gensim.utils import simple_preprocess
from gensim.models import Word2Vec
# 构建词向量模型
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model = Word2Vec(sentences, min_count=1)
# 构建语料库
documents = ["cat say meow", "dog say woof"]
texts = [[word for word in simple_preprocess(document)] for document in documents]
dictionary = corpora.Dictionary(texts)
# 计算文本相似度
query = "cat say meow"
query_bow = dictionary.doc2bow(simple_preprocess(query))
query_lsi = model[query_bow]
document_lsi = model[dictionary.doc2bow(simple_preprocess(documents[0]))]
similarity = softcossim(query_lsi, document_lsi, dictionary)
print(similarity)
```
2. 使用nltk库计算文本相似度
```python
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
# 读取文本
doc1 = "This is a sample sentence"
doc2 = "This is another example sentence"
stop_words = set(stopwords.words('english'))
# 对文本进行预处理,包括分词、去除停用词、词形还原
lemmatizer = WordNetLemmatizer()
tokens1 = [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(doc1) if word.lower() not in stop_words]
tokens2 = [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(doc2) if word.lower() not in stop_words]
# 构建TF-IDF向量模型
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform([doc1, doc2])
similarity = (tfidf_matrix * tfidf_matrix.T).A[0,1]
print(similarity)
```
以上两种方法都可以计算两个文本的相似度,选择哪种方法取决于具体情况和需求。
相关推荐
![ipynb](https://img-home.csdnimg.cn/images/20210720083646.png)
![rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)