计算两个文本的相似度py
时间: 2023-12-08 09:06:10 浏览: 170
计算两个文本的相似度可以使用Python中的文本相似度计算库,比如gensim、nltk等。具体实现方法如下:
1. 使用gensim库计算文本相似度
```python
from gensim.matutils import softcossim
from gensim import corpora
from gensim.utils import simple_preprocess
from gensim.models import Word2Vec
# 构建词向量模型
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model = Word2Vec(sentences, min_count=1)
# 构建语料库
documents = ["cat say meow", "dog say woof"]
texts = [[word for word in simple_preprocess(document)] for document in documents]
dictionary = corpora.Dictionary(texts)
# 计算文本相似度
query = "cat say meow"
query_bow = dictionary.doc2bow(simple_preprocess(query))
query_lsi = model[query_bow]
document_lsi = model[dictionary.doc2bow(simple_preprocess(documents[0]))]
similarity = softcossim(query_lsi, document_lsi, dictionary)
print(similarity)
```
2. 使用nltk库计算文本相似度
```python
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
# 读取文本
doc1 = "This is a sample sentence"
doc2 = "This is another example sentence"
stop_words = set(stopwords.words('english'))
# 对文本进行预处理,包括分词、去除停用词、词形还原
lemmatizer = WordNetLemmatizer()
tokens1 = [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(doc1) if word.lower() not in stop_words]
tokens2 = [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(doc2) if word.lower() not in stop_words]
# 构建TF-IDF向量模型
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform([doc1, doc2])
similarity = (tfidf_matrix * tfidf_matrix.T).A[0,1]
print(similarity)
```
以上两种方法都可以计算两个文本的相似度,选择哪种方法取决于具体情况和需求。
阅读全文