jupyter notebook gensim 文本相似度代码
时间: 2023-11-16 10:07:50 浏览: 180
以下是使用jupyter notebook和gensim包实现文本相似度的代码:
```python
# 导入所需的包
from gensim import corpora, models, similarities
# 准备数据
documents = ["This is a sample sentence.",
"This is another sentence.",
"I love coding in Python.",
"I hate coding in Java."]
# 对文本进行分词处理
texts = [[word for word in document.lower().split()] for document in documents]
# 建立词典
dictionary = corpora.Dictionary(texts)
# 建立语料库
corpus = [dictionary.doc2bow(text) for text in texts]
# 训练TF-IDF模型
tfidf = models.TfidfModel(corpus)
# 将语料库转换为TF-IDF表示
corpus_tfidf = tfidf[corpus]
# 训练LSI模型
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
# 将语料库转换为LSI表示
corpus_lsi = lsi[corpus_tfidf]
# 建立索引
index = similarities.MatrixSimilarity(corpus_lsi)
# 计算相似度
query = "I love coding in Python."
query_bow = dictionary.doc2bow(query.lower().split())
query_tfidf = tfidf[query_bow]
query_lsi = lsi[query_tfidf]
sims = index[query_lsi]
# 输出相似度结果
for document_number, score in sorted(enumerate(sims), key=lambda x: x[1], reverse=True):
print("Document number: {} Similarity score: {}".format(document_number, score))
```
阅读全文