写一段python代码,要求是计算输入的两个文本相似度,余弦相似度,input函数,构建词袋,向量表示,停用词,分词,k-means聚类算法画图
时间: 2024-05-15 14:17:58 浏览: 13
以下是一个可以计算两个文本之间余弦相似度的Python代码示例:
```
import math
import re
from collections import Counter
def cosine_similarity(text1, text2):
# 分词并去除停用词
stopwords = set(['the', 'and', 'to', 'of', 'that', 'in', 'it', 'for', 'with', 'as'])
words1 = [word.lower() for word in re.findall(r'\w+', text1) if word.lower() not in stopwords]
words2 = [word.lower() for word in re.findall(r'\w+', text2) if word.lower() not in stopwords]
# 构建词袋
bag_of_words = Counter(words1 + words2)
# 计算向量表示
vec1 = [bag_of_words[word] for word in words1]
vec2 = [bag_of_words[word] for word in words2]
# 计算余弦相似度
numerator = sum(x * y for x, y in zip(vec1, vec2))
denominator = math.sqrt(sum(x ** 2 for x in vec1)) * math.sqrt(sum(y ** 2 for y in vec2))
return numerator / denominator
# 测试代码
text1 = input("请输入第一个文本: ")
text2 = input("请输入第二个文本: ")
similarity = cosine_similarity(text1, text2)
print("两个文本的余弦相似度为:", similarity)
```
如果需要使用k-means聚类算法进行可视化,可以使用matplotlib库进行绘图。以下是示例代码:
```
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import CountVectorizer
# 创建样本数据
texts = ["I am a boy", "You are a girl", "He is a man", "She is a woman"]
# 词袋转化器
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
# 使用KMeans算法聚类,其中K=2
model = KMeans(n_clusters=2, random_state=0)
labels = model.fit_predict(X)
# 绘制聚类结果
plt.scatter(X.toarray()[:, 0], X.toarray()[:, 1], c=labels)
plt.show()
```
该代码将四个文本转化为词袋向量,然后使用KMeans算法将它们分成两类,并使用matplotlib绘制了聚类结果的散点图。