信息检索 文本相似度 python_使用TF-IDF算法计算网站页面相似度分布(Python)
时间: 2023-12-29 16:06:18 浏览: 29
TF-IDF算法是一种常用的文本相似度计算方法,可以用于计算网站页面的相似度分布。下面是一个使用Python实现TF-IDF算法计算网站页面相似度分布的示例代码:
```python
import requests
from bs4 import BeautifulSoup
import re
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
# 获取网页内容
def get_page_content(url):
try:
response = requests.get(url)
html = response.text
return html
except:
return ""
# 解析网页内容并提取文本
def parse_html(html):
soup = BeautifulSoup(html, 'html.parser')
for script in soup(["script", "style"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
return text
# 计算相似度矩阵
def calculate_similarity_matrix(content_list):
vectorizer = TfidfVectorizer(min_df=1, stop_words="english")
tfidf = vectorizer.fit_transform(content_list)
similarity_matrix = (tfidf * tfidf.T).toarray()
return similarity_matrix
# 打印相似度分布
def print_similarity_distribution(similarity_matrix):
similarity_distribution = np.triu(similarity_matrix, k=1).flatten()
similarity_distribution = similarity_distribution[similarity_distribution > 0]
print("Mean similarity:", np.mean(similarity_distribution))
print("Median similarity:", np.median(similarity_distribution))
print("Max similarity:", np.max(similarity_distribution))
print("Min similarity:", np.min(similarity_distribution))
print("Standard deviation of similarity:", np.std(similarity_distribution))
# 示例代码
if __name__ == "__main__":
urls = ["https://www.baidu.com/", "https://www.zhihu.com/", "https://www.google.com/"]
content_list = []
for url in urls:
html = get_page_content(url)
content = parse_html(html)
content_list.append(content)
similarity_matrix = calculate_similarity_matrix(content_list)
print_similarity_distribution(similarity_matrix)
```
该示例代码获取了百度、知乎和谷歌三个网站的页面内容,并使用TF-IDF算法计算它们之间的相似度分布。可以根据需要修改示例代码中的网站列表,以计算其他网站之间的相似度分布。