首页python爬虫统计词频

python爬虫统计词频

时间: 2024-05-10 09:12:56 浏览: 127

Python爬虫统计词频的基本思路如下： 1. 网络爬虫爬取网页内容 2. 使用正则表达式或者BeautifulSoup等工具对网页内容进行解析 3. 对解析出来的文本内容进行分词处理，可以使用jieba等第三方库进行中文分词 4. 统计每个单词出现的次数，并按照词频从高到低排序 5. 可以使用Matplotlib等库进行数据可视化展示以下是Python爬虫统计词频的示例代码： ```python import requests import jieba from collections import Counter import matplotlib.pyplot as plt # 爬取网页内容 url = 'https://www.csdn.net/' headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers) html = response.text # 使用正则表达式或者BeautifulSoup等工具对网页内容进行解析 # 这里使用BeautifulSoup进行解析 from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'html.parser') text = soup.get_text() # 对文本内容进行分词处理 words = jieba.cut(text) # 统计每个单词出现的次数，并按照词频从高到低排序 counter = Counter(words) top_words = counter.most_common(20) # 数据可视化展示 x_data = [word for word in top_words] y_data = [word for word in top_words] plt.bar(x_data, y_data) plt.show() ```

阅读全文