首页python爬虫抓取网页文本

python爬虫抓取网页文本

时间: 2023-11-08 19:53:14 浏览: 139

使用Python爬虫抓取网页文本的步骤如下： 1. 导入相关模块，如requests、beautifulsoup4等。 2. 使用requests模块发送HTTP请求，获取网页的源代码。 3. 使用beautifulsoup4模块解析HTML文档，获取需要的数据。 4. 可以使用正则表达式对数据进行进一步的处理和提取。以下是一个简单的Python爬虫抓取网页文本的示例代码： ```python import requests from bs4 import BeautifulSoup # 发送HTTP请求，获取网页源代码 url = 'http://example.com' response = requests.get(url) html = response.text # 使用beautifulsoup4解析HTML文档，获取需要的数据 soup = BeautifulSoup(html, 'html.parser') data = soup.find('div', {'class': 'data'}).text # 使用正则表达式对数据进行进一步的处理和提取 import re pattern = re.compile(r'\d+') result = pattern.findall(data) ``` 需要注意的是，爬虫在抓取网页时需要注意法律法规和网站的使用协议，不得进行非法爬虫行为。

阅读全文