python常用爬虫代码
时间: 2023-11-03 09:58:13 浏览: 47
以下是Python中常用的爬虫代码:
1. 使用requests库发送HTTP请求获取网页内容
```python
import requests
url = 'https://www.example.com'
response = requests.get(url)
html_content = response.text
```
2. 使用BeautifulSoup库解析HTML并提取信息
```python
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
title = soup.title.text
links = [link['href'] for link in soup.find_all('a')]
```
3. 使用正则表达式匹配网页内容
```python
import re
pattern = r'<title>(.*?)</title>'
title = re.findall(pattern, html_content)
```
4. 使用Selenium库模拟浏览器行为
```python
from selenium import webdriver
url = 'https://www.example.com'
driver = webdriver.Chrome()
driver.get(url)
html_content = driver.page_source
```
5. 使用Scrapy框架进行结构化的爬取
```python
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://www.example.com']
def parse(self, response):
title = response.xpath('//title/text()').extract_first()
links = response.css('a::attr(href)').extract()
```
请注意,使用爬虫时需要遵守网站的robots.txt文件和使用条款,以避免对网站的不必要影响。