首页爬虫python基础语法

爬虫python基础语法

时间: 2024-12-10 17:14:29 浏览: 12

好的,我很高兴为您介绍一下Python爬虫的基础语法。Python是编写网络爬虫的常用语言,其简洁易读的语法非常适合这种任务。以下是一些Python爬虫常用的基础语法:

导入必要的库:
```
import requests
from bs4 import BeautifulSoup
import re
```
requests用于发送HTTP请求,BeautifulSoup用于解析HTML,re是正则表达式库。

发送HTTP请求:

response = requests.get(url)
html = response.text

这会发送GET请求并获取网页内容。

解析HTML:

soup = BeautifulSoup(html, 'html.parser')
title = soup.title.string

使用BeautifulSoup解析HTML,并提取网页标题。

查找元素:

links = soup.find_all('a')
for link in links:
 print(link.get('href'))

查找所有标签并提取href属性。

正则表达式:

pattern = re.compile(r'\d+')
numbers = pattern.findall(text)

使用正则表达式查找文本中的数字。

数据存储:

with open('data.csv', 'w', newline='', encoding='utf-8') as f:
 writer = csv.writer(f)
 writer.writerow(['Column1', 'Column2'])
 writer.writerows(data)

将爬取的数据保存到CSV文件中。

处理分页:

for page in range(1, total_pages + 1):
 url = base_url + str(page)
 # 处理每个页面的数据

处理多页数据时,使用循环构建URL。

异常处理:

try:
 response = requests.get(url, timeout=10)
 response.raise_for_status()
except requests.exceptions.RequestException as e:
 print(f"Error fetching {url}: {e}")

添加异常处理,提高爬虫的稳定性。

这些是Python爬虫的一些基础语法。实际编写爬虫时,还需要考虑反爬机制、处理动态内容、使用代理等多个方面。