import requests from bs4 import BeautifulSoup response = requests.get('https://www.baidu.com') soup = BeautifulSoup(response.content, 'html.parser') import csv import requests from bs4 import BeautifulSoup # 构造请求头 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} # 爬取数据 url = 'http://www.pm25.in/shandong' response = requests.get(url, headers=headers) soup = BeautifulSoup(response.content, 'html.parser') # 获取指数数据 data_list = [] for tr in soup.find_all('tr')[1:]: tds = tr.find_all('td') aoi = tds[0].text pm25 = tds[1].text pm10 = tds[2].text co = tds[3].text so2 = tds[4].text no2 = tds[5].text o3 = tds[6].text data_list.append([aoi, pm25, pm10, co, so2, no2, o3]) # 保存为csv文件 with open('shandong.csv', 'w', newline='', encoding='utf-8') as f: writer = csv.writer(f) writer.writerow(['AOI指数', 'PM2.5指数', 'PM10指数', '一氧化碳含量指数', '二氧化硫含量指数', '二氧化氮含量指数', '臭氧含量指数']) writer.writerows(data_list)
时间: 2024-01-07 15:03:57 浏览: 104
这段代码的作用是爬取山东省2022年的空气质量指数数据,并将数据保存为csv格式的文件。
首先导入了 requests 和 BeautifulSoup 这两个库,requests 用于发送 HTTP 请求,BeautifulSoup 用于解析 HTML 内容。
然后定义了一个请求头 headers,用于模拟浏览器发送请求,提高爬取数据的成功率。
接着使用 requests 库发送 GET 请求获取山东省空气质量监测网站的 HTML 内容,然后使用 BeautifulSoup 库解析 HTML 内容并获取指数数据。
最后将指数数据保存为 CSV 文件。其中,使用 with 语句打开文件对象,指定文件名为 shandong.csv ,以写入模式打开文件,并设置 newline='' 和 encoding='utf-8' 参数,确保写入的 CSV 文件是 UTF-8 编码,并且行末不会出现多余的空行。然后使用 csv.writer 写入数据,先写入表头,再写入数据。
相关问题
优化这段代码import requests from bs4 import BeautifulSoup url = "https://www.gupang.com/202305/64619.html" response = requests.get(url) soup = BeautifulSoup(response.content, "html.parser") title = soup.find("h1", class_="title").text content = soup.find("div", class_="content").text print(title) print(content)
可以使用以下方法进行优化:
1. 添加异常处理,以防止请求失败或解析失败导致程序崩溃:
```python
import requests
from bs4 import BeautifulSoup
url = "https://www.gupang.com/202305/64619.html"
try:
response = requests.get(url)
response.raise_for_status() # 抛出异常,如果请求失败
except requests.exceptions.RequestException as e:
print(e)
sys.exit(1)
soup = BeautifulSoup(response.content, "html.parser")
try:
title = soup.find("h1", class_="title").text
content = soup.find("div", class_="content").text
except AttributeError as e:
print(e)
sys.exit(1)
print(title)
print(content)
```
2. 使用`with`语句来管理资源,以确保在使用完文件或网络连接后自动释放资源:
```python
import requests
from bs4 import BeautifulSoup
url = "https://www.gupang.com/202305/64619.html"
try:
with requests.get(url) as response:
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
title = soup.find("h1", class_="title").text
content = soup.find("div", class_="content").text
except (requests.exceptions.RequestException, AttributeError) as e:
print(e)
sys.exit(1)
print(title)
print(content)
```
这种方法可以避免手动关闭连接或文件,而且代码更加简洁。
修改代码,使得li_list的编码格式是utf-8import requests from bs4 import BeautifulSoup url = 'https://www.icbc.com.cn/page/827855918799994880.html' response = requests.get(url=url) page_response = response.text soup = BeautifulSoup(page_response, 'html.parser',from_encoding='utf-8') li_list = soup.select('#mypagehtmlcontent p')
import requests
from bs4 import BeautifulSoup
url = 'https://www.icbc.com.cn/page/827855918799994880.html'
response = requests.get(url=url)
page_response = response.content.decode('utf-8')
soup = BeautifulSoup(page_response, 'html.parser')
li_list = soup.select('#mypagehtmlcontent p')
# 以下为测试代码,可以不用理会
for li in li_list:
print(li.text)
阅读全文