爬虫作业 爬取http://www.nmc.cn/publish/agro/soil-moisture-monitoring-10cm.html中的所有数据和图片到Excel表中 全代码
时间: 2024-12-18 08:32:27 浏览: 8
爬取网页内容并将其保存到Excel表格通常需要使用Python的库,如BeautifulSoup用于解析HTML,requests获取网页内容,以及pandas处理数据和xlsxwriter写入Excel。下面是一个简单的全代码示例,它假设你已经安装了所需的库:
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
import io
from PIL import Image
import xlsxwriter
# 请求网页
url = "http://www.nmc.cn/publish/agro/soil-moisture-monitoring-10cm.html"
response = requests.get(url)
# 检查请求是否成功
if response.status_code == 200:
# 解析HTML
soup = BeautifulSoup(response.text, 'html.parser')
# 查找包含数据的部分(这里假设数据在一个table标签内)
data_table = soup.find('table')
# 提取数据
headers = [th.text for th in data_table.find_all("th")]
rows = []
for row in data_table.find_all("tr"):
cols = [td.text for td in row.find_all(["td", "th"])]
if cols:
rows.append(cols)
# 创建DataFrame
df_data = pd.DataFrame(rows, columns=headers)
# 图片链接提取和下载
img_links = [img['src'] for img in soup.find_all('img', src=True)]
images = {}
for link in img_links:
img_response = requests.get(link)
image_name = f'image_{len(images) + 1}.jpg'
with open(image_name, 'wb') as f:
f.write(img_response.content)
images[image_name] = {'link': link}
# 将数据和图片信息合并到一个字典列表
data_list = [{'data': row, 'images': images[row_index]} for row_index, row in enumerate(df_data.iterrows())]
# 写入Excel
workbook = xlsxwriter.Workbook('output.xlsx')
worksheet = workbook.add_worksheet()
for i, item in enumerate(data_list):
for col, value in enumerate(item['data']):
worksheet.write(i, col, value)
for img_name, img_info in item['images'].items():
img_data = Image.open(io.BytesIO(requests.get(img_info['link']).content))
img_data.save(f'image_{i+1}_{img_name}')
worksheet.insert_image(f"F{i+2}", f'image_{i+1}_{img_name}')
workbook.close()
else:
print(f"请求失败,状态码:{response.status_code}")
```
请注意,这只是一个基础示例,实际网页结构可能会有所不同,可能需要调整`soup.find()`部分以匹配实际页面元素。运行此代码前确保已满足网络访问权限,并可能需要对某些网站的反爬策略进行适当的处理。
阅读全文