用python的xpath方法获取58同城房产的二手房网站柯桥区域的前十页的每条二手房,在详情页面抓取标题、小区名称、单价、总价、户型、所在楼层、面积最后写入csv文件,加入反爬措施
时间: 2024-06-08 13:11:50 浏览: 7
首先,我们需要安装 `requests` 和 `lxml` 库。
```python
!pip install requests lxml
```
接着,我们可以使用 `requests` 发送请求,并使用 `lxml` 的 `etree` 模块解析 HTML。
```python
import requests
from lxml import etree
url = 'https://hz.58.com/ershoufang/kaifaqu/pn1/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
html = etree.HTML(response.text)
```
接下来,我们需要使用 XPath 获取二手房列表的每个房源的链接,并进入详情页获取信息。
```python
for i in range(1, 11): # 前十页
url = f'https://hz.58.com/ershoufang/kaifaqu/pn{i}/'
response = requests.get(url, headers=headers)
html = etree.HTML(response.text)
house_list = html.xpath('//ul[@class="house-list-wrap"]/li')
for house in house_list:
detail_url = house.xpath('./div[@class="list-info"]/h2/a/@href')[0]
detail_response = requests.get(detail_url, headers=headers)
detail_html = etree.HTML(detail_response.text)
title = detail_html.xpath('//div[@class="title"]/h1/text()')[0]
community = detail_html.xpath('//div[@class="communityName"]/a/text()')[0]
price = detail_html.xpath('//span[@class="total"]/text()')[0]
unit_price = detail_html.xpath('//span[@class="unit"]/text()')[0]
layout = detail_html.xpath('//div[@class="room"]/ul/li[1]/text()')[0]
floor = detail_html.xpath('//div[@class="room"]/ul/li[2]/text()')[0]
area = detail_html.xpath('//div[@class="room"]/ul/li[3]/text()')[0]
```
最后,我们将获取到的信息写入 CSV 文件中。
```python
import csv
with open('houses.csv', 'w', newline='', encoding='utf-8-sig') as f:
writer = csv.writer(f)
writer.writerow(['标题', '小区名称', '单价', '总价', '户型', '所在楼层', '面积'])
for i in range(1, 11): # 前十页
url = f'https://hz.58.com/ershoufang/kaifaqu/pn{i}/'
response = requests.get(url, headers=headers)
html = etree.HTML(response.text)
house_list = html.xpath('//ul[@class="house-list-wrap"]/li')
for house in house_list:
detail_url = house.xpath('./div[@class="list-info"]/h2/a/@href')[0]
detail_response = requests.get(detail_url, headers=headers)
detail_html = etree.HTML(detail_response.text)
title = detail_html.xpath('//div[@class="title"]/h1/text()')[0]
community = detail_html.xpath('//div[@class="communityName"]/a/text()')[0]
price = detail_html.xpath('//span[@class="total"]/text()')[0]
unit_price = detail_html.xpath('//span[@class="unit"]/text()')[0]
layout = detail_html.xpath('//div[@class="room"]/ul/li[1]/text()')[0]
floor = detail_html.xpath('//div[@class="room"]/ul/li[2]/text()')[0]
area = detail_html.xpath('//div[@class="room"]/ul/li[3]/text()')[0]
writer.writerow([title, community, unit_price, price, layout, floor, area])
```
为了防止被反爬机制封禁 IP,我们可以使用一些反爬措施,如随机更换 User-Agent、设置间隔时间等。