请使用python爬取https://hotels.ctrip.com/网站上20页的二星级三星级四星级五星级房间的数据
时间: 2023-12-03 10:01:45 浏览: 128
携程航班信息爬取(python)—第一次写博客,不好请别见外!
好的,这个任务需要用到Python中的requests库和BeautifulSoup库来爬取和解析网页内容。以下是代码实现:
```python
import requests
from bs4 import BeautifulSoup
url = 'https://hotels.ctrip.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
# 获取每个星级酒店的链接
star_urls = {}
for star in range(2, 6):
params = {'star': star}
html = requests.get(url + 'hotel', headers=headers, params=params).text
soup = BeautifulSoup(html, 'html.parser')
links = soup.select('.hotel_list .hotel_new_list a.hotel_name')
star_urls[star] = [link['href'] for link in links]
# 爬取每个房间的数据
data = []
for star in range(2, 6):
for link in star_urls[star]:
for page in range(1, 21):
params = {'hotelId': link.split('/')[-1], 'pageIndex': page}
html = requests.get(url + 'hotel/dianping/' + link.split('/')[-1], headers=headers, params=params).text
soup = BeautifulSoup(html, 'html.parser')
rooms = soup.select('.hotel_comment_list .comment_block')
for room in rooms:
room_data = {}
room_data['hotel'] = soup.select_one('.hotel_name').get_text(strip=True)
room_data['star'] = star
room_data['room'] = room.select_one('.room_name').get_text(strip=True)
room_data['price'] = room.select_one('.comment_txt span').get_text(strip=True)
data.append(room_data)
print(data)
```
这段代码首先获取了每个星级酒店的链接,然后通过循环遍历每个链接,以及每个链接的前20页,来爬取每个房间的数据。最终将所有数据存储在列表中,输出到屏幕上。
需要注意的是,由于该网站有反爬机制,因此需要设置headers来模拟浏览器发送请求。同时,代码中使用了BeautifulSoup库来解析网页内容,需要先使用pip安装该库。
阅读全文