、请自行选取一个出名的旅游(跟旅游有关系的点评)网站,选取某一地点的酒店查询界面作为采集对象,设计两个采集程序分别采集该搜索结果查询页的前 25 条酒店(第一页的基本数据(酒店名、评价、价格、标签、地点等) 与前 10 家酒店的前 3 页具体评价(评价人的名字、评价时间、评分、具体评价信息、入住时间) 等信息。
时间: 2023-11-21 22:06:02 浏览: 72
好的,我选择了TripAdvisor作为采集对象。以下是我设计的两个采集程序:
1. 采集搜索结果查询页的前25条酒店的基本数据
```python
from selenium import webdriver
import time
url = 'https://www.tripadvisor.com/Hotels-g60763-New_York_City_New_York-Hotels.html'
driver = webdriver.Chrome()
driver.get(url)
# 等待页面加载完成
time.sleep(5)
# 获取前25个酒店的名称、评价、价格、标签和地点
hotels = driver.find_elements_by_css_selector('.srp-list .sr_property_block')
for hotel in hotels[:25]:
name = hotel.find_element_by_css_selector('.srp-list .sr_item_main_block .sr-hotel__title span').text
rating = hotel.find_element_by_css_selector('.srp-list .sr_item_main_block .sr-hotel__rating span[1]').get_attribute('aria-label')
price = hotel.find_element_by_css_selector('.srp-list .sr_item_main_block .sr_price_estimate .price').text
tags = [tag.text for tag in hotel.find_elements_by_css_selector('.srp-list .sr_item_main_block .sr_card_secondary_content .sr_card_tag_line span')]
location = hotel.find_element_by_css_selector('.srp-list .sr_item_main_block .sr_card_secondary_content .sr_card_address .sr_card_address_line a').text
print('酒店名称:', name)
print('评价:', rating)
print('价格:', price)
print('标签:', tags)
print('地点:', location)
driver.quit()
```
2. 采集前10家酒店的前3页具体评价信息
```python
from selenium import webdriver
import time
url = 'https://www.tripadvisor.com/Hotels-g60763-New_York_City_New_York-Hotels.html'
driver = webdriver.Chrome()
driver.get(url)
# 等待页面加载完成
time.sleep(5)
# 获取前10个酒店的链接
hotel_links = driver.find_elements_by_css_selector('.srp-list .sr_property_block .sr_item_main_block .sr-hotel__title a')[:10]
hotel_links = [link.get_attribute('href') for link in hotel_links]
for link in hotel_links:
driver.get(link)
time.sleep(5)
# 获取前3页的评价信息
for i in range(3):
# 点击下一页按钮
if i > 0:
next_button = driver.find_element_by_css_selector('.location-review-review-list-parts-Pagination__container--2M6le .pagination .next a')
if 'disabled' in next_button.get_attribute('class'):
break
next_button.click()
# 获取当前页的评价信息
reviews = driver.find_elements_by_css_selector('.location-review-review-list-parts-SingleReview__reviewContainer--dZpEy')
for review in reviews:
name = review.find_element_by_css_selector('.location-review-review-list-parts-Profile__avatar--1Y6zT .ui_avatar a').get_attribute('aria-label')
date = review.find_element_by_css_selector('.location-review-review-list-parts-EventDate__event_date--1epHa').text
rating = review.find_element_by_css_selector('.location-review-review-list-parts-RatingLine__bubbles--GcJvM .ui_bubble_rating').get_attribute('class').split('_')[-1]
comment = review.find_element_by_css_selector('.location-review-review-list-parts-ExpandableReview__reviewText--gOmRC').text
stay_date = review.find_element_by_css_selector('.location-review-review-list-parts-EventDate__event_date--1epHa span:last-child').text
print('姓名:', name)
print('日期:', date)
print('评分:', rating)
print('评论:', comment)
print('入住日期:', stay_date)
driver.quit()
```
这两个程序都是基于Selenium实现的,可以在Chrome浏览器中自动化执行。第一个程序采集基本数据,第二个程序采集具体评价信息。需要注意的是,第二个程序需要点击下一页按钮来获取更多的评价信息。
阅读全文