用pycharm的xpath语法爬取https://www.shanghairanking.cn/institution网站前五页的内容并按照校名、地区、管理部门、类别、双一流 共计5个字段进行解析最后保存到csv文件的代码
时间: 2024-04-04 14:30:34 浏览: 110
首先需要安装`requests`、`lxml`和`csv`这三个库,可以在PyCharm的Terminal中执行以下命令安装:
```
pip install requests
pip install lxml
```
然后就可以编写代码了,代码如下:
```python
import requests
from lxml import etree
import csv
# 存储解析结果的列表
data_list = []
# 循环前五页
for page in range(1, 6):
url = f'https://www.shanghairanking.cn/institution/index/{page}'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)
html = etree.HTML(response.text)
# 获取当前页面所有学校的链接
link_list = html.xpath('//td[@class="align-left"]/a/@href')
for link in link_list:
# 访问学校链接
school_url = f'https://www.shanghairanking.cn{link}'
school_response = requests.get(school_url, headers=headers)
school_html = etree.HTML(school_response.text)
# 解析校名
name = school_html.xpath('//div[@class="univ-name"]/h1/text()')[0]
# 解析地区
region = school_html.xpath('//div[@class="col-lg-2 col-md-2 col-sm-12"]/div/p[1]/text()')[0]
# 解析管理部门
department = school_html.xpath('//div[@class="col-lg-2 col-md-2 col-sm-12"]/div/p[2]/text()')[0]
# 解析类别
category = school_html.xpath('//div[@class="col-lg-2 col-md-2 col-sm-12"]/div/p[3]/text()')[0]
# 解析是否为双一流
is_double_first_class = school_html.xpath('//div[@class="col-lg-2 col-md-2 col-sm-12"]/div/p[4]/text()')[0]
# 将解析结果存入列表
data_list.append([name, region, department, category, is_double_first_class])
# 将数据写入csv文件
with open('schools.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['校名', '地区', '管理部门', '类别', '双一流'])
writer.writerows(data_list)
```
代码执行完成后,会在当前目录下生成一个名为`schools.csv`的文件,里面就是解析结果。
阅读全文