用pycharm爬取https://www.shanghairanking.cn/institution网站的前五页内容并按照校名、地区、管理部门、类别、双一流 共计5个字段进行解析最后保存到csv文件的代码
时间: 2024-01-22 08:02:14 浏览: 83
好的,以下是代码实现:
```python
import requests
from bs4 import BeautifulSoup
import csv
url = "https://www.shanghairanking.cn/institution"
# 设置请求头信息
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
# 用csv库打开(创建)文件并写入表头
with open('universities.csv', 'w', newline='', encoding='utf-8-sig') as file:
writer = csv.writer(file)
writer.writerow(['校名', '地区', '管理部门', '类别', '双一流'])
# 循环爬取前5页内容
for i in range(1, 6):
# 构造请求url
page_url = url + "/ranking_2021_en.html"
if i > 1:
page_url = url + "/ranking_2021_en_" + str(i) + ".html"
# 发送请求并获取响应
response = requests.get(page_url, headers=headers)
# 解析网页内容
soup = BeautifulSoup(response.content.decode('utf-8'), 'lxml')
# 获取大学列表
universities = soup.select('#ranking-section > div:nth-child(3) > table > tbody > tr')
# 循环解析每个大学的信息并写入csv文件
with open('universities.csv', 'a', newline='', encoding='utf-8-sig') as file:
writer = csv.writer(file)
for university in universities:
# 获取每个字段的值
name = university.select_one('td:nth-child(2) > a').text.strip()
location = university.select_one('td:nth-child(3)').text.strip()
department = university.select_one('td:nth-child(4)').text.strip()
category = university.select_one('td:nth-child(5)').text.strip()
double_first_class = university.select_one('td:nth-child(6)').text.strip()
# 将信息写入csv文件
writer.writerow([name, location, department, category, double_first_class])
```
这段代码会爬取https://www.shanghairanking.cn/institution网站的前5页内容,并将校名、地区、管理部门、类别、双一流等5个字段的信息解析出来,并保存到名为“universities.csv”的csv文件中。
阅读全文