song_list = soup.select('#song > li')优化这段为爬取这段链接<a target="_1" href="/play/588129.htm" class="songNameA"><font>As Long As You Love Me (Acoustic Version)</font></a>
时间: 2023-08-14 22:12:13 浏览: 87
要爬取链接 `<a target="_1" href="/play/588129.htm" class="songNameA"><font>As Long As You Love Me (Acoustic Version)</font></a>` 中的歌名和链接,可以使用以下代码:
```
song_list = soup.select('#song > li')
for song in song_list:
song_link = song.select_one('.songNameA')['href']
song_name = song.select_one('.songNameA font').text
print(song_name, song_link)
```
这里使用了 `select_one` 方法来获取链接和歌名。 `select_one('.songNameA')['href']` 获取链接, `select_one('.songNameA font').text` 获取歌名。注意, `select_one` 只会返回第一个匹配项,因为每个歌曲只有一个链接和一个歌名,所以这里使用 `select_one` 没有问题。
相关问题
if soup.find_all("div", class_="zg_page list_pagebox"): another_url = soup.select('div.zg_page.list_pagebox > p > a')[1].get("href") wb2_data = requests.get(another_url) wb2_data.encoding = 'gb2312' soup = BeautifulSoup(wb2_data.text, 'lxml') passage1 = soup.select('div.cont.clearfix > div.zgsz_show.fl > div.zgsz_sContent.clearfix > p') passage1.pop(0) for paragraph1 in passage1: data1 = paragraph1.get_text() if len(data1) > 30: f.write(data1 + '\n') 优化这段代码
这段代码可以做一些优化,例如:
1. 使用try-except语句来处理可能出现的异常,避免程序因为异常而崩溃。
2. 将代码封装成函数,使其更加模块化和可重用。
3. 使用with语句来更好地管理文件对象,避免忘记关闭文件。
下面是优化后的代码示例:
```python
import requests
from bs4 import BeautifulSoup
def get_passages(url, file_path):
try:
response = requests.get(url)
response.encoding = 'gb2312'
soup = BeautifulSoup(response.text, 'lxml')
if soup.find_all("div", class_="zg_page list_pagebox"):
next_url = soup.select('div.zg_page.list_pagebox > p > a')[1].get("href")
get_passages(next_url, file_path)
passages = soup.select('div.cont.clearfix > div.zgsz_show.fl > div.zgsz_sContent.clearfix > p')
passages.pop(0)
with open(file_path, 'a', encoding='utf-8') as f:
for passage in passages:
data = passage.get_text().strip()
if len(data) > 30:
f.write(data + '\n')
except Exception as e:
print('Error:', e)
```
这个函数接受两个参数:url和file_path,分别表示要爬取的页面URL和要保存数据的文件路径。该函数会递归地爬取所有页面,并将满足条件的文本写入文件。如果中途出现异常,函数会打印错误信息。
优化代码:import requests from bs4 import BeautifulSoup import csv # 请求URL url = "https://pvp.qq.com/web201605/herodetail/527.shtml" # 请求头部信息 headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.57"} # 发送请求 response = requests.get(url, headers=headers) # 解析HTML soup = BeautifulSoup(response.content, "html.parser") # 获取所有英雄的链接 hero_links = [] for hero in soup.select(".herolist > li > a"): hero_links.append(hero["href"]) # 爬取每个英雄的属性 heroes = [] for link in hero_links: response = requests.get(link, headers=headers) soup = BeautifulSoup(response.content, "html.parser") # 获取英雄属性 name = soup.select(".cover-name")[0].text survive = soup.select(".")[0].text attack = soup.select(".cover-list-bar data-bar2 fl")[0].text skill = soup.select(".skill")[0].text difficulty = soup.select(".difficulty")[0].text # 保存英雄属性 heroes.append({"name": name, "survive": survive, "attack": attack, "skill": skill, "difficulty": difficulty}) # 将数据写入CSV文件 with open("heroes.csv", "w", newline="", encoding="utf-8-sig") as csvfile: fieldnames = ["name", "survive", "attack", "skill", "difficulty"] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) # 写入表头 writer.writeheader() # 写入数据 for hero in heroes: writer.writerow(hero)
可以尝试使用多线程或异步的方式来提高爬取效率,比如使用`concurrent.futures`库中的`ThreadPoolExecutor`来实现多线程爬取。同时,可以对代码进行优化,减少不必要的HTTP请求,比如将`hero_links`列表中的链接去重,避免重复爬取同一个英雄的属性。另外,可以将英雄属性的选择器封装成一个函数,提高代码的可读性和可维护性。优化后的代码如下:
```python
import csv
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
# 请求头部信息
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.57"
}
# 获取英雄属性
def get_hero_attrs(link):
response = requests.get(link, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
name = soup.select(".cover-name")[0].text
survive = soup.select(".")[0].text
attack = soup.select(".cover-list-bar data-bar2 fl")[0].text
skill = soup.select(".skill")[0].text
difficulty = soup.select(".difficulty")[0].text
return {"name": name, "survive": survive, "attack": attack, "skill": skill, "difficulty": difficulty}
# 请求URL
url = "https://pvp.qq.com/web201605/herodetail/527.shtml"
# 发送请求
response = requests.get(url, headers=headers)
# 解析HTML
soup = BeautifulSoup(response.content, "html.parser")
# 获取所有英雄的链接,并去重
hero_links = set(hero["href"] for hero in soup.select(".herolist > li > a"))
# 爬取每个英雄的属性
heroes = []
with ThreadPoolExecutor(max_workers=8) as executor:
hero_attrs = executor.map(get_hero_attrs, hero_links)
for attr in hero_attrs:
heroes.append(attr)
# 将数据写入CSV文件
with open("heroes.csv", "w", newline="", encoding="utf-8-sig") as csvfile:
fieldnames = ["name", "survive", "attack", "skill", "difficulty"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
# 写入表头
writer.writeheader()
# 写入数据
for hero in heroes:
writer.writerow(hero)
```
阅读全文