import requests from bs4 import BeautifulSoup url='https://top.baidu.com/board?tab=realtime' headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.37'} res = requests.get(url=url, headers=headers) res.encoding ='utf-8' soup = BeautifulSoup(res.text, 'html.parser') hot_list = soup.find_all(class_='c-single-text-ellipsis') print(hot_list) for hot in hot_list: print(hot.get_text().strip())在这段代码的基础上,利用open()函数将爬取内容写入本地 /home/qingjiao/records.txt 文件
时间: 2024-03-19 10:40:52 浏览: 20
好的,你可以在代码最后加上以下内容来将爬取内容写入指定的文件:
```
with open('/home/qingjiao/records.txt', 'w', encoding='utf-8') as f:
for hot in hot_list:
f.write(hot.get_text().strip() + '\n')
```
这段代码会将爬取到的热搜内容写入指定路径的"records.txt"文件中,每个热搜占一行。请注意,你需要将路径修改为你自己的文件路径。
相关问题
使用requests库和BeautifulSoup爬取该网站https://top.baidu.com/board?tab=realtime
好的,我知道了。首先,我们需要安装requests和beautifulsoup4库,可以使用以下命令进行安装:
```
pip install requests beautifulsoup4
```
然后,可以使用以下代码进行爬取:
```python
import requests
from bs4 import BeautifulSoup
url = 'https://top.baidu.com/board?tab=realtime'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
hot_list = soup.find_all('a', class_='title-link')
for hot in hot_list:
print(hot.get_text().strip())
```
这段代码会爬取百度实时热点榜单的标题,并打印出来。你可以根据自己的需要进行修改。记得加上异常处理和反爬虫策略。
优化这段代码import requests from bs4 import BeautifulSoup url = "https://www.gupang.com/202305/64619.html" response = requests.get(url) soup = BeautifulSoup(response.content, "html.parser") title = soup.find("h1", class_="title").text content = soup.find("div", class_="content").text print(title) print(content)
可以使用以下方法进行优化:
1. 添加异常处理,以防止请求失败或解析失败导致程序崩溃:
```python
import requests
from bs4 import BeautifulSoup
url = "https://www.gupang.com/202305/64619.html"
try:
response = requests.get(url)
response.raise_for_status() # 抛出异常,如果请求失败
except requests.exceptions.RequestException as e:
print(e)
sys.exit(1)
soup = BeautifulSoup(response.content, "html.parser")
try:
title = soup.find("h1", class_="title").text
content = soup.find("div", class_="content").text
except AttributeError as e:
print(e)
sys.exit(1)
print(title)
print(content)
```
2. 使用`with`语句来管理资源,以确保在使用完文件或网络连接后自动释放资源:
```python
import requests
from bs4 import BeautifulSoup
url = "https://www.gupang.com/202305/64619.html"
try:
with requests.get(url) as response:
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
title = soup.find("h1", class_="title").text
content = soup.find("div", class_="content").text
except (requests.exceptions.RequestException, AttributeError) as e:
print(e)
sys.exit(1)
print(title)
print(content)
```
这种方法可以避免手动关闭连接或文件,而且代码更加简洁。