import requests import re headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'} def baidu(page): num = (page - 1) * 10 url = 'http://www.baidu.com/s?ie=utf-8&medium=0&rtt=1&bsst=1&rsv_dl=news_b_pn&cl=2&wd=meizu&tn=news&rsv_bp=1&rsv_sug3=16&rsv_sug1=6&rsv_sug7=100&oq=&rsv_btype=t&f=8&inputT=5752&rsv_sug4=6599&x_bfe_rqs=032000000000000000004800000000000000000000000008&x_bfe_tjscore=0.080000&tngroupname=organic_news&newVideo=12&goods_entry_switch=1&pn='+ str(num) res = requests.get(url, headers=headers).text baidu(i+1) print("第"+str(i+1)+"页爬取成功") import re p_title = '}">(.?)</a></h3><div class=' #提取标题 title=re.findall(p_title,res,re.S) print(title) p_info = 'c-gap-right-xsmall" aria-label=".?">(.?)</span><span ' #提取时间 info=re.findall(p_info,res,re.S) print(info) p_href= '<a class="source-link_Ft1ov" href="(.?)"' #提取网址 href=re.findall(p_href,res,re.S) print(href) p_souce='<span class="c-color-gray" aria-label=".?">(.?)</span>' #提取来源 souce=re.findall(p_souce,res,re.S) print(souce) for i in range(len(title)): title[i]=re.sub('<.*?>','',title[i]) #去掉<em>、</em> print(str(i+1)+"."+title[i]+' ('+info[i]+"-"+souce[i]+')') print(" ",href[i])

时间: 2024-02-22 10:01:07 浏览: 19

您的代码中有一些问题需要修改： 1. 在 `baidu` 函数中，递归调用 `baidu` 函数时，应该传入 `page` 参数而不是 `i+1`。 2. 在正则表达式中，如果要匹配 `'` 或 `"` 等特殊字符，需要在前面加上反斜线 `\` 进行转义。 3. 在正则表达式中，应该使用非贪婪模式 `.*?` 代替贪婪模式 `.*`，以避免匹配到多余的内容。下面是修改后的代码： ```python import requests import re headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36' } def baidu(page): num = (page - 1) * 10 url = 'http://www.baidu.com/s?ie=utf-8&medium=0&rtt=1&bsst=1&rsv_dl=news_b_pn&cl=2&wd=meizu&tn=news&rsv_bp=1&rsv_sug3=16&rsv_sug1=6&rsv_sug7=100&oq=&rsv_btype=t&f=8&inputT=5752&rsv_sug4=6599&x_bfe_rqs=032000000000000000004800000000000000000000000008&x_bfe_tjscore=0.080000&tngroupname=organic_news&newVideo=12&goods_entry_switch=1&pn=' + str(num) res = requests.get(url, headers=headers).text # 在这里可以对 res 进行数据解析和处理 # ... if page < 5: baidu(page+1) print("第"+str(page)+"页爬取成功") p_title = '}">(.*?)</a></h3><div class=' #提取标题 title = re.findall(p_title, res, re.S) print(title) p_info = 'c-gap-right-xsmall" aria-label="(.*?)">' #提取时间 info = re.findall(p_info, res, re.S) print(info) p_href = '<a class="source-link_Ft1ov" href="(.*?)"' #提取网址 href = re.findall(p_href, res, re.S) print(href) p_source = '<span class="c-color-gray" aria-label="(.*?)">' #提取来源 source = re.findall(p_source, res, re.S) print(source) for i in range(len(title)): title[i] = re.sub('<.*?>', '', title[i]) #去掉<em>、</em> print(str(i+1) + "." + title[i] + ' (' + info[i] + " - " + source[i] + ')') print(" ", href[i]) ``` 需要注意的是，百度搜索结果的页面结构经常会发生变化，如果您的代码无法正常提取数据，可能需要根据实际情况进行调整。另外，爬虫程序需要遵守网站的爬虫规则，不得过度访问和请求网站，避免对网站造成负担和影响。

相关推荐

'''模拟浏览器头部信息'''headers = 'User-Agent': 'Mozilla/5.0 (

import reimport requestsfrom bs4 import BeautifulSoupimport t

大麦抢票脚本1.doc

import requests import re ur1='https://bj.lianjia.com/zufang/' header={'User-Agent':'Mozilla/5.0(Windows NT 6.1;Win64;x64)AppleWebKit/537.36 (KHTML,like Gecko) Chrome/74.0.3729.169 Safari/537.36'} response = requests.get(url,headers=header) html=response.text 哪里有错误并修改

user agent为Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36

纠正这段代码import requests from bs4 import BeautifulSoup import pandas as pd import matplotlib.pyplot as plt import csv # 设置请求头 headers = { "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Saf

import requests from bs4 import BeautifulSoup import pandas as pd # 设置请求头，模拟浏览器发送请求 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} # 设置请求参数 symbol = 'JD

最新推荐

ansys maxwell

matlab基于不确定性可达性优化的自主鲁棒操作.zip

zigbee-cluster-library-specification

管理建模和仿真的文件

实现实时数据湖架构：Kafka与Hive集成

2． 通过python绘制y=e-xsin(2πx)图像

JSBSim Reference Manual

"互动学习：行动中的多样性与论文攻读经历"

实现实时监控告警系统：Kafka与Grafana整合

导入numpy库，创建两个包含9个随机数的3*3的矩阵，将两个矩阵分别打印出来，计算两个数组的点积并打印出来。（random.randn()、dot（）函数）

2．通过python绘制y=e-xsin(2πx)图像