首页python爬取网页中的pdf

python爬取网页中的pdf

时间: 2024-01-14 15:03:11 浏览: 51

你可以使用Python中的requests和BeautifulSoup库来爬取网页中的PDF文件。具体步骤如下： 1. 首先使用requests库发送请求获取目标网页的HTML代码，例如： ``` import requests url = 'https://example.com' response = requests.get(url) ``` 2. 然后使用BeautifulSoup库解析HTML代码，找到所有的PDF链接，例如： ``` from bs4 import BeautifulSoup soup = BeautifulSoup(response.text, 'html.parser') pdf_links = soup.find_all('a', href=lambda href: href.endswith('.pdf')) ``` 这里使用了BeautifulSoup的find_all方法，通过lambda表达式筛选出所有以".pdf"结尾的链接。 3. 最后，遍历所有PDF链接，使用requests库下载PDF文件到本地，例如： ``` for link in pdf_links: pdf_url = link['href'] response = requests.get(pdf_url) with open('{}.pdf'.format(link.text), 'wb') as f: f.write(response.content) ``` 这里使用了requests库的get方法获取PDF文件内容，并使用Python的with语句打开本地文件，写入PDF内容。注意，如果目标网站有反爬措施，需要加入适当的headers和延时等机制。