首页python 爬取 pdf

python 爬取 pdf

时间: 2023-09-30 17:07:10 浏览: 104

你可以使用Python的第三方库来爬取PDF文件，比如requests和BeautifulSoup。以下是一个简单的例子： ``` import requests # 下载PDF文件 def download_pdf(url, file_name): response = requests.get(url) with open(file_name, "wb") as file: file.write(response.content) # 获取页面中的PDF链接 def get_pdf_links(url): response = requests.get(url) soup = BeautifulSoup(response.text, "html.parser") links = [] for link in soup.find_all("a"): href = link.get("href") if href.endswith(".pdf"): links.append(href) return links # 示例用法 pdf_url = "https://example.com/sample.pdf" pdf_file = "sample.pdf" download_pdf(pdf_url, pdf_file) ``` 请注意，这只是一个基本的示例，具体的爬取过程可能因网站结构而有所不同。你可能需要进一步处理链接和错误处理，以适应不同的情况。

阅读全文