python爬取网页内容转换为PDF文件_python爬取网页详细教程 - CSDN文库

pdf转换成word

132 浏览量更新于2023-03-03 评论 1 收藏 46KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

python爬取网页内容转换为爬取网页内容转换为PDF文件文件

本文实例为大家分享了python爬取网页内容转换为PDF的具体代码，供大家参考，具体内容如下

将廖雪峰的学习教程转换成PDF文件，代码只适合该网站，如果需要其他网站的教程，可靠需要进行稍微的修改。

# coding=utf-8

import os

import re

import time

import pdfkit

import requests

from bs4 import BeautifulSoup

from PyPDF2 import PdfFileMerger

import sys

reload(sys)

sys.setdefaultencoding('utf8')

html_template = """

<!DOCTYPE html>

<html lang="en">

<head>

<meta charset="UTF-8">

</head>

<body>

{content}

</body>

</html>

"""

#----------------------------------------------------------------------

def parse_url_to_html(url, name):

"""

解析URL，返回HTML内容

:param url:解析的url

:param name: 保存的html文件名

:return: html

"""

try:

response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

# 正文

body = soup.find_all(class_="x-wiki-content")[0] # 标题

title = soup.find('h4').get_text()

# 标题加入到正文的最前面，居中显示

center_tag = soup.new_tag("center")

title_tag = soup.new_tag('h1')

title_tag.string = title

center_tag.insert(1, title_tag)

body.insert(1, center_tag)

html = str(body)

# body中的img标签的src相对路径的改成绝对路径

pattern = "(<img .*?src=\")(.*?)(\")"

def func(m):

if not m.group(3).startswith("http"):

rtn = m.group(1) + "http://www.liaoxuefeng.com" + m.group(2) + m.group(3)

return rtn

else:

return m.group(1)+m.group(2)+m.group(3)

html = re.compile(pattern).sub(func, html)

html = html_template.format(content=html)

html = html.encode("utf-8")

with open(name, 'wb') as f:

f.write(html)

return name

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余2页未读，立即下载

评论0

weixin_38621553

粉丝: 2
资源: 936

会员权益专享

图片转文字

全年可省5，000元立即开通

最新资源

资源上传下载、课程学习等过程中有任何疑问或建议，欢迎提出宝贵意见哦~我们会及时处理！点击此处反馈