Python BeautifulSoup库：入门与实战解析

25 浏览量更新于2024-08-31 收藏 72KB PDF 举报

Python爬虫库BeautifulSoup是一个强大的工具，用于解析HTML和XML文档，提取所需的数据。它在Python中的地位类似于DOM（Document Object Model）模型，但更加轻量级且易于使用，尤其适合初学者和快速开发的需求。本文将详细介绍BeautifulSoup的基础概念、主要功能以及一个简单的使用实例。一、BeautifulSoup简介 BeautifulSoup库是由开发者Zachary Kaplan和Jesse P. Gossman共同开发的，它允许用户通过解析标记语言（如HTML）来解析网页内容，而无需深入了解复杂的DOM操作。其核心优势在于简化了网页数据抓取的过程，使得开发者可以快速定位和提取需要的信息，无需编写复杂的正则表达式。 Python中常用的解析器有： 1. Python标准库（html.parser）：这是Python自带的解析器，虽然执行速度适中，但文档错误处理能力在早期版本（如Python 2.7.3或3.2.2之前）相对较弱。 2. lxml解析器：基于C语言库，提供了更快的速度和更强的文档容错能力，但需要额外安装。 3. lxml XML解析器：专为XML设计，是唯一支持XML解析的BeautifulSoup解析器，同样需要额外安装C语言库。 4. html5lib解析器：提供最好的容错性，以浏览器的方式解析文档，生成的HTML5格式文档，但解析速度较慢，不依赖外部扩展。二、快速入门要使用BeautifulSoup，首先需要导入`bs4`模块，并通过给定的HTML文档创建BeautifulSoup对象。例如，我们有一个包含HTML结构的字符串`html_doc`： ```python from bs4 import BeautifulSoup html_doc = """ <html> <head> <title>The Dormouse's story</title> </head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" rel="external" rel="external" rel="external" rel="external" rel="external">Elsie</a>, ... """ soup = BeautifulSoup(html_doc, 'html.parser') # 使用HTML解析器 ``` 在这个例子中，`soup`就是BeautifulSoup对象，我们可以通过这个对象调用各种方法来解析和操作HTML元素。比如，要获取`<title>`标签的内容，可以使用`soup.title.string`；要获取所有``标签，可以用`soup.find_all('p')`；获取`<a>`标签的链接，可以查找属性`'href'`，即`soup.find_all('a')['href']`。节点操作包括选择特定节点、遍历节点、修改节点属性等，如删除``标签，可以写成`soup.find('p', {'class': 'title'}).decompose()`。获取CSS属性则是通过`.attr('css选择器')`，如获取``的文本内容，可以是`soup.find('p', class_='story').text`。 BeautifulSoup是一个强大且易用的工具，通过它可以轻松处理HTML文档，提取所需的数据，对于Python爬虫和Web数据抓取项目来说，是不可或缺的辅助工具。学习并熟练掌握它的使用，能够大大提高数据处理的效率和质量。

Python爬虫库爬虫库BeautifulSoup的介绍与简单使用实例的介绍与简单使用实例

BeautifulSoup是一个可以从HTML或XML文件中提取数据的Python库，本文为大家介绍下Python爬虫库BeautifulSoup的介绍与简单使用实例其中包括了，BeautifulSoup解析

HTML，BeautifulSoup获取内容，BeautifulSoup节点操作，BeautifulSoup获取CSS属性等实例

一、介绍

BeautifulSoup库是灵活又方便的网页解析库，处理高效，支持多种解析器。利用它不用编写正则表达式即可方便地实现网页信息的提取。

Python常用解析库

解析器使用方法优势劣势

Python标准库 BeautifulSoup(markup, “html.parser”) Python的内置标准库、执行速度适中、文档容错能力强

Python 2.7.3 or 3.2.2)前的版本中文容错能力

差

lxml HTML 解析器 BeautifulSoup(markup, “lxml”) 速度快、文档容错能力强需要安装C语言库

lxml XML 解析器 BeautifulSoup(markup, “xml”) 速度快、唯一支持XML的解析器需要安装C语言库

html5lib BeautifulSoup(markup, “html5lib”) 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档速度慢、不依赖外部扩展

二、快速开始

给定html文档，产生BeautifulSoup对象

from bs4 import BeautifulSoup

html_doc = """

<html><head><title>The Dormouse's story</title></head>

<body>

The Dormouse's story

Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.

...

"""

soup = BeautifulSoup(html_doc,'lxml')

输出完整文本

print(soup.prettify())

<html>

<head>

<title>

The Dormouse's story

</title>

</head>

<body>

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie

</a>

Lacie

</a>

and

Tillie

</a>

;

and they lived at the bottom of a well.

...

</body>

</html>

浏览结构化数据

print(soup.title) #<title>标签及内容

print(soup.title.name) #<title>name属性

print(soup.title.string) #<title>内的字符串

print(soup.title.parent.name) #<title>的父标签name属性(head)

print(soup.p) # 第一个

print(soup.p['class']) #第一个的class

print(soup.a) # 第一个<a></a>

print(soup.find_all('a')) # 所有<a></a>

print(soup.find(id="link3")) # 所有id='link3'的标签

<title>The Dormouse's story</title>

title

The Dormouse's story

head

The Dormouse's story

['title']

[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

找出所有标签内的链接

for link in soup.find_all('a'):

print(link.get('href'))

http://example.com/elsie

http://example.com/lacie

http://example.com/tillie

获得所有文字内容

print(soup.get_text())

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

自动补全标签并进行格式化

html = """

下载后可阅读完整内容，剩余3页未读，立即下载

身份认证购VIP最低享 7 折!

30元优惠券

weixin_38551187

粉丝: 3

Python BeautifulSoup库：入门与实战解析

Python爬虫包BeautifulSoup实例（三）

Python爬虫包BeautifulSoup学习实例（五）

Python使用requests及BeautifulSoup构建爬虫实例代码

python爬虫beautifulsoup实例

python爬虫 beautifulsoup使用

python爬虫简单实例

python爬虫实例教程-代码

python爬虫的简单编程实例

python爬虫实例网易云-Python3爬虫实例之网易云音乐爬虫

数据采集爬虫实例beautifulsoup

最新资源