Python Web Scraping第二版：实战教程

PYTHON

需积分: 9 21 浏览量更新于2024-07-18 收藏 7.14MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

《Python网络爬虫第二版》是一本实用的指南，由 Katharine Jarmul 和 Richard Lawson 联合撰写，专为那些想要掌握数据抓取和网页爬取技术的读者设计。该书深入浅出地讲解了如何利用 Python 进行网页数据采集，特别强调了 PyQT、Selenium、HTML 和 Python 的结合应用。本书针对的是Python Web Scraping 的第二版，版权归属 Packt Publishing，出版于2017年10月。书中内容涵盖了以下核心知识点： 1. **Python基础知识**：作为基础，作者会介绍 Python 语言的基础语法、数据类型、控制结构以及函数等，确保读者对 Python 环境有扎实的理解。 2. **Web Scraping原理**：读者将学习如何理解和解析 HTML，理解网站结构，识别需要抓取的数据元素，以及如何处理不同网页布局和动态加载内容的挑战。 3. **PyQT工具**：PyQt 是 Python 的一个图形用户界面库，它将帮助读者构建简洁的用户界面，便于数据可视化和结果展示。书中会详细介绍如何使用 PyQt 进行网页抓取和数据操作。 4. **Selenium框架**：本书还将深入讲解 Selenium，这是一个用于自动化浏览器行为的工具，对于动态网页的爬取尤为重要。读者将学会如何使用 Selenium 控制浏览器，模拟真实用户的交互，获取动态加载的内容。 5. **数据处理与分析**：抓取到的数据通常需要清洗、整理和分析。书中会涵盖如何使用 Python 的数据处理库（如 Pandas、NumPy）对数据进行预处理，以及如何进行数据分析。 6. **实战项目与案例**：为了帮助读者巩固理论知识，书中包含多个实际项目，涉及新闻聚合、电商数据挖掘、社交媒体监控等多个领域，让读者在实践中提升技能。 7. **最佳实践与注意事项**：作者会讨论关于网络安全、法规遵从性和道德伦理的重要考虑，确保读者在进行数据抓取时遵循正确的原则。 8. **版权与法律问题**：书中提醒读者尊重版权法，强调在没有事先获得出版商许可的情况下，不可复制或传播书中的内容，以保护知识产权。《Python网络爬虫第二版》是一本实用且全面的资源，适合希望在 IT 行业从事数据抓取工作，或者对 Web 数据分析感兴趣的读者。通过阅读和实践书中的内容，读者不仅能掌握 Python 技术，还能了解如何在实际工作中合法、高效地进行网络数据采集。

资源详情

资源推荐

Preface

[ 4 ]

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this

book—what you liked or may have disliked. Reader feedback is important for us to develop

titles that you really get the most out of.

To send us general feedback, simply send an e-mail to feedback@packtpub.com, and

mention the book title through the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or

contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you

to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at h t t p ://w w w . p

a c k t p u b . c o m . If you purchased this book elsewhere, you can visit h t t p ://w w w . p a c k t p u b . c

o m /s u p p o r t and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Hover the mouse pointer on the SUPPORT tab at the top.2.

Click on Code Downloads & Errata.3.

Enter the name of the book in the Search box.4.

Select the book for which you're looking to download the code files.5.

Choose from the drop-down menu where you purchased this book from.6.

Click on Code Download.7.

You can also download the code files by clicking on the Code Files button on the book's

webpage at the Packt Publishing website. This page can be accessed by entering the book's

name in the Search box. Please note that you need to be logged in to your Packt account.

Preface
[ 5 ]
Once the file is downloaded, please make sure that you unzip or extract the folder using the
latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at h t t p s ://g i t h u b . c o m /P a c k t P u b l
i s h i n g /P y t h o n - W e b - S c r a p i n g - S e c o n d - E d i t i o n . We also have other code bundles from
our rich catalog of books and videos available at h t t p s ://g i t h u b . c o m /P a c k t P u b l i s h i n g /.
Check them out!
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do
happen. If you find a mistake in one of our books—maybe a mistake in the text or the
code—we would be grateful if you could report this to us. By doing so, you can save other
readers from frustration and help us improve subsequent versions of this book. If you find
any errata, please report them by visiting h t t p ://w w w . p a c k t p u b . c o m /s u b m i t - e r r a t a ,
selecting your book, clicking on the Errata Submission Form link, and entering the details of
your errata. Once your errata are verified, your submission will be accepted and the errata
will be uploaded to our website or added to any list of existing errata under the Errata
section of that title.
To view the previously submitted errata, go to h t t p s ://w w w . p a c k t p u b . c o m /b o o k s /c o n t e n
t /s u p p o r t and enter the name of the book in the search field. The required information will
appear under the Errata section.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At
Packt, we take the protection of our copyright and licenses very seriously. If you come
across any illegal copies of our works in any form on the Internet, please provide us with
the location address or website name immediately so that we can pursue a remedy.
Please contact us at copyright@packtpub.com with a link to the suspected pirated
material.

Introduction to Web Scraping

[ 8 ]

In an ideal world, web scraping wouldn't be necessary and each website would provide an

API to share data in a structured format. Indeed, some websites do provide APIs, but they

typically restrict the data that is available and how frequently it can be accessed.

Additionally, a website developer might change, remove, or restrict the backend API. In

short, we cannot rely on APIs to access the online data we may want. Therefore we need to

learn about web scraping techniques.

Is web scraping legal?

Web scraping, and what is legally permissible when web scraping, are still being

established despite numerous rulings over the past two decades. If the scraped data is being

used for personal and private use, and within fair use of copyright laws, there is usually no

problem. However, if the data is going to be republished, if the scraping is aggressive

enough to take down the site, or if the content is copyrighted and the scraper violates the

In Feist Publications, Inc. v. Rural Telephone Service Co., the United States Supreme Court

decided scraping and republishing facts, such as telephone listings, are allowed. A similar

case in Australia, Telstra Corporation Limited v. Phone Directories Company Pty Ltd,

demonstrated that only data with an identifiable author can be copyrighted. Another

scraped content case in the United States, evaluating the reuse of Associated Press stories

for an aggregated news product, was ruled a violation of copyright in Associated Press v.

Meltwater. A European Union case in Denmark, ofir.dk vs home.dk, concluded that regular

crawling and deep linking is permissible.

There have also been several cases in which companies have charged the plaintiff with

aggressive scraping and attempted to stop the scraping via a legal order. The most recent

case, QVC v. Resultly, ruled that, unless the scraping resulted in private property damage, it

could not be considered intentional harm, despite the crawler activity leading to some site

stability issues.

These cases suggest that, when the scraped data constitutes public facts (such as business

locations and telephone listings), it can be republished following fair use rules. However, if

the data is original (such as opinions and reviews or private user data), it most likely cannot

be republished for copyright reasons. In any case, when you are scraping data from a

website, remember you are their guest and need to behave politely; otherwise, they may

ban your IP address or proceed with legal action. This means you should make download

requests at a reasonable rate and define a user agent to identify your crawler. You should

also take measures to review the Terms of Service of the site and ensure the data you are

taking is not considered private or copyrighted.

剩余214页未读，继续阅读

田伯光光

粉丝: 28
资源: 58

Python Web Scraping第二版：实战教程

Python Web Scraping(2nd) 无水印pdf

Python.Web.Scraping.2nd.Edition.2017.5.pdf

Python Web Scraping - Second Edition .azw3电子书下载

推荐一些python的教程给我

python爬虫外文文献

python 实现12306抢票

Error scraping for collect.slave_status: Error 1227: Access denied; you need (at least one of) the SUPER, REPLICATION CLIENT privilege(s) for this operation" source="exporter.go:171"

微信机器人+Python

python的爬虫教程你有推荐嘛

python如何封装定位元素方法

python爬虫的参考文献

python爬虫教程网址

Python automation

如何使用VS Code编写爬虫获取http://other.zzkjxy.edu.cn/xxgc/list_11/2362.html页面上的网站名称和时间戳信息？

python爬虫股票信息代码

选中一只股票，用爬虫技术获取该股票过去三个月的价格走势。设计并实现一个简单的机器学习模型，例如线性回归或逻辑回归，预测该股票的未来一个月价格，并与真实情况作对比，分析结果。

urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1131)>

期货数据接口python

使用python 获取全量的携程酒店信息

怎么将html代码翻译为python代码

最新资源