Python网络爬虫实战第二版：从静态页面到JavaScript网站数据提取

5星 · 超过95%的资源需积分: 10 168 浏览量更新于2024-07-19 4 收藏 14.78MB PDF 举报

"Python Web Scraping 2nd Edition" 是一本深入探讨如何使用Python 3.x进行网页数据抓取的专业指南。本书旨在帮助读者从互联网的海量数据中提取有价值的信息，这些信息通常嵌在网站的结构和样式中，需要通过特定技术进行提取。在本书的早期章节中，读者将学习如何从静态网页中获取数据。这包括了解如何利用缓存机制（如数据库和文件）来节省时间并减轻服务器负载。随着基础知识的深入，读者将有机会实践构建更复杂的爬虫，涉及浏览器、爬虫和并发刮取器的使用。作者Katharine Jarmul和Richard Lawson指导读者掌握如何处理JavaScript依赖的网站，利用PyQt和Selenium等工具来抓取数据。此外，还将探讨如何在复杂网站上填写表单，甚至应对那些有验证码保护的网站，这里会介绍如何使用如mechanize这样的Python库来自动化这些操作。书中还会教授如何使用Scrapy库构建基于类的刮削器，并将其应用到实际网站上。书的后半部分，读者将学习如何测试网站使用刮削器，远程刮取，最佳实践，处理图像以及其他相关主题。通过本书的学习，读者将能够全面了解如何在遵守网站规则和法律的前提下，高效且负责任地进行网络数据抓取。该书第二版于2017年5月出版，Packt Publishing是出版商。尽管作者和出版商已经尽力确保书中的信息准确性，但他们不承担由此引起的任何直接或间接损害的责任。书中提到的所有公司和产品商标，Packt Publishing已尽可能准确地使用大写表示，但不能保证其准确性。这本书对于想要学习Python网络爬虫技术的人来说是一份宝贵的资源，无论你是初学者还是有一定经验的开发者，都能从中获取到实用的知识和技巧，提升你在数据挖掘和分析领域的技能。

Preface

[ 4 ]

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this

book—what you liked or may have disliked. Reader feedback is important for us to develop

titles that you really get the most out of.

To send us general feedback, simply send an e-mail to feedback@packtpub.com, and

mention the book title through the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or

contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you

to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at h t t p ://w w w . p

a c k t p u b . c o m . If you purchased this book elsewhere, you can visit h t t p ://w w w . p a c k t p u b . c

o m /s u p p o r t and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Hover the mouse pointer on the SUPPORT tab at the top.2.

Click on Code Downloads & Errata.3.

Enter the name of the book in the Search box.4.

Select the book for which you're looking to download the code files.5.

Choose from the drop-down menu where you purchased this book from.6.

Click on Code Download.7.

You can also download the code files by clicking on the Code Files button on the book's

webpage at the Packt Publishing website. This page can be accessed by entering the book's

name in the Search box. Please note that you need to be logged in to your Packt account.

Preface
[ 5 ]
Once the file is downloaded, please make sure that you unzip or extract the folder using the
latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at h t t p s ://g i t h u b . c o m /P a c k t P u b l
i s h i n g /P y t h o n - W e b - S c r a p i n g - S e c o n d - E d i t i o n . We also have other code bundles from
our rich catalog of books and videos available at h t t p s ://g i t h u b . c o m /P a c k t P u b l i s h i n g /.
Check them out!
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do
happen. If you find a mistake in one of our books—maybe a mistake in the text or the
code—we would be grateful if you could report this to us. By doing so, you can save other
readers from frustration and help us improve subsequent versions of this book. If you find
any errata, please report them by visiting h t t p ://w w w . p a c k t p u b . c o m /s u b m i t - e r r a t a ,
selecting your book, clicking on the Errata Submission Form link, and entering the details of
your errata. Once your errata are verified, your submission will be accepted and the errata
will be uploaded to our website or added to any list of existing errata under the Errata
section of that title.
To view the previously submitted errata, go to h t t p s ://w w w . p a c k t p u b . c o m /b o o k s /c o n t e n
t /s u p p o r t and enter the name of the book in the search field. The required information will
appear under the Errata section.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At
Packt, we take the protection of our copyright and licenses very seriously. If you come
across any illegal copies of our works in any form on the Internet, please provide us with
the location address or website name immediately so that we can pursue a remedy.
Please contact us at copyright@packtpub.com with a link to the suspected pirated
material.

Introduction to Web Scraping

[ 8 ]

In an ideal world, web scraping wouldn't be necessary and each website would provide an

API to share data in a structured format. Indeed, some websites do provide APIs, but they

typically restrict the data that is available and how frequently it can be accessed.

Additionally, a website developer might change, remove, or restrict the backend API. In

short, we cannot rely on APIs to access the online data we may want. Therefore we need to

learn about web scraping techniques.

Is web scraping legal?

Web scraping, and what is legally permissible when web scraping, are still being

established despite numerous rulings over the past two decades. If the scraped data is being

used for personal and private use, and within fair use of copyright laws, there is usually no

problem. However, if the data is going to be republished, if the scraping is aggressive

enough to take down the site, or if the content is copyrighted and the scraper violates the

In Feist Publications, Inc. v. Rural Telephone Service Co., the United States Supreme Court

decided scraping and republishing facts, such as telephone listings, are allowed. A similar

case in Australia, Telstra Corporation Limited v. Phone Directories Company Pty Ltd,

demonstrated that only data with an identifiable author can be copyrighted. Another

scraped content case in the United States, evaluating the reuse of Associated Press stories

for an aggregated news product, was ruled a violation of copyright in Associated Press v.

Meltwater. A European Union case in Denmark, ofir.dk vs home.dk, concluded that regular

crawling and deep linking is permissible.

There have also been several cases in which companies have charged the plaintiff with

aggressive scraping and attempted to stop the scraping via a legal order. The most recent

case, QVC v. Resultly, ruled that, unless the scraping resulted in private property damage, it

could not be considered intentional harm, despite the crawler activity leading to some site

stability issues.

These cases suggest that, when the scraped data constitutes public facts (such as business

locations and telephone listings), it can be republished following fair use rules. However, if

the data is original (such as opinions and reviews or private user data), it most likely cannot

be republished for copyright reasons. In any case, when you are scraping data from a

website, remember you are their guest and need to behave politely; otherwise, they may

ban your IP address or proceed with legal action. This means you should make download

requests at a reasonable rate and define a user agent to identify your crawler. You should

also take measures to review the Terms of Service of the site and ensure the data you are

taking is not considered private or copyrighted.

剩余214页未读，继续阅读

xinconan2

粉丝: 269
资源: 399

Python网络爬虫实战第二版：从静态页面到JavaScript网站数据提取

Python Web Scraping Cookbook

Python Web Scraping Cookbook-Packt Publishing(2018).pdf )

Web Scraping with Python 无水印pdf

Packt.Python.Web.Scraping.2nd.Edition.2017.5.pdf

Python.Web.Scraping.2nd.Edition.2017.5.epub

Data.Collection.with.R.A.Practical.Guide.to.Web.Scraping.and.Text.Mining

Web.Scraping.with.Python.mobi

Web Scraping with Python, 2nd Edition (2018)

Intro to web scraping with Python.pptx

Web Scraping with Python, 2nd edition, Collecting More Data from the Modern Web

最新资源