2017年Packt出版的Python网络爬虫第二版指南

需积分: 10 8 浏览量更新于2024-07-18 收藏 12.47MB PDF 举报

《Packt.Python.Web.Scraping.2nd.Edition.2017.5》是一本专著，针对Python网络爬虫技术的第二版，由Katharine Jarmul和Richard Lawson合著，于2017年5月更新。该书主要讲解如何从网络上获取数据，内容覆盖了Python Web Scraping的深入理论和实践技巧，适合对数据抓取感兴趣的开发者或研究人员。本书详细介绍了Python在网页抓取领域的应用，包括但不限于以下几个核心知识点： 1. **Python Web Scraping基础**：章节会介绍Python语言如何与网络接口交互，如何使用requests库发送HTTP请求、BeautifulSoup或Scrapy等库解析HTML文档，以及如何处理cookies和session管理。 2. **网页结构分析**：讲解如何分析网页的结构，理解XPath和CSS选择器在定位网页元素中的重要作用，以及如何根据网页的动态加载特性设计更有效的抓取策略。 3. **数据提取和解析**：深入剖析如何从HTML中提取所需的数据，包括表格、图片、链接等，并可能涉及JSON、XML等其他数据格式的处理。 4. **反爬虫策略与应对**：讨论网站常见的反爬虫机制，如验证码、IP限制、User-Agent伪装等，以及如何通过代理IP、设置延时等方法来规避这些问题。 5. **性能优化与效率提升**：提供关于如何编写高效的爬虫代码，包括并发处理、队列系统、数据库存储等，以适应大规模数据抓取的需求。 6. **法律与道德问题**：强调在进行网络爬虫时必须遵守版权法和网站的服务条款，尊重数据源的权益，讨论合法抓取的边界和伦理考量。 7. **案例研究与实战项目**：书中包含多个实际项目的示例，让读者通过实践巩固所学知识，包括新闻抓取、商品价格比较、社交媒体数据获取等应用场景。 8. **最新技术和工具更新**：作为第二版，书中反映了2017年的技术趋势，可能会涵盖当时最新的爬虫库更新、API使用方法以及新兴的爬虫框架。《Packt.Python.Web.Scraping.2nd.Edition.2017.5》旨在帮助读者掌握Python网络爬虫的技能，无论你是初学者还是进阶者，都能从中获益匪浅。同时，它也提醒读者在追求数据获取的同时，要重视法律法规和个人职业操守。由于版权原因，所有内容未经许可不得复制或传播，确保了信息的权威性和合法性。

Preface

[ 4 ]

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this

book—what you liked or may have disliked. Reader feedback is important for us to develop

titles that you really get the most out of.

To send us general feedback, simply send an e-mail to feedback@packtpub.com, and

mention the book title through the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or

contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you

to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at h t t p ://w w w . p

a c k t p u b . c o m . If you purchased this book elsewhere, you can visit h t t p ://w w w . p a c k t p u b . c

o m /s u p p o r t and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Hover the mouse pointer on the SUPPORT tab at the top.2.

Click on Code Downloads & Errata.3.

Enter the name of the book in the Search box.4.

Select the book for which you're looking to download the code files.5.

Choose from the drop-down menu where you purchased this book from.6.

Click on Code Download.7.

You can also download the code files by clicking on the Code Files button on the book's

webpage at the Packt Publishing website. This page can be accessed by entering the book's

name in the Search box. Please note that you need to be logged in to your Packt account.

Preface
[ 5 ]
Once the file is downloaded, please make sure that you unzip or extract the folder using the
latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at h t t p s ://g i t h u b . c o m /P a c k t P u b l
i s h i n g /P y t h o n - W e b - S c r a p i n g - S e c o n d - E d i t i o n . We also have other code bundles from
our rich catalog of books and videos available at h t t p s ://g i t h u b . c o m /P a c k t P u b l i s h i n g /.
Check them out!
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do
happen. If you find a mistake in one of our books—maybe a mistake in the text or the
code—we would be grateful if you could report this to us. By doing so, you can save other
readers from frustration and help us improve subsequent versions of this book. If you find
any errata, please report them by visiting h t t p ://w w w . p a c k t p u b . c o m /s u b m i t - e r r a t a ,
selecting your book, clicking on the Errata Submission Form link, and entering the details of
your errata. Once your errata are verified, your submission will be accepted and the errata
will be uploaded to our website or added to any list of existing errata under the Errata
section of that title.
To view the previously submitted errata, go to h t t p s ://w w w . p a c k t p u b . c o m /b o o k s /c o n t e n
t /s u p p o r t and enter the name of the book in the search field. The required information will
appear under the Errata section.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At
Packt, we take the protection of our copyright and licenses very seriously. If you come
across any illegal copies of our works in any form on the Internet, please provide us with
the location address or website name immediately so that we can pursue a remedy.
Please contact us at copyright@packtpub.com with a link to the suspected pirated
material.

Introduction to Web Scraping

[ 8 ]

In an ideal world, web scraping wouldn't be necessary and each website would provide an

API to share data in a structured format. Indeed, some websites do provide APIs, but they

typically restrict the data that is available and how frequently it can be accessed.

Additionally, a website developer might change, remove, or restrict the backend API. In

short, we cannot rely on APIs to access the online data we may want. Therefore we need to

learn about web scraping techniques.

Is web scraping legal?

Web scraping, and what is legally permissible when web scraping, are still being

established despite numerous rulings over the past two decades. If the scraped data is being

used for personal and private use, and within fair use of copyright laws, there is usually no

problem. However, if the data is going to be republished, if the scraping is aggressive

enough to take down the site, or if the content is copyrighted and the scraper violates the

In Feist Publications, Inc. v. Rural Telephone Service Co., the United States Supreme Court

decided scraping and republishing facts, such as telephone listings, are allowed. A similar

case in Australia, Telstra Corporation Limited v. Phone Directories Company Pty Ltd,

demonstrated that only data with an identifiable author can be copyrighted. Another

scraped content case in the United States, evaluating the reuse of Associated Press stories

for an aggregated news product, was ruled a violation of copyright in Associated Press v.

Meltwater. A European Union case in Denmark, ofir.dk vs home.dk, concluded that regular

crawling and deep linking is permissible.

There have also been several cases in which companies have charged the plaintiff with

aggressive scraping and attempted to stop the scraping via a legal order. The most recent

case, QVC v. Resultly, ruled that, unless the scraping resulted in private property damage, it

could not be considered intentional harm, despite the crawler activity leading to some site

stability issues.

These cases suggest that, when the scraped data constitutes public facts (such as business

locations and telephone listings), it can be republished following fair use rules. However, if

the data is original (such as opinions and reviews or private user data), it most likely cannot

be republished for copyright reasons. In any case, when you are scraping data from a

website, remember you are their guest and need to behave politely; otherwise, they may

ban your IP address or proceed with legal action. This means you should make download

requests at a reasonable rate and define a user agent to identify your crawler. You should

also take measures to review the Terms of Service of the site and ensure the data you are

taking is not considered private or copyrighted.

剩余214页未读，继续阅读

yuanwyue

粉丝: 25
资源: 1

2017年Packt出版的Python网络爬虫第二版指南

Packt.Python.Web.Scraping.2nd.Edition.2017.5.pdf

Packt.Python.GUI.Programming.Cookbook.2nd.Edition.2017

Packt.Python.for.Finance.2nd.Edition.2017

Packt.Python.Network.Programming.Cookbook.2nd.Edition.2017

Packt.Python.3.Web.Development.Beginners.Incl.code

Packt.Python.Machine.Learning.Cookbook.2nd.Edition.2019

Packt.Python.3.Object-Oriented.Programming.2nd.Edition.1784398780.zip

Packt.Python.Social.Media.Analytics.2017.7.pdf

Packt.Functional.Python.Programming.2nd.Edition.2018

Packt.Learning.Python.Design.Patterns.2nd.Edition.2016.2

最新资源