Python网络爬虫实战第二版

需积分: 9 32 浏览量更新于2024-07-18 1 收藏 15.04MB PDF 举报

"Python Web Scraping Second Edition 是一本关于Python网络爬虫的入门书籍，由Katharine Jarmul和Richard Lawson合著。本书旨在教授读者如何从互联网上抓取数据，内容涵盖了爬虫的基本原理以及如何使用Scrapy等框架进行高效爬取。" 在Python编程领域，Web Scraping是用于自动化从网页提取大量信息的技术。这本书的第二版是2017年出版的，针对初学者提供了一个全面的学习路径。首先，书中会介绍爬虫的基本概念，包括HTTP协议的理解，网页结构（HTML、CSS、JavaScript）分析，以及网页抓取的基础方法。接着，读者将深入学习Python中的相关库，如BeautifulSoup和Requests，它们是实现网络爬虫的关键工具。BeautifulSoup库帮助解析HTML和XML文档，而Requests库则用于发送HTTP请求，两者结合可以方便地获取和处理网页内容。此外，本书特别强调了Scrapy框架的使用。Scrapy是一个强大的、用于web scraping的Python框架，它提供了许多高级功能，如数据存储、中间件、爬虫管理等，使得复杂爬虫项目的实现变得更为简便。学习Scrapy可以让读者更高效地构建大规模的爬虫项目，同时能够处理反爬虫策略，如设置用户代理、处理cookies、模拟登录等。书中还会涉及网络爬虫的伦理和法律问题，提醒读者在进行数据抓取时要尊重网站的robots.txt文件规定，避免侵犯版权和隐私，遵守各地的法律法规。在实际应用部分，读者将学习如何处理数据清洗、存储和分析，这包括使用正则表达式清洗非结构化数据，将数据导出到CSV或JSON文件，甚至可能涉及到数据库操作，如SQLite或MySQL。同时，可能会讲解如何使用Pandas等数据分析库对抓取的数据进行初步处理和分析。 "Python Web Scraping Second Edition"是一本详尽的教程，适合想要进入网络爬虫领域的Python初学者，通过本书，读者不仅可以掌握网络爬虫的基本技术，还能了解到如何利用这些技术进行高效的数据抓取和分析。

Preface

[ 4 ]

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this

book—what you liked or may have disliked. Reader feedback is important for us to develop

titles that you really get the most out of.

To send us general feedback, simply send an e-mail to feedback@packtpub.com, and

mention the book title through the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or

contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you

to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at h t t p ://w w w . p

a c k t p u b . c o m . If you purchased this book elsewhere, you can visit h t t p ://w w w . p a c k t p u b . c

o m /s u p p o r t and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Hover the mouse pointer on the SUPPORT tab at the top.2.

Click on Code Downloads & Errata.3.

Enter the name of the book in the Search box.4.

Select the book for which you're looking to download the code files.5.

Choose from the drop-down menu where you purchased this book from.6.

Click on Code Download.7.

You can also download the code files by clicking on the Code Files button on the book's

webpage at the Packt Publishing website. This page can be accessed by entering the book's

name in the Search box. Please note that you need to be logged in to your Packt account.

Preface
[ 5 ]
Once the file is downloaded, please make sure that you unzip or extract the folder using the
latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at h t t p s ://g i t h u b . c o m /P a c k t P u b l
i s h i n g /P y t h o n - W e b - S c r a p i n g - S e c o n d - E d i t i o n . We also have other code bundles from
our rich catalog of books and videos available at h t t p s ://g i t h u b . c o m /P a c k t P u b l i s h i n g /.
Check them out!
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do
happen. If you find a mistake in one of our books—maybe a mistake in the text or the
code—we would be grateful if you could report this to us. By doing so, you can save other
readers from frustration and help us improve subsequent versions of this book. If you find
any errata, please report them by visiting h t t p ://w w w . p a c k t p u b . c o m /s u b m i t - e r r a t a ,
selecting your book, clicking on the Errata Submission Form link, and entering the details of
your errata. Once your errata are verified, your submission will be accepted and the errata
will be uploaded to our website or added to any list of existing errata under the Errata
section of that title.
To view the previously submitted errata, go to h t t p s ://w w w . p a c k t p u b . c o m /b o o k s /c o n t e n
t /s u p p o r t and enter the name of the book in the search field. The required information will
appear under the Errata section.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At
Packt, we take the protection of our copyright and licenses very seriously. If you come
across any illegal copies of our works in any form on the Internet, please provide us with
the location address or website name immediately so that we can pursue a remedy.
Please contact us at copyright@packtpub.com with a link to the suspected pirated
material.

Introduction to Web Scraping

[ 8 ]

In an ideal world, web scraping wouldn't be necessary and each website would provide an

API to share data in a structured format. Indeed, some websites do provide APIs, but they

typically restrict the data that is available and how frequently it can be accessed.

Additionally, a website developer might change, remove, or restrict the backend API. In

short, we cannot rely on APIs to access the online data we may want. Therefore we need to

learn about web scraping techniques.

Is web scraping legal?

Web scraping, and what is legally permissible when web scraping, are still being

established despite numerous rulings over the past two decades. If the scraped data is being

used for personal and private use, and within fair use of copyright laws, there is usually no

problem. However, if the data is going to be republished, if the scraping is aggressive

enough to take down the site, or if the content is copyrighted and the scraper violates the

In Feist Publications, Inc. v. Rural Telephone Service Co., the United States Supreme Court

decided scraping and republishing facts, such as telephone listings, are allowed. A similar

case in Australia, Telstra Corporation Limited v. Phone Directories Company Pty Ltd,

demonstrated that only data with an identifiable author can be copyrighted. Another

scraped content case in the United States, evaluating the reuse of Associated Press stories

for an aggregated news product, was ruled a violation of copyright in Associated Press v.

Meltwater. A European Union case in Denmark, ofir.dk vs home.dk, concluded that regular

crawling and deep linking is permissible.

There have also been several cases in which companies have charged the plaintiff with

aggressive scraping and attempted to stop the scraping via a legal order. The most recent

case, QVC v. Resultly, ruled that, unless the scraping resulted in private property damage, it

could not be considered intentional harm, despite the crawler activity leading to some site

stability issues.

These cases suggest that, when the scraped data constitutes public facts (such as business

locations and telephone listings), it can be republished following fair use rules. However, if

the data is original (such as opinions and reviews or private user data), it most likely cannot

be republished for copyright reasons. In any case, when you are scraping data from a

website, remember you are their guest and need to behave politely; otherwise, they may

ban your IP address or proceed with legal action. This means you should make download

requests at a reasonable rate and define a user agent to identify your crawler. You should

also take measures to review the Terms of Service of the site and ensure the data you are

taking is not considered private or copyrighted.

剩余214页未读，继续阅读

forFamily

粉丝: 0
资源: 1

Python网络爬虫实战第二版

Python Web Scraping(2nd) 无水印pdf

Python.Web.Scraping.2nd.Edition.2017.5.pdf

Packt.Python.GUI.Programming.Cookbook.2nd.Edition.2017

mastering elasticsearch[m]. 2nd edition. uk: packt publishing

有关python大数据分析技术的文献及其作者和出处

人工智能深度学习参考文献

有关matplotlib库的文献

关于Java的外文期刊参考文献

深度学习参考文献列表

用python完成机器学习实验鸢尾花数据集SVM分类需要的参考文献

最新资源