Python爬虫实战：2018数据科学最佳实践与requests/beautifulsoup教程

5星 · 超过95%的资源需积分: 10 88 浏览量更新于2024-07-18 收藏 5.01MB PDF 举报

"《Practical Web Scraping for Data Science_2018》是一本专为数据科学爱好者和专业人员编写的实用指南，作者是Seppe van den Broucke和Bart Baesens。该书在2018年发布，主要针对Python编程语言，重点讲解如何通过requests库和beautiful soup库进行高效、合规的网络爬虫技术。这本书不仅涵盖了基础知识，还提供了最佳实践和实际案例，帮助读者深入理解并掌握Web数据抓取的各个方面。本书的主要内容包括但不限于： 1. Python爬虫基础：介绍了Python作为数据科学中的关键工具，以及requests和beautiful soup库在爬虫开发中的核心作用。它会从安装和配置开始，逐步引导读者编写简单的网页抓取脚本。 2. HTTP协议与网络请求：详细解释了HTTP工作原理，如何构造和解析请求，以及如何处理响应，这对于理解爬虫的核心逻辑至关重要。 3. 解析HTML和XML：通过beautiful soup库，学习如何解析HTML文档，提取所需的数据元素，如链接、文本、表格等，以及处理常见的网页结构和异常情况。 4. 数据清洗与预处理：介绍如何对抓取到的数据进行清洗，去除噪声、处理缺失值和异常值，使其适合作为后续数据分析的基础。 5. 反爬虫策略与应对：讨论了网站反爬机制，如robots.txt规则、User-Agent设置、IP限制等，并提供策略来规避或适应这些限制。 6. 爬虫架构设计：讲解如何设计和实现可扩展、稳定、易于维护的爬虫系统，包括使用爬虫框架（如Scrapy）和分布式爬虫技术。 7. 隐私与法律问题：强调了在进行Web scraping时必须遵守的法律规范，如版权法、数据保护法，以及尊重网站robots.txt协议的重要性。 8. 实战案例分析：书中包含多个实际项目，涵盖了新闻聚合、产品价格比较、社交媒体数据挖掘等多个领域，帮助读者将理论知识应用到具体场景中。 9. 持续学习和进阶：提供了一些进一步学习的资源和技巧，以及未来可能遇到的新技术和挑战。《Practical Web Scraping for Data Science_2018》是一本全面且实用的指南，无论你是初次接触爬虫的新手，还是希望提升现有技能的中级开发者，都能从中获益匪浅。通过阅读本书，读者不仅能提升自己的编程技能，还能深入了解如何利用Web数据为数据科学项目增添价值。"

1.1.1 Why Web Scraping forData Science?

When surfing the web using a normal web browser, you’ve probably encountered

multiple sites where you considered the possibility of gathering, storing, and analyzing

the data presented on the site’s pages. Especially for data scientists, whose “raw

material” is data, the web exposes a lot of interesting opportunities:

• There might be an interesting table on a Wikipedia page (or pages)

you want to retrieve to perform some statistical analysis.

• Perhaps you want to get a list of reviews from a movie site to perform

text mining, create a recommendation engine, or build a predictive

model to spot fake reviews.

• You might wish to get a listing of properties on a real-estate site to

build an appealing geo-visualization.

• You’d like to gather additional features to enrich your data set based

on information found on the web, say, weather information to

forecast, for example, soft drink sales.

• You might be wondering about doing social network analytics using

prole data found on a web forum.

• It might be interesting to monitor a news site for trending new stories

on a particular topic of interest.

The web contains lots of interesting data sources that provide a treasure trove for all

sorts of interesting things. Sadly, the current unstructured nature of the web does not

always make it easy to gather or export this data in an easy manner. Web browsers are

very good at showing images, displaying animations, and laying out websites in a way

that is visually appealing to humans, but they do not expose a simple way to export their

data, at least not in most cases. Instead of viewing the web page by page through your

web browser’s window, wouldn’t it be nice to be able to automatically gather a rich data

set? This is exactly where web scraping enters the picture.

If you know your way around the web a bit, you’ll probably be wondering: “Isn’t

this exactly what Application Programming Interface (APIs) are for?” Indeed, many

websites nowadays provide such an API that provides a means for the outside world to

access their data repository in a structured way— meant to be consumed and accessed

CHAPTER 1 INTRODUCTION

• Scraping is being applied a lot in HR and employee analytics. e San

Francisco-based hiQ startup specializes in selling employee analyses

by collecting and examining public prole information, for instance,

from LinkedIn (who was not happy about this but was so far unable

to prevent this practice following a court case; see https://www.

bloomberg.com/news/features/2017-11-15/the-brutal-ﬁght-to-

mine- your-data-and-sell-it-to-your-boss).

• Digital marketeers and digital artists often use data from the web

for all sorts of interesting and creative projects. “We Feel Fine” by

Jonathan Harris and Sep Kamvar, for instance, scraped various blog

sites for phrases starting with “I feel,” the results of which could then

visualize how the world was feeling throughout the day.

• In another study, messages scraped from Twitter, blogs, and other

social media were scraped to construct a data set that was used to

build a predictive model toward identifying patterns of depression

and suicidal thoughts. is might be an invaluable tool for aid

providers, though of course it warrants a thorough consideration

of privacy related issues as well (see https://www.sas.com/en_ca/

insights/articles/analytics/using-big-data-to-predict-

suicide-risk-canada.html).

• Emmanuel Sales also scraped Twitter, though here with the goal

to make sense of his own social circle and time line of posts (see

https://emsal.me/blog/4). An interesting observation here is

that the author rst considered using Twitter’s API, but found that

“Twitter heavily rate limits doing this: if you want to get a user’s

follow list, then you can only do so 15 times every 15 minutes, which

is pretty unwieldy to work with.”

• In a paper titled “e Billion Prices Project: Using Online Prices for

Measurement and Research” (see http://www.nber.org/papers/

w22111), web scraping was used to collect a data set of online price

information that was used to construct a robust daily price index for

multiple countries.

CHAPTER 1 INTRODUCTION

• Banks and other nancial institutions are using web scraping

for competitor analysis. For example, banks frequently scrape

competitors’ sites to get an idea of where branches are being opened

or closed, or to track loan rates oered— all of which is interesting

information that can be incorporated in their internal models and

forecasting. Investment rms also often use web scraping, for instance,

to keep track of news articles regarding assets in their portfolio.

• Sociopolitical scientists are scraping social websites to track

population sentiment and political orientation. A famous article

called “Dissecting Trump’s Most Rabid Online Following”

(see https://ﬁvethirtyeight.com/features/dissecting-trumps-

most-rabid-online-following/) analyzes user discussions on

Reddit using semantic analysis to characterize the online followers

and fans of Donald Trump.

• One researcher was able to train a deep learning model based on

scraped images from Tinder and Instagram together with their “likes”

to predict whether an image would be deemed “attractive”

(see http://karpathy.github.io/2015/10/25/selﬁe/).

Smartphone makers are already incorporating such models in their

photo apps to help you brush up your pictures.

• In “e Girl with the Brick Earring,” Lucas Woltmann sets out to

scrape Lego brick information from https://www.bricklink.com

to determine the best selection of Lego pieces (see http://

lucaswoltmann.de/art'n'images/2017/04/08/the-girl-with-the-

brick- earring.html) to represent an image (one of the co-authors of

this book is an avid Lego fan, so we had to include this example).

• In “Analyzing 1000+ Greek Wines With Python,” Florents Tselai

scrapes information about a thousand wine varieties from a Greek

wine shop (see https://tselai.com/greek-wines-analysis.html)

to analyze their origin, rating, type, and strength (one of the co-

authors of this book is an avid wine fan, so we had to include this

example).

CHAPTER 1 INTRODUCTION

剩余312页未读，继续阅读

xlw2003

粉丝: 104
资源: 22

Python爬虫实战：2018数据科学最佳实践与requests/beautifulsoup教程

practical-web-scraping-for-data-science:Seppe vanden Broucke和Bart Baesens的“数据科学实用Web爬取”源代码

Practical Web Scraping for Data Science.pdf

Practical Web Scraping for Data Science - 2018

Practical Web Scraping for Data Science Best Practices and Examples with Python

Practical Web Scraping for Data Science Best Practices and Examples with epub

Python Web Scraping - Second Edition .azw3电子书下载

Web Scraping with Python

Python: End-to-end Data Analysis.azw3电子书下载

Python for Information

Python for Informatics

最新资源