Python数据科学实用网络抓取最佳实践与示例

需积分: 9 3 浏览量更新于2024-07-18 收藏 4.84MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

"这本书是《Practical Web Scraping for Data Science: Best Practices and Examples with Python》的无水印英文原版PDF，由Seppe vanden Broucke和Bart Baesens合著。该资源在网络上共享，如果存在侵权问题，可以联系上传者或CDN删除。在亚马逊美国官网可以找到更多关于本书的信息。" 《Practical Web Scraping for Data Science》是一本深入探讨数据科学中网页抓取实用技巧和最佳实践的书籍，它专门针对Python语言进行讲解。网页抓取是数据科学领域的重要组成部分，它允许研究人员和分析师从互联网上获取大量未结构化的数据，为分析和挖掘提供原始素材。书中可能涵盖了以下关键知识点： 1. **Python基础**：虽然假设读者对Python有一定的了解，但作者可能会回顾Python的基础知识，包括语法、数据类型、控制流和函数，这对于后续的网页抓取工作至关重要。 2. **网络基础知识**：理解HTTP协议和网页结构（HTML、CSS、JavaScript）是抓取网页内容的基础。书中可能介绍这些基本概念以及它们在抓取中的应用。 3. **网页抓取库**：Python中有许多用于网页抓取的库，如BeautifulSoup、Scrapy等。书中会详细介绍如何使用这些库来解析和提取网页数据。 4. **数据存储**：抓取到的数据通常需要存储以便进一步分析。书里可能涉及CSV、JSON、数据库（如SQLite或MySQL）等数据存储方式。 5. **处理反爬策略**：网站可能会有各种防止抓取的措施，如验证码、IP封锁等。作者可能讨论如何应对这些挑战，包括使用代理、模拟浏览器行为等。 6. **数据清洗和预处理**：抓取的数据往往需要清洗和预处理才能变得可用。这部分可能会涵盖正则表达式、字符串处理、缺失值处理等内容。 7. **网页抓取的法律与伦理**：书中可能强调遵守网站的robots.txt文件，尊重版权，以及如何合法、道德地进行网页抓取。 8. **实战项目**：通过实际的案例研究，读者可以学习如何将理论应用于实际项目，例如从电商网站抓取商品价格、社交媒体平台抓取用户行为数据等。 9. **爬虫架构设计**：对于大规模的抓取任务，有效的爬虫架构设计是必要的。作者可能讨论分布式爬虫、多线程和异步请求等高级话题。 10. **自动化和持续集成**：介绍如何将抓取脚本自动化，并集成到持续集成系统中，确保数据的定期更新。这本书旨在为数据科学家和对数据获取感兴趣的读者提供一套全面的工具和策略，帮助他们有效地从网络中获取数据，为数据分析和机器学习项目提供原料。通过学习书中的内容，读者将能够构建自己的网页抓取解决方案，从而充分利用互联网上的海量信息。

资源详情

资源推荐

1.1.1 Why Web Scraping forData Science?

When surfing the web using a normal web browser, you’ve probably encountered

multiple sites where you considered the possibility of gathering, storing, and analyzing

the data presented on the site’s pages. Especially for data scientists, whose “raw

material” is data, the web exposes a lot of interesting opportunities:

• There might be an interesting table on a Wikipedia page (or pages)

you want to retrieve to perform some statistical analysis.

• Perhaps you want to get a list of reviews from a movie site to perform

text mining, create a recommendation engine, or build a predictive

model to spot fake reviews.

• You might wish to get a listing of properties on a real-estate site to

build an appealing geo-visualization.

• You’d like to gather additional features to enrich your data set based

on information found on the web, say, weather information to

forecast, for example, soft drink sales.

• You might be wondering about doing social network analytics using

prole data found on a web forum.

• It might be interesting to monitor a news site for trending new stories

on a particular topic of interest.

The web contains lots of interesting data sources that provide a treasure trove for all

sorts of interesting things. Sadly, the current unstructured nature of the web does not

always make it easy to gather or export this data in an easy manner. Web browsers are

very good at showing images, displaying animations, and laying out websites in a way

that is visually appealing to humans, but they do not expose a simple way to export their

data, at least not in most cases. Instead of viewing the web page by page through your

web browser’s window, wouldn’t it be nice to be able to automatically gather a rich data

set? This is exactly where web scraping enters the picture.

If you know your way around the web a bit, you’ll probably be wondering: “Isn’t

this exactly what Application Programming Interface (APIs) are for?” Indeed, many

websites nowadays provide such an API that provides a means for the outside world to

access their data repository in a structured way— meant to be consumed and accessed

CHAPTER 1 INTRODUCTION

• Scraping is being applied a lot in HR and employee analytics. e San

Francisco-based hiQ startup specializes in selling employee analyses

by collecting and examining public prole information, for instance,

from LinkedIn (who was not happy about this but was so far unable

to prevent this practice following a court case; see https://www.

bloomberg.com/news/features/2017-11-15/the-brutal-ﬁght-to-

mine- your-data-and-sell-it-to-your-boss).

• Digital marketeers and digital artists often use data from the web

for all sorts of interesting and creative projects. “We Feel Fine” by

Jonathan Harris and Sep Kamvar, for instance, scraped various blog

sites for phrases starting with “I feel,” the results of which could then

visualize how the world was feeling throughout the day.

• In another study, messages scraped from Twitter, blogs, and other

social media were scraped to construct a data set that was used to

build a predictive model toward identifying patterns of depression

and suicidal thoughts. is might be an invaluable tool for aid

providers, though of course it warrants a thorough consideration

of privacy related issues as well (see https://www.sas.com/en_ca/

insights/articles/analytics/using-big-data-to-predict-

suicide-risk-canada.html).

• Emmanuel Sales also scraped Twitter, though here with the goal

to make sense of his own social circle and time line of posts (see

https://emsal.me/blog/4). An interesting observation here is

that the author rst considered using Twitter’s API, but found that

“Twitter heavily rate limits doing this: if you want to get a user’s

follow list, then you can only do so 15 times every 15 minutes, which

is pretty unwieldy to work with.”

• In a paper titled “e Billion Prices Project: Using Online Prices for

Measurement and Research” (see http://www.nber.org/papers/

w22111), web scraping was used to collect a data set of online price

information that was used to construct a robust daily price index for

multiple countries.

CHAPTER 1 INTRODUCTION

• Banks and other nancial institutions are using web scraping

for competitor analysis. For example, banks frequently scrape

competitors’ sites to get an idea of where branches are being opened

or closed, or to track loan rates oered— all of which is interesting

information that can be incorporated in their internal models and

forecasting. Investment rms also often use web scraping, for instance,

to keep track of news articles regarding assets in their portfolio.

• Sociopolitical scientists are scraping social websites to track

population sentiment and political orientation. A famous article

called “Dissecting Trump’s Most Rabid Online Following”

(see https://ﬁvethirtyeight.com/features/dissecting-trumps-

most-rabid-online-following/) analyzes user discussions on

Reddit using semantic analysis to characterize the online followers

and fans of Donald Trump.

• One researcher was able to train a deep learning model based on

scraped images from Tinder and Instagram together with their “likes”

to predict whether an image would be deemed “attractive”

(see http://karpathy.github.io/2015/10/25/selﬁe/).

Smartphone makers are already incorporating such models in their

photo apps to help you brush up your pictures.

• In “e Girl with the Brick Earring,” Lucas Woltmann sets out to

scrape Lego brick information from https://www.bricklink.com

to determine the best selection of Lego pieces (see http://

lucaswoltmann.de/art'n'images/2017/04/08/the-girl-with-the-

brick- earring.html) to represent an image (one of the co-authors of

this book is an avid Lego fan, so we had to include this example).

• In “Analyzing 1000+ Greek Wines With Python,” Florents Tselai

scrapes information about a thousand wine varieties from a Greek

wine shop (see https://tselai.com/greek-wines-analysis.html)

to analyze their origin, rating, type, and strength (one of the co-

authors of this book is an avid wine fan, so we had to include this

example).

CHAPTER 1 INTRODUCTION

剩余312页未读，继续阅读

yinkaisheng-nj

粉丝: 762
资源: 6231

Python数据科学实用网络抓取最佳实践与示例

practical-web-scraping-for-data-science:Seppe vanden Broucke和Bart Baesens的“数据科学实用Web爬取”源代码

Practical Web Scraping for Data Science.pdf

Practical Web Scraping for Data Science - 2018

python爬虫外文文献

python utils

import requests from lxml import etree import pandas as pd

watir-webdriver ssh

python的爬虫教程你有推荐嘛

Python automation

解释一下from selenium import webdriver

classify titles by their similarities using Python and pleasue using Clustering

beautiful soup soup

python爬虫的参考文献

[scrapy.core.scraper] DEBUG

二手房数据爬取参考文献

Error scraping for collect.slave_status: Error 1227: Access denied; you need (at least one of) the SUPER, REPLICATION CLIENT privilege(s) for this operation" source="exporter.go:171"

列举一个例子，使用Anaconda爬虫抓取数据代码

用python爬取豆瓣图书网并进行可视化分析的相关参考文献

我想从大众点评上批量获取商家信息该怎么实现

beautifulsoup4 find

最新资源