Python网络爬虫实战第二版:从静态页面到JavaScript网站数据提取

5星 · 超过95%的资源 需积分: 10 204 下载量 168 浏览量 更新于2024-07-19 4 收藏 14.78MB PDF 举报
"Python Web Scraping 2nd Edition" 是一本深入探讨如何使用Python 3.x进行网页数据抓取的专业指南。本书旨在帮助读者从互联网的海量数据中提取有价值的信息,这些信息通常嵌在网站的结构和样式中,需要通过特定技术进行提取。 在本书的早期章节中,读者将学习如何从静态网页中获取数据。这包括了解如何利用缓存机制(如数据库和文件)来节省时间并减轻服务器负载。随着基础知识的深入,读者将有机会实践构建更复杂的爬虫,涉及浏览器、爬虫和并发刮取器的使用。 作者Katharine Jarmul和Richard Lawson指导读者掌握如何处理JavaScript依赖的网站,利用PyQt和Selenium等工具来抓取数据。此外,还将探讨如何在复杂网站上填写表单,甚至应对那些有验证码保护的网站,这里会介绍如何使用如mechanize这样的Python库来自动化这些操作。书中还会教授如何使用Scrapy库构建基于类的刮削器,并将其应用到实际网站上。 书的后半部分,读者将学习如何测试网站使用刮削器,远程刮取,最佳实践,处理图像以及其他相关主题。通过本书的学习,读者将能够全面了解如何在遵守网站规则和法律的前提下,高效且负责任地进行网络数据抓取。 该书第二版于2017年5月出版,Packt Publishing是出版商。尽管作者和出版商已经尽力确保书中的信息准确性,但他们不承担由此引起的任何直接或间接损害的责任。书中提到的所有公司和产品商标,Packt Publishing已尽可能准确地使用大写表示,但不能保证其准确性。 这本书对于想要学习Python网络爬虫技术的人来说是一份宝贵的资源,无论你是初学者还是有一定经验的开发者,都能从中获取到实用的知识和技巧,提升你在数据挖掘和分析领域的技能。
2018-03-31 上传
The internet contains a wealth of data. This data is both provided through structured APIs as well as by content delivered directly through websites. While the data in APIs is highly structured, information found in web pages is often unstructured and requires collection, extraction, and processing to be of value. And collecting data is just the start of the journey, as that data must also be stored, mined, and then exposed to others in a value-added form. With this book, you will learn many of the core tasks needed in collecting various forms of information from websites. We will cover how to collect it, how to perform several common data operations (including storage in local and remote databases), how to perform common media-based tasks such as converting images an videos to thumbnails, how to clean unstructured data with NTLK, how to examine several data mining and visualization tools, and finally core skills in building a microservices-based scraper and API that can, and will, be run on the cloud. Through a recipe-based approach, we will learn independent techniques to solve specific tasks involved in not only scraping but also data manipulation and management, data mining, visualization, microservices, containers, and cloud operations. These recipes will build skills in a progressive and holistic manner, not only teaching how to perform the fundamentals of scraping but also taking you from the results of scraping to a service offered to others through the cloud. We will be building an actual web-scraper-as-a-service using common tools in the Python, container, and cloud ecosystems.