实战教程:使用Apache Spark和Python处理大数据

3星 · 超过75%的资源 需积分: 10 43 下载量 41 浏览量 更新于2024-07-19 收藏 14.45MB PDF 举报
《弗兰克·凯恩的Apache Spark与Python驾驭大数据实战》是一本由Frank Kane撰写的专业书籍,旨在通过真实世界的例子,帮助读者在实际操作中有效分析大型数据集。这本书是2017年Packt Publishing出版的,版权受到保护,未经版权所有者书面许可,不得进行任何形式的复制、存储或传输。 本书的核心内容围绕Apache Spark和Python这两个强大的数据处理工具展开。Apache Spark是一个开源的分布式计算框架,特别适合处理大规模数据,它提供了一个内存计算模型,能实现实时数据处理和分析。而Python,作为一门易学且功能丰富的编程语言,被广泛应用于数据分析领域,其丰富的库(如Pandas、NumPy和SciPy等)使得数据操作和分析变得高效。 书中通过一系列实例,讲解如何使用Spark的DataFrame API和Spark SQL来处理数据,包括数据清洗、数据转换、聚合和机器学习等关键步骤。读者将学会如何利用Spark的并行计算能力,以及如何编写简洁、高效的Python代码来执行复杂的计算任务。此外,书中还将涉及如何整合其他Python库,如Databricks Notebook,以优化工作流程。 值得注意的是,尽管作者和Packt Publishing努力确保书中信息的准确性,但书中的内容并非无懈可击,读者在实践中可能会遇到某些特定环境或版本差异导致的问题。此外,由于版权法律的限制,书中引用的商标信息可能存在更新不及时的情况,但这并不影响读者学习和理解Spark与Python在大数据处理中的核心应用。 《Frank Kane's Taming Big Data with Apache Spark and Python》是一本实用的指南,适合数据分析师、数据工程师或者希望提升大数据处理技能的专业人士。无论是初学者还是经验丰富的开发者,都能从中找到有价值的内容,提升自己在处理海量数据时的效率和效果。
2017-07-11 上传
Frank Kane's Taming Big Data with Apache Spark and Python English | 2017 | ISBN-10: 1787287947 | 296 pages | AZW3/PDF/EPUB (conv) | 6.12 Mb Key Features Understand how Spark can be distributed across computing clusters Develop and run Spark jobs efficiently using Python A hands-on tutorial by Frank Kane with over 15 real-world examples teaching you Big Data processing with Spark Book Description Frank Kane's Taming Big Data with Apache Spark and Python is your companion to learning Apache Spark in a hands-on manner. Frank will start you off by teaching you how to set up Spark on a single system or on a cluster, and you'll soon move on to analyzing large data sets using Spark RDD, and developing and running effective Spark jobs quickly using Python. Apache Spark has emerged as the next big thing in the Big Data domain – quickly rising from an ascending technology to an established superstar in just a matter of years. Spark allows you to quickly extract actionable insights from large amounts of data, on a real-time basis, making it an essential tool in many modern businesses. Frank has packed this book with over 15 interactive, fun-filled examples relevant to the real world, and he will empower you to understand the Spark ecosystem and implement production-grade real-time Spark projects with ease. What you will learn Find out how you can identify Big Data problems as Spark problems Install and run Apache Spark on your computer or on a cluster Analyze large data sets across many CPUs using Spark's Resilient Distributed Datasets Implement machine learning on Spark using the MLlib library Process continuous streams of data in real time using the Spark streaming module Perform complex network analysis using Spark's GraphX library Use Amazon's Elastic MapReduce service to run your Spark jobs on a cluster About the Author My name is Frank Kane. I spent nine years at Amazon and IMDb, wrangling millions of customer ratings and customer transactions to produce things such