PySpark入门指南:构建数据密集型应用

需积分: 15 27 下载量 138 浏览量 更新于2024-07-20 收藏 6.16MB PDF 举报
"Spark for Python Developers 是一本2015年由Packt Publishing出版的书籍,专为Python开发者介绍Apache Spark的使用。全书共300页,旨在帮助读者理解和应用Spark进行大数据处理。" 本书内容涵盖了从Spark的基础概念到实际开发的多个方面: 1. Spark架构解析:书中首先讲解了数据密集型应用的架构,包括基础设施层、持久化层、集成层和分析层。这些层次共同构建了一个高效的数据处理系统。 - 基础设施层:涉及硬件和软件资源,如计算节点和网络设备。 - 持久化层:负责数据的存储和管理,确保数据的可靠性和可访问性。 - 集成层:允许不同组件之间进行通信和协作。 - 分析层:提供各种工具和算法,用于数据处理和分析。 2. Spark核心概念:介绍了Spark的核心库和PySpark,PySpark是Python开发者使用Spark的主要接口。Resilient Distributed Dataset (RDD) 是Spark中的核心数据结构,它提供了容错和分布式计算的能力。 3. 安装和环境配置:指导读者如何设置Spark的Python开发环境,包括在Ubuntu上搭建Oracle VirtualBox,安装Anaconda(包含Python 2.7),安装Java 8,以及安装Spark。此外,还提到了如何启用IPython Notebook,以便于交互式地编写和运行Spark程序。 4. 虚拟化与云部署:除了本地环境,书中还介绍了如何使用Vagrant虚拟化环境,并进一步探讨了将应用程序部署到Amazon Web Services (AWS) 的过程。同时,通过Docker容器化技术,使环境部署更加灵活和便捷。 5. 后续章节:根据摘要内容,书中的后续章节可能还会涵盖更多关于数据处理、机器学习、图计算等高级主题,以及如何使用PySpark进行实际项目开发。 这本书适合对Python编程有一定基础,并希望利用Spark进行大规模数据处理的读者。通过阅读,读者可以掌握如何使用PySpark构建和运行大数据应用程序,以及如何在不同环境中部署和管理这些应用程序。同时,书中的一些约定、反馈机制和客户支持信息也有助于读者更好地参与学习和交流。
195 浏览量
Paperback: 146 pages Publisher: Packt Publishing - ebooks Account (February 4, 2016) Language: English ISBN-10: 1784399698 ISBN-13: 978-1784399696 Key Features Set up real-time streaming and batch data intensive infrastructure using Spark and Python Deliver insightful visualizations in a web app using Spark (PySpark) Inject live data using Spark Streaming with real-time events Book Description Looking for a cluster computing system that provides high-level APIs? Apache Spark is your answer―an open source, fast, and general purpose cluster computing system. Spark's multi-stage memory primitives provide performance up to 100 times faster than Hadoop, and it is also well-suited for machine learning algorithms. Are you a Python developer inclined to work with Spark engine? If so, this book will be your companion as you create data-intensive app using Spark as a processing engine, Python visualization libraries, and web frameworks such as Flask. To begin with, you will learn the most effective way to install the Python development environment powered by Spark, Blaze, and Bookeh. You will then find out how to connect with data stores such as MySQL, MongoDB, Cassandra, and Hadoop. You'll expand your skills throughout, getting familiarized with the various data sources (Github, Twitter, Meetup, and Blogs), their data structures, and solutions to effectively tackle complexities. You'll explore datasets using iPython Notebook and will discover how to optimize the data models and pipeline. Finally, you'll get to know how to create training datasets and train the machine learning models. By the end of the book, you will have created a real-time and insightful trend tracker data-intensive app with Spark.

翻译 This is Elsevier's new document class for typeset journal articles, elsarticle.cls. It is now accepted for submitted articles, both in Elsevier's electronic submission system and elsewhere. Elsevier's previous document class for typeset articles, elsart.cls, is now over 10 years old. It has been replaced with this newly written document class elsarticle.cls, which has been developed for Elsevier by the leading TeX developer STM Document Engineering Pvt Ltd. elsarticle.cls is based upon the standard LaTeX document class article.cls. It uses natbib.sty for bibliographical references. Bugs and problems with elsarticle.cls may be reported to the developers of the class via elsarticle@stmdocs.in. The file manifest.txt provides a list of the files in the elsarticle bundle. The following are the main files available: - elsarticle.dtx, the dtx file - elsdoc.pdf, the user documentation - elsarticle-template-num.tex, template file for numerical citations - elsarticle-template-harv.tex, template file for name-year citations - elsarticle-template-num-names.tex, template file for numerical citations + new natbib option. Eg. Jones et al. [21] - elsarticle-num.bst, bibliographic style for numerical references - elsarticle-harv.bst, bibliographic style for name-year references - elsarticle-num-names.bst, bibliographic style for numerical referencces + new natbib option for citations. To extract elsarticle.cls from *.dtx: latex elsarticle.ins The documentation file is elsdoc.tex in the contrib directory. To compile it: 1. pdflatex elsdoc 2. pdflatex elsdoc 3. pdflatex elsdoc

309 浏览量