利用Docker构建可扩展的数据科学基础设施：Jupyter Notebook服务器指南

5星 · 超过95%的资源需积分: 9 26 浏览量更新于2024-07-19 收藏 6.96MB PDF 举报

《Docker for Data Science》是一本由Joshua Cook所著的专业书籍，专为数据科学领域提供深入指南。该书的核心主题是围绕Jupyter Notebook Server构建可扩展且灵活的数据基础设施，利用Docker这一容器化技术来优化数据科学工作流程。Docker在数据科学中的应用使得开发人员能够轻松地打包和分发应用程序及其依赖项，确保在不同环境（如本地开发环境、服务器集群或云端）中的一致性。作者强调了Docker如何简化部署和管理复杂的数据科学工具链，包括Python、R、Apache Spark等，以及相关的数据处理库如NumPy、Pandas和TensorFlow。书中不仅介绍了Docker的基本概念，如Dockerfile的编写、镜像的创建和管理，还涵盖了如何使用Docker Compose进行多容器服务的部署，以及如何利用Docker Swarm进行更大型的集群管理。对于Jupyter Notebook Server，读者可以了解到如何将它与Docker结合，创建私有和安全的环境，同时保持代码的版本控制和分享。通过这种方式，团队成员可以在保持数据和环境隔离的同时，共享和协作分析项目，提高生产力。此外，本书还讨论了如何利用Docker进行数据科学项目的持续集成/持续部署（CI/CD），确保从数据处理到模型训练的整个流程自动化。在云计算时代，它还探讨了如何在AWS、Google Cloud或Azure等云平台上无缝部署Docker容器，以实现成本效益和弹性。版权方面，所有权利归Joshua Cook所有，包括翻译、复制、再版、广播、电子改编、计算机软件使用等，以现有或未来技术执行。书中可能包含商标名称、图标和图像，但使用时应遵循相关版权和商标规定。《Docker for Data Science》是一本实用的教程，适合数据科学家、机器学习工程师和数据工程师，帮助他们理解和利用Docker技术优化数据科学项目，并在日益复杂的IT环境中提升工作效率。

xxi

Introduction

This text is designed to teach the concepts and techniques of Docker and its ecosystem

as applied to the field of data science. Besides introducing the core Docker technologies

(the container and image, the engine, the Dockerfile), this book contains a discussion on

building larger integrated systems using the Jupyter Notebook Server and open source

data stores MongoDB, PostgreSQL, and Redis.

The first chapter walks the reader through a recommended hardware configuration

for working through the text using an AWS t2.micro. Chapters 2 and 3 introduce

the core technologies used in the book, Docker and Jupyter, as well as the idea of

interactive programming. Chapters 4, 5, 6, and 9 dig deeper into specific areas of the

Docker ecosystem. Chapter 7 explores the official Jupyter Docker images developed and

maintained by the Jupyter development team. Chapter 8 introduces the Docker images

for three open source data stores. Chapters 9 and 10 tie everything together, connecting

Jupyter to data stores using Docker Compose. After having completed the book, readers

are encouraged to reread Chapter 3 and Chapter 10 to begin to develop their own

interactive software development style.

The concepts presented herein can be challenging, especially in terms of the

abstraction of computer resources and processes. That said, no requisite knowledge is

assumed. An attempt has been made to build the discussion from base principles. With

this in mind, the reader should be comfortable working at the command line and have

an adventurous and inquisitive spirit. We hope that readers with an intermediate to

advanced understanding of Docker, Jupyter, or both will gain a deeper understanding of

the concepts and learn novel approaches to the solving of computational problems using

these tools.

J. Cook, Docker for Data Science, DOI 10.1007/978-1-4842-3012-1_1

CHAPTER 1

Introduction

The typical data scientist consistently has a series of extremely complicated problems on

their mind beyond considerations stemming from their system infrastructure. Still, it is

inevitable that infrastructure issues will present themselves. To oversimplify, we might

draw a distinction between the “modeling problem” and the “engineering problem.” The

data scientist is uniquely qualified to solve the former, but can often come up short in

solving the latter.

Docker has been widely adopted by the system administrator and DevOps

community as a modern solution to the challenges presented in high availability and high

performance computing.

Docker is being used for the following: transitioning legacy

applications to a more modern “microservice”-based approach, facilitating continuous

integration and continuous deployment for the integration of new code, and optimizing

infrastructure usage.

In this book, I discuss Docker as a tool for the data scientist, in particular in

conjunction with the popular interactive programming platform Jupyter. Using Docker

and Jupyter, the data scientist can easily take ownership of their system configuration and

maintenance, prototype easily deployable and scalable data solutions, and trivially clone

entire systems with an eye toward replicability and communication. In short, I propose

that skill with Docker is just enough knowledge of systems operations to make the data

scientist dangerous. Having done this, I propose that Docker can add high performance

and availability tools to the data scientist’s toolbelt and fundamentally change the way

that models are developed, prototyped, and scaled.

“Big Data”

A precise definition of “big data” will elude even the most seasoned data wizard. I favor

the idea that big data is the exact scope of data that is no longer manageable without

explicit consideration to its scope. This will no doubt vary from individual to individual

and from development team to development team. I believe that mastering the concepts

and techniques associated with Docker presented herein will drastically increase the size

and scope of what exactly big data is for any reader.

www.docker.com/use-cases

CHAPTER 1 ■ INTRODUCTION

Recommended Practice for Learning

In this first chapter, you jump will headlong into using Docker and Jupyter on a cloud

system. I hope that readers have a solid grasp of the Python numerical computing stack,

although I believe that nearly anyone should be able to work their way through this book

with enough curiosity and liberal Googling.

For the purposes of working through this book, I recommend using a sandbox

system. If you are able to install Docker in an isolated, non-mission critical setting, you

can work through this text without fear of “breaking things.” For this purpose, I here

describe the process of setting up a minimal cloud-based system for running Docker

using Amazon Web Services (AWS).

As of the writing of this book, AWS is the dominant cloud-based service provider.

I don’t endorse the idea that its dominance is a reason a priori to use its services. Rather,

I present an AWS solution here as one that will be the easiest to adopt by the largest group

of people. Furthermore, I believe that this method will generalize to other cloud-based

offerings such as DigitalOcean

or Google Cloud Platform,

provided that the reader has

secure shell (ssh) access to these systems and that they are running a Linux variant.

I present instructions for configuring a system using Elastic Compute Cloud (EC2).

New users receive 750 hours of free usage on their T2.micro platform and I believe that

this should be more than enough for the typical reader’s journey through this text.

Over the next few pages, I outline the process of configuring an AWS EC2 system for

the purposes of working through this text. This process consists of

1. Configuring a key pair

2. Creating a new security group

3. Creating a new EC2 instance

4. Configuring the new instance to use Docker

Set up a New AWS Account

To begin, set up an AWS account if you do not already have one.

■ Note This work can be done in any region, although it is recommended that readers

take note of which region they have selected for work (Figure 1-1). For reasons I have long

forgotten, I choose to work in us-west-2.

www.digitalocean.com

https://cloud.google.com

Instructions for creating a new AWS account can be found at https://aws.amazon.com/

premiumsupport/knowledge-center/create-and-activate-aws-account/.

剩余265页未读，继续阅读

爱琴忆海

粉丝: 250

利用Docker构建可扩展的数据科学基础设施：Jupyter Notebook服务器指南

Docker for Data Science 无水印原版pdf

Docker for Data Science: Building Scalable and Extensible Data Infrastructure

docker for windows installer.exe

docker Starting the Docker Engine...

Waiting for the Docker Engine...

docker 一直在Waiting for the Docker Engine...

银河麒麟服务器操作系统-docker+适配手册.pdf

docker host.docker.internal

重启docker报错Job for docker.service canceled.

Dependency failed for Docker Application Container Engine.

最新资源