TensorFlow分布式基础：DistributedStrategy详解与数据处理

版权申诉

技术资料

64 浏览量更新于2024-07-01 收藏 489KB DOC 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

在本文档中，我们将深入解析TensorFlow的分布式训练框架DistributedStrategy的基础概念和实现。首先，我们将概述 Strategy 类体系及其组成部分： 1. **StrategyBase** 是分布式训练策略的基本抽象，它负责初始化、使用以及控制逻辑。子类如 `MirroredStrategy` 和 `MultiWorkerMirroredStrategy` 建立在 `StrategyBase` 上，分别支持模型在单机多GPU和多机多GPU环境中的分布式训练。 - `Initialization` 包括策略实例化时的配置，例如设置设备分发模式。 - `Usage` 涉及在训练循环中如何应用策略，如数据并行处理和模型更新。 - `Control Loop (CTL)` 提供对训练流程的控制，确保数据在不同设备间正确传递。 - `Scope` 是一个抽象的概念，用于封装策略相关的上下文，有助于资源管理和隔离。 2. **StrategyExtendedV2** 是一个扩展版本，包含更精细的特性如局部性（locality）控制，即数据和计算的物理位置管理。它还涉及模型参数的更新机制，确保一致性。 3. **数据处理** 是分布式策略的核心，主要涉及读取和分发数据。`DistributedDataset` 和 `DistributedIterator` 负责创建分布式数据集，支持直接读取数据集、通过 `MirroredExtended` 实现镜像复制，以及使用 `input_lib` 功能。`InputWorkers` 定义了数据处理的工作节点，`DistributedIterator` 可用于迭代数据。 - `Directly reading data sets` 包括示例用法和不同策略下的实现，如 `MirroredStrategy` 的数据分布方式。 - `Input Workers` 和 `input_contexts` 是数据分布的关键组件，它们负责构建和管理工作负载。 4. **高级使用** 部分介绍了 Strategy 在不同框架中的集成，如 Keras 和 Custom Training Loop。`Keras` 集成使得用户可以在 Keras 模型中无缝应用分布式训练，而 `Custom Training Loop` 和 `Estimator` 则提供了更底层的控制。这篇文档详述了TensorFlow分布式策略的基础原理和其实现细节，包括策略类的设计、数据分布的处理方式，以及如何与不同框架（如Keras）结合使用，以便开发者能够轻松地将模型迁移到分布式环境中，提高训练效率。

资源详情

资源推荐

>>> @tf.function

... def replica_fn(input):

... return input*2

>>> result = []

>>> # Iterate over the tf.distribute.DistributedDataset

... for x in dist_dataset:

... # process dataset elements

... result.append(strategy.run(replica_fn, args=(x,)))

>>> print(result)

[PerReplica:{

0: ,

}, PerReplica:{

0: ,

}]

2.1.2 基类实现

StrategyBase 方法之中，主要三种数据相关操作是：分批，分片，预取（大家可以回到

PyTorch 数据读取部分看看异同）。

在上面的代码片段中，分批操作具体是：

dataset 首先按照 global_batch_size 进行分批。

其次调用 experimental_distribute_dataset 把 dataset 按照一个新分批大小（batch size）进

行重新分批，新分批大小等于”全局分批大小除以同步副本数量”。用户可以用 Pythonic for

loop 来遍历它。

x 是一个 tf.distribution.DistributedValues，其包含所有副本的数据，而每个副本会得到新

批次大小的数据。

tf.distribution.Strategy.run 将负责把 x 中每个副本对应的数据（per-replica）分发给每个副

本执行工作函数 replica_fn。

分片（Sharding）包含跨多个工笔者的自动分片（autosharding）。

首先，在多工笔者（ multi-worker ）分布式训练中（使用

tf.distribution.experimental.MultiWorkerMirroredStrategy 或 tf.distribution.TPUStrategy 时），

在一组工笔者上自动分片（autosharding）数据集意味着每个工笔者被分配了整个数据集的

一个子集（如果设置了正确的 tf.data.experimental.AutoShardPolicy）。这是为了确保在每个

step 中，每个工笔者都会处理一个全局的，包含不重叠的数据集元素的批次。自动分片有

几个不同的选项，可以使用 tf.data.experimental.DistributeOptions 来指定。

然后，每个工笔者内的分片意味着该方法将在所有工笔者设备之间分割数据（如果存在多

个）。无论多工笔者（multi-worker）是否设定自动分片，这都会发生。

对于跨多个工笔者的自动分片，默认模式是 tf.data.experimental.AutoShardPolicy.AUTO。

如果数据集是从读者数据集（例如 tf.data.TFRecordDataset、tf.data.TextLineDataset 等）中创

建的，该模式将尝试按文件分片，否则按数据分片，其中每个工笔者将读取整个数据集，但

是只处理分配给它的分片。然而，如果每个工笔者的输入文件少于一个，我们建议您通过设

置 tf.data.experimental.DistributeOptions.auto_shard_policy 为

tf.data.experimental.AutoShardPolicy.OFF 来禁止跨工笔者的数据集自动分片。

对于预取（prefetch），默认情况下，该方法在用户提供的 tf.data.Dataset 实例的末尾添加

一个预取转换。预取转换的参数是 buffer_size，就是同步的副本（replicas in sync）的数量。

experimental_distribute_dataset 的定义如下，其实就是调用 extended 来完成操作。

def experimental_distribute_dataset(self, dataset, options=None):

“““Creates tf.distribute.DistributedDataset from tf.data.Dataset.

Args:

dataset: tf.data.Dataset that will be sharded across all replicas using

the rules stated above.

options: tf.distribute.InputOptions used to control options on how this

dataset is distributed.

Returns:

A tf.distribute.DistributedDataset.

“““

distribution_strategy_input_api_counter.get_cell(

self.__class__.__name__, “distribute_dataset”).increase_by(1)

return self._extended._experimental_distribute_dataset(dataset, options)

2.1.3 MirroredExtended 实现

我们用 MirroredExtended 来看看具体实现，其实就是调用 input_lib.get_distributed_dataset

来进行处理，因此我们深入到 input_lib 之中。

def _experimental_distribute_dataset(self, dataset, options):

if (options and options.experimental_replication_mode ==

distribute_lib.InputReplicationMode.PER_REPLICA):

raise NotImplementedError(

“InputReplicationMode.PER_REPLICA “

“is only supported in “

“distribute_datasets_from_function.”

)

return input_lib.get_distributed_dataset(

dataset,

self._input_workers_with_options(options),

self._container_strategy(),

num_replicas_in_sync=self._num_replicas_in_sync,

options=options)

2.1.4 input_lib 功能

input_lib 提供了关于处理输入数据的一些基础功能。get_distributed_dataset 是一个通用函

数，其可以被所有策略用来返回分布式数据集。返回的分布式数据集实例是不同的，这取决

于我们是在 TF1 还是 TF2 的背景下。返回的分布式数据集实例的 API 也有所不同。这里

用到了 DistributedDataset 和 input_workers，所以我们有必要一一进行分析。

def get_distributed_dataset(dataset,

input_workers,

strategy,

num_replicas_in_sync=None,

input_context=None,

options=None,

build=True):

“““Returns a distributed dataset from the given tf.data.Dataset instance.

Args:

dataset: a tf.data.Dataset instance.

input_workers: an InputWorkers object which specifies devices on which

iterators should be created.

strategy: a tf.distribute.Strategy object, used to run all-reduce to

handle last partial batch.

num_replicas_in_sync: Optional integer. If this is not None, the value is

used to decide how to rebatch datasets into smaller batches so that

the total batch size for each step (across all workers and replicas)

adds up to dataset’s batch size.

input_context: InputContext for sharding. Only pass this in for between

graph multi-worker cases where there is only one input_worker. In

these cases, we will shard based on the input_pipeline_id and

num_input_pipelines in the InputContext.

options: Default is None. tf.distribute.InputOptions used to control

options on how this dataset is distributed.

build: whether to build underlying datasets when a DistributedDataset is

created. This is only useful for ParameterServerStrategy now.

Returns:

A distributed dataset instance.

“““

if tf2.enabled():

return DistributedDataset( # 接下来会分析 DistributedDataset

input_workers,

剩余51页未读，继续阅读

书博教育

粉丝: 1
资源: 2834

会员权益专享

TensorFlow分布式基础：DistributedStrategy详解与数据处理

TensorFlow 源码

[源码解析] TensorFlow 分布式之 MirroredStrategy 分发计算.doc

TensorFlow深度学习基础与应用-源代码.zip

AttributeError: module 'tensorflow_core._api.v2.distribute' has no attribute 'TPUStrategy'

tensorflow分布式训练

如何解决 AttributeError module 'tensorflow.python.distribute.input_lib' has no attribute 'DistributedDatasetInterface' 错误？

tensorflow的分布式训练

goldendb分布式数据库例行维护手册.pdf

TensorFlow 1.x 和 TensorFlow 2.x 有什么区别？

鸿蒙系统3.1和3.0有什么区别

TensorFlow的分布式训练接口使用

module 'tensorflow.compat.v2' has no attribute 'ConfigProto'

在使用TensorFlow时，如何使用tf.distribute.Strategy来处理多个设备之间的上下文共享，用python给出简单例子

import tensorflow as tf tf.config.list_physical_devices()啥意思

TensorFlow版本的区别

用redission提供的分布式锁比用redistemplate.opsForValue().setIfAbsent()好在哪

java分布式学习路线

tensorflow容器化分布式训练

.盘古系统在一个核心基础层之上，通过为不同应用场景而抽象的适配层设计，提供了 分布式文件系统两种形态。 A.分布式网络存储 B.分布式飞天存储 C.分布式块存储 D.分布式sSD存储

怎么将网络模型调用到gpu上

会员权益专享

最新资源

.盘古系统在一个核心基础层之上，通过为不同应用场景而抽象的适配层设计，提供了分布式文件系统两种形态。 A.分布式网络存储 B.分布式飞天存储 C.分布式块存储 D.分布式sSD存储