Tensorflow dataset.padded_batch详解：处理异构序列的秘籍

5 浏览量更新于2024-08-30 收藏 124KB PDF 举报

在深入理解TensorFlow中的`dataset.padded_batch`函数时，我们首先需要明确它的作用和工作原理。这个函数在处理序列数据时非常关键，尤其是在构建像Seq2Seq（序列到序列）这样的模型时，它能够确保所有输入样本在进行批处理之前都被填充到相同的形状，以便于网络的训练和处理。 1. 英文解释与中文辅助理解： - 英文原义：该函数将连续的元素组合成填充批次，类似于`Dataset.dense_to_sparse_batch()`。它合并多个形状可能不同的连续数据元素，将它们打包成一个具有额外外层维度的单个元素。每个结果元素会被填充到`padded_shapes`指定的形状。 - W3Schools中文解释：此方法将数据集中的连续元素组合成一批，这些元素可能具有不同的形状。通过此函数，数据在批量前被填充到`padded_shapes`定义的固定大小，未知维度（如TensorShape中的`None`或类似对象中的`-1`）会被填充。 2. 参数解读： - `batch_size`: 这是一个`tf.int64`类型的标量张量，表示要合并的连续数据元素的数量，形成一个批次。 - `padded_shapes`: 一个嵌套结构，包含`tf.TensorShape`或类似张量的向量，定义了每个输入元素的各个部分应该被填充到的形状。如果某个维度是未知的（例如`None`或`-1`），它会在填充时自动确定大小。 3. 实际应用：在Seq2Seq模型中，`dataset.padded_batch`通常用于预处理文本数据，如机器翻译任务，其中源句子和目标句子长度可能不同。通过这个函数，可以确保所有输入序列在长度上对齐，方便模型进行批量处理。同时，这也允许我们在训练过程中利用GPU的并行性，提高效率。 4. 探索和学习：为了深入理解这个函数，你可以尝试以下步骤： - 编写一个简单的示例，使用`padded_batch`处理不同长度的序列数据。 - 检查`Dataset.dense_to_sparse_batch`的区别，理解它们在填充策略上的不同。 - 调试代码，观察输入数据在填充和批处理后的变化。 - 查看源码注释，了解其内部实现细节。通过这样的实践和学习，你不仅能够掌握`dataset.padded_batch`的使用，还能提升自己的英文阅读和编程能力，这对深度学习框架的学习至关重要。记住，理论与实践相结合是理解和掌握任何技术的关键。

tensorflow 中中dataset.padded_batch函数的个人理解过程函数的个人理解过程

今天继续啃Tensorflow实战Google深度学习框架这本书，在250P的Seq2Seq模型代码理解时候有点困难，其中padded_batch(batch_size,padded_shapes)这个函数

为最，本次仅为记录刨根问底的过程，也是整理一下类似函数的理解过程。

1直接查看英文解释，并且配合W3school的中文解释，锻炼英文阅读理解能力，尤其是专业的英文单词。

直接在pycharm上查看代码自带的英文注释

"""Combines consecutive elements of this dataset into padded batches.

Like `Dataset.dense_to_sparse_batch()`, this method combines

multiple consecutive elements of this dataset, which might have

different shapes, into a single element. The tensors in the

resulting element have an additional outer dimension, and are

padded to the respective shape in `padded_shapes`.

Args:

batch_size: A `tf.int64` scalar `tf.Tensor`, representing the number of

consecutive elements of this dataset to combine in a single batch.

padded_shapes: A nested structure of `tf.TensorShape` or

`tf.int64` vector tensor-like objects representing the shape

to which the respective component of each input element should

be padded prior to batching. Any unknown dimensions

(e.g. `tf.Dimension(None)` in a `tf.TensorShape` or `-1` in a

tensor-like object) will be padded to the maximum size of that

dimension in each batch.

padding_values: (Optional.) A nested structure of scalar-shaped

`tf.Tensor`, representing the padding values to use for the

respective components. Defaults are `0` for numeric types and

the empty string for string types.

Returns:

A `Dataset`.

"""

结合W3school的中文解释，https://www.w3cschool.cn/tensorflow_python/tensorflow_python-pqdr2cqn.html

将此数据集的连续元素合并为填充的批处理.

像 Dataset.dense_to_sparse_batch() 一样, 此方法将此数据集的多个连续元素 (可能具有不同的形状) 合并到单个元素中.结果元素中的张量有一个额外的外部维度, 并

填充到 padded_shapes 中的相应形状.

ARGS：

batch_size：一个 tf.int64 标量 tf.Tensor,表示此数据集的连续元素在单个批处理中合并的数量.

padded_shapes：tf.TensorShape 的嵌套结构或 tf. int64 向量张量样对象,表示每个输入元素的各自组件在批处理之前应填充的形状.任何未知的维度 (例如

tf.Dimension(None) 在一个 TensorShape 或-1 在一个类似张量的对象中) 将被填充到每个批次中该维度的最大维度.

padding_values：(可选)一个标量形状的嵌套结构 tf.Tensor,表示要用于各个组件的填充值.对于数字类型和字符串类型的空字符串,默认值为 0.

一个数据集

具体应用实例，我参考了这位博主的博文https://blog.csdn.net/z2539329562/article/details/89791783，经过删减并添加了自己的注释。

第一个实例：

import tensorflow as tf

import numpy as np

tf.reset_default_graph()

x = [[1, 0, 0],

[2, 3, 0],

[4, 5, 6],

[7, 8, 0],

[9, 0, 0],

[0, 1, 0]]

# tf.TensorShape([]) 表示长度为单个数字

# tf.TensorShape([None]) 表示长度未知的向量

padded_shapes = (

tf.TensorShape([None])

)

dataset = tf.data.Dataset.from_tensor_slices(x)

iterator_before = dataset.make_one_shot_iterator()

dataset_padded = dataset.padded_batch(2, padded_shapes=padded_shapes)

#dataset_padded = dataset.padded_batch(2, padded_shapes=[None]) 也是可以的，原因看下面注释1

iterator_later = dataset_padded.make_one_shot_iterator()#iterator_later是经过padded_batch处理的数据集迭代器

sess = tf.Session()

try:

while True:

print(sess.run(iterator_before.get_next()))

except tf.errors.OutOfRangeError:

print("end")

try:

while True:

下载后可阅读完整内容，剩余3页未读，立即下载

weixin_38713203

粉丝: 11
资源: 942

Tensorflow dataset.padded_batch详解：处理异构序列的秘籍

浅谈tensorflow中dataset.shuffle和dataset.batch dataset.repeat注意点

详解Tensorflow数据读取有三种方式（next_batch）

使用tensorflow DataSet实现高效加载变长文本输入

Tensorflow中的masking和padding

tensorflow入门:TFRecordDataset变长数据的batch读取详解

TensorFlow 2.1.0版tf.data教程：探索新特性

TensorFlow 2.x中的数据增强技术

介绍TensorFlow 2.x：学习神经网络的最佳框架

TensorFlow中的数据输入与预处理

【序列预测专家】：TensorFlow中RNN的应用与实践

最新资源