TensorFlow 2.1.0版tf.data教程：探索新特性

50 浏览量更新于2024-08-30 收藏 386KB PDF 举报

"tf.data官方教程 – – 基于TF-v2" 在TensorFlow v2中，`tf.data` API是构建高效、可扩展的数据输入管道的关键工具，它允许开发者轻松地从各种数据源读取数据，并进行预处理，以适应深度学习模型的需求。这篇教程主要围绕`tf.data.Dataset`这一核心概念展开，详细讲解如何利用`tf.data`构建数据输入通道。 1. **基础知识** - **Dataset结构介绍**：`tf.data.Dataset`是数据序列的抽象表示，它可以由单一元素或由其他`Dataset`组合而成。每个`Dataset`对象都可以生成一系列元素，这些元素可以是任何张量类型。 2. **读取输入数据** - **读取Numpy数组**：可以直接将Numpy数组转换为`Dataset`，方便在训练过程中使用。 - **读取Python生成器中的数据**：通过生成器函数，可以动态生成数据，`tf.data.Dataset.from_generator`将这些生成器转换为`Dataset`。 - **读取TFRecord数据**：TFRecord是一种二进制文件格式，常用于存储TensorFlow数据。`tf.data.TFRecordDataset`类可以读取这些文件。 - **读取text数据**：`tf.data.TextLineDataset`可以从文本文件中逐行读取数据。 - **读取CSV数据**：`tf.data.experimental.CsvDataset`类用于解析CSV文件，将其转换为`Dataset`。 3. **数据集元素batching** - **最简单的batching**：使用`dataset.batch(batch_size)`将连续的元素打包成批次。 - **填充batching**：`padded_batch`函数允许将不同大小的元素填充到统一大小的批次中，这对于处理不同长度的序列尤其有用。 4. **训练工作流程** - **数据repeat多个epoch**：`dataset.repeat(num_epochs)`使数据集重复指定次数，用于训练过程中的多次迭代。 - **随机shuffle输入数据**：`dataset.shuffle(buffer_size)`对数据进行随机打乱，有助于训练过程中的泛化。 5. **数据预处理** - **使用Dataset.map()进行数据预处理**：`map`函数接受一个函数作为参数，该函数会在每个元素上应用，可用于数据转换，如归一化、解码等。 - **使用非TF函数进行数据预处理**：通过`tf.py_function`可以将Python函数引入到数据管道中，处理更复杂的数据操作。 - **解析tf.Example protocol buffer messages**：`tf.data.Dataset.from_tensor_slices`和`tf.parse_example`可用于解析protobuf消息。 - **时间序列windowing**：用于处理时间序列数据，如使用`dataset.window`创建滑动窗口，`dataset.flat_map`将窗口展开为单独的样本。 - **重采样**：`tf.data.Dataset.sampling`和`tf.data.experimental.rejection_resample`用于根据特定条件进行样本的随机采样。 6. **在高阶API中使用tf.data** - **在tf.keras中使用tf.data**：Keras模型支持直接使用`tf.data.Dataset`作为输入，简化模型训练过程。 - **在tf.estimator中使用tf.data**：Estimator框架也可以与`tf.data`结合，提供灵活的数据输入方式。 `tf.data` API的设计目标是让数据处理变得简单、高效，通过它可以构建出复杂的数据处理流程，无论是图像、文本还是时间序列数据，都能轻松应对。通过组合不同的操作，开发者可以构建出适应各种需求的定制化数据输入通道，从而更好地优化模型训练。

ds_series = tf.data.Dataset.from_generator(

gen_series,

output_types=(tf.int32, tf.float32), # 必选参数

output_shapes=((), (None,))) # 可选参数，但最好选上，原因前面已经提过

ds_series

现在，tf.data.Dataset建好了。但请注意但请注意：将形状可变的数据集进行 batching 时，您需要使用Dataset.padded_batch。

ds_series_batch = ds_series.shuffle(20).padded_batch(10, padded_shapes=([], [None]))

ids, sequence_batch = next(iter(ds_series_batch))

print(ids.numpy())

print()

print(sequence_batch.numpy())

[ 6 1 10 0 3 17 12 9 5 23] [[ 0.5812 -0.825 0.6075 -1.3856 -0.8151 -1.1908 0. 0. ] [-0.7208 0.0611 0.0084 0.6592 0.8364 0.8327 -0.7164 0.8826] [ 0.0391 -2.0019 0.4077 0.9304 0. 0. 0. 0. ] [ 0.4397 -

0.0901 -0.4993 0.3485 0.2481 0. 0. 0. ] [ 0.0346 0. 0. 0. 0. 0. 0. 0. ] [-1.0478 0. 0. 0. 0. 0. 0. 0. ] [ 0. 0. 0. 0. 0. 0. 0. 0. ] [ 0. 0. 0. 0. 0. 0. 0. 0. ] [ 0.3163 0. 0. 0. 0. 0. 0. 0. ] [ 0. 0. 0. 0. 0. 0. 0. 0. ]]

注意注意：TensorFlow 2.2版本中，padded_shapes参数已经不需要了，The default behavior is to pad all axes to the longest in the batch.

ds_series_batch = ds_series.shuffle(20).padded_batch(10)

对于更实际的示例，可以尝试用preprocessing.image.ImageDataGenerator将其包装为tf.data.Dataset。

首先下载数据：

flowers = tf.keras.utils.get_file(

'flower_photos',

'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',

untar=True)

Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz

228818944/228813984 [==============================] – 5s 0us/step

创建 image.ImageDataGenerator

img_gen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255, rotation_range=20)

images, labels = next(img_gen.flow_from_directory(flowers))

Found 3670 images belonging to 5 classes.

print(images.dtype, images.shape)

print(labels.dtype, labels.shape)

float32 (32, 256, 256, 3)

float32 (32, 5)

ds = tf.data.Dataset.from_generator(

img_gen.flow_from_directory, args=[flowers],

output_types=(tf.float32, tf.float32),

output_shapes=([32,256,256,3], [32,5])

)

2.3 读取读取TFRecord数据数据 ¶

See Loading TFRecords for an end-to-end example.

tf.data API支持多种文件格式，因此您可以处理超出内存大小的大型数据集。例如，TFRecord文件格式是一种简单的面向记录的二进制格式，许多TensorFlow应用程序都支持该格式的训练数据。通过

tf.data.TFRecordDataset 类，您可以将一个或多个 TFRecord 文件的内容作为数据管道的输入。

下面以French Street Name Signs(FSNS)为例：

# Creates a dataset that reads all of the examples from two files.

fsns_test_file = tf.keras.utils.get_file("fsns.tfrec", "https://storage.googleapis.com/download.tensorflow.org/data/fsns-20160927/testdata/fsns-00000-of-00001")

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/fsns-20160927/testdata/fsns-00000-of-00001

7905280/7904079 [==============================] – 0s 0us/step

TFRecordDataset的filenames 参数可以是字符串、字符串列表，也可以是字符串 tf.Tensor。因此，如果您有两组分别用于训练和验证的文件，你可以创建一个工厂方法来产生dataset（以filenames作为

输入参数）。

dataset = tf.data.TFRecordDataset(filenames = [fsns_test_file])

dataset

很多TensorFlow项目在它们的TFRecords文件中，使用了序列化的tf.train.Example记录。查看这种数据需要解码：

raw_example = next(iter(dataset))

parsed = tf.train.Example.FromString(raw_example.numpy())

parsed.features.feature['image/text']

bytes_list {

value: “Rue Perreyon”

}

2.4 读取读取text数据数据 ¶

See Loading Text for an end to end example.

很多数据集都是作为一个或多个文本文件存储的。tf.data.TextLineDataset 可以从一个或多个文本文件中提取行。给定一个或多个文件名，TextLineDataset 会为这些文件的每行生成一个字符串值元素。

directory_url = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'

file_names = ['cowper.txt', 'derby.txt', 'butler.txt']

file_paths = [

tf.keras.utils.get_file(file_name, directory_url + file_name)

for file_name in file_names

]

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt

819200/815980 [==============================] – 0s 0us/step

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt

811008/809730 [==============================] – 0s 0us/step

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt

811008/807992 [==============================] – 0s 0us/step

dataset = tf.data.TextLineDataset(file_paths)

查看第一个文件的前几行：

for line in dataset.take(5):

print(line.numpy())

b”\xef\xbb\xbfAchilles sing, O Goddess! Peleus’ son;”

b’His wrath pernicious, who ten thousand woes’

b”Caused to Achaia’s host, sent many a soul”

b’Illustrious into Ades premature,’

剩余13页未读，继续阅读

紫藤花叶子

粉丝: 286
资源: 888

TensorFlow 2.1.0版tf.data教程：探索新特性

TensorFlow常用函数详解与并行计算

TensorFlow教程与初学者案例分析：涵盖TF v1和v2版本

基于TensorFlow2.0的YOLOv2目标检测实现解析

yolov3-tf2-master_anpr_yolov3-tf2_yolov3-tf2-master_

AttributeError: module 'tensorflow._api.v2.data' has no attribute 'read_data_sets'

最新资源