ApacheBeam2019ProgrammingGuide.pdf

1星需积分: 9 65 浏览量更新于2023-03-03 评论收藏 905KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

30/10/2019 Beam Programming Guide

https://beam.apache.org/documentation/programming-guide/ 1/49

Apache Beam Programming Guide

The Beam Programming Guide is intended for Beam users who want to use the Beam SDKs to create data processing pipelines. It provides guidance for using the

Beam SDK classes to build and test your pipeline. It is not intended as an exhaustive reference, but as a language-agnostic, high-level guide to programmatically

building your Beam pipeline. As the programming guide is filled out, the text will include code samples in multiple languages to help illustrate how to implement Beam

concepts in your pipelines.

Adapt for: Java SDK Python SDK

1. Overview

To use Beam, you need to first create a driver program using the classes in one of the Beam SDKs. Your driver program defines your pipeline, including all of the

inputs, transforms, and outputs; it also sets execution options for your pipeline (typically passed in using command-line options). These include the Pipeline Runner,

which, in turn, determines what back-end your pipeline will run on.

The Beam SDKs provide a number of abstractions that simplify the mechanics of large-scale distributed data processing. The same Beam abstractions work with both

batch and streaming data sources. When you create your Beam pipeline, you can think about your data processing task in terms of these abstractions. They include:

Pipeline : A Pipeline encapsulates your entire data processing task, from start to finish. This includes reading input data, transforming that data, and writing

output data. All Beam driver programs must create a Pipeline . When you create the Pipeline , you must also specify the execution options that tell

the Pipeline where and how to run.

PCollection : A PCollection represents a distributed data set that your Beam pipeline operates on. The data set can be bounded, meaning it comes from a

fixed source like a file, or unbounded, meaning it comes from a continuously updating source via a subscription or other mechanism. Your pipeline typically

creates an initial PCollection by reading data from an external data source, but you can also create a PCollection from in-memory data within your driver

program. From there, PCollection s are the inputs and outputs for each step in your pipeline.

PTransform : A PTransform represents a data processing operation, or a step, in your pipeline. Every PTransform takes one or more PCollection objects as

input, performs a processing function that you provide on the elements of that PCollection , and produces zero or more output PCollection objects.

I/O transforms: Beam comes with a number of “IOs” - library PTransform s that read or write data to various external storage systems.

A typical Beam driver program works as follows:

Create a Pipeline object and set the pipeline execution options, including the Pipeline Runner.

Create an initial PCollection for pipeline data, either using the IOs to read data from an external storage system, or using a Create transform to build

a PCollection from in-memory data.

Apply PTransforms to each PCollection . Transforms can change, filter, group, analyze, or otherwise process the elements in a PCollection . A transform

creates a new output PCollection without modifying the input collection. A typical pipeline applies subsequent transforms to each new output PCollection in

turn until processing is complete. However, note that a pipeline does not have to be a single straight line of transforms applied one after another: think

of PCollection s as variables and PTransform s as functions applied to these variables: the shape of the pipeline can be an arbitrarily complex processing

graph.

Use IOs to write the final, transformed PCollection (s) to an external source.

30/10/2019 Beam Programming Guide

https://beam.apache.org/documentation/programming-guide/ 2/49

Run the pipeline using the designated Pipeline Runner.

When you run your Beam driver program, the Pipeline Runner that you designate constructs a workflow graph of your pipeline based on the PCollection objects

you’ve created and transforms that you’ve applied. That graph is then executed using the appropriate distributed processing back-end, becoming an asynchronous

“job” (or equivalent) on that back-end.

2. Creating a pipeline

The Pipeline abstraction encapsulates all the data and steps in your data processing task. Your Beam driver program typically starts by constructing

aPipeline object, and then using that object as the basis for creating the pipeline’s data sets as PCollection s and its operations as Transform s.

To use Beam, your driver program must first create an instance of the Beam SDK class Pipeline (typically in the main() function). When you create your Pipeline ,

you’ll also need to set some configuration options. You can set your pipeline’s configuration options programatically, but it’s often easier to set the options ahead of

time (or read them from the command line) and pass them to the Pipeline object when you create the object.

// Start by defining the options for the pipeline.

PipelineOptions options = PipelineOptionsFactory.create();

// Then create the pipeline.

Pipeline p = Pipeline.create(options);

2.1. Configuring pipeline options

Use the pipeline options to configure different aspects of your pipeline, such as the pipeline runner that will execute your pipeline and any runner-specific configuration

required by the chosen runner. Your pipeline options will potentially include information such as your project ID or a location for storing files.

When you run the pipeline on a runner of your choice, a copy of the PipelineOptions will be available to your code. For example, if you add a PipelineOptions

parameter to a DoFn’s @ProcessElement method, it will be populated by the system.

2.1.1. Setting PipelineOptions from command-line arguments

While you can configure your pipeline by creating a PipelineOptions object and setting the fields directly, the Beam SDKs include a command-line parser that you

can use to set fields in PipelineOptions using command-line arguments.

To read options from the command-line, construct your PipelineOptions object as demonstrated in the following example code:

PipelineOptions options =

PipelineOptionsFactory.fromArgs(args).withValidation().create();

This interprets command-line arguments that follow the format:

--<option>=<value>

Java Python Go

30/10/2019 Beam Programming Guide

https://beam.apache.org/documentation/programming-guide/ 4/49

help command. PipelineOptionsFactory will also validate that your custom options are compatible with all other registered options.

The following example code shows how to register your custom options interface with PipelineOptionsFactory :

PipelineOptionsFactory.register(MyOptions.class);

MyOptions options = PipelineOptionsFactory.fromArgs(args)

.withValidation()

.as(MyOptions.class);

Now your pipeline can accept --input=value and --output=value as command-line arguments.

3. PCollections

The PCollection abstraction represents a potentially distributed, multi-element data set. You can think of a PCollection as “pipeline” data; Beam transforms

use PCollection objects as inputs and outputs. As such, if you want to work with data in your pipeline, it must be in the form of a PCollection .

After you’ve created your Pipeline , you’ll need to begin by creating at least one PCollection in some form. The PCollection you create serves as the input for the

first operation in your pipeline.

3.1. Creating a PCollection

You create a PCollection by either reading data from an external source using Beam’s Source API, or you can create a PCollection of data stored in an in-memory

collection class in your driver program. The former is typically how a production pipeline would ingest data; Beam’s Source APIs contain adapters to help you read

from external sources like large cloud-based files, databases, or subscription services. The latter is primarily useful for testing and debugging purposes.

3.1.1. Reading from an external source

To read from an external source, you use one of the Beam-provided I/O adapters. The adapters vary in their exact usage, but all of them read from some external data

source and return a PCollection whose elements represent the data records in that source.

Each data source adapter has a Read transform; to read, you must apply that transform to the Pipeline object itself. TextIO.Read , for example, reads from an

external text file and returns a PCollection whose elements are of type String , each String represents one line from the text file. Here’s how you would

apply TextIO.Read to your Pipeline to create a PCollection :

public static void main(String[] args) {

// Create the pipeline.

PipelineOptions options =

PipelineOptionsFactory.fromArgs(args).create();

Pipeline p = Pipeline.create(options);

// Create the PCollection 'lines' by applying a 'Read' transform.

PCollection<String> lines = p.apply(

Java

Java Python Go

30/10/2019 Beam Programming Guide

https://beam.apache.org/documentation/programming-guide/ 5/49

"ReadMyFile", TextIO.read().from("gs://some/inputData.txt"));

}

See the section on I/O to learn more about how to read from the various data sources supported by the Beam SDK.

3.1.2. Creating a PCollection from in-memory data

To create a PCollection from an in-memory Java Collection , you use the Beam-provided Create transform. Much like a data adapter’s Read , you

apply Create directly to your Pipeline object itself.

As parameters, Create accepts the Java Collection and a Coder object. The Coder specifies how the elements in the Collection should be encoded.

The following example code shows how to create a PCollection from an in-memory List :

public static void main(String[] args) {

// Create a Java Collection, in this case a List of Strings.

final List<String> LINES = Arrays.asList(

"To be, or not to be: that is the question: ",

"Whether 'tis nobler in the mind to suffer ",

"The slings and arrows of outrageous fortune, ",

"Or to take arms against a sea of troubles, ");

// Create the pipeline.

PipelineOptions options =

PipelineOptionsFactory.fromArgs(args).create();

Pipeline p = Pipeline.create(options);

// Apply Create, passing the list and the coder, to create the PCollection.

p.apply(Create.of(LINES)).setCoder(StringUtf8Coder.of());

}

3.2. PCollection characteristics

A PCollection is owned by the specific Pipeline object for which it is created; multiple pipelines cannot share a PCollection . In some respects,

a PCollection functions like a collection class. However, a PCollection can differ in a few key ways:

3.2.1. Element type

The elements of a PCollection may be of any type, but must all be of the same type. However, to support distributed processing, Beam needs to be able to encode

each individual element as a byte string (so elements can be passed around to distributed workers). The Beam SDKs provide a data encoding mechanism that

includes built-in encoding for commonly-used types as well as support for specifying custom encodings as needed.

Java Python

剩余48页未读，继续阅读