没有合适的资源?快使用搜索试试~ 我知道了~
首页Apache Beam 2019 Programming Guide.pdf
Apache Beam 2019 Programming Guide.pdf

Apache Beam 2019 Programming Guide The Beam Programming Guide is intended for Beam users who want to use the Beam SDKs to create data processing pipelines. It provides guidance for using the Beam SDK classes to build and test your pipeline. It is not intended as an exhaustive reference, but as a language-agnostic, high-level guide to programmatically building your Beam pipeline. As the programming guide is filled out, the text will include code samples in multiple languages to help illustrate how to implement Beam concepts in your pipelines.
资源详情
资源评论
资源推荐

30/10/2019 Beam Programming Guide
https://beam.apache.org/documentation/programming-guide/ 1/49
Apache Beam Programming Guide
The Beam Programming Guide is intended for Beam users who want to use the Beam SDKs to create data processing pipelines. It provides guidance for using the
Beam SDK classes to build and test your pipeline. It is not intended as an exhaustive reference, but as a language-agnostic, high-level guide to programmatically
building your Beam pipeline. As the programming guide is filled out, the text will include code samples in multiple languages to help illustrate how to implement Beam
concepts in your pipelines.
Adapt for: Java SDK Python SDK
1. Overview
To use Beam, you need to first create a driver program using the classes in one of the Beam SDKs. Your driver program defines your pipeline, including all of the
inputs, transforms, and outputs; it also sets execution options for your pipeline (typically passed in using command-line options). These include the Pipeline Runner,
which, in turn, determines what back-end your pipeline will run on.
The Beam SDKs provide a number of abstractions that simplify the mechanics of large-scale distributed data processing. The same Beam abstractions work with both
batch and streaming data sources. When you create your Beam pipeline, you can think about your data processing task in terms of these abstractions. They include:
Pipeline : A Pipeline encapsulates your entire data processing task, from start to finish. This includes reading input data, transforming that data, and writing
output data. All Beam driver programs must create a Pipeline . When you create the Pipeline , you must also specify the execution options that tell
the Pipeline where and how to run.
PCollection : A PCollection represents a distributed data set that your Beam pipeline operates on. The data set can be bounded, meaning it comes from a
fixed source like a file, or unbounded, meaning it comes from a continuously updating source via a subscription or other mechanism. Your pipeline typically
creates an initial PCollection by reading data from an external data source, but you can also create a PCollection from in-memory data within your driver
program. From there, PCollection s are the inputs and outputs for each step in your pipeline.
PTransform : A PTransform represents a data processing operation, or a step, in your pipeline. Every PTransform takes one or more PCollection objects as
input, performs a processing function that you provide on the elements of that PCollection , and produces zero or more output PCollection objects.
I/O transforms: Beam comes with a number of “IOs” - library PTransform s that read or write data to various external storage systems.
A typical Beam driver program works as follows:
Create a Pipeline object and set the pipeline execution options, including the Pipeline Runner.
Create an initial PCollection for pipeline data, either using the IOs to read data from an external storage system, or using a Create transform to build
a PCollection from in-memory data.
Apply PTransforms to each PCollection . Transforms can change, filter, group, analyze, or otherwise process the elements in a PCollection . A transform
creates a new output PCollection without modifying the input collection. A typical pipeline applies subsequent transforms to each new output PCollection in
turn until processing is complete. However, note that a pipeline does not have to be a single straight line of transforms applied one after another: think
of PCollection s as variables and PTransform s as functions applied to these variables: the shape of the pipeline can be an arbitrarily complex processing
graph.
Use IOs to write the final, transformed PCollection (s) to an external source.

30/10/2019 Beam Programming Guide
https://beam.apache.org/documentation/programming-guide/ 2/49
Run the pipeline using the designated Pipeline Runner.
When you run your Beam driver program, the Pipeline Runner that you designate constructs a workflow graph of your pipeline based on the PCollection objects
you’ve created and transforms that you’ve applied. That graph is then executed using the appropriate distributed processing back-end, becoming an asynchronous
“job” (or equivalent) on that back-end.
2. Creating a pipeline
The Pipeline abstraction encapsulates all the data and steps in your data processing task. Your Beam driver program typically starts by constructing
aPipeline object, and then using that object as the basis for creating the pipeline’s data sets as PCollection s and its operations as Transform s.
To use Beam, your driver program must first create an instance of the Beam SDK class Pipeline (typically in the main() function). When you create your Pipeline ,
you’ll also need to set some configuration options. You can set your pipeline’s configuration options programatically, but it’s often easier to set the options ahead of
time (or read them from the command line) and pass them to the Pipeline object when you create the object.
// Start by defining the options for the pipeline.
PipelineOptions options = PipelineOptionsFactory.create();
// Then create the pipeline.
Pipeline p = Pipeline.create(options);
2.1. Configuring pipeline options
Use the pipeline options to configure different aspects of your pipeline, such as the pipeline runner that will execute your pipeline and any runner-specific configuration
required by the chosen runner. Your pipeline options will potentially include information such as your project ID or a location for storing files.
When you run the pipeline on a runner of your choice, a copy of the PipelineOptions will be available to your code. For example, if you add a PipelineOptions
parameter to a DoFn’s @ProcessElement method, it will be populated by the system.
2.1.1. Setting PipelineOptions from command-line arguments
While you can configure your pipeline by creating a PipelineOptions object and setting the fields directly, the Beam SDKs include a command-line parser that you
can use to set fields in PipelineOptions using command-line arguments.
To read options from the command-line, construct your PipelineOptions object as demonstrated in the following example code:
PipelineOptions options =
PipelineOptionsFactory.fromArgs(args).withValidation().create();
This interprets command-line arguments that follow the format:
--<option>=<value>
Java Python Go
Java Python Go

30/10/2019 Beam Programming Guide
https://beam.apache.org/documentation/programming-guide/ 3/49
Note: Appending the method .withValidation will check for required command-line arguments and validate argument values.
Building your PipelineOptions this way lets you specify any of the options as a command-line argument.
Note: The WordCount example pipeline demonstrates how to set pipeline options at runtime by using command-line options.
2.1.2. Creating custom options
You can add your own custom options in addition to the standard PipelineOptions . To add your own options, define an interface with getter and setter methods for
each option, as in the following example for adding input and output custom options:
public interface MyOptions extends PipelineOptions {
String getInput();
void setInput(String input);
String getOutput();
void setOutput(String output);
}
You can also specify a description, which appears when a user passes --help as a command-line argument, and a default value.
You set the description and default value using annotations, as follows:
public interface MyOptions extends PipelineOptions {
@Description("Input for the pipeline")
@Default.String("gs://my-bucket/input")
String getInput();
void setInput(String input);
@Description("Output for the pipeline")
@Default.String("gs://my-bucket/input")
String getOutput();
void setOutput(String output);
}
It’s recommended that you register your interface with PipelineOptionsFactory and then pass the interface when creating the PipelineOptions object. When you
register your interface with PipelineOptionsFactory , the --help can find your custom options interface and add it to the output of the --
Java Python Go
Java Python Go

30/10/2019 Beam Programming Guide
https://beam.apache.org/documentation/programming-guide/ 4/49
help command. PipelineOptionsFactory will also validate that your custom options are compatible with all other registered options.
The following example code shows how to register your custom options interface with PipelineOptionsFactory :
PipelineOptionsFactory.register(MyOptions.class);
MyOptions options = PipelineOptionsFactory.fromArgs(args)
.withValidation()
.as(MyOptions.class);
Now your pipeline can accept --input=value and --output=value as command-line arguments.
3. PCollections
The PCollection abstraction represents a potentially distributed, multi-element data set. You can think of a PCollection as “pipeline” data; Beam transforms
use PCollection objects as inputs and outputs. As such, if you want to work with data in your pipeline, it must be in the form of a PCollection .
After you’ve created your Pipeline , you’ll need to begin by creating at least one PCollection in some form. The PCollection you create serves as the input for the
first operation in your pipeline.
3.1. Creating a PCollection
You create a PCollection by either reading data from an external source using Beam’s Source API, or you can create a PCollection of data stored in an in-memory
collection class in your driver program. The former is typically how a production pipeline would ingest data; Beam’s Source APIs contain adapters to help you read
from external sources like large cloud-based files, databases, or subscription services. The latter is primarily useful for testing and debugging purposes.
3.1.1. Reading from an external source
To read from an external source, you use one of the Beam-provided I/O adapters. The adapters vary in their exact usage, but all of them read from some external data
source and return a PCollection whose elements represent the data records in that source.
Each data source adapter has a Read transform; to read, you must apply that transform to the Pipeline object itself. TextIO.Read , for example, reads from an
external text file and returns a PCollection whose elements are of type String , each String represents one line from the text file. Here’s how you would
apply TextIO.Read to your Pipeline to create a PCollection :
public static void main(String[] args) {
// Create the pipeline.
PipelineOptions options =
PipelineOptionsFactory.fromArgs(args).create();
Pipeline p = Pipeline.create(options);
// Create the PCollection 'lines' by applying a 'Read' transform.
PCollection<String> lines = p.apply(
Java
Java Python Go

30/10/2019 Beam Programming Guide
https://beam.apache.org/documentation/programming-guide/ 5/49
"ReadMyFile", TextIO.read().from("gs://some/inputData.txt"));
}
See the section on I/O to learn more about how to read from the various data sources supported by the Beam SDK.
3.1.2. Creating a PCollection from in-memory data
To create a PCollection from an in-memory Java Collection , you use the Beam-provided Create transform. Much like a data adapter’s Read , you
apply Create directly to your Pipeline object itself.
As parameters, Create accepts the Java Collection and a Coder object. The Coder specifies how the elements in the Collection should be encoded.
The following example code shows how to create a PCollection from an in-memory List :
public static void main(String[] args) {
// Create a Java Collection, in this case a List of Strings.
final List<String> LINES = Arrays.asList(
"To be, or not to be: that is the question: ",
"Whether 'tis nobler in the mind to suffer ",
"The slings and arrows of outrageous fortune, ",
"Or to take arms against a sea of troubles, ");
// Create the pipeline.
PipelineOptions options =
PipelineOptionsFactory.fromArgs(args).create();
Pipeline p = Pipeline.create(options);
// Apply Create, passing the list and the coder, to create the PCollection.
p.apply(Create.of(LINES)).setCoder(StringUtf8Coder.of());
}
3.2. PCollection characteristics
A PCollection is owned by the specific Pipeline object for which it is created; multiple pipelines cannot share a PCollection . In some respects,
a PCollection functions like a collection class. However, a PCollection can differ in a few key ways:
3.2.1. Element type
The elements of a PCollection may be of any type, but must all be of the same type. However, to support distributed processing, Beam needs to be able to encode
each individual element as a byte string (so elements can be passed around to distributed workers). The Beam SDKs provide a data encoding mechanism that
includes built-in encoding for commonly-used types as well as support for specifying custom encodings as needed.
Java Python
剩余48页未读,继续阅读


















安全验证
文档复制为VIP权益,开通VIP直接复制

评论1