FlumeJava：构建高效数据并行管道的Java库

5星 · 超过95%的资源需积分: 25 138 浏览量更新于2024-09-12 收藏 1.16MB PDF 举报

"FlumeJava:简易高效的数据并行管道" 在大数据处理领域，FlumeJava是一个用于构建、测试和运行高效数据并行管道的Java库。它针对那些需要一系列MapReduce任务串联起来的复杂计算场景，简化了编程和管理的难度。由Craig Chambers、Ashish Raniwala、Frances Perry、Stephen Adams、Robert R. Henry、Robert Bradshaw和Nathan Weizenbaum等人在Google共同开发，这个库旨在提供一种简单、高级且统一的抽象层，覆盖了不同的数据表示和执行策略。 FlumeJava的核心是两个表示不可变并行集合的类。这些集合支持一系列用于并行处理的操作。通过使用并行集合及其操作，开发者可以轻松地处理各种数据类型和执行策略，而无需深入理解底层的执行细节。这种设计使得FlumeJava能够将复杂的计算任务转化为简单易用的API。为了确保并行操作的高效性，FlumeJava采用延迟评估（lazy evaluation）策略。它内部构造了一个执行计划的数据流图。当并行操作的最终结果被请求时，这个数据流图才会被执行，从而优化了资源的利用和任务的并行度。这种设计有助于避免不必要的计算，尤其是在大规模数据处理中，可以显著减少计算时间和资源消耗。 FlumeJava的一个关键优势在于它的灵活性。它可以适应不同的硬件环境和数据源，允许用户根据需求调整执行策略。此外，由于其与Java的紧密集成，开发者可以利用丰富的Java生态系统和工具，如JUnit进行单元测试，Maven进行项目管理，以及IDE进行代码调试。在实际应用中，FlumeJava可以帮助开发者快速构建数据处理流水线，例如，从多个源收集数据，进行预处理，然后将其传输到存储系统或进一步的分析阶段。通过提供简洁的API，它降低了编写和维护复杂数据处理流程的复杂性，使开发者能够专注于业务逻辑，而不是底层的并行计算细节。总结来说，FlumeJava是一个强大的工具，为Java开发者提供了构建、测试和执行数据并行管道的便捷途径，尤其适用于需要多步MapReduce操作的场景。其核心理念是简化数据处理的复杂性，提高代码的可读性和维护性，同时保证执行效率。通过使用FlumeJava，开发者可以在保持代码清晰的同时，实现大数据处理的高性能和高效率。

computation, and abstract away from the lower-level “physical”

details of the different kinds of input and output storage formats

and the appropriate partitioning of the logical computation into a

graph of MapReduces.

3.1 Core Abstractions

The central class of the FlumeJava library is PCollection<T>,

a (possibly huge) immutable bag of elements of type T. A

PCollection can either have a well-deﬁned order (called a se-

quence), or the elements can be unordered (called a collection).

Because they are less constrained, collections are more efﬁcient

to generate and process than sequences. A PCollection<T>

can be created from an in-memory Java Collection<T>. A

PCollection<T> can also be created by reading a ﬁle in one of

several possible formats. For example, a text ﬁle can be read as a

PCollection<String>, and a binary record-oriented ﬁle can be

read as a PCollection<T>, given a speciﬁcation of how to decode

each binary record into a Java object of type T. Data sets repre-

sented by multiple ﬁle shards can be read in as a single logical

PCollection. For example:

PCollection<String> lines =

readTextFileCollection("/gfs/data/shakes/hamlet.txt");

PCollection<DocInfo> docInfos =

readRecordFileCollection("/gfs/webdocinfo/part-*",

recordsOf(DocInfo.class));

In this code, recordsOf(...) speciﬁes a particular way in which

a DocInfo instance is encoded as a binary record. Other pre-

deﬁned encoding speciﬁers are strings() for UTF-8-encoded

text, ints() for a variable-length encoding of 32-bit integers, and

pairsOf(e1,e2 ) for an encoding of pairs derived from the en-

codings of the components. Users can specify their own custom

encodings.

A second core class is PTable<K,V>, which represents

a (possibly huge) immutable multi-map with keys of type

K and values of type V. PTable<K,V> is a subclass of

PCollection<Pair<K,V>>, and indeed is just an unordered bag

of pairs. Some FlumeJava operations apply only to PCollections

of pairs, and in Java we choose to deﬁne a subclass to capture this

abstraction; in another language, PTable<K,V> might better be de-

ﬁned as a type synonym of PCollection<Pair<K,V>>.

The main way to manipulate a PCollection is to invoke a

data-parallel operation on it. The FlumeJava library deﬁnes only

a few primitive data-parallel operations; other operations are im-

plemented in terms of these primitives. The core data-parallel

primitive is parallelDo(), which supports elementwise compu-

tation over an input PCollection<T> to produce a new output

PCollection<S>. This operation takes as its main argument a

DoFn<T, S>, a function-like object deﬁning how to map each

value in the input PCollection<T> into zero or more values to

appear in the output PCollection<S>. It also takes an indication

of the kind of PCollection or PTable to produce as a result. For

example:

PCollection<String> words =

lines.parallelDo(new DoFn<String,String>() {

void process(String line, EmitFn<String> emitFn) {

for (String word : splitIntoWords(line)) {

emitFn.emit(word);

}

}, collectionOf(strings()));

In this code, collectionOf(strings()) speciﬁes that

the parallelDo() operation should produce an unordered

PCollection whose String elements should be encoded using

UTF-8. Other options include sequenceOf(elemEncoding )

Some of these examples have been simpliﬁed in minor ways from the real

versions, for clarity and compactness.

for ordered PCollections and tableOf(keyEncoding,

valueEncoding ) for PTables. emitFn is a call-back function

FlumeJava passes to the user’s process(...) method, which

should invoke emitFn.emit(outElem ) for each outElem that

should be added to the output PCollection. FlumeJava includes

subclasses of DoFn, e.g., MapFn and FilterFn, that provide

simpler interfaces in special cases. There is also a version of

parallelDo() that allows multiple output PCollections to

be produced simultaneously from a single traversal of the input

PCollection.

parallelDo() can be used to express both the map and reduce

parts of MapReduce. Since they will potentially be distributed

remotely and run in parallel, DoFn functions should not access

any global mutable state of the enclosing Java program. Ideally,

they should be pure functions of their inputs. It is also legal for

DoFn objects to maintain local instance variable state, but users

should be aware that there may be multiple DoFn replicas operating

concurrently with no shared state. These restrictions are shared by

MapReduce as well.

A second primitive, groupByKey(), converts a multi-map of

type PTable<K,V> (which can have many key/value pairs with the

same key) into a uni-map of type PTable<K, Collection<V>>

where each key maps to an unordered, plain Java Collection of

all the values with that key. For example, the following computes

a table mapping URLs to the collection of documents that link to

them:

PTable<URL,DocInfo> backlinks =

docInfos.parallelDo(new DoFn<DocInfo,

Pair<URL,DocInfo>>() {

void process(DocInfo docInfo,

EmitFn<Pair<URL,DocInfo>> emitFn) {

for (URL targetUrl : docInfo.getLinks()) {

emitFn.emit(Pair.of(targetUrl, docInfo));

}

}, tableOf(recordsOf(URL.class),

recordsOf(DocInfo.class)));

PTable<URL,Collection<DocInfo>> referringDocInfos =

backlinks.groupByKey();

groupByKey() captures the essence of the shufﬂe step of MapRe-

duce. There is also a variant that allows specifying a sorting order

for the collection of values for each key.

A third primitive, combineValues(), takes an input

PTable<K, Collection<V>> and an associative combining

function on Vs, and returns a PTable<K, V> where each input

collection of values has been combined into a single output value.

For example:

PTable<String,Integer> wordsWithOnes =

words.parallelDo(

new DoFn<String, Pair<String,Integer>>() {

void process(String word,

EmitFn<Pair<String,Integer>> emitFn) {

emitFn.emit(Pair.of(word, 1));

}

}, tableOf(strings(), ints()));

PTable<String,Collection<Integer>>

groupedWordsWithOnes = wordsWithOnes.groupByKey();

PTable<String,Integer> wordCounts =

groupedWordsWithOnes.combineValues(SUM_INTS);

combineValues() is semantically just a special case of

parallelDo(), but the associativity of the combining function al-

lows it to be implemented via a combination of a MapReduce com-

biner (which runs as part of each mapper) and a MapReduce re-

ducer (to ﬁnish the combining), which is more efﬁcient than doing

all the combining in the reducer.

A fourth primitive, flatten(), takes a list of

PCollection<T>s and returns a single PCollection<T> that

365

剩余12页未读，继续阅读

qq_26493017

粉丝: 0
资源: 12

FlumeJava：构建高效数据并行管道的Java库

最新资源