Apache Kafka与MapR Streams推动实时流处理新设计

5星 · 超过95%的资源需积分: 10 180 浏览量更新于2023-05-30 4 收藏 12.23MB PDF 举报

《流式架构：使用Apache Kafka和MapR Streams的新设计》是由Ted Dunning和Ellen Friedman共同撰写的一本专业书籍，专注于探索和讲解流处理技术在大数据领域的最新发展。该书深入剖析了如何在Hadoop和Spark的背景下，利用Apache Kafka作为关键的分布式流处理平台，以及MapR Streams在其中的作用。作者们结合实践经验和理论知识，为开发人员、数据分析师和系统管理员提供了全面的指导，帮助他们理解和构建高效、可扩展的实时数据处理系统。书中内容涵盖了以下几个重要知识点： 1. **流式架构基础**：首先，读者将学习到流处理的基本概念，包括事件驱动的数据处理模型、实时数据流的处理需求以及与批处理的区别。流式架构的核心在于处理连续、高吞吐量的数据流，而不是一次性的批量数据。 2. **Apache Kafka**：作为主角，Apache Kafka被详细介绍为一个强大的分布式消息队列系统，它提供了一种可靠、高吞吐量的实时数据流处理平台。书中会涉及Kafka的设计原则、架构、分区和复制策略，以及如何配置和管理Kafka集群。 3. **MapR Streams**：MapR Streams是MapR公司为Kafka提供的增强版本，它在Kafka的基础上增强了实时分析和查询功能。读者可以了解到MapR Streams如何简化流处理任务的开发，并支持SQL查询，以及其与MapR Data Platform的集成。 4. **Hadoop和Spark的集成**：书中还会探讨如何在Hadoop生态系统中整合Apache Kafka和MapR Streams，例如通过YARN或Spark Streaming进行数据处理。此外，如何利用Hadoop的存储能力来持久化流数据，以及如何优化性能和容错性也会被讨论。 5. **实战案例和最佳实践**：书中不仅提供理论知识，还包含丰富的实际应用案例和最佳实践，帮助读者掌握如何设计和实现复杂的流处理应用，如日志分析、实时监控和实时决策支持系统。 6. **认证与培训**：对于想要进一步提升技能的读者，书中还提到MapR提供的在线培训课程，旨在帮助读者成为大数据领域的专家，并提供免费的Hadoop培训资源。 7. **版权信息**：最后，本书版权信息强调所有权利归作者所有，且强调读者可以根据教育、商业或销售推广目的购买，并指出O'Reilly Media的联系方式以获取更多信息。《流式架构：使用Apache Kafka和MapR Streams的新设计》是一本适合从事大数据和实时分析领域专业人士的参考资料，深入讲解了现代流式计算架构的关键技术和工具，有助于读者在实际工作中构建和优化高性能的数据流处理系统。

Figure 1-2. Display of a smartphone application known as Waze. In

addition to providing point-to-point directions, it also adds value by

supplying real-time

trac information shared by millions of drivers.

Knowing that there is a slow-down caused by an accident on a par‐

ticular freeway during the morning commute is useful to a driver

while the incident and its effect on traffic are happening. Knowing

about this an hour after the event or at the end of the day, in con‐

trast, has much less value, except perhaps as a way to review the his‐

tory of traffic patterns. But these after-the-fact insights do little to

help the morning commuter get to work faster. Waze is just one

straightforward example of the time-value of information: the value

of that particular knowledge decreases quickly with elapsed time.

Being able to process streaming data via a 4G network and deliver

6 | Chapter 1: Why Stream?

reports to drivers in a timely manner is essential for this navigation

tool to work as it is intended.

Low-latency analysis of streaming data lets you

respond to life as it happens.

Time-value of information is significant in many use cases where the

value of particular insights diminishes very quickly after the event.

The following section touches on a few more examples.

Where Streaming Matters

Let’s start with retail marketing. Consider the opportunities for

improving customer experience and raising a customer’s tendency

to buy something as they pass through a brick-and-mortar store.

Perhaps the customer would be encouraged by a discount coupon,

particularly if it were for an item or service that really appealed to

them.

The idea of encouraging sales through coupons is certainly not new,

but think of the evolution in style and effectiveness of how this mar‐

keting technique can be applied. In the somewhat distant past, dis‐

count coupons were mailed en masse to the public, with only very

rough targeting in terms of large areas of population—very much a

fire hose approach. Improvements were made when coupons were

offered to a more selective mailing list based on other information

about a customer’s interests or activities. But even if the coupon was

well-matched to the customer’s interest, there was a large gap in

time and focus between receiving it via mail or newspaper and being

able to act on it by going to the store. That left plenty of time for the

impact of the coupon to “wear off” as the customer became distrac‐

ted by other issues, making even this targeted approach fairly hit-or-

miss.

Now imagine instead that as a customer passes through a store, a

display sign lights up as they pass to offer a nice selection of colors

in a specific style of sweater or handbag that interests them. Perhaps

a discount coupon code shows up on the customer’s phone as they

reach the electronics department. Or suppose the store is an out‐

door outfitter that can distinguish customers who are interested in

Streaming Data: Life As It Happens | 7

camping plus canoeing from those who like camping plus mountain

biking, based on their past purchases or web-viewing habits. Bea‐

cons might react to the smartphones of customers as they enter and

provide offers via text messages to their phones that fit these differ‐

ent tastes. How much more effective could a discount coupon be if

it’s offered not only to the right person but also at just the right

moment?

These new approaches to customer-responsive, in-the-moment

marketing are already being implemented by some large retail mer‐

chants, in some cases developed in-house and in others through

vendors who provide innovative new services. The ability to recog‐

nize the presence of a particular customer may make use of a WiFi

connection to a cell phone or sometimes via beacons placed strategi‐

cally in a store. These techniques are not limited to retail stores.

Hotels and other service organizations are also beginning to look at

how these approaches can help them better recognize return cus‐

tomers or be alert to constantly changing levels needed for service at

check-in or in the hotel lounge.

These approaches are not limited to retail marketing. Surprisingly,

similar techniques can also be used to track the position of garbage

trucks and how they service “smart” dumpsters that announce their

relative fill levels. Trucks can be deployed on customized schedules

that better match actual needs, thus optimizing operations with

regard to drivers’ time, gas consumption, and equipment usage.

The main goal in each of these sample situations is to gain actiona‐

ble insights in a timely manner. The response to these insights may

be made by humans or may be automated processes. Either way,

timing is the key. The aim is to exploit streaming data and new tech‐

nologies to be able to respond to life in the moment. But as it turns

out, that’s not the only advantage to be gained from using streaming

data, as we discuss later in this chapter. It turns out that a streaming

architecture forms the core for a wide-ranging set of processes,

some of which you may not previously have thought of in terms of

streaming.

One of the most important and widespread situations in which it is

important to be able to carry out low-latency analytics on streaming

data is for defending data security. With a well-designed project, it is

possible to monitor a large variety of things that take place in a sys‐

tem. These actions might include the transactions involving a credit

8 | Chapter 1: Why Stream?

card or the sequence of events related to logins for a banking web‐

site. With anomaly detection techniques and very low-latency tech‐

nologies, cyber attacks by humans or robots may be discovered

quickly so that action can be taken to thwart the intrusion or at least

to mitigate loss.

Batch Versus Streaming

In the past, in order to handle data analysis at scale, data was collec‐

ted and analyzed in batch. What’s the difference in a batch versus a

streaming process? Consider for a moment this simple analogy:

compare data to water that may be collected in a bucket and deliv‐

ered to the user versus water that flows to the user via a pipe.

It’s possible to put a valve on the pipe such that the flow of water is

periodically interrupted when the tap is closed. But with the pipe

and valve, it is the choice of the user whether to hold back the water

or to let it flow—it can handle both styles of delivery. In contrast,

even if you carry buckets very quickly to the recipient, the water

delivered by bucket (batch) will never occur as a continuous stream.

In computing, batch processing is a good way to deal with huge

amounts of distributed data, and batch-based computational

approaches such as MapReduce or Spark are still useful in many sit‐

uations. If you require an hourly summation of a series of events

and an end-of-day or weekly final sum, batch processes may serve

your needs well. But for many use cases, batch does not sufficiently

reflect the way life happens. That observation underlies the increas‐

ing interest in flow-based computing, which is explained more

thoroughly in Chapter 3.

As mentioned earlier, the benefits of adopting a streaming style of

handling data go far beyond the opportunity to carry out real-time

or near–real time analytics, as powerful as those immediate insights

may be. Some of the broader advantages require durability: you

need a message-passing system that persists the event stream data in

such a way that you can apply checkpoints to let you restart reading

from a specific point in the flow.

Streaming Data: Life As It Happens | 9

剩余116页未读，继续阅读

Scape1989

粉丝: 25

Apache Kafka与MapR Streams推动实时流处理新设计

Apache Kafka与MapR Streams：流式架构新篇章

深入了解kafka-streams-dotnet：.NET环境下的Kafka流处理

Apache Kafka学习资源汇总：新手指南与进阶阅读

Streaming Architecture New Designs Using Apache Kafka and MapR Streams

Spark-Streaming-Apache-Kafka-Apache-HBase:Spark Streaming示例项目，它从Kafka中提取消息并写入HBase Table

Building-Data-Streaming-Applications-with-Apache-Kafka:Packt发行的《使用Apache Kafka构建数据流应用程序》

building-data-streaming-applications-apache-kafka

practical-change-data-streaming-use-cases-with-apache-kafka-and-debezium-qconsf-2019.pdf

clojure-kafka-examples：Kafka和Kafka Streams的Clojure示例（JAVA Interop）

awesome-kafka:有关Apache Kafka的列表

最新资源