Kafka权威指南：NehaNarkhede, GwenShapira & ToddPalino 联合力作

3星 · 超过75%的资源需积分: 10 35 浏览量更新于2024-07-20 收藏 1.96MB PDF 举报

《Kafka：权威指南》是由NehaNarkhede、Gwen Shapira和Todd Palino共同编著的一本专业书籍，它深入探讨了Apache Kafka这个强大的分布式流处理平台。该书作为《The Definitive Guide》系列的一部分，旨在为读者提供全面而详尽的知识，涵盖了Kafka的设计理念、架构、工作原理、部署与管理，以及在现代数据处理场景中的应用实践。本书不仅适合Kafka的初学者，对于已经在技术栈中使用或计划引入Kafka的专业人士同样具有价值。作者们凭借丰富的实战经验和深厚的技术底蕴，将Kafka的核心概念如主题（Topic）、生产者（Producer）、消费者（Consumer）和消息队列模型（Message Queueing）等进行了深入剖析。读者可以从中了解到Kafka如何实现实时数据的高效传输和处理，其分区（Partitioning）、复制（Replication）和高可用性设计，以及如何利用Kafka与其他技术（如Spark Streaming、Flume等）集成构建复杂的数据管道。《Kafka：权威指南》还提供了关于Kafka最佳实践、性能调优和故障恢复策略的实用建议，帮助读者更好地理解和应对Kafka在大规模实时数据处理中的挑战。此外，书中还包含了最新的Kafka版本（如2.x和3.x）特性介绍，确保信息的时效性和实用性。该书的版权信息表明，版权所有，未经许可不得复制或商业使用，且强调了在线版本的购买和获取途径。出版商O'Reilly Media对本书进行了严格的编辑、校对和设计，确保了内容的质量。书中的修订历史记录了自首次早期发布以来的更新情况，便于读者跟踪最新内容。《Kafka：权威指南》是一本不可或缺的参考书籍，无论你是希望深入了解Kafka的开发者、系统管理员还是数据工程师，都将从这本书中收获丰富的理论知识和实践经验，以便在实际项目中充分利用Kafka的潜力。

Individual Queue Systems

At the same time that you have been waging this war with metrics, one of your cow‐

orkers has been doing similar work with log messages. Another has been working on

tracking user behavior on the front-end website and providing that information to

developers who are working on machine learning, as well as creating some reports for

management. You have all followed a similar path of building out systems that decou‐

ple the publishers of the information from the subscribers to that information. Figure

1-4 shows such an infrastructure, with three separate pub/sub systems.

Figure 1-4. Multiple publish/subscribe systems

This is certainly a lot better than utilizing point to point connections (as in Figure

1-2), but there is a lot of duplication. Your company is maintaining multiple systems

for queuing data, all of which have their own individual bugs and limitations. You

also know that there will be more use cases for messaging coming soon. What you

would like to have is a single centralized system that allows for publishing of generic

types of data, and that will grow as your business grows.

Enter Kafka

Apache Kafka is a publish/subscribe messaging system designed to solve this prob‐

lem. It is often described as a “distributed commit log”. A filesystem or database com‐

mit log is designed to provide a durable record of all transactions so that they can be

replayed to consistently build the state of a system. Similarly, data within Kafka is

stored durably, in order, and can be read deterministically. In addition, the data can

14 | Chapter 1: Meet Kafka

be distributed within the system to provide additional protections against failures, as

well as significant opportunities for scaling performance.

Messages and Batches

The unit of data within Kafka is called a message. If you are approaching Kafka from a

database background, you can think of this as similar to a row or a record. A message

is simply an array of bytes, as far as Kafka is concerned, so the data contained within

it does not have a specific format or meaning to Kafka. Messages can have an optional

bit of metadata which is referred to as a key. The key is also a byte array, and as with

the message, has no specific meaning to Kafka. Keys are used when messages are to

be written to partitions in a more controlled manner. The simplest such scheme is to

treat partitions as a hash ring, and assure that messages with the same key are always

written to the same partition. Usage of keys is discussed more thoroughly in Chap‐

ter 3.

For efficiency, messages are written into Kafka in batches. A batch is just a collection

of messages, all of which are being produced to the same topic and partition. An indi‐

vidual round trip across the network for each message would result in excessive over‐

head, and collecting messages together into a batch reduces this. This, of course,

presents a tradeoff between latency and throughput: the larger the batches, the more

messages that can be handled per unit of time, but the longer it takes an individual

message to propagate. Batches are also typically compressed, which provides for more

efficient data transfer and storage at the cost of some processing power.

Schemas

While messages are opaque byte arrays to Kafka itself, it is recommended that addi‐

tional structure be imposed on the message content so that it can be easily under‐

stood. There are many options available for message schema, depending on your

application’s individual needs. Simplistic systems, such as Javascript Object Notation

(JSON) and Extensible Markup Language (XML), are easy to use and human reada‐

ble. However they lack features such as robust type handling and compatibility

between schema versions. Many Kafka developers favor the use of Apache Avro,

which is a serialization framework originally developed for Hadoop. Avro provides a

compact serialization format, schemas that are separate from the message payloads

and that do not require generated code when they change, as well as strong data typ‐

ing and schema evolution, with both backwards and forwards compatibility.

A consistent data format is important in Kafka, as it allows writing and reading mes‐

sages to be decoupled. When these tasks are tightly coupled, applications which sub‐

scribe to messages must be updated to handle the new data format, in parallel with

the old format. Only then can the applications that publish the messages be updated

to utilize the new format. New applications that wish to use data must be coupled

Enter Kafka | 15

www.ebook3000.com

Producers and Consumers

Kafka clients are users of the system, and there are two basic types: producers and

consumers.

Producers create new messages. In other publish/subscribe systems, these may be

called publishers or writers. In general, a message will be produced to a specific topic.

By default, the producer does not care what partition a specific message is written to

and will balance messages over all partitions of a topic evenly. In some cases, the pro‐

ducer will direct messages to specific partitions. This is typically done using the mes‐

sage key and a partitioner that will generate a hash of the key and map it to a specific

partition. This assures that all messages produced with a given key will get written to

the same partition. The producer could also use a custom partitioner that follows

other business rules for mapping messages to partitions. Producers are covered in

more detail in Chapter 3.

Consumers read messages. In other publish/subscribe systems, these clients may be

called subscribers or readers. The consumer subscribes to one or more topics and

reads the messages in the order they were produced. The consumer keeps track of

which messages it has already consumed by keeping track of the

oset of messages.

The offset is another bit of metadata, an integer value that continually increases, that

Kafka adds to each message as it is produced. Each message within a given partition

has a unique offset. By storing the offset of the last consumed message for each parti‐

tion, either in Zookeeper or in Kafka itself, a consumer can stop and restart without

losing its place.

Consumers work as part of a consumer group. This is one or more consumers that

work together to consume a topic. The group assures that each partition is only con‐

sumed by one member. In Figure 1-6, there are three consumers in a single group

consuming a topic. Two of the consumers are working from one partition each, while

the third consumer is working from two partitions. The mapping of a consumer to a

partition is often called ownership of the partition by the consumer.

In this way, consumers can horizontally scale to consume topics with a large number

of messages. Additionally, if a single consumer fails, the remaining members of the

group will rebalance the partitions being consumed to take over for the missing

member. Consumers and consumer groups are discussed in more detail in Chapter 4.

Enter Kafka | 17

www.ebook3000.com

剩余117页未读，继续阅读

苏州一点红

粉丝: 2
资源: 13

Kafka权威指南：NehaNarkhede, GwenShapira & ToddPalino 联合力作

Kafka The Definitive Guide Real-Time Data and Stream Processing at Scale epub

Kafka the Definitive Guide 2nd Edition

Kafka_The Definitive Guide_Real-Time Data and Stream Processing at Scale

kafka_the definitive guide(201707)

Kafka_The Definitive Guide_Real-Time Data and Stream Processing at Scale-2017

高清彩版 Kafka_The Definitive Guide_Real-Time Data and Stream Processing at Scale

kafka the definitive guide

kafka-definitive-guide

基于STM8单片机的CAT24WCxx存储器实验(I2C模拟方式).zip

Matlab遗传优化算法等算法 求解 生鲜配送问题 路径优化 时间窗 新鲜度 损成本 等约束 程序+算法+参考文献

最新资源

Matlab遗传优化算法等算法求解生鲜配送问题路径优化时间窗新鲜度损成本等约束程序+算法+参考文献