Apache Kafka官方文档详解：入门、配置与API概览

需积分: 4 198 浏览量更新于2024-07-19 收藏 1.71MB PDF 举报

Apache Kafka是一个开源的分布式流处理平台，由LinkedIn开发，现在被Apache基金会维护。Kafka文档提供了全面的指南，涵盖了从安装、配置到高级功能和设计原则的方方面面。 1. **入门与简介** - Kafka的设计目标是提供高吞吐量、低延迟的消息传递，适用于大规模实时数据流处理。它支持多种应用场景，包括日志收集、监控数据、实时分析等。 - 快速入门部分引导用户设置和运行一个基本的Kafka集群，以便快速体验其核心功能。 2. **软件生态** - Kafka拥有丰富的生态系统，包括官方提供的生产者API和消费者API，以及第三方工具如Kafka Connect（用于数据集成）和Kafka Streams（用于实时数据处理）。这些API的不同版本反映了Kafka的发展历程，从旧版本的简单接口到新版本的高级API。 3. **配置管理** - Kafka的配置灵活性很高，允许对broker（消息节点）和客户端（如生产者和消费者）进行定制。配置参数涉及数据持久性、性能优化、复制策略和流量控制等方面。 4. **API设计** - 生产者API负责发送消息到主题（topic），消费者API负责从主题消费消息。旧的API可能包含两种类型：OldHighLevelConsumerAPI（高级消费者API）和OldSimpleConsumerAPI（简单消费者API），新版本强调了易用性和一致性，提供了NewConsumerAPI。 5. **流处理** - Streams API是Kafka的核心组件，用于构建实时数据管道。它支持将数据从源读取、处理和写入目的地，支持复杂的业务逻辑和数据转换。 6. **设计与实现** - Kafka的设计动机在于高效地处理大量数据，通过异步复制和分区机制确保数据可靠性。它还考虑了持久化策略（如Log Compaction）、消息传递的语义（如Exactly Once或At Least Once）、以及内存管理和性能优化。 7. **故障恢复与安全** - Replication机制确保数据在多个节点间的冗余，即使在节点故障时也能保证消息的可靠传输。Kafka提供了对加密、认证和授权的支持，以增强安全性。总结起来，Apache Kafka文档是一个全面的学习资源，无论是初次接触Kafka的新手，还是希望深入了解其内部工作原理和高级特性的开发者，都能从中找到所需的信息。随着Kafka的不断发展，理解这些核心概念和技术对于有效利用这个强大的流处理平台至关重要。

1.2应用场景UseCases

本章节介绍几种主流的ApacheKafka的应用场景。关于几个场景实践的概述可以参考这篇博

客.

信息系统Messaging

Kafka可以作为传统信息中间件的替代产品。消息中间件可能因为各种目的被引入到系统之中

（解耦生产者和消费、堆积未处理的消息）。对比其他的信息中间件，Kafka的高吞吐量、内

建分区、副本、容错等特性，使得它在大规模伸缩性消息处理应用中成为了一个很好的解决

方案。

根据我们的在消息系统场景的经验，系统常常需求的吞吐量并不高，但是要求很低的点到点

的延迟并且依赖Kafka提供的强有力的持久化功能。

在这个领域Kafka常常被拿来与传统的消息中间件系统进行对比，例如ActiveMQ或

者RabbitMQ。

网站活动追踪WebsiteActivityTracking

Kafka原本的应用场景要求它能重建一个用户活动追踪管线作为一个实时的发布与订阅消息

源。意思就是用户在网站上的动作事件（如浏览页面、搜索、或者其它操作）被发布到每个

动作对应的中心化Topic上。使得这些数据源能被不同场景的需求订阅到，这些场景包括实时

处理、实时监控、导入Hadoop或用于离线处理、报表的离线数据仓库中。

活动追踪通常情况下是非常高频的，因为很多活动消息是由每个用户的页面浏览产生的。

监控Metrics

Kafka常被用来处理操作监控数据。这涉及到聚合统计分布式应用的数据来产生一个中心化的

操作数据数据源。

日志收集LogAggregation

很多人把Kafka用作日志收集服务的替换方案。日志收集基础就是从服务器收集物理日志文件

并他们放在统一的地方（文件服务器或者HDFS）存储以便后续处理。Kafka抽象了文件的细

节，为日志或者事件数据提供了一个消息流的抽象。这样就可以很好的支持低延迟处理需

求、多数据源需求，分布式数据消费需求。与Scribe或Flume其它的日志收集系统相比，

Kafka提供了同样优秀的性能，基于副本的更强的持久化保证和更低的点到点的延迟。

流处理StreamProcessing

应用场景

1.2UseCases

HereisadescriptionofafewofthepopularusecasesforApacheKafka.Foranoverviewof

anumberoftheseareasinaction,seethisblogpost.

Messaging

Kafkaworkswellasareplacementforamoretraditionalmessagebroker.Messagebrokers

areusedforavarietyofreasons(todecoupleprocessingfromdataproducers,tobuffer

unprocessedmessages,etc).IncomparisontomostmessagingsystemsKafkahasbetter

throughput,built-inpartitioning,replication,andfault-tolerancewhichmakesitagood

solutionforlargescalemessageprocessingapplications.

Inourexperiencemessagingusesareoftencomparativelylow-throughput,butmayrequire

lowend-to-endlatencyandoftendependonthestrongdurabilityguaranteesKafkaprovides.

InthisdomainKafkaiscomparabletotraditionalmessagingsystemssuchasActiveMQor

RabbitMQ.

WebsiteActivityTracking

TheoriginalusecaseforKafkawastobeabletorebuildauseractivitytrackingpipelineasa

setofreal-timepublish-subscribefeeds.Thismeanssiteactivity(pageviews,searches,or

otheractionsusersmaytake)ispublishedtocentraltopicswithonetopicperactivitytype.

Thesefeedsareavailableforsubscriptionforarangeofusecasesincludingreal-time

processing,real-timemonitoring,andloadingintoHadooporofflinedatawarehousing

systemsforofflineprocessingandreporting.

Activitytrackingisoftenveryhighvolumeasmanyactivitymessagesaregeneratedforeach

userpageview.

Metrics

Kafkaisoftenusedforoperationalmonitoringdata.Thisinvolvesaggregatingstatisticsfrom

distributedapplicationstoproducecentralizedfeedsofoperationaldata.

LogAggregation

ManypeopleuseKafkaasareplacementforalogaggregationsolution.Logaggregation

typicallycollectsphysicallogfilesoffserversandputstheminacentralplace(afileserver

orHDFSperhaps)forprocessing.Kafkaabstractsawaythedetailsoffilesandgivesa

UseCases

cleanerabstractionoflogoreventdataasastreamofmessages.Thisallowsforlower-

latencyprocessingandeasiersupportformultipledatasourcesanddistributeddata

consumption.Incomparisontolog-centricsystemslikeScribeorFlume,Kafkaoffersequally

goodperformance,strongerdurabilityguaranteesduetoreplication,andmuchlowerend-to-

endlatency.

StreamProcessing

ManyusersofKafkaprocessdatainprocessingpipelinesconsistingofmultiplestages,

whererawinputdataisconsumedfromKafkatopicsandthenaggregated,enriched,or

otherwisetransformedintonewtopicsforfurtherconsumptionorfollow-upprocessing.For

example,aprocessingpipelineforrecommendingnewsarticlesmightcrawlarticlecontent

fromRSSfeedsandpublishittoan"articles"topic;furtherprocessingmightnormalizeor

deduplicatethiscontentandpublishedthecleansedarticlecontenttoanewtopic;afinal

processingstagemightattempttorecommendthiscontenttousers.Suchprocessing

pipelinescreategraphsofreal-timedataflowsbasedontheindividualtopics.Startingin

0.10.0.0,alight-weightbutpowerfulstreamprocessinglibrarycalledKafkaStreamsis

availableinApacheKafkatoperformsuchdataprocessingasdescribedabove.Apartfrom

KafkaStreams,alternativeopensourcestreamprocessingtoolsincludeApacheStormand

ApacheSamza.

EventSourcing

Eventsourcingisastyleofapplicationdesignwherestatechangesareloggedasatime-

orderedsequenceofrecords.Kafka'ssupportforverylargestoredlogdatamakesitan

excellentbackendforanapplicationbuiltinthisstyle.

CommitLog

Kafkacanserveasakindofexternalcommit-logforadistributedsystem.Theloghelps

replicatedatabetweennodesandactsasare-syncingmechanismforfailednodesto

restoretheirdata.ThelogcompactionfeatureinKafkahelpssupportthisusage.Inthis

usageKafkaissimilartoApacheBookKeeperproject.

UseCases

剩余230页未读，继续阅读

李月光98

粉丝: 57

Apache Kafka官方文档详解：入门、配置与API概览

apache-atlas-2.1.0-kafka-hook.tar.gz

PyPI 官网下载 | confluent-kafka-amine-1.4.2.1.tar.gz

ranger-2.1.0-kafka-plugin.tar.gz

kafka-client0.10.0.1.jar

启动文件file-flume-kafka.conf

kafka kafka-run-class kafka.tools.GetOffsetshell

in/kafka-topics.sh --zookeeper 192.168.10.12:2181 --list

我使用的是org.apache.kafka.connect.json.JsonConverter转换器，我该怎么设置

class org.apache.kafka.common.serialization.StringSerializer is not an instance of org.apache.kafka.common.serialization.Deserializer

spark.executor.extraJavaOptions=-Dlog4j.logger.org.apache.spark.streaming.kafka.KafkaSource=DEBUG这个配置添加在哪里

最新资源