what Apache Kafka is built to be. Getting used to this way of thinking about data
might be a little different than what you’re used to, but it turns out to be an incredibly
powerful abstraction for building applications and architectures. Kafka is often com‐
pared to a couple of existing technology categories: enterprise messaging systems, big
data systems like Hadoop, and data integration or ETL tools. Each of these compari‐
sons has some validity but also falls a little short.
Kafka is like a messaging system in that it lets you publish and subscribe to streams of
messages. In this way, it is similar to products like ActiveMQ, RabbitMQ, IBM’s
MQSeries, and other products. But even with these similarities, Kafka has a number
of core differences from traditional messaging systems that make it another kind of
animal entirely. Here are the big three differences: first, it works as a modern dis‐
tributed system that runs as a cluster and can scale to handle all the applications in
even the most massive of companies. Rather than running dozens of individual mes‐
saging brokers, hand wired to different apps, this lets you have a central platform that
can scale elastically to handle all the streams of data in a company. Secondly, Kafka is
a true storage system built to store data for as long as you might like. This has huge
advantages in using it as a connecting layer as it provides real delivery guarantees—its
data is replicated, persistent, and can be kept around as long as you like. Finally, the
world of stream processing raises the level of abstraction quite significantly. Messag‐
ing systems mostly just hand out messages. The stream processing capabilities in
Kafka let you compute derived streams and datasets dynamically off of your streams
with far less code. These differences make Kafka enough of its own thing that it
doesn’t really make sense to think of it as “yet another queue.”
Another view on Kafka—and one of our motivating lenses in designing and building
it—was to think of it as a kind of real-time version of Hadoop. Hadoop lets you store
and periodically process file data at a very large scale. Kafka lets you store and contin‐
uously process streams of data, also at a large scale. At a technical level, there are defi‐
nitely similarities, and many people see the emerging area of stream processing as a
superset of the kind of batch processing people have done with Hadoop and its vari‐
ous processing layers. What this comparison misses is that the use cases that continu‐
ous, low-latency processing opens up are quite different from those that naturally fall
on a batch processing system. Whereas Hadoop and big data targeted analytics appli‐
cations, often in the data warehousing space, the low latency nature of Kafka makes it
applicable for the kind of core applications that directly power a business. This makes
sense: events in a business are happening all the time and the ability to react to them
as they occur makes it much easier to build services that directly power the operation
of the business, feed back into customer experiences, and so on.
The final area Kafka gets compared to is ETL or data integration tools. After all, these
tools move data around, and Kafka moves data around. There is some validity to this
as well, but I think the core difference is that Kafka has inverted the problem. Rather
than a tool for scraping data out of one system and inserting it into another, Kafka is
xiv | Foreword