Tutorial Conclusion
Distributed Nature
Next Steps
The output is basically an enriched version of the first aggregation we ran. We still have a list of interests and their counts, but
now each interest has an additional avg_age, which shows the average age for all employees having that interest.
Even if you don’t understand the syntax yet, you can easily see how complex aggregations and groupings can be
accomplished using this feature. The sky is the limit as to what kind of data you can extract!
Hopefully, this little tutorial was a good demonstration about what is possible in Elasticsearch. It is really just scratching the
surface, and many features—such as suggestions, geolocation, percolation, fuzzy and partial matching—were omitted to
keep the tutorial short. But it did highlight just how easy it is to start building advanced search functionality. No configuration
was needed—just add data and start searching!
It’s likely that the syntax left you confused in places, and you may have questions about how to tweak and tune various
aspects. That’s fine! The rest of the book dives into each of these issues in detail, giving you a solid understanding of how
Elasticsearch works.
At the beginning of this chapter, we said that Elasticsearch can scale out to hundreds (or even thousands) of servers and
handle petabytes of data. While our tutorial gave examples of how to use Elasticsearch, it didn’t touch on the mechanics at
all. Elasticsearch is distributed by nature, and it is designed to hide the complexity that comes with being distributed.
The distributed aspect of Elasticsearch is largely transparent. Nothing in the tutorial required you to know about distributed
systems, sharding, cluster discovery, or dozens of other distributed concepts. It happily ran the tutorial on a single node living
inside your laptop, but if you were to run the tutorial on a cluster containing 100 nodes, everything would work in exactly the
same way.
Elasticsearch tries hard to hide the complexity of distributed systems. Here are some of the operations happening
automatically under the hood:
Partitioning your documents into different containers or shards, which can be stored on a single node or on multiple
nodes
Balancing these shards across the nodes in your cluster to spread the indexing and search load
Duplicating each shard to provide redundant copies of your data, to prevent data loss in case of hardware failure
Routing requests from any node in the cluster to the nodes that hold the data you’re interested in
Seamlessly integrating new nodes as your cluster grows or redistributing shards to recover from node loss
As you read through this book, you’ll encounter supplemental chapters about the distributed nature of Elasticsearch. These
chapters will teach you about how the cluster scales and deals with failover ([distributed-cluster]), handles document storage
([distributed-docs]), executes distributed search ([distributed-search]), and what a shard is and how it works ([inside-a-shard]).
These chapters are not required reading—you can use Elasticsearch without understanding these internals—but they will
provide insight that will make your knowledge of Elasticsearch more complete. Feel free to skim them and revisit at a later
point when you need a more complete understanding.
By now you should have a taste of what you can do with Elasticsearch, and how easy it is to get started. Elasticsearch tries
hard to work out of the box with minimal knowledge and configuration. The best way to learn Elasticsearch is by jumping in:
just start indexing and searching!
However, the more you know about Elasticsearch, the more productive you can become. The more you can tell Elasticsearch
about the domain-specific elements of your application, the more you can fine-tune the output.
The rest of this book will help you move from novice to expert. Each chapter explains the essentials, but also includes expert-
level tips. If you’re just getting started, these tips are probably not immediately relevant to you; Elasticsearch has sensible
defaults and will generally do the right thing without any interference. You can always revisit these chapters later, when you
are looking to improve performance by shaving off any wasted milliseconds.
Life Inside a Cluster
Supplemental Chapter
As mentioned earlier, this is the first of several supplemental chapters about how Elasticsearch operates in a distributed
environment. In this chapter, we explain commonly used terminology like cluster, node, and shard, the mechanics of how
Elasticsearch scales out, and how it deals with hardware failure.
Although this chapter is not required reading—you can use Elasticsearch for a long time without worrying about shards,
replication, and failover—it will help you to understand the processes at work inside Elasticsearch. Feel free to skim
through the chapter and to refer to it again later.
Elasticsearch is built to be always available, and to scale with your needs. Scale can come from buying bigger servers
(vertical scale, or scaling up) or from buying more servers (horizontal scale, or scaling out).
While Elasticsearch can benefit from more-powerful hardware, vertical scale has its limits. Real scalability comes from
horizontal scale—the ability to add more nodes to the cluster and to spread load and reliability between them.
With most databases, scaling horizontally usually requires a major overhaul of your application to take advantage of these
extra boxes. In contrast, Elasticsearch is distributed by nature: it knows how to manage multiple nodes to provide scale and
high availability. This also means that your application doesn’t need to care about it.