I have been working with Elasticsearch since the late 2.x versions. Coming from an SQL background, I have figured out over the years that you have to think about your data differently when working with Elasticsearch. Over time I fell in love with how Elasticsearch handles partitioning the data into neat little cubes called shards.
This morning I was brainstorming the best shard and schema layout for an infrastructure monitoring project I am working on and had one of those light bulb moments. That is where I came up with the data cube concept. Much has changed in Elasticsearch since the 2.x versions, but much has stayed the same, specifically the data cube architecture. Without a doubt, it has been optimized as the Lucene library has grown and internal improvements made to the horizontal scaling algorithms, but fundamentally it is still the same. Why change a great concept?
When Elasticsearch reached version 7, one basic change was made to the default settings. Instead of four shards and one replica per index, the default is now one shard and one replica per index. This makes perfect sense in some cases. It falls right in line with the Index Lifecycle Management system. But is this always the best answer? Lately, I have reviewed a few Elasticsearch systems that are still using the best practices from older versions in the 5.x and 6.x series. Enough things have changed that this older data is no longer relevant in a lot of cases. The system will still work fine but will be far from optimized.
This post will try to explain how I view the Elasticsearch sharding system when working with time-series data. Time series data has different requirements than static search engine data since the data is ever moving, not ever-changing. Elasticsearch does not change documents unless you force updates. For best efficiency, time-series data should never be updated, only ingested clean then using the search tools later to build what you need from the individual documents. I have always said, at the core level, Elasticsearch is only limited by your imagination and the data you have to work with.
Some major pros and cons revolve around how many shards that you assign to an index. The first and most basic rule is that if you have more than one shard, the shard count should be evenly divisible by the number of nodes you have. This is not saying that if you have 20 nodes you should have 20 shards for one index, but if you have 20 nodes I would not create a 35 shard index. Same concept, just a different set of numbers. Technically this covers hot and warm nodes as they are fundamentally the same; they just have different routing rules and different resources. Using more shards does not always equal better performance. The image below shows the imbalance this can cause.
For this to make sense, you need to understand how ingestion and querying work. To dive deep into the internals of ingestion and querying is a document in itself, so we will just stick to the basic concepts.
When you have an index setup with multiple shards, Elasticsearch does its best to keep all of the shards the same size. As I am not the Elasticsearch document routing engine, some assumptions will be made, but fundamentally Elasticsearch would treat it the same. Using the example above, we have 4 primary shards labeled P1-P4 and 4 replica shards labeled R1-R4. This would be an index setup with 4 primary shards and one replica shard. Some are confused by how replicas work. It is not one replica of your data. It is a replica of every shard that holds your data. Elasticsearch balances these shards, so no primary and no replica should ever be on the same node. This makes Elasticsearch data handling very data resilient. When designed properly, all data has a backup.
Elasticsearch looks at the shard values during ingestion, decides which shard has the least amount of documents, and attempts to ingest the new document to that shard. Once it has found a suitable primary shard to write the document to, the ingestion process starts. The document is ingested, verified, and then replicated to the replica shard of the same number. For instance, the first document being ingested into index production-run-05-2021 would potentially go to P1. After written to P1 and verified it would also go to R1. The replication takes place internally as the document is read directly from P1 to R1. P1 and R1 should always be very, very close in document count and storage size. This should be true for all shards and their replicas.
An index and cluster design like the one noted will work. It will work great. Everything will seem fine until you turn up the ingestion rate. You always have to keep this in mind to future-proof your system. Your data will grow. It will grow exponentially. It is better to think about this during planning than having to rush and fix it in production. Think about it this way, Elasticsearch is rotating through the shards as fast as your data is coming in. If your system is processing 10,000 documents per second, Elasticsearch has to rotate through the shards, well, 10,000 times per second. This is where imbalanced nodes become a real problem and can provide very random and sporadic results during querying.
Using the 3 node cluster above, which is the best practices recommended cluster size, we can get a good view of where this becomes a problem. Every document written to P1, P2 or replicated to R4 will use Node 1, R2, R3 and P4 will use the resources of Node 3, where Node 2 will only process P3 or R1. The bottleneck comes in where Elasticsearch has to spend time updating resources for two primary shards on one node. Yes, you could balance this out by making it 6 shards, 2 primaries, and 1 replica per node, but that would increase the open file count, resource usage, IO, and network traffic on Node 2 as it is constantly rotating through the shards. Another thing to keep in mind is that this also imbalances the storage requirements of the node. Assuming these are 40GB shards Nodes 1 and 3 would use 120GB of space while node 2 would only be using 80GB of space. The can make calculating storage requirements more difficult and you will always have some nodes using more space than others. Do not think about Elasticsearch indices in terms of “all my data” think of each index as in individual entity that has to share resources with the other index entities. Similar to how shared cloud VM’s work.
If we reduce this to three shards with one replica we are headed to better territory. This would put 1 primary shard and one replica shard on each node. This generally levels out the resource usage but may be total overkill depending on the use case. Though this greatly depends on if your system is ingest heavy or query heavy.
But wait you say, what about just using the default of 1 primary shard and letting Index Lifecycle Management handle the shard rotation so there is always 1 shard and 1 replica per index. Well, again this comes back to ingest rate along with query complexity and frequency.
When you query any data with Elasticsearch the query is first verified then if it passes vetting a broadcast is sent out to the whole cluster asking for which node has the data requested in the query. This scope can be narrowed using query routing but we will assume the default routing is being used. If you are using one shard this is quite simple as only two nodes will answer, the one with the primary shard and the one with the secondary shard. Whichever source responds the fastest sends the data, the node that the query came in on then processes the data and returns it to the client. This is pretty simple and straight forward. It does not matter how big your timeframe is all of this data will come from either one primary or one replica. But what happens when your data grows beyond the 40 – 50GB recommended shard size?
More than once I have looked at an Elasticsearch cluster and found shards quite in excess of the 50GB recommended size, in some circumstances this is a custom configuration that is just what the use case needs. But in most cases this is an oversight in the design phase or just a pure lack of knowledge as to how the system works. Shards over 50GB will continue to work, shards over 100GB will continue to work, in theory a single shard approaching 1TB will work, it will work until it doesn’t.
No doubt that was very ambiguous, but here is why, this is all based on use case. If you want to dive deep into Elasticsearch ingestion routing you can choose exactly which shard on which node a single piece of new data will be ingested to or on the other hand you can direct you query to only one node, and or one shard. These are deep topics for complex use cases but it can be done. The good thing about Elasticsearch is it was designed to take care for all of this automatically. The Lucene library is just the storage and query engine, Elasticsearch brings the routing and the easy horizontal scalability, so in most cases it is just best to let the system keep its self balanced. With this in mind lets look at why these larger shards are normally bad.
When you send a query to an index and the data has been recently ingested or queried that data will be in the cache, this is the main use for the HEAP ram. But when you step out of the cache into older data Elasticsearch has to do some heavy lifting to dig deep into this huge well of data. This may happen infrequently but when it does happen there is a good chance it will dump the cache of recent documents and load in these older time series documents. Depending on how far back you want to look at the data Elasticsearch has to keep digging and digging until it finds it. All during this time processor threads, HEAP ram, and IO are stuck waiting on it to find the data you asked for. Once it is found the data is cached but will quickly be replaced by new queries and new data coming in. So if the same query were ran again in 30 seconds it may find what it needs in the cache, but if it is ran again in 30 minutes it probably will not.
Well again, can’t we just leave it at one shard and let ILM handle this? In and optimal setup Index Lifecycle Management will play a role in all time series data but shard count is always a question.
Setting up ILM to rotate your data at 50GB is great if you are generating this much data over multiple days. and you could do it with 2 nodes. Though you still have to always keep the shard count in mind. This is always the sticking point that many do not seem to grasp, when it comes to shards, you can have too many. No matter if it is your hot nodes or warm nodes you have to keep the shard count as low as possible. This helps with both ingestion and query performance as the system has fewer files to keep track of. In the end a shard is just a collection of files on disk, so much like the OS, low open file count is a great thing. This becomes more prevalent if you are running Elasticsearch in a container such as docker as they can be very strict on resource usage.
Another thing to keep in mind when designing these indices and shard layouts is the fact that even if you have a 3 node cluster using 1 shard that technically leaves one node unused until enough data is collected to roll over the shard. So any queries you send will only use the processing power and ram of one node where spreading the shards out across multiple nodes will bring all processors into play when the data is queried. But there is a caveat here, querying an index with one shard can produce different document scores vs. querying a data with multiple shards.
When you query one shard the results are scored and returned from that one shard. When you query an index with multiple shards every shard returns the data it has already scored. If you request 1000 documents the coordinating node will receive 1000 documents from every shard and rescore them to present you with 1000 total results. So if you have your data on 4 shards the coordinating node would reorder 4000 documents to return your 1000 results.
When looking at moving time series data such as a line visualization this does not cause much of a problem since it is based on moving data points over time. In these situations scoring does not even come into play because the data is being sorted on timestamp which runs in a specific order in a specific direction depending on your sort. But if you query this same data say for a list of devices to generate a report based on a large timeframe that covers multiple shards you could come up with somewhat different results compared to the same query on one shard. This has to be kept in mind. The potential scoring discrepancy will grow with the number of primary shards. This has to be weighed against performance requirements. Though Elasticsearch does support the boost query option which can help level this out.
Visualize your shards as cubes of data, or maybe a book. The data makes up the vertical plane, or book pages and the timestamps make up the horizontal plane, or page numbers. When you dig deep into an index for old time series data Elasticsearch has to flip through everyone of those pages until it finds the beginning of your query and continue reading until it finds the end of your data. If your data is at the bottom of a 50GB stack, it is going to take a while to find it. This is where what I called time slicing can come into play. If you frequently need to query old and new data look into using date stamped indices . This splits the big book into multiple smaller books that are easier to read. This is explained in great depth here https://www.elastic.co/blog/control-ilm-phase-transition-timings-using-origination-date and should be considered in all time series situations. When using date stamped indices you can focus your queries directly to the indices that hold the data you are looking for instead of making the system dig through gigabytes of data to get to that specific date.
There are many variables that come into play when building an Elasticsearch index. Everything from ingest rate, to query rate, to type of query, depth of query, cluster budget and many other factors. I hope the tips above give some clarity in your next cluster design.
Hope this helps.
--Go for it, every day.