Today we live in a world of big data. What we thought of as big data only a few years ago is now insignificant compared to the amount of data that needs to be processed and stored every minute of every day. Every new device that is released has some kind of analytics attached to it. From your latest RFID credit card to every time you open a door in a new automobile or your latest online grocery store order. It is all tracked and stored somewhere.
This data is very useful to the business and the consumer. We all use it every day. Looking at the ratings on any e-commerce website, that is big data. The logistics for order and delivery of e-commerce products are also big data, from tracking what you might like based on what you have already purchased to plotting the most economical route to get the package to its destination. The retailer, in this case, would like to know what is selling well and not selling so they know what to stock, all big data.
But, all of this data is worthless without a way to use it. Raw data is no good. It has to be enriched, stored, analyzed, compared even to the point of analyzing the analyzed data. That is not even the best part. Using time-series data, you can compare what has happened to what might happen through the use of machine learning.
The process of enriching and analyzing this data can seem very daunting to even seasoned veterans who are dealing with an unknown data set. Taking the raw data and turning it into something that can be analyzed generally takes one of two routes. The first route tried and true since the beginning of analytical computing is, roll your own processor. Rolling your own processor shares some parts of the second option: choosing a data analytics platform that is well suited to your use case. There are many available, but this guide focuses only on Elasticsearch.
Rolling your own processor generally has 4 parts. Decipher the data, write a data schema for the data, write the data processor, write the interface for using the data, be it command line or GUI. This may not sound too complex, but once you add up the times, the last two parts generally take exponentially longer than deciphering the data or writing the schema. Another hurdle to deal with when processing very large data sets is the need to parallel process these large data sets across multiple compute resources. A single server with the maximum supported resources can not compete with a cluster of smaller, more efficient machines. This is another area where Elasticsearch excels and why it is the topic of this guide.
When designing a data analytics system, one of the first things you must do is understand the data you are working with. If you do not understand the data, your system is bound to fail at the most critical time. This is where deciphering the data and writing a schema for it come in for either option. You have to know your data before you do anything with it. Trying to fix a data problem on a production system while the system is in use is a painful experience, so it is best to not let it happen in the first place. Elasticsearch excels at everything needed to analyze data sets from a few hundred records to many billions of records from a single node for smaller data sets to hundreds of nodes for large or heavily accessed data sets. Be it bare metal servers, in-house virtual machines, cloud virtual machines, or one of the Elasticsearch based SaaS, I have yet to find a data set that I could not process with Elasticsearch.
The short answer is that Elasticsearch is a horizontally distributed search engine based on the Apache Lucene project. Since its inception over 10 years ago, the Elasticsearch feature set has grown to the point that it is no longer just a search engine. Elasticsearch is a powerful analytical research engine with Machine Learning capabilities.
The goal of Elasticsearch from the beginning has been to be the fastest full-text search engine available. Though there have been licensing changes the Basic version of Elasticsearch is still free. Of course, some features are not available in the Basic version but there are definitely enough to perform amazing analytics on nearly any kind of data. The caching logic and processing speed that was originally designed to make an awesome search engine turned out to also be a great solution for analytics. The parallel processing of the full-text search queries works equally as well with time series data used for analytics. Throw in an awesome interface in the one provided with Kibana along with some machine learning, and you have generally an analytics platform in a box. Machine learning is not available with the Basic license, you can get a 30-day trial license to test if you like. This is available through the Kibana interface.
A few commands typed into a terminal or a few clicks at a cloud provider and you can have this running within 1 hour, analyzing your data and finding anomalies that you probably didn’t know existed.
Of course, the compute resources that you choose for your Elasticsearch cluster are the most important part of the decision-making process. There is no doubt you could just build the biggest nodes available and do all of your analytics work there, but this is not the best idea. No matter if these are fully loaded bare metal servers or the highest VM offering available to you, in almost all cases, this is not the best way to design any service, let alone an Elasticsearch cluster.
That is one major advantage of Elasticsearch. You do not need huge monolithic servers; you can always add new computing nodes when needed. With this in mind, you can build a cluster just big enough to fit your current needs then expand it on the fly when necessary. In some of the more diverse analytical applications, the cluster can grow and shrink as needed. With this capability, Elasticsearch is both economically and environmentally friendly, only using what it needs when it needs it.
There are 3 main compute resources to discuss when designing an Elasticsearch cluster. These are RAM, CPU Cores, and Disk Storage. Years of experience with this platform have led me to conclude that this is the best order of importance for these resources. Though any one of these resources being out of spec, even on one node, will impact the whole cluster. The three sections below will discuss how these three resources work together to provide the data processing and query speed offered by Elasticsearch with some real-world insight along the way.
RAM or Random Access Memory has been the nightmare of computer users and programmers since the beginning of computers. Users always want more insight, more cool visuals, more data compressed and packed into whatever RAM is available to that user or system. At one time the maximum system limits on ram were thousands of times less than what is available now. But the issue persists, how to make the best usage of all of the RAM available to my application.
This guide will not go into a deep discussion of exactly what RAM is or how application garbage collection works, those are deep topics for another guide. Let us just say that writing memory management and garbage collection routines into your application can take just about the same amount of programming time as writing the core application. Elasticsearch circumvents this extra overhead by using JAVA as its runtime engine. JAVA has built-in memory management routines that take that burden off of the programmers. The JAVA-based memory management and garbage collection has been proven to work for many years now and is trusted by many corporations and developers for their application programming needs.
Elasticsearch uses memory in two ways. The first and most basic way is that the Lucene library that Elasticsearch is based on makes heavy use of the operating system disk cache. This allows the Lucene library very fast access to data that would otherwise be read from disk. If the data is not found in the system disk cache Lucene will fetch the data from the disk which in turn the operating system will add to the cache cleaning out older items as needed. This is all handled at the operating system level so there is no overhead of trying to cache something that is already being cached.
The second and more interesting way that Elasticsearch uses RAM is what it does with the JAVA HEAP. The JAVA HEAP memory is used predominantly by Elasticsearch to store queries and results. When you send a query to Elasticsearch the node that receives the query will send a broadcast to the rest of the cluster asking for which node has the fastest access to the requested data. This algorithm gets into shards, shard count, shard replicas, and a few other areas which we will touch on later. We can simply say the HEAP cache holds what Elasticsearch has already done so it does not have to do it again, from queries to results, to the location of the data.
Elasticsearch uses the 3 level HEAP RAM layout. This is broken down into Eden Space, Young Generation, and Old Generation. There are complete books on what exactly happens in these three isolated memory spaces but the simple answer is: New data to be cached is stored in Eden space, Data that is still being accessed after a predetermined amount of time is moved into Young Space, data that has been around for a long time is moved into Old Space and deleted when space is needed. This is where garbage collection comes in and can be a real nemesis to the developer and the end-user.
This is a very oversimplified view but I hope gives enough information to understand how this works.
The amount of total system ram that you will need for the Elasticsearch nodes is completely use case dependant. The factors that play into this decision mainly are ingest rate, query rate, and document complexity. Query rate and document complexity are weighted heavily against ingesting rate in most cases. Though we need to remember, query rate is not just the time to run a query. This will also include the number of users running queries and the complexity of those queries. Mix this with the potential complexity of the data and things can turn sour very quickly if the cluster is not properly designed. To claim a full High Availability setup this design should also include an expansion plan along with a disaster recovery plan.
Much like the system disk cache managed by the operating system the Elasticsearch HEAP cache has one function, high-speed data access. Frequently accessed data is held in the cache, and, the same as the system disk cache, old infrequently accessed data is purged to make way for new data. So it would always seem maximum ram is the best option, but this is not always the case and at the same time is not budget-friendly.
Ram is a complex topic when working with Elasticsearch. Along with what we went over above there are others considerations that also have to be factored in. As of this writing, the maximum amount of ram that should be assigned to Elasticsearch is 32GB. The reason for this goes into JAVA memory pointer compression. Like some other topics in this guide, pointer compression is a whole topic in its self. The fastest way to explain this is that once you go above 32GB of Ram the JAVA memory pointers become uncompressed and use a larger percentage of the HEAP ram just to track memory pointers. Unless you have a highly specialized and tuned Elasticsearh system you will not see any benefit going above 32GB of RAM until you go beyond 64GB of HEAP ram. Of course, this is the benefit of being able to horizontally scale the cluster and spread your workload out over many compute nodes, you do not need a massive amount of RAM on one machine.
The final memory consideration has to do with balancing the system disk cache along with the amount of HEAP that you assign to Elasticsearch. As noted above unless you have a very specialized use case you should not assign more than 32GB to HEAP but you also have to keep the disk cache in mind which is just as important if not more important than the HEAP ram. The best practice advice here is 50%. Do not assign more than 50% of the available system ram to the HEAP. There is a little bit of wiggle room here but once you start going beyond 50% HEAP ram performance will start to degrade.
To wrap up this part on memory we can just say, do not assign more than 50% of available ram to the HEAP. Do not assign more than 32GB to the Elasticsearch HEAP. Do not think you have to design one or two huge machines. Know your data before deciding on memory configuration or node count. As we continue through the next two parts on CPU and Storage you will see how all of this ties together.
A final note, going above 32GB of HEAP ram to the next size where you think you might see an improvement would be 64GB of HEAP ram. The problem here is when you are working with that much HEAP ram garbage collection times can quickly erode any performance improvement that you were hoping for. As was mentioned earlier, every use case is unique so do not look at this as a hard limit. This is a recommended best practice that I fully support unless you have a very good understanding of how Elasticsearch works.
CPU Cores are just as important as RAM; they are only number two on the list because there is a lot less to think about when deciding on the number of CPU cores. But along with the number of CPU cores we have to think about the CPU speed. As far as the way Elasticsearch processing works the number of cores is more important than the core speed to a point.
Every Elasticsearch process runs in a thread. In a perfect world, you would assign enough cores for every thread to always run on its own core. This would waste a lot of resources and a lot of budget since not all threads are always running. Elasticsearch has a task scheduler that does a very good job of starting and stopping threads to make optimum usage of the CPU cores available so we can focus more on just selecting the resources without having to worry about all of the threads individually.
CPU frequency while playing a role in the processing speed of your Elasticsearch nodes is not near as important as the number of threads that can run in parallel. More cores equate to more threads running in parallel. In most cases, a node will perform better with 4 3.2Ghz cores than a single core running at 4.3Ghz. Keep this in mind for planning, if 4 slightly slower cores are going to cost the same as one faster core, go for the 4 slower cores.
This is a general overview of CPU cores and threads. If you look into complex cluster designs and stand-alone node types this topic can go much deeper but for most use cases this should cover the basics.
Disk based storage has been the bottleneck of all computers since it was introduced decades ago. Even with the advances in SATA and NVME storage, it continues to be the bottleneck of every computer system. This can become very evident very fast when working with a system that has sub-optimal storage. Writing to storage is not like writing to the blazing-fast RAM that sits directly on the CPU data bus. This bottleneck has to do with the conversion of data between two formats then written to a specific location on the storage device. This is calculated by the storage controller which takes time, then the write which takes time, then the data verification, which takes time. Add up all of the times of the different parts of the process and storage is nowhere near as fast as RAM so storage is an important decision.
In general, there are two types of storage still in use. These are referred to as spinning or non-spinning. Standalone spinning drives installed directly in a server are becoming rare. But they are still very popular in RAID storage arrays and hybrid storage arrays.
When planning a new cluster the most important storage metric to think about is the overall IOPS available between the storage and your compute resource. IOPS is IO Operations Per Second. You can have the top-tier server to build your cluster on but if the IOPS count is too low you will be wasting the other resources that you are paying for.
This is of course workload-dependent but my recommendation here is to use the highest IOPS available on the platform. Storage speed impacts the whole cluster, the longer the node has to wait for data the longer that thread is tied up when other threads could be using that CPU time.
I hope these tips help when designing your next Elasticsearch cluster. Check back for new content every week.
--Parting Wisdom
-Research all angles of a situation before commiting. What looks good can become bad very quickly.-