Logstash Tips on Batch Size

It is quite important.

Many settings across the Logstash core, filters, input, and output plugins can be tweaked to improve performance. Though Logstash is documented very well at Elastic and there are many other articles about using and tuning Logstash, I decided to take a different approach. Less technical, more fun. At least how much ever fun you can have explaining software settings.

The batch size setting, in my opinion, is one of the most important settings when you are trying to squeeze every bit of performance out of your resources. The technical documentation only tells you that you may need more Java HEAP ram, and it should increase performance when set correctly. But since each environment is unique and faces different challenges with different resources, it is hard to know exactly what it is doing and exactly how it helps.

To put my spin on it, the Logstash batch size is a precise dance between the resources available to your Logstash instance, the resources available at your Elasticsearch cluster, the size of your documents, the speed of input, the complexity of your filters, network speed, and so on. So even in this post, I can not tell you exactly what your environment needs, but I will give some useful hints.

The batch size is simply how many documents Logstash will collect from an input into one stack before sending them through the specified pipeline’s filters and outputs. The straightforward block diagram below shows how this works.

This diagram greatly oversimplifies how this works but may help some understand it better. The batch size can be global or per pipeline. The global setting is in the logstash.yml configuration file. The per pipeline settings can be set in the pipelines.yml file if you are using that method for multiple pipelines or through the GUI if using centralized pipeline management. No matter which option you are using the batch size information below still applies.


Part1. Heap Ram

One of the first things that gets overlooked when working with the Logstash batch size is what resources are actually available to the running Logstash instance. Whether you are running on bare metal, a VM, a Cloud VM, stand-alone docker, or Kubernetes, you always need to be aware of the resources available to Logstash and what resources it is using.

By default, Logstash assigns 1GB of RAM to the JAVA HEAP. All pipelines share this HEAP ram. The amount of ram available to Logstash can be adjusted in the jvm.options file. This file is generally in the same location as the logstash.yml file. Also, by default, Logstash will use all CPU cores that are presented to it. For now, we will assume that this is a single pipeline with a reasonably simple filter set.

Our primary focus here is the HEAP ram. Since there is 1 GB of HEAP by default, we will figure our batch numbers around that. You can get a good ballpark figure of how big a batch size you can set by dividing your documents’ size by 70% of the available HEAP ram. You never want to use all of the HEAP ram as Logstash depends on it for maintenance. Once you start going over 70% of HEAP usage in any JAVA application, performance starts to degrade because of much time spent in garbage collection. Since Logstash can be very heavy on IO and Network, depending on the workload, these are things to keep in mind when designing a pipeline.

So with this in mind, let us figure that 70% of 1GB is 700MB. Assuming that the documents to be processed are 100KB each, we can figure that at max, we could set a batch size of 7,000 and still be in the safe zone as far as ram is concerned. If your documents are larger or smaller, the formula is pretty simple. (0.7*HEAP)/Document Size So if your documents are 500KB each, it would just be (0.7 * 1,000,000,000)/500,000 or 1400 documents in the batch before you reach the recommended HEAP limit. You can always increase the HEAP size to hold larger batches but there are many other factors that go into the optimal batch size, though this is a good number to start with. The next consideration is how the CPU cores and worker threads will handle this batch.

Remember, increasing HEAP ram will not always solve your problems. Sometimes it can make small problems worse because larger HEAP requires more time to do garbage collection and that there is more to process in RAM. If you are seeing delays and high HEAP usage it could also be saturated CPU cores.

Part 2. CPU Cores and Worker Threads

Where mismanagement of Logstash memory usage can quickly cause headaches the next consideration is CPU cores and worker threads. By default Logstash will assume it has access to all CPU cores and automatically set the worker threads to that value. If you have 12 CPU cores you will have 12 worker threads.

There are a lot of complexities behind the scenes in how worker threads work. The easiest way to look at it is that the number of worker threads is how many events Logstash will process in parallel once the batch has reached processing size. Keep in mind that the CPU cores are spread across all pipelines. If you have defined five pipelines with four worker threads in total, they would require 20 CPU cores. There is some wiggle room here, which is explained in the Logstash Tuning and Profiling documentation, but again I will put my spin on it.

All of this information comes back around to document size and filter complexity. If you have very small documents with no or very few filters, then yes, you can, for sure, oversubscribe the worker threads by many factors. If your documents are large or have complex filters, you must be careful with oversubscribing. Also, keep in mind that all of your pipelines are added together as far as resource usage. If you have the same five pipelines with four workers mentioned above and each pipeline has a batch size of 1000, that is all added together in the end. In this scenario, 20 CPU cores are needed and 5*1000 or 5000 batches worth of events in memory when the pipelines are at 100% usage.

Part 3. It Could Be The Output or Maybe the Input

Even when your resources are spot on and your filters are tuned to 100%, there are still two main factors that come into play with Logstash performance, the input and the output. Logstash does its best to balance the multitude of tasks across available resources but has little control over the input source or the output destinations. A full analysis of all input and output congestion will cover many posts, which we will get to in time. For now, we will stick with the basics.

Of course, the input is where your data comes from; the output is where it is going. There are many input plugins available for Logstash. The three most popular are the beats input, file input, and Syslog input. The two most popular outputs are the file output and the Elasticsearch output. This array of inputs and outputs covers the whole gambit of network and local system IO traffic, but in most cases, what is happening there is out of the realm of Logstash.

If the network is saturated, of course Logstash is going to respond slow. All communication has to go over the network if you are using a network input. If the IO on the local system is saturated, sure it is saturated if you are using the file input and reading 10 log files that add 100 new lines per second. Yes, the network stack will become saturated if you use 10 UDP, 10 Syslog, and ten beats inputs, all running at 100%. These are just things to keep in mind as you design your Logstash pipeline inputs and outputs. Logstash is always bound by the environment where it is installed.

Wrap Up

Logstash is bound by the hardware and operating system that it is running on; there is no other way around that. This is especially important if you are running a standalone docker or container service.

Container systems such as Kubernetes keep stringent control of the resources in use by each pod. Here you have to be even more vigilant in giving Logstash precisely what it needs for the workload you present to it. Too many resources and you are, of course, wasting resources, too few resources, and Logstash suffers frequently becoming the scapegoat for poorly designed pods or even full VM’s for that matter.

Please also remember that most cloud services, such as AWS, GCP, and Azure, are shared services by default. Your VM’s there are sharing the resources on that computing card with any number of other customers. Also, in the cloud arena, a significant performance hit that is frequently overlooked is the IOPS number. This is IO Operations Per Second. If your Logstash instance is heavy on IO operations such as writing to or reading log files from disk, this number is significant. The number of CPU cores you are paying for does not matter if saturated IOPS causes the system to wait for IO.

Again with cloud services also look out for the maximum network speed that your specific VM allows. If you are planning on processing 1TB of logs per hour you will need to look at least 1Gb/s for this to work. One step down to the next standard 100Mb/s would take over 3 hours to transfer 1TB of logs, it would never catch up.

Also disk speed makes a huge difference with the IO operations. Now we are into SSD disks, but we have different level of SSD, especially in the cloud.

I want to tell everyone, use the biggest fastest platform that you can find, but this is not practical in any case. So below I have provided my thought process and tips when designing a new Logstash instance and the pipelines that will go with it.

The first thing to consider is the number and type of inputs. All plugins inherently use a little bit of HEAP ram just being loaded in memory even if they do not do anything. This is generally insignificant, so we will not use it in our calculations. But we do care about the type of input as a file-based input will need more IO where network-based inputs will be weighted more towards network and CPU.

Next, we get move on to the number and complexity of filters across every pipeline. There are a few filters available to Logstash that make network calls, such as the Elasticsearch filter. These kinds of filters take a post of their own, so we will stick with the basic filters that do not do this. When it comes to filters, we are generally talking raw CPU and memory. There is no IO involved, or at least there should not be. If the operating system uses any swap memory, this will severely impact performance by swapping memory to disk. Swap memory should be avoided in all situations.

The filters are all event manipulation routines that run in memory. These make up a good 65% of the resource usage, in my opinion. These filters are where most of the work is done to enrich your data. As mentioned above, this is almost wholly CPU, which sort of levels out the need for IO depending on the input source.

Keep in mind that any inputs, outputs, or filters that depend on an external network based resource such as the www input or output, even the Elasticsearch output are bound by the response speed of that resource. And yes one pipeline can slowdown another pipeline since they are all sharing the same resources.


I hope this information helps those that read it with planning their Logstash installs and pipelines. Stay tuned for Part 2 coming soon which goes through profiling a Logstash system with a custom written NodeJS utility.

--Parting Wisdom--
-Measure twice, cut once.-

Tech Thoughts