Metricbeat to PostgreSQL Using Perl and TimescaleDB


It Lives


Preface

As I continued to work on and test different data loading techniques using TimescaleDB and Bureau of Labor Statistics data, it dawned on me that I was not collecting any good metrics on all of this work I was doing. Sure I could program metrics into the data loaders and such, but I am talking about system metrics from the VMs doing the work. Things like CPU usage, memory usage, network usage, and all of those other metrics that we all love so much.

Since I am currently focused on PostgreSQL, I do not have a running Elasticsearch cluster, but I have a lot of hosts that still have Metricbeat installed. For the past five years, I have used Elasticsearch and Metricbeat to collect metrics on my projects and live servers. Without a running Elasticsearch cluster, I paused the first project and pondered how I should go about collecting system metrics for my current and future projects.

There are many options available for collecting system metrics. There are also many different ways to store these metrics, from flat text to rrdtool and its whole ecosystem of configuration options. There is Prometheus and its tools, InfluxDB, Cassandra, and many more. So, where I could have spent days looking for something I wanted to work with, I decided to bring this full circle back to what I was working on anyway; Postgresql and TimescaleDB. But how to collect the metrics? I do not want to spend days learning something new. So after some deep mind searching, I decided to stick to what I already knew and work something out to convert Metricbeat JSON data into a format compatible with my current PostgreSQL work. Those thoughts spawned this project.

When I started this side project I had great ambitions that I was going to reinvent the wheel. I wrote the first version of this with JavaScript and NodeJS. The idea was to use JavaScript to emulate the communication between a Redis server and a client. I worked on that version for three days before figuring out that TCP buffers are not easy to manage, and I was losing 10% of my records because of truncated documents left in the buffer. Which, of course, became a much worse problem as I started pushing more metrics from more hosts. Of course, there are ways to buffer the buffers, but then it dawned on me that I would lose all metrics currently in that buffer if the script crashed. Losing a buffer full of metrics is never a good thing, so I moved on to version 2 using Perl.

Version two was a little less ambitious but still buggy. The initial idea was to hold all values I wanted to INSERT in a Perl HASH until all values were filled over a specific timestamp and then insert them into a PostgreSQL wide table. I loved the idea of being able to store everything I wanted in a wide table, but there was a problem here also. As I turned up more hosts, the hash quickly grew, and I realized it was not much different than before. If the script crashed, it would lose all metrics currently in memory. Also, I would have to write hash cleanup routines otherwise the script would have a memory leak that could grow quickly if many hosts were being monitored.

Now we have landed at this final version. This version could use a lot of tweaks and additions, but for the time being, it works for my needs and shows the basics of converting the deep nested Metricbeat data into SQL inserts. This version uses Redis as a metric buffer but is a far cry from what the first version tried to do. In this version the idea of using the wide table format was modified into what I will call a medium table format. We will get into the details on that when we get to that section of this project document.

Before starting this project I knew bits and pieces of how to do parts of these things. This project document wraps years of knowledge, lots of research, many nights of testing code, testing queries, and the comprehension to put it all together into one document. As problem solvers, this is what we do, with the hope that our hard work and research all combined into one well documented project will help others further their knowledge.

Follow along as we build this project and learn a little bit about converting deep nested JSON documents into SQL style inserts.

Let's Get Going!




The Project Platform

I chose these components because I am very familiar with them. I have used each of these for at least a handful of years, except for TimescaleDB. TimescaleDB is a set of extensions that you install on top of PostgreSQL. These extensions do some crafty work in the background to take advantage of many PostgreSQL concepts that can be intimidating to even a seasoned database engineer. Things like table partitioning, automatic data aggregation and complex queries, just name a few things I have figured out so far.

At one time, most of the databases running Internet applications and collecting metrics were SQL based. Many SQL-based options are available, and now with the fast rise of NoSQL databases, there are many other options to consider for processing metrics. I have used both SQL and NoSQL for many different tasks. They both have their upsides and downsides, but everything is use-case dependent in the end. This project is designed to take the SQL route, not define which route is better.

The PostgreSQL database that we will build has two tables. One is a relational table that holds metric descriptions and names. The other will be a TimescaleDB hypertable that holds the metrics. We will work through building all of these from a VM with a fresh install of Ubuntu Linux 20.04 all the way to a working metrics converter and display platform.

This project makes a few assumptions about skill level. This project assumes that you have a freshly installed Ubuntu 20.04 server, VM or otherwise, and that you know how to navigate a Linux command line. Other than that, this project document covers everything else that you will need to know to complete the project.


The Hardware (Virtual or Otherwise)

Since this project will only be collecting metrics from one host, we can use a fairly low-resource platform for this project. My platform, as shown below, has four processor cores, 4 gigabytes of RAM, and 50 gigabytes of virtual storage running on a 1TB NvME. Whether running as a local VM, a cloud-based VM, or a bare metal server, these resources should work fine. Below is a screenshot of the Proxmox VM creation GUI showing what I chose for the resources.



This pretty much covers the hardware needed to run this project. The 50 gigabytes of storage is a little more than what will be required, but it is a good buffer if someone decides to push data from other hosts into this system. We will discuss how to do that when we get to the Redis configuration section.

There are many cloud services available where you can lease a VM with these resources for a nominal fee. If you are only going to use a leased VM long enough to run through this project once, then you only need to lease it for a few hours. In most cases, this costs less than one US dollar.



Installing the Core Packages

We need to install some core packages that will be needed as we work through this project. I pondered a few different ways to handle installing packages, from very detailed explanations of every command to offering just a blob of commands and moving on from there. The first way added a dozen pages to the document. The second way offered no detail on what was being installed. Neither of them are a good answer. With this in mind, I came up with this hybrid format. These will be in a numbered list with the summary of what the command does and then the command.

This first set of installs takes care of updating the operating system to the latest packages and installing the PostgreSQL database, the client, and the TimescaleDB extensions.



PostgreSQL and the TimescaleDB Extensions

As of October 2022, the install instructions for the PostgreSQL and TimescaleDB packages presented here are slightly different than what is offered on the Timescale website. The official instructions state that you must be logged in as root to follow these procedures. The list below has been updated and tested by me to allow this installation without having to be logged in as root.

  • 1) Update the current system repositories
  • sudo apt update

  • 2) Upgrade the system to the latest packages
  • sudo apt upgrade

  • 3) Install PostgreSQL Dependancies
  • sudo apt install gnupg postgresql-common apt-transport-https lsb-release wget apt-transport-https software-properties-common

  • 4) Run PostgreSQL Repository Builder
  • sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh

  • 5) Install Timescale repository Into the apt-get sources List
  • echo "deb https://packagecloud.io/timescale/timescaledb/ubuntu/ $(lsb_release -c -s) main" | sudo tee /etc/apt/sources.list.d/timescaledb.list

  • 6) Add Timescale repository GPG key to apt-get
  • wget --quiet -O - https://packagecloud.io/timescale/timescaledb/gpgkey | sudo apt-key add -

  • 7) Update apt-get package list using the repositories we added above
  • sudo apt update

  • 8) Instale TimescaleDB extensions, PostgreSQL, and PostgreSQL client
  • sudo apt install timescaledb-2-postgresql-14

  • 9) Run timescaledb-tune
    This will ask a few questions and modify the PostgreSQL configuration to work with the extensions. For this project the defaults will work fine.
  • sudo timescaledb-tune

  • 10) Restart PostgreSQL server to load new configuration
  • sudo systemctl restart postgresql

  • 11) Connect to our newly installed PostgreSQL database
  • sudo su postgres -c psql

After completing these 11 steps, we should have a fully installed PostgreSQL database with the TimescaleDB extensions and the PostgreSQL database client. If all is well, we should now see an output that looks like the one below.

                     
psql (14.5 (Ubuntu 14.5-1.pgdg20.04+1))
Type "help" for help.

postgres=#
                     
                  

This shows us that the PostgreSQL client can connect to the database. These steps that we just completed are the most complex part of the core component installations. In the end, these 11 steps install two repositories, three core packages, and all their dependencies. Now we will move on to our metric buffer, Redis.



Install Redis

For this project, Redis is a very simple install. We do not need to install any keys or repositories as the version supplied by default using Ubuntu 20.04 apt-get works very well for this project. Run the command below to install the Redis server.

sudo apt-get install redis

This package also installs redis-cli, the default Redis database client. This package starts the redis server after installation. We can verify that it is running by running the command below. We do not need to run sudo for redis-cli in its default setup. So just redis-cli.

redis-cli

If there are no errors, you should see the output below after running redis-cli.

                     
127.0.0.1:6379>
                     
                  

This is the Redis database prompt. We will use this a few times while working through this project, mainly to check the status of the data we are collecting. The next installation is Metricbeat. Metricbeat requires repository additions just as PostgreSQL and TimescaleDB did.



Install Metricbeat

The documentation offered by Elastic at this URL Repositories for APT and YUM explains these steps very well. I have included a slightly modified version of the instructions here so that this project can be completed with this one document. The modification is that we will not be trying to reinstall packages we already installed earlier, such as, apt-transport-https. This was designed around Metricbeat version 8.4.3.

  • 1) Add Elastic GPG Key to apt-get repository
  • wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -

  • 2) Add Elastic OSS Repository
  • echo "deb https://artifacts.elastic.co/packages/oss-8.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-8.x.list

  • 3)Update Repository List
  • sudo apt-get update

  • 4)Install Metricbeat
  • sudo apt-get install metricbeat

Metricbeat will not start after the installation as PostgreSQL and Redis did. Metricbeat needs to be configured before we can start it. We will start on configurations after installing these last few core packages. Let's now move on to installing Grafana.



Install Grafana

We will install the latest version of Grafana over these next few steps. Grafana also requires adding a repository and a GPG key. So we will go through very similar steps as what we did with Metricbeat. We will use a modified version of the steps listed in the Grafana documentation at Install on Debian or Ubuntu. The guide here will be a few steps simpler as we have already installed some of the components mentioned in the official document. The steps are broken down below.

  • 1) Add Grafana GPG Key to apt-get repository
  • sudo wget -q -O /usr/share/keyrings/grafana.key https://packages.grafana.com/gpg.key

  • 2) Add Grafana OSS Repository
  • echo "deb [signed-by=/usr/share/keyrings/grafana.key] https://packages.grafana.com/oss/deb stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list

  • 3) Update Repository List
  • sudo apt-get update

  • 3) Install Grafana
  • sudo apt-get install grafana

  • 4) Start Grafana
  • sudo service grafana-server start

Since Grafana uses a web interface, the easiest way to see if it is running is to use a web browser and navigate to the web interface. As we used default settings, the URL will be http://server-ip:3000/. If you do not already know it, you can find the IP of the project server by running the ip route command. This command and its output are shown below.

                     
bvest@project-beat-interceptor:~$ ip route
default via 192.168.2.1 dev ens18 proto dhcp src 192.168.2.196 metric 100
192.168.2.0/24 dev ens18 proto kernel scope link src 192.168.2.196
192.168.2.1 dev ens18 proto dhcp scope link src 192.168.2.196 metric 100
                     
                  

In the case of my project server, the IP address is 192.168.2.196, which is shown as the second IP address in the first entry, "default". So we would navigate to http://192.168.2.196:3000. If everything is working, your browser should show the window below.



We will log into this later in this document, so you can close this browser window now. Now we move on to the last part of the core parts of this project, the Perl modules.



Installing the Perl Modules

I chose Perl for this project for a few reasons. It does this kind of work well, I have been writing Perl script for around 15 years, and I have used it as the glue code in many systems similar to what we are doing with this project. It is very solid and is pre-installed on most Linux distributions. I may rewrite this in Go or another compiled language someday, but for now, this Perl version does what I need it to do. Though Perl is installed by default with Ubuntu Server 20.04, we need to install a few modules that make this kind of processing much easier.

All of the modules that we need can be installed using apt-get, though the CPAN system is quite useful if you write a lot of Perl code. Lets get to it, as these are the last core components that need to be installed.

  • 1) Install Redis Client Perl Module
  • sudo apt-get install libredis-perl

  • 2) Install Perl DBI Module (Database Interface)
  • sudo apt-get install libdbi-perl

  • 3) Install Perl Postgres Client Module (Connects to DBI)
  • sudo apt install libdbd-pg-perl

These are the last core components that can be installed with the apt-get system. The next steps take us through downloading the repository containing the custom files and script for this project and the configuration of all the pieces we just installed.




Download the Project Repository

Though we will step through the configurations and the script that runs this project, the whole project is in a repo on GitHub. We will use the files in this repo to copy the new configurations into place and finally run the script. Since git is installed by default, we will use that route to download this project.

Navigate to your home folder on this project machine and download the repo using the commands below.

cd ~
git clone https://github.com/b-vest/project-metricbeat-to-postgresql-with-perl.git

This git clone will have created a new folder named project-metricbeat-to-postgresql-with-perl, lets navigate to that folder and see the files that we will be working with.

cd project-metricbeat-to-postgresql-with-perl
ls -al

Out of the listing of files that is printed, the ones we need for this project are listed and explained below.

  • 1) hybrid-poller.pl
    This is the script that does all of the conversion betwen JSON formatted Metricbeat data and PostgreSQL compatible database inserts.

  • 2) project-file-fields.asciidoc
    This file defines all of the field names and descriptions that we will load into the field_names table later in this document.

  • 3) project-file-metricbeat.yml
    This is a custom metricbeat.yml configuration file that is preloaded with the Redis output and process filters for this project.

  • 4) project-file-redis.yml
    This is a preconfigured Metricbeat module for gathering metrics from the Redis server that we installed.

  • 5) project-file-system.yml
    This is a preconfigured Metricbeat module for gathering metrics from the Core system, such as CPU, Memory, Network usage and so on.

This brings us to the end of what we must install and download. Now we move on to building some database tables, configuring Metricbeat, and loading the field_names table. We will first configure Metricbeat and get it running so our Redis buffer can start collecting some metrics on the system. Having the metrics already in the Redis buffer will give us something to work with by the time we get back to Grafana and start working with the database queries.



Configuring Metricbeat

metricbeat.yml


Including the configurations in the repo make this project much easier to set up. For all of these configurations, we will first see the command to copy the configuration into the correct place and then a summary of what was changed in those configurations. Let's ensure we are in the project folder and run the commands listed below.

sudo cp ./project-file-metricbeat.yml /etc/metricbeat/metricbeat.yml

This step copies the metricbeat.yml from the project folder to where metricbeat stores its configuration information at /etc/metricbeat. What was copied into that folder is shown below.

                     
metricbeat.config.modules:
  # Glob pattern for configuration loading
  path: ${path.config}/modules.d/*.yml
  # Set to true to enable config reloading
  reload.enabled: false

output.redis:
  hosts: ["localhost:6379"]
  bulk_max_size: 6
  backoff_max: 1s
  #datatype: "channel"
  max_retries: 10
  #password: "thisisnotasecurekey"
  key: "metricbeat"
#  db: 0
#  timeout: 15


# ================================= Processors =================================

# Configure processors to enhance or manipulate events generated by the beat.

processors:
  - drop_fields:
       fields: ['agent','ecs','service']
  #- add_host_metadata: ~
  #- add_cloud_metadata: ~
  #- add_docker_metadata: ~
  #- add_kubernetes_metadata: ~

                     
                  

This custom configuration is a very stripped-down configuration versus the Metricbeat default configuration. This configuration only shows what we are actually using and a few commented out items for future use. Our only output is Redis at localhost:6379, which is where Redis listens by default. The one drop_fields: process is to remove fields that we do not need for this project.


redis.yml

Copy this metrics module into the metricbeat modules.d folder.

sudo cp ./project-file-redis.yml /etc/metricbeat/modules.d/redis.yml

This file is copied to nearly the same location but we added the modules.d folder. This is the Metricbeat module that collects metrics from the Redis server. All modules go into the /etc/metricbeat/moduels.d/ folder by default. Though this can be changed in the core metricbeat.yml configuration, for this project it is fine where it is.

                     
# Module: redis
# Docs: https://www.elastic.co/guide/en/beats/metricbeat/8.4/metricbeat-module-redis.html

- module: redis
  metricsets:
    - info
    - keyspace
  period: 10s

  # Redis hosts
  hosts: ["127.0.0.1:6379"]

  # Network type to be used for redis connection. Default: tcp
  #network: tcp

  # Max number of concurrent connections. Default: 10
  #maxconn: 10

  # Redis AUTH password. Empty by default.
  #password: foobared

                     
                  

The only change between this module and the stock redis.yml module is enabling the metricsets. Since we did not set up a redis password, nothing else needs to be changed.

Note: Redis is only secure without a password because for this project it only responds to localhost. If Redis is bound to anything other than localhost it should always have a password.


system.yml

Copy this metrics module into the metricbeat modules.d folder.

sudo cp ./project-file-system.yml /etc/metricbeat/modules.d/system.yml

Similar to the redis.yml above this file was copied to /etc/metricbeat/modules.d/system.yml. What was copied is below.

                     
# Module: system
# Docs: https://www.elastic.co/guide/en/beats/metricbeat/8.4/metricbeat-module-system.html

- module: system
  period: 10s
  metricsets:
    - cpu
    - load
    - memory
    - network
    #- process
    #- process_summary
    #- socket_summary
    #- entropy
    #- core
    - diskio
    #- socket
    #- service
    #- users
  process.include_top_n:
    by_cpu: 5      # include top 5 processes by CPU
    by_memory: 5   # include top 5 processes by memory
# Configure the mount point of the host’s filesystem for use in monitoring a host from within a container
# hostfs: "/hostfs"

- module: system
  period: 1m
  metricsets:
    - filesystem
    - fsstat
  processors:
  - drop_event.when.regexp:
      system.filesystem.mount_point: '^/(sys|cgroup|proc|dev|etc|host|lib|snap)($|/)'

- module: system
  period: 15m
  metricsets:
    - uptime

#- module: system
#  period: 5m
#  metricsets:
#    - raid
#  raid.mount_point: '/'

                     
                  

This module configuration is a bit larger than the one for Redis, as it can collect information about many parts of the system. For this project, I have disabled all process and socket monitoring and enabled diskio monitoring. The poller script should work with any of these options enabled, but these are the only ones we will use in this project.

With all of these things complete we can now start Metricbeat and start collecting data about our project system.


Start Metricbeat

sudo service metricbeat start

The easiest way to see if we are collecting metrics from Metricbeat is to check if our Redis metrics buffer is collecting any documents. Let's connect to Redis with the redis-cli utility and see if we have any documents.

redis-cli

Once connected lets run the Redis command LLEN metricbeat. This commamnd stands for List Length, the metricbeat part tells Redis which key to count. By default the Mericbeat redis module writes into a key named metricbeat. The output should be an integer similar to what is shown below.

                     
127.0.0.1:6379> llen metricbeat
(integer) 18   
                     
                  

This output shows that at this time, Redis is buffering 18 metric records that will be processed when we first start the polling script. This value will grow as we work through the rest of this document which is great because we will have a few metrics to build visualizations around when we get everything in place. Since we are using the default 10-second poll cycle, this record count will grow quickly.

Now we have data in one place, but it is still not usable by Grafana. We need to convert this data from JSON to SQL Inserts and send it to PostgreSQL. Before we can do that, we need to create the database, table, and users that we will use for the PostgreSQL side of this work. We also need to install the TimescaleDB extensions on our database and turn one of our tables into a hypertable. We move on to those steps in the next section.




Setting Up The Database and Tables

Now that we have data in Redis, we need to build the PostgreSQL and TimescaleDB parts. These tables will define and store the metrics currently collected by Metricbeat and stored in our Redis buffer. First, we need to reconnect to the database as we did earlier when we first set up PostgreSQL. Let's do that with the command below. You can either do this in a new terminal to the project server or exit from the redis-cli utility and run these commands in that same terminal.

sudo su postgres -c psql

Now that we are connected to the PostgreSQL database we need to create our project database. Run the commands below to create the project database and then connect to it.

CREATE DATABASE metrics_ht;
\c metrics_ht


If everthing has worked well to this point we should see the message below after connecting to the new database.

                     
postgres=# \c metrics_ht
You are now connected to database "metrics_ht" as user "postgres".
metrics_ht=#
                     
                  

Instale TimescaleDB Extensions

The first thing that we need to do after connecting to this new database is install the TimescaleDB extensions. All TimescaleDB functions run as parts of the query, as everything is tightly bound with PostgreSQL. Below is the query to install the extensions on our newly created metrics_ht database.

CREATE EXTENSION IF NOT EXISTS timescaledb;

If everything goes well you will be presented with a screen that looks like the one below.

                     
WARNING:
WELCOME TO
 _____ _                               _     ____________
|_   _(_)                             | |    |  _  \ ___ \
  | |  _ _ __ ___   ___  ___  ___ __ _| | ___| | | | |_/ /
  | | | |  _ ` _ \ / _ \/ __|/ __/ _` | |/ _ \ | | | ___ \
  | | | | | | | | |  __/\__ \ (_| (_| | |  __/ |/ /| |_/ /
  |_| |_|_| |_| |_|\___||___/\___\__,_|_|\___|___/ \____/
               Running version 2.8.1
For more information on TimescaleDB, please visit the following links:

 1. Getting started: https://docs.timescale.com/timescaledb/latest/getting-started
 2. API reference documentation: https://docs.timescale.com/api/latest
 3. How TimescaleDB is designed: https://docs.timescale.com/timescaledb/latest/overview/core-concepts

Note: TimescaleDB collects anonymous reports to better understand and assist our users.
For more information and how to disable, please see our docs https://docs.timescale.com/timescaledb/latest/how-to-guides/configuration/telemetry.

CREATE EXTENSION
                     
                  

If you see this then everything is good and the extensions are installed. Now we can move on to creating the tables.


Create The Tables

Now we are ready to create the tables. We will create two tables named field_names and metric_values. field_names will hold metric field names and the definitions of those names. metric_values will hold the real-time metrics that we pull from Redis and parse into SQL inserts. Let's create the field_names table first. Run the complete command listed below in the terminal where we just created the metrics_ht database.


field_names table

                     
CREATE  TABLE "public".field_names ( 
   field_id             serial  NOT NULL  ,
   field_name           varchar(100)    ,
   description          varchar    ,
   CONSTRAINT pk_field_names PRIMARY KEY ( field_id ),
   CONSTRAINT unq_field_names UNIQUE ( field_name ) 
 );
                     
                  

This table is our static relational table. This table does not update automatically as the metric_values table will. We load this data set using the poller script and manually refresh the data if any metric values are added or changed as Metricbeat updates. This project is designed around version 8.4.3, but these fields rarely change.

Below I will break down each variable and constraint in the order that they are shown in the query.

  • 1) field_id
    This is an auto incrementing field that will create the field_id which relates to metric_values.field_id. This is the only relation betwen these two tables.

  • 2) field_name
    Contains the full JSON key path that builds the actual name of the metric, such as system.network.in.bytes. We will investigate this more deeply when we look at the script.

  • 3) description
    Contains the definition of this field_name as provided by the metricbeat fields.asciidoc.

  • 4) Constraint Primary Key field_id
    As field_id is an auto incrementing key it is the Primary key. We need to be aware of this constraint when we build the PostgreSQL metrics user in just a bit.

  • 5) Constraint Unique field_name
    When this table does update or find unknown fields we do not want to of the same name. Making this a unique 100 character varchar column will stop this from happening.

Now we move on to the table that will hold all of our metrics and be a hypertable, metric_values


metric_values

                     
CREATE  TABLE "public".metric_values ( 
   "timestamp"          timestamp  NOT NULL  ,
   field_id             integer    ,
   hostname             varchar(100)    ,
   device               varchar  NOT NULL  ,
   "value"              numeric(40,6)    
 );                     
                     
                  

This table is our real time metrics collector. This is where all of the data that we process out of our Redis buffer ends up. Below I describe all of the fields as I did with field_names above.

  • 1) timestamp
    This is one of the most important parts of all time series data, the timestamp. Without the timestamp we can not have time series data of any kind. This uses the timestamp field which is compatible with Metricbeat timestamps out of the box.

  • 2) field_id
    This is the field_id that relates back to the field_names table. This allows us to save space by not storing the long text names for every data point. We only store the integer and then JOIN the field_names table when needed.

  • 3) hostname
    The name of the host that our data point came from. In an ideal world this would also be an Interger relation to a table named metric_hosts. For now this will work for our project.

  • 4) device
    This holds the name of the device that this metric belongs to. How this works will be explained after this basic summary.

  • 5) value
    This field is the other most important part of these metrics, the metric value. The metric value and timestamp make up the core of all time series data. The other fields are metric identification information but without timestamp or value, there is no way to visualize this metric.

The device column: As we collect metrics from many different sources in this project the devices column is ambiguous. In most circles, devices are things such as disk drives, network controllers, memory controllers, and so on, which are denoted by their device name in the metrics. Things such as sda for disk storage and ens18 or eth0 for network controllers. This device field goes beyond that and is used to identify any metric as to where it came from. If you look at things like memory, cpu, or system load, the device will be system as the system is the device. In the case of monitoring Redis, the system will be redis since that is the only process sending metrics. When gathering metrics for databases such as PostgreSQL and MySQL, the device column can potentially be the name of the database if you are collecting per-database metrics.


A couple more things

These are the only tables we need for this project. Now we need to add a couple things to the metric_values table. First we will add an index for the timestamp and field_id fields as these will be the most heavily searched.

CREATE INDEX idx_metric_values ON "public".metric_values ( "timestamp", field_id );

The response from this command will be a simpe CREATE INDEX if there were no errors. Now on to the TimescaleDB hypertable.



Convert metric_values table to hypertable

Hypertables are one of the core features of TimescaleDB. This chunked table structure allows data in one table to be sliced into many different chunks. A hypertable can outperform a standard table by many factors when designed correctly. Many variables come into play when designing the hypertable layout. These range from ingest rate and query rate to query span, document size, and the number of simultaneous users, among other factors. There are many ways to do this and many different use cases for how the hypertable should be created. The options for creating hypertables are documented very well at the Timescale website in this document, create_hypertable.

Though I can not think of a production environment where a ten-minute hypertable chunk is a good idea, it is a good idea for this project so we can take a peek at the show_chunks command later. Heed the warning about not doing this in production and run the command below to turn metric_values into a hypertable with 10-minute chunks.

SELECT create_hypertable('metric_values', 'timestamp', chunk_time_interval => INTERVAL '10 minutes');

In short, this says to create a hypertable on metric_values using the timestamp column to split the chunks into ten minute blocks. We can verify that the hypertable was created by running the show_chunks command below.

SELECT show_chunks('metric_values');

Since we are not ingesting data into the hypertable yet, the chunk count will be zero. Now we have to create the user that will be allowed to read and write from these tables, and then, we can start working with the script and Grafana.



Create PostgreSQL Users

We will create the metrics_user using the PostgreSQL client where we just created the tables. Run each of the commands below one at a time. These are fairly self explanitory based on the query

                     
CREATE USER metrics_user WITH PASSWORD 'notasecurepassword';
GRANT ALL PRIVILEGES ON DATABASE "metrics_ht" to metrics_user;
GRANT ALL ON field_names TO metrics_user;
GRANT ALL ON metric_values TO metrics_user;
GRANT USAGE, SELECT ON SEQUENCE field_names_field_id_seq TO metrics_user;
                     
                  

That last GRANT command GRANT USAGE, SELECT ON SEQUENCE field_names_field_id_seq TO metrics_user; took me a minute to figure out as I was used to older versions of MySQl which allowed a GRANT ALL to everything option. This is much more secure so I am glad I had to take the time to slow down and figure it out. Something to keep in mind.

And this ends all the work we have to do on the databases. The next steps are the fun part, the script, Grafana, and QUERIES. First, we will use the script to load the field_names data and start converting some of the data for us.




The Script: Sonata in Perl 5


Now we are getting to the good stuff that all of this setup has led us to, the script and data conversion. I chose Perl 5 for this project for a few reasons. I have been scripting with Perl 5 for over 15 years. It has all the modules we need, with one noticeable limitation I have worked around, and I can prototype fairly fast with Perl 5. With the proper configuration and optimizations, I believe this script could handle processing metrics for thousands of devices on top of the single host we will be processing for this project. There was a time I wrote a Perl server using only SNMP querying to collect metrics that could handle nearly 10,000 cable modems every 15 minutes. That system did not have Redis as a buffer, and the script had to do all the work. Most of this comes down to ingest and querying optimizations in the pipeline, not so much what the script alone is doing.

With my rambling out of the way lets take a look at the script and break down what the major parts are doing. We will not define every single line of code. Only the parts that do something meaningful. The line numbers do match up to what you will find in the script.



Deep Dive: hybrid-poller.pl

                     
17: use Redis;
18: use DBI;
19: use Data::Dumper;
20: use JSON::XS;
21: use Getopt::Long;
22: use Scalar::Util qw(looks_like_number);
                     
                  

These lines load all of the perl modules that are needed to read the data from Redis, convert it from JSON to SQL inserts and insert it into PostgreSQL.

                     
28: my @isRunning = qx(ps auxwwww | grep $0 | grep -v grep | grep -v $$);
29: if ( scalar(@isRunning) > 0 ) {
30:     print "Script is running. Exiting\n";
31:     exit;
32: }
33: 
                     
                  

These lines are a piece of code that I add to all scrits that run as a daemon. This prevents multiple copies of the script from running which allows us to setup a cron job to restart the script if it fails.

                     
36: GetOptions(
37:     'help'               => \my $help,
38:     'dbname=s'           => \my $dbname,
39:     'dbhost=s'           => \my $dbhost,
40:     'dbport=i'           => \my $dbport,
41:     'dbuser=s'           => \my $dbuser,
42:     'dbpass=s'           => \my $dbpass,
43:     'batchsize=i'        => \my $batchsize,
44:     'rkey=s'             => \my $rediskey,
45:     'debug=i'            => \my $debug,
46:     'fields_file=s'      => \my $fields_file,
47:     'build_fields_table' => \my $build_fields_table,
48:     'live'               => \my $live
49: ) or die "Invalid options passed to $0\n";
50: 
51: $dbport    = 5432         if !$dbport;
52: $rediskey  = 'metricbeat' if !$rediskey;
53: $debug = 0 if !$debug;
54: $dbhost = 'localhost' if !$dbhost;
55: 
56: if ( !$dbname || !$dbhost || !$dbuser || !$dbpass || $help ) {
57:     print "Missing command line settings.\n";
58:     printHelp();
59:     exit;
60: }
61: 
62: if ($debug) {
63:     print "Using Values\n";
64:     print "--dbname $dbname --dbhost $dbhost --dbport $dbport --dbuser $dbuser --dbpass $dbpass --rkey $rediskey\n";
65: }

                     
                  

These lines load our command line options and set some defaults for the known values for this project. Then it checks for missing values and runs the printHelp() sub routine if any required values are missing. At line 62 we pring the given values if the --debug 1 option was given.

                     
68: my $redisClient = Redis->new;

71: my $dbh = DBI->connect( "dbi:Pg:dbname=$dbname;host=$dbhost;port=$dbport", $dbuser, $dbpass, { AutoCommit => 0, RaiseError => 1 } ) or die $DBI::errstr;
                     
                  

These lines load the Redis and PostgreSQL clients that this script will use to move the data from Redis to PostgreSQl after it has been converted from JSON to SQL.

                     
73: if ($build_fields_table) {
74:   buildFieldsTable();
75:   exit;
76: }
                     
                  

If the --build_fields_table option was given then run the buildFeidlsTable(); sub routine. The code for that subroutine is below.

                     
245: sub buildFieldsTable{
246:      print "build_fields_table\n";
247:     if ( !$fields_file ) {
248:         print "Missing --fields_file\n";
249:         printHelp();
250:         exit;
251:     }
252:     my @fieldsLines = qx(cat $fields_file);
253:     my $fieldsLines;
254:     my $field_name;
255:     my $insertQuery;
256:     my $lineCounter   = 0;
257:     my $insertedCount = 0;
258:     my $updatedCount  = 0;
259:     my @insertArray;
260: 
261:     foreach my $line (@fieldsLines) {
262:         chomp($line);
263:         if ( $line =~ /::/ ) {
264:             chomp( $fieldsLines[ $lineCounter + 3 ] );
265:             $fieldsLines[ $lineCounter + 3 ] =~ s/'//g;
266: 
267:             #Check and ingest field_name if it does not exist
268:             if ( $insertQuery =~ "INSERT" ) {
269:                 print "Do The insert\n";
270:                 $insertQuery = "";
271:             }
272: 
273:             #print "Split Field Name\n";
274:             ( undef, $field_name, undef ) = split( /\`/, $line );
275: 
276:             #print $field_name."\n";
277:             #check if this row already exists in the database
278:             my $checkSql       = "SELECT * from field_names where field_name = '" . $field_name . "'";
279:             my $fieldNameCheck = $dbh->selectall_hashref( $checkSql, "field_name" );
280:             $dbh->commit();
281: 
282:             if ( $fieldNameCheck->{$field_name}->{field_id} ) {
283:                 my $updateSql = "UPDATE field_names SET description = '" . $fieldsLines[ $lineCounter + 3 ] . "' WHERE field_name = '" . $field_name . "'";
284:                 $dbh->do($updateSql);
285:                 ++$updatedCount;
286:             }
287:             else {
288:                 my $insertQuery = "INSERT INTO field_names (field_name, description) VALUES('" . $field_name . "','" . $fieldsLines[ $lineCounter + 3 ] . "')";
289:                 $dbh->do($insertQuery);
290: 
291:                 ++$insertedCount;
292:             }
293:         }
294: 
295:         #print "Field Name: ".$field_name." ".$line ."\n";
296:         ++$lineCounter;
297:     }
298: 
299:     $dbh->commit();
300:     print "Finished processing fields_talbe. Inserted: " . $insertedCount . " Updated: " . $updatedCount . " \n";
301: 
302:     exit;
303: }

                     
                  

This subroutine parses the project file project-file-fields.asciidoc to find metric value names and their definitions. This file has the same content as the file /etc/metricbeat/fields.yml, which is installed with Metricbeat. I find the asciidoc version easier to process, it comes from the same package and can be found at the Metricbeat GitHub repo here fields.asciidoc. Next we will take a look at the do{foreach{foreach{foreach{}}}}while loop that keeps this running. This large chunk of code will be summarized after the code.

                      
091: do {
092: START:
093:     my $redisCounter = 0;
094:     my $metric  = $redisClient->lpop( $rediskey);
095: 
096:     if ( !$metric ) {
097: 
098:         #if no metrics sleep for 2 seconds and try again.
099:         print "No metrics. Sleeping\n" if $debug;
100:         sleep(1);
101:         goto START;
102:     }
103: 
104:         #Loop through the metric array. This is an array of JSON text
105:         #remove new line characters from end of string
106:         chomp($metric);
107: 
108:         #print $metric."\n";
109:         #Convert JSON string into perl Hash object
110:         my $metricHash = decode_json($metric);
111: 
112:         #Assign a few variables for needed infomration at the root of the Hash
113:         my $timestamp   = $metricHash->{'@timestamp'};
114:         my $hostname    = $metricHash->{host}->{name};
115:         my $eventModule = $metricHash->{event}->{module};
116:         my $metricSet   = $metricHash->{metricset}->{name};
117:         my $deviceName  = $eventModule;
118: 
119:         #This creates the hash that contains the values that we want to ingest
120:         #we create the $theseMetrics variable to hold the metrics hash so the
121:         #hash keychains to not get long and confusing
122:         my $theseMetrics = $metricHash->{$eventModule}->{$metricSet};
123:         #If $theseMetrics is empty skip it. Maybe add logging in the future for things that are skipped
124:         #for this project dropping it is fine.
125:         print $timestamp. " ->" . $hostname . " -> " . $eventModule . " -> " . $metricSet . "\n" if $debug >= 2;
126:         print $theseMetrics. "\n"                                                                if $debug >= 2;
127: 
128:         #At this point we have to start looping through the metrics to get
129:         #the names and values of the metrics.
130:         #These can be up to four layers deep depending on the type of
131:         #metric
132:         #first level loop
133:         if ( $theseMetrics->{name} ) {
134:             print "This is a Device Name: " . $theseMetrics->{name} . "\n" if $debug >= 2;
135:             $deviceName = $theseMetrics->{name};
136:         }
137:         foreach my $level_1 ( keys %{$theseMetrics} ) {
138:             #Create New Variable for this level
139:             my $level_1Hash = $theseMetrics->{$level_1};
140:             #Check to see if this is a key with a value or another hash to loop
141:             if ( ref($level_1Hash) eq "HASH" ) {
142:                 #Here we will loop the variable that we created above which is actually $theseMetrics->{$level_1}
143:                 foreach my $level_2 ( keys %{$level_1Hash} ) {
144:                     #second Level Loop
145:                     my $level_2Hash = $level_1Hash->{$level_2};
146:                     if ( ref($level_2Hash) eq "HASH" ) {
147:                         #third level loop
148:                         foreach my $level_3 ( keys %{$level_2Hash} ) {
149:                             my $level_3Hash = $level_2Hash->{$level_3};
150:                             if ( ref($level_3Hash) eq "HASH" ) {
151:                                 #fourth level loop
152:                                 foreach my $level_4 ( keys %{$level_3Hash} ) {
153:                                     my $level_4Hash = $level_3Hash->{$level_4};
154:                                     if ( ref($level_4Hash) eq "HASH" ) {
155:                                         #No hashes are above 4 deep in this data set
156:                                         #This is a place holder for logic flow
157:                                     }
158:                                     else {
159:                                         my $field_name = $eventModule . "." . $metricSet . "." . $level_1 . "." . $level_2 . "." . $level_3 . "." . $level_4;
160:                                         my $fieldCheckReturn = checkFieldNameExsts($field_name);
161:                                         my $field_id   = $fieldNameRef->{$field_name}->{field_id};
162:                                         push(@insertArray,[$timestamp, $field_id, $hostname, $deviceName, $level_4Hash, $field_name]);
163:                                         ++$insertCounter;
164:                                         checkBatchSize(@insertArray);
165:                                     }
166:                                 }
167:                             }
168:                             else {
169:                                 my $field_name = $eventModule . "." . $metricSet . "." . $level_1 . "." . $level_2 . "." . $level_3;
170:                                 my $fieldCheckReturn = checkFieldNameExsts($field_name);
171:                                 my $field_id   = $fieldNameRef->{$field_name}->{field_id};
172:                                 push(@insertArray, [$timestamp, $field_id, $hostname, $deviceName, $level_3Hash, $field_name]);
173:                                 ++$insertCounter;
174:                                 checkBatchSize(@insertArray);
175:                             }
176:                         }
177:                     }
178:                     else {
179:                         my $field_name = $eventModule . "." . $metricSet . "." . $level_1 . "." . $level_2;
180:                         my $fieldCheckReturn = checkFieldNameExsts($field_name);
181:                         my $field_id   = $fieldNameRef->{$field_name}->{field_id};
182:                         push(@insertArray,[$timestamp, $field_id, $hostname, $deviceName, $level_2Hash, $field_name]);
183:                         ++$insertCounter;
184:                         checkBatchSize(@insertArray);
185:                     }
186:                 }    #end $level_1Hash Loop
187:             }
188:             else {
189:                 #This is jsut a place holder for logic flow
190:                 #We do no inserts here because this can only contain the
191:                 #name string which we assigned at the top of
192:                 #this loop
193:             }
194:         }# end $theseMetricsHash Loop
195: 
196: } while 1;
197: 

                     
                  

The do{}while part of this loop is what keeps it running over and over looking for new metrics. Since this all works together we will use a list to step through the code. The list entris will be marked with the identifying line numbers.

  • 091) Start the do loop. This loop ends on line 196 with }while 1;

  • 092) I dentify this as the START: loop so we can use goto at line 101 if there are no metrics available.

  • 094) Send an LPOP command to Redis. In this mode Redis is a FIFO buffer. We pop the first record off of the top, return it to the script and remove it from Redis. This Perl module does not support the LPOP count function so we query it as a single variable. (I may add LPOP count option to the Redis Perl module in the future)

  • 096 - 102) If no metrics were available at from the LPOP command sleep for one second and check again bu returning to START after the sleep.

  • 110 - 122) If we have made it this far we have JSON data to work with. Convert it into a Perl hash with decode_json, and create the initial variables avalable in the root of the JSON data that is now a Perl Hash. At line 122 we assign only the part of the JSON data that we want into a new variable. This new HASH is defined by variables that we assigned at lines 115 and 116.

  • 133 - 136) This conditional is here in the rare event that we might run into a record that does not have a device name. Otherwise it assigns the name key to a variable $deviceName which we will use in the SQL insert later.

  • 137 - 193) The code chunk at these lines is the infamous deep nested loop. The way this data is stored in the Metricbeat JSON data, the metric values we want to process could potentially be up to 4 levels deep in the JSON data we are processing as a Perl Hash. The timestamp is at the first level of keys, yet the value may be at levels 2, 3, or 4, so to make sure we get everything, we have to traverse the whole JSON tree. But, as we approach the level that the value is on, we have to continually append items to the field_name variable as each new level that we enter has a part of that field_name that we need. The deepest example is this variable at level 3 as shown in this code snip from line 169 my $field_name = $eventModule . "." . $metricSet . "." . $level_1 . "." . $level_2 . "." . $level_3;. This would result in something like system.network.in.bytes which will be the field_name for this metric value. However, we convert that field name to the field_id integer at lines 161, 171, and 181.

    As the values are processed they are pushed into the array @insertArray to be used as part of the insert. On every loop cycle at every level the sub routine checkBatchSize runs to see if the number of elements in the array is greter than or equal to the --batchsize option. That sub routine is shown below.
                     
199: sub checkBatchSize{
200:   my @thisArray = @_;
201:   if(scalar(@thisArray >= $batchsize)){
202:       #print scalar(@thisArray)."\n";
203:     my $insertReturn = processInsertArray(@thisArray);
204:   }
205: }                     
                     
                  
  • We pass the whole array to this subroutine and it uses the Perl scalar function to check how many elements are in the Array. If the result is larger than the $batchsize variable that we set at the command line then it runs the processInsertArray() sub routine and pass the same array to it. Below we will look at the processInsertarray() sub routine.
                     
207: sub processInsertArray{
208:   my @thisArray = @_;
209:       my $insertMetric = "INSERT INTO metric_values (timestamp,field_id,hostname,device,value) VALUES(?,?,?,?,?)";
210:     my $sth = $dbh->prepare($insertMetric);
211:   foreach my $metric(@thisArray){
212:     #In this version we are going to skip text and empty values. Maybe in the future log these and or add support
213:     #print Dumper($metric);
214: 
215:     if(looks_like_number($$metric[4])){
216:       $sth->execute($$metric[0],$$metric[1],$$metric[2],$$metric[3],$$metric[4]);
217:     }
218:   }
219:   $dbh->commit();
220: 
221:   @insertArray = ();
222: 
223: }
224: 

                     
                  
  • This is the subroutine that does the final work of inserting the records into PostgreSQL. We initialize the insert variable at line 209 and prepare it at line 210. We do these outside of the loop because these only need to be done once, and it wastes processing time if we do it over and over in the loop. We also used the ?,?,?,? insert style so we can loop through the array just passing variables into the $sth->execute() function. We use $$ notation of the array variables because the @thisArray array is an array of arrays $$ denotes that we are accessing an Array reference. At line 215, you will notice that we are using the function from the use Scalar::Util qw(looks_like_number); module that we loaded on line 22. This is because some metrics returned from metricbeat are not numeric but text. This script can not process text metrics at this time, and that is a whole new arena of brain hurt so we will skip those and focus on numeric values.

    Once the full array has been executed, we commit the changes to the database at line 219. Since our database connector at line 71 has AutoCommit disabled, we have to call commit. This is especially efficient if you are ingesting a lot of records at the same time. Though we will only process a few at a time with this project, it is a good way to set it up anyway.

With that out of the way, there is one last processing subroutine that we will take a quick peek at then we can get to actually running the script. This routine is a sometimes do something routine that takes care of field names that were not imported from the file project-file-fields.asciidoc. Some field names are set up as wildcards in the asciidoc file. The field_names table builder does not know how to deal with these, so the script below detects that condition and creates a new field_name, and new field_id entry in the field_names table, then reloads the field_name hash into the global variable.

                     228: sub checkFieldNameExsts{
229:  my $field_name = shift;
230:  #Some field_names are wildcarded in the fields.asciidoc. We will of course have no match for these.
231:  #Here we will verify that the field name exists, create it if it does not and return the created value
232:  if(!$fieldNameRef->{$field_name}){
233:     print "Field Name Does Not Exist. Creating: ".$field_name."\n";
234:     my $insertQuery = "INSERT INTO field_names (field_name) VALUES('" . $field_name . "')";
235:     my $newFieldReturn = $dbh->do($insertQuery);
236:     my $fieldNamesSql = "SELECT * from field_names";
237:     print "Reload the Hash: ".$fieldNamesSql. "\n" if $debug;
238:     $fieldNameRef = $dbh->selectall_hashref( $fieldNamesSql, "field_name" );
239: 
240: $dbh->commit();
241: 
242:  }
243: }
                  

End of the Script

And that is it. This is how the script works to convert the JSON formatted metrics provided by Metricbeat into SQL inserts. Lots of reading and information ingesting, but it is all worth it now as we run the script for the first time.




Running the Script

Now we are all finished with installs, explanations, configurations, and explaining how all of these backend parts work together. Now we can run the script for the first time to do its first job of building the field_names table. If you run the script without any options, it will print the detailed help file to the screen. It is good to know these options, but with this project, many items are set to defaults that will work with this system. The only thing that we have to provide to run the script is the database name, in this case, metrics_ht, and the database username and password that we created earlier. If you followed along with all of the setup to this point, you should be able to run the command below with no problems.

perl hyprid-poller.pl --dbname metrics_ht --dbuser metrics_user --dbpass notasecurepassword --build_fields_table --fields_file project-file-fields.asciidoc

This first run of the script builds the field_names table that is required to match the field_id from the metric_values table to the field_id in this table, allowing us to use metric names instead of the numeric id. If there are no errors, this script will only print build_field_table and a count of how many fields were inserted or updated into the field_names table. And that is the final step. Now we start collecting metrics. All we have to do to change the script to that mode is remove the --build_fields_table and --fields_file options, as shown below.

perl hyprid-poller.pl --dbname metrics_ht --dbuser metrics_user --dbpass notasecurepassword

If there are no errors, then we are collecting data. We can easily verify this by checking our record count in the PostgreSQL client. If you do not have a terminal connected to the database, open a new terminal and run the command below. We will leave this window running because this is where the script is running. We can move the script into background mode by running perl hybrid-poller.pl --dbname metrics_ht --dbuser metrics_user --dbpass notasecurepassword &, which will detach the process from the console but for this project, it is good to leave it running in the terminal for now. This time we will connect with the user we created earlier, using the same username and password we are using when we launch the script. You will be prompted for the password.


psql -h localhost -U metrics_user -d metrics_ht

This will take us directly to our metrics_ht database. Now we can count the rows in the values table to see how many records we have collected. Run the SQL command below to count the records.

                     SELECT COUNT(*) FROM metric_values;
                  

This command will return an integer named count that shows how many metrics have been inserted into the metric_values hypertable. Now that we have some data in the tables, we can look at how many chunks have been created by our hypertable with a 10-minute interval. Run the command below to see the chunk information.

                     SELECT show_chunks('metric_values');
                  

This query will display an output similar to the one below. This output shows how many different files this hypertable has been chunked into. Since we set these chunks to a 10-minute interval, the count of chunks should match the amount of time since we started Metricbeat. If you have been working through this for an hour, then you should see six chunks; for 30 minutes, you would see three chunks, and so on. This count may not be exact as it depends on where our script is in processing the metrics from the Redis buffer. Remember, 10-minute intervals on the chunks is probably never a good idea, but in the case of this project, it helps explain how these chunks work. Also, if, for some reason, you have to backfill data, it goes to the chunk where that timeframe exists.

                     
metrics_ht=> SELECT show_chunks('metric_values');
               show_chunks
-----------------------------------------
 _timescaledb_internal._hyper_1_1_chunk
 _timescaledb_internal._hyper_1_2_chunk
(2 rows)
                     
                  


End of the terminal work

Here on line 1175 of this HTML document, we come to the end of all the steps needed to prepare this data for use by Grafana. From here on out, we will be working with Grafana. We will set up the data source and step through a few custom queries needed to use this data set with Grafana, even the coveted, on-the-fly events per second calculations. Now the fun begins, visuals and dashboards.



On to Grafana We Go


By default, Grafana does not understand the layout of the relational database that we have created. This database is not a standard key/value setup. I call this a key/key/key/value setup. For this to work, we have to know the hostname, system, and field_name (which is field_id before the relation), so we can build visuals based on field names instead of having to search the field_id every time we want to create a new visual. But that is just for queries for the data type of gauge. If we are working with counter-type data, data that is always increasing or decreasing in the same direction, then we have to use a LAG query which gets a little more complex. We will cover both ways as we move forward.

As we have not logged into Grafana yet, we will have to go through the steps of logging in as admin/admin and creating a new password. Let's navigate back to the web interface that we opened many lines ago, or open a new one at http://project-server-ip:3000 and log in with the username admin with the password admin. When prompted, choose a new password and continue.

The first thing that we need to do is add a new data source. Once you are logged in, navigate to the Data sources option, as shown in the screenshot below.



Once you are at the new window, click the Add data source button. A new window will load with a screen full of options for data sources. Either type Postgres into the filter box at the top of this page or scroll through the list until you find PostgreSQL and select that option.

After clicking on PostgreSQL, you will see a screen like the one below. The values in the image below are the values for this project. SSL is disabled in this project as this is all self-contained on one VM. If this idea were to move into production somewhere, all of the SSL and other security precautions should be taken. Everything on this project is contained in this single localhost, so it is fairly secure.



The Host setting is the localhost connection to the database. The User and Password are the same ones we created above with the CREATE USER command. If you changed either the username or the password, that should be reflected here. And finally, remember that we are connecting to the metrics_ht database. Once all these settings are filled in and verified, click the Save and Test button at the bottom of this screen. If everything is good, you will see a green popup above the button that says "Database Connection OK". Now we can build a visual.

From the page you are at now click on Dashboards icon as shown in the image below and select Manage.



Once the new page has loaded select the top left option "Add a new panel". You should see the screen below.



As shown in the image above, click the pencil edit icon, which is circled in red, and paste the query below into the query box. Once you have done all of this, click on the visualization window, and you should see a graph similar to the one in the image.

                     
SELECT
  "timestamp" AS "time",
  field_names.field_name AS metric,
  hostname AS hostname,
  avg(value) AS "Load 1m"
FROM metric_values LEFT JOIN field_names on field_names.field_id = metric_values.field_id 
WHERE field_names.field_name = 'system.load.norm.1' AND $__timeFilter(timestamp)
GROUP BY "timestamp",2,3
ORDER BY 1,2
                     
                  

The graph and query above show the 1-minute processor load of the project server. We chose that metric to graph with this line of the query WHERE field_names.field_name = 'system.load.norm.1'. system.load.norm.1 is the metric we are querying from the database.

For an overall view of the system load, we can add more queries to see the 5-minute and 15-minute cpu load on this server. We do this by clicking the + Query button under our current query editor. When the new query editor box loads, click on the edit pencil as we did before and paste in the same query, though we will change a few things. Below is the query for the 5-minute load average.

                     
SELECT
  "timestamp" AS "time",
  field_names.field_name AS metric,
  hostname AS hostname,
  avg(value) AS "Load 5m"
FROM metric_values LEFT JOIN field_names on field_names.field_id = metric_values.field_id 
WHERE field_names.field_name = 'system.load.norm.5' AND $__timeFilter(timestamp)
GROUP BY "timestamp",2,3
ORDER BY 1,2
                     
                  

If you compare the two queries, you will see that lines 5 and 7 differ. Line 5 changed Load 1m to Load 5m, and line 6 changed system.load.norm.1 to system.load.norm.5. The change on line 5 changed the graph legend to Load 5m. The change on line 7 tells the query to look for system.load.norm.5 instead of system.load.norm.1. Before we look at another screenshot, let's go ahead and add a third query to get the 15-minute cpu load reported by this project server.

                     
SELECT
  "timestamp" AS "time",
  field_names.field_name AS metric,
  hostname AS hostname,
  avg(value) AS "Load 15m"
FROM metric_values LEFT JOIN field_names on field_names.field_id = metric_values.field_id 
WHERE field_names.field_name = 'system.load.norm.15' AND $__timeFilter(timestamp)
GROUP BY "timestamp",2,3
ORDER BY 1,2
                     
                  

Click on the graph window once more and we should now have a third metric named Load 15m as shown in the image below.



This updated visual shows this project server's 1-minute, 5-minute, and 15-minute load. These are always good values to know on any system because these metrics easily let us know if our server has a problem or is lacking CPU power. We can save this and move on to the time series events per-second query. Choose a name for this visual and put it in Panel title on the top right of the screen. I named this one Processor Load. Then click Apply. After clicking apply, the browser should return to the dashboard with this graph present on that dashboard. In almost all cases, there will be a bit of a spike in this graph that lets us know when we first started reading data out of Redis and ingesting it into PostgreSQL. This spike happens because Redis has been buffering our metrics since we started the script. There were many records to process at first. After that initial processing, the ingest rate falls to a few every 10 to 20 seconds. This data can be better visualized if we use the timestamp dropdown to show the last 15 minutes, as we are currently looking at the last 6 hours.



We will keep this visual around for a while, so give it a title using the box on the right outlined in blue. This one I called System CPU Load. Then click Apply. This will return you to the new dashboard with our system load visual as the only thing on the dashboard.





Take a Mind Break

We have accomplished a lot in these few pages. We have installed two types of databases, TimescaleDB extensions, Perl modules, metricbeat, and Grafana. Not to mention that we have created databases, database tables, database users, and a hypertable. Now we are running a script that converts nested JSON data into SQL inserts to store metrics. And now, we are visualizing that data using Grafana.

We are taking a breather here because next, we get into a query that will let us build visuals around data based on counters. Unlike gauge data, counter data is, in most cases, always increasing. Bandwidth counters, disk usage counters, and such generally go up or down in the same direction for long periods. If you try to visualize these kinds of values without LAG queries, you will see an ever-increasing line or sawtooth dips when counters reset or, in the case of disk usage, when data is removed. To visualize the data correctly, we have to perform math between the current database row and the row before the current row. This type of query gives us the difference in the value over that time range which we then render on the visualization.

At this point, we have built a fully functioning visualization system out of readily available parts and tied them together with a bit of Perl. This is a grand accomplishment in an environment where many systems like this are single-tasked and split across many people on a team. I think it is good for everyone to know how all of the parts work, which is what spawned this document. Be proud of what we have accomplished to this point, and we now take on the PostgreSQL LAG query.




PostgreSQL Time series LAG Queries

Let's head back to the dashboard where our System CPU Load graph is displayed. Click the add panel icon at the top of this screen, and choose Add New Panel. This is the same screen we used for the last visualization; we will just be using a different query.



Click on the edit button as before and paste in the query below.

                     
WITH var1 AS(SELECT timestamp, hostname,metric_values.field_id, field_name, device, value
             FROM metric_values
             LEFT JOIN field_names on field_names.field_id = metric_values.field_id
             WHERE field_name = 'system.network.in.bytes'
             AND device != 'lo'
             AND $__timeFilter(timestamp)
            ),
     var2 AS(SELECT timestamp, hostname, device, value, 
          LAG(value,1) OVER (ORDER BY hostname,device,timestamp) previous_value,
             LAG(timestamp,1) OVER (ORDER BY hostname,device,timestamp) previous_timestamp,
             LAG(device,1) OVER (ORDER BY hostname,device,timestamp) previous_device
             FROM var1
            ),
     var3 AS(SELECT timestamp, hostname, device, value, previous_value,
             CASE 
               WHEN device = previous_device AND  EXTRACT (EPOCH FROM(timestamp - previous_timestamp)) > 0
                 THEN (value - previous_value) / EXTRACT (EPOCH FROM(timestamp - previous_timestamp))
               ELSE NULL 
             END AS  bytes_sec
             FROM var2 
            )
SELECT timestamp AS "time", hostname, device, bytes_sec AS "Inbound" from var3 ORDER BY time
                  
               

If everything is still working smoothly, this should generate a new matric visualization showing the Inbound network traffic on all interfaces. This visualization is generated with the slightly more complex LAG query that we ran for this visual. The whole query is broken down in detail in the next section.


PostgreSQL LAG Query Details

The query above does everything needed to calculate the bytes per second value for every host that has metrics for the field_name system.network.in.bytes. Like many other metrics, this one is a counter. The value is ever-increasing. If you were to visualize this metric's time series raw values, you would see an ever-growing angle as the counter increases. This type of visual is not very useful information, maybe in a table or a single metric readout, but not on a time series line graph, which is what most of us want.

Since we store all of the metricbeat metric values in a table that is keyed by hostname, device, and field_id, it takes a custom query like this one for Grafana to render the data correctly. There may be ways to do this in Grafana, but I found it easier to work with the queries, especially since the PostgreSQL data source is not like most key/value sources.

Without including the field_names table, we can see that we are working with timestamp, hostname, device, field_id, and value. The three keys that allow the metric to be identified are hostname, device, and field_id. The other two keys, timestamp and value, are our actual metrics as they are the two ever-changing values. In the case of counters, this ever-changing value is always incrementing. The timestamp is also incrementing with every new metric entry. To determine the number of events per second, we have to find the difference between current and previous values, find the difference between current and previous timestamps and then divide the value difference by the time difference. That is what this query does. Below we will break the query parts down and explain what they are doing.

Query Breakdown


                  
1 WITH var1 AS(SELECT timestamp, hostname,metric_values.field_id, field_name, device, value
2              FROM metric_values
3              LEFT JOIN field_names on field_names.field_id = metric_values.field_id
4              WHERE field_name = 'system.network.in.bytes'
5              AND $__timeFilter(timestamp)
6            ),
                  
               

Lines 1-6 Explained

  • 1) Assign Select to Variable var1
  • 2) Selecting from the table metric_values
  • 3) JOIN field_names with metric_values by relating the field_id in both tables. This allows us to select our metric with a text name such as system.network.in.bytes instead of its numeric field_id
  • 4) First query filter that limits the results to field_name system.network.in.bytes only
  • 5) Filter results by timestamp, in this case based on the Grafana variable $__timeFilter

So, in this case we are looking for only the field_name system.network.in.bytes and only within the time range that is selected on the Grafana dashboard. $__timeFilter(timestamp) can be changed to a normal query such as AND timestamp > now() - INTERVAL '15 minutes' to work as a standard query. The results of this query are then passed on to the next section as var1.

                  
7     var2 AS(SELECT timestamp, hostname, device, value, 
8          LAG(value,1) OVER (ORDER BY hostname,device,timestamp) previous_value,
9             LAG(timestamp,1) OVER (ORDER BY hostname,device,timestamp) previous_timestamp,
10            LAG(device,1) OVER (ORDER BY hostname,device,timestamp) previous_device
11            FROM var1
12            ),
                  
               

Lines 7-12 Explained

  • 7) Assign Select to Variable var2
  • 8) Select the previous value of the current host and device ordered by ascending timestamp and assign it to previous_value.
  • 9) Select the previous timestamp of the current host and device ordered by ascending timestamp and assign it to previous_value.
  • 10) Select the previous device of the current host and device ordered by ascending timestamp and assign it to previous_value.
  • 11) Selecting from var1 which we created in the last section

                  
13     var3 AS(SELECT timestamp, hostname, device, value, previous_value,
14            CASE 
15               WHEN device = previous_device AND  EXTRACT (EPOCH FROM(timestamp - previous_timestamp)) > 0
16                 THEN (value - previous_value) / EXTRACT (EPOCH FROM(timestamp - previous_timestamp))
17               ELSE NULL 
18             END AS  bytes_sec
19             FROM var2 
20            )
                  
               

Lines 13-20 Explained

  • 13) Assign Select to Variable var3
  • 14) Start SQL conditional. Similar to if then
  • 15) This conditional compares the current device to the previous_device and checks that the difference between the current and previous timestamps is greater than zero. The time difference calculation can result in 0 when counters reset or the device being queried changes. This stops divide by 0 errors and returns NULL, which Grafana will treat as a missing data point and connect to the next data point.
  • 16) If the above conditions are met, then do the events per second calculation and return it as bytes_sec
  • 17) If the conditions on line 15 are not met then return NULL as bytes_sec
  • 18) End this conditional and assign the result to bytes_sec
  • 19) We SELECTing from the data in var2 created above.

                  
21 SELECT timestamp AS "time", hostname, device, bytes_sec AS "Inbound" from var3 ORDER BY time
                  
               

Line 21 Explained

  • 21) Select the data from var3 created above, identify bytes_sec as Inbound and send the data to Grafana. Selecting bytes_sec AS "Inbound" is what sets the data type name on the Grafana visual legend. If you change Inbound to Ingress, that is what would be displayed on the visual.

Below is an example of the output sent to Grafana with this query.

                  

          time           |       hostname       |     device      |        Inbound
-------------------------+----------------------+-----------------+------------------------
 2022-10-08 12:17:33.628 | haproxy-dev          | ens18           |  1508.4491550844915508
 2022-10-08 12:17:35.999 | dev-timescale        | ens18           |     28008.100000000000
 2022-10-08 12:17:36.221 | beat-interceptor     | ens18           |     18365.536553655366
 2022-10-08 12:17:38.132 | dangerousmetrics.org | eth0            |    50.2000000000000000
 2022-10-08 12:17:38.573 | dev-dangrousmetrics  | ens18           |   772.6000000000000000
 2022-10-08 12:17:43.629 | haproxy-dev          | ens18           |  1833.7166283371662834
 2022-10-08 12:17:45.999 | dev-timescale        | ens18           |     28669.500000000000
 2022-10-08 12:17:46.222 | beat-interceptor     | ens18           |     18400.759924007599
 2022-10-08 12:17:48.132 | dangerousmetrics.org | eth0            |    62.0000000000000000
 2022-10-08 12:17:48.573 | dev-dangrousmetrics  | ens18           |   521.9000000000000000
 2022-10-08 12:17:53.628 | haproxy-dev          | ens18           |  1528.2528252825282528
 2022-10-08 12:17:55.998 | dev-timescale        | ens18           |     74631.463146314631
 2022-10-08 12:17:56.222 | beat-interceptor     | ens18           |     57966.900000000000
 2022-10-08 12:17:58.133 | dangerousmetrics.org | eth0            |    60.9939006099390061
 2022-10-08 12:17:58.574 | dev-dangrousmetrics  | ens18           |   778.9221077892210779
                  
               

Remember that we touched on the device column earlier. It can identify many different things, but this is taken into account when we select the metric type to see, so as long as we are familiar with the data, this is a good way to track what sent the metric in the shortest form possible. In a future version, the hostname and device will be integers linking to metric_hostnames and metric_devices. This design will further reduce the size of the metric_values table just as using the field_names table did. Relating two fields on an integer is vastly more efficient than linking on any text, especially what could be long text, like the FQDN of a host that is sending metrics.

With all of the explanations about the query out of the way, let's go ahead and create the Outbound time series for this visual. We will add a second query and change two things, just like when we were working on the load graph. We will try this one without reprinting the query. The changes are simple and outlined below. First, click the + Query button and paste our current query into the new query window by clicking the edit pencil, as we have done a few times already. Once the query is in the box, make the changes below.

  • 1) At line 4 at the WHERE section of the query change system.network.in.bytes to system.network.out.bytes
  • 2) At line 21 at the final SELECT statement change Inbound to Outbound.

Once these changes are made, a new set of lines and new legend items should appear on the visual showing Outbound traffic along with the Inbound traffic. The x-axis shows raw values. If you would like to see this as a converted value, such as bytes per second, scroll through the Grafana menu on the right until you find Standard options. Choose bytes/s from the Unit selector, Data Rate, and then bytes/s. Whether you choose IEC units or SI units is up to you. This update should change the x-axis to a format that is easier to understand when working with bandwidth. Give this panel a name like NEtwork Traffic and click Apply at the top right. Now we have Network Traffic and Processor Load visualizations.



And there is our dashboard with two running graphs. But you may ask, how do I know what data I have to work with here? Lets build a query and find out.


Browsing the Available Data

It is pointless to have a ton of metrics to work with but have no idea what they are. We can build a fairly simple query that will sample the data and connect it to the field_names with their definitions. With a bit of WHERE filtering, we can zoom in on what metrics we want to work with. We will build this as a Grafana table and display it on the same dashboard as our other graphs.

Let's head back to the dashboard that shows the two visuals we created and click Add panel. Choose Add a new panel, and we will be back at our favorite part of Grafnana, designing a new metric visual. This visual will be different. In the upper right-hand dropdown menu that now says Time series, click that dropdown and choose Table. Then at the bottom of our query designer change the format to Table. Once these changes are made, click our favorite old editor pencil and paste in the query below.

                  
SELECT field_name,description, field_names.field_id, value, device, hostname FROM field_names JOIN (
  SELECT DISTINCT ON (field_id) * 
  FROM metric_values
  WHERE timestamp > now() - INTERVAL '1 minute'
  ORDER BY field_id
) as singleID
ON field_names.field_id = singleID.field_id
                  
               

When finished you should see what is in the screenshot below.



This query provides a list of all of the metric definitions available that we have metrics for. This query initially selects all rows in the field_names table, then JOINS on a sub-query that selects one sample for every field_id found in the field_names table. If no sample is found, the row is not returned. Let's give this a name, something like Available Metrics, and then click Apply.

When selecting the correct query for the visualization you want to build, it takes a bit of data intuition to know a gauge data type from a counter data type. If you can not tell from the definition, you can write a query to watch that data point to see if it is ever-increasing or ever-changing. Ever increasing in one direction is generally a counter. A value that goes up and down over time is generally a gauge. You can browse through all these metrics and make many different visualizations. Play around with the data and see what you can come up with.

Final Tip

Remember what we have built when you find metrics you want to work with. Gauge data will use the first query we used earlier for visualizing processor load, which is an ever-changing value. We used the slightly longer PostgreSQL LAG query to visualize data where we needed to calculate events per second from a counter. It does not matter what you are visualizing. From bushels of corn to quantum particles, we are always looking for a value that is changing, be it up, down, left, right, or random. In the end, use queries instead of middleware wherever possible. Make the database do the work.



Coming To an End

We are now at the end of this guide. If you followed along through the guide, you should have a self-contained system that can convert Metricbeat JSON data sets into SQL inserts using Redis as a JSON buffer and Grafana as the display engine. We threw in a few multi-level PostgreSQL queries and some deep explanations for this project. I hope anyone following this guide learned a little about time series metrics and how to process them with SQL.



Where To From Here

This project is what I like to call a side quest. This project spawned because I needed a way to collect a nice set of metrics for a larger project, "Visualizing Bureau of Labor Statistics Employment Data with PostgreSQL and TimescaleDB". Stay tuned for updates on that project and a document similar in detail to this one that explains the parsing and storing of that data. That project will also come with a bit of JavaScript and NodeJS to handle rendering that data.

The TimescaleDB table chunking we used here is only a small part of its functionality. I will dig more deeply into the available hyper_functions as I process the BLS and CDC data in future projects. Stay tuned for those.




Have a Wonderful Day