Pseudo Relational NoSQL

Humans Make the Link.

I was thinking about SQL and NoSQL yesterday as I explained the difference to my daughter. More or less the difference between relational and non-relational. As I explained it to her, something started to form in my mind about how NoSQL can be pseudo relational without the need for deeply nested trees and hidden fields. In the end, this is quite simple though it has been made to seem complex by some articles I have read.

When you design an SQL relational database set, you generally match field names between the tables. For instance, as I explained to my daughter, you could have employee.department_id and as table.field set, with that you could select all data from both by telling the query to JOIN those two tables and fields, also explaining that these relations extend all the way out to the database level. Fairly simple to those familiar with SQL. SQL has many other powers as far as how to query this relational data. You can use AS to change the name of a table.field combo on the fly. There are stored procedures and many other nifty SQL things. SQL offers a plethora of tools for working with this relational data. SQL is a query language with the power of a programming language. Everything happens programmatically at the query.

On the other hand, NoSQL databases were generally designed for raw speed. There is nothing speedy about relational databases, not that they are slow, especially these days, but they were designed for relating data, not raw speed.

My NoSQL system of choice is Elasticsearch. I have worked with others, but Elasticsearch is my first choice for a NoSQL solution if at all possible. With this in mind, I started to think, well, I have designed many dashboards that worked just like a relational database. It just comes down to the human using the dashboard ends up being the relational algorithm, and we are damn good at it. But you have to follow some strict rules when designing your indices. Elastic has a good bead on this with the Elastic Common Schema. ECS is a great start for common items that Elasticsearch knows of. But Elasticsearch, as great as it is, can not know everything, can not know all of your data. So designing the indices and dashboards that are relatable is in the court of the data science/analytics team. Below are some tips and ways to think about it.

It does not matter if you are designing the one and only index that will ever run on your Elasticsearch cluster or your cluster will host hundreds of indices; you should create an internal index schema. Doing this will save you many headaches in the future as a new developer or engineer will not be able to come in and create a bunch of random field names that match nothing else. The core of this whole pseudo relational NoSQL idea is keeping all field names the same across all indices. If your indices contain a field for a user's first name, but one index has them as first_name and one as firstName, then the pseudo relational idea can never work because it depends on filtering Elasticsearch queries. If you filter firstName for Bill, it will only return the fields named firstName, unlike SQL, where you could JOIN first_name to firstName. This will break any visuals, hence dashboards that are set to look at one or the other. Then the human looking at the dashboard never makes the link, and the pseudo relation is gone. Field names MUST be the same.

Sure, you could search for first_name: "Bill" AND firstName: "Bill", but what happens as your data set grows and you have 20 million records for first_name and 20 million records for firstName? And then, another developer comes along and adds a new index with a field called Firstname. Each AND you add to that search makes it much more expensive in memory and processing time. Keeping your field names lined up using a company schema also eases pressure on the cache and processor threads because it can hold one field name across many indices instead of many names across many indices.

This may not make any sense for a NoSQL database until we come to the current part: analyzing and visualizing all of this data. As a single query against a single index, this makes no difference. But, if you design your schema and indices with these same field names, thoughts in mind, you can use multiple indices in the same analytic dashboard and use filtering just like normal since it would apply to all indices included in the dashboard. Then, as the dashboard is filtered, the human using the dashboard will see the data relations across multiple indices in a single window. Humans are natural pattern matchers, but the pattern is not always apparent. Data scientists and analysts are here to coax and explain that data into a pattern that others can benefit from.

This is not just the analytical dashboards that you can build with Kibana. Any application that uses Elasticsearch indices can benefit from these tips. In a lot of cases having the company schema will make developing an application easier because there are fewer field names to track.

Have a Great Day