Earlier today, I was part of a conversation where, among other things, we talked about optimizing and hardening data delivery pipelines. The pipelines are actually the most important part of the big data infrastructure. If your data is not right, then you are not right and will not last long in the fast-moving terabyte per second world of today. The pipeline manipulates your data, if it is not correct your data is not correct.
The amount of data that is being sourced, parsed, stored, analyzed, enriched, and analyzed again grows exponentially every day. Soon every powered device and some that are not actively powered will be sending some type of metrics. Metrics make the world. They tell you where your products are selling, where they are being used, who is using them, all the way down to your last grocery store order and the street that the product is being used on. Before this data is of any use, it has to be stored somewhere, somehow. If your pipelines are not correct or not optimized, 1 terabyte of data is worth about as much as 1 kilobyte of data.
The more I thought about this, I started to ponder on the dimensionality of the data. I can create an array of an array of an array of an object of an array and call it a multidimensional array. Still, the dimensions are just keys to the next set of data. It is still in one-dimensional space. Or is it? If you are working with time-series data, you automatically add one more dimension to the data because now you can track it through time. But the way we tend to look at it now, you are still just working with a flat text file full of data that you can watch over time.
What if we query two indices that contain different data but the same data structure over a specified timeframe? Is the data truly multidimensional, then? No, it is still just keyed flat text that you are searching within a specific timeframe, so technically, this data is 2D. It has two dimensions, time and the individual document. At least, in my opinion, this is how it looks.
Then I thought, how would you actually create multidimensional data true to the definition. I envisioned a cube moving through time, sort of like if you watch the animation of a tesseract, a cube moving through a cube. As the data moves through the time dimension, it drags pieces of the previous data with it, data deemed important by a person or algorithm. Then new data is compared to the data that has been brought through to the current time. Anomalies are noted and included with the next data set. This depth of dragging is also set by an algorithm that knows what it is looking for. Over time, enough data is compiled by this method to potentially detect anomalies in the data before they happen and can for certain see anomalies as they happen.
As I thought about this, I had the realization this is how almost all Machine Learning works. Be it massive, fast data sampling or teaching a neural network, this is pretty much how it works. This is our current evolution of data storage and usage. Storing what we think may happen in the future and comparing it to what already happened in the past, storing the result, and getting just a bit smarter every time.
I define this as a 4D dataset.
-Parting Wisdom
-Say it before someone else does.-