log(v)

back
Need help setting up snorkel?
Please send an email to okay.zed at gmail and I'll be more than glad to help you get setup with an instance and answer any questions about your use case and requirements.
Happy snorkeling!

In this post, we will discuss the place and purpose of online analytical processors (or OLAP for short) in the analytics stack.

TL:DR; OLAPs provide useful realtime analytics capabilities at the expense of traditional DB guarantees. They step in where TSDBs leave off and are complementary to both TSDBs and traditional DBs.

Overview

The analytics space today (and tomorrow) is quite crowded. Yet there are always new and improved analytics DBs coming out. What is it that we are looking for that we keep writing new DBs? The shallow answer is: faster queries or more data. But if we dig just a little deeper, the reason is because we want understanding, and bigger & faster analytics is how we’ll get there.

TSDBs

Before talking about OLAPs, let’s talk time series databases: the bread and butter of graph building tools. If you are used to site reliability and monitoring, you know much of the job is looking at graphs and how much of a difference a good graph vs a bad graph can make. That’s why TSDBs tend to be our best friends and we live and die by our TSDB stack.

However - graphs are where the fun only begins - a time series graph can show the motion of a value over time, but they rarely give the why.

This is primarily because:

  1. TSDBs tend to store pre-aggregated data
  2. TSDBs data is extremely de-normalized
  3. TSDBs only support time series queries

All of the above are for good reasons (like scaling and performance), but together they often prevent further investigation into anomalies once they are discovered. Given that a TSDB lets us quickly locate anomalous graphs, what do we do when we find them?

The OLAP

This is where OLAPs come in. OLAPs fall somewhere between a relational DB and a TSDB. They store a finite amount of recent (and potentially unstructured) data and run realtime analytic queries on that data, while relaxing some of the constraints and requirements of a relational DB.

Because they run queries quickly (with sub-second execution times) OLAPs are great at exploratory data analysis and bringing the shape of our datasets to life.

In an OLAP, we can add new filters, change GROUP BYs and run advanced queries with great response times. Usually this query refinement requires full table scans or extreme indexing, which can be expensive, so OLAPs only need store enough data for diagnosing anomalies - whether it’s 1 day, 1 week or 1 month of data.

If you are thinking: couldn’t an OLAP be built with a relational DB? The answer is yes! But, in general, OLAPs are built to perform well with append-only data, sparse user queries and high volume of inserts, which is not quite the same workload as a traditional DB.

If the OLAP sounds too good to be true, that’s because it is - the OLAP is not a panacea. For every benefit of an OLAP, there are disadvantages: some simplistic queries, short retention times, and no consistency guarantees (to name a few), but in spite of all of the disadvantages, an OLAP is a vital part of any analytic pipeline as a way of exploring and answering questions immediately.

What Now?

Hopefully you’ve decided to give an OLAP a try. But how do you decide? Personally, what makes a good OLAP is several things: ease of use (easy to setup, easy to log data into a new table), performance (queries finish in under 1s), debuggability and interoperability with current pipelines.

It’s also useful to have an OLAP that lets you log adhoc data without requiring a table schema, similar to the way data gets logged into time series databases. This lets you add instrumentation into your code with a simple logging call and be done - no table creation or migration steps.

Of course, the backend is not enough: a query exploration UI is vital. A good UI takes advantage of an OLAPs speed and ties together the different queries that the OLAP is capable of. A great UI will have workflows oriented towards the specific goals and tasks at hand, be it exploratory analysis, model building, experiment evaluation or process monitoring.

Appendix A: The List

Backends & Backend Ideas

Frontends

CHANGELOG

2016-09-16

2016-09-15

2016-09-13

Appendix B: more notes on TSDBs

In a traditional TSDB, each series contains only one numeric quantity. Instead of keeping together all the relevant data in one row, TSDBs spread the data out over multiple series (each series is essentially a table of points). This tends to work for a small number of datasets, but as the keyspace grows larger and larger, the data becomes harder and harder to comb through. In other words: by storing each numeric quantity as a seperate table, we give up our ability to explore and correlate meaningful information that the programmer can and has given us at instrumentation time.

One potential solution for this is to create a system that can issue and analyze thousands of graphs and create real time correlations. The other solution is to let a person do the digging.

Having the time series keys be denormalized this way has some problems, namely:

  • its hard to navigate and find related data solely by keyname
  • data must be pre-known beforehand: each key was purposefully inserted
  • there is no rollups or ability to drill down into a time series graph
  • aggregates tend to be pre-computed and have to be maintained for each time granularity

Given all these problems, why do we love time series databases so much? One reason is because they are easy to use and produce visuals that can be digested by eye. A relational table requires table creation and schema setup in order to do any analysis, while a TSDB only requires logging data to it.