In this post, we will discuss the place and purpose of online analytical processors (or OLAP for short) in the analytics stack.
TL:DR; OLAPs provide useful realtime analytics capabilities at the expense of traditional DB guarantees. They step in where TSDBs leave off and are complementary to both TSDBs and traditional DBs.
Overview
The analytics space today (and tomorrow) is quite crowded. Yet there are always new and improved analytics DBs coming out. What is it that we are looking for that we keep writing new DBs? The shallow answer is: faster queries or more data. But if we dig just a little deeper, the reason is because we want understanding, and bigger & faster analytics is how we’ll get there.
TSDBs
Before talking about OLAPs, let’s talk time series databases: the bread and butter of graph building tools. If you are used to site reliability and monitoring, you know much of the job is looking at graphs and how much of a difference a good graph vs a bad graph can make. That’s why TSDBs tend to be our best friends and we live and die by our TSDB stack.
However - graphs are where the fun only begins - a time series graph can show the motion of a value over time, but they rarely give the why.
This is primarily because:
- TSDBs tend to store pre-aggregated data
- TSDBs data is extremely de-normalized
- TSDBs only support time series queries
All of the above are for good reasons (like scaling and performance), but together they often prevent further investigation into anomalies once they are discovered. Given that a TSDB lets us quickly locate anomalous graphs, what do we do when we find them?
The OLAP
This is where OLAPs come in. OLAPs fall somewhere between a relational DB and a TSDB. They store a finite amount of recent (and potentially unstructured) data and run realtime analytic queries on that data, while relaxing some of the constraints and requirements of a relational DB.
Because they run queries quickly (with sub-second execution times) OLAPs are great at exploratory data analysis and bringing the shape of our datasets to life.
In an OLAP, we can add new filters, change GROUP BYs and run advanced queries with great response times. Usually this query refinement requires full table scans or extreme indexing, which can be expensive, so OLAPs only need store enough data for diagnosing anomalies - whether it’s 1 day, 1 week or 1 month of data.
If you are thinking: couldn’t an OLAP be built with a relational DB? The answer is yes! But, in general, OLAPs are built to perform well with append-only data, sparse user queries and high volume of inserts, which is not quite the same workload as a traditional DB.
If the OLAP sounds too good to be true, that’s because it is - the OLAP is not a panacea. For every benefit of an OLAP, there are disadvantages: some simplistic queries, short retention times, and no consistency guarantees (to name a few), but in spite of all of the disadvantages, an OLAP is a vital part of any analytic pipeline as a way of exploring and answering questions immediately.
What Now?
Hopefully you’ve decided to give an OLAP a try. But how do you decide? Personally, what makes a good OLAP is several things: ease of use (easy to setup, easy to log data into a new table), performance (queries finish in under 1s), debuggability and interoperability with current pipelines.
It’s also useful to have an OLAP that lets you log adhoc data without requiring a table schema, similar to the way data gets logged into time series databases. This lets you add instrumentation into your code with a simple logging call and be done - no table creation or migration steps.
Of course, the backend is not enough: a query exploration UI is vital. A good UI takes advantage of an OLAPs speed and ties together the different queries that the OLAP is capable of. A great UI will have workflows oriented towards the specific goals and tasks at hand, be it exploratory analysis, model building, experiment evaluation or process monitoring.
Appendix A: The List
Backends & Backend Ideas
- filodb
- honeycomb
- interana
- memsql
- mapd
- splunk
- druid
- prestodb
- clickhouse.yandex
- apache storm
- greenplum
- citusdb + postgres
- pandas
- R language
Frontends
CHANGELOG
2016-09-16
- add list of DBs to appendix, add more notes about TSDB (chewbranca)
2016-09-15
- has intro, middle and semi-conclusion
2016-09-13
- first write up of OLAP abilities.