Logv is a project to bring free and libre instrumentation tools to people and organizations. Logv is primarily the work of a human, with the help and inspiration from countless others.
The Short Story
At FB, I worked on the site speed team from 2011 - 2013, where we primarily did two things: blamed the packager for site performance problems and built tools to try and understand why the packager hated us. Our desire to understand the packager (imagine webpack, but it learns which packages belong together based on historical data) was so strong that my teammates ended up building a datastore specifically crafted for queries to run against our packager instrumentation. Eventually that datastore came to be known as scuba.
After leaving FB in 2013, I knew I wanted the scuba style of analysis (iterative and comparative queries on top of a fast query engine) to become available to me in the future. Having spoken to multiple companies building proprietary distributed databases, I realized that scuba’s backend (or its equivalent) was not going to be released as open source any time soon, so I decided to build the front-end (snorkel) in anticipation of backends becoming available later.
After several years of waiting, no compelling backends were on the horizon, so in January 2016 I set to writing a backend (sybil) that would be suitable for my specific instrumentation use cases.
For a backend to be compelling, it primarily has to support one feature: being able start logging instrumentation without defining a table schema ahead of time (also, it has to be fast and free). Unfortunately, this is a somewhat uncommon feature for databases, so in 2013, I decided to build snorkel on top of mongo and its aggregation framework, thinking that my dream datastore will come along eventually.
Over 6 months, snorkel was upgraded and put into production as an analytics tool at rdio. I ran performance experiments with it, while other engineers were able to get insights into how their features were being used in the wild.
As we ramped up data collection, the limits of how much data could be held started to become known: after a few million samples, queries started to take a while. Most people didn’t mind waiting 10 seconds for a query, but knowing that mongo was taking 10 seconds to run a query on 3 million samples was bothering me. Luckily, snorkel had been built with sampling rates as a first class feature, so changing the collection rate was not really a problem, but I knew that I wanted my datasets to support more than 5 million samples at a time.
I waited for mongo’s performance enhancements to the aggregation framework to be released, thinking that mongo will scale horizontally (just add in another node and magically increase performance throughput? don’t mind if I do), but eventually realized that the aggregation framework would not support the horizontal scaling features. (see here)
Hearing good things about postgres’ recent support for JSONB, I added a driver for postgres to snorkel and started sending the same data that I sent to mongo straight into postgres. I found that the queries were not as fast as I was hoping: the max size of reasonable datasets seemed to be only 3 - 5 million samples, just like with mongo.
I speculated that the JSONB columns must not be fast enough, so wrote yet another driver for snorkel that allowed me to query postgres tables directly, without the need for JSONB columns. Unfortunately, time series queries were still slow, probably because of the dynamic time buckets being created.
Thinking about what was going on, I decided I wanted to investigate what it would take to get a datastore that does fast full table scans (necessary for the OLAP workload I intended with snorkel) and started building a datastore that was meant specifically for my use case.
In January 2016, development was started on sybil - an append only datastore for instrumentation. After some months of furious development, sybil was announced publicly in June 2016.
- first write up