log(v)

back
Need help setting up snorkel?
Please send an email to okay.zed at gmail and I'll be more than glad to help you get setup with an instance and answer any questions about your use case and requirements.
Happy snorkeling!

Intro

In sybil, we made the choice to build a query engine that runs on columnar data. There are several advantages to columnar storage - particularly 1) only data relevant to queries needs to be loaded off disk and 2) data stored in column form is able to be compressed very well compared to row form.

In this post, we will discuss the advantages of columnar compression and dive into the specifics a little bit more. We briefly touch on columnar compression in the sybil implementation notes, but this post will explore the techniques used and results obtained.

What is columnar storage?

Columnar storage is a way of storing data such that each column (or field) of a record is stored separately. This is in contrast to row-storage, where all the fields from a row are stored next to each other. With row storage, if you do a full table scan, all the data in the table has to be loaded off disk or referenced via indexes. With columnar storage, only the fields that are accessed directly in the query are loaded off disk. The advantage of columnar storage is that it allows for full table scans of only a subset of columns, which is particularly suited to OLAP workloads.

Alongside the reduced amount of data that needs to be loaded per query, columnar storage also groups similar data together, allowing per column compression tactics.

As an example, one year of my nginx logs took about 1GB inside mongo DB as row form data. The same data imported into sybil takes under 90MB - a savings of 10x.

Storage layout

In sybil, data is organized into blocks of 65K records. The fields from each column are stored in their own file. For example, if we have a table with 200,000 records and five fields, we will have 4 blocks: 3 full blocks and one partial block. Each block will be represented as a directory on disk and each field will be a file in that directory.

A typical full block from my nginx access logs looks like the following:

$ ls block853711672/ -lah
total 940K
drwxr-xr-x  2 okay okay 4.0K Jan 18 15:30 .
drwxrwxrwx 57 okay okay  12K Oct 20 17:26 ..
-rw-rw-rw-  1 okay okay  869 Jul  6  2017 info.db
-rw-rw-rw-  1 okay okay 1.7K Jul  6  2017 int_parsedport.db
-rw-rw-rw-  1 okay okay  84K Jul  6  2017 int_size.db
-rw-rw-rw-  1 okay okay 143K Jul  6  2017 int_time.db
-rw-rw-rw-  1 okay okay  74K Jul  6  2017 str_agent.db
-rw-rw-rw-  1 okay okay  65K Jul  6  2017 str_code.db
-rw-rw-rw-  1 okay okay  65K Jul  6  2017 str_host.db
-rw-rw-rw-  1 okay okay  65K Jul  6  2017 str_method.db
-rw-rw-rw-  1 okay okay 1.8K Jul  6  2017 str_parsedhost.db
-rw-rw-rw-  1 okay okay  73K Jul  6  2017 str_parsedpath.db
-rw-rw-rw-  1 okay okay 1.7K Jul  6  2017 str_parsedscheme.db
-rw-rw-rw-  1 okay okay  69K Jul  6  2017 str_referer.db
-rw-rw-rw-  1 okay okay  69K Jul  6  2017 str_remote.db
-rw-rw-rw-  1 okay okay  65K Jul  6  2017 str_user.db

Notice that there is an info.db file that contains metadata, as well as one file per field. There are 3 integer fields (size, timestamp and parsed port) and 10 string fields.

If we use gzip to compress our data, we can get even greater gains:

$ ls block853711672/ -lah
total 300K
drwxr-xr-x  2 okay okay 4.0K Jan 18 15:36 .
drwxrwxrwx 57 okay okay  12K Oct 20 17:26 ..
-rw-rw-rw-  1 okay okay  736 Jul  6  2017 info.db.gz
-rw-rw-rw-  1 okay okay  393 Jul  6  2017 int_parsedport.db.gz
-rw-rw-rw-  1 okay okay  30K Jul  6  2017 int_size.db.gz
-rw-rw-rw-  1 okay okay  58K Jul  6  2017 int_time.db.gz
-rw-rw-rw-  1 okay okay 7.6K Jul  6  2017 str_agent.db.gz
-rw-rw-rw-  1 okay okay 4.7K Jul  6  2017 str_code.db.gz
-rw-rw-rw-  1 okay okay  366 Jul  6  2017 str_host.db.gz
-rw-rw-rw-  1 okay okay 6.0K Jul  6  2017 str_method.db.gz
-rw-rw-rw-  1 okay okay  503 Jul  6  2017 str_parsedhost.db.gz
-rw-rw-rw-  1 okay okay  13K Jul  6  2017 str_parsedpath.db.gz
-rw-rw-rw-  1 okay okay  418 Jul  6  2017 str_parsedscheme.db.gz
-rw-rw-rw-  1 okay okay 5.0K Jul  6  2017 str_referer.db.gz
-rw-rw-rw-  1 okay okay 6.7K Jul  6  2017 str_remote.db.gz
-rw-rw-rw-  1 okay okay  400 Jul  6  2017 str_user.db.gz

Notice that we are doing whole-file gzip compression. An astute observer will say: “doesn’t that eliminate the ability to seek to a specific record in the file?” and the answer is yes! - sybil is meant for full table scans. When accessing any records in a block, the whole block of 65K records needs to be read off disk. In other words: there is no ability to seek to specific records. This is because sybil is an analytic datastore and calculates aggregations and doesn’t have to worry about updating individual records .

How it works

High Cardinality Integers

For an integer column, we have two options: for high cardinality integers (where there are more than 10,000 distinct integers), we store the integers in an array, with the index of the array representing the virtual ID of the record in the block. We then delta encode the array and store its derivative. Finally, we use golang’s encoding/gob which supports Variable Length Quantities - a way of storing integers on a byte by byte basis. By using a combination of Delta Encoding and VLQ, we are able to store integers using only 1 to 3 bytes.

For a high cardinality column like timestamps (represented as seconds since the epoch), we use about 2.2 bytes per timestamp. If we were to store the timestamps naively, we would require 4 bytes.

If we add gzip compression on top of this, we are able to 65K timestamps in only 58K - less than one byte per timestamp.

Timestamps

Running some experiments with cardinality, we can see that the spread of the integers will determine how much space they take up.

Below are ungzipped results for the range of the timestamps stored in a single block. test_0.25 represents uniform timestamps that range over a quarter of an hour, while test_48 represents timestamps that range over a 48 hour timespan.

This should give some idea of the compression to expect based on the number of distinct values in that particular column.

167K tsdb/test_0.25/block803942162/int_time.db
180K tsdb/test_0.5/block998479838/int_time.db
186K tsdb/test_1/block350848165/int_time.db
190K tsdb/test_3/block090475483/int_time.db
192K tsdb/test_6/block741306497/int_time.db
196K tsdb/test_12/block050239454/int_time.db
217K tsdb/test_24/block212806080/int_time.db
235K tsdb/test_48/block571945678/int_time.db

You’ll notice there is some difference in the size between these timestamps and the ones from my nginx logs: that is likely because the nginx logs timestamps are not uniformly distributed, they follow human traffic patterns, leading to more compressability.

Low Cardinality Integers

For low cardinality columns (with less than 5,000 unique values), we store the data somewhat differently: instead of storing an array of all the values, we store data as an inverted index - each value is a key in a dictionary and for each key, we store the list of record IDs that have that value.

To put it more clearly, let’s say we have 4 possible integer values and 10 records. The data might be stored as a dictionary like so:

0: [1,2,5],
1: [0, 3],
2: [4,6,7],
3: [8, 9]

Each key represents the field value and attached to that value is a list of all records that share that value. We then delta encode the record IDs and once again store them using VLQ. Using Delta Encoding has the effect of reducing the size of the values we need to store with some overhead when unpacking the values.

0: [1,1,3],
1: [0, 3],
2: [4,2,1],
3: [8, 1]

For low cardinality columns, the combination of an inverted index + Delta Encoding + VLQ typically leads to about 1 - 2 bytes per sample, with very low cardinality columns requiring 1 byte per record.

If we add gzip on top of that, our savings can become quite large. In the data above, the integer column representing the file sizes served only takes 0.5 bytes per sample, while an extremely low cardinality column (like the parsed port) takes only a few hundred bytes to store all 65K records.

String Columns

For string columns, we store a lookup from string value to integer IDs by storing an array of strings. The integer ID for a string is its index in the array.

The effect is that each string only appears once per file and in subsequent appearances, shows up as an integer. By storing strings as integers, we are able to re-use our integer column compression schemes, as well as reduce the number of string comparisons when running filters.

If we look at the Agent field, which represents a user agent, we can see that without gzip compression, it takes roughly 1.2 bytes to represent a useragent. If we naively stored each useragent as a string, the cost would be much larger, considering that an agent string typically looks like: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36

With gzip compression, the savings go up even more: we can store 65K agent fields in about 0.11 bytes per sample.

Changelog

February 2019

January 2019