Intro
In sybil, we made the choice to build a query engine that runs on columnar data. There are several advantages to columnar storage - particularly 1) only data relevant to queries needs to be loaded off disk and 2) data stored in column form is able to be compressed very well compared to row form.
In this post, we will discuss the advantages of columnar compression and dive into the specifics a little bit more. We briefly touch on columnar compression in the sybil implementation notes, but this post will explore the techniques used and results obtained.
What is columnar storage?
Columnar storage is a way of storing data such that each column (or field) of a record is stored separately. This is in contrast to row-storage, where all the fields from a row are stored next to each other. With row storage, if you do a full table scan, all the data in the table has to be loaded off disk or referenced via indexes. With columnar storage, only the fields that are accessed directly in the query are loaded off disk. The advantage of columnar storage is that it allows for full table scans of only a subset of columns, which is particularly suited to OLAP workloads.
Alongside the reduced amount of data that needs to be loaded per query, columnar storage also groups similar data together, allowing per column compression tactics.
As an example, one year of my nginx logs took about 1GB inside mongo DB as row form data. The same data imported into sybil takes under 90MB - a savings of 10x.
Storage layout
In sybil, data is organized into blocks of 65K records. The fields from each column are stored in their own file. For example, if we have a table with 200,000 records and five fields, we will have 4 blocks: 3 full blocks and one partial block. Each block will be represented as a directory on disk and each field will be a file in that directory.
A typical full block from my nginx access logs looks like the following:
$ ls block853711672/ -lah
total 940K
drwxr-xr-x 2 okay okay 4.0K Jan 18 15:30 .
drwxrwxrwx 57 okay okay 12K Oct 20 17:26 ..
-rw-rw-rw- 1 okay okay 869 Jul 6 2017 info.db
-rw-rw-rw- 1 okay okay 1.7K Jul 6 2017 int_parsedport.db
-rw-rw-rw- 1 okay okay 84K Jul 6 2017 int_size.db
-rw-rw-rw- 1 okay okay 143K Jul 6 2017 int_time.db
-rw-rw-rw- 1 okay okay 74K Jul 6 2017 str_agent.db
-rw-rw-rw- 1 okay okay 65K Jul 6 2017 str_code.db
-rw-rw-rw- 1 okay okay 65K Jul 6 2017 str_host.db
-rw-rw-rw- 1 okay okay 65K Jul 6 2017 str_method.db
-rw-rw-rw- 1 okay okay 1.8K Jul 6 2017 str_parsedhost.db
-rw-rw-rw- 1 okay okay 73K Jul 6 2017 str_parsedpath.db
-rw-rw-rw- 1 okay okay 1.7K Jul 6 2017 str_parsedscheme.db
-rw-rw-rw- 1 okay okay 69K Jul 6 2017 str_referer.db
-rw-rw-rw- 1 okay okay 69K Jul 6 2017 str_remote.db
-rw-rw-rw- 1 okay okay 65K Jul 6 2017 str_user.db
Notice that there is an info.db file that contains metadata, as well as one file per field. There are 3 integer fields (size, timestamp and parsed port) and 10 string fields.
If we use gzip to compress our data, we can get even greater gains:
$ ls block853711672/ -lah
total 300K
drwxr-xr-x 2 okay okay 4.0K Jan 18 15:36 .
drwxrwxrwx 57 okay okay 12K Oct 20 17:26 ..
-rw-rw-rw- 1 okay okay 736 Jul 6 2017 info.db.gz
-rw-rw-rw- 1 okay okay 393 Jul 6 2017 int_parsedport.db.gz
-rw-rw-rw- 1 okay okay 30K Jul 6 2017 int_size.db.gz
-rw-rw-rw- 1 okay okay 58K Jul 6 2017 int_time.db.gz
-rw-rw-rw- 1 okay okay 7.6K Jul 6 2017 str_agent.db.gz
-rw-rw-rw- 1 okay okay 4.7K Jul 6 2017 str_code.db.gz
-rw-rw-rw- 1 okay okay 366 Jul 6 2017 str_host.db.gz
-rw-rw-rw- 1 okay okay 6.0K Jul 6 2017 str_method.db.gz
-rw-rw-rw- 1 okay okay 503 Jul 6 2017 str_parsedhost.db.gz
-rw-rw-rw- 1 okay okay 13K Jul 6 2017 str_parsedpath.db.gz
-rw-rw-rw- 1 okay okay 418 Jul 6 2017 str_parsedscheme.db.gz
-rw-rw-rw- 1 okay okay 5.0K Jul 6 2017 str_referer.db.gz
-rw-rw-rw- 1 okay okay 6.7K Jul 6 2017 str_remote.db.gz
-rw-rw-rw- 1 okay okay 400 Jul 6 2017 str_user.db.gz
Notice that we are doing whole-file gzip compression. An astute observer will say: “doesn’t that eliminate the ability to seek to a specific record in the file?” and the answer is yes! - sybil is meant for full table scans. When accessing any records in a block, the whole block of 65K records needs to be read off disk. In other words: there is no ability to seek to specific records. This is because sybil is an analytic datastore and calculates aggregations and doesn’t have to worry about updating individual records .
How it works
High Cardinality Integers
For an integer column, we have two options: for high cardinality integers
(where there are more than 10,000 distinct integers), we store the integers in
an array, with the index of the array representing the virtual ID of the record
in the block. We then delta encode the array and store its derivative.
Finally, we use golang’s encoding/gob
which supports Variable Length
Quantities - a way of
storing integers on a byte by byte basis. By using a combination of Delta
Encoding and VLQ, we are able to store integers using only 1 to 3 bytes.
For a high cardinality column like timestamps (represented as seconds since the epoch), we use about 2.2 bytes per timestamp. If we were to store the timestamps naively, we would require 4 bytes.
If we add gzip compression on top of this, we are able to 65K timestamps in only 58K - less than one byte per timestamp.
Timestamps
Running some experiments with cardinality, we can see that the spread of the integers will determine how much space they take up.
Below are ungzipped results for the range of the timestamps stored in a single block. test_0.25 represents uniform timestamps that range over a quarter of an hour, while test_48 represents timestamps that range over a 48 hour timespan.
This should give some idea of the compression to expect based on the number of distinct values in that particular column.
167K tsdb/test_0.25/block803942162/int_time.db
180K tsdb/test_0.5/block998479838/int_time.db
186K tsdb/test_1/block350848165/int_time.db
190K tsdb/test_3/block090475483/int_time.db
192K tsdb/test_6/block741306497/int_time.db
196K tsdb/test_12/block050239454/int_time.db
217K tsdb/test_24/block212806080/int_time.db
235K tsdb/test_48/block571945678/int_time.db
You’ll notice there is some difference in the size between these timestamps and the ones from my nginx logs: that is likely because the nginx logs timestamps are not uniformly distributed, they follow human traffic patterns, leading to more compressability.
Low Cardinality Integers
For low cardinality columns (with less than 5,000 unique values), we store the data somewhat differently: instead of storing an array of all the values, we store data as an inverted index - each value is a key in a dictionary and for each key, we store the list of record IDs that have that value.
To put it more clearly, let’s say we have 4 possible integer values and 10 records. The data might be stored as a dictionary like so:
0: [1,2,5],
1: [0, 3],
2: [4,6,7],
3: [8, 9]
Each key represents the field value and attached to that value is a list of all records that share that value. We then delta encode the record IDs and once again store them using VLQ. Using Delta Encoding has the effect of reducing the size of the values we need to store with some overhead when unpacking the values.
0: [1,1,3],
1: [0, 3],
2: [4,2,1],
3: [8, 1]
For low cardinality columns, the combination of an inverted index + Delta Encoding + VLQ typically leads to about 1 - 2 bytes per sample, with very low cardinality columns requiring 1 byte per record.
If we add gzip on top of that, our savings can become quite large. In the data above, the integer column representing the file sizes served only takes 0.5 bytes per sample, while an extremely low cardinality column (like the parsed port) takes only a few hundred bytes to store all 65K records.
String Columns
For string columns, we store a lookup from string value to integer IDs by storing an array of strings. The integer ID for a string is its index in the array.
The effect is that each string only appears once per file and in subsequent appearances, shows up as an integer. By storing strings as integers, we are able to re-use our integer column compression schemes, as well as reduce the number of string comparisons when running filters.
If we look at the Agent field, which represents a user agent, we can see that
without gzip compression, it takes roughly 1.2 bytes to represent a useragent.
If we naively stored each useragent as a string, the cost would be much larger,
considering that an agent string typically looks like: Mozilla/5.0 (Windows NT
10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98
Safari/537.36
With gzip compression, the savings go up even more: we can store 65K agent fields in about 0.11 bytes per sample.
Changelog
February 2019
- reduce unique value recommendation to 5,000 unique values
- add timestamp results section
- gzip clarification
January 2019
- First write-up
- Add example block storage
- Discuss string and integer storage