logv

Overview
- Sybil vs Other DBs
Execution Engine
Storage Details
Errata
Changelog

Sybil is a multi-threaded OLAP engine built specifically for append only instrumentation and event logs. Its architecture is based around ingestion of JSON or CSV samples in an ongoing basis: either through real-time or batch jobs. This page contains information on the implementation and quirks of sybil

Overview

Sybil is a Go binary with 4 main responsibilities: ingestion, digestion, querying and expiration of samples . A sample is simply a JSON record with multiple string and int fields:

{
  time: <seconds since epoch>, // REQUIRED
  gentime: 200,
  host: "aws-112-e1",
  pageload: 1233,
  browser: "Chrome",
  path: "/home",
  country: "US",
  city: "NY",
  region: "NY"
}

Samples typically represent event data: system logs, physical sensors, web analytics, fitness monitors, etc.

When sybil ingests samples from stdin, a new file is created in the ingest log that holds the samples in row form. Each invocation of the sybil binary will create its own ingestion log file, to reduce lock contention. If more than a certain amount of data is in the ingestion log, a digest is initiated. During digestion, files are pulled out of the ingestion log and processed into blocks in the column store, with each block holding up to 65K records.

When running a query, sybil first loads metadata off disk and then decides which blocks actually need to be read off disk. The blocks (and their relevant columns) are then loaded in parallel and a map reduce query is run on each block. This sounds like it would be slow, but in practice it is faster than doing full table scans on single-threaded row stores because we are loading less data off disk and analyzing it in parallel.

The final part of the sample lifecycle in sybil is expiring data: old blocks can be compressed and deleted out of a table using sybil’s trim command based on the desired table size or sample age.

Sybil vs Other DBs

Sybil falls somewhere between a time series database (TSDB) and a relational DB (RDB). Like a TSDB, there is no up front schema for tables and sybil stores immutable data. Unlike a TSDB, sybil also supports records with multiple fields and can quickly run queries with dynamic filters and groupings (similar to a RDB).

Unlike an RDB, Sybil does not support UPDATEs or JOINs or even have the idea of a unique ID for each record. Instead, blocks of records are pulled off disk and quickly aggregated. Doing full block scans means that we can continually adjust our query filters and time granularity while we ask for different views of our tables and still have reasonable (sub second) query execution times on millions of samples.

In other words, sybil is a naive aggregation engine, optimized for execution speed of simple dynamic queries on raw samples and log data over time.

Execution Engine

Querying

Sybil supports three main query types: raw samples, tables and time series as well as avg, sum, count, count distinct and percentile summary aggregations.

A simple query might be the SQL equivalent of:

SELECT count(*), PERCENTILE(pageload, 50)
FROM pagestats
WHERE time > -1 week AND time < now
GROUP BY city, browser, country;

To issue queries, sybil uses a map-reduce execution style on the blocks in a table. Each block is loaded and queried in parallel by a worker pool and then combined into the master aggregator’s results.

The process looks something like:

for block in blocks:
  * load block info.db off disk
  * test the block extrema (from info.db) against filters to see block can be skipped
  * check the block's query cache to see if this query is cached
  * allocate records and load pertinent columns off disk using the LoadSpec
  * assemble columns into record form inside the block
  * filter the records in the block
  * group and aggregate the records into per block results
  * save block results to query cache (if caching is enabled)
  * recycle the block and put it in the table's block pool

combine and rollup block results into master result
print results

We get some nice feature from block by block execution. For example, during string group by queries, the rollup rows are stored as small byte buffer values and only expanded into their full string values when finalizing the result from a block, saving on string joining and hashing time. Similarly, when we are loading data off disk, string filters are executed only once per string per block, even if shared among multiple records.

Note: Because of the map reduce style execution, certain queries are liable to blow up the size of data: for example, a group by on a high cardinality column will incur some large penalties when there are thousands or millions of values in the group by.

Serverless Architecture & Locking

Sybil is a binary that exists only for the duration of each command and exits when the command is finished. This is in contrast to many databases which are either standalone servers or embedded in other processes.

To be multi-process safe (without a daemon process), Sybil uses files for locking key parts of a table during the ingestion and digestion process. Each table has 4 main locks: the info lock, the digest lock, the cache lock and the block locks.

The info lock is used by every sybil invocation to read the info.db file for the table; it is the most commonly contested lock.
The digest lock is used to prevent multiple processes from trying to digest the same ingest files.
The block lock is used when writing data into a block during the digestion process to prevents another process from trying to append that same block.
The cache lock is for syncing the block metadata cache - this only comes into usage after there are hundreds of blocks.

When a lock is grabbed, the owner process writes its PID into the lock file. If a process tries to grab a lock and its taken, the process will verify that the PID is still alive. If the PID is dead, sybil will initiate a lock recovery process and try to automatically resolve the situation.

To stress the serverless architecture, we spin up several ingestion threads and have them ingest and digest records as fast as they can in batches of 10, 100 and 1000 and then check for record consistency.

To test the lock recovery process we simulate disk out of space errors and purposefully create locks with no owner. We then verify sybil does not lose any data and can recover the locks.

Turning off the GC

For all of Go’s advantages, Go does have some drawbacks, including a garbage collector that can jump in unexpectedly. In order to speed up sybil’s execution, sybil makes several memory oriented optimizations. Firstly, sybil allocates records in giant re-usable slabs. When a worker is done querying, the slab goes back to the pool and is re-used. Secondly, sybil turns off the GC during block aggregation and turns it on at pre-defined points to free memory back to the OS.

Together, these optimizations increase sybil’s performance speed significantly, because memory allocations are minimized and the GC doesn’t need to halt the world unexpectedly.

In terms of performance, with GC turned off and slab reallocation on, instantiating 15mm records takes about 450ms. With GC on, sybil takes 550ms. With slab recycling off, we take over 750ms, regardless of whether GC is enabled or not. This is likely due to the large cost of allocating new memory and thread contention over mallocs, even though we are still doing slab allocations.

Channels vs Mutexes

Another of the common performance problems I ran into with Go is channels. Originally, the map reduce pattern and record materialization looks perfect for channels, but in practice, sending data over channels introduces more overhead between threads than well placed (or no) mutexes.

One nice feature of Go is its race detector tool, which finds when multiple threads access the same variable. Plugging up race conditions with mutexes has never been easier.

Time Series Queries

Time Series queries are a first class citizen in sybil and built into the aggregation engine. Whether this is a good thing remains to be seen. In essence, a time series query is just adding a group by time to a column, but often that is difficult in a relational DB unless the granularities are specified ahead of time, involving an up front schema

Weighted Columns

Since sybil is meant for instrumentation, the idea of a sample rate is supported in the DB. If you want to sample at the rate of 1 in 100 samples, you can add a ‘weight’ column to a record to have the that record serve as a weighted sample. If a weight column is specified for a query, it will be used to weight the counts, averages and histograms when running aggregations.

The purpose of this is to allow downsampling while still sampling some events at 1:1 ratio and others at 1:100 ratio: for example errors vs. successful requests.

Histograms

The basic rollup result for each numeric quantity is a histogram. The histogram is created by segmenting the range for the column into even buckets. This is purposefully creating approximate histograms in exchange for query speeds.

I’ve evaluated advanced histogramming methods, like the hdr histogram and t-digests, but they will often be more expensive than a simple flat histogram, whether it is in recording the values or joining histograms across blocks.

Oct 2017 Update:

After deliberating for some time, I decided to implement nested histograms - with the outer hist having logarithmic buckets and the inner histograms having fixed size buckets. it works well for certain distributions, like file sizes.

When used in a group by, query timing can go up enormously, so these nested hists are to be used with care. Combining cached query results with a large group by can take 1.2s (vs. 400ms) because of the large number of buckets being used, which is: LG(col_max - col_min) * NUM_BUCKETS.

Query Caching

To enable query caching, use “-cache-queries” flag.

When query caching is enabled, sybil will try to hash query results based on the query parameters. This can give a speedup of 5 - 10x (or more), depending on the given query and if its been run before.

Each block of 65K records has its own query cache, so time is still spent stitching together block results. For queries with a high cardinality group by, that means the query cache will not help as much (but will still help!)

One of the interesting problems that came up when generating the hash from the query parameters is that most queries come with a time range on them, so we can’t naively hash all filters. If we did generate a hash with all filters, it would be constantly changing as the time filter will be changing every query.

Storage Details

Sybil stores all data using the encoding/gob module for better (improved disk usage) or worse (slightly degraded performance due to reflection). Every table is just a directory on disk with an info.db file, an ingest log and sub-directories of records. Each sub-directory is known as a block and holds 65K records. Each field in a block is stored as a separate file on disk. To access any particular field in a block, the whole file for that field is read off disk and unpacked into row form.

By storing data in blocks and columns, a query only needs to load the data it needs off disk and not more.

Layout and Data types

Sybil supports 3 main datatypes: integers, strings and sets of strings and that’s all.

An example db might look like:

db/
  mytable/
    info.db
    block0000001/
      str_col_1.db
      int_col_2.db
      int_col_3.db
      set_col_4.db
      info.db
      cache/
        823018baedf.db.gz
        e2384abdfff.db.gz
    block0000002/
    block0000003/
    cache/
    ingest/
      row_file_109283.log
      row_file_892912.log

Schema-less

Sybil doesn’t require up-front schemas: ingest a sample and your table now exists. But woe is the person who ingests a column with multiple data types under it: only the majority datatype will be accepted after some consensus is reached.

The schema-less ingestion is vital to frictionless logging of instrumentation data. Any hiccups in that process can prevent new instrumentation from being created.

Schema-less samples means that new columns can be added on the fly without adjusting the dataset schema. Getting rid of unwanted columns is a matter of deleting files from disk, no locking or compaction of tables is necessary.

No Indeces

Unusually, sybil has no indeces, instead each block has a metadata file with information about the records in the block. this is all that’s really necessary when the data is being sent to sybil in (mostly) chronological order, because most of the queries we will process are filtered on a date range. This means that we have two working sizes on our data: how big the data is on disk and how many blocks we need to actually load and process off disk. For row store DBs, these numbers are often the same and/or dictated by indeces. With sybil, its possible to have hundreds of MB on disk but only need to load a few MB for a query.

Column Compression

More details about Column Compression in Sybil

Sybil stores digested data in column form, meaning that each block of data is stored as one file per column. This allows for per column storage schemes and optimization.

On average, sybil uses very little space for samples, with low cardinality fields taking only a byte. To do so, sybil stores all strings as integer IDs with a lookup table along side them. Sybil then uses a combination of variable length int encoding, delta compression and bucketing low cardinality values to make the most of per column compression.

For a low cardinality string column, the typical space can be 1 or 2 bytes per field, while higher cardinality columns can take up to 2 or 3 bytes per field + the actual string content.

For integer columns, a second timestamp (2^31) will take ~3 bytes, while smaller values will take up as little as a single byte.

File Compression

In addition to column compression, sybil supports selective compression of files: files can be gzipped as they become old or less relevant and will still be loaded off disk if they are relevant to the query. In fact, sybil encourages this: gzipping a file is a great way to save hard disk space and sybil comes with commands for finding the oldest blocks. Any .db file in a block is up for gzipping, but there is no point to gzipping the info.db files (and they will likely get replaced anyways)

Using gzip on columns will greatly decrease the space they use. A typical low cardinality column can drop to less than 1KB of space for 65K records (previously at 65Kbytes, 1 byte per record), while a timestamp column drops to 1 byte per record (previously at 3 bytes per record)

Auto Compaction

At the end of an ingestion, sybil checks to see if the digestion threshold has been hit by inspecting the number and size of the files in the ingestion log and automatically starts a digestion if necessary. By using the digest locks and keeping track of the number of files when the digestion started, sybil avoids duplicated or uneccessary consecutive digestions without needing to read the ingestion log. This allows multiple ingestors to initiate a digestion without worrying that multiple digestions will be executed. Of course, executing multiple digestions is perfectly safe, but it is a waste of CPU.

Block Metadata Cache

When a dataset has more than a few hundred blocks (10mm+ samples), the process of reading the block metadata starts to take a non-negligible amount of time. To reduce this time, sybil keeps track of the finished blocks and their metadata once the total number of blocks in a dataset reaches a threshold.

By keeping track of cached metadata on finished blocks, sybil can avoid dozens of file loads off disk and save hundreds of milliseconds as the dataset grows in size.

Errata

Perf Numbers

On my laptop (4 core, 2.4 GHz i5-6300U), my query numbers are:

allocate, but don’t load 15mm rows: 450ms
query 15mm with 1 int col: 900ms
query 15mm with 1 low cardinality str col: 1.4s
query 15mm with 2 low cardinality str col: 1.9s

My typical working set is between 100K to 1mm samples, so queries generally run in under a second . If you have 10mm or more samples in your query range, you’ll want to invest in a nice and strong server machine. If you don’t want to spend time on upgrading and maintaing a server, you can always sample your data and log 1 in 10 or 1 in 100 samples.

Using Go

In choosing to use Go for this project, I made the decision to try out a new language on a larger project, not knowing what the bottlenecks would be or where it would take me. Overall, the development time in Go to get up and running with a threaded application has been very nice, as well as the tooling: profiler, race detection, unit testing, package manager, etc.

The bad parts of Go have been around memory management and ability to really optimize execution. While its nice to never worry about freeing pointers, fighting the GC has been annoying. Similarly, Go does not really implement architecture specific optimizations or have good SIMD support (yet), so the overall timing of the program is maybe 2 - 5x slower than it could be while development time was 2 - 5x faster.

Reflections

Going in to this project, my overall goal was to create a backend that was easy to setup and usable for the typical instrumentation analysis I do (usually up to 10mm samples in a given time period). I had wanted to test the feasibility of several ideas: 1) see if full table scans are really such a bad idea, 2) create a schema-less datastore and 3) run multi-threaded aggregations.

Looking back, sybil has been a pretty good effort in this respect. It can drop in as a mongo replacement for me with much better aggregation performance and lower disk usage. My uptime DB had been taking 1GB of data for 10mm samples in mongo. After moving to sybil, getting rid of mongo’s globally unique record IDs and using gzip compression on old records, this simple db only takes 5MB for 15mm samples and queries run much faster.

Admittedly, mongo is a low target, but sybil will also out-perform postgres’s JSONB columns as well. It turns out, if you want to store schema-less data, there are not many options that are also fast.

Design Decisions

When starting on the sybil, the question was “how can we make full table scans go as fast as possible on data that doesn’t have an up front schema”.

Walking forward from that (with the scuba whitepapers and abadi survey in hand), several design decisions came out:

tables must not require a schema to be created
the data would be stored on disk, each column in one file - 65K records per file
the queries would be run in multi-threaded mode, trying to go as fast as possible
data would be stored in a columnar format, allowing only a subset of data to be loaded off disk
histogram queries must be supported
time series queries would be first class
tables must support expiring old data out using disk limits
ability to skip chunks of irrelevant data based on time filters
sybil would be run as a serverless binary (no server process involved)

Critiques

An obvious criticism to make is that: postgres and mongo support indeces, so it should be possible to make any queries of these DBs fast. Unfortunately, this means creating and maintaining a schema and knowing ahead of time which columns will be important. Otherwise, these DBs will have to run full table scans.

Using Go as the base language (instead of C or C++) may have been a mistake, as well. In terms of productivity, Go has been very nice, but for performance, there is some fight to the finish. At this point, I am pretty satisfied with the performance on a single machine, even though I know there is a lot of room to improve by using SIMD or GPUs.

As a column store, sybil is naive - it spends a lot of time in stitching records together (aka materialization). This means that sybil stores data as a column store, but turns the data into rows before issuing queries. According to the abadi survey on column stores, almost 50% of query performance (for specific queries that filter the data down to a smaller subset) can be improved by moving to a late materialization model, by filtering before turning the columns into records.

The current generation of SQL query engines and distributed clusters are JIT’ing SQL expressions into byte code on the fly, with performance speed at the level of compiled C++ code. Sybil is nowhere near this level of sophisticated: all queries are map reduce based and hand written in Go.

Another weakness is in using encoding/gob to encode all our data. This makes the data less transportable to other languages and does require overhead during reflection. A todo is to decide on whether parquet or other column storage formats are viable in Go

Databases I’m jealous of

There are plenty of fast and powerful DBs, but the ones that impress me the most are kdb+, clickhouse (even though it requires schemas) and scuba.

Useful tools

fio(1) for file IO testing
vmtouch for checking the file system cache
goprof’s profiling views: both weblist and the call graphs
pidstats for keeping track of per process performance over long periods

Changelog

2017-12-12

add design decisions section

2017-12-01

remove experimental queries, so sybil can focus on one goal

2017-10-01

add multi hist
add query cache
add race detector to channel mutex detection
add mention of materialization to critiques

2017-06-19

add channels vs. mutex section in go performance

2017-06-06

add more query notes: histograms, time series, etc
add critiques

2017-06-05

Add errata section: go and reflections
Add sybil vs rdb vs tsdb

2017-06-04

first write up of sybil implementation & features

log(v)