log(v)

back
Need help setting up snorkel?
Please send an email to okay.zed at gmail and I'll be more than glad to help you get setup with an instance and answer any questions about your use case and requirements.
Happy snorkeling!

On the Benefits of Being Schema-less

Inheriting ideas from Scuba, one of the main design decisions of Sybil is that tables do not require their schema defined ahead of time before inserting data. I think this is an important feature distinction for several reasons.

Blobs and databases

If we consider analytic data that clients send to us, most modern analytic solutions allow their customers to send them blobs of JSON data alongside a structured payload. Inside a structured database with schemas, that customer blob will end up in a mega column that is unstructured. The row for each sample would look something like

user id | timestamp | event name | client blob

With this schema and some indeces, we’re able to do fast user counting, time filtering and per event aggregations. However, because the client blob is stored as a mega column, we lose out on indexing and compressability. For the structured DB to get the most out of our client blob, it’ll need hints on how to treat each interior column. Without these hints, running analytic queries on this data becomes more expensive, especially for queries that use filters or generate time series.

Storing the blob in one mega column also means that this column is essentially a row store, so queries involving the mega column will need to load all its data off disk, regardless of which interior columns are being used.

Because of the missing column statistics and implied row storage, querying a relational DB that holds these blobs is essentially forcing single threaded full table scans on row store data, which tends to be quite slow.

These issues can be worked around (by using JSON indeces in postgres, for example), but overall, it requires per table schema maintenance and upkeep from either the person logging the data or the owners of the database, which is often hard to enforce or remember to do.

Schema-less Sybil

Sybil recognizes that ad-hoc data will often create this blobby situation and was built from the ground up for this particular workload: full table scans on unstructured data. But the main differences are that Sybil 1) uses columnar storage even on schema-less data and 2) has a multi-threaded query engine, together reducing the total disk space required and speeding up the execution time on analytic queries.

By not insisting on an up-front schema, sybil basically treats each dataset as a giant table with sparse columns. When samples come in with new columns, the columns are added as new sparse columns in the datastore.

The major difficulty with schema-less data is when the datatype changes underneath a column. It’s often the case that someone logged an int as a string or vice versa and they want to change the column type. This happens much more frequently than one would think, especially when first adding instrumentation.

For sybil, we decide to start throwing away samples with the wrong column type (assuming a majority consensus has been reached), so instead of changing a column type, its better to just create a new column. While its not impossible to change the type of a column, it does require some manual intervention: first the old column files need to be deleted and then the sybil info.db needs to be rebuilt using the sybil rebuild command.

Performance

As naive as full table scans sound, they work out surprisingly well, as long as you are willing to chew up some CPU.

In my practical experience, a relational DB with single threaded readers will not scale for analytic queries well past 1 - 3mm rows inside an unstructured column (taking up to 10 or 15 seconds to execute simple time series queries) because adding filters tend to induce full table scans. In sybil, which is built for full table scans, we can build time series queries on 10mm rows in 1 - 3 seconds.

Comparing against a schema-less row store DB that is not built for aggregations, we encounter the same performance difference: mongo will fall over at time series queries on 1 - 3mm records, while sybil keeps going.

Conclusion

While it may look like sloppy programming, requiring schema creation and maintenance can be worse for the customer and client experience with certain types of data, particularly analytics and instrumentation. And while I’m recommending some use cases for schema-less data, I’m not recommending that we get rid of all schemas and I do advise double logging your instrumentation data to long term storage.