In this post, we will talk about the differences between Scuba and Snorkel, two realtime data analysis tools. If you want to read more about OLAPs and Time Series DBs or see a list of OLAP engines, try this page
Overview
Scuba is a realtime log analysis tool used at Facebook. It’s both a UI for building queries and a distributed online analytical processing (OLAP) query engine. With a fleet of servers at its disposal, Scuba holds billions of records in RAM and runs simple aggregation queries in under a second. Today, Scuba is used along-side ODS (a time series database) to monitor and maintain Facebook’s infrastructure.
In general, Scuba trades accuracy and consistency for speed: its primary purpose is as a diagnosis and debugging tool for infrastructure. When people want to monitor an existing metric, they use their TSDB. When they want to explore, debug or diagnose, they use Scuba.
Unfortunately for us, Scuba is not available outside Facebook. That’s where Snorkel comes in. Snorkel is conceptually the little sister to Scuba. The major difference is that Snorkel is 1) free, 2) available and 3) smaller purpose.
Snorkel’s backend, Sybil, runs on a single machine and with much smaller datasets than Scuba’s backend. Sybil stores data on disk (instead of in memory) and is generally good for queries on up to 10M rows on commodity hardware (30M rows on server class hardware) vs the billions that Scuba can do.
Aside from their scale difference, Scuba and Snorkel are similar in many ways: they allow for ad-hoc queries, they perform full table scans, they do not require up front schema definitions and they cap table size by memory and time limits. Together, these particular features make for a compelling backend for digging through instrumentation.
Feature Comparison Matrix
Storage Features | Scuba | Snorkel | TSDB | OLAP |
---|---|---|---|---|
Requires Schemas | X | |||
Distributed | X | Planned | Commercial | Commercial |
Column Store | X | X | Optional | |
Tabular Data | X | X | X | |
Indices | X | X | ||
Append only data | X | X | X | |
Max Insertion | 1M+/s | 1K/s | 10+K/s | 1M+/s |
Mem Capped Tables | X | X | X | |
Query Features | Scuba | Snorkel | TSDB | OLAP |
SQL Support | X | X | ||
JOIN Queries | X | |||
Parallel Query Engine | X | X | ||
Table Queries | X | X | X | |
Time-Series Queries | X | X | X | X |
Distribution Queries | X | X | X | |
Samples Queries | X | X | X | |
Feasible Table Scans | 1B+ | 10 - 30M | N/A | 1B+M |
Frontend Features | Scuba | Snorkel | TSDB | OLAP |
Time Controls | X | X | X | ? |
Filter Controls | X | X | ? | |
Table View | X | X | ? | |
Time View | X | X | X | ? |
Composable Time Series | X | ? | ||
Dist. View | X | X | ? | |
Graph View | X | X | ? | |
Scatter Plot | X | X | ? | |
Sankey | X | ? | ||
Custom Views by Dataset | X | X | ? | |
Key typeaheads | X | X | X | ? |
Value typeaheads | X | ? | ||
Time Comparison | X | X | X | ? |
Filter Comparison | X | X | ? | |
Dashboarding | X | External | External | ? |
CHANGELOG
2017-11-30
- Add note about distributed query work being planned
- Add note about support for custom views per dataset
2017-06-17
- First write up.
- Add comparison table
- Add intro paragraphs