- Example Cluster Setups
As sybil nears its 2 year age mark, one of the biggest features that is requested is distributed queries: a way of scaling analysis with the number of machines and cores. This page will document progress towards that goal.
Add a distributed mode to sybil that enhances its query performance
Sybil is a binary that can ingest and query JSON or CSV data, specifically meant for ongoing ingestion and analysis of instrumentation. Sybil is serverless, meaning that it runs without a server - instead, ingestion and queries are run as invocations of a binary.
As an aggregation engine, sybil runs map reduce style queries on mostly immutable blocks of data stored in columnar form. Because sybil is built around these map reduce aggregations, moving from a single node to multiple node aggregation should be straight forward.
There’s more specific implementation notes for sybil here
In terms of performance, each core should be able to aggregate 1 - 2mm rows per second. With a 4 CPU machine, we can get simple queries on 10mm rows in 1s, with 16 CPU, we should get 40mm queries in 1 - 3s (+ time to combine results)
See the wiki page on sybil performance, for more information and estimation.
- Leaf Node Preparation: Done
- Distributed Queries: Done
- Distributed Ingestion: Not Started
- Data Backup & Archiving: Not Started
- Node & Cluster Management: Not Started
Thus far, distributed and remote queries work but distributed ingestion is not yet implemented. A preliminary msybil binary for randomized ingestion has been added, but no cluster management support has been added yet
GRPC server has been accepted and incorporated into sybil master. The grpc
server can accept and run queries over grpc using the
Dockerization efforts have begun under @tmc’s guidance - it will hopefully soon be possible to deploy and maintain snorkel + sybil via docker.
Distributed queries are working and part of snorkel / sybil. The main client interested in building the distributed ingestion example has gone off the radar, so there has been no forward progress. It’s likely that what they’ve built is proprietary.
re: distributed ingestion: I think every team has their own ingestion story (likely involving kafka or other pipelines) - sybil is just sits at the end of one of these pipes and ingest.
December has been a laid back month with the holidays, so I’m pushing the goals for cluster setup into Q1 of 2018.
November has been spent preparing the sybil binary for usage in a distributed scenario. In order to get there, several features and necessary adjustments have been added, with the goal of creating a leaf that can work with high cardinality (1mm+ unique values in a column) data. Specific enhancements are:
- using loglogbeta for approximate count distinct queries
- a new intermediate result pruning stage during map / reduce
- lowering the overall memory usage of queries
- speeding up high cardinality string loading off disk
- can serialize query results to stdout as bytes (leaf)
- load multiple saved QuerySpec from disk and combine them again (aggregator)
msybil is a python script that calls out to multiple machines via SSH and issues queries and stitches their results together. It’s being bundled with snorkel for multi-machine queries, but may be separated into its own package later.
- manual cluster setup & allocation
- setup example pidstats dataset on all cluster machines
- write automation script for setting up sybil and binaries
- distributed ingestion using scalable message queues
- better cluster management commands
- distributed table trimming
- backup & restoration of data
Example Cluster Setups
early business / open source projects
- 1 machine, estimated cost ~10 - 50$ / month
- deployment manual, requires backups. single point of failure
- dataset working size 10 - 20mm for fast queries, 40mm for slow queries
- 4 machines (4 cpu, 8gb RAM) + 1 aggregator, estimated cost: 200$ / mo
- deployment is manual, each machine has a user called “sybil” which is where the db and sybil binaries are installed
- data is backed up via external process (on a per machine basis). if a node is lost, the whole image has to be restarted from a backup
- estimated working table size: 50 - 100mm for fast query speeds (< 3 seconds)
- estimated ingestion rate: 20k / per second
- 20 machines + 5 aggregators, estimated cost: ~1000$/mo
- 4 beefy machines + 1 aggregator: estimated cost ???
- deployment is managed via cloud interface, nodes can enter and exit at will
TODO: determine the performance of high compute machine and cluster
- not thinking about this yet, but a cluster of 100 commodity machines with hierarchical aggregation could get us 1 - 4billion records per query
- update with 2018 progress
- remove impl. details
- update december progress, move cluster setup goals into Q1 of 2018
- update november progress
- First write up.
- Add november progress
- Add goals and roadmap
- Add nodes about leaf raising protocol