- Cluster Management
- Example Cluster Setups
As sybil nears its 2 year age mark, one of the biggest features that is requested is distributed queries: a way of scaling analysis with the number of machines and cores. This page will document progress towards that goal.
Add a distributed mode to sybil that enhances its query performance
Sybil is a binary that can ingest and query JSON or CSV data, specifically meant for ongoing ingestion and analysis of instrumentation. Sybil is serverless, meaning that it runs without a server - instead, ingestion and queries are run as invocations of a binary.
As an aggregation engine, sybil runs map reduce style queries on mostly immutable blocks of data stored in columnar form. Because sybil is built around these map reduce aggregations, moving from a single node to multiple node aggregation should be straight forward.
There’s more specific implementation notes for sybil here
In terms of performance, each core should be able to aggregate 1 - 2mm rows per second. With a 4 CPU machine, we can get simple queries on 10mm rows in 1s, with 16 CPU, we should get 40mm queries in 1 - 3s (+ time to combine results)
See the wiki page on sybil performance, for more information and estimation.
- Leaf Node Preparation: Done
- Distributed Queries: First Draft
- Distributed Ingestion: Not Started
- Data Backup & Archiving: Not Started
- Node & Cluster Management: Not Started
GRPC server has been accepted and incorporated into sybil master. The grpc
server can accept and run queries over grpc using the
Dockerization efforts have begun under @tmc’s guidance - it will hopefully soon be possible to deploy and maintain snorkel + sybil via docker.
Distributed queries are working and part of snorkel / sybil. The main client interested in building the distributed ingestion example has gone off the radar, so there has been no forward progress. It’s likely that what they’ve built is proprietary.
re: distributed ingestion: I think every team has their own ingestion story (likely involving kafka or other pipelines) - sybil is just sits at the end of one of these pipes and ingest.
December has been a laid back month with the holidays, so I’m pushing the goals for cluster setup into Q1 of 2018.
November has been spent preparing the sybil binary for usage in a distributed scenario. In order to get there, several features and necessary adjustments have been added, with the goal of creating a leaf that can work with high cardinality (1mm+ unique values in a column) data. Specific enhancements are:
- using loglogbeta for approximate count distinct queries
- a new intermediate result pruning stage during map / reduce
- lowering the overall memory usage of queries
- speeding up high cardinality string loading off disk
- can serialize query results to stdout as bytes (leaf)
- load multiple saved QuerySpec from disk and combine them again (aggregator)
msybil is a python script that calls out to multiple machines via SSH and issues queries and stitches their results together. It’s being bundled with snorkel for multi-machine queries, but may be separated into its own package later.
- prepare sybil for usage as a leaf node
- plug in distributed sybil binary to snorkel
- manual cluster setup & allocation
- setup example pidstats dataset on all cluster machines
- write automation script for setting up sybil and binaries
- distributed ingestion using scalable message queues
- better cluster management commands
- distributed table trimming
- backup & restoration of data
- init new aggregator
- init new leaf node (where data is stored)
- list leaf nodes
- keep track of up and down leaf nodes
- send new data to a leaf node
- backup data from leaf nodes
- bring down a leaf node
Ingestion & Storage
- send samples into the cluster via thrift or grpc or other API
- read samples off message queues
- backup data to s3 (or other storage)
- clear query caches
- create deployable cronjobs for managing sybils directories
- distributed table trim with per dataset customizations on max size
- archive old data
- have a query queue / scheduler that makes sure queries are not thrashing resources
- run query on multiple hosts and bring results back for aggregation
- report the total number of records in the table, number matching and number aggregated
- need to make sure that the aggregator is always up
- need to make sure that the message queues are always running
- runbook for new leaves
- runbook for restoring backups
- diagnosing issues
Example Cluster Setups
early business / open source projects
- 1 machine, estimated cost ~10 - 50$ / month
- deployment manual, requires backups. single point of failure
- dataset working size 10 - 20mm for fast queries, 40mm for slow queries
- 4 machines (4 cpu, 8gb RAM) + 1 aggregator, estimated cost: 200$ / mo
- deployment is manual, each machine has a user called “sybil” which is where the db and sybil binaries are installed
- data is backed up via external process (on a per machine basis). if a node is lost, the whole image has to be restarted from a backup
- estimated working table size: 50 - 100mm for fast query speeds (< 3 seconds)
- estimated ingestion rate: 20k / per second
- 20 machines + 5 aggregators, estimated cost: ~1000$/mo
- 4 beefy machines + 1 aggregator: estimated cost ???
- deployment is managed via cloud interface, nodes can enter and exit at will
TODO: determine the performance of high compute machine and cluster
- not thinking about this yet, but a cluster of 100 commodity machines with hierarchical aggregation could get us 1 - 4billion records per query
- update december progress, move cluster setup goals into Q1 of 2018
- update november progress
- First write up.
- Add november progress
- Add goals and roadmap
- Add nodes about leaf raising protocol