logv

Overview
- Goal
- About Sybil
- Status
Progress
Roadmap
Example Cluster Setups
Changelog

As sybil nears its 2 year age mark, one of the biggest features that is requested is distributed queries: a way of scaling analysis with the number of machines and cores. This page will document progress towards that goal.

Overview

Goal

Add a distributed mode to sybil that enhances its query performance

About Sybil

Sybil is a binary that can ingest and query JSON or CSV data, specifically meant for ongoing ingestion and analysis of instrumentation. Sybil is serverless, meaning that it runs without a server - instead, ingestion and queries are run as invocations of a binary.

As an aggregation engine, sybil runs map reduce style queries on mostly immutable blocks of data stored in columnar form. Because sybil is built around these map reduce aggregations, moving from a single node to multiple node aggregation should be straight forward.

There’s more specific implementation notes for sybil here

In terms of performance, each core should be able to aggregate 1 - 2mm rows per second. With a 4 CPU machine, we can get simple queries on 10mm rows in 1s, with 16 CPU, we should get 40mm queries in 1 - 3s (+ time to combine results)

See the wiki page on sybil performance, for more information and estimation.

Status

Leaf Node Preparation: Done
Distributed Queries: Done
Distributed Ingestion: Not Started
Data Backup & Archiving: Not Started
Node & Cluster Management: Not Started

Thus far, distributed and remote queries work but distributed ingestion is not yet implemented. A preliminary msybil binary for randomized ingestion has been added, but no cluster management support has been added yet

Progress

October 2018

GRPC server has been accepted and incorporated into sybil master. The grpc server can accept and run queries over grpc using the -dial flag

May 2018

Dockerization efforts have begun under @tmc’s guidance - it will hopefully soon be possible to deploy and maintain snorkel + sybil via docker.

April 2018

Distributed queries are working and part of snorkel / sybil. The main client interested in building the distributed ingestion example has gone off the radar, so there has been no forward progress. It’s likely that what they’ve built is proprietary.

re: distributed ingestion: I think every team has their own ingestion story (likely involving kafka or other pipelines) - sybil is just sits at the end of one of these pipes and ingest.

December 2017

December has been a laid back month with the holidays, so I’m pushing the goals for cluster setup into Q1 of 2018.

November 2017

Leaf Node

November has been spent preparing the sybil binary for usage in a distributed scenario. In order to get there, several features and necessary adjustments have been added, with the goal of creating a leaf that can work with high cardinality (1mm+ unique values in a column) data. Specific enhancements are:

using loglogbeta for approximate count distinct queries
a new intermediate result pruning stage during map / reduce
lowering the overall memory usage of queries
speeding up high cardinality string loading off disk
can serialize query results to stdout as bytes (leaf)
load multiple saved QuerySpec from disk and combine them again (aggregator)

msybil is a python script that calls out to multiple machines via SSH and issues queries and stitches their results together. It’s being bundled with snorkel for multi-machine queries, but may be separated into its own package later.

Roadmap

Q? 2019

manual cluster setup & allocation
setup example pidstats dataset on all cluster machines
write automation script for setting up sybil and binaries
distributed ingestion using scalable message queues

Q? 2019

better cluster management commands
distributed table trimming
backup & restoration of data

Example Cluster Setups

early business / open source projects

1 machine, estimated cost ~10 - 50$ / month
deployment manual, requires backups. single point of failure
dataset working size 10 - 20mm for fast queries, 40mm for slow queries

small business

4 machines (4 cpu, 8gb RAM) + 1 aggregator, estimated cost: 200$ / mo
deployment is manual, each machine has a user called “sybil” which is where the db and sybil binaries are installed
data is backed up via external process (on a per machine basis). if a node is lost, the whole image has to be restarted from a backup
estimated working table size: 50 - 100mm for fast query speeds (< 3 seconds)
estimated ingestion rate: 20k / per second

medium business

20 machines + 5 aggregators, estimated cost: ~1000$/mo
4 beefy machines + 1 aggregator: estimated cost ???
deployment is managed via cloud interface, nodes can enter and exit at will

large business

TODO: determine the performance of high compute machine and cluster

not thinking about this yet, but a cluster of 100 commodity machines with hierarchical aggregation could get us 1 - 4billion records per query

Changelog

2019-04-15

update with 2018 progress
remove impl. details

2017-12-31

update december progress, move cluster setup goals into Q1 of 2018

2017-11-29

update november progress

2017-11-22

First write up.
Add november progress
Add goals and roadmap
Add nodes about leaf raising protocol

log(v)

Overview

Goal

About Sybil

Status

Progress

October 2018

May 2018

April 2018

December 2017

November 2017

Leaf Node

Roadmap

Q? 2019

Q? 2019

Example Cluster Setups

early business / open source projects

small business

medium business

large business

Changelog

2019-04-15

2017-12-31

2017-11-29

2017-11-22