log(v)

back
Need help setting up snorkel?
Please send an email to okay.zed at gmail and I'll be more than glad to help you get setup with an instance and answer any questions about your use case and requirements.
Happy snorkeling!

As sybil nears its 2 year age mark, one of the biggest features that is requested is distributed queries: a way of scaling analysis with the number of machines and cores. This page will document progress towards that goal.

Overview

Goal

Add a distributed mode to sybil that enhances its query performance

About Sybil

Sybil is a binary that can ingest and query JSON or CSV data, specifically meant for ongoing ingestion and analysis of instrumentation. Sybil is serverless, meaning that it runs without a server - instead, ingestion and queries are run as invocations of a binary.

As an aggregation engine, sybil runs map reduce style queries on mostly immutable blocks of data stored in columnar form. Because sybil is built around these map reduce aggregations, moving from a single node to multiple node aggregation should be straight forward.

There’s more specific implementation notes for sybil here

In terms of performance, each core should be able to aggregate 1 - 2mm rows per second. With a 4 CPU machine, we can get simple queries on 10mm rows in 1s, with 16 CPU, we should get 40mm queries in 1 - 3s (+ time to combine results)

See the wiki page on sybil performance, for more information and estimation.

Status

Thus far, distributed and remote queries work but distributed ingestion is not yet implemented. A preliminary msybil binary for randomized ingestion has been added, but no cluster management support has been added yet

Progress

October 2018

GRPC server has been accepted and incorporated into sybil master. The grpc server can accept and run queries over grpc using the -dial flag

May 2018

Dockerization efforts have begun under @tmc’s guidance - it will hopefully soon be possible to deploy and maintain snorkel + sybil via docker.

April 2018

Distributed queries are working and part of snorkel / sybil. The main client interested in building the distributed ingestion example has gone off the radar, so there has been no forward progress. It’s likely that what they’ve built is proprietary.

re: distributed ingestion: I think every team has their own ingestion story (likely involving kafka or other pipelines) - sybil is just sits at the end of one of these pipes and ingest.

December 2017

December has been a laid back month with the holidays, so I’m pushing the goals for cluster setup into Q1 of 2018.

November 2017

Leaf Node

November has been spent preparing the sybil binary for usage in a distributed scenario. In order to get there, several features and necessary adjustments have been added, with the goal of creating a leaf that can work with high cardinality (1mm+ unique values in a column) data. Specific enhancements are:

msybil is a python script that calls out to multiple machines via SSH and issues queries and stitches their results together. It’s being bundled with snorkel for multi-machine queries, but may be separated into its own package later.

Roadmap

Q? 2019

Q? 2019

Example Cluster Setups

early business / open source projects

small business

medium business

large business

TODO: determine the performance of high compute machine and cluster

Changelog

2019-04-15

2017-12-31

2017-11-29

2017-11-22