log(v)

back
Need help setting up snorkel?
Please send an email to okay.zed at gmail and I'll be more than glad to help you get setup with an instance and answer any questions about your use case and requirements.
Happy snorkeling!

As sybil nears its 2 year age mark, one of the biggest features that is requested is distributed queries: a way of scaling analysis with the number of machines and cores. This page will document progress towards that goal.

Overview

Goal

Add a distributed mode to sybil that enhances its query performance

About Sybil

Sybil is a binary that can ingest and query JSON or CSV data, specifically meant for ongoing ingestion and analysis of instrumentation. Sybil is serverless, meaning that it runs without a server - instead, ingestion and queries are run as invocations of a binary.

As an aggregation engine, sybil runs map reduce style queries on mostly immutable blocks of data stored in columnar form. Because sybil is built around these map reduce aggregations, moving from a single node to multiple node aggregation should be straight forward.

There’s more specific implementation notes for sybil here

In terms of performance, each core should be able to aggregate 1 - 2mm rows per second. With a 4 CPU machine, we can get simple queries on 10mm rows in 1s, with 16 CPU, we should get 40mm queries in 1 - 3s (+ time to combine results)

See the wiki page on sybil performance, for more information and estimation.

Status

Progress

November 2017

Leaf Node

November has been spent preparing the sybil binary for usage in a distributed scenario. In order to get there, several features and necessary adjustments have been added, with the goal of creating a leaf that can work with high cardinality (1mm+ unique values in a column) data. Specific enhancements are:

msybil is a python script that calls out to multiple machines via SSH and issues queries and stitches their results together. It’s being bundled with snorkel for multi-machine queries, but may be separated into its own package later.

Roadmap

Q4 2017

Q1 2018

Q2 2018

Cluster Management

Orchestration

Ingestion & Storage

Query

Alerting

Documentation

Example Cluster Setups

early business / open source projects

small business

medium business

large business

TODO: determine the performance of high compute machine and cluster

Changelog

2017-11-29

2017-11-22