This post details the transition of snorkel from nodejs to python.
Snorkel started in 2013 as a UI for aggregating schemaless instrumentation. At the time, mongo was chosen to hold data since it was schemaless and had an aggregation pipeline. Many design decisions were made around using mongo (and later postgres) before I started working on Sybil in 2016.
I built Sybil to overcome some of the limitations of mongo: Sybil is faster than mongo and uses less disk space because it stores immutable data and uses columnar storage. In addition, the Sybil query model is parallelized vs. the single thread used in mongo. For these reasons, a Sybil dataset can have 5 - 10x or more space savings while still out performing mongo’s aggregation framework.
After several years, Sybil has become my primary datastore and preferred way of storing instrumentation, but Snorkel still had baggage from mongo and nodejs.
PS: Read more about the history of Snorkel and Sybil here
There are several reasons to move to python, but the main ones are:
- less mongo baggage: mongo was originally used to store data, config and sessions
- the nodejs ecosystem has too much churn for me
- security vulnerabilities in nodejs packages
- python is an easier language for backend devs to pick up and contribute to
- deploying and distributing python apps is more stable than node apps
- its easier to self-host python apps
- I want snorkel to last for another 10 years, I’m not confident that a nodejs app built today would be installable in 10 years
The python app (snorkel-lite) is built on top of flask with the intent of minimizing external dependencies. To keep things simple, saved queries and session data is stored in Sqlite instead of leveldb or mongo.
Advantages of nodejs:
- it’s easier to use sockets in nodejs than in python
- code can be shared between server and client
- asynchronous execution model
Advantages of python:
- the ecosystem is more mature
- its simpler to follow and understand code
- classes! ES6 does have classes and modules, but they are more natural in python
- it’s easy to deploy packages to pip and plugins are easier to write
Part of the initial goals of snorkel-lite was to re-use as much code as possible from Snorkel while simplifying and removing extra code that was unecessary. To do so, I built a component framework for flask that supported the old Snorkel components and was able to re-use the views from Snorkel with minimal effort. Luckily, this worked out pretty well and the components from Snorkel.js were usable in snorkel-lite.
There was an initial burst of activity in september 2018 where I ported the main views of Snorkel to python: table, time, dist and samples as a proof of concept. By November 2018, I was using snorkel-lite full-time on my local data.
Between Dec 2018 and January 2019, the Alternative and Advanced views were ported over. It was relatively easy to port them. Google auth and RBAC controls were added in January 2019.
In Feb 2019, I transitioned all my servers to using snorkel-lite and pointed my grafana dashboards at my Snorkel lite instances. Additionally, the snorkel-lite package was built and released to pypi.
In March 2019, I started redirecting requests from snorkel.logv.org to slite.logv.org. The snorkel-lite package is now bundled with sybil and several helper binaries for easily ingesting and querying from the CLI.
Some remaining work to be done for snorkel-lite is:
- continue adding polish and refining the interactions
- adding RSS feeds
- create better dataset presenter configs
- port Map view over
- writing UI tests
Despite the large amount of work left, I’m confident in snorkel-lite’s codebase and usefulness, especially the getting started portion.
- Initial write up
- Timeline of work
- Background and motivation section
- Future work