This Week in Glean: Your personal Glean data pipeline

Feb 25, 2022 · 1 minute read

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean.) All "This Week in Glean" blog posts are listed in the TWiG index (and on the Mozilla Data blog). This article is cross-posted on the Mozilla Data blog.

On February 11th, 2022 we hosted a Data Club Lightning Talk session. There I presented my small side project of setting up a minimal data pipeline & data storage for Glean.

The premise:

Can I build and run a small pipeline & data server to collect telemetry data from my own usage of tools?

To which I can now answer: Yes, it's possible. The complete ingestion server is a couple hundred lines of Rust code. It's able to receive pings conforming to the Glean ping schema, transform them and store it into an SQLite database. It's very robust, not crashing once on me (except when I created an infinite loop within it).

You can watch the lightning talk here:

Instead of creating some slides for the talk I created an interactive report. The full report can be read online.

Besides actually writing a small pipeline server this was also an experiment in trying out Irydium and Datasette to produce an interactive & live-updated data report.

Irydium is a set of tooling designed to allow people to create interactive documents using web technologies, started by wlach a while back. Datasette is an open source multi-tool for exploring and publishing data, created and maintained by simonw. Combining both makes for a nice experience, even though there's still some things that could be simplified.

My pipeline server is currently not open source. I might publish it as an example at a later point.