Migrating Shift RMM to Containers


Migrating Shift RMM to Containers

We have been working on a branch of shift-rmm that will change how it is deployed from manually downloading binaries from GitHub to using containers. Before we merge the changes and document how to migrate an existing install, we feel it would be good to explain why we are changing how Shift RMM is deployed. Let’s begin with some simple reasons and then dive into how the product works as well as the vision for the future of Shift RMM.

We Need More Reps with Containers

In order to better support and serve our customers as well as grow as engineers, we need to gain more experience with containers. Having more experience with containers will help us build better automation and observability tools which will lead to improvements for projects like our Docker dashboard.

Tracking and Bumping Version Numbers Manually is Tedious

We are currently following the RSS feeds for VictoriaMetrics, Loki, and Grafana to track updates. Along with reading the release notes and the upgrade guides (which usually doesn’t require any changes on our end), someone will have to edit a variable in the Ansible role and submit that to GitLab. It is not difficult to do, however it is a manual task that can lead to cool new upstream features like VictoriaMetrics’ new Cardinality Explorer and Loki’s new query sharding not being available to customers as quickly. We still follow these feeds, but it will be nice to follow a tag in a Docker Compose file instead of having to rush to a computer to change one number and push it in Git every time there is an update.

Upstream Projects Want Us To

In investing, there is a saying “don’t fight the fed” which means that you should listen to the news coming of out of the Federal Reserve and invest accordingly. We think this saying can be tweaked for software to say “don’t fight the vendor or upstream projects” It is not a secret that Shift RMM is gluing together several tools from vendors like VictoriaMetrics and Grafana Labs.

We have noticed that the newer a tool from Grafana Labs is, the more the focused on containers it is. Grafana Lab’s first tool, Grafana, still has deb and rpm packages as well as repositories that you easily install on any VM. No containers needed, so what’s the problem? Newer tools from Grafana are cloud native first and therefore may not have a non-container option. Loki does have documentation for running as a single binary, but the example config file stores everything in /tmp which means if you copy/paste this config, you would lose your logs after every reboot. Changing this to /opt/loki the way shift-rmm does fixes the issue, but we are taking this as a hint that they would like to see things move away from local storage on a VM/bare metal to object storage in a container. Mimir, Grafana’s new metric storage tool, is currently not used in shift-rmm, but we are following it closely since it is made by Grafana Labs. Mimir might gain support for Telegraf and may have Grafana specific benefits over a vanilla Prometheues data source like VictoriaMetrics. Mimir’s single binary option is single-threaded so it will not scale very well. We are taking this as another cloud native future indication.

New Toys to Play With

Grafana OnCall

Grafana recently open-sourced Grafana OnCall, which is a Grafana plugin that enables complex alerting workflows such as on-call rotations based on calendar and escalation paths. This tool was originally purchased from Amixir Inc and was written as a Python Web App. Golang based programs, like Grafana, compile down to a single binary whereas Python Web Apps usually require multiple components be installed and stitched together. Grafana’s docs only have Kubernetes, Docker-Compose, and local dev mode for oncall. If we wanted to avoid use of containers, we would end up fighting the upstream vendor and spending a significant amount of time automating and reverse engineering the Docker-Compose setup just to keep all that technical debt up to date and in good working order. Alternatively, we can follow the vendor docs which would be better supported, better maintained, and easier to deliver to users.

Traefik and Crowdsec

We are aware that both of these tools are written in Golang, so a manual install isn’t terrible to automate, especially since Crowdsec has deb and rpm packages. We also have an Ansible role written already to deploy them. However, the docs for the Traefik Crowdsec Bouncer list Docker and Docker-Compose as dependencies, so once again, avoiding containers would involve going against the vendor’s recommendations. The reason for switching to Traefik instead of sticking with NGINX is that Traefik has better built-in open-source observability due to the fact that it has both InfluxDB and Promethues metrics that can be added in a few lines of yaml as opposed to writing custom logs for NGINX, paying for the better ones, or building them ourselves. Traefik also offers things like OpenID and LDAP-based auth that are only available in NGINX Plus. Features such as ModSecurity WAF would require compiling NGINX from source. A full list of Traefik’s plugins can be found here.