Agent-Less System Monitoring with Elixir Broadway

74 points by zacksiri 3 days ago

SRE or whatever they are calling Ops here, this blog left me with more "Please hire an Ops Principal". That has nothing to do with Elixir.

We (Ops type people) have a developed system for gathering metrics, it's Prometheus stack. Instead of integrating with that system, OpsMaru decided it doesn't work and went with their own custom system. You are showing code you were building all CPU metrics that PromQL query easily does and you code assumes 15 second scrapes so if we need higher resolution temporarily, well, sucks to be your customer. Also, if you did Remote Write, you could Remote Write back to a customer if they wanted it. Hell, you could have written a system so we don't need run Prometheus locally since you would scrape everything and send it back to us.

Also, you are already running "my company code" so it might be emitting Prometheus metrics so I'm probably running Prometheus already so I can monitor my own code. However, if I wanted to keep an eye on OpsMaru Uplink, I can't because OpsMaru Uplink doesn't appear to have Metric endpoint I can monitor. Maybe your customers are too small to have Ops people but if they did, they are now blind.

I want blog article explaining all options tested and what pitfalls you ran into that you settled on this.

zacksiri a day ago

Thank you for your feedback, you have a valid point about the /metrics endpoint. We're planning on providing a /metrics endpoint in the future. I mentioned this previously as well in another reply.
This isn't a custom system at all we're simply removing the need to install / configure / manage another external package by implementing a data shipper into uplink using elixir broadway. The end goal is still that Ops / SREs can still use their existing favorite monitoring pipeline whether that's Grafana / Prometheus / Loki or Elastic / OpenSearch stack. There are several advantages, it means less things to install / maintain / patch / secure as mentioned in the post. We believe doings things this way leads to a more robust / secure system in the long term.
As for the 15 seconds scrapes we can tune that and provide that as an option for customers as well. These are things we can improve and provide to our customers as options. For now for MVP we're shipping data to the elastic stack certain decisions are made to help simplify and reduce the amount of things we have to do to get the product to an MVP.
We can provide the /metrics endpoint in the future, it's just a matter of time and priorities.
There are reasons why we're shipping data into elastic that will be clearer once things mature a little more. There are things Elastic can do that we need at a base level for our internal product plans, there will be a follow up post about this later.
Will provide more blog post articles giving you more details as to the decisions we've made and why we made them. Always happy to read feedback.
- stackskipton a day ago
  
  I await the blog article. I just think throwing out Prometheus Stack was terrible idea. If you want to store Metrics in Elastic, which I've done and always ends in tears, is fine. My concern is not keeping Prometheus compatible until last second.
  If I'm a customer and say "Hey, my applications are emitting Prometheus Metrics, how to scrape?" what is your recommendation with the platform?
  - zacksiri a day ago
    
    We're not throwing out Prometheus stack that's for sure.
    I've had experience using Elastic for a long time I understand it's capabilities and limitations, I'm familiar with it's querying capabilities, and things have always played out nicely with elastic for me. I guess it's all about what tools you are familiar with.
    If you are a paying customer and say you want prometheus metrics today we would show you how to access the already available LXD /metrics endpoint securely. You can learn more about this here https://documentation.ubuntu.com/lxd/en/latest/metrics/. With Opsmaru you get full access to the underlying LXD cluster by default out of the box.
    Ideally we would want to build an uplink version of the /metrics endpoint to ensure more smooth / automated / seamless integration but the above will work perfectly well if you want it today. The uplink /metrics endpoint would also provide additional metrics regarding application specific metrics not available from the LXD endpoint. LXD only provides infrastructural metrics not application metrics.
    In the future we want to integrate APM by default since we handle the routing for the apps we can bring together all these metrics like response times, SIEM and provide more metrics than infrastructural level metrics.

cpursley a day ago

This is a pretty neat product. Seems like the purpose is to allow deploying and selling of self-hosted instances?

zacksiri a day ago

Thank you! Yes! That's the main focus moving forward. Parts of it is still being built out, essentially we want to enable an App Store like experience for Web Applications. Open source developers should be able to monetize their applications by selling instances of their app to people who are non-technical but need their apps. Developers get paid via stripe connect.
- cpursley a day ago
  
  Neato. You might hear from me in the near future. I'm working on a self-hosted CRM/Email Marketing/Drip type of app with Phoenix LiveView - I'm assuming you handle Elixir as first class citizen based on your blog post. I also have some hybrid astrojs apps I'm considering productizing.
  - zacksiri a day ago
    
    Happy to help! Yes we handle Elixir / Phoenix as a first class citizen. The whole thing is built in Elixir / Phoenix. We can also handle astrojs apps, our docs is built in astro hosted using Opsmaru.

ic4l a day ago

LXD seems like an unusual choice when Kubernetes already has cadvisor and strong monitoring integrations. Avoiding extra agents is nice, but does this really scale better than existing solutions like Prometheus and OpenTelemetry?

What’s the advantage here beyond keeping things lightweight? Feels like this could hit limitations as complexity grows.

zacksiri a day ago

I chose LXD for several reasons. There is much less overhead cost when it comes to managing an LXD cluster:
- It's more vertically integrated for example networking across node is built in, you get it out of the box.
- It supports stateful workload out of the box with no fuss. Running DBs doing snapshots, deletion protection etc... is very simple with LXD.
- LXD supports running Docker inside containers which means in the future we will be enabling docker containers. Each LXD containers can be treated as a 'pod' that can run multiple docker containers inside. But it's just a simple system container you can treat like a VM.
- Working with GPUs is very simple and straight forward. This is going to be key as we start to enable AI work loads.
- LXD doesn't require a Master Node which means each instance I provision I can use it to run my work load. It also supports redundancy as I grow my cluster because it handles distribution through raft. Which means in terms of overhead it's much lower than K8s
- Overall LXD feels like a batteries included container hypervisor.
This solution doesn't replace things like prometheus. In fact LXD has native support for prometheus, we would also be able to extend the solution to pushing data to a prometheus instance or expose a /metrics endpoint for prometheus to consume.
For our MVP we just chose Elastic but it will be easy to extend to support prometheus as well. We're shipping data using open telemetry format. OpenTelemetry is a specification when we ship data we try to keep it as close to what open telemetry does as much as possible. Elastic's observability supports this out of the box.
All this solution does is it queries the underlying infrastructure metrics and ship it to a destination. The only scaling it needs to handle is ship the data and handle back-pressure incase the destination cannot handle the load. Broadway does this out of the box.
- mdaniel a day ago
  
  > For our MVP we just chose Elastic
  Honest question: why Elastic over Open Search?
  - zacksiri a day ago
    
    This was not an easy decision. I believe both are great products and you wouldn't go wrong either way. I was on the fence for a long time before making the decision.
    I think it's a combination of a lot of things. I've been a long time Elasticsearch user, I think I used it since version 0.17.
    Elastic just seems to have a lot more built in and seem to be the leader on this front when it comes to innovation. They did start the whole project and are leading when it comes to new stuff being built in. Their business / survival depends on building the best search product in whatever circumstances they find themselves in.
    OpenSearch is funded by Amazon, and it's not their sole focus.
    Things just feel more polished, well integrated than OpenSearch. The Kibana UI, Observability stack and their AI search stuff some of which is unique to elasticsearch and I'm sure there are more things yet to be uncovered.
    Taking a long term view just seems like the offering of Elastic is more fitting to our product requirements.
    That said whatever we've implemented in this blog post would also work with OpenSearch. In the future we will enable customers to bring their own timeseries DB and it would work with OpenSearch.
- ic4l a day ago
  
  Great answer, thanks.

zacksiri a day ago

Hey there! Founder of Opsmaru here. I didn't expect the post to get to the first page after it didn't get upvoted when I first posted it. Happy to answer any questions about the product and this post!

lydericlandry a day ago

Do you also support incus (LXD fork)?
- zacksiri a day ago
  
  We were planning on supporting incus. However incus dropped support for fan networking in favor of OVN for cross node networking.
  OVN is very heavy and requires a lot of management when it comes to provisioning and maintenance. So for now we didn't want to go there just yet.
  We're sticking with LXD, it's been receiving a lot of updates from canonical, the team is responsive on the forum and has been a pleasure to work with.
  Once we have some breathing room, we definitely want to explore incus and see what networking options are out there.
  Maybe we'll just adopt wireguard and make that work out of the box with incus in a future iteration.