Site Reliability Engineer

About Axiom

At Axiom, our vision is to unlock insights from event data at any scale. We believe that unified access to all event data, all the time, makes organizations far more effective. That’s why we built Axiom: the first cloud-native data store designed specifically for timestamped events, providing live-streaming, search, reporting, and monitoring capabilities. Built from the ground up, Axiom delivers performance and scalability at a lower cost. Today, thousands of companies — from ambitious startups to the world’s largest enterprises — trust Axiom to make sense of event data for engineering, security, and operational use cases.

Our work so far is just the first step towards our vision. We’ve raised funds from top-tier investors including Crane Venture Partners, LocalGlobe, Fly VC, Mango Capital, as well as leading angels like former GitHub CEO Nat Friedman, former Heroku co-founder Adam Wiggins, and Vercel CEO Guillermo Rauch. We’re building a diverse group of kind and brilliant people who are the best at what they do to forge ahead. Join us!

About the Role

We are looking for a Site Reliability Engineer at to join our team at Axiom. You will be pivotal in upholding our promise of superior reliability and performance to our customers. Collaborating with backend engineers and product teams, you will emphasize creating and operating scalable and reliable systems. Axiom's emphasis on SREs revolves around automating, measuring, and continuously improving the reliability and efficiency of our systems.

Your primary responsibilities:

Engineer and maintain a robust, secure, and scalable infrastructure for Axiom Cloud.
Collaborate with engineering teams to define and refine service level objectives.
Contribute to disaster recovery planning, capacity engineering, performance analysis, and system tuning.
Foster best practices for code deployments, aiding in the education of the broader development team.
Roll out tooling and solutions that improve system reliability and reduce manual toil.
Address and remediate service incidents and contribute to postmortems and root cause analyses.
Foster a culture of monitoring, alerting, and observability across the organization.

You are an ideal candidate if:

You have over two years of experience in a reliability-focused engineering environment.
You are passionate about system reliability, latency, performance, and efficiency.
You're familiar with AWS tools and technologies.
You have hands-on experience with Docker, Kubernetes, and Amazon EKS.
Knowledge of networking and Linux systems.
You understand infrastructure-as-code tools such as Terraform/Pulumi.
You possess strong networking knowledge and are adept with Linux systems.
Familiarity with CI platforms like GitHub Actions, GitLab, CircleCI or others.
Experience with monitoring, alerting, and observability tools.

Bonus skills and experiences:

Proven track record of maintaining production systems at scale.
A software engineering background with expertise in Golang.

We provide:

Competitive salary and equity package.
Flexibility to work from wherever suits you best. For this role, we are considering individuals based in the timezone range UTC-10 to UTC-4 or UTC+6 to UTC+12.
Budget to build your home office set-up.
Monthly budget to support mental and physical wellness.
Uncapped, continued education stipend.
Annual offsite with your teammates from around the world.
A focus day each week with no meetings, Slack or Zoom. Uninterrupted time to focus on work and run errands.
Uncapped vacation to unplug and rejuvenate.
Generous and flexible family leave for everyone.

Remote

Australia