Keep the lights on. You’ll be responsible for the overall performance and reliability of the production systems powering our GraphQL insights tool, Apollo Engine. You’ll be working closely with a team of backend and product engineers to build in and improve fault tolerance, monitoring, and recovery plans for our core product and its most important features.

In this role, you’ll complement our backend devs by focusing on building and maintaining the architecture behind Engine (and making it awesome). You’ll have a lot of responsibility quickly, being in charge of our uptime monitoring, pager rotation, backup strategies, etc. GraphQL is taking off in the industry, and you’ll help us build the scalable and reliable system we need to support data ingestion loads at huge volumes from our customers as GraphQL usage increases at their respective companies.

What you’ll do

Write and maintain docs that answer all of the questions we regularly get from customers about architecture, security, data retention, PII, internal policies, etc. that we can distribute to customers and sales prospects.
Design and implement a backup strategy that covers all of our critical data.
Own our pager rotation and on-call scheduling for production alerts and critical support tickets.
Make sure all our production systems have proper monitoring and alerting in place.
Contribute to new feature designs to make sure there’s a performance and reliability element to the technical plans.
Edit our Kotlin backend code to improve things like logging, monitoring, etc.
Maintain our Terraform configurations, Kubernetes files, and other deployment configuration tools and extend them to support new use cases, features, production environments, etc.
Design SLAs for Engine and its components and make sure our systems are built in line with them.
Design, implement, and simulate a complete disaster recovery plan.

About you

You know how to write code — you like not only finding problems, but fixing them too.
You’ve operated production systems at scale in the past and know what a world-class ops culture looks like.
You’re pragmatic — you know how to make tradeoffs between different design points that optimize for overall business goals, not just a technical result.
You’d be excited to teach the rest of the backend team how to do SRE-style work themselves.
You elevate the team around you.

You can do this work from our San Francisco headquarters, or anywhere else in the world. MDG is proud to be an equal opportunity workplace dedicated to pursuing and hiring a talented and diverse workforce.

Jobspresso

Jobspresso is the easiest way to find remote jobs and careers at interesting and innovative companies.

Site Reliability Engineer

AI & Data
Anywhere
Posted March 4
Meteor

Overview

Don't miss another job posting 👌🏼

Join 20,000+ remote workers and get daily job updates on X, Facebook or email.

Site Reliability Engineer

AI & DataAnywherePosted March 4 Meteor

Overview

Don't miss another job posting 👌🏼

Join 20,000+ remote workers and get daily job updates on X, Facebook or email.

AI & Data
Anywhere
Posted March 4
Meteor