Overview

Come join the SRE team at Stack Overflow! As one of the top 50 websites by traffic volume worldwide, we hit some unique challenges. Recently we’ve launched Stack Overflow for Enterprise and Stack Overflow for Teams, allowing organizations to have a private experience on the platform they already know and love. The success of these new products requires us to rethink our infrastructure strategy for supporting on-prem, cloud, and remote deployments.

We’re looking for someone with Linux administration experience (3+ years), and experience with containerization and managing cloud resources is a plus.  You’ll join our team of SREs and devs and continue driving and improving our systems automation efforts and managing Linux and container based services. We don’t expect you to know everything about all of the technologies we use, so you’ll work with other members of the team to learn and develop your skills.

As an SRE, you’ll bring a developer mindset to system administration, always looking for ways to automate manual work and create repeatable, scalable systems and processes. We are wiki-centric and prefer to document and automate in small increments as we work.

While we are a remote-first team with team members all over the world, this position will have occasional datacenter work requirements, which means proximity to the Denver, CO datacenter is a requirement. You’ll work primarily from home, only going into the datacenter a few times per month.

What you’ll do:

  • Maintain the services and infrastructure platform used by the Stack Overflow websites.
  • Help us scale traffic from 6,000 hits/sec to twice that next year
  • Be part of our on-call rotation (approximately 1 week out of 5), we get paged rarely
  • Be responsible for the maintenance and upkeep of our Denver datacenter infrastructure — typically this means coordinating vendors and remote hands, but sometimes requires physical presence for larger-scale projects
  • Act as a subject matter expert around our Linux infrastructure and automation.
  • Work iteratively to scope and deliver large projects

Technologies you’ll work with:

  • Linux CentOS 7 and Alpine
  • Kubernetes (cluster administration and containerizing applications)
  • Go / Bash
  • Some Windows Server 2012 R2 and 2016, PowerShell and C#
  • Github Enterprise, TeamCity (CI)
  • Puppet, some Ansible
  • Haproxy, Redis, Elasticsearch
  • Dell Servers and EqualLogic storage
  • Fortinet and Cisco Routers, ASAs, and Switches, HSRP / Keepalived / BGP
  • IIS, DFS, Multi-site AD, SQL Server 2017

Some projects that we’ve recently completed or are working on:

  • Improving infrastructure automation around our Windows and Linux servers
  • Creating a secure replica of our infrastructure for storing private Q&A data
  • Reinventing how DNS is managed
  • Implementing autonomous OS upgrades for both Windows and Linux servers
  • Upgrading hardware with zero downtime across a variety of services
  • Improving how we monitorservice internals
  • Migrating to a new CDN

Skills & Requirements

We’re looking for:

  • In-depth experience in Linux (and comfortable working with Windows)
  • Basic understanding of networking: the HTTP protocol, how load balancers work, IP addressing. (We use HAProxy, Fast.ly/Varnish, IIS)
  • Experience working hands-on with computer hardware
  • Experience with a configuration management systems or Infrastructure as Code (we use Puppet and Ansible)
  • A track record of taking on challenges and delivering thorough, stable, and maintainable systems
  • Strong written communication skills and a strong inclination to “document as you go”

Not required, but please let us know if you have experience with:

  • Experience with Dell OME (or other firmware management system)
  • Experience with network device administration
  • Experience with TeamCity, Jenkins, OctoDeploy, or other CI systems
  • HBase system administration
  • Experience in security, or have worked in a SOC or PCI environment
  • Experience with Azure or other cloud environments
  • Experience with some of the other technologies we use: ElasticSearch, Redis, Haproxy, Puppet, VMware, TeamCity, DSC, IIS and SSL cert management
  • Involvement with open source projects

When you apply…  Please include an up-to-date resume. We also strongly encourage you to include a cover letter explaining why you’re interested in working at Stack Overflow.

What you’ll get in return:

  • Flexible hours
  • 20 days paid vacation + holidays
  • Completely free health insurance – no copay, no premiums
  • Generous parental leave (10-16 weeks at 100% pay), family care leave, and unlimited sick days
  • Employees will never be poked with a sharp stick

When you work remotely (within 1 hour travel time to Denver, CO)… We’ll help you set up a great home office, with an ergonomic chair, standing desk, and any other equipment you need to do your job.