Overview

Come join the SRE team at Stack Overflow! As one of the top 50 websites by traffic volume worldwide, we hit some unique challenges. Recently we’ve launched Stack Overflow for Enterprise and Stack Overflow for Teams, allowing organizations to have a private experience on the platform they already know and love. The success of these new products requires us to rethink our infrastructure strategy for supporting on-prem, cloud, and remote deployments.

We’re looking for someone with Windows Server experience (3+ years), and experience with managing cloud resources is a plus. You’ll join our team of SREs and devs and continue driving and improving our systems automation efforts and managing Windows-based services. We don’t expect you to know everything about all of the technologies we use, so you’ll work with other members of the team to learn and develop your skills.

As an SRE, you’ll bring a developer mindset to system administration, always looking for ways to automate manual work and create repeatable, scalable systems and processes. We are wiki-centric and prefer to document and automate in small increments as we work.

While we are a remote-first team with team members all over the world, this position will have occasional datacenter work requirements, which means 1-hour travel time to the Jersey City, NJ datacenter is a requirement.

What you’ll do:

  • Maintain the services and infrastructure platform used by the Stack Overflow websites.
  • Help us scale traffic from 6,000 hits/sec to twice that next year.
  • Be part of our on-call rotation (approximately 1 week out of 5), we get paged rarely.
  • Be responsible for the maintenance and upkeep of our Jersey City datacenter infrastructure–typically this means coordinating vendors and remote hands, but sometimes requires physical presence for larger-scale projects.
  • Act as a subject matter expert around our Windows infrastructure and automation
  • Work iteratively to scope and deliver large projects

Technologies you’ll work with:

  • Windows Server 2012 R2 and 2016; Linux CentOS 7 and Alpine
  • PowerShell / Go / Bash / Some C#
  • Github Enterprise, TeamCity (CI)
  • Puppet, some Ansible
  • Haproxy, Redis, Elasticsearch
  • Dell Servers and EqualLogic storage
  • Fortinet and Cisco Routers, ASAs, and Switches, HSRP / Keepalived / BGP
  • IIS, DFS, Multi-site AD, SQL Server 2017
  • Future: Containers and Kubernetes for both on-prem and cloud infrastructure

Some projects that we’ve recently completed or are working on:

  • Improving infrastructure automation around our WIndows and Linux servers
  • Creating a secure replica of our infrastructure for storing private Q&A data
  • Reinventing how DNS is managed
  • Implementing autonomous OS upgrades for both Windows and Linux servers
  • Upgrading hardware with zero downtime across a variety of services
  • Improving how we monitorservice internals
  • Migrating to a new CDN

Skills & Requirements

We’re looking for:

  • In-depth experience in Windows and comfortable working in Linux
  • Basic understanding of networking: the HTTP protocol, how load balancers work, IP addressing. (We use HAProxy, Fast.ly/Varnish, IIS)
  • Experience working hands-on with computer hardware
  • Experience with a configuration management systems or Infrastructure as Code (we use Puppet and Ansible)
  • A track record of taking on challenges and delivering thorough, stable, and maintainable systems
  • Strong written communication skills and a strong inclination to “document as you go”

Not required, but please let us know if you have experience with:

  • Experience with Microsoft SQLServer administration and query tuning
  • Experience with Dell OME (or other firmware management system)
  • Experience with network device administration
  • Experience with TeamCity, Jenkins, OctoDeploy, or other CI systems
  • HBase system administration
  • Experience in security, or have worked in a SOC or PCI environment
  • Experience with Azure or other cloud environments
  • Experience with some of the other technologies we use: ElasticSearch, Redis, Haproxy, Puppet, VMware, TeamCity, DSC, IIS and SSL cert management
  • Involvement with open source projects

When you apply…  Please include an up-to-date resume. We also strongly encourage you to include a cover letter explaining why you’re interested in working at Stack Overflow.

What you’ll get in return:

  • Flexible hours
  • 20 days paid vacation + holidays
  • Completely free health insurance – no copay, no premiums
  • Generous parental leave (10-16 weeks at 100% pay), family care leave, and unlimited sick days
  • Employees will never be poked with a sharp stick

If you want to work in our office… You’ll get your own private office in our headquarters in New York City, and enjoy additional benefits like free lunch every day prepared by our own in-house chefs, transportation reimbursement, and all the espresso you can drink.

If you want to work remotely (within 1 hour travel time to Jersey City)… We’ll help you set up a great home office, with an ergonomic chair, standing desk, and any other equipment you need to do your job.