Moving at our pace brings a lot of change, complexity, and ambiguity—and a little bit of chaos. Shopifolk thrive on that and are comfortable being uncomfortable. That means Shopify is not the right place for everyone.
Before you apply, consider if you can:
- Care deeply about what you do and about making commerce better for everyone
- Excel by seeking professional and personal hypergrowth
- Keep up with an unrelenting pace (the week, not the quarter)
- Be resilient and resourceful in face of ambiguity and thrive on (rather than endure) change
- Bring critical thought and opinion
- Embrace differences and disagreement to get shit done and move forward
- Work digital-first for your daily work
About the role
At Shopify, we ship on quality instead of time. Our teams deploy new code many times a day, and our production scale is massive. Shopify has many critical components, and sometimes they fail. That's where the Site Reliability team comes in, ensuring we can get back to green as fast as possible. Site Reliability sets the foundation for building and running resilient systems at Shopify. This is a team of engineers with a foundation in software development, and in-depth operational knowledge of the entire Shopify stack, who act as first responders and leaders during incidents.
Our job is to resolve incidents as quickly as possible, and guide teams to build a more resilient Shopify. We build and improve tools as necessary to satisfy this requirement, and constantly seek out ways to automate away manual toil.
Commerce happens 24/7, so we have built a globally distributed team that can respond whenever necessary. Our team hires across 4 different regions (Asia-Pacific, North America West, North America East, and EMEA) in a follow-the-sun model that provides 24/7 coverage for incident management.
We welcome remote candidates based in the Asia-Pacific region.
What’s in it for you
- Help Shopify run its planet scale systems by enabling our engineering teams to create resilient systems.
- Build and improve tools to keep our platform resilient and performant.
- Help define what Resiliency and Site Reliability Engineering means for Shopify.
- Directly impact production systems underpinning commerce for millions of merchants, who generate revenue for their livelihood, their families, and their employees, through the businesses they’ve built on our platform.
- Possibility of relocation to a region the team operates within.
What you'll do
- Respond to automated alerts and execute playbooks.
- Manage ongoing incidents, using your understanding of Shopify to involve the right teams and resolve as quickly as possible.
- Identify gaps in our processes and build or improve tools to support incident management.
- Develop production tooling and services to improve our platform’s resilience.
- Clean up the noise in our signals, ensuring we can get an understanding of our platform and more efficiently debug problems.
- Set standards with engineering teams across the company for building resilient, performant systems.
- Ensure we never fail for the same reason twice.
- Follow up on incidents to ensure retrospective learning takes place, and appropriate action items are created and prioritized.