Original listing text, shown exactly as published by the company.
About the role
This is a new role — one that doesn’t yet exist at Wealthsimple — and it’s a meaningful one. As a Staff Software Developer on Production Engineering, you’ll bring senior technical leadership to the work of making Wealthsimple more reliable at scale. You’ll work across platform and product teams, identify the highest-leverage reliability problems, and build solutions that don’t just fix the immediate issue but raise the floor for everyone. This isn’t a role where you sit in one corner of the codebase. It’s a role where you shape how engineering gets done across the company.
What you’ll do
- Improve the platform to prevent incidents — designing and driving adoption of guardrails, sensible defaults, and engineering standards that reduce the likelihood of failures across services
- Build tooling that reduces time to mitigation when incidents occur, including contributing to our in-house product on AI-assisted incident response
- Own the investigation and follow-through on load test findings — translating results into concrete reliability improvements across critical flows
- Work across platform and product engineering teams as a technical influencer — participating in architecture and readiness reviews, coaching service owners, and driving adoption of scalable reliability practices
- Identify recurring failure patterns and design platform-level fixes that prevent them from showing up again in a different service
- Contribute to the team’s reliability syncs with product engineering, helping align on incident themes, critical-flow risks, and the next highest-leverage initiatives
Skills you bring
- 8+ years of software engineering experience, with significant time in platform, infrastructure, or SRE work
- Demonstrated track record of improving reliability at scale — reducing incidents, building guardrails, or driving operational standards across multiple teams
- Strong proficiency in backend systems and distributed architecture; you can diagnose complex failure modes across a service mesh
- Experience with load testing and capacity planning, and the ability to translate findings into concrete engineering improvements
- Proven ability to work across engineering teams as a technical influencer — driving adoption of standards and practices without direct authority
- Familiarity with Kubernetes, Helm, Argo and modern deployment tooling
- Strong written and verbal communication — comfortable presenting findings and recommendations to both engineering teams and senior leadership
Who you are
- You think in systems — you’re not looking for the fix, you’re looking for what caused the problem and how to make sure it doesn’t happen elsewhere
- You’re comfortable working without direct authority; you build credibility through the quality of your thinking and the clarity of your recommendations
- You hold a high bar for operational excellence without making it someone else’s problem to catch up to — you bring people along
- You’re energised by ambiguity, not slowed down by it; you know how to prioritise when everything feels urgent
- You’re curious about where AI-assisted tooling is headed in reliability engineering, and you want to help shape how we use it — not just observe it from a distance