Principal Software Engineer - ML Platform Engineer

Riot Games was established in 2006 by entrepreneurial gamers who believe that player-focused game development can result in great games. In 2009, Riot released its debut title League of Legends to critical and player acclaim. As the most played PC game in the world, over 100 million play every month. Players form the foundation of our community and it’s for them that we continue to evolve and improve the League of Legends experience.

We’re looking for humble but ambitious, razor-sharp professionals who can teach us a thing or two. We promise to return the favor. Like us, you take play seriously; you’re passionate about games. We embrace those who see things differently, aren’t afraid to experiment, and who have a healthy disregard for constraints.

That's where you come in.

The AI Efficiency team at Riot Games builds the platforms, tools, and technical foundations that help Rioters safely and effectively use AI to accelerate how we work. As these systems become increasingly important to creative, product, and development workflows across Riot, we need platform engineering that can keep pace with growing scale, complexity, and expectations.

As a Principal Platform Engineer on the AI Efficiency team, you will design, build, and evolve the internal platforms, automation systems, and operational guardrails that make our AI services and developer workflows more scalable, reliable, and easy to use. You will partner closely with software engineers, infrastructure teams, and cross-functional stakeholders to improve developer experience, platform reliability, deployment safety, observability, and operational excellence across a growing portfolio of AI services, internal tooling, and supporting infrastructure.

You will also help the team evaluate and operationalize a new generation of AI-native engineering workflows, including agent-assisted code review, automated bug triage and remediation, AI-driven performance and security analysis, browser-based UI validation, and other emerging automation patterns that can safely augment human judgment. This role is not only about building better platforms for today, but also about shaping how Riot adopts the next wave of intelligent engineering tooling responsibly and effectively. Current platform-engineering guidance emphasizes product-minded internal platforms, self-service, and golden paths, while modern browser automation tooling now explicitly supports AI-agent workflows and accessibility-focused testing.

You’re right for this role if you enjoy making complex systems easier to use and operate, reducing cognitive load for engineers, building paved roads instead of one-off solutions, and improving reliability through strong platform design. You are energized not only by hard infrastructure and operational problems, but also by the opportunity to responsibly bring new AI-native automation patterns into real engineering workflows.

Responsibilities:

Design, implement, and evolve internal platform capabilities that make AI Efficiency services easier to build, ship, observe, secure, and operate
Build and maintain self-service workflows, reusable platform abstractions, and golden paths that improve developer productivity while preserving reliability, security, and governance
Improve platform reliability through better monitoring, alerting, observability, deployment safety, release practices, and incident readiness
Define and operationalize service health indicators, SLIs, SLOs, and related reliability metrics that help teams make informed tradeoffs between reliability, velocity, and cost
Build automation that reduces operational toil and improves mean time to detect, respond, and recover from incidents
Partner with engineers throughout the software development lifecycle to embed operability, production readiness, and maintainability into system design, implementation, rollout, and ongoing support
Improve CI/CD systems, developer workflows, and release pipelines so shipping becomes safer, faster, and more repeatable
Identify platform and reliability risks across distributed systems, infrastructure, service dependencies, and operational workflows, and drive durable improvements
Troubleshoot AI model-serving issues across frameworks, runtimes, and hardware environments, including diagnosing configuration, compatibility, and performance issues across different GPU platforms and supporting model format conversion workflows when needed
Design and run resilience, recovery, and failure-mode testing to validate system behavior under stress and uncover hidden weaknesses before they impact users
Evaluate, integrate, and operate AI-assisted engineering tools that improve code quality, reliability, security, performance, and developer productivity across the software delivery lifecycle
Build and evolve automation pipelines that combine conventional CI/CD systems with agentic workflows such as automated code review, bug detection, regression analysis, test generation, remediation suggestions, and workflow verification
Partner with engineers to introduce safe, auditable, and measurable uses of AI agents in areas such as pull request review, operational diagnostics, UI and UX validation, accessibility checks, and production readiness checks
Define guardrails, approval workflows, observability, reporting, and escalation paths for AI-assisted automation to ensure these systems remain safe, trustworthy, and operationally effective
Establish evaluation frameworks and success metrics for AI-native development tooling, including quality lift, false positive rates, latency, cost, operational risk, and impact on engineering throughput
Lead or contribute to incident response and post-incident improvement work for critical internal platforms and services, with a focus on systemic fixes and long-term resilience
Champion platform and operational excellence through documentation, runbooks, standards, and tooling that raise the engineering bar across the broader organization

Required Qualifications:

Bachelor’s degree in Computer Science or a related field, or equivalent professional experience
5+ years of experience in Platform Engineering, Infrastructure Engineering, Site Reliability Engineering, DevOps, Developer Experience, or a similar role supporting production systems and engineering workflows
Strong programming and automation skills in one or more languages such as Python, Go, or JavaScript / TypeScript
Experience designing, building, or operating internal platforms, developer tooling, CI/CD systems, or shared infrastructure used by multiple engineering teams
Experience operating and improving cloud-based production systems in AWS, GCP, Azure, or comparable environments
Strong understanding of observability practices, including metrics, logs, traces, dashboards, and alert design
Experience improving reliability and operability for distributed systems, service-oriented architectures, APIs, or platform infrastructure
Experience with incident management, root cause analysis, and driving durable operational improvements after production issues
Strong understanding of containerized environments and orchestration platforms such as Kubernetes, ECS, or similar technologies
Ability to collaborate across teams, influence technical direction, and communicate clearly with both engineers and non-engineers

Desired Qualifications:

Experience supporting AI/ML platforms, inference services, model-serving systems, data pipelines, or GPU-backed workloads
Experience defining and using SLOs, error budgets, and reliability metrics to guide prioritization and engineering decisions
Experience with platform product thinking, including designing self-service experiences, paved roads, or golden paths for internal users
Experience improving the reliability and usability of developer platforms, internal tools, or enterprise-facing services
Familiarity with infrastructure as code and configuration management systems such as Terraform, Pulumi, or similar tools
Experience with security, access control, secrets management, and operational hardening in production environments
Experience balancing availability, latency, efficiency, cost, and ease of use in systems operating at scale
Experience mentoring other engineers and raising platform or reliability standards through technical leadership
Experience evaluating or integrating AI-assisted software engineering tools for code review, static analysis, test generation, incident investigation, or operational automation
Familiarity with emerging agentic engineering workflows, including how AI agents can safely interact with source control, CI/CD pipelines, browser automation, and developer platforms
Experience with end-to-end browser automation and testing frameworks such as Playwright, including their use for UI validation, accessibility checks, regression testing, or workflow automation
Experience establishing governance, review loops, or quality controls for automated and AI-assisted engineering systems
Comfort working at the intersection of platform engineering, reliability engineering, developer productivity, and emerging AI-native software delivery practices

Our Perks:

Full relocation support
Comprehensive health insurance for you, your spouse, and children
Open paid time off
Retirement benefits with company matching
Life insurance, parental leave, plus short-term and long-term disability
Play Fund so you can deepen your knowledge of our players and community through games
We’ll double down on your donations of time and money to non-profits