Service Reliability Specialist - Live Operations, NOC
Job Id: REQ-0002967
The Network Operations Center (NOC) manages the 24x7 monitoring and response components of Riot's player-facing services. We are the first line of defense when things go wrong with any of our live services and many of our internal services as well. We leverage technical familiarity with best-practice processes to rapidly remediate incidents. The team is staffed with Administrators, and Specialists that provide reliable triage services across many levels of technical and process operations. The team helps to create and mentor other Riot teams on best practice in alerting, monitoring, and operational processes.
As a Service Reliability Specialist, you will work closely with the Live Operations team and Riot Games globally to establish and maintain a high-performing and highly available game service for players. You will monitor and support all aspects of production environments, development environments, and general system needs. Your technical skills and grasp of system integration will help you diagnose and communicate potential issues to Rioters and the community, improving the quality of the player experience. You will be looked to as the expert in NOC craft and incident management principles and relied upon to look for more proactive solutions to incidents.
- Works under general direction within a clear framework of accountability.
- Exercises substantial personal responsibility and autonomy.
- Works with external teams to execute the mission of the team.
- First responder, triage agent, or escalation point from the NOC to external teams.
- Work with internal and external teams to create and update documentation
- Multitask and rapidly address issues affecting our players and services.
- Gather and report data on the health and operation of Riot services
- Work in a fast paced, constantly changing environment.
- Proactive triage and investigation of live incidents
- Speak with authority on incident management processes
- Perform technical troubleshooting (SSH, IP address, Command Line Interfaces).
- 6+ years of NOC Technician or equivalent role (Analyst, System Administrator, Live Operations, Network Administrator, etc)
- Familiarity with the core concepts of operating systems, networking, and software life cycles
- Enthusiasm around operations and technology
- Highly driven and self-motivated
- Excellent logical troubleshooting skills
- Strong organizational skills
- Demonstrates excellent communication skills.
- Experience with the following:
- Monitoring solutions eg: NewRelic, Nagios, Elastic Search, Grafana
- Event management tools eg: BigPanda, Moogsoft
- ITIL-based Ticketing systems eg: ServiceNow, JIRA
- Scripting proficiency is highly desired
- Experience working on deployments in a live environment is a plus
- Multiple language proficiency is a plus, especially Mandarin
- Certified in Linux+ and Network+, or equivalents
- Engineering degree or equivalent
- Software engineering experience
For this role, you'll find success through craft expertise, a collaborative spirit, and decision-making that prioritizes the delight of players. We will certainly be looking at your past studies and experience, but for this role, we also look for dedicated people with a personal relationship with games. If you embody player empathy and care about the experiences of players, this could be the role for you!
- Full health insurance for you, your spouse and children
- Open paid time off
- Savings benefit with company matching
- Life insurance, parental leave, plus short-term and long-term disability
- Play Fund so you can broaden and deepen your knowledge of our players and community through games
- Wellness Fund to encourage a balanced body and mind
- Monthly phone bill allowance
- Monthly food allowance
- We will double down on your donations of time and money to non-profits
Don’t forget to include a resume and cover letter. We receive a lot of applications, but we’ll notice a fun, well-written intro that shows us you take play seriously.