Service Reliability Manager

Job Title: Service Reliability Manager
Location: Cardiff Bay, Wales
Salary: £92,500.00 per annum
Department: Group IT & Security
Reports To: Head of Service Management

Role objectives:

• Ensure our products are ready for life in Production
• Embed reliability and supportability as features, across the lifecycle of solution development
• Help to guide our engineering team’s transformation
• Help to design services that are engineered appropriately for our target market
• Raise the bar for engineering quality
• Deliver higher service availability
• Establish and lead an enablement team

Personal qualities:

• Trustworthy and quick thinking; be one of the smartest people in the room and likeable
• Optimistic & Resilient; breed positivity and don’t give up on the “right thing”
• Leadership & Negotiation; sell not tell, build support and consensus
• Creativity and High standards; develop imaginative solutions without cutting corners
• Fully rounded; experience of dev, support, security, ops, architecture and sales

Day to day the Service Reliability Manager will:

• Conduct Service Readiness Reviews
• Take the lead by combining domain knowledge and technical expertise with a passion for coaching and developing people
• Influence and mentor a wide range of colleagues on building robust and resilient applications that include self-healing and fault tolerance techniques
• Be responsible for the performance, reliability and resilience of our internal and external services
• Help our engineering teams resolve priority issues
• Contribute to the running and improvement of Change Management, Emergency Response and Capacity Planning activities
• Work with architectural team members to ensure that systems are loosely or fully decoupled and have oversight of how systems relate to each other
• Limit the time spent on operational tasks and automate wherever possible
• Lead the engineering activities that enable root causes to be identified, debugged and resolved to prevent recurrence
• Proactively identify the causes of outages that haven’t yet happened

Service Reliability Manager should have:

• A track record of troubleshooting and resolving issues in live production environments and implementing strategies to eliminate them
• Experience in a technical operations support role
• Solutions architecture experience
• Shell scripting experience
• Proficient in container based environments including Docker and Kubernetes
• Experience of automating infrastructure using “as code” tooling
• Strong OS skills, Windows and Linux
• Solid understanding of relational and NoSQL databases
• Experience in establishing/negotiating SLIs, SLOs, and SLAs
• Experience in a hybrid cloud based infrastructure
• Understanding of infrastructure services including DNS, DHCP, LDAP, virtualization, server monitoring, cloud services (Azure and AWS)
• Fluency in one or more high-level programming language such as JavaScript or .NET C#
• Knowledge of continuous integration and continuous delivery, testing methodologies, TDD and agile development methodologies
• Strong ability and enthusiasm to learn new technologies in a short time
• Strong people management capabilities of both direct and indirect reports