1. Work with the Engineering, Product, Delivery and Architecture teams to ensure that appropriate attention is given to 'Reliability Engineering'.
· Focus on (at least) meeting the standards of the Licence to Operate framework (LtO) within the bank.
· Focus on automation; in the context of self-healing & auto-remediation.
· Ensure that the systems are able to withstand 'chaos engineering' practices.
· Ensure adequate instrumentation and alerting exists to spot leading indicators of an impending incident in the platform; as well as in systems on which the platform depends.
· Ensure the means exist to quickly recover a degraded service (instrumentation, runbooks, tooling etc).
· Ensure platform features fail gracefully when services are degraded.
· Drive SRE education across the wider team to improve quality and reliability
2. Lead the management of incidents (during and after) involving solutions that are the responsibility of the Communications Hub team.
· Incident management (during and post)
· Root cause identification
· Drive through fixes to systems (mobile and our dependancies) so that the same underlying defects do not cause multiple incidents.
3. Establish and lead a global support capability for the Communications Hub team.
· Establish and manage a 24x7 (follow the sun) global support model.
· Ensure Communications Hub team members are good citizens in the community (reporting, meetings etc).
4. Perform the role of ‘IT Service Owner’ for services for which the Communications Hub team are accountable; either personally, or with the support of others.
What we’re looking for
· Significant experience working through the definition, design, release and run cycle of software products to markets - using Agile/Scrum methodologies.
· Experience with DevOps, ITIL, Cloud Services, IT Infrastructure and Operations, including environment standup, server builds, firewalls, security and regulatory compliance.
· Experience of any object-oriented language preferably Java or Scala, AWS, Docker, Terraform, Kubernetes, Chef.
· Proficiency working in Unix/Linux environments. Familiarity with Git, Jenkins, Maven, Nexus, ElasticSearch, AWS Cloud Computing, Pivotal Cloud Foundry, Spark, Kafka.
· Familiarity with Amazon cloud solutions and architectures (EC2, S3, Cloud Formation, Dynamo DB, Route 53, IAM, ELB, CloudWatch, Lambda, Kinesis etc.).
· Managing production public-facing backend services with internal application users and external customers.
· Experience implementing and managing Logging, Monitoring and Alerting framework for hybrid cloud or third party services using AppDynamics, Splunk, Data Dog.
· Experience with agile development (Scrum, Kanban, etc.) and within an agile project team (agile in ability to perform cross-functional tasks quickly) – balance multiple projects and collaborating closely with other development teams
· Understands the importance of a good team dynamic, is a very capable communicator and comfortable receiving feedback.
· Experience with Atlassian toolset JIRA/Confluence.
· Experience with tools such as Jenkins and Ansible, Chef, Puppet, Salt, OpsWorks (for AWS).
· Experience with Cloud development and deployment best practices on AWS.
· A passion to keep up with the latest cloud security trends and tools in the cloud space.
Please note that we offer permanent employment as well as B2B.
If the above sounds like something you're currently doing, something you believe you can do, or even something you would like to do, get in touch with us and let's have a chat.
We aim to provide feedback as soon as possible. In the meantime, if you have any feedback on the process we would be very keen to hear it. We are constantly looking for ways to improve and refine how we work so would love to hear what your side of the story is, good or bad.