As an SRE/Infrastructure Engineer, is responsible for designing, implementing, and maintaining the cloud infrastructure our platform sits on, as well as the monitoring and deployment services that enable the rest of engineering to develop, deliver and maintain our platform services. You will also be instrumental in both monitoring and incident response, playing a key role in ensuring maximum reliability and minimal downtime. You will collaborate with teams across the company, including developers, customer support, product owners and sales, to ensure the reliability, scalability, and performance of our platform.
- Infrastructure Design and Implementation: assist or lead in the design, deployment, and operation of the infrastructure components required to support our applications and services. This includes managed cloud infrastructure, networking, security, data storage and cloud hosted services
- System Automation: Develop and maintain automation and tools to streamline infrastructure provisioning, configuration management, deployment, and monitoring. Implement infrastructure as code (IaC) practices using tools such as Terraform and Ansible
- Monitoring and Alerting: Implement monitoring solutions to track the health, performance, and availability of infrastructure components and applications. Configure alerting mechanisms to notify teams of potential issues and proactively address them before they impact users
- Incident Response and Root Cause Analysis: Participate in incident response activities to identify, troubleshoot, and resolve incidents. Communicate incident status and updates to ensure both internal and external customers are fully informed. Conduct root cause analysis to determine the underlying causes of incidents and implement preventive measures to avoid recurrence
- Performance & Cost Optimization: Analyze system performance metrics and identify opportunities for optimization. Tune infrastructure components, optimize configurations, and implement performance enhancements to ensure optimal performance and resource utilization
- Security and Compliance: Implement security controls, and respond to security incidents in accordance with established policies and procedures
- Disaster Recovery and High Availability: Design and implement disaster recovery (DR) and high availability (HA) solutions to ensure business continuity and minimize downtime. Develop and test DR plans, implement failover mechanisms, and conduct periodic drills to validate readiness
- Capacity Planning and Scaling: Monitor resource utilization trends and prepare the infrastructure to handle the predicted changes in the future
- Documentation and Knowledge Sharing: Create and maintain documentation for infrastructure configurations, procedures, and best practices. Share knowledge and expertise with team members through documentation, training sessions, and mentorship to foster a culture of learning and collaboration
Requirements
- Proficiency in scripting and automation using languages such as Go, Python and Bash
- Experience with cloud platforms (e.g., AWS, Azure, Google Cloud) and containerization technologies (e.g., Docker, Kubernetes)
- Strong understanding of networking concepts, protocols, and security principles
- Familiarity with infrastructure as code (IaC) tools and configuration management frameworks (e.g. Terraform)
- Knowledge of monitoring and logging tools (e.g. Prometheus, Grafana, ELK Stack, AWS Cloudwatch) for infrastructure and application monitoring
- Excellent problem-solving skills, attention to detail, and ability to work independently and collaboratively in a fast-paced environment
- Effective communication skills, both written and verbal, with the ability to articulate technical concepts to non-Infrastructure stakeholders
Benefits
As well as working as part of an amazing, engaging and collaborative team, we offer our staff a wide range of benefits to motivate them to be the best they can be! Here's an overview of everything we offer right now!
You will receive:
- A competitive salary based on your experience and ability to perform in role
- 25 days annual leave (excluding bank holidays)
- 8% pension contribution
- Private health care via AXA
- Fantastic corporate discounts and mental wellbeing support via Perkbox, including a top of line EAP
- Salary sacrifice schemes as well as the opportunity to receive share options
We have fantastic offices in Basingstoke and London (Midford Place) complete with a fully stocked fridge / snacks and catered lunches 2 times a week.
We also reward our teams with monthly socials, half day Fridays during the summer months of July and August, 3 extra days off during the Christmas holidays and a culture built on recognition, collaboration and success.