Position Overview 

The SRE Lead is responsible for managing the daily operations of the SRE team and overseeing the reliability, scalability, and performance of the infrastructure and services. This role involves managing the team’s day-to-day activities, defining strategies for improving system reliability, and ensuring the team adopts best practices in automation, incident response, and infrastructure management.  

Key Responsibilities 

1. System Reliability and Performance 

  • Maintain and enhance the reliability, uptime, and performance of production systems. 
  • Monitor system health and proactively identify performance bottlenecks and areas for improvement. 
  • Conduct root cause analysis (RCA) and contribute to post-incident reviews to prevent recurrence. 
     

2. Incident Response and Operations 

  • Participate in on-call rotations and respond to incidents to minimize downtime. 
  • Collaborate in incident management processes including triage, mitigation, documentation, and recovery. 
  • Develop runbooks and automation scripts to streamline troubleshooting and recovery procedures. 
     

3. Automation and Infrastructure Optimization 

  • Implement and maintain Infrastructure as Code (IaC) using tools such as Terraform, Ansible, or CloudFormation. 
  • Improve CI/CD pipelines to ensure seamless, repeatable, and reliable deployments. 
  • Automate operational tasks to reduce manual effort and increase efficiency. 
  • Optimize cloud resource usage for performance and cost efficiency. 
     

4. Monitoring and Observability 

  • Build and maintain comprehensive monitoring, alerting, and observability solutions (e.g., Prometheus, Grafana, ELK, Datadog). 
  • Ensure meaningful alerts and actionable metrics are in place to detect and respond to system anomalies. 
  • Collaborate with development teams to embed observability into new services from design to deployment. 
     

5. Cross-Functional Collaboration 

  • Work closely with developers, QA, and DevOps to ensure system reliability is integrated into every phase of the software lifecycle. 
  • Partner with stakeholders to support reliable deployments and continuous delivery. 
  • Contribute to documentation, playbooks, and process improvements. 
     

6. Continuous Improvement and Innovation 

  • Identify areas of improvement in existing systems, processes, and automation frameworks. 
  • Research and implement emerging technologies that enhance system scalability, security, and resilience. 
  • Participate in post-mortem reviews and reliability improvement initiatives. 
     

7. Security and Compliance 

  • Apply security best practices to system configuration, monitoring, and access control. 
  • Collaborate with security teams to maintain compliance with organizational and industry standards. 
  • Assist in vulnerability management and ensure patches or mitigations are deployed in a timely manner. 
     

8. Reporting and Metrics 

  • Track, analyze, and report on system performance, reliability, and incident trends. 
  • Use metrics-driven insights to support reliability improvements and operational excellence initiatives. 

 
Education & Qualifications 

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field, or equivalent experience. 

Preferred certifications (optional): 

  • AWS Cloud Engineer 
  • AWS Machine Learning Ops Engineer 

Collaboration 

  • Passion for building scalable, reliable, and secure systems in a fast-paced environment. 
  • Ability to translate complex technical concepts into clear, actionable insights for technical teams. 
  • Strong interpersonal skills with the ability to work effectively across cross-functional teams. 
  • Excellent problem-solving and analytical skills. 
     

Our recruitment philosophy 
 
We value self-awareness and powerful communication skills in our recruitment process. We seek fiercely passionate people who understand themselves and their career goals. We're after those with the right skills and a conscious choice to join our field. The perfect fit? A trading and crypto enthusiast who’s driven, collaborative, acts with ownership and delivers solid, scalable outcomes.