SRE  Lead

Position Overview

The SRE Lead is responsible for managing the daily operations of the SRE team and overseeing the reliability, scalability, and performance of the infrastructure and services. This role involves managing the team’s day-to-day activities, defining strategies for improving system reliability, and ensuring the team adopts best practices in automation, incident response, and infrastructure management.

Key Responsibilities

Leadership and Team Management
Manage the daily operations of the SRE team, including scheduling, assignment of tasks, and performance tracking
Provide technical guidance, mentorship, and feedback to individual team members and promote feedback exchange (utilize feedback tools when possible)
Conduct regular performance reviews, set goals, and develop individual growth plans in coordination with TM Specialist for SRE team members

2. SRE Practices

Implement SRE strategy, processes, and practices defined by the organization, ensuring that they are adhered to within the team

3. Strategic Planning and Roadmap Development

Develop and execute a roadmap for improving observability, scalability, disaster recovery, and performance optimization

4. System Reliability and Incident Management

Oversee system health, ensuring a high level of reliability, uptime, and performance across production environments
Lead incident management efforts, including response, resolution, and post-mortem reviews, ensuring root causes are identified and mitigated
Drive the development of incident response protocols and on-call rotations to ensure 24/7 support and quick resolution of critical issues

5. Automation and Infrastructure Optimization

Drive the adoption and scaling of automation practices across the team, reducing manual tasks related to deployments, scaling, and monitoring
Ensure the team implements Infrastructure as Code (IaC) and continuously refines CI/CD pipelines to support efficient, repeatable, and reliable infrastructure management
Lead initiatives for optimizing cloud infrastructure and resource usage, ensuring performance meets business needs while optimizing costs

6. Production Release Support

Oversee and support the deployment of new features and updates to production, ensuring minimal downtime and maximum reliability
Collaborate with development and management teams to ensure a smooth and efficient release process, adhering to established release procedures
Monitor production environments during and after releases, ready to address any issues or rollbacks if necessary

7. Monitoring, Observability, and Performance Tuning

Oversee the development and maintenance of monitoring and observability systems, ensuring they provide real-time insights into system performance and reliability
Ensure that system metrics are regularly reviewed and that performance-tuning efforts are prioritized based on system bottlenecks and resource usage patterns
Work with development teams to ensure observability is integrated into the design and development of applications and services

8. Cross-Functional Collaboration and Communication

Serve as the point of contact for reliability-related matters, providing regular updates on system health, incident trends, and improvement plans.
Foster a culture of shared responsibility between SRE and development teams, encouraging collaboration on building reliable, scalable, and performant systems

9. Continuous Improvement and Innovation

Promote the adoption of new technologies, frameworks, and tools that enhance system resilience, scalability, and automation
Regularly review and refine processes to increase the efficiency and effectiveness of incident response, system monitoring, and infrastructure management

10. Security, Compliance, and Risk Management

Ensure that security best practices are integrated into all aspects of infrastructure management, including access control, vulnerability management, and data protection
Collaborate with security teams to ensure compliance with industry standards and regulations while maintaining system availability and performance
Proactively manage risks related to system reliability and availability, identifying and mitigating potential threats before they impact production environments

11. Reporting and Metrics

Define, track, and report on key metrics related to system performance, uptime, and incident response, providing insights to both the engineering team and leadership
Lead efforts to use data-driven insights for system improvements and to measure the impact of changes to reliability and performance
Present regular reports on the state of system reliability, key incidents, and ongoing improvement initiatives to leadership and stakeholders

12. Collaboration in Hiring

Participate in the hiring process for SRE, evaluating candidates and helping build a strong, capable team

Education & Qualifications

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field, or equivalent experience.

Preferred certifications (optional):

AWS Cloud Engineer
AWS Machine Learning Ops Engineer

Leadership & Collaboration

Passion for building scalable, reliable, and secure systems in a fast-paced environment.
Ability to translate complex technical concepts into clear, actionable insights for technical teams.
Strong interpersonal skills with the ability to work effectively across cross-functional teams.
Excellent problem-solving and analytical skills.

Our recruitment philosophy

We value self-awareness and powerful communication skills in our recruitment process. We seek fiercely passionate people who understand themselves and their career goals. We're after those with the right skills and a conscious choice to join our field. The perfect fit? A trading and crypto enthusiast who’s driven, collaborative, acts with ownership and delivers solid, scalable outcomes.

Choose Where To Go Next

Want to get started?

Choose Where To Go Next

Want to get started?

About

Products

Platforms

Accounts

Promotions

Tools

Partnership

Choose Where To Go Next

Want to get started?

About

Products

Platforms

Accounts

Promotions

Tools

Partnership

About

Products

Platforms

Accounts

Promotions

Tools

Partnership

Blog

Support

Choose Where To Go Next

Want to get started?

SRE Lead

SRE  Lead