Site Reliability Engineer Lead

Site Reliability Engineer Lead

12-15 years
Not Specified

Job Description

Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, and fault-tolerant systems. The SRE Lead candidate would work collaboratively with OCBC's infrastructure and software development teams to ensure services reliability and uptime appropriate to the needs of users and fast iterations of improvement. The candidate has a passion for software development and pays great attention to optimizing existing systems, building infrastructure as well as reducing / eliminating toil through automation.
As a Tech Lead, you will be responsible for leading and building a team of software/system engineers with your excellent technical leadership. You are expected to set up necessary processes for efficient execution and advocate good engineering practices. You will also regularly coordinate and communicate with other application / infrastructure teams as well as internal users.

1. Build and manage SRE team, including team recruitment, new talent training, system operation/maintenance/coordination and team culture building.
2. Develop a long-term technical plan, have a clear implementation path and milestones, continuously ensure the competitiveness of the team and technology.
3. Formulate process specifications and plans with regards to access, configuration, disaster recovery as well as fault handling for all critical paths of the operating platform.
4. Design and implement software platforms as well as monitor frameworks for efficient, automated, and intelligent event driven / service-oriented architecture governance.
5. Cooperate with the system development team to ensure system reliability throughout the entire life cycle from system design to launch (Cradle to Grave). Identify opportunities for continuous improvement in the full lifecycle of a large distributed system. (i.e. Design, development, configuration, testing, deployment, monitoring, and operations) Continuously evolve automated operation, maintenance facilities and platforms.
6. Strengthen communication and cooperation with business teams, improve cross-team coordination, ensure continuous improvement and optimization of business flows. Promote the evolution of business architecture design through reduction of customer anxiety.

  • Work closely with solution architects, application development team to ensure adherence to best practices in design and coding w.r.t SRE & CRE principles.
  • Monitor, troubleshoot & analyse application & underlying infrastructure performance issues as part of the performance engineering exercises and derive gold-configuration parameters.
  • Drive thorough performance analysis of microservices code by using single-user code profiling techniques.
  • Assist development team to tune the applications/configurations for critical systems to comply with the NFR before going live in production and ensure the performance recommendations are part of the change request process.
  • Ensure appropriate governance w.r.t framework usage across multiple delivery streams and enhance the framework capability to meet the upcoming requirements.
  • Participate & contribute to resiliency validation exercises and create proper reporting to the stakeholders.
  • Define critical performance KPIs, set alert rules and roll-out monitoring dashboards for Production with timely reporting to the stakeholders.
  • Automation of various manual tasks w.r.t performance monitoring, alerting, analysis, reporting, capacity planning etc to improve application observability, resiliency & operational efficiency.

.LI-BW Qualifications
  • Bachelor's Degree of Computer Science with equivalent work experience of 12+ years. 3+ years of R&D experience is a bonus
  • Systematic in operation and maintenance thinking with the ability to find the balance between when to be tactical vs. strategic. Familiar with Linux systems and networking.
  • Practical experience with development or intelligent operation and maintenance of large-scale distributed systems architectures, hybrid cloud/on-premise environments, and event-driven or event stream systems. (i.e. distributed storage, scheduling, big data computing system) is preferred.
  • Self-driven with the ability to plan and summarize well with strong analytical and problem-solving skills. Experienced with project and team management. Positive attitude towards continuous learning.
  • Possesses a strong sense of responsibility, a proactive team spirit, and a strong ability to comprehensively analyze and solve problems.
  • Minimum 3-5 years of hands-on experience in Python, JAVA/J2EE, Spring Boot, JavaScript, SQL/PostgreSQL in terms of writing maintenable, testable code.
  • Minimum 2 years of hands-on experience in any of the technology such as Red Hat OpenShift/Kubernetes, Docker, Kafka, ELK, Redis and DevOps Tools such as Jenkins, Bitbucket, JIRA.
  • Hands on experience in application monitoring with Grafana, Kibana, Prometheus, AppDynamics or Dynatrace is a plus.
  • Hands on experience in Chaos Engineering is a plus.
  • Familiarity with Helm / Terraform

Job Details

About OCBC

OCBC Bank is the longest established Singapore bank, formed in 1932 from the merger of three local banks, the oldest of which was founded in 1912. It is now the second largest financial services group in Southeast Asia by assets and one of the world’s most highly-rated banks, with Aa1 by Moody’s and AA- by both Fitch and S&P. Recognised for its financial strength and stability, OCBC Bank is consistently ranked among the World’s Top 50 Safest Banks by Global Finance and has been named Best Managed Bank in Singapore by The Asian Banker

Job Source :

Similar Jobs

Career Advice to Find Better