Numrah is seeking a highly experienced DevOps Engineer to join our dynamic team. The ideal candidate will have a strong background in Google Cloud Platform (GCP), Kubernetes, PostgreSQL, real-time application management, high-availability services, and site reliability engineering (SRE). This role is essential to our mission of building resilient, scalable, and efficient infrastructure for our applications, ensuring seamless performance and availability for our clients.
Infrastructure Management: Design, implement, and manage scalable infrastructure on Google Cloud Platform (GCP) to support highly available and secure applications.
Kubernetes Orchestration: Manage, deploy, and scale containerized applications using Kubernetes, ensuring optimal resource usage and smooth operation in production environments.
Database Administration: Configure, monitor, and optimize PostgreSQL databases to ensure data integrity, high availability, and performance for real-time applications.
Site Reliability Engineering: Develop and implement SRE practices to monitor, automate, and improve system reliability, performance, and availability, aiming for zero downtime and high service availability.
Real-Time Application Support: Work with engineering teams to ensure that real-time applications run smoothly, with fast response times and minimal latency, focusing on performance tuning and troubleshooting as necessary.
CI/CD and Automation: Develop and maintain CI/CD pipelines for code deployment, infrastructure changes, and automated testing, focusing on reducing lead time and improving deployment speed.
Monitoring & Incident Management: Set up and manage monitoring systems, logging, and alerting for early detection of issues, leading incident response and postmortem analysis to continuously improve the infrastructure.
Performance Optimization: Identify bottlenecks and optimize the performance of services, including the tuning of Kubernetes clusters, databases, and GCP resources
Experience: 2+ years of experience as a DevOps Engineer, Site Reliability Engineer, or similar role, with a strong focus on Google Cloud Platform.
Technical Expertise:In-depth knowledge of Google Cloud Platform services, including Compute Engine, Cloud Storage, VPC, GKE, Cloud Pub/Sub, Cloud SQL and BigQuery.
Proficiency in Kubernetes (preferably GKE), including deployment, scaling, and troubleshooting.
Strong experience with PostgreSQL administration, including clustering, replication, and backup strategies.
Background in supporting real-time applications and ensuring low-latency, high-performance environments.
Skills:
Advanced knowledge of CI/CD tools (such as Google Cloud Deploy, Github Actions, or equivalent) and Infrastructure as Code (IaC) using Terraform or equivalent.
Familiarity with observability tools like Prometheus, Grafana, Stackdriver, and ELK/EFK stack.
Proven experience implementing SRE principles and working with SLAs, SLIs, and SLOs to improve service reliability.
Soft Skills: Strong analytical skills, a collaborative approach to problem-solving, and excellent communication skills with both technical and non-technical stakeholders.