4Bell Technology
Senior Site Reliability Engineer(R-1564)
1,800,000.00-2,000,000.00/A
Any Graduation
IT (Information Technology)
Full-time
Bangalore/Bengaluru
23-Jun-2026
Job Description
Job Description:
We are looking for a Senior Site Reliability Engineer (SRE) with deep expertise in observability, cloud-native infrastructure, and large-scale distributed systems. This role is highly hands-on and focuses on designing, building, and operating reliable, observable, and scalable platforms running on Kubernetes, with a strong preference for Google Cloud Platform (GCP) and AWS.
Mandatory Skills: Python, Site Reliability Engineer, Elk
Roles & Responsibilities
Reliability & Operations:
- Design, implement, and maintain highly available and resilient systems in Kubernetes-based environments
- Define and enforce SLOs, SLIs, and error budgets
- Lead incident response, RCA, and postmortems
- Drive reliability improvements through automation
Observability (Core Focus):
- Architect and operate observability platforms for metrics, logging, tracing, and alerting
- Work with Prometheus, Alert manager, Open Telemetry, Grafana, Loki / ELK / OpenSearch
- Implement cloud-native monitoring (GCP Cloud Monitoring & Logging preferred)
- Establish actionable alerting standards
Cloud & Platform Engineering:
- Build and manage infrastructure on GCP (preferred) or AWS
- Operate Kubernetes clusters (GKE preferred)
- Deploy services using Helm
- Manage containerized workloads using Docker
Automation & Tooling:
- Strong Python skills with emphasis on reliability, automation, and observability tooling
- Develop automation and tooling using Python
- Create internal reliability and monitoring tools
- Integrate CI/CD pipelines with observability and reliability checks
Collaboration & Leadership:
- Mentor junior engineers
- Influence architecture decisions
- Collaborate across engineering teams
Skill to Evaluate: Python, Site Reliability Engineer, Elk,AWS,GCP,Kubernetes,Docker,Ansible,packer,Jenkins,Splunk,Cribl,Terraform,Vectors,Prometheus,linux,helm,datadog
Project Details:
Project Details / What You’ll Work On Build and operate a centralized observability platform for metrics, logs, traces, and alerting across Kubernetes workloads using Prometheus, Grafana, OpenTelemetry, and GCP Cloud Monitoring Define and drive SLOs, SLIs, and error budgets to improve reliability, reduce MTTR, and guide release decisions Design, operate, and optimize EKS/GKE-based Kubernetes platforms using Helm and containerized workloads with Docker Develop Python-based automation and tooling for observability, SLO reporting, incident response, and operational workflows Lead incident response for production issues, conduct blameless postmortems, and drive long-term reliability improvements Optimize platform scalability, performance, and cloud cost efficiency with a strong focus on GCP and AWS. Act as a technical leader, influencing architecture and mentoring teams on reliability and observability best practices