Job Details

Reliability Operations Engineer

Job Code: J50243

Apply now

Job Summary

Experience:

2.00 - 5.00 Years

Industrial Type:

IT-Hardware & Networking

Location:

Mumbai

Functional Area:

IT Software - Other

Designation:

Reliability Operations Engineer

Key Skills:

Observability, Monitoring, ELK, Elasticsearch, Logstash, File beat, Kibana

Educational Level:

Graduate/Bachelors

Job Post Date:

2026-02-26 18:02:47

Stream of Study:

Degree:

BCA, BCS, BE-Comp/IT, BE-Other, BSc-Comp/IT, BSc-Other, BTech-Comp/IT, BTech-Other

Company Description

Our Client was founded by 3 IIMers in the year 2000. Client is an electronic presentment technology and payment services company. The Company is focused on leveraging technology to enable banks,businesses and other institutions to present invoices, statements and bills to
consumers or businesses and receive payments against them.

Their Product powers electronic payments and collections services for the largest banks and companies in India and also manages the bill payment service of Visa in India. It operates as a neutral service bureau aggregating multiple banks, billing companies and other corporations onto a common standards-based platform for delivering electronic payments and collection services across multiple electronic channels.

Their Product manages these services across a range of access channels viz. Internet Banking, ATM Banking, Tele Banking, Mobile Banking etc. The Payment Gateway services of our Client enable customers to pay online using either their electronic banking accounts or credit cards.

Job Description

NOTE: This is a 24x7 rotational shift role supporting high-availability production environments.

Position: Reliability Operations Engineer – Production Support Monitoring (24x7).
Work Location: Andheri (W); near Azadnagar Metro Station

Role
We are seeking a proactive and detail-oriented Reliability Operations Engineer to support the uptime, monitoring, and operational stability of our critical internet-facing web applications and transaction workflows.
This role is responsible for real-time production monitoring, early anomaly detection, and first-level incident response to ensure seamless application performance and near 100% availability.
The ideal candidate will have hands-on experience with monitoring tools (ELK Stack, Grafana), strong understanding of API workflows, and the ability to analyze logs and system metrics to identify and escalate issues promptly.

Responsibilities
Production Monitoring and Uptime Management
• Perform 24x7 monitoring of critical internet web applications and system workflows
• Ensure high system availability and proactively detect service disruptions
• Monitor API performance, latency, error rates, and transaction flows
• Track server health parameters including CPU, memory, disk utilization, and network stability
• Monitor Elasticsearch cluster health, index growth, and log ingestion pipelines

Incident Detection & Escalation
• Identify incidents (P1 / P2 / P3) and raise timely alerts to application and infrastructure teams
• Perform first-level triage to determine whether issues are application, database, infrastructure, or external dependency related
• Maintain incident logs and support Root Cause Analysis (RCA) documentation
• Follow defined escalation matrix and SLA guidelines

Application Workflow Validation
• Understand end-to-end application workflows and business transactions
• Review API request/response payloads and validate error conditions
• Distinguish between 4xx and 5xx errors and identify recurring failure patterns
• Support validation of integration points with external systems and APIs

Observability & Dashboard Management
• Create and maintain real-time dashboards using Kibana and Grafana
• Monitor File beat and Logstash log ingestion processes
• Configure alerts and threshold-based monitoring for proactive incident detection
• Analyze system logs and application metrics to identify trends and anomalies

Continuous Monitoring Improvements
• Suggest enhancements to monitoring metrics and dashboards
• Reduce false positives and improve alert quality
• Contribute to automation of monitoring tasks where feasible

Experience
• 2–5 years of experience in Production Support, NOC, SRE Support, or Application Monitoring roles
• Hands-on experience working in 24x7 monitoring environments
• Experience supporting high-availability internet-based applications preferred

Skills
Application and Workflow Understanding
• Strong understanding of HTTP/REST APIs and JSON payload structures
• Knowledge of API status codes and error classification
• Ability to trace application workflows across services

Monitoring and Observability
• Hands-on experience with ELK Stack (Elasticsearch, Logstash, File beat, Kibana)
• Monitoring Elasticsearch cluster health and index size management
• Experience creating dashboards in Kibana and/or Grafana
• Ability to configure monitoring alerts and thresholds

Infrastructure and Systems Knowledge
• Basic Linux command-line proficiency (top, df -h, grep, tail, curl, netstat, etc.)
• Understanding of server health metrics and system performance indicators
• Basic knowledge of databases and log analysis

Incident Management
• Ability to quickly assess and classify production issues
• Strong analytical and troubleshooting skills
• Familiarity with ticketing systems such as Jira or ServiceNow

Soft Skills
• Strong analytical and problem-solving capabilities
• Ability to remain calm under pressure during high-severity incidents
• Clear verbal and written communication skills
• Strong attention to detail
• Willingness to work rotational shifts including nights and weekends

Qualifications
• Bachelor’s degree in Computer Science, Information Technology, or related field
• Certifications in Linux, Cloud, or Monitoring tools are an added advantage

Did not find a matching job? You can still send your CV to jobs@sampoorna.com or Register Here