Are you ready to grow your career in the cloud? Do you like the feeling that you are making a difference? This is your chance to be an integral part of a dynamic team of talented professionals deploying and maintaining innovative, industry-leading, cloud-based software.

Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. This technical role is focused on deploying, maintaining, and automating a wide range of operational tasks for the Instana observability and application performance monitoring (APM) tool’s Software as a service (SaaS) environments on AWS, Google Cloud and IBM Cloud. You will work collaboratively with the entire cloud organization and IBM vendors to support, maintain, and operationally improve the availability and reliability of the Instana offerings.

Your Role and Responsibilities

Instana is a leading observability and application performance monitoring (APM) tool. Our mission is to be the best-in-class tool for cloud-native microservice architecture observability. To achieve this, we receive and process billions of data points every day. We are looking for an experienced engineer to join our globally distributed Site Reliability Engineering (SRE) team that operates Instana’s SaaS platforms.
As a member of the Instana SRE team, you will:

  • Operate and improve the Instana infrastructure with a strong focus on reliability, security and cost
  • Develop automation tooling for deployments, upgrades and self-remediation
  • Participate in 24x7 on-call rotation, incident response and root cause analysis
  • Work with product teams to
    • Understand reliability and costs implications of new components and services
    • Perform production readiness checks
    • Design meaningful SLOs that help meet our availability goals
    • Assist in the design of the application and system architecture to meet future scalability requirements


Required Technical and Professional Expertise

You should demonstrate a mix of experience and skills in following areas:
5+ years of software development, software engineering and/or system operations experience supporting cloud offerings
System administration/engineering experience (Ubuntu and RedHat)
Experience with at least one of these datastores:

  • Kafka
  • Cassandra
  • Elasticsearch
  • Clickhouse
  • CockroachDB

Experience with at least one of these clouds:

  • AWS
  • Google Cloud Platform(GCP)
  • IBM Cloud

Experience with cloud technologies such as Docker, Kubernetes, and Open Shift
Experience with infrastructure as code and configuration management tools (e.g. Terraform, Chef, Ansible)
Approach troubleshooting systematically and have a deep sense of ownership for your work
Passion for resolving reliability issues and identify strategies to mitigate going forward

Preferred Technical and Professional Expertise

In addition knowledge/experience in any of the following would be an advantage:
Experience with DevOps engineering or SRE
Networking (HTTP, Cloudflare, TLS, Akamai, DNS) to troubleshoot network and load balancer issues.
Source control (Git, GitHub) and CI/CD pipeline (Jenkins)
Software development experience (Golang and Java preferred)
Experience with developing monitoring for production components and instrumenting code for observability using Instana or LogDNA.
Motivated to learn new technologies
Strong verbal and communication skills
Capability to work in a global, multicultural and diverse environment