Senior Site Reliability Engineer, IMF

bloomreach • Czechia

No Relocation

Posted: January 23, 2026

Job Description

We are looking for a dedicated DevOps Engineer to join our Analytics team and manage our in-memory database (IMF) and related services. Our system runs on Google Cloud Platform (GCP) and Kubernetes and integrates with Kafka, MongoDB, and other services. Your job will be to keep our databases and services running smoothly, maintain reliable monitoring, and develop tools and automation for new releases, maintenance, and incident management.

The team works remotely in the Central European Time Zone. We are happy to meet you in Brno, Prague (Czechia) or Bratislava (Slovakia), where our headquarters is located.

Responsibilities

System Administration:

Manage and configure our Kubernetes components to ensure they are highly available, reliable, and perform well.

Incident Management:

Handle incident responses and perform root cause analysis for critical issues.
Participate in a 24/7 on-call rotation, with each duty lasting 1 week. We aim to have 4 engineers in the rotation.

Automation and Tools Development: Create and maintain scripts and tools using Python and Go to automate operations and reduce manual tasks.
Scaling and Resource Planning:

Monitor system performance and plan for future scaling.
Ensure there are enough resources during peak times.

Monitoring and Logging:

Set up and maintain systems to monitor and log activities, so issues can be detected and addressed early.

Backup and Recovery:

Ensure our database has reliable backups and efficient tools for quick and smooth recovery.

Collaboration:

Work closely with other engineers and product managers to ensure successful project delivery.
Collaborate with L2 support engineers to ensure seamless operations and effective problem resolution.

Qualifications

Experience:

Worked in DevOps or Site Reliability Engineering (SRE) before.
Understand basic DevOps principles.
Familiar with cloud platforms, especially Google Cloud Platform (GCP).
It’s important to know how to use Kubernetes.
Know how to build and maintain CI/CD pipelines in GitLab or similar.

Skills:

Good at automating tasks and scripting with Python, Go, or Shell (for basic Linux tasks and Kubernetes management).
Experienced in handling and resolving incidents.

Tools:

Know how to use monitoring tools such as VictoriaMetrics and Grafana.
Familiar with logging tools.

Problem-solving:

Good at analyzing issues and finding solutions.

Communication:

Can communicate well and work well with remote teams.

Adaptability:

Able to work on your own and manage multiple tasks.
Comfortable working in a fast-paced environment.

Our stack

GitLab
Victoria metrics, Grafana, InfluxDB, Chronograf, Sentry
IMF (our in-memory database written in C++), Apache Kafka, MongoDB
Kubernetes (GKE), Google Cloud Platform, gRPC
Python, Go

Your success story.

First 30 Days:

Get to know the team, the company, and key processes.
Start working on your first tasks.
Learn about our infrastructure, release process, tools, and product with our help.

First 90 Days:

Take an active role in daily operations, including monitoring and incident management.
Work on small automation projects to make routine tasks easier and improve efficiency.
Help develop and maintain internal tools for monitoring, logging, and automation.
Join the on-call rotation with support from experienced team members.

First 180 Days:

Take ownership of specific tasks and projects, working independently.
Contribute to scaling and resource planning to ensure the system can handle future growth and peak times.
Understand the team's direction and help shape our future.

#LI-KP1

Additional Content

The team works remotely in the Central European Time Zone. We are happy to meet you in Brno, Prague (Czechia) or Bratislava (Slovakia), where our headquarters is located.

Responsibilities

System Administration:

Manage and configure our Kubernetes components to ensure they are highly available, reliable, and perform well.

Incident Management:

Handle incident responses and perform root cause analysis for critical issues.
Participate in a 24/7 on-call rotation, with each duty lasting 1 week. We aim to have 4 engineers in the rotation.

Automation and Tools Development: Create and maintain scripts and tools using Python and Go to automate operations and reduce manual tasks.
Scaling and Resource Planning:

Monitor system performance and plan for future scaling.
Ensure there are enough resources during peak times.

Monitoring and Logging:

Set up and maintain systems to monitor and log activities, so issues can be detected and addressed early.

Backup and Recovery:

Ensure our database has reliable backups and efficient tools for quick and smooth recovery.

Collaboration:

Work closely with other engineers and product managers to ensure successful project delivery.
Collaborate with L2 support engineers to ensure seamless operations and effective problem resolution.

Qualifications

Experience:

Worked in DevOps or Site Reliability Engineering (SRE) before.
Understand basic DevOps principles.
Familiar with cloud platforms, especially Google Cloud Platform (GCP).
It’s important to know how to use Kubernetes.
Know how to build and maintain CI/CD pipelines in GitLab or similar.

Skills:

Good at automating tasks and scripting with Python, Go, or Shell (for basic Linux tasks and Kubernetes management).
Experienced in handling and resolving incidents.

Tools:

Know how to use monitoring tools such as VictoriaMetrics and Grafana.
Familiar with logging tools.

Problem-solving:

Good at analyzing issues and finding solutions.

Communication:

Can communicate well and work well with remote teams.

Adaptability:

Able to work on your own and manage multiple tasks.
Comfortable working in a fast-paced environment.

Our stack

GitLab
Victoria metrics, Grafana, InfluxDB, Chronograf, Sentry
IMF (our in-memory database written in C++), Apache Kafka, MongoDB
Kubernetes (GKE), Google Cloud Platform, gRPC
Python, Go

Your success story.

First 30 Days:

Get to know the team, the company, and key processes.
Start working on your first tasks.
Learn about our infrastructure, release process, tools, and product with our help.

First 90 Days:

Take an active role in daily operations, including monitoring and incident management.
Work on small automation projects to make routine tasks easier and improve efficiency.
Help develop and maintain internal tools for monitoring, logging, and automation.
Join the on-call rotation with support from experienced team members.

First 180 Days:

Take ownership of specific tasks and projects, working independently.
Contribute to scaling and resource planning to ensure the system can handle future growth and peak times.
Understand the team's direction and help shape our future.

#LI-KP1

Apply Now View Full Posting

RemoteJob Guru

Menu

Senior Site Reliability Engineer, IMF

Job Description

Responsibilities

Qualifications

Our stack

Your success story.

Additional Content

Responsibilities

Qualifications

Our stack

Your success story.