Logo

Senior Python Systems Engineer (Agent & Infrastructure)

ClearMLGermany


No Relocation

Posted: March 8, 2026

Job Description

About the company

At ClearML, our mission is to make infrastructure management effortless across every phase of

the AI lifecycle -- from building and training AI models to large-scale production. Trusted by

more than 2,000 organizations, AI builders and IT teams use our AI infrastructure platform to

power everything from early-stage R&D to mission-critical public sector and enterprise-grade AI

pipelines.

We’re growing quickly and looking for curious, self-driven individuals who are excited to shape

the future of AI and the infrastructure that powers it. Our customers are tackling some of the

world’s most important challenges -- revolutionizing healthcare, discovering new medicines,

securing global finance, protecting national security, and preserving our planet’s ecosystems.

About the Role:

We are looking for a Senior Systems Engineer to own the execution layer of the ClearML

platform. You will be responsible for some of the critical components that spin up containers,

manage GPUs, and tunnel connections that make ClearML work seamlessly in multiple

environments.

This role sits at the intersection of Software Engineering and DevOps. You will write Python

code that orchestrates infrastructure, manages Docker containers, interacts with the Kubernetes

API, and handles low-level networking.

Responsibilities

● Agent Development: Design and optimize the clearml-agent, a Python service

responsible for pulling jobs, setting up environments, and executing ML pipelines.

● Kubernetes Integration: Write logic to interact directly with K8s APIs, manage Pod

life-cycles, and handle Custom Resource Definitions (CRDs).

● Resource Management: Implement logic for dynamic resource allocation

(GPU/CPU/Memory) and container orchestration.

● Systems Programming: Build robust daemons and services that interact with OS-level

primitives (systemd, signals, I/O streams).

● Networking: Troubleshoot and optimize TCP/IP connections, DNS resolution, and

firewall traversal to ensure seamless connectivity for users.

About the companyAt ClearML, our mission is to make infrastructure management effortless across every phase ofthe AI lifecycle -- from building and training AI models to large-scale production. Trusted bymore than 2,000 organizations, AI builders and I...

● 8+ years of development experience with a strong focus on Systems Programming.

● Kubernetes Mastery: Deep understanding of Kubernetes architecture (beyond just

writing YAML). You should know how to write code that controls K8s.

● Container Internals: Extensive experience with Docker, including building and

maintaining images.

● Python for Systems: Experience using Python for automation, daemons, or CLI tools

(using libraries like subprocess, socket, asyncio).

● Networking Fundamentals: Strong grasp of HTTP/S, WebSockets, TCP/IP, Proxies,

and Reverse Proxies.

● OS Knowledge: strong understanding of Linux internals and shell scripting.

Advantages

● Experience with GPU hardware management (NVIDIA drivers, CUDA, NVIDIA Container

Toolkit).

● Experience building Kubernetes Operators/Controllers (using Kopf or Operator SDK).

● Background in HPC (High-Performance Computing) or Slurm/MPI.

● Experience with Go (Golang) is a plus (for specific K8s components).

Additional Content

About the company

At ClearML, our mission is to make infrastructure management effortless across every phase of

the AI lifecycle -- from building and training AI models to large-scale production. Trusted by

more than 2,000 organizations, AI builders and IT teams use our AI infrastructure platform to

power everything from early-stage R&D to mission-critical public sector and enterprise-grade AI

pipelines.

We’re growing quickly and looking for curious, self-driven individuals who are excited to shape

the future of AI and the infrastructure that powers it. Our customers are tackling some of the

world’s most important challenges -- revolutionizing healthcare, discovering new medicines,

securing global finance, protecting national security, and preserving our planet’s ecosystems.

About the Role:

We are looking for a Senior Systems Engineer to own the execution layer of the ClearML

platform. You will be responsible for some of the critical components that spin up containers,

manage GPUs, and tunnel connections that make ClearML work seamlessly in multiple

environments.

This role sits at the intersection of Software Engineering and DevOps. You will write Python

code that orchestrates infrastructure, manages Docker containers, interacts with the Kubernetes

API, and handles low-level networking.

Responsibilities

● Agent Development: Design and optimize the clearml-agent, a Python service

responsible for pulling jobs, setting up environments, and executing ML pipelines.

● Kubernetes Integration: Write logic to interact directly with K8s APIs, manage Pod

life-cycles, and handle Custom Resource Definitions (CRDs).

● Resource Management: Implement logic for dynamic resource allocation

(GPU/CPU/Memory) and container orchestration.

● Systems Programming: Build robust daemons and services that interact with OS-level

primitives (systemd, signals, I/O streams).

● Networking: Troubleshoot and optimize TCP/IP connections, DNS resolution, and

firewall traversal to ensure seamless connectivity for users.

About the companyAt ClearML, our mission is to make infrastructure management effortless across every phase ofthe AI lifecycle -- from building and training AI models to large-scale production. Trusted bymore than 2,000 organizations, AI builders and I...

● 8+ years of development experience with a strong focus on Systems Programming.

● Kubernetes Mastery: Deep understanding of Kubernetes architecture (beyond just

writing YAML). You should know how to write code that controls K8s.

● Container Internals: Extensive experience with Docker, including building and

maintaining images.

● Python for Systems: Experience using Python for automation, daemons, or CLI tools

(using libraries like subprocess, socket, asyncio).

● Networking Fundamentals: Strong grasp of HTTP/S, WebSockets, TCP/IP, Proxies,

and Reverse Proxies.

● OS Knowledge: strong understanding of Linux internals and shell scripting.

Advantages

● Experience with GPU hardware management (NVIDIA drivers, CUDA, NVIDIA Container

Toolkit).

● Experience building Kubernetes Operators/Controllers (using Kopf or Operator SDK).

● Background in HPC (High-Performance Computing) or Slurm/MPI.

● Experience with Go (Golang) is a plus (for specific K8s components).