Featherless AI logo

AI Researcher — Inference Optimization

Featherless AIRemote (world)


No Relocation

Posted: January 23, 2026

Job Description

Role Overview

We are seeking an AI Researcher with deep experience in inference optimization to design, evaluate, and deploy high-performance inference systems for large-scale machine learning models. You will work at the intersection of model architecture, systems engineering, and hardware-aware optimization, improving latency, throughput, and cost efficiency across real-world production environments.

Key Responsibilities

  • Research and develop techniques to optimize inference performance for large neural networks.

  • Improve latency, throughput, memory efficiency, and cost per inference.

  • Design and evaluate model-level optimizations (quantization, pruning, KV-cache optimization, architecture-aware simplifications).

  • Implement systems-level optimizations (dynamic batching, kernel fusion, multi-GPU inference, prefill vs decode optimization).

  • Benchmark inference workloads across hardware accelerators.

  • Collaborate with engineering teams to deploy optimized inference pipelines.

  • Translate research insights into production-ready improvements.

Required Qualifications

  • Strong background in machine learning, deep learning, or AI systems.

  • Hands-on experience optimizing inference for large-scale models.

  • Proficiency in Python and modern ML frameworks (e.g., PyTorch).

  • Experience with inference tooling (e.g., Triton, TensorRT, vLLM, ONNX Runtime).

  • Ability to design experiments and communicate results clearly.

Preferred / Nice-to-Have Qualifications

  • Experience deploying production inference systems at scale.

  • Familiarity with distributed and multi-GPU inference.

  • Experience contributing to open-source ML or inference frameworks.

  • Authorship or co-authorship of peer-reviewed research papers in machine learning, systems, or related fields.

  • Experience working close to hardware (CUDA, ROCm, profiling tools).

What Success Looks Like

  • Measurable gains in latency, throughput, and cost efficiency.

  • Optimized inference systems running reliably in production.

  • Research ideas successfully translated into deployable systems.

  • Clear benchmarks and documentation that inform product decisions.

Relevant Research Areas (Bonus)

  • Long-context inference optimization

  • Speculative decoding

  • KV-cache compression and paging

  • Efficient decoding strategies

  • Hardware-aware inference design

Additional Content

Role Overview

We are seeking an AI Researcher with deep experience in inference optimization to design, evaluate, and deploy high-performance inference systems for large-scale machine learning models. You will work at the intersection of model architecture, systems engineering, and hardware-aware optimization, improving latency, throughput, and cost efficiency across real-world production environments.

Key Responsibilities

  • Research and develop techniques to optimize inference performance for large neural networks.

  • Improve latency, throughput, memory efficiency, and cost per inference.

  • Design and evaluate model-level optimizations (quantization, pruning, KV-cache optimization, architecture-aware simplifications).

  • Implement systems-level optimizations (dynamic batching, kernel fusion, multi-GPU inference, prefill vs decode optimization).

  • Benchmark inference workloads across hardware accelerators.

  • Collaborate with engineering teams to deploy optimized inference pipelines.

  • Translate research insights into production-ready improvements.

Required Qualifications

  • Strong background in machine learning, deep learning, or AI systems.

  • Hands-on experience optimizing inference for large-scale models.

  • Proficiency in Python and modern ML frameworks (e.g., PyTorch).

  • Experience with inference tooling (e.g., Triton, TensorRT, vLLM, ONNX Runtime).

  • Ability to design experiments and communicate results clearly.

Preferred / Nice-to-Have Qualifications

  • Experience deploying production inference systems at scale.

  • Familiarity with distributed and multi-GPU inference.

  • Experience contributing to open-source ML or inference frameworks.

  • Authorship or co-authorship of peer-reviewed research papers in machine learning, systems, or related fields.

  • Experience working close to hardware (CUDA, ROCm, profiling tools).

What Success Looks Like

  • Measurable gains in latency, throughput, and cost efficiency.

  • Optimized inference systems running reliably in production.

  • Research ideas successfully translated into deployable systems.

  • Clear benchmarks and documentation that inform product decisions.

Relevant Research Areas (Bonus)

  • Long-context inference optimization

  • Speculative decoding

  • KV-cache compression and paging

  • Efficient decoding strategies

  • Hardware-aware inference design