Live jobs

From the latest startups to raise UK venture capital

companies

Jobs

My job alerts

Staff Engineer, Inference Platform - Reliability

Callosum

London, UK

Posted on Apr 29, 2026

Location

London

Employment Type

Full time

Location Type

On-site

Department

Intelligent Systems Engineering

About Us

Artificial intelligence scaled on a bet - that bigger models, more identical chips, and more data would keep delivering. As problems grow more complex and the requirements of intelligence more diverse, that bet is breaking down. The next era belongs to heterogeneous intelligence: diverse models on diverse chips, each with distinct strengths, co-evolving into systems of capability unreachable by any single model or accelerator.

Callosum is the Intelligent Systems company. We built the infrastructure to make that possible. Our co-evolution engine optimises simultaneously across workflows, agents, and silicon. We launched in early 2026 showing orders of magnitude improvements in performance and a shift in the cost-performance frontier that no single chip or model provider can provide.

We believe intelligence comes from the system, not the model.

We are scientists and engineers solving what others consider impossible. If you thrive on hard problems, and are passionate and energised by the scale of the challenge, we'd love to hear from you.

About the Role

A platform is only as good as its worst hour. Every customer running production traffic on Callosum depends on the platform behaving predictably under load, degrading gracefully when something fails, and recovering quickly when something breaks. Owning that behaviour is what this role does.

You will own the operational integrity of the platform end-to-end: deployment scaling, observability, how it responds to failure, and how it stays inside the latency and availability envelopes our enterprise customers depend on. As the platform grows by orders of magnitude over the coming year, the work shifts from building the operational foundation to defending it under conditions most teams never encounter.

Concretely, you will own Kubernetes-based deployment across the platform's services, the observability stack - metrics, tracing, alerting, SLOs - the cloud infrastructure underneath (AWS, infrastructure as code), the rollout and release process, and the incident response posture. You will set the on-call practice and the standards every engineer on the platform team is held to operationally.

You will partner directly with the Staff Engineer responsible for platform architecture, setting the technical direction of the platform. You will work closely with the hardware and orchestration teams to expose heterogeneous backends reliably through the platform.

Who we are looking for

This is a senior role and the bar is correspondingly specific. We're looking for someone who has done this kind of work before, at this kind of scale, and is ready to do it again with much more on the line.

Demonstrated experience operating distributed systems at production scale as the engineer responsible for their reliability, not adjacent to it.
Deep, hands-on Kubernetes - autoscaling, failure recovery, networking, multi-tenancy - with evidenced experience owning clusters.
Strong cloud infrastructure fundamentals on AWS, with infrastructure as code as a default rather than a project.
Strong fluency with observability: you have designed an observability stack from primitives. You have run on-call for systems where outages have business consequences and learned from doing so.

What Stands Out

You have owned reliability for an LLM serving platform, agent runtime, or comparable high-throughput inference system handling billions of tokens or requests.
You have established SLO practice from scratch, run blameless incident reviews that changed how a team builds, or designed deployment systems that scaled without becoming the constraint.
Experience with developer tooling, or with AI-native developer products from either side of the interface.

What We Offer

Competitive Salary, determined by skills and experience
Equity & Ownership
Private healthcare
We offer Visa sponsorship and relocation benefits to hire the best in the world
We work in person at our London office. You'll have the tools, space and setup to do your best work, and if you have specific needs, just tell us

We're committed to building an inclusive workplace where everyone feels welcome, and believe in equal opportunities for all.

See more open positions at Callosum