Systems Tooling & Infrastructure - Member of Technical Staff
Callosum
Other Engineering, IT
London, UK
Location
London
Employment Type
Full time
Location Type
On-site
Department
Intelligent Systems Engineering
About Us
Artificial intelligence scaled on a bet - that bigger models, more identical chips, and more data would keep delivering. As problems grow more complex and the requirements of intelligence more diverse, that bet is breaking down. The next era belongs to heterogeneous intelligence: diverse models on diverse chips, each with distinct strengths, co-evolving into systems of capability unreachable by any single model or accelerator.
Callosum is the Intelligent Systems company. We built the infrastructure to make that possible. Our co-evolution engine optimises simultaneously across workflows, agents, and silicon. We launched in early 2026 showing orders of magnitude improvements in performance and a shift in the cost-performance frontier that no single chip or model provider can provide.
We believe intelligence comes from the system, not the model.
We are scientists and engineers solving what others consider impossible. If you thrive on hard problems, and are passionate and energised by the scale of the challenge, we'd love to hear from you.
About the Role
Callosum believes that orders of magnitude improvements in AI systems will come through application-aware orchestration across heterogeneous hardware. We are building that vision: infrastructure that treats the full landscape of compute as a unified, co-evolving system, evolved beyond GPUs.
This role owns the developer experience of Callosum's stack, turning complex, low-level systems into something observable, debuggable, and usable by the rest of the team. You'll build the profiling, tracing, and developer tooling that defines how engineers interact with heterogeneous systems, enabling fast experimentation with new accelerators and complex inference workflows. You will own the abstractions, CLIs, and instrumentation that the engineering organisation is built on - primitives that don't yet exist for the next generation of compute infrastructure. As multi-stage and multi-agent workflows grow in complexity, your work is what keeps execution paths visible and tractable, ensuring the organisation can scale without losing insight or control.
What You’ll Build
Extend profiling and tracing tooling for new accelerators, including collection, compression, and visualisation of performance data
Develop CLI tools and automation wrappers that simplify common workflows - spinning up inference stacks, launching benchmarks, managing configurations
Converting prototypes of internal tooling into high-performance, scalable, accessible commands
Build tooling to support multi-agent serving workflows: request tracing across agent boundaries, pipeline visualisation, and debugging tools for complex inference DAGs
Create internal libraries and abstractions that let other teams move faster without reinventing shared infrastructure
What You Bring
Strong software engineering fundamentals: clean APIs, good error handling, sensible defaults, and clear documentation
Experience with profiling and tracing systems (perf, Nsight, Tracy, or similar) and a good sense of how to make trace data actionable rather than overwhelming
Familiarity with observability stacks (Prometheus, Grafana, OpenTelemetry, or equivalent) in varied infrastructure environments
Comfortable across the stack - from low-level trace collection to dashboards and developer-facing CLI tools