Back to Careers

Own LLM Serving Runtime and KV Cache systems across scheduling, batching, profiling, and prototype hardware bring-up.

Serve as the core owner for LLM Serving Runtime and KV Cache capabilities, from planning and design to implementation.
Build key mechanisms for online LLM inference runtime, including request scheduling, batching, KV Cache management, long-context support, and performance optimization.
Drive end-to-end performance closure on real workloads and improve runtime benchmark, profiling, and performance-analysis methods.
Work with compiler, kernel, and silicon architecture teams to define and connect critical interfaces across the execution stack.
Advance bring-up, debugging, and iteration in prototype hardware environments so prototypes become stable systems.
Capture reusable designs and engineering practices for long-term system evolution.

PhD in computer science, electronic engineering, automation, mathematics, computational science, or a related field.
Strong systems-software foundation across operating systems, concurrency, memory management, distributed systems, or high-performance computing.
Strong engineering capability to independently design, implement, debug, and profile complex modules.
Familiarity with at least one mainstream deep-learning or LLM inference stack such as PyTorch, CUDA, Triton, vLLM, SGLang, TensorRT-LLM, or DeepSpeed.
Clear understanding of online LLM inference concepts including prefill / decode, KV Cache, attention, batching, long context, and multi-card deployment.
Strong abstraction, collaboration, and ownership.

Postdoctoral Researcher - LLM Runtime