Own LLM Serving Runtime and KV Cache systems across scheduling, batching, profiling, and prototype hardware bring-up.
Responsibilities
- Serve as the core owner for LLM Serving Runtime and KV Cache capabilities, from planning and design to implementation.
- Build key mechanisms for online LLM inference runtime, including request scheduling, batching, KV Cache management, long-context support, and performance optimization.
- Drive end-to-end performance closure on real workloads and improve runtime benchmark, profiling, and performance-analysis methods.
- Work with compiler, kernel, and silicon architecture teams to define and connect critical interfaces across the execution stack.
- Advance bring-up, debugging, and iteration in prototype hardware environments so prototypes become stable systems.
- Capture reusable designs and engineering practices for long-term system evolution.
Requirements
- PhD in computer science, electronic engineering, automation, mathematics, computational science, or a related field.
- Strong systems-software foundation across operating systems, concurrency, memory management, distributed systems, or high-performance computing.
- Strong engineering capability to independently design, implement, debug, and profile complex modules.
- Familiarity with at least one mainstream deep-learning or LLM inference stack such as PyTorch, CUDA, Triton, vLLM, SGLang, TensorRT-LLM, or DeepSpeed.
- Clear understanding of online LLM inference concepts including prefill / decode, KV Cache, attention, batching, long context, and multi-card deployment.
- Strong abstraction, collaboration, and ownership.