ABOUT THE JOB

Designs and implements the low-level runtime stack that drives FuriosaAI's NPU hardware to its theoretical limits — from device driver interfaces and DMA-based I/O to kernel execution scheduling, multi-node inference, and embedded firmware.

RESPONSIBILITIES

Develops the low-level runtime responsible for DMA-based I/O operations and kernel execution scheduling, maximizing inference throughput while minimizing end-to-end latency.
Builds and optimizes asynchronous execution pipelines that orchestrate data movement and compute across the NPU hardware.
Enables multi-node inference by implementing foundational communication primitives, including RDMA-based data transfer for low-latency, high-bandwidth inter-node operations.
Develops embedded firmware (PERT) that runs on the NPU's integrated ARM core, managing on-device scheduling, synchronization, and hardware resource control.
Profiles and tunes system-level performance across the full runtime stack — from firmware to user-space — to eliminate bottlenecks in real-world inference workloads.

MINIMUM QUALIFICATIONS

Bachelor's degree in Computer Science or equivalent work experience. Strong systems programming background with 3+ years of experience in Rust, C, or C++.
Bachelor's degree in Computer Science, Electrical Engineering, or equivalent work experience.
Strong communication skills for cross-team requirement gathering and technical alignment.
3+ years of systems programming experience in Rust, C, or C++.
Solid understanding of computer architecture fundamentals: memory hierarchy, cache coherency, OS, DMA, interrupts, and MMIO.

PREFERRED QUALIFICATIONS

Deep expertise in low-latency runtime systems, embedded firmware development, or high-performance I/O — especially in the context of accelerator hardware.
Experience designing and implementing low-latency asynchronous execution models and scheduling systems.
Experience with DMA engines, scatter-gather I/O, or other zero-copy data transfer mechanisms.
Experience developing embedded firmware for ARM-based processors (bare-metal or lightweight RTOS environments).
Familiarity with RDMA technologies and high-performance networking for distributed or multi-node systems.
Experience with CUDA low-level runtime internals such as CUDA Graphs, stream-based execution, and asynchronous kernel launch optimization.
Experience with kernel-level performance optimizations (e.g., Linux kernel modules, eBPF, perf, ftrace).
Understanding of deep learning inference workloads and their hardware execution characteristics.
Experience with profiling and performance tuning of system software on accelerator or SoC platforms.

CONTACT

recruit@furiosa.ai

Software Engineer, Low Level Programming Interface & Runtime

Job description

Explore more

Similar jobs

Senior Machine Learning Engineer, Search

Outbound Sales Lead, SME & Growth, South Korea

Software Engineer, DevEx (Developer Experience)

Manager Strategic Partners

Staff ML Research Scientist, Pegasus

Senior ML Research Scientist, Pegasus