Furiosa Ai
Software Engineer, Low Level Programming Interface & Runtime
Company
Role
Software Engineer, Low Level Programming Interface & Runtime
Location
Job type
Full-time
Posted
Yesterday
Salary
Job description
ABOUT THE JOB
Designs and implements the low-level runtime stack that drives FuriosaAI's NPU hardware to its theoretical limits — from device driver interfaces and DMA-based I/O to kernel execution scheduling, multi-node inference, and embedded firmware.
RESPONSIBILITIES
- Develops the low-level runtime responsible for DMA-based I/O operations and kernel execution scheduling, maximizing inference throughput while minimizing end-to-end latency.
- Builds and optimizes asynchronous execution pipelines that orchestrate data movement and compute across the NPU hardware.
- Enables multi-node inference by implementing foundational communication primitives, including RDMA-based data transfer for low-latency, high-bandwidth inter-node operations.
- Develops embedded firmware (PERT) that runs on the NPU's integrated ARM core, managing on-device scheduling, synchronization, and hardware resource control.
- Profiles and tunes system-level performance across the full runtime stack — from firmware to user-space — to eliminate bottlenecks in real-world inference workloads.
MINIMUM QUALIFICATIONS
- Bachelor's degree in Computer Science or equivalent work experience. Strong systems programming background with 3+ years of experience in Rust, C, or C++.
- Bachelor's degree in Computer Science, Electrical Engineering, or equivalent work experience.
- Strong communication skills for cross-team requirement gathering and technical alignment.
- 3+ years of systems programming experience in Rust, C, or C++.
- Solid understanding of computer architecture fundamentals: memory hierarchy, cache coherency, OS, DMA, interrupts, and MMIO.
PREFERRED QUALIFICATIONS
- Deep expertise in low-latency runtime systems, embedded firmware development, or high-performance I/O — especially in the context of accelerator hardware.
- Experience designing and implementing low-latency asynchronous execution models and scheduling systems.
- Experience with DMA engines, scatter-gather I/O, or other zero-copy data transfer mechanisms.
- Experience developing embedded firmware for ARM-based processors (bare-metal or lightweight RTOS environments).
- Familiarity with RDMA technologies and high-performance networking for distributed or multi-node systems.
- Experience with CUDA low-level runtime internals such as CUDA Graphs, stream-based execution, and asynchronous kernel launch optimization.
- Experience with kernel-level performance optimizations (e.g., Linux kernel modules, eBPF, perf, ftrace).
- Understanding of deep learning inference workloads and their hardware execution characteristics.
- Experience with profiling and performance tuning of system software on accelerator or SoC platforms.
CONTACT
- recruit@furiosa.ai
Explore more
Similar jobs
Senior Machine Learning Engineer, Search
Twelve Labs
Full-time
Seoul, South Korea (Remote)15 hours ago
Outbound Sales Lead, SME & Growth, South Korea
Airwallex
Full-time
Seoul, South KoreaYesterday
Software Engineer, DevEx (Developer Experience)
Furiosa Ai
Full-time
Seoul, South KoreaYesterday
Manager Strategic Partners
Feverup
Seoul, SeoulYesterday
Staff ML Research Scientist, Pegasus
Twelve Labs
Full-time
Seoul, South Korea (Remote)Yesterday
Senior ML Research Scientist, Pegasus
Twelve Labs
Full-time
Seoul, South Korea (Remote)Yesterday