Netflix Distributed Systems Engineer (Compute Runtime) — Study Program
Target Role: L5/L6 Distributed Systems Engineer, Compute Runtime Team
Compensation Range: $499,000 – $900,000
Location: USA Remote
Date Created: April 6, 2026
What This Team Does: The Compute Runtime team owns Netflix's Kubernetes data plane — the kubelet, container runtime (containerd),
and the entire node-level stack that runs every Netflix workload on AWS. You'd be building and customizing the software that actually
executes containers, not just deploying apps on top of K8s.
1. Role Analysis & Gap Assessment
What They Want (Prioritized)
| Requirement |
Priority |
Your Status |
| Container runtime internals (containerd, runc, NRI plugins, kubelet) |
Critical |
Gap |
| Go proficiency |
Critical |
Gap |
| Linux system performance debugging |
Critical |
Partial (have linux_sysadmin) |
| Large-scale distributed systems design |
High |
Strong |
| Kubernetes architecture & operations |
High |
Strong |
| Networking (TCP, IPv4, sockets, container networking) |
High |
Strong |
| Docker / container fundamentals |
High |
Strong |
| Operational troubleshooting at scale |
High |
Have |
| Open source contributions |
Preferred |
Gap |
| Linux kernel development |
Preferred |
Gap |
| AI/ML compute infrastructure |
Preferred |
Gap |
Good News: Your existing study library covers distributed systems, K8s architecture, Docker, networking, and system design very well.
The gaps are deep but focused: Go + container runtime internals + Linux perf. These are learnable in a focused program.
2. Study Program Overview
Phase 1 (Weeks 1-4) Phase 2 (Weeks 5-10) Phase 3 (Weeks 11-16) Phase 4 (Weeks 17-22)
┌─────────────────┐ ┌──────────────────────────┐ ┌──────────────────────────┐ ┌──────────────────────────┐
│ Go Language │────▶│ Container Runtime │────▶│ KaaS Portal Project │────▶│ Open Source + Interview │
│ Fundamentals │ │ Internals & K8s Data │ │ (Multi-Cloud K8s) │ │ Prep & Contributions │
│ + Concurrency │ │ Plane Deep Dive │ │ Go + K8s API + Clouds │ │ │
└─────────────────┘ └──────────────────────────┘ └──────────────────────────┘ └──────────────────────────┘
│ │ │ │
Linux Perf ──────────── Linux Perf ──────────────── Linux Perf ──────────────────── Linux Perf
(ongoing thread) (ongoing) (ongoing) (ongoing)
3. Phase 1 — Go Language & Linux Performance (Weeks 1-4)
Go is the language of Kubernetes, containerd, and the entire CNCF ecosystem. This isn't about learning syntax — it's about thinking in Go's concurrency model and understanding how K8s itself is written.
Week 1: Go Fundamentals
- Install Go, set up workspace and IDE (GoLand or VS Code + gopls)
- Types, structs, interfaces, embedding — Go's composition model
- Pointers, slices, maps — understand memory layout
- Error handling patterns (no exceptions —
error interface, fmt.Errorf, wrapping)
- Packages, modules,
go mod, dependency management
- Build: CLI tool that parses /proc filesystem to show container resource usage
Week 2: Concurrency Deep Dive
- Goroutines and the Go scheduler (M:N threading,
GOMAXPROCS)
- Channels — buffered, unbuffered, directional,
select
- sync package —
Mutex, RWMutex, WaitGroup, Once, Pool
- Context package — cancellation, timeouts, value propagation
- Race detector (
go run -race)
- Build: Concurrent container health checker that monitors multiple containers with timeouts
Week 3: Systems Programming in Go
- syscall and x/sys/unix packages
- Working with Linux namespaces from Go (clone, unshare, setns)
- cgroups v2 manipulation from Go
- File I/O, os/exec for process management
- net package — TCP/UDP servers, Unix domain sockets
- Build: Minimal container runtime in Go (namespaces + cgroups + chroot)
Week 4: Go Patterns for K8s Development
- client-go library — informers, listers, workqueues
- Controller pattern and reconciliation loops
- Code generation — deepcopy, client, informer generators
- Testing in Go — table-driven tests, mocks, integration tests
- Profiling — pprof, trace, benchmarks
- Build: Simple K8s controller that watches Pods and logs lifecycle events
Key Resources
| Resource | Type | Focus |
| The Go Programming Language (Donovan & Kernighan) | Book | Comprehensive foundation |
| Concurrency in Go (Katherine Cox-Buday) | Book | Deep concurrency patterns |
| Let's Go & Let's Go Further (Alex Edwards) | Book | Production Go patterns |
| Go by Example (gobyexample.com) | Web | Quick reference |
| Effective Go & Go Blog | Web | Idiomatic patterns |
| Kubernetes source code (k8s.io/kubernetes) | Code | Real-world Go at scale |
Netflix explicitly requires "Linux system performance debugging capability." This is Brendan Gregg territory — he literally works at a cloud provider and wrote the book on this.
Core Skills to Build
| Tool / Area | What to Learn | Why It Matters for This Role |
perf |
CPU profiling, flame graphs, perf stat, perf record |
Profiling container workloads, finding hot paths in kubelet |
eBPF / bpftrace |
Tracing syscalls, kernel functions, custom probes |
Dynamic tracing of container runtime behavior without restarting |
strace / ltrace |
Syscall tracing, latency analysis |
Debugging container startup failures, I/O issues |
cgroups v2 |
Resource accounting, limits, CPU/memory/IO controllers |
This is literally how containers enforce resource limits |
/proc and /sys |
Process info, cgroup hierarchies, network stats |
Diagnosing container resource issues at the source |
| Flame graphs |
Generating and reading CPU, off-CPU, memory flame graphs |
Visual performance analysis of container workloads |
| Networking tools |
ss, ip, tc, conntrack, tcpdump |
Debugging container networking, CNI issues, service mesh overhead |
Key Resources
- Systems Performance, 2nd Ed (Brendan Gregg) — The bible
- BPF Performance Tools (Brendan Gregg) — eBPF deep dive
- brendangregg.com — Free articles, methodologies, checklists
- Linux Observability with BPF (Calavera & Fontana)
Practice Method: Spin up containers with deliberate performance issues (CPU hog, memory leak, I/O contention) and use these tools to diagnose them. Document your debugging process as you go.
4. Phase 2 — Container Runtime Internals & K8s Data Plane (Weeks 5-10)
This is the core of the Netflix role. You need to understand not just how to use containers, but how they actually work at every layer.
Kubernetes Node (Data Plane)
┌──────────────────────────────────────────────────────────────────────┐
│ │
│ ┌──────────┐ CRI gRPC ┌────────────┐ │
│ │ kubelet │──────────────────▶│ containerd │ │
│ │ │ │ │ │
│ │ - Pod │ │ - Images │ OCI Runtime │
│ │ lifecycle │ - Snapshots│───────────────┐ │
│ │ - Volume │ NRI Interface │ - Content │ │ │
│ │ mgmt │ │ │ store │ ┌─────▼──┐ │
│ │ - Device │ ▼ │ - Tasks │ │ runc │ │
│ │ plugins│ ┌─────────┐ │ - NRI host │ │ │ │
│ │ - cAdvisor│ │NRI Plugin│ └────────────┘ │ creates│ │
│ └──────────┘ │(your │ │ & runs │ │
│ │ code) │ │ the │ │
│ └─────────┘ │container│ │
│ └────────┘ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Linux Kernel │ │
│ │ namespaces | cgroups v2 | seccomp | AppArmor | eBPF │ │
│ └───────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
Weeks 5-6: OCI & containerd
- OCI Runtime Specification — config.json, rootfs, lifecycle hooks
- OCI Image Specification — layers, manifests, image indexes
- runc source code walkthrough — how it creates namespaces, sets up cgroups, pivots root, execs the process
- containerd architecture — GRPC API, plugins, shim v2, snapshotter, content store
- CRI (Container Runtime Interface) — how kubelet talks to containerd
- Hands-on: Use
ctr and crictl to pull images, create containers, inspect namespaces directly
- Code Read: Walk through containerd's container creation path in the source
Weeks 7-8: Kubelet Internals
- Kubelet source code structure —
pkg/kubelet/
- Pod lifecycle management — SyncPod, pod workers, status manager
- Container runtime manager (genericRuntimeManager)
- Volume management — attach, mount, unmount, detach lifecycle
- Device plugin framework — GPU, FPGA, custom device allocation
- Resource management — CPU manager, memory manager, topology manager
- cAdvisor integration — how resource metrics are collected
- Build: Custom kubelet plugin or admission webhook in Go
Week 9: NRI (Node Resource Interface)
- NRI specification and purpose — runtime-level hooks for resource management
- NRI plugin API — container create, update, stop events
- Existing NRI plugins — topology-aware scheduling, memory tiering, balloon
- How Netflix likely uses NRI — custom resource policies, workload isolation
- Build: NRI plugin that enforces custom CPU pinning policy
- How NRI differs from admission webhooks (node-level vs API-level)
Week 10: Container Networking & Security
- CNI (Container Network Interface) — how containers get network interfaces
- veth pairs, network namespaces, bridge networking
- iptables/nftables rules that kube-proxy creates
- Container security — seccomp profiles, AppArmor, SELinux, capabilities
- Image security — signing, scanning, admission policies
- Hands-on: Trace a packet from pod A to pod B, documenting every hop
Key Resources
| Resource | Type | Focus |
| containerd source (github.com/containerd/containerd) | Code | The runtime Netflix customizes |
| runc source (github.com/opencontainers/runc) | Code | The OCI reference runtime |
| Kubernetes kubelet source (k8s.io/kubernetes/pkg/kubelet) | Code | Data plane brain |
| NRI repo (github.com/containerd/nri) | Code | Plugin interface you'd extend |
| Container Security (Liz Rice) | Book | Practical container security from the ground up |
| Kubernetes in Action, 2nd Ed (Marko Lukša) | Book | Excellent K8s internals coverage |
| Programming Kubernetes (Hausenblas & Schimanski) | Book | Building K8s-native Go applications |
Netflix-Specific Context: Netflix runs entirely on AWS. Their Titus platform was their original container orchestrator.
They've been migrating to/integrating with Kubernetes. This team specifically maintains the data plane — the node-level
software. Read Netflix's tech blog posts about Titus and their K8s journey. Understanding their migration context will set you apart in interviews.
5. Phase 3 — Capstone Project: KaaS Portal (Weeks 11-16)
This project directly demonstrates the skills Netflix wants: Go proficiency, K8s API mastery, infrastructure automation, and systems thinking. It's also genuinely useful and fun.
Architecture
┌──────────────────────────────────────────────────────┐
│ KaaS Web Portal │
│ (Go backend + HTMX frontend) │
│ │
│ ┌─────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ Cluster │ │ Workload │ │ Observability │ │
│ │ Manager │ │ Deployer │ │ Dashboard │ │
│ └────┬─────┘ └────┬─────┘ └────────┬─────────┘ │
│ │ │ │ │
│ ┌────▼──────────────▼─────────────────▼──────────┐ │
│ │ Unified K8s Abstraction Layer │ │
│ │ (client-go, controller-runtime, dynamic) │ │
│ └─────┬──────────────┬────────────────┬──────────┘ │
│ │ │ │ │
└────────┼──────────────┼────────────────┼──────────────┘
│ │ │
┌────▼────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ GKE │ │ EKS │ │ AKS │
│ (GCP) │ │ (AWS) │ │ (Azure) │
└─────────┘ └───────────┘ └───────────┘
Feature Roadmap
Weeks 11-12: Foundation
- Go HTTP server with chi or echo router
- Multi-cloud provider interface (Go interfaces for GKE/EKS/AKS)
- Cluster CRUD via cloud provider SDKs
- Kubeconfig management and secure storage
- HTMX frontend for dynamic UI without JS framework
- Authentication (OAuth2 / OIDC)
Weeks 13-14: K8s Integration
- Dynamic client for multi-cluster resource management
- Deploy workloads across clusters (Deployments, Services, Ingress)
- Real-time log streaming via K8s API
- Namespace and RBAC management
- Cost estimation per cluster/workload
- Custom K8s controller for managing KaaS resources (CRDs)
Weeks 15-16: Advanced Features
- Node pool management — scale up/down, instance types
- Cluster health monitoring and alerting (Prometheus metrics)
- Multi-cluster service mesh or federation
- Custom containerd configuration per cluster
- GPU node pool support (relevant to AI/ML compute)
- Terraform/Pulumi provider under the hood for IaC
Stretch Goals
- NRI plugin deployment management
- Custom kubelet configuration profiles
- eBPF-based node diagnostics dashboard
- Cluster upgrade orchestration
- Spot/preemptible instance workload scheduling
- Disaster recovery: cross-cloud cluster failover
Why This Project Is Perfect:
- Demonstrates Go proficiency in a real systems project
- Shows K8s API mastery via client-go and controllers
- Proves multi-cloud infrastructure experience
- The custom containerd/NRI features show runtime-level understanding
- It's a portfolio piece that doubles as a useful tool
- Open-source it to check the "OSS contributions" box
6. Phase 4 — Open Source Contributions & Interview Prep (Weeks 17-22)
Open Source Contribution Strategy
The job posting lists "open source project contribution history" as preferred. Target these repos:
| Project | Why | Entry Points |
| containerd/containerd |
Directly relevant to the role |
Bug fixes, test improvements, documentation, small features tagged "good first issue" |
| containerd/nri |
NRI is explicitly in the job requirements |
Example plugins, test coverage, documentation improvements |
| kubernetes/kubernetes |
Direct relevance, especially sig-node |
sig-node issues, kubelet improvements, test-infra |
| opencontainers/runc |
OCI runtime they'd expect you to understand |
Bug fixes, test improvements |
Contribution Approach: Start by joining Kubernetes Slack (sig-node channel). Attend sig-node meetings. Pick up issues
labeled good-first-issue or help-wanted. Even small, well-crafted PRs (test fixes, doc improvements) show
you understand the development workflow and codebase. Quality over quantity.
Interview Preparation
System Design Topics (Netflix Focus)
- Design a container orchestration platform (Titus-like)
- Design a multi-tenant Kubernetes cluster with resource isolation
- Design a container image distribution system for 100K+ nodes
- Design a node health monitoring and auto-remediation system
- Design a GPU scheduling system for ML workloads
- Design a container runtime with custom security policies
- Design a zero-downtime node upgrade system for K8s
Coding Interview Topics (Go)
- Implement a basic container runtime (namespaces, cgroups, rootfs)
- Implement a K8s controller with leader election
- Implement a rate limiter with workqueue
- Concurrent systems problems — producer/consumer, fan-out/fan-in
- Data structures and algorithms in Go (standard LeetCode + systems flavor)
- Debug a goroutine leak, deadlock, or race condition
Behavioral / Culture (Netflix Specifics)
Netflix's culture is "context not control" — they emphasize:
- Freedom & Responsibility — autonomous decision-making
- High performance — "adequate performance gets a generous severance"
- Radical candor — direct feedback culture
- Context over control — explain why, not how
Prepare stories that demonstrate you thriving with autonomy, making high-judgment calls, and giving/receiving direct feedback.
7. Weekly Schedule Template
Assuming ~15-20 hours/week of study time:
| Day |
Focus (2-3 hrs) |
Activity |
| Monday |
Go Programming |
Read + exercises from current chapter/topic. Write code. |
| Tuesday |
Container Runtime / K8s Internals |
Source code reading, documentation, hands-on labs |
| Wednesday |
Go Programming |
Build project feature or solve problems in Go |
| Thursday |
Linux Performance |
Tool practice, debugging exercises, Brendan Gregg material |
| Friday |
Project Work (KaaS Portal) |
Implement features, write tests, push code |
| Saturday |
Project Work + Open Source |
Continue project or work on OSS contribution |
| Sunday |
Review & System Design |
Review week's material, practice one system design problem |
8. Netflix Tech Blog — Required Reading
These posts give you insight into how this team thinks and what they've built:
| Topic | What to Search For | Why |
| Titus |
"Titus, the Netflix container management platform" |
Their original container orchestrator — understand what they're migrating from |
| Container Runtime |
"Netflix container runtime" and "Titus Executor" |
Their custom runtime work — direct context for this role |
| Compute |
"Auto Scaling Production Services on Titus" |
How they think about compute at scale |
| Networking |
"Networking in Titus" and container networking posts |
Their approach to container networking on AWS |
| Performance |
Brendan Gregg's Netflix posts on performance |
Performance culture and tools they use |
| Linux |
"Netflix and Linux" and kernel-related posts |
Their Linux kernel customization approach |
9. Success Metrics — How to Know You're Ready
| Skill | You're Ready When You Can... |
| Go |
Write a K8s controller from scratch, debug goroutine leaks with pprof, read K8s source code fluently |
| Container Runtime |
Explain the full path from kubectl run to process executing in a container, including every component involved. Write an NRI plugin. |
| Kubelet |
Explain pod lifecycle from kubelet's perspective, how it communicates with containerd via CRI, how resource managers work |
| Linux Perf |
Given a "containers are slow" report, systematically diagnose whether the issue is CPU, memory, I/O, or network using perf/eBPF/bpftrace |
| System Design |
Design a container orchestration platform on a whiteboard, covering scheduling, networking, storage, security, and monitoring |
| Netflix Culture |
Articulate how you work with high autonomy, give examples of high-impact decisions you made with incomplete information |
Timeline: 22 weeks (~5.5 months) at 15-20 hrs/week. You can compress this if you dedicate more time, or extend it if
you want to go deeper on any area. The KaaS project is the crown jewel — it gives you a concrete artifact to talk about in interviews
and doubles as a portfolio piece on GitHub.