Netflix Distributed Systems Engineer (Compute Runtime) — Study Program

Target Role: L5/L6 Distributed Systems Engineer, Compute Runtime Team
Compensation Range: $499,000 – $900,000
Location: USA Remote
Date Created: April 6, 2026

What This Team Does: The Compute Runtime team owns Netflix's Kubernetes data plane — the kubelet, container runtime (containerd), and the entire node-level stack that runs every Netflix workload on AWS. You'd be building and customizing the software that actually executes containers, not just deploying apps on top of K8s.

1. Role Analysis & Gap Assessment

What They Want (Prioritized)

Requirement Priority Your Status
Container runtime internals (containerd, runc, NRI plugins, kubelet) Critical Gap
Go proficiency Critical Gap
Linux system performance debugging Critical Partial (have linux_sysadmin)
Large-scale distributed systems design High Strong
Kubernetes architecture & operations High Strong
Networking (TCP, IPv4, sockets, container networking) High Strong
Docker / container fundamentals High Strong
Operational troubleshooting at scale High Have
Open source contributions Preferred Gap
Linux kernel development Preferred Gap
AI/ML compute infrastructure Preferred Gap
Good News: Your existing study library covers distributed systems, K8s architecture, Docker, networking, and system design very well. The gaps are deep but focused: Go + container runtime internals + Linux perf. These are learnable in a focused program.

2. Study Program Overview

Phase 1 (Weeks 1-4) Phase 2 (Weeks 5-10) Phase 3 (Weeks 11-16) Phase 4 (Weeks 17-22) ┌─────────────────┐ ┌──────────────────────────┐ ┌──────────────────────────┐ ┌──────────────────────────┐ │ Go Language │────▶│ Container Runtime │────▶│ KaaS Portal Project │────▶│ Open Source + Interview │ │ Fundamentals │ │ Internals & K8s Data │ │ (Multi-Cloud K8s) │ │ Prep & Contributions │ │ + Concurrency │ │ Plane Deep Dive │ │ Go + K8s API + Clouds │ │ │ └─────────────────┘ └──────────────────────────┘ └──────────────────────────┘ └──────────────────────────┘ │ │ │ │ Linux Perf ──────────── Linux Perf ──────────────── Linux Perf ──────────────────── Linux Perf (ongoing thread) (ongoing) (ongoing) (ongoing)

3. Phase 1 — Go Language & Linux Performance (Weeks 1-4)

Go Language Mastery

4 weeks

Go is the language of Kubernetes, containerd, and the entire CNCF ecosystem. This isn't about learning syntax — it's about thinking in Go's concurrency model and understanding how K8s itself is written.

Week 1: Go Fundamentals

  • Install Go, set up workspace and IDE (GoLand or VS Code + gopls)
  • Types, structs, interfaces, embedding — Go's composition model
  • Pointers, slices, maps — understand memory layout
  • Error handling patterns (no exceptions — error interface, fmt.Errorf, wrapping)
  • Packages, modules, go mod, dependency management
  • Build: CLI tool that parses /proc filesystem to show container resource usage

Week 2: Concurrency Deep Dive

  • Goroutines and the Go scheduler (M:N threading, GOMAXPROCS)
  • Channels — buffered, unbuffered, directional, select
  • sync package — Mutex, RWMutex, WaitGroup, Once, Pool
  • Context package — cancellation, timeouts, value propagation
  • Race detector (go run -race)
  • Build: Concurrent container health checker that monitors multiple containers with timeouts

Week 3: Systems Programming in Go

  • syscall and x/sys/unix packages
  • Working with Linux namespaces from Go (clone, unshare, setns)
  • cgroups v2 manipulation from Go
  • File I/O, os/exec for process management
  • net package — TCP/UDP servers, Unix domain sockets
  • Build: Minimal container runtime in Go (namespaces + cgroups + chroot)

Week 4: Go Patterns for K8s Development

  • client-go library — informers, listers, workqueues
  • Controller pattern and reconciliation loops
  • Code generation — deepcopy, client, informer generators
  • Testing in Go — table-driven tests, mocks, integration tests
  • Profiling — pprof, trace, benchmarks
  • Build: Simple K8s controller that watches Pods and logs lifecycle events

Key Resources

ResourceTypeFocus
The Go Programming Language (Donovan & Kernighan)BookComprehensive foundation
Concurrency in Go (Katherine Cox-Buday)BookDeep concurrency patterns
Let's Go & Let's Go Further (Alex Edwards)BookProduction Go patterns
Go by Example (gobyexample.com)WebQuick reference
Effective Go & Go BlogWebIdiomatic patterns
Kubernetes source code (k8s.io/kubernetes)CodeReal-world Go at scale

Linux Performance (Ongoing Thread)

Ongoing through all phases

Netflix explicitly requires "Linux system performance debugging capability." This is Brendan Gregg territory — he literally works at a cloud provider and wrote the book on this.

Core Skills to Build

Tool / AreaWhat to LearnWhy It Matters for This Role
perf CPU profiling, flame graphs, perf stat, perf record Profiling container workloads, finding hot paths in kubelet
eBPF / bpftrace Tracing syscalls, kernel functions, custom probes Dynamic tracing of container runtime behavior without restarting
strace / ltrace Syscall tracing, latency analysis Debugging container startup failures, I/O issues
cgroups v2 Resource accounting, limits, CPU/memory/IO controllers This is literally how containers enforce resource limits
/proc and /sys Process info, cgroup hierarchies, network stats Diagnosing container resource issues at the source
Flame graphs Generating and reading CPU, off-CPU, memory flame graphs Visual performance analysis of container workloads
Networking tools ss, ip, tc, conntrack, tcpdump Debugging container networking, CNI issues, service mesh overhead

Key Resources

Practice Method: Spin up containers with deliberate performance issues (CPU hog, memory leak, I/O contention) and use these tools to diagnose them. Document your debugging process as you go.

4. Phase 2 — Container Runtime Internals & K8s Data Plane (Weeks 5-10)

Container Runtime Deep Dive

6 weeks

This is the core of the Netflix role. You need to understand not just how to use containers, but how they actually work at every layer.

Kubernetes Node (Data Plane) ┌──────────────────────────────────────────────────────────────────────┐ │ │ │ ┌──────────┐ CRI gRPC ┌────────────┐ │ │ │ kubelet │──────────────────▶│ containerd │ │ │ │ │ │ │ │ │ │ - Pod │ │ - Images │ OCI Runtime │ │ │ lifecycle │ - Snapshots│───────────────┐ │ │ │ - Volume │ NRI Interface │ - Content │ │ │ │ │ mgmt │ │ │ store │ ┌─────▼──┐ │ │ │ - Device │ ▼ │ - Tasks │ │ runc │ │ │ │ plugins│ ┌─────────┐ │ - NRI host │ │ │ │ │ │ - cAdvisor│ │NRI Plugin│ └────────────┘ │ creates│ │ │ └──────────┘ │(your │ │ & runs │ │ │ │ code) │ │ the │ │ │ └─────────┘ │container│ │ │ └────────┘ │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ Linux Kernel │ │ │ │ namespaces | cgroups v2 | seccomp | AppArmor | eBPF │ │ │ └───────────────────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────────────────┘

Weeks 5-6: OCI & containerd

  • OCI Runtime Specification — config.json, rootfs, lifecycle hooks
  • OCI Image Specification — layers, manifests, image indexes
  • runc source code walkthrough — how it creates namespaces, sets up cgroups, pivots root, execs the process
  • containerd architecture — GRPC API, plugins, shim v2, snapshotter, content store
  • CRI (Container Runtime Interface) — how kubelet talks to containerd
  • Hands-on: Use ctr and crictl to pull images, create containers, inspect namespaces directly
  • Code Read: Walk through containerd's container creation path in the source

Weeks 7-8: Kubelet Internals

  • Kubelet source code structure — pkg/kubelet/
  • Pod lifecycle management — SyncPod, pod workers, status manager
  • Container runtime manager (genericRuntimeManager)
  • Volume management — attach, mount, unmount, detach lifecycle
  • Device plugin framework — GPU, FPGA, custom device allocation
  • Resource management — CPU manager, memory manager, topology manager
  • cAdvisor integration — how resource metrics are collected
  • Build: Custom kubelet plugin or admission webhook in Go

Week 9: NRI (Node Resource Interface)

  • NRI specification and purpose — runtime-level hooks for resource management
  • NRI plugin API — container create, update, stop events
  • Existing NRI plugins — topology-aware scheduling, memory tiering, balloon
  • How Netflix likely uses NRI — custom resource policies, workload isolation
  • Build: NRI plugin that enforces custom CPU pinning policy
  • How NRI differs from admission webhooks (node-level vs API-level)

Week 10: Container Networking & Security

  • CNI (Container Network Interface) — how containers get network interfaces
  • veth pairs, network namespaces, bridge networking
  • iptables/nftables rules that kube-proxy creates
  • Container security — seccomp profiles, AppArmor, SELinux, capabilities
  • Image security — signing, scanning, admission policies
  • Hands-on: Trace a packet from pod A to pod B, documenting every hop

Key Resources

ResourceTypeFocus
containerd source (github.com/containerd/containerd)CodeThe runtime Netflix customizes
runc source (github.com/opencontainers/runc)CodeThe OCI reference runtime
Kubernetes kubelet source (k8s.io/kubernetes/pkg/kubelet)CodeData plane brain
NRI repo (github.com/containerd/nri)CodePlugin interface you'd extend
Container Security (Liz Rice)BookPractical container security from the ground up
Kubernetes in Action, 2nd Ed (Marko Lukša)BookExcellent K8s internals coverage
Programming Kubernetes (Hausenblas & Schimanski)BookBuilding K8s-native Go applications
Netflix-Specific Context: Netflix runs entirely on AWS. Their Titus platform was their original container orchestrator. They've been migrating to/integrating with Kubernetes. This team specifically maintains the data plane — the node-level software. Read Netflix's tech blog posts about Titus and their K8s journey. Understanding their migration context will set you apart in interviews.

5. Phase 3 — Capstone Project: KaaS Portal (Weeks 11-16)

Kubernetes-as-a-Service Multi-Cloud Portal

6 weeks

This project directly demonstrates the skills Netflix wants: Go proficiency, K8s API mastery, infrastructure automation, and systems thinking. It's also genuinely useful and fun.

Architecture

┌──────────────────────────────────────────────────────┐ │ KaaS Web Portal │ │ (Go backend + HTMX frontend) │ │ │ │ ┌─────────┐ ┌──────────┐ ┌──────────────────┐ │ │ │ Cluster │ │ Workload │ │ Observability │ │ │ │ Manager │ │ Deployer │ │ Dashboard │ │ │ └────┬─────┘ └────┬─────┘ └────────┬─────────┘ │ │ │ │ │ │ │ ┌────▼──────────────▼─────────────────▼──────────┐ │ │ │ Unified K8s Abstraction Layer │ │ │ │ (client-go, controller-runtime, dynamic) │ │ │ └─────┬──────────────┬────────────────┬──────────┘ │ │ │ │ │ │ └────────┼──────────────┼────────────────┼──────────────┘ │ │ │ ┌────▼────┐ ┌─────▼─────┐ ┌─────▼─────┐ │ GKE │ │ EKS │ │ AKS │ │ (GCP) │ │ (AWS) │ │ (Azure) │ └─────────┘ └───────────┘ └───────────┘

Feature Roadmap

Weeks 11-12: Foundation

  • Go HTTP server with chi or echo router
  • Multi-cloud provider interface (Go interfaces for GKE/EKS/AKS)
  • Cluster CRUD via cloud provider SDKs
  • Kubeconfig management and secure storage
  • HTMX frontend for dynamic UI without JS framework
  • Authentication (OAuth2 / OIDC)

Weeks 13-14: K8s Integration

  • Dynamic client for multi-cluster resource management
  • Deploy workloads across clusters (Deployments, Services, Ingress)
  • Real-time log streaming via K8s API
  • Namespace and RBAC management
  • Cost estimation per cluster/workload
  • Custom K8s controller for managing KaaS resources (CRDs)

Weeks 15-16: Advanced Features

  • Node pool management — scale up/down, instance types
  • Cluster health monitoring and alerting (Prometheus metrics)
  • Multi-cluster service mesh or federation
  • Custom containerd configuration per cluster
  • GPU node pool support (relevant to AI/ML compute)
  • Terraform/Pulumi provider under the hood for IaC

Stretch Goals

  • NRI plugin deployment management
  • Custom kubelet configuration profiles
  • eBPF-based node diagnostics dashboard
  • Cluster upgrade orchestration
  • Spot/preemptible instance workload scheduling
  • Disaster recovery: cross-cloud cluster failover
Why This Project Is Perfect:
  • Demonstrates Go proficiency in a real systems project
  • Shows K8s API mastery via client-go and controllers
  • Proves multi-cloud infrastructure experience
  • The custom containerd/NRI features show runtime-level understanding
  • It's a portfolio piece that doubles as a useful tool
  • Open-source it to check the "OSS contributions" box

6. Phase 4 — Open Source Contributions & Interview Prep (Weeks 17-22)

Open Source & Interview Preparation

6 weeks

Open Source Contribution Strategy

The job posting lists "open source project contribution history" as preferred. Target these repos:

ProjectWhyEntry Points
containerd/containerd Directly relevant to the role Bug fixes, test improvements, documentation, small features tagged "good first issue"
containerd/nri NRI is explicitly in the job requirements Example plugins, test coverage, documentation improvements
kubernetes/kubernetes Direct relevance, especially sig-node sig-node issues, kubelet improvements, test-infra
opencontainers/runc OCI runtime they'd expect you to understand Bug fixes, test improvements
Contribution Approach: Start by joining Kubernetes Slack (sig-node channel). Attend sig-node meetings. Pick up issues labeled good-first-issue or help-wanted. Even small, well-crafted PRs (test fixes, doc improvements) show you understand the development workflow and codebase. Quality over quantity.

Interview Preparation

System Design Topics (Netflix Focus)

Coding Interview Topics (Go)

Behavioral / Culture (Netflix Specifics)

Netflix's culture is "context not control" — they emphasize:
  • Freedom & Responsibility — autonomous decision-making
  • High performance — "adequate performance gets a generous severance"
  • Radical candor — direct feedback culture
  • Context over control — explain why, not how
Prepare stories that demonstrate you thriving with autonomy, making high-judgment calls, and giving/receiving direct feedback.

7. Weekly Schedule Template

Assuming ~15-20 hours/week of study time:

Day Focus (2-3 hrs) Activity
Monday Go Programming Read + exercises from current chapter/topic. Write code.
Tuesday Container Runtime / K8s Internals Source code reading, documentation, hands-on labs
Wednesday Go Programming Build project feature or solve problems in Go
Thursday Linux Performance Tool practice, debugging exercises, Brendan Gregg material
Friday Project Work (KaaS Portal) Implement features, write tests, push code
Saturday Project Work + Open Source Continue project or work on OSS contribution
Sunday Review & System Design Review week's material, practice one system design problem

8. Netflix Tech Blog — Required Reading

These posts give you insight into how this team thinks and what they've built:

TopicWhat to Search ForWhy
Titus "Titus, the Netflix container management platform" Their original container orchestrator — understand what they're migrating from
Container Runtime "Netflix container runtime" and "Titus Executor" Their custom runtime work — direct context for this role
Compute "Auto Scaling Production Services on Titus" How they think about compute at scale
Networking "Networking in Titus" and container networking posts Their approach to container networking on AWS
Performance Brendan Gregg's Netflix posts on performance Performance culture and tools they use
Linux "Netflix and Linux" and kernel-related posts Their Linux kernel customization approach

9. Success Metrics — How to Know You're Ready

SkillYou're Ready When You Can...
Go Write a K8s controller from scratch, debug goroutine leaks with pprof, read K8s source code fluently
Container Runtime Explain the full path from kubectl run to process executing in a container, including every component involved. Write an NRI plugin.
Kubelet Explain pod lifecycle from kubelet's perspective, how it communicates with containerd via CRI, how resource managers work
Linux Perf Given a "containers are slow" report, systematically diagnose whether the issue is CPU, memory, I/O, or network using perf/eBPF/bpftrace
System Design Design a container orchestration platform on a whiteboard, covering scheduling, networking, storage, security, and monitoring
Netflix Culture Articulate how you work with high autonomy, give examples of high-impact decisions you made with incomplete information
Timeline: 22 weeks (~5.5 months) at 15-20 hrs/week. You can compress this if you dedicate more time, or extend it if you want to go deeper on any area. The KaaS project is the crown jewel — it gives you a concrete artifact to talk about in interviews and doubles as a portfolio piece on GitHub.