KaaS Portal β€” Learn-by-Building Guide

This guide walks you through building a Kubernetes-as-a-Service portal from scratch. YOU write every line of code. Each milestone teaches Go concepts, K8s internals, and container runtime knowledge directly relevant to the Netflix Compute Runtime role.

The project is at ~/PyCharmProjects/kaas-portal. There's already a scaffold committed from an earlier session β€” you can reference it for ideas, rewrite it completely, or rm -rf it and start fresh. It's YOUR project.

How to use this guide: Each milestone has steps, concepts to understand, and "why this matters for Netflix" context. If you get stuck, ask Claude β€” but ask questions, don't ask for code. The hints (expandable sections) give you direction without giving you the answer.

Required Reading: Netflix's Compute Runtime Blog Posts

The job posting references two blog posts that describe work done by this exact team. Read both before you start building β€” they'll shape how you think about the project.

Blog Post 1: Noisy Neighbor Detection with eBPF

Source: Netflix Tech Blog β€” Noisy Neighbor Detection with eBPF (Sept 2024)

The Problem

Netflix runs Titus, their multi-tenant compute platform. Multiple containers share the same physical host. A "noisy neighbor" is a container (or system process) that hogs host resources β€” especially CPU β€” and degrades performance for other containers on the same machine.

Traditional tools like perf add significant overhead and are deployed after the problem is noticed. By then the noisy neighbor has moved on, or the profiling overhead makes things worse.

The eBPF Solution

They instrument run queue latency β€” the time a process sits waiting for a CPU after it becomes runnable. This uses three Linux scheduler hooks:

HookWhen It FiresWhat They Do
sched_wakeup Process transitions from sleeping β†’ runnable Record timestamp in a BPF hash map, keyed by PID
sched_wakeup_new Newly created process becomes runnable Same β€” record timestamp
sched_switch CPU switches to running a different process Look up the wakeup timestamp, compute delta = now - wakeup_time. That delta is the run queue latency.

Container Attribution

They use the process's cgroup ID to map each scheduling event back to its container. This is critical: without it, you just have PID-level data. With cgroup mapping, you can say "container X is experiencing high run queue latency because container Y is causing CPU contention."

Key Subtlety: Throttling vs. Noisy Neighbor

A container exceeding its cgroup CPU limit gets throttled, which also appears as high run queue latency. You must distinguish throttling (self-inflicted) from actual noisy-neighbor preemption (caused by others). The team found that system processes (not just other containers) are often the real noisy neighbors.

Data Pipeline

BPF hooks BPF ring buffer Go userspace program Atlas β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β” β”‚sched_wakeup │───────▢│ │───────▢│ Reads events │───────▢│Netflixβ”‚ β”‚sched_switch β”‚ β”‚ kernelβ†’user β”‚ β”‚ Computes latency β”‚ β”‚metricsβ”‚ β”‚sched_wakeup β”‚ β”‚ zero-copy β”‚ β”‚ Maps to cgroup β”‚ β”‚backendβ”‚ β”‚ _new β”‚ β”‚ β”‚ β”‚ Emits metrics β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”˜

Performance Numbers

What to learn from this:

Blog Post 2: Mount Mayhem β€” Scaling Containers on Modern CPUs

Source: Netflix Tech Blog β€” Mount Mayhem at Netflix (Feb 2026)

The Problem

Netflix nodes stalled for tens of seconds when starting many containers concurrently. The mount table ballooned during startup because containerd executes thousands of bind mount operations when assembling multi-layer container images. A health check that reads the mount table would take 30+ seconds.

Root Cause: Kernel VFS Lock Contention

Almost all time was spent trying to grab a global kernel lock in the Linux Virtual Filesystem (VFS) layer. The hottest code path was path_init() β€” specifically a sequence lock (seqlock) that serializes mount table lookups.

When hundreds of containers start simultaneously, each needing dozens of mount operations, they all fight over this single lock.

How Container Overlay Filesystems Work (Background)

Container Image: 5 layers β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Layer 5 (app code) β”‚ ← writable (container's changes) β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Layer 4 (pip pkgs) β”‚ ← read-only β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Layer 3 (python) β”‚ ← read-only β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Layer 2 (apt pkgs) β”‚ ← read-only β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Layer 1 (ubuntu) β”‚ ← read-only β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ containerd uses overlayfs to stack these layers. Each layer requires bind mount operations β†’ O(n) mounts per container. 100 containers Γ— 20 layers = 2,000 mount operations β†’ all hitting the same kernel lock.

CPU Microarchitecture Made It Worse

Using Intel's Topdown Microarchitecture Analysis (TMA), they found:

Hardware architecture mattered enormously:

Instance TypeArchitectureBehavior Under Contention
AWS r5.metal (older) Dual-socket, NUMA, mesh cache coherence Severe stalls β€” cross-socket cache line bouncing
AWS m7i.metal / m7a.24xlarge (newer) Single-socket, distributed cache Scaled smoothly, far less contention

Disabling hyperthreading improved latency by up to 30%.

The Fix: O(n) β†’ O(1) Mount Operations

Two approaches were considered:

  1. New kernel mount APIs (fsopen() / fsmount()) β€” these use file descriptors instead of path-based lookups, avoiding the global VFS lock. But requires newer kernels.
  2. Redesign overlay assembly β€” group layer mounts under a common parent, reducing mount operations from O(n) per layer to O(1) per container. This is what they shipped.
What to learn from this:

Milestone 0: Set Up Your Go Environment

1Verify Go is installed

Run go version. You should see Go 1.26.x (already installed via brew).

2Understand what changed since 2016-2018

You used Go before modules existed. Here's what's different now:

Then (2016-2018)Now (2026)Why It Matters
$GOPATH and vendor/ go mod (modules) No more GOPATH. Run go mod init <module-path> in any directory. Dependencies in go.mod.
No generics Generics (Go 1.18+) Type parameters: func Map[T any](s []T, f func(T) T) []T. Used in newer K8s libraries.
log.Printf log/slog (Go 1.21+) Structured logging in stdlib: slog.Info("msg", "key", value)
Basic http.ServeMux Enhanced mux (Go 1.22+) Method-based routing: mux.HandleFunc("GET /api/users/{id}", handler). No need for gorilla/chi for basic apps.
interface{} any (alias, Go 1.18+) any is just interface{}. Cleaner to read.
Manual context threading context is everywhere Every API call, every K8s client method, every DB query takes a context.Context as first arg. Non-negotiable pattern.
Error wrapping was manual fmt.Errorf("...: %w", err) The %w verb wraps errors. errors.Is() and errors.As() unwrap them. This is how K8s code handles errors.

3Warm up: Write a small program

Before touching the KaaS project, write a standalone Go program that:

Why: This covers the exact patterns you'll use in the KaaS portal, in a throwaway sandbox.

Hint: Graceful shutdown pattern

The key pieces are: signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM) gives you a context that cancels on Ctrl+C. Start the HTTP server in a goroutine. Block on <-ctx.Done(). Then call server.Shutdown(timeoutCtx). This is the standard Go HTTP server pattern.

Milestone 1: Project Scaffold & Health Check

Goal: Get a running Go API server with proper project structure, structured logging, and a health endpoint.

1Initialize the project

You can either start from the existing scaffold at ~/PyCharmProjects/kaas-portal or wipe it and start fresh. If starting fresh:

2Choose your project layout

Go doesn't enforce a project structure, but the community convention for non-trivial projects is:

kaas-portal/ β”œβ”€β”€ cmd/ β”‚ └── kaas-portal/ β”‚ └── main.go ← entry point β”œβ”€β”€ internal/ ← private packages (can't be imported by other modules) β”‚ β”œβ”€β”€ api/ ← HTTP handlers, middleware β”‚ β”œβ”€β”€ config/ ← configuration loading β”‚ └── provider/ ← cloud provider implementations β”‚ β”œβ”€β”€ kind/ β”‚ β”œβ”€β”€ aws/ β”‚ └── gcp/ β”œβ”€β”€ pkg/ ← public packages (could be imported by others) β”‚ └── models/ ← domain types (Cluster, etc.) β”œβ”€β”€ go.mod β”œβ”€β”€ go.sum β”œβ”€β”€ Makefile └── .gitignore

Why internal/? The Go compiler enforces that packages under internal/ can only be imported by code in the parent directory tree. This is a hard compiler guarantee, not just convention. It keeps your API surface intentional.

3Build these pieces yourself

4Test it

Run go run ./cmd/kaas-portal and curl http://localhost:8080/healthz.

Then write an actual Go test. Create internal/api/server_test.go:

Why: Go's testing story is built into the language (go test ./...). No test framework needed. Table-driven tests are the Go idiom β€” learn this pattern early.

Hint: Logging middleware approach

You need to capture the status code, but http.ResponseWriter doesn't expose it after WriteHeader(). The solution: wrap ResponseWriter in your own struct that records the status code. Then call next.ServeHTTP(yourWrapper, r).

Checkpoint: You have a running Go HTTP server with structured logging, a health check, and a test. Commit it.

Milestone 2: Domain Model & Provider Interface

Goal: Design the core abstractions β€” the Cluster model and the Provider interface that every cloud backend will implement.

1Design the Cluster model

Think about what a cluster is, provider-agnostically. At minimum:

Write this as a struct in pkg/models/cluster.go with JSON tags.

Why pkg/? This is your public API. If someone imported your module, they could use these types. The provider implementations in internal/ will create and return these models.

2Design the Provider interface

This is where Go's interface model shines. Define an interface with methods like:

Go Interfaces β€” Key Concept:

In Java/C#, you write class KindProvider implements Provider. In Go, you don't. If a type has the right methods, it is a Provider. The compiler checks this at the call site, not at the declaration. This is called structural typing (or "duck typing, but checked at compile time").

This matters for K8s: the entire K8s codebase is built on this pattern. The kubelet talks to containerd via the CRI interface. containerd talks to runc via the OCI runtime interface. Plugins everywhere β€” all using Go interfaces.

3Think about error handling

Every method returns error. Think about what errors mean for each operation:

For now, keep it simple β€” you can use errors.Is() and sentinel errors. Refine later.

4Wire the provider map into your server

Your server should accept a map[string]Provider. Add a GET /api/v1/providers endpoint that returns the list of registered provider names.

Hint: Compile-time interface check

Go won't tell you a type fails to implement an interface until you try to use it as one. To get early feedback, add this line to your provider file: var _ Provider = (*KindProvider)(nil). This asserts at compile time that KindProvider implements Provider.

Checkpoint: You have a Cluster model, a Provider interface, and your server knows about providers. No providers implement it yet β€” that's next. Commit.

Milestone 3: Kind Provider β€” Your First Real Cluster

Goal: Implement a provider that creates real local K8s clusters using Kind. By the end, you'll run curl -X POST .../clusters and get a real, working Kubernetes cluster.

This is where it starts connecting to the Netflix role. Kind creates clusters using containerd (the same container runtime Netflix customizes). When you create a Kind cluster, you're exercising the same kubelet β†’ CRI β†’ containerd β†’ runc stack that the Compute Runtime team owns at Netflix.

1Understand what Kind does under the hood

Before writing code, understand what happens when Kind creates a cluster:

  1. Kind pulls a Docker image (kindest/node:v1.x.x) that contains kubelet, kubeadm, and containerd
  2. It runs that image as Docker containers β€” each container = one K8s node
  3. Inside each container, containerd runs as the container runtime and kubelet manages it via CRI
  4. kubeadm bootstraps the control plane
  5. Kind generates a kubeconfig pointing to the API server

This is containers running inside containers β€” the node is a Docker container, and the pods inside it use containerd. Understanding this nesting is key.

2Add the Kind library dependency

go get sigs.k8s.io/kind

Look at the sigs.k8s.io/kind/pkg/cluster package. The key type is cluster.Provider which has methods like Create(), Delete(), List(), and KubeConfig().

3Implement the Kind provider

Create internal/provider/kind/kind.go. You need:

Go Concurrency Concept β€” sync.RWMutex:

Your API server handles concurrent HTTP requests (each in its own goroutine). If two requests try to read/write the clusters map simultaneously, you'll get a data race. sync.RWMutex allows:

Use go run -race ./cmd/kaas-portal to run with the race detector β€” it will catch data races at runtime.

4Add CRUD handlers

Wire up these endpoints:

MethodPathWhat It Does
POST/api/v1/clustersCreate cluster (takes JSON body)
GET/api/v1/clustersList all clusters (optional ?provider= filter)
GET/api/v1/clusters/{id}Get single cluster (include kubeconfig)
DELETE/api/v1/clusters/{id}Delete cluster
GET/api/v1/clusters/{id}/kubeconfigGet kubeconfig as YAML

5Test it end-to-end

Start the server and create an actual cluster:

# Create a Kind cluster
curl -X POST http://localhost:8080/api/v1/clusters \
  -H "Content-Type: application/json" \
  -d '{"name": "test-1", "provider": "kind", "node_count": 1}'

# This will take 1-2 minutes. When it returns, you have a real K8s cluster.

# List clusters
curl http://localhost:8080/api/v1/clusters | jq

# Get kubeconfig and use it
curl http://localhost:8080/api/v1/clusters/kind-test-1/kubeconfig > /tmp/test-1.kubeconfig
kubectl --kubeconfig /tmp/test-1.kubeconfig get nodes

# Clean up
curl -X DELETE http://localhost:8080/api/v1/clusters/kind-test-1

6Explore what Kind created

While the cluster is running, investigate what's happening at the container runtime level:

# See the Docker containers Kind created (each is a K8s "node")
docker ps

# Exec into the node container and look at containerd
docker exec -it test-1-control-plane bash
  # Inside the node:
  crictl ps          # List containers via CRI (like kubectl for the runtime)
  crictl images      # List images in containerd
  cat /etc/containerd/config.toml   # containerd configuration
  ps aux | grep kubelet              # The kubelet process
  mount | head -30                   # See the mount table (remember the Mount Mayhem blog!)
  cat /proc/1/cgroup                 # cgroup of the init process
When you run crictl ps inside the Kind node, you're using the same CRI interface that the Netflix Compute Runtime team works on. When you look at /etc/containerd/config.toml, you're seeing the same configuration they customize. When you look at the mount table, you're seeing the same mount point explosion described in the "Mount Mayhem" blog post. This is the stack.
Checkpoint: You can create and destroy real K8s clusters via your API. You've looked inside a node and seen kubelet, containerd, the mount table, and cgroups. Commit.

Milestone 4: client-go β€” Talk to Your Clusters

Goal: Use the official Kubernetes Go client to query your clusters β€” get nodes, list pods, read namespaces. This is the library that kubelet, controllers, and operators are built on.

1Add the client-go dependency

go get k8s.io/client-go@latest

This is a large dependency. It pulls in the same code that powers kubectl and every K8s controller.

2Build a cluster info endpoint

Add GET /api/v1/clusters/{id}/info that:

3Add a pod listing endpoint

Add GET /api/v1/clusters/{id}/pods that lists pods across all namespaces (or filtered by ?namespace=).

Use clientset.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{})

4Deploy a workload through the API

Add POST /api/v1/clusters/{id}/deployments that creates a Deployment in the cluster. Accept a simple payload (image, replicas, name) and use client-go to create it.

Go Concept β€” Context Propagation:

Every client-go call takes a context.Context. Your HTTP handler gets one from r.Context(). Pass it through to the K8s client calls. If the HTTP client disconnects, the context cancels, and the K8s API call is abandoned. This is how Go propagates cancellation through entire call chains without try/catch/finally.

Hint: Building a clientset from kubeconfig string

The trick is that clientcmd normally reads from a file. To use an in-memory kubeconfig string, use clientcmd.RESTConfigFromKubeConfig([]byte(kubeconfigStr)). This gives you a *rest.Config that you pass to kubernetes.NewForConfig(config).

Checkpoint: Your portal can create clusters AND interact with them β€” listing nodes, pods, deploying workloads. You're writing the same client-go code that K8s controllers use. Commit.

Milestone 5: Async Cluster Creation (Go Concurrency)

Goal: Cluster creation takes minutes. Right now the API blocks. Fix this using Go's concurrency primitives β€” goroutines, channels, and status polling.

1Make creation async

Change CreateCluster to:

  1. Immediately return the cluster in "provisioning" status with a 202 Accepted
  2. Kick off the actual Kind creation in a goroutine
  3. The goroutine updates the cluster status to "running" or "failed" when done

Clients poll GET /api/v1/clusters/{id} to check status.

2Handle cancellation

What if someone deletes a cluster while it's still provisioning? Think about:

3Add a status event stream (stretch goal)

Add GET /api/v1/clusters/{id}/events using Server-Sent Events (SSE) to stream status updates in real-time.

Go Concurrency Patterns to Understand:
Checkpoint: Cluster creation is non-blocking. You understand goroutines, mutexes, and context cancellation at a practical level. Commit.

Milestone 6: EKS Provider (AWS)

Goal: Add a real cloud provider. EKS creation takes 10-15 minutes β€” your async pattern from Milestone 5 pays off here.

1Add the AWS SDK

go get github.com/aws/aws-sdk-go-v2 and the EKS, EC2, IAM, and STS service packages.

You used the AWS SDK in Go at Symantec β€” this will feel familiar, but the SDK has been rewritten (v2). The API design is different: it uses functional options and context everywhere.

2Implement the EKS provider

This is more complex than Kind. You'll need to:

3Handle the cost angle

EKS clusters cost money. Consider:

Netflix runs entirely on AWS. Their Titus/K8s platform manages thousands of EC2 instances. Understanding EKS, VPCs, IAM roles, and instance types is directly relevant. The EKS data plane (kubelet + containerd on each EC2 node) is essentially what the Compute Runtime team manages, but with Netflix's customizations.

Milestone 7: GKE Provider (GCP)

Goal: Add your second cloud provider. By now the Provider interface should make this clean β€” same interface, different backend.

Same pattern as EKS but with cloud.google.com/go/container. GKE clusters are faster to create (~5 min) and have a simpler IAM model.

The real learning here is: does your Provider interface hold up? If adding GKE requires changing the interface, that's a design smell worth reflecting on.

Future Milestones (Ideas)

Once the core is solid, these are directions that go deeper into Netflix-relevant territory:

MilestoneWhat You'll LearnNetflix Relevance
Custom K8s Controller (CRD) controller-runtime, reconciliation loops, CRD design This is how K8s operators work. The Compute Runtime team builds controllers.
Node diagnostics endpoint SSH into nodes, collect metrics, read cgroup stats Operational troubleshooting β€” "why is this container slow?"
containerd configuration management containerd config.toml, runtime classes, snapshotter config Literally what the team customizes β€” runtime configuration at the node level
eBPF-based node monitoring Write a Go program that uses eBPF to collect scheduling metrics Directly from the "Noisy Neighbor" blog post β€” same technology, same use case
React frontend Cluster dashboard, real-time status, log viewer Not Netflix-relevant, but makes the project a complete portfolio piece

Learning Resources by Milestone

TopicResourceWhen to Read
Modern Go Effective Go + Go Blog Milestone 0 β€” skim for what changed
Go modules go.mod reference Milestone 0
Go concurrency Concurrency in Go by Katherine Cox-Buday, chapters 1-4 Milestone 5
client-go client-go examples Milestone 4
Kind internals Kind design docs Milestone 3
Container runtime Container Security by Liz Rice Milestone 3 (while exploring the Kind node)
Linux performance Systems Performance by Brendan Gregg, chapters 1-6 Ongoing β€” start during Milestone 3
eBPF Netflix blog post + Learning eBPF by Liz Rice Future milestone on node monitoring