KaaS Portal — Learn-by-Building Guide

This guide walks you through building a Kubernetes-as-a-Service portal from scratch. YOU write every line of code. Each milestone teaches Go concepts, K8s internals, and container runtime knowledge directly relevant to the Netflix Compute Runtime role.

The project is at ~/PyCharmProjects/kaas-portal. There's already a scaffold committed from an earlier session — you can reference it for ideas, rewrite it completely, or rm -rf it and start fresh. It's YOUR project.

How to use this guide: Each milestone has steps, concepts to understand, and "why this matters for Netflix" context. If you get stuck, ask Claude — but ask questions, don't ask for code. The hints (expandable sections) give you direction without giving you the answer.

Required Reading: Netflix's Compute Runtime Blog Posts

The job posting references two blog posts that describe work done by this exact team. Read both before you start building — they'll shape how you think about the project.

Blog Post 1: Noisy Neighbor Detection with eBPF

Source: Netflix Tech Blog — Noisy Neighbor Detection with eBPF (Sept 2024)

The Problem

Netflix runs Titus, their multi-tenant compute platform. Multiple containers share the same physical host. A "noisy neighbor" is a container (or system process) that hogs host resources — especially CPU — and degrades performance for other containers on the same machine.

Traditional tools like perf add significant overhead and are deployed after the problem is noticed. By then the noisy neighbor has moved on, or the profiling overhead makes things worse.

The eBPF Solution

They instrument run queue latency — the time a process sits waiting for a CPU after it becomes runnable. This uses three Linux scheduler hooks:

Hook	When It Fires	What They Do
`sched_wakeup`	Process transitions from sleeping → runnable	Record timestamp in a BPF hash map, keyed by PID
`sched_wakeup_new`	Newly created process becomes runnable	Same — record timestamp
`sched_switch`	CPU switches to running a different process	Look up the wakeup timestamp, compute delta = `now - wakeup_time`. That delta is the run queue latency.

Container Attribution

They use the process's cgroup ID to map each scheduling event back to its container. This is critical: without it, you just have PID-level data. With cgroup mapping, you can say "container X is experiencing high run queue latency because container Y is causing CPU contention."

Key Subtlety: Throttling vs. Noisy Neighbor

A container exceeding its cgroup CPU limit gets throttled, which also appears as high run queue latency. You must distinguish throttling (self-inflicted) from actual noisy-neighbor preemption (caused by others). The team found that system processes (not just other containers) are often the real noisy neighbors.

Data Pipeline

BPF hooks BPF ring buffer Go userspace program Atlas ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ ┌───────┐ │sched_wakeup │───────▶│ │───────▶│ Reads events │───────▶│Netflix│ │sched_switch │ │ kernel→user │ │ Computes latency │ │metrics│ │sched_wakeup │ │ zero-copy │ │ Maps to cgroup │ │backend│ │ _new │ │ │ │ Emits metrics │ │ │ └──────────────┘ └──────────────┘ └──────────────────┘ └───────┘

Performance Numbers

Overhead: <600 nanoseconds per hook invocation — safe for continuous production use
Baseline run queue latency: ~83.4 microseconds average, rare spikes to ~400 microseconds
With a noisy neighbor: 131 millisecond P99 spike — a 1,500x increase that would visibly affect HTTP traffic

What to learn from this:

eBPF — kernel-programmable instrumentation. The Compute Runtime team uses this daily.
Linux scheduling — run queues, preemption, cgroups CPU controllers. You need to understand how the scheduler decides which process gets CPU time.
cgroups — resource isolation for containers. This is the mechanism behind --cpus and --memory in Docker.
The Go userspace program — eBPF events are collected in the kernel, but a Go program reads them via ring buffers. This is a common pattern in the K8s ecosystem.
Ring buffers — efficient kernel-to-userspace data transfer without memory copying or syscalls.

Blog Post 2: Mount Mayhem — Scaling Containers on Modern CPUs

Source: Netflix Tech Blog — Mount Mayhem at Netflix (Feb 2026)

The Problem

Netflix nodes stalled for tens of seconds when starting many containers concurrently. The mount table ballooned during startup because containerd executes thousands of bind mount operations when assembling multi-layer container images. A health check that reads the mount table would take 30+ seconds.

Root Cause: Kernel VFS Lock Contention

Almost all time was spent trying to grab a global kernel lock in the Linux Virtual Filesystem (VFS) layer. The hottest code path was path_init() — specifically a sequence lock (seqlock) that serializes mount table lookups.

When hundreds of containers start simultaneously, each needing dozens of mount operations, they all fight over this single lock.

How Container Overlay Filesystems Work (Background)

Container Image: 5 layers ┌─────────────────────┐ │ Layer 5 (app code) │ ← writable (container's changes) ├─────────────────────┤ │ Layer 4 (pip pkgs) │ ← read-only ├─────────────────────┤ │ Layer 3 (python) │ ← read-only ├─────────────────────┤ │ Layer 2 (apt pkgs) │ ← read-only ├─────────────────────┤ │ Layer 1 (ubuntu) │ ← read-only └─────────────────────┘ containerd uses overlayfs to stack these layers. Each layer requires bind mount operations → O(n) mounts per container. 100 containers × 20 layers = 2,000 mount operations → all hitting the same kernel lock.

CPU Microarchitecture Made It Worse

Using Intel's Topdown Microarchitecture Analysis (TMA), they found:

95.5% of CPU pipeline slots were stalled on contested memory accesses
57% of stalls were caused by false sharing — multiple cores accessing different data that happens to live on the same cache line (64 bytes)

Hardware architecture mattered enormously:

Instance Type	Architecture	Behavior Under Contention
AWS r5.metal (older)	Dual-socket, NUMA, mesh cache coherence	Severe stalls — cross-socket cache line bouncing
AWS m7i.metal / m7a.24xlarge (newer)	Single-socket, distributed cache	Scaled smoothly, far less contention

Disabling hyperthreading improved latency by up to 30%.

The Fix: O(n) → O(1) Mount Operations

Two approaches were considered:

New kernel mount APIs (fsopen() / fsmount()) — these use file descriptors instead of path-based lookups, avoiding the global VFS lock. But requires newer kernels.
Redesign overlay assembly — group layer mounts under a common parent, reducing mount operations from O(n) per layer to O(1) per container. This is what they shipped.

What to learn from this:

Overlay filesystems — how containerd builds a container's rootfs from image layers. You'll implement awareness of this in the KaaS portal.
Linux VFS internals — mount tables, path lookup, sequence locks. This is the kernel layer beneath containerd.
Performance analysis methodology — they didn't guess. They used TMA, flame graphs, and hardware performance counters to find the bottleneck.
Cache coherence & NUMA — understanding CPU architecture matters for container scheduling. Netflix schedules workloads differently on different hardware.
containerd internals — the team customizes how containerd assembles overlay filesystems. This is literally the job.

Milestone 0: Set Up Your Go Environment

1Verify Go is installed

Run go version. You should see Go 1.26.x (already installed via brew).

2Understand what changed since 2016-2018

You used Go before modules existed. Here's what's different now:

Then (2016-2018)	Now (2026)	Why It Matters
`$GOPATH` and `vendor/`	`go mod` (modules)	No more GOPATH. Run `go mod init <module-path>` in any directory. Dependencies in `go.mod`.
No generics	Generics (Go 1.18+)	Type parameters: `func Map[T any](s []T, f func(T) T) []T`. Used in newer K8s libraries.
`log.Printf`	`log/slog` (Go 1.21+)	Structured logging in stdlib: `slog.Info("msg", "key", value)`
Basic `http.ServeMux`	Enhanced mux (Go 1.22+)	Method-based routing: `mux.HandleFunc("GET /api/users/{id}", handler)`. No need for gorilla/chi for basic apps.
`interface{}`	`any` (alias, Go 1.18+)	`any` is just `interface{}`. Cleaner to read.
Manual context threading	`context` is everywhere	Every API call, every K8s client method, every DB query takes a `context.Context` as first arg. Non-negotiable pattern.
Error wrapping was manual	`fmt.Errorf("...: %w", err)`	The `%w` verb wraps errors. `errors.Is()` and `errors.As()` unwrap them. This is how K8s code handles errors.

3Warm up: Write a small program

Before touching the KaaS project, write a standalone Go program that:

Creates a module (go mod init warmup)
Defines a struct with JSON tags
Starts an HTTP server using the new ServeMux routing ("GET /hello/{name}")
Returns JSON responses
Uses slog for logging
Handles graceful shutdown with signal.NotifyContext

Why: This covers the exact patterns you'll use in the KaaS portal, in a throwaway sandbox.

Hint: Graceful shutdown pattern

The key pieces are: signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM) gives you a context that cancels on Ctrl+C. Start the HTTP server in a goroutine. Block on <-ctx.Done(). Then call server.Shutdown(timeoutCtx). This is the standard Go HTTP server pattern.

Milestone 1: Project Scaffold & Health Check

Goal: Get a running Go API server with proper project structure, structured logging, and a health endpoint.

1Initialize the project

You can either start from the existing scaffold at ~/PyCharmProjects/kaas-portal or wipe it and start fresh. If starting fresh:

mkdir -p ~/PyCharmProjects/kaas-portal and cd into it
go mod init github.com/<your-username>/kaas-portal
git init

2Choose your project layout

Go doesn't enforce a project structure, but the community convention for non-trivial projects is:

kaas-portal/ ├── cmd/ │ └── kaas-portal/ │ └── main.go ← entry point ├── internal/ ← private packages (can't be imported by other modules) │ ├── api/ ← HTTP handlers, middleware │ ├── config/ ← configuration loading │ └── provider/ ← cloud provider implementations │ ├── kind/ │ ├── aws/ │ └── gcp/ ├── pkg/ ← public packages (could be imported by others) │ └── models/ ← domain types (Cluster, etc.) ├── go.mod ├── go.sum ├── Makefile └── .gitignore

Why internal/? The Go compiler enforces that packages under internal/ can only be imported by code in the parent directory tree. This is a hard compiler guarantee, not just convention. It keeps your API surface intentional.

3Build these pieces yourself

Config package: Load port and environment from env vars. Keep it simple — just a struct and a Load() function.
Server struct: Holds your dependencies (config, logger). Has a Router() method that returns http.Handler.
Health endpoint: GET /healthz returns {"status": "ok"}
Logging middleware: Wraps every request with timing and status code logging using slog.
main.go: Wires everything together. Starts server. Handles graceful shutdown.

4Test it

Run go run ./cmd/kaas-portal and curl http://localhost:8080/healthz.

Then write an actual Go test. Create internal/api/server_test.go:

Use httptest.NewRequest and httptest.NewRecorder
Call your handler directly
Assert the response status and body

Why: Go's testing story is built into the language (go test ./...). No test framework needed. Table-driven tests are the Go idiom — learn this pattern early.

Hint: Logging middleware approach

You need to capture the status code, but http.ResponseWriter doesn't expose it after WriteHeader(). The solution: wrap ResponseWriter in your own struct that records the status code. Then call next.ServeHTTP(yourWrapper, r).

Checkpoint: You have a running Go HTTP server with structured logging, a health check, and a test. Commit it.

Milestone 2: Domain Model & Provider Interface

Goal: Design the core abstractions — the Cluster model and the Provider interface that every cloud backend will implement.

1Design the Cluster model

Think about what a cluster is, provider-agnostically. At minimum:

ID, Name, Provider ("kind", "aws", "gcp"), Region, Status, K8sVersion, NodeCount
Timestamps (created, updated)
Kubeconfig (sensitive — should not appear in list responses)

Write this as a struct in pkg/models/cluster.go with JSON tags.

Why pkg/? This is your public API. If someone imported your module, they could use these types. The provider implementations in internal/ will create and return these models.

2Design the Provider interface

This is where Go's interface model shines. Define an interface with methods like:

Name() string
CreateCluster(ctx context.Context, req CreateClusterRequest) (*Cluster, error)
GetCluster(ctx context.Context, id string) (*ClusterDetail, error)
ListClusters(ctx context.Context) ([]Cluster, error)
DeleteCluster(ctx context.Context, id string) error

Go Interfaces — Key Concept:

In Java/C#, you write class KindProvider implements Provider. In Go, you don't. If a type has the right methods, it is a Provider. The compiler checks this at the call site, not at the declaration. This is called structural typing (or "duck typing, but checked at compile time").

This matters for K8s: the entire K8s codebase is built on this pattern. The kubelet talks to containerd via the CRI interface. containerd talks to runc via the OCI runtime interface. Plugins everywhere — all using Go interfaces.

3Think about error handling

Every method returns error. Think about what errors mean for each operation:

Cluster not found vs. provider failure — are these different error types?
Should you define custom error types? (e.g., type NotFoundError struct{...})
How will API handlers distinguish "not found" (404) from "provider broken" (500)?

For now, keep it simple — you can use errors.Is() and sentinel errors. Refine later.

4Wire the provider map into your server

Your server should accept a map[string]Provider. Add a GET /api/v1/providers endpoint that returns the list of registered provider names.

Hint: Compile-time interface check

Go won't tell you a type fails to implement an interface until you try to use it as one. To get early feedback, add this line to your provider file: var _ Provider = (*KindProvider)(nil). This asserts at compile time that KindProvider implements Provider.

Checkpoint: You have a Cluster model, a Provider interface, and your server knows about providers. No providers implement it yet — that's next. Commit.

Milestone 3: Kind Provider — Your First Real Cluster

Goal: Implement a provider that creates real local K8s clusters using Kind. By the end, you'll run curl -X POST .../clusters and get a real, working Kubernetes cluster.

This is where it starts connecting to the Netflix role. Kind creates clusters using containerd (the same container runtime Netflix customizes). When you create a Kind cluster, you're exercising the same kubelet → CRI → containerd → runc stack that the Compute Runtime team owns at Netflix.

1Understand what Kind does under the hood

Before writing code, understand what happens when Kind creates a cluster:

Kind pulls a Docker image (kindest/node:v1.x.x) that contains kubelet, kubeadm, and containerd
It runs that image as Docker containers — each container = one K8s node
Inside each container, containerd runs as the container runtime and kubelet manages it via CRI
kubeadm bootstraps the control plane
Kind generates a kubeconfig pointing to the API server

This is containers running inside containers — the node is a Docker container, and the pods inside it use containerd. Understanding this nesting is key.

2Add the Kind library dependency

go get sigs.k8s.io/kind

Look at the sigs.k8s.io/kind/pkg/cluster package. The key type is cluster.Provider which has methods like Create(), Delete(), List(), and KubeConfig().

3Implement the Kind provider

Create internal/provider/kind/kind.go. You need:

A struct that holds a Kind cluster.Provider instance and an in-memory map of clusters
A sync.RWMutex to protect the map (multiple goroutines may read/write concurrently)
CreateCluster: call Kind's Create(), track state, get kubeconfig
ListClusters: return from your in-memory map
DeleteCluster: call Kind's Delete(), update state

Go Concurrency Concept — sync.RWMutex:

Your API server handles concurrent HTTP requests (each in its own goroutine). If two requests try to read/write the clusters map simultaneously, you'll get a data race. sync.RWMutex allows:

Multiple concurrent readers (mu.RLock() / mu.RUnlock())
Exclusive writer access (mu.Lock() / mu.Unlock())

Use go run -race ./cmd/kaas-portal to run with the race detector — it will catch data races at runtime.

4Add CRUD handlers

Wire up these endpoints:

Method	Path	What It Does
POST	/api/v1/clusters	Create cluster (takes JSON body)
GET	/api/v1/clusters	List all clusters (optional `?provider=` filter)
GET	/api/v1/clusters/{id}	Get single cluster (include kubeconfig)
DELETE	/api/v1/clusters/{id}	Delete cluster
GET	/api/v1/clusters/{id}/kubeconfig	Get kubeconfig as YAML

5Test it end-to-end

Start the server and create an actual cluster:

# Create a Kind cluster
curl -X POST http://localhost:8080/api/v1/clusters \
  -H "Content-Type: application/json" \
  -d '{"name": "test-1", "provider": "kind", "node_count": 1}'

# This will take 1-2 minutes. When it returns, you have a real K8s cluster.

# List clusters
curl http://localhost:8080/api/v1/clusters | jq

# Get kubeconfig and use it
curl http://localhost:8080/api/v1/clusters/kind-test-1/kubeconfig > /tmp/test-1.kubeconfig
kubectl --kubeconfig /tmp/test-1.kubeconfig get nodes

# Clean up
curl -X DELETE http://localhost:8080/api/v1/clusters/kind-test-1

6Explore what Kind created

While the cluster is running, investigate what's happening at the container runtime level:

# See the Docker containers Kind created (each is a K8s "node")
docker ps

# Exec into the node container and look at containerd
docker exec -it test-1-control-plane bash
  # Inside the node:
  crictl ps          # List containers via CRI (like kubectl for the runtime)
  crictl images      # List images in containerd
  cat /etc/containerd/config.toml   # containerd configuration
  ps aux | grep kubelet              # The kubelet process
  mount | head -30                   # See the mount table (remember the Mount Mayhem blog!)
  cat /proc/1/cgroup                 # cgroup of the init process

When you run crictl ps inside the Kind node, you're using the same CRI interface that the Netflix Compute Runtime team works on. When you look at /etc/containerd/config.toml, you're seeing the same configuration they customize. When you look at the mount table, you're seeing the same mount point explosion described in the "Mount Mayhem" blog post. This is the stack.

Checkpoint: You can create and destroy real K8s clusters via your API. You've looked inside a node and seen kubelet, containerd, the mount table, and cgroups. Commit.

Milestone 4: client-go — Talk to Your Clusters

Goal: Use the official Kubernetes Go client to query your clusters — get nodes, list pods, read namespaces. This is the library that kubelet, controllers, and operators are built on.

1Add the client-go dependency

go get k8s.io/client-go@latest

This is a large dependency. It pulls in the same code that powers kubectl and every K8s controller.

2Build a cluster info endpoint

Add GET /api/v1/clusters/{id}/info that:

Gets the kubeconfig for the cluster
Creates a Kubernetes clientset from the kubeconfig (look at clientcmd and kubernetes.NewForConfig)
Queries the K8s API for: node count, node status, K8s server version, namespace list
Returns it all as JSON

3Add a pod listing endpoint

Add GET /api/v1/clusters/{id}/pods that lists pods across all namespaces (or filtered by ?namespace=).

Use clientset.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{})

4Deploy a workload through the API

Add POST /api/v1/clusters/{id}/deployments that creates a Deployment in the cluster. Accept a simple payload (image, replicas, name) and use client-go to create it.

Go Concept — Context Propagation:

Every client-go call takes a context.Context. Your HTTP handler gets one from r.Context(). Pass it through to the K8s client calls. If the HTTP client disconnects, the context cancels, and the K8s API call is abandoned. This is how Go propagates cancellation through entire call chains without try/catch/finally.

Hint: Building a clientset from kubeconfig string

The trick is that clientcmd normally reads from a file. To use an in-memory kubeconfig string, use clientcmd.RESTConfigFromKubeConfig([]byte(kubeconfigStr)). This gives you a *rest.Config that you pass to kubernetes.NewForConfig(config).

Checkpoint: Your portal can create clusters AND interact with them — listing nodes, pods, deploying workloads. You're writing the same client-go code that K8s controllers use. Commit.

Milestone 5: Async Cluster Creation (Go Concurrency)

Goal: Cluster creation takes minutes. Right now the API blocks. Fix this using Go's concurrency primitives — goroutines, channels, and status polling.

1Make creation async

Change CreateCluster to:

Immediately return the cluster in "provisioning" status with a 202 Accepted
Kick off the actual Kind creation in a goroutine
The goroutine updates the cluster status to "running" or "failed" when done

Clients poll GET /api/v1/clusters/{id} to check status.

2Handle cancellation

What if someone deletes a cluster while it's still provisioning? Think about:

Using context.WithCancel for each provisioning goroutine
Storing the cancel function alongside the cluster state
Calling cancel on delete

3Add a status event stream (stretch goal)

Add GET /api/v1/clusters/{id}/events using Server-Sent Events (SSE) to stream status updates in real-time.

Go Concurrency Patterns to Understand:

Goroutines are cheap — ~2KB stack, multiplexed onto OS threads. Spawning one per cluster creation is fine.
Don't communicate by sharing memory; share memory by communicating — the Go proverb. Consider whether a channel-based design is better than mutex-protected maps.
Context cancellation — ctx.Done() returns a channel that closes when cancelled. Use select to listen for it.

Checkpoint: Cluster creation is non-blocking. You understand goroutines, mutexes, and context cancellation at a practical level. Commit.

Milestone 6: EKS Provider (AWS)

Goal: Add a real cloud provider. EKS creation takes 10-15 minutes — your async pattern from Milestone 5 pays off here.

1Add the AWS SDK

go get github.com/aws/aws-sdk-go-v2 and the EKS, EC2, IAM, and STS service packages.

You used the AWS SDK in Go at Symantec — this will feel familiar, but the SDK has been rewritten (v2). The API design is different: it uses functional options and context everywhere.

2Implement the EKS provider

This is more complex than Kind. You'll need to:

Create an EKS cluster (API call)
Create a managed node group
Handle IAM roles (cluster role + node role)
Wait for the cluster to become ACTIVE
Generate a kubeconfig using STS token

3Handle the cost angle

EKS clusters cost money. Consider:

Auto-delete after a TTL (e.g., 2 hours)
Use the smallest instance types for learning (t3.small)
Add a max_clusters config to prevent runaway costs

Netflix runs entirely on AWS. Their Titus/K8s platform manages thousands of EC2 instances. Understanding EKS, VPCs, IAM roles, and instance types is directly relevant. The EKS data plane (kubelet + containerd on each EC2 node) is essentially what the Compute Runtime team manages, but with Netflix's customizations.

Milestone 7: GKE Provider (GCP)

Goal: Add your second cloud provider. By now the Provider interface should make this clean — same interface, different backend.

Same pattern as EKS but with cloud.google.com/go/container. GKE clusters are faster to create (~5 min) and have a simpler IAM model.

The real learning here is: does your Provider interface hold up? If adding GKE requires changing the interface, that's a design smell worth reflecting on.

Future Milestones (Ideas)

Once the core is solid, these are directions that go deeper into Netflix-relevant territory:

Milestone	What You'll Learn	Netflix Relevance
Custom K8s Controller (CRD)	controller-runtime, reconciliation loops, CRD design	This is how K8s operators work. The Compute Runtime team builds controllers.
Node diagnostics endpoint	SSH into nodes, collect metrics, read cgroup stats	Operational troubleshooting — "why is this container slow?"
containerd configuration management	containerd config.toml, runtime classes, snapshotter config	Literally what the team customizes — runtime configuration at the node level
eBPF-based node monitoring	Write a Go program that uses eBPF to collect scheduling metrics	Directly from the "Noisy Neighbor" blog post — same technology, same use case
React frontend	Cluster dashboard, real-time status, log viewer	Not Netflix-relevant, but makes the project a complete portfolio piece

Learning Resources by Milestone

Topic	Resource	When to Read
Modern Go	Effective Go + Go Blog	Milestone 0 — skim for what changed
Go modules	go.mod reference	Milestone 0
Go concurrency	Concurrency in Go by Katherine Cox-Buday, chapters 1-4	Milestone 5
client-go	client-go examples	Milestone 4
Kind internals	Kind design docs	Milestone 3
Container runtime	Container Security by Liz Rice	Milestone 3 (while exploring the Kind node)
Linux performance	Systems Performance by Brendan Gregg, chapters 1-6	Ongoing — start during Milestone 3
eBPF	Netflix blog post + Learning eBPF by Liz Rice	Future milestone on node monitoring