From Swarm to Kubernetes

Table of Contents

If you’ve been following the “Here Comes the Swarm” series, you know I had a perfectly good Docker Swarm running. Six nodes, three managers, three workers, all humming along nicely. So why would I throw that away for Kubernetes?

Well, I didn’t throw it away. I evolved it.

Why the Switch? #

Look, Docker Swarm is great. It’s simple, it just works, and you can go from zero to a running cluster in about 15 minutes. But over time, I kept bumping into the same walls:

Stateful workloads are painful - Swarm’s volume management is basic. CephFS helped, but mounting it on every node and hoping the right pod landed on the right node got old.
GitOps is bolted on - You can kind-of do GitOps with Swarm + Portainer, but it never felt native.
The ecosystem pressure - Almost everything interesting in the CNCF landscape assumes Kubernetes. Not Swarm.
I wanted to learn - Let’s be honest, Kubernetes is the industry standard. Swarm knowledge is great, but K8s is where the jobs (and the cool toys) are.

So I decided to migrate. But not to some cloud-managed K8s—I wanted to run it on my own hardware, the same Proxmox cluster that was hosting the Swarm VMs.

The Architecture #

Here’s what I ended up with:

┌─────────────────────────────────────────────────────────────────┐
│                    Proxmox VE Cluster                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  k8s-ctrl-01     k8s-ctrl-02     k8s-ctrl-03                    │
│  10.0.40.90      10.0.40.91      10.0.40.92                     │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐                     │
│  │  Talos   │   │  Talos   │   │  Talos   │                     │
│  │  Linux   │   │  Linux   │   │  Linux   │                     │
│  └──────────┘   └──────────┘   └──────────┘                     │
│                                                                 │
│  VIP: 10.0.40.101 (floats between control-plane nodes)           │
│                                                                 │
│  VLAN 40 (10.0.40.0/24) - Main cluster traffic                  │
│  VLAN 70 (10.0.70.0/24) - Ceph storage traffic                  │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              Ceph (via Ceph CSI)                        │    │
│  └─────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘

Three nodes. All control-plane. No dedicated workers.

That might sound weird if you’re used to production Kubernetes where you have separate worker pools. But for a homelab? Three beefy control-plane nodes running everything is perfectly fine. Talos handles the scheduling, and with taints and tolerations I can keep system workloads isolated from user apps when needed.

Why Talos? #

Talos Linux is the secret sauce here. It’s not just another Linux distro with Kubernetes bolted on—it’s an operating system designed from the ground up for Kubernetes.

No SSH. No package manager. No shell (well, there’s a limited one in maintenance mode). Everything is configured through a declarative API. You want to change a kernel parameter? You update your config file and apply it. You want to update the OS? You do a talosctl upgrade and it reboots into a new image in seconds.

Kubernetes nodes — My three control-plane nodes

The deploy process is clean:

Generate the Talos config from templates
Bootstrap the first node
Join the other nodes
Install Cilium
Deploy ArgoCD
Let ArgoCD handle the rest

The Stack #

Here’s what runs on top of the cluster:

┌─────────────────────────────────────────────────────────────────┐
│                      ArgoCD (GitOps)                            │
├─────────────────────────────────────────────────────────────────┤
│  Cilium (CNI + eBPF)  │  Envoy Gateway (Ingress)                │
│  cert-manager          │  Cloudflare Tunnel                     │
│  k8s-gateway (DNS)     │  Ceph CSI (Storage)                    │
│  Doppler Operator      │  Prometheus + Grafana                  │
├─────────────────────────────────────────────────────────────────┤
│  Web: Glance, Homepage                                          │
│  Network: Cloudflare DNS/Tunnel, Envoy Gateway, k8s-gateway      │
│  Storage: Ceph CSI                                               │
│  Media: (you'll see)                                             │
│  Monitoring: Prometheus + Grafana                                │
└─────────────────────────────────────────────────────────────────┘

Cilium - The Networking Layer #

Cilium is my CNI of choice. It uses eBPF (Extended Berkeley Packet Filter) to do networking at the kernel level—way faster than the old iptables-based approaches. It replaces kube-proxy entirely, handles network policies, and even does BGP peering for load-balanced service IPs.

Config is minimal:

cniConfig:
  name: none  # We disable the built-in CNI

And we let Cilium handle everything. The cluster uses 10.42.0.0/16 for pods and 10.43.0.0/16 for services.

ArgoCD - The GitOps Engine #

This is the heart of the operation. ArgoCD watches my kubernetes/ directory and makes sure the cluster matches what’s in Git. If I want to add an app, I create a manifest file, push to the repo, and ArgoCD picks it up within minutes.

The bootstrap flow is clean:

# Generate and apply Talos config
task talos:genconfig
task talos:bootstrap

# Install platform components (Cilium, ArgoCD, etc.)
task apps:bootstrap

# Verify everything is healthy
task verify:cluster

All the apps are organized by namespace:

kubernetes/apps/
├── argo-system/        # ArgoCD itself
├── cert-manager/       # SSL certificates
├── default/            # Basic validation apps
├── doppler-operator-system/  # Secrets management
├── kube-system/        # System components
├── media/              # Media apps
├── monitoring/         # Prometheus + Grafana
├── network/            # DNS, tunnels, ingress
├── productivity/       # Utility apps
├── storage/            # Ceph CSI
└── web/                # Glance, Homepage

And even lower-level cluster member info is just a talosctl away:

The Proxmox Foundation #

Just like in the Swarm series, everything sits on top of Proxmox. I use the same bpg/proxmox Terraform provider to define the VMs:

module "k8s_ctrl_01" {
  source = "./modules/proxmox-vm"
  
  name         = "k8s-ctrl-01"
  vm_id        = 4090
  node_name    = "pve-0"
  ipv4_address = "10.0.40.90/24"
  ipv4_gateway = "10.0.40.1"
  
  memory_dedicated = 16384  # 16GB RAM
  cpu_cores        = 8
  disk_size        = 128     # 128GB SSD
}

Each node boots from a Talos image downloaded from the Talos Factory. The machine configs are generated from templates in talos/:

# talos/talconfig.yaml
clusterName: kubernetes
talosVersion: "${talosVersion}"
kubernetesVersion: "${kubernetesVersion}"
endpoint: https://10.0.40.101:6443

nodes:
  - hostname: "k8s-ctrl-01"
    ipAddress: "10.0.40.90"
    installDisk: "/dev/sda"
    controlPlane: true
    networkInterfaces:
      - deviceSelector:
          hardwareAddr: "bc:24:11:79:b5:8f"
        addresses:
          - "10.0.40.90/24"
        routes:
          - network: "0.0.0.0/0"
            gateway: "10.0.40.1"
        vip:
          ip: "10.0.40.101"
      - deviceSelector:
          hardwareAddr: "bc:24:11:ce:0c:fb"
        addresses:
          - "10.0.70.90/24"

Notice the two NICs: one on VLAN 40 for cluster traffic, and one on VLAN 70 for Ceph storage. Same pattern as the Swarm VMs had.

Managing Secrets #

I carried over Doppler from the Swarm setup. The Doppler Kubernetes Operator injects secrets directly into pods—no .env files, no manual copying, no secrets in Git. ArgoCD syncs the config, and the operator handles the rest.

The Bootstrap Flow #

Getting from bare metal to a working cluster is a sequence of well-defined steps:

# 1. Provision the VMs with OpenTofu
task tf:proxmox:apply

# 2. Generate Talos machine configs
task talos:genconfig

# 3. Bootstrap the Talos cluster
task talos:bootstrap

# 4. Install Cilium (the CNI)
task platform:cilium

# 5. Install ArgoCD
task platform:argocd

# 6. Let ArgoCD sync all the apps
task apps:bootstrap

# 7. Verify everything
task verify:cluster

The entire process takes about 30 minutes from “I have nothing” to “I have a working Kubernetes cluster with apps”. And if a node dies? I tear down the VM with task tf:proxmox:destroy (targeting just that node), rebuild it, and rejoin. The whole thing is reproducible by design.

What Runs Where #

Not everything moved to Kubernetes. Some things stayed on the host as Docker containers:

On the host (Docker Compose):

Portainer (yes, still—for the occasional quick check)
Cloudflare Tunnel (at the host level for backup ingress)
Beszel (system monitoring)
Traefik (host-level reverse proxy)
Uptime-Kuma (uptime monitoring)

In Kubernetes:

Glance (personal dashboard)
Homepage (homelab homepage)
Everything else that benefits from orchestration

What I Learned #

Making the jump from Swarm to Kubernetes on bare metal was… humbling. Here are the things I wish I knew going in:

Talos is amazing, but different - No SSH means you debug everything through talosctl and kubectl. It takes getting used to, but once it clicks, you never want to go back.
Cilium is worth the hype - eBPF networking is genuinely faster. Pod-to-pod latency dropped noticeably compared to the old Swarm setup.
ArgoCD changes the game - The “git push and it’s live” feeling? That’s not just hype. It’s genuinely satisfying to see your cluster auto-sync the moment you push.
GitOps for secrets is still hard - SOPS + Age works, but it adds friction. Doppler helps, but it’s one more dependency.
Three nodes is enough - For a homelab, three control-plane nodes running everything is more than sufficient. I haven’t had a single resource crunch.

The Repo #

Everything is in the project-homelab repository. The Talos configs, the Kubernetes manifests, the OpenTofu stacks, the Docker compose files—it’s all there.

Be warned: it’s a living document. I’m still migrating apps from the old Swarm, adding monitoring, and generally poking at things. But the foundation is solid, and the bootstrap flow is well-tested.

And the best part? When I inevitably want to rebuild everything again in 2027, I’ll just run task infra:provision and go make coffee.