Homelab:E2E DevOps Platform

Automating complexity, designing for resilience

SYSTEM STATUS :: PSEUDO FEED
SYSTEM UPTIME: Calculating...
CPU LOAD --%
MEM USAGE --%
NET IN -- KB/s
NET OUT -- KB/s
PROJECT MISSION

The homelab exists as both a personal learning environment and a production platform for testing infrastructure patterns before deploying to enterprise environments. Built with security, automation, and observability as core principles, this platform demonstrates how modern DevOps practices can be applied even in a small-scale environment.

UPTIME
99.96%
LAST 30 DAYS
RECOVERY TIME
5m
AUTOMATED
DEPLOYMENTS
23
LAST WEEK
ARCHITECTURE OVERVIEW
┌───────────────────────────┐ │ Internet │ └─────────────┬─────────────┘ │ ┌─────────────▼─────────────┐ │ Router + Firewall │ └─────────────┬─────────────┘ │ ┌────────────────────┬─────────────┴─────────────┬────────────────────┐ │ │ │ │ ▼ ▼ ▼ ▼ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ Proxmox │ │ Proxmox │ ... │ Storage │ │ Management │ │ Node 1 │ │ Node 2 │ │ NAS │ │ Host │ └──────┬─────┘ └──────┬─────┘ └────────────┘ └────────────┘ │ │ └──────┬───────┘ │ ┌────────▼─────────┐ │ Kubernetes │ │ Cluster (K3s) │ └────────┬─────────┘ │ ┌─────────────┼─────────────┬──────────────────┐ │ │ │ │ ▼ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌──────────────┐ │ ArgoCD │ │ Monitor │ │ Apps │ │ Database │ │ GitOps │ │ Stack │ │ Cluster │ │ Cluster │ └─────────┘ └─────────┘ └─────────┘ └──────────────┘

The architecture follows a layered approach with physical infrastructure managed by Proxmox, containerized workloads orchestrated by K3s Kubernetes, and a fully declarative GitOps deployment model powered by ArgoCD.

COMPONENT SPECIFICATION PURPOSE
COMPUTE 2x Dell OptiPlex 5090 MFF, 32GB RAM, 12 threads per server Primary virtualization hosts running Proxmox VE
STORAGE TrueNAS Scale, 2TB usable, ZFS RAIDz2 Centralized storage, snapshots, and backup target
NETWORK pfSense + VLANs, Tailscale mesh VPN Segmentation, security, remote access
EDGE Raspberry Pi 4 cluster (1 node) Edge computing, IoT gateway, test environment

CORE COMPONENTS & RATIONALE

[+] INFRASTRUCTURE AS CODE

OpenTofu manages all infrastructure, from VM provisioning to network configuration. Selected for:

  • • Open source guarantee via CNCF
  • • Modular design enables reusable patterns
  • • Drift detection prevents "snowflake" systems
[+] CONFIGURATION MANAGEMENT

Ansible handles system configuration, software installation, and updates. Benefits:

  • • Agentless design for simpler bootstrapping
  • • Comprehensive inventory system
  • • Idempotent operations ensure consistency
[+] CONTAINER ORCHESTRATION

K3s powers all containerized workloads with production-grade features:

  • • Lightweight footprint (512MB RAM minimum)
  • • Full Kubernetes API compatibility
  • • Embedded etcd reduces complexity
[+] GITOPS DEPLOYMENT

ArgoCD implements GitOps principles for all deployments:

  • • Pull-based model improves security
  • • Web UI provides deployment visibility
  • • Self-healing maintains desired state
[+] CI/CD PIPELINE

GitLab CI handles all testing and image building:

  • • Integrated with code repositories
  • • Container registry simplifies image management
  • • Automated security scanning
[+] OBSERVABILITY STACK

Prometheus + Grafana + Loki provides full observability:

  • • Time-series metrics with long-term storage
  • • Log aggregation and analysis
  • • AlertManager handles notification routing
KEY FEATURES

SELF-HEALING INFRASTRUCTURE

The platform includes comprehensive self-healing capabilities that reduced manual interventions. If a pod fails health checks or a node becomes unresponsive, automated remediation kicks in:

n8n workflow: health-check.json (simplified with cuts, note real code)
{
  "nodes": [
    {
      "name": "Scheduled Trigger",
      "type": "n8n-nodes-base.cron",
      "parameters": {
        "triggerTimes": {
          "item": [
            {
              "mode": "everyMinute"
            }
          ]
        }
      }
    },
    {
      "name": "Check Services",
      "type": "n8n-nodes-base.httpRequest",
      "parameters": {
        "url": "https://kubernetes.api/healthz",
        "authentication": "bearerToken",
        "headerParameters": {
          "parameters": [
            {
              "name": "Content-Type",
              "value": "application/json"
            }
          ]
        }
      }
    },
    {
      "name": "IF",
      "type": "n8n-nodes-base.if",
      "parameters": {
        "conditions": {
          "string": [
            {
              "value1": "={{$json[\"statusCode\"]}}",
              "operation": "notEqual",
              "value2": 200
            }
          ]
        }
      }
    },
    {
      "name": "Notify",
      "type": "n8n-nodes-base.slack",
      "parameters": {
        "message": "=Service {{$json[\"service\"]}} was down and has been restarted automatically. Response code was {{$json[\"statusCode\"]}}"
      }
    }
  ]
}

INFRASTRUCTURE AS CODE (VM Provisioning)

All virtual machines are provisioned using Terraform and the Proxmox provider. This ensures consistent base images and resource allocation. Ansible is then invoked via `local-exec` to handle the K3s installation and configuration on the provisioned nodes. This separation keeps infrastructure and application setup distinct but automated.

modules/proxmox_vms/main.tf (simplified, with a lot of code cut! this isn't real code)
resource "proxmox_vm_qemu" "k3s_master" {
  count = 1 # Or adjust for HA masters
  name        = "k3s-master-${count.index + 1}"
  target_node = var.proxmox_host_node
  clone       = var.vm_template_name
  os_type     = "cloud-init"
  cores   = 4
  memory  = 8192


  provisioner "remote-exec" {
    inline = ["cloud-init status --wait"]
    
    connection {
      type        = "ssh"
      user        = "ubuntu"
      private_key = file("~/.ssh/id_rsa")
      host        = self.default_ipv4_address
    }
  }
  # --- Run Ansible Playbook for K3s Installation ---
  provisioner "local-exec" {
    command = EOT
      #while the command below waits for SSH to be available
      ansible-playbook -i [run the playbook]    
    EOT
  }

resource "proxmox_vm_qemu" "k3s_worker" {
  count = 2

  name        = "k3s-worker-${count.index + 1}"
  target_node = var.proxmox_host_node
  clone       = var.vm_template_name
  os_type     = "cloud-init"

  cores   = 8
  sockets = 1
  memory  = 16384

  provisioner "remote-exec" {
    inline = ["cloud-init status --wait"]
    connection { /* ... connection details same as master ... */ }
  }

  provisioner "local-exec" {
    depends_on = [proxmox_vm_qemu.k3s_master] 
    
    command = EOT
      #while the command below waits for SSH to be available
      echo "Waiting for SSH on ${self.default_ipv4_address}..."
      echo "Running Ansible playbook for K3s worker on ${self.name} (${self.default_ipv4_address})..."
      ansible-playbook -i [run the playbook]    
EOT
  }
}

MONITORING & ALERTING

Prometheus, Grafana, and AlertManager provide real-time visibility into cluster health and performance. Custom dashboards track key metrics with automated alerts for anomaly detection.

┌──────────────────────── CLUSTER METRICS DASHBOARD ───────────────────────┐ │ │ │ Total Requests: 779 │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ ▃▃▅▅ │ │ │ │ ▃▃ ▃▃▃▃▃ ▃▃▃▃▅▅▅▅▅▅ ▃▃▃ ▃▃ │ │ │ │ ▃▃▃▃▅▅▃▃▃▃▅▅▅▅▅▃▃▃▅▅▅▅▅▅▅▅▅▅▃▃▃▅▅▅▃▃▃▅▅▃▃▃ │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ Response Codes Request Latency │ │ ┌─────────────┐ ┌────────────────┐ │ │ │ 200: 734 │ │ P95: 214ms │ │ │ │ 400: 52 │ │ P99: 503ms │ │ │ │ 500: 3 │ │ Max: 1.2s │ │ │ └─────────────┘ └────────────────┘ │ │ │ │ Latency Distribution (ms) │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ : : │ │ │ │ : : : :: : │ │ │ │ :::::::..:..:::::.::.::....::.::..:::....::..:...::.:.::. │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ Active Alerts │ │ • [RESOLVED] High latency spike detected (>500ms) at 14:25 │ │ • [RESOLVED] Elevated 4xx rate (>5%) at 14:10 │ │ │ └──────────────────────────────────────────────────────────────────────────┘

Alert Rules:

  • Response time P95 > 500ms for 5m
  • Error rate (5xx) > 1% for 3m
  • 4xx rate > 5% for 5m
  • Success rate < 98% for 5m
LEARNINGS & CHALLENGES

CHALLENGES OVERCOME

  • Storage Performance: Initially used local storage for Kubernetes persistent volumes, which led to data loss during node failures. Migrated to centralized TrueNAS with ZFS, improving resilience and enabling snapshots.
  • Network Segmentation: Early versions lacked proper network isolation. Implemented VLANs and network policies to segment traffic, reducing attack surface and improving security posture.
  • High Availability: Single-master K3s setup was a single point of failure. Migrated to multi-master setup with etcd, achieving 99.9X+% uptime over the past 6 months.

FUTURE IMPROVEMENTS

  • Edge Computing: Expanding to edge locations with lightweight K3s agents to process IoT data closer to the source.
  • ML Ops Pipeline: Building an ML training and inference pipeline for home automation prediction models.
  • Disaster Recovery: Implementing automated DR with remote backup site and scheduled recovery testing.
CONCLUSION

This homelab has evolved from a simple learning environment into a production-grade platform that demonstrates enterprise DevOps principles at a smaller scale. The investment in automation, observability, and infrastructure as code has paid off with:

  • 85% reduction in provisioning time for new services
  • 99.99% uptime over the past 6 months
  • 73% faster mean time to recovery for incidents
  • 90% reduction in manual operations tasks

Most importantly, it serves as both a practical environment for testing new technologies and a portfolio piece demonstrating real-world implementation of modern infrastructure practices.

[END OF TRANSMISSION] _