Skip to main content

The 3 AM Outage That Changed My Architecture

Data center servers with glowing lights Photo by Taylor Vick on Unsplash

Two years ago, my primary Proxmox server's motherboard died at 3 AM. My self-hosted services went down simultaneously: password manager, DNS, monitoring. I was dead in the water until I could source a replacement part.

That painful lesson taught me: single points of failure are unacceptable, even in a homelab. Understanding how to design resilient systems became essential.

This incident became a driving force behind building a security-focused homelab with resilience baked in from the start.

High Availability Architecture

flowchart TB
    subgraph clusternodes["Cluster Nodes"]
        Node1[Proxmox Node 1<br/>Dell R940]
        Node2[Proxmox Node 2<br/>Dell R730]
        Node3[Proxmox Node 3<br/>Custom Build]
    end
    subgraph sharedstorage["Shared Storage"]
        Ceph[(Ceph Cluster<br/>Distributed Storage)]
    end
    subgraph networkinfrastructure["Network Infrastructure"]
        Switch1[10Gb Switch<br/>Primary]
        Switch2[1Gb Switch<br/>Management]
    end
    subgraph haservices["HA Services"]
        Corosync[Corosync<br/>Cluster Communication]
        PVE[PVE HA Manager<br/>Failover Orchestration]
        Fencing[Fencing Agent<br/>Split-Brain Prevention]
    end
    subgraph vmswithha["VMs with HA"]
        DNS[Pi-hole DNS]
        Vault[Bitwarden]
        Monitor[Wazuh SIEM]
        Web[Web Services]
    end

    Node1 --> Ceph
    Node2 --> Ceph
    Node3 --> Ceph

    Node1 --> Switch1
    Node2 --> Switch1
    Node3 --> Switch1

    Node1 --> Switch2
    Node2 --> Switch2
    Node3 --> Switch2

    Corosync --> Node1
    Corosync --> Node2
    Corosync --> Node3

    PVE --> Corosync
    Fencing --> PVE

    Ceph --> DNS
    Ceph --> Vault
    Ceph --> Monitor
    Ceph --> Web

    classDef greenNode fill:#4caf50,color:#fff
    classDef blueNode fill:#2196f3,color:#fff
    classDef redNode fill:#f44336,color:#fff
    class Ceph greenNode
    class Corosync blueNode
    class Fencing redNode

Planning Your HA Cluster

Hardware Requirements

Minimum (3 nodes required for quorum):

  • Node 1: Dell R940 (primary) - 32GB RAM, 8 cores
  • Node 2: Dell R730 (secondary) - 24GB RAM, 6 cores
  • Node 3: Custom build (witness) - 16GB RAM, 4 cores

Network Requirements:

  • Two separate networks (cluster + management)
  • 10Gb preferred for Ceph storage network
  • 1Gb acceptable for management/corosync

Storage Requirements:

  • 3× identical disks per node for Ceph (minimum)
  • NVMe recommended for journal/metadata
  • Dedicated disks for Ceph (not shared with OS)

Why Three Nodes?

Proxmox HA requires an odd number of nodes for quorum:

  • 2 nodes: Can't survive any failures (no quorum)
  • 3 nodes: Survives 1 node failure ✓
  • 5 nodes: Survives 2 node failures
  • 7 nodes: Survives 3 node failures (overkill for homelab)

My setup: 3 nodes provides good balance of reliability vs. cost.

Initial Proxmox Cluster Setup

Prepare Each Node

📎 Complete setup script: Full node preparation with networking and repositories

Update packages, configure bridge interfaces with static IPs

Create Cluster

📎 Complete cluster setup: Full 3-node cluster creation with dual links

Create cluster on node 1: pvecm create homelab-cluster Join from other nodes: pvecm add <node1-ip>

Configure Corosync

📎 Complete configuration: Full corosync.conf with redundant rings and crypto

Enable knet transport with AES256 encryption, configure redundant links

Ceph Storage Configuration

Install Ceph

📎 Complete setup: Full Ceph installation with all monitors

Install Ceph packages, initialize cluster on storage network, create monitors

Configure Ceph OSDs

📎 Complete setup: OSD creation script for all nodes and disks

Create OSD on each disk: pveceph osd create /dev/sdX

Create Ceph Pools

Create pools with 3x replication (min 2), map to Proxmox storage

Ceph Performance Tuning

Set placement groups to 128, enable RBD caching

High Availability Configuration

Enable HA Manager

📎 Complete HA setup: Full HA manager configuration and verification

Verify HA services running, check cluster status

Configure Fencing

Fencing prevents split-brain scenarios by forcibly powering off unresponsive nodes.

Install fence-agents, configure IPMI credentials for each node

Enable HA for VMs

📎 Complete VM HA configuration: HA resource management with groups and priorities

Add VMs to HA: ha-manager add vm:100 --state started --max_restart 3

Testing Failover

Simulated Node Failure

Power off node, watch VMs migrate within 2 minutes. This pattern integrates well with zero trust VLAN segmentation to ensure services remain isolated during failover.

Simulated Network Partition

Block all traffic with iptables, verify fencing powers off minority partition

Simulated Ceph Failure

Stop OSD daemon, verify data remains accessible via replication

Backup Strategy

Proxmox Backup Server Integration

📎 Complete backup configuration: Full PBS setup with schedules and retention

Add PBS storage, schedule nightly snapshots at 2 AM

Automated Backup Script

Backup cluster config to tarball, sync offsite with rclone

Monitoring and Alerting

Prometheus Exporter

📎 Complete monitoring setup: Full Prometheus exporter config with metrics

Install exporter, configure PVE credentials, expose metrics

Grafana Dashboard

Import dashboard with cluster quorum, Ceph health, VM status panels

Alerting Rules

Alert on quorum loss, Ceph errors, node failures

Operational Procedures

Maintenance Mode

Migrate all VMs off node, set maintenance state, perform updates

Rolling Updates

Migrate VMs, update packages, reboot node, repeat for all nodes

Disaster Recovery

Scenario 1: Single Node Failure

Automatic Response:

  1. Corosync detects node failure
  2. Fencing agent confirms node is offline
  3. HA manager migrates VMs to surviving nodes
  4. Services resume on new nodes

Time to Recovery: 2-5 minutes (automatic)

Scenario 2: Split-Brain

Automatic Response:

  1. Network partition detected
  2. Majority partition maintains quorum
  3. Minority partition loses quorum, stops VMs
  4. Fencing prevents both partitions from writing to Ceph

Manual Recovery:

Set expected votes, restart cluster services, verify quorum

Scenario 3: Total Cluster Failure

Manual Recovery:

Set expected=1, start VMs manually, restore quorum after nodes rejoin

Cost Analysis

My 3-node HA cluster cost:

Component Cost Notes
Dell R940 (used) $800 Primary node
Dell R730 (used) $500 Secondary node
Custom build $400 Witness node
10Gb Switch $200 Storage network
Ceph SSDs (9×1TB) $900 Distributed storage
UPS systems (3) $300 Power protection
Total $3,100 One-time investment

Monthly costs: ~$30 (electricity, though this varies by region)

Compared to cloud: $150-300/month for equivalent HA VMs

Break-even: Probably around 1 year

Lessons Learned

After running HA Proxmox for two years:

1. Three Nodes is the Sweet Spot

Two nodes can't form quorum. Four nodes is wasteful. Three provides good balance.

2. Network Reliability is Critical

Your cluster is only as reliable as the network connecting it. Invest in quality switches and redundant links.

3. Ceph is Powerful but Complex

Ceph provides excellent distributed storage, but monitor it carefully. Degraded OSDs can significantly impact performance, though in my experience, the impact varies depending on your workload. For lessons on local LLM deployment requiring high-performance storage, Ceph's distributed architecture provides good IOPS for model loading.

4. Test Failover Regularly

I test failover monthly. The first few times revealed configuration issues that would've been disastrous in a real outage.

5. Have a Runbook for Disasters

When your cluster is down at 3 AM, you don't want to figure out recovery procedures. Document everything.

6. Backup Beyond the Cluster

Ceph replication protects against disk failures, not logical corruption. Maintain independent backups.

Performance Metrics

My cluster performance:

  • Uptime: 99.97% (3 hours downtime in 2 years)
  • Failover time: 2-3 minutes average
  • VM migration: <30 seconds (live migration)
  • Ceph write latency: 2-5ms (NVMe SSDs)
  • Ceph read latency: <1ms (cached)
  • Network throughput: 8-9 Gbps (10Gb links)

Research & References

Proxmox Documentation

  1. Proxmox VE Administration Guide - Official documentation
  2. Proxmox Cluster Documentation - Cluster setup guide

Ceph Storage

  1. Ceph Architecture and Design - Official Ceph docs

High Availability Concepts

  1. CAP Theorem - Consistency, Availability, Partition tolerance
  2. Raft Consensus Algorithm - Distributed consensus explanation
  3. Corosync Documentation - Cluster communication

Conclusion

Building an HA Proxmox cluster eliminated my single point of failure and dramatically improved homelab reliability. I can now perform maintenance without downtime, and hardware failures no longer cause panic.

Is HA overkill for a homelab? Maybe. But when you self-host critical services like passwords and DNS, the peace of mind is worth the investment. Plus, learning enterprise-grade HA concepts in a homelab environment is invaluable experience.

Start with a 3-node cluster, use Ceph for storage, test failover regularly, and enjoy worry-free infrastructure.


Running HA in your homelab? What failure scenarios have you encountered? Share your clustering stories and lessons learned!

Related Posts