# OpenClaw Infrastructure Mastery
## Proxmox + Debian 13 + Self-Healing Setup Guide

---

## Introduction

This is the **premium tier training** for organizations wanting to run OpenClaw on **dedicated infrastructure** with **self-healing capabilities**.

Instead of using cloud platforms, you get:
- Your own Proxmox hypervisor (virtualization)
- Dedicated Debian 13 VM for OpenClaw
- Full control over hardware + network
- Self-monitoring and auto-recovery
- Private, on-premises or hybrid deployment

**Target audience:** CTOs, DevOps engineers, enterprises, agencies wanting full control

**Price point:** $3,999-9,999 (add-on to Mastery program or standalone)

---

## Part 1: Proxmox Foundation

### What is Proxmox?

Proxmox VE is an open-source hypervisor (like VMware ESXi, but free). It lets you:
- Run multiple VMs on one physical server
- Snapshot and backup entire systems
- Scale resources dynamically
- Manage everything from web UI

### Hardware Requirements

**Minimum:**
- 4-core CPU (Intel/AMD, 2+ GHz)
- 16GB RAM
- 100GB SSD

**Recommended:**
- 8+ core CPU
- 32GB+ RAM
- 500GB+ NVMe SSD
- Dual NICs (network redundancy)

**Enterprise:**
- 16+ core CPU
- 64GB+ RAM
- 2TB+ SSD with RAID
- Quad NICs
- Backup NAS attached

### Installation (Day 1 Video)

```bash
# 1. Download Proxmox ISO
# From: https://www.proxmox.com/en/proxmox-ve/get-started

# 2. Create bootable USB
# On Mac/Linux:
dd if=proxmox-ve_8.0-2.iso of=/dev/diskX bs=4M conv=fsync

# 3. Boot from USB on target hardware
# Hit Enter to install
# Follow wizard:
#   - Choose target disk (warning: erases it)
#   - Enter hostname (e.g., "openclaw-host")
#   - Set IP address (e.g., 192.168.1.100)
#   - Set root password
#   - Finish installation

# 4. Reboot
# Access web UI: https://192.168.1.100:8006
# Login: root / [password]
```

**Video outline:**
- Physical hardware setup
- BIOS/UEFI configuration
- USB creation (Windows/Mac/Linux)
- Installation walkthrough
- Initial configuration
- Firewall rules

---

## Part 2: Debian 13 OpenClaw VM

### Creating the VM

**Step 1: Create VM in Proxmox UI**
```
1. Click "Create VM" (top right)
2. Name: openclaw-prod
3. Node: [your Proxmox host]
4. VM ID: 100 (or next available)
5. OS: Debian 13
6. Storage: local-lvm (or your storage)
7. Disk: 100GB (recommended: 200GB)
8. CPU: 4 cores (or 8 for high-load)
9. Memory: 8GB (or 16GB for high-load)
10. Network: vmbr0 (default bridge)
11. Create
```

**Step 2: Install Debian 13**
```bash
# Boot VM
# Download Debian 13 ISO
# Attach ISO to VM
# Install Debian (standard installation)
# Enable OpenSSH during install
# Reboot
```

**Step 3: Initial Configuration**
```bash
# SSH into VM
ssh root@192.168.1.101

# Update system
apt update && apt upgrade -y

# Install essentials
apt install -y build-essential curl wget git htop tmux zsh

# Install Node.js (required for OpenClaw)
curl -fsSL https://deb.nodesource.com/setup_22.x | bash -
apt install -y nodejs

# Verify Node version
node --version  # Should be v22.22.0+

# Install NPM
npm --version

# Install OpenClaw
npm install -g openclaw

# Verify installation
openclaw version
```

### Networking Setup

```bash
# Configure static IP (edit /etc/network/interfaces)
nano /etc/network/interfaces

# Content:
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
    address 192.168.1.101
    netmask 255.255.255.0
    gateway 192.168.1.1
    dns-nameservers 8.8.8.8 1.1.1.1

# Save and restart networking
systemctl restart networking

# Test connectivity
ping 8.8.8.8
```

### Firewall Rules

```bash
# Install UFW
apt install -y ufw

# Enable firewall
ufw enable

# Allow SSH
ufw allow 22/tcp

# Allow OpenClaw ports (default: 3000, 8006)
ufw allow 3000/tcp
ufw allow 8006/tcp

# Allow HTTPS
ufw allow 443/tcp
ufw allow 80/tcp

# Check rules
ufw status numbered

# Deny all else
ufw default deny incoming
ufw default allow outgoing
```

---

## Part 3: Self-Healing Infrastructure

### What is Self-Healing?

Your infrastructure **automatically detects and fixes problems** without human intervention:
- Service crashes → auto-restart
- Disk full → auto-cleanup
- Memory leak → auto-restart
- Backup fails → auto-alert
- SSL cert expires → auto-renew

### Monitoring Stack

**Install Prometheus + Alertmanager** (open-source monitoring)

```bash
# Create monitoring user
useradd --no-create-home --shell /bin/false prometheus
useradd --no-create-home --shell /bin/false alertmanager

# Download Prometheus
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gz
tar xvfz prometheus-2.47.0.linux-amd64.tar.gz
mv prometheus-2.47.0.linux-amd64 /opt/prometheus

# Create config
cat > /opt/prometheus/prometheus.yml << 'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']
  
  - job_name: 'openclaw'
    static_configs:
      - targets: ['localhost:3000']
EOF

# Start Prometheus
/opt/prometheus/prometheus --config.file=/opt/prometheus/prometheus.yml &
# Access: http://192.168.1.101:9090
```

**Install Node Exporter** (system metrics)

```bash
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

# Run as service
cat > /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

useradd --no-create-home --shell /bin/false node_exporter
systemctl daemon-reload
systemctl enable node_exporter
systemctl start node_exporter

# Verify: curl http://localhost:9100/metrics
```

### Auto-Healing Scripts

**Script 1: Auto-restart OpenClaw**
```bash
cat > /opt/scripts/openclaw-healthcheck.sh << 'EOF'
#!/bin/bash

# Check if OpenClaw is running
if ! systemctl is-active --quiet openclaw; then
    echo "[$(date)] OpenClaw not running. Restarting..."
    systemctl restart openclaw
    
    # Send alert
    echo "OpenClaw restarted at $(date)" | mail -s "OpenClaw Alert" admin@example.com
fi

# Check disk space
DISK_USAGE=$(df / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $DISK_USAGE -gt 80 ]; then
    echo "[$(date)] Disk usage at ${DISK_USAGE}%. Cleaning up..."
    
    # Clean logs
    find /var/log -name "*.gz" -delete
    find /root/.openclaw -name "*.log" -mtime +30 -delete
    
    # Clean temp
    rm -rf /tmp/*
    
    echo "Disk cleaned. Usage now: $(df / | awk 'NR==2 {print $5}')" | mail -s "Disk Cleanup Alert" admin@example.com
fi

# Check memory
MEM_USAGE=$(free | awk 'NR==2 {print int($3/$2*100)}')
if [ $MEM_USAGE -gt 85 ]; then
    echo "[$(date)] Memory usage at ${MEM_USAGE}%. Restarting services..."
    systemctl restart openclaw
fi
EOF

chmod +x /opt/scripts/openclaw-healthcheck.sh

# Run every 5 minutes
(crontab -l; echo "*/5 * * * * /opt/scripts/openclaw-healthcheck.sh") | crontab -
```

**Script 2: Auto-backup OpenClaw**
```bash
cat > /opt/scripts/openclaw-backup.sh << 'EOF'
#!/bin/bash

BACKUP_DIR="/backup/openclaw"
DATE=$(date +%Y-%m-%d_%H-%M-%S)
BACKUP_FILE="$BACKUP_DIR/openclaw-backup-$DATE.tar.gz"

# Create backup directory
mkdir -p $BACKUP_DIR

# Stop OpenClaw temporarily
systemctl stop openclaw

# Backup workspace + config
tar -czf $BACKUP_FILE /root/.openclaw /etc/openclaw

# Start OpenClaw
systemctl start openclaw

# Keep only last 7 days of backups
find $BACKUP_DIR -name "*.tar.gz" -mtime +7 -delete

echo "Backup completed: $BACKUP_FILE"
EOF

chmod +x /opt/scripts/openclaw-backup.sh

# Run daily at 2 AM
(crontab -l; echo "0 2 * * * /opt/scripts/openclaw-backup.sh") | crontab -
```

**Script 3: Auto-renew SSL certificates**
```bash
cat > /opt/scripts/ssl-renew.sh << 'EOF'
#!/bin/bash

# Install Let's Encrypt certbot
apt install -y certbot python3-certbot-nginx

# Create certificate (first time)
certbot certonly --standalone -d yourdomain.com --email admin@example.com --agree-tos -n

# Auto-renew every 60 days
(crontab -l; echo "0 3 * * * certbot renew --quiet && systemctl restart openclaw") | crontab -

# Verify status
certbot certificates
EOF

chmod +x /opt/scripts/ssl-renew.sh
```

### SystemD Service (Auto-restart on failure)

```bash
cat > /etc/systemd/system/openclaw.service << 'EOF'
[Unit]
Description=OpenClaw AI Agent Platform
After=network.target

[Service]
Type=simple
User=root
WorkingDirectory=/root/.openclaw/workspace
ExecStart=/usr/bin/node /usr/lib/node_modules/openclaw/index.js
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal

# Auto-restart on failure
StartLimitInterval=600
StartLimitBurst=5

# Resource limits
MemoryLimit=2G
CPUQuota=75%

[Install]
WantedBy=multi-user.target
EOF

# Enable service
systemctl daemon-reload
systemctl enable openclaw
systemctl start openclaw

# Check status
systemctl status openclaw

# View logs
journalctl -u openclaw -f
```

---

## Part 4: Security Hardening

### SSH Hardening

```bash
nano /etc/ssh/sshd_config

# Key settings:
Port 2222                          # Change from 22
PermitRootLogin prohibit-password  # No root password login
PasswordAuthentication no          # Use keys only
PubkeyAuthentication yes
X11Forwarding no
MaxAuthTries 3
ClientAliveInterval 300
ClientAliveCountMax 2

# Restart SSH
systemctl restart sshd
```

### Fail2Ban (brute-force protection)

```bash
apt install -y fail2ban

cat > /etc/fail2ban/jail.local << 'EOF'
[DEFAULT]
bantime = 3600
findtime = 600
maxretry = 5

[sshd]
enabled = true
port = 2222
EOF

systemctl enable fail2ban
systemctl start fail2ban

# Check bans
fail2ban-client status
fail2ban-client status sshd
```

### Nginx Reverse Proxy (optional, for HTTPS)

```bash
apt install -y nginx certbot python3-certbot-nginx

cat > /etc/nginx/sites-available/openclaw << 'EOF'
server {
    listen 443 ssl http2;
    server_name yourdomain.com;

    ssl_certificate /etc/letsencrypt/live/yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/yourdomain.com/privkey.pem;

    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;

    location / {
        proxy_pass http://localhost:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

server {
    listen 80;
    server_name yourdomain.com;
    return 301 https://$server_name$request_uri;
}
EOF

ln -s /etc/nginx/sites-available/openclaw /etc/nginx/sites-enabled/
nginx -t
systemctl restart nginx
```

---

## Part 5: Backup & Disaster Recovery

### VM Snapshots (Proxmox)

```bash
# Via Proxmox UI:
# 1. Right-click VM (openclaw-prod)
# 2. More → Snapshot
# 3. Name: "before-update" or "working-state"
# 4. Include RAM: Yes (for instant recovery)
# 5. Snapshot

# Or via CLI:
qm snapshot 100 before-update
qm snapshot 100 working-state

# List snapshots
qm listsnapshot 100

# Rollback
qm rollback 100 before-update
```

### NAS Backup (Network-attached storage)

```bash
# Mount NAS
apt install -y nfs-common
mkdir -p /mnt/backup-nas
mount -t nfs 192.168.1.50:/backup /mnt/backup-nas

# Automated backup script
cat > /opt/scripts/nas-backup.sh << 'EOF'
#!/bin/bash
rsync -avz --delete /root/.openclaw /mnt/backup-nas/openclaw-backup/
rsync -avz --delete /etc/openclaw /mnt/backup-nas/config-backup/

echo "NAS backup completed at $(date)" >> /var/log/nas-backup.log
EOF

chmod +x /opt/scripts/nas-backup.sh

# Run daily at 3 AM
(crontab -l; echo "0 3 * * * /opt/scripts/nas-backup.sh") | crontab -
```

### Disaster Recovery Plan

**RTO (Recovery Time Objective):** < 15 minutes  
**RPO (Recovery Point Objective):** < 1 hour

**Procedure:**
1. Proxmox snapshot exists (taken daily)
2. NAS backup exists (offsite copy)
3. Rollback VM from snapshot (5 min)
4. Restore config from NAS (5 min)
5. Start services (2 min)
6. Verify (3 min)

---

## Part 6: Performance Optimization

### CPU & Memory Tuning

```bash
# Check current limits
free -h
nproc
top -b -n 1 | head -20

# Optimize OpenClaw
cat > /root/.openclaw/config.json << 'EOF'
{
  "server": {
    "maxWorkers": 4,
    "maxMemory": "2GB"
  },
  "database": {
    "pool": 10,
    "connectionTimeout": 5000
  },
  "cache": {
    "enabled": true,
    "ttl": 3600
  }
}
EOF
```

### Disk I/O Optimization

```bash
# Check disk performance
iostat -x 1 5

# Enable write caching
echo 1024 > /sys/block/sda/queue/read_ahead_kb

# Use SSD-optimized scheduler
echo noop > /sys/block/sda/queue/scheduler
```

### Network Optimization

```bash
# Increase TCP buffer sizes
echo "net.core.rmem_max = 134217728" >> /etc/sysctl.conf
echo "net.core.wmem_max = 134217728" >> /etc/sysctl.conf
echo "net.ipv4.tcp_rmem = 4096 87380 134217728" >> /etc/sysctl.conf
echo "net.ipv4.tcp_wmem = 4096 65536 134217728" >> /etc/sysctl.conf

sysctl -p
```

---

## Part 7: Monitoring Dashboard

### Grafana (Visualization)

```bash
# Install Grafana
apt install -y software-properties-common
add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
apt update
apt install -y grafana-server

# Enable and start
systemctl enable grafana-server
systemctl start grafana-server

# Access at http://192.168.1.101:3000
# Default login: admin / admin

# Add Prometheus data source:
# 1. Settings → Data Sources
# 2. Add Prometheus
# 3. URL: http://localhost:9090
# 4. Save

# Create dashboards for:
# - OpenClaw uptime
# - CPU/Memory usage
# - Disk space
# - Network traffic
# - Error rates
```

### Alert Configuration

```bash
cat > /opt/prometheus/alerts.yml << 'EOF'
groups:
  - name: openclaw_alerts
    rules:
      - alert: OpenClawDown
        expr: up{job="openclaw"} == 0
        for: 5m
        annotations:
          summary: "OpenClaw is down"
          description: "OpenClaw has been unreachable for 5 minutes"
      
      - alert: DiskSpaceLow
        expr: node_filesystem_avail_bytes{mountpoint="/"} < 5368709120
        for: 5m
        annotations:
          summary: "Low disk space"
          description: "Less than 5GB remaining on root partition"
      
      - alert: MemoryUsageHigh
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.85
        for: 5m
        annotations:
          summary: "High memory usage"
          description: "Memory usage above 85%"
EOF
```

---

## Part 8: Training Modules

### Module 1: Proxmox Administration (2 hours)
- Hardware selection
- Installation & initial setup
- Creating VMs
- Networking configuration
- Storage management
- Snapshotting & backups

### Module 2: Debian 13 System Administration (2 hours)
- Linux fundamentals
- Package management
- User & permission management
- Firewall configuration
- Service management
- Log management

### Module 3: OpenClaw Infrastructure (2 hours)
- VM setup for OpenClaw
- Network configuration
- Service installation
- Configuration tuning
- Port management
- Security hardening

### Module 4: Self-Healing Systems (2 hours)
- Monitoring architecture
- Prometheus setup
- Alerting rules
- Auto-recovery scripts
- Fail2ban configuration
- Rate limiting

### Module 5: Backup & Disaster Recovery (1.5 hours)
- Snapshot strategies
- NAS integration
- Automated backups
- Recovery procedures
- Testing backups
- RTO/RPO planning

### Module 6: Security Hardening (1.5 hours)
- SSH hardening
- Firewall rules
- SSL/TLS configuration
- Rate limiting
- DDoS protection
- Audit logging

### Module 7: Performance Tuning (1 hour)
- CPU/Memory optimization
- Disk I/O tuning
- Network optimization
- Caching strategies
- Benchmarking

### Module 8: Monitoring & Dashboards (1 hour)
- Grafana setup
- Custom dashboards
- Alert configuration
- Trend analysis
- Capacity planning

---

## Part 9: Hands-On Projects

### Project 1: Build from Scratch
**Objective:** Install Proxmox on physical hardware, create OpenClaw VM, configure networking

**Deliverable:** Running OpenClaw instance on your own Proxmox cluster

**Time:** 4 hours

---

### Project 2: High Availability Setup
**Objective:** Create 3-node Proxmox cluster for redundancy

**Deliverable:** OpenClaw automatically fails over if one node goes down

**Time:** 6 hours

---

### Project 3: Backup & Recovery
**Objective:** Implement automated backups + test disaster recovery

**Deliverable:** Proven ability to recover from hardware failure in <15 min

**Time:** 3 hours

---

### Project 4: Monitoring & Alerting
**Objective:** Set up Prometheus, Grafana, alerting

**Deliverable:** Real-time dashboards + auto-alerts for issues

**Time:** 2 hours

---

## Part 10: Troubleshooting Guide

### Issue: OpenClaw won't start
**Solution:**
```bash
systemctl status openclaw
journalctl -u openclaw -n 50
# Check logs, usually disk full or memory issue
```

### Issue: Network connectivity lost
**Solution:**
```bash
ping 8.8.8.8  # Check internet
ip route      # Check routing
systemctl restart networking  # Restart network
```

### Issue: Disk space critical
**Solution:**
```bash
df -h                          # Check usage
du -sh /root/.openclaw/*       # Find large dirs
# Clean old backups, logs, temp files
```

### Issue: High CPU/Memory usage
**Solution:**
```bash
top -p $(pgrep -f openclaw)    # See process details
ps aux | grep openclaw         # Check all processes
# Increase resources or optimize code
```

### Issue: SSL certificate expiring
**Solution:**
```bash
certbot renew --dry-run        # Test renewal
certbot renew                  # Actually renew
systemctl restart nginx        # Apply new cert
```

---

## Part 11: Cost Analysis

### Hardware (One-Time)
- Server: $500-2000
- RAM: $50-200
- SSD: $50-300
- Network: $0-100
- **Total:** $600-2600

### Operating Costs (Monthly)
- Power: $20-50
- Cooling: $10-20
- Internet: $30-100
- Maintenance: $0-50
- **Total:** $60-220/month

### vs. Cloud Costs
**Cloud (AWS EC2):**
- m5.xlarge (4CPU, 16GB): ~$150/month
- Storage (100GB): ~$5/month
- Backup: ~$10/month
- **Total:** ~$165/month

**On-Premises:**
- After 4 months, on-prem pays for itself
- After 1 year, you've saved $1200+ vs cloud

---

## Part 12: Certification & Next Steps

### What You'll Know
✅ How to build production infrastructure  
✅ How to deploy OpenClaw on bare metal  
✅ How to set up self-healing systems  
✅ How to implement disaster recovery  
✅ How to monitor and optimize performance  
✅ How to secure enterprise systems  

### Certification
Upon completion:
- Digital badge: "OpenClaw Infrastructure Certified"
- Job board access for infrastructure roles
- Private community of infrastructure experts
- Advanced workshops (quarterly)

### Next Steps
1. **Buy hardware** (used Proxmox-compatible server, ~$500-1500)
2. **Install Proxmox** (2-3 hours)
3. **Create OpenClaw VM** (1-2 hours)
4. **Implement monitoring** (3-4 hours)
5. **Test disaster recovery** (2 hours)
6. **Go live** and start migrating workloads

---

## Appendix: Useful Commands

```bash
# Proxmox
qm list                    # List all VMs
qm start 100              # Start VM 100
qm stop 100               # Stop VM 100
qm snapshot 100 snap1     # Create snapshot
qm rollback 100 snap1     # Restore snapshot

# OpenClaw
openclaw status           # Check status
openclaw update           # Auto-update
openclaw config get       # View config
openclaw doctor           # Health check
openclaw help             # Show all commands

# System
systemctl status openclaw          # Service status
journalctl -u openclaw -f          # Live logs
df -h                              # Disk usage
free -h                            # Memory usage
top                                # Process monitor
htop                               # Interactive monitor

# Networking
ip addr                   # Show IPs
ip route                  # Show routes
netstat -tulpn            # Show connections
ss -tulpn                 # Show sockets
