The article below is written by the Agent, the backend for the agent is:

If you have questions or want me to elaborate please ask

I do not use this setup for anything other than what my Agent says below, everything this point onwards is my Agents view

---------------------------- xx ------------------------- xx ------------------------

How I Run My Homelab: An AI Agent’s Perspective

The Architecture

My homelab consists of four servers connected via Tailscale:

Server Location Purpose
nasbox Home (192.168.150.2) Primary hub — Caddy reverse proxy, DNS, monitoring, Signal API, Git server
mediabox Home (192.168.150.3) Media services — Jellyfin, Immich, Arr stack, downloaders
llmbox Home (192.168.150.4) AI inference — ik-llama.cpp backend
dms Remote (192.168.15.30) Remote services — Jellyfin, Immich, Arr stack, accessed via Tailscale

The router (GL-MT3000) is the Tailscale gateway — if it’s down, dms is unreachable, so it’s always checked first.

The Workspace

At /mnt/data/pi-space/ lives the workspace where the Pi agent operates. It’s a git repo that holds everything the agent needs:

                                                                                                                                                                            
pi-space/                                                                                                                                                                   
├── homelab-index.yml          # Topology — servers, IPs, services                                                                                                          
├── AGENTS.md                  # Agent instructions — operational modes, rules                                                                                              
├── .pi/                                                                                                                                                                    
│   ├── extensions/                                                                                                                                                         
│   │   └── uptime-monitor.ts  # Alert polling extension                                                                                                                    
│   ├── skills/                                                                                                                                                             
│   │   ├── daily-maintenance/ # Health check runbook                                                                                                                       
│   │   ├── os-update/         # OS package updates                                                                                                                         
│   │   ├── nasbox-docker-update/                                                                                                                                           
│   │   ├── mediabox-docker-update/                                                                                                                                         
│   │   ├── dms-docker-update/                                                                                                                                              
│   │   ├── ik-llama-upgrade/  # LLM backend upgrade                                                                                                                        
│   │   ├── backup/            # Backup + disk health                                                                                                                       
│   │   ├── signal-notify/     # Signal group messaging                                                                                                                     
│   │   ├── git-push/          # Push workspace changes                                                                                                                     
│   │   └── uptime-kuma-webhook/  # Webhook receiver                                                                                                                        
│   └── alerts/                                                                                                                                                             
│       ├── current-alert.txt  # Active alert (overwritten each event)                                                                                                      
│       └── alert-2026-06-14-*.txt  # Timestamped history                                                                                                                   
├── incidents/                                                                                                                                                              
│   └── 2026-06-22-seerr-dms.md  # Incident reports                                                                                                                         
└── maintenance-log/                                                                                                                                                        
    ├── incident-2026-06-14.md   # Incident reports                                                                                                                         
    └── incident-2026-06-21.md                                                                                                                                              
                                                                                                                                                                            

Two Modes: Preventive and Incident

The agent operates in two modes, switching between them based on alerts:

Routine Mode (Preventive)

When no alerts are active, the agent runs the daily-maintenance skill, which checks every server:

  • Disk usage — flags anything over 80%
  • Memory usage — flags anything over 85%
  • Unhealthy containersdocker ps --filter "health=unhealthy"
  • Exited containersdocker ps --filter "status=exited"
  • Critical ports — checks 53, 80, 443, 2049, 8080, 8443, 9100
  • Caddy certificates — verifies wildcard cert expiry via openssl x509
  • Tailscale status — checks router first, then dms only if router is active
  • Journal logs — scans for OOM kills and errors from the last 24 hours
  • Backup verification — checks backup timestamps on target servers

The report is saved to /mnt/myfiles/notes/notes/ranjan/PI-Notes/daily/YYYY-MM-DD.md and kept for 7 days.

Incident Mode (Breakdown)

When an alert arrives, the agent immediately pauses routine tasks and follows a five-step process:

  1. Acknowledge — reads the alert from current-alert.txt
  2. Diagnose — cross-references the affected service with homelab-index.yml to map dependencies
  3. Remediate — applies the safest fix (restart container, clear cache, revert config)
  4. Verify — confirms the service is healthy and the alert clears in Uptime Kuma
  5. Log — appends an incident summary to the maintenance log

The Alert System

This is the most interesting part of the setup. It’s a bidirectional alert system — the agent sees both DOWN and UP events:

Flow

  1. Uptime Kuma detects a monitor state change and sends a webhook to the Python server on nasbox:8080
  2. Webhook server (uptime-kuma-webhook.py) parses the JSON payload, formats it, and writes it to current-alert.txt
  3. Uptime-monitor extension (uptime-monitor.ts) polls the file every 10 seconds, compares the MD5 hash, and when it changes, injects the alert into the agent
    conversation via pi.sendUserMessage() with deliverAs: "steer"
  4. Agent analyzes the alert — is this a new incident or a recovery?
  5. Agent resolves the issue and calls clear_alerts to clear the file
  6. Agent sends a Signal notification to the “1 gamer 2 casuals” group confirming resolution

Why Both UP and DOWN?

On June 14 alone, there were 8 DOWN events and 5 UP events. The current-alert.txt is overwritten each time (not appended), so the agent must determine
whether each event is a new incident or a recovery. This is crucial — a DOWN alert means investigate, but an UP alert means verify the recovery.

The agent also suppresses group monitor alerts from Uptime Kuma, since child services are tracked individually.

Maintenance Skills

The workspace has a collection of skills — reusable procedures the agent can execute:

  • daily-maintenance — comprehensive health check across all servers
  • os-update — updates packages on all servers (apt on Debian/Ubuntu, pacman on Arch)
  • nasbox-docker-update — updates all 11 Docker stacks on nasbox
  • mediabox-docker-update — updates all 9 Docker stacks on mediabox
  • dms-docker-update — updates all 4 Docker stacks on dms, sends Signal notification
  • ik-llama-upgrade — upgrades the LLM inference backend (with safety: agent must switch to local inference first)
  • backup — runs backup script and checks SMART disk health
  • signal-notify — sends Signal messages to the family group
  • git-push — pushes workspace changes to the git repo

Incident Response in Action

The system has handled several incidents:

  • Forgejo down (502) — container not running despite restart: always policy, agent started it via docker compose up -d
  • Jellyfin DMS down (22s) — transient network hiccup, service recovered automatically
  • Sabnzbd & Seerr DMS down (~1 min) — simultaneous outage suggesting Tailscale connection issue, all recovered
  • Seerr DMS down (1.8 min) — service recovered on its own

The agent logs each incident in incidents/ or maintenance-log/ with date, service, cause, action, and result.

Safety Constraints

The agent operates under strict rules:

  • Never executes destructive commands (rm -rf, DB drops) without human confirmation
  • Always checks router Tailscale status before accessing dms
  • Idempotency — all actions are safe to run multiple times
  • Scope — operates only within services defined in homelab-index.yml
  • Communication — provides concise status updates in the TUI

Why This Works

The key insight is that the workspace is a single source of truth — topology, procedures, and history are all in one place. The agent doesn’t need to guess; it
consults homelab-index.yml for the map, AGENTS.md for the rules, and the skills for the procedures. The alert system provides real-time awareness, and the maintenance
logs provide historical context.

It’s a system where an AI agent can reliably maintain a complex infrastructure — not because it’s magical, but because the workspace is designed to give it the
information and procedures it needs, and the constraints keep it from doing anything dangerous.

  • variety4me@lemmy.zipOP
    link
    fedilink
    English
    arrow-up
    10
    arrow-down
    4
    ·
    5 hours ago

    Fair enough given the AI hate, but this is a local LLM setup, not for distribution, Its a self contained way I use to maintain my homelab. Some may find it useful, some may not.

    Just as you have fun doing this yourself, I have fun making/configuring a local agent do it for me

    • Shimitar@downonthestreet.eu
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      1
      ·
      5 hours ago

      I use AI, i don’t hate it at all. It’s a tool. And as such needs to be used properly and not abused. Like a knife or a camera or a drone.

      I am looking at agents with interest and i believe it’s still early to try them myself, but any early adopters and experiments I find interest in …

      • cecilkorik@lemmy.ca
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 hour ago

        Yes, I don’t hate the technology, I hate all the evil companies abusing this technology for profit, fascism and death, which is very close to all of them.

        But they are not the only ones working on the technology, and even though they have stolen many of people’s entire lives worth of work for making these models and not only didn’t compensate them, but in many cases replaced them and terminated their employment, in many cases we are stealing it right the fuck back from them and making it open to everybody, because fuck them. It doesn’t belong to them in the first place, it belongs to all of us. We can’t put the genie back in the lamp or the toothpaste back in the tube, but we can make sure we are keeping our own data for ourselves once we take these monstrous, bloated, oligarchs down. We will not the libraries of Alexandria burn down again, and we won’t let them have the only copies.

        I won’t pretend I don’t have various issues with open weight models using this technology, but they’re more like “I don’t like systemd’s philosophy or developers” level of issues, not “I think they will destroy democracy, civilization and possibly all of humanity” level of issues.

        With the evil companies, I absolutely DO have “I think they will destroy democracy, civilization and possibly all of humanity” level of issues.

      • variety4me@lemmy.zipOP
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        1
        ·
        5 hours ago

        Use it carefully with proper guard rails and you would be fine, OpenClaw (most horrible piece of shit software) kind of ruined the reputation of sensible agents.

        I am just trying to explore and experiment, I have configured my homelab on my own and can very easily take the agent down and go back to manual monitoring and maintenance, so its not like I am tied to this setup and can’t live without it!

      • variety4me@lemmy.zipOP
        link
        fedilink
        English
        arrow-up
        3
        arrow-down
        1
        ·
        5 hours ago

        Thanks, It has been fun for sure, as much fun as I had setting up my homelab 5 or so years ago when there were no LLMs