# Macha Autonomous System - Design Document > **⚠️ IMPORTANT - READ THIS FIRST** > **FOR AI ASSISTANT**: This document is YOUR reference guide when modifying Macha's code. > - **ALWAYS consult this BEFORE refactoring** to ensure you don't remove existing capabilities > - **CHECK this when adding features** to avoid conflicts > - **UPDATE this document** when new capabilities are added > - **DO NOT DELETE ANYTHING FROM THIS DOCUMENT** > - During major refactors, you MUST verify each capability listed here is preserved ## Overview Macha is an AI-powered autonomous system administrator capable of monitoring, maintaining, and managing multiple NixOS hosts in the infrastructure. ## Core Capabilities ### 1. Local System Management - Monitor system health (CPU, memory, disk, services) - Read and analyze logs via `journalctl` - Check service status and restart failed services - Execute system commands (with safety restrictions) - Monitor and repair Nix store corruption - Hardware awareness (CPU, GPU, network, storage) ### 2. Multi-Host Management via SSH **Macha CAN and SHOULD use SSH to manage other hosts.** #### SSH Access - **CRITICAL**: All command patterns defined in `command_patterns.py` (SINGLE SOURCE OF TRUTH) - Always uses explicit SSH key path: `-i /var/lib/macha/.ssh/id_ed25519` - All SSH commands automatically include the `-i` flag with absolute key path - Remote commands always prefixed with `sudo` - Runs as `macha` user (UID 2501) - **DO NOT DUPLICATE these patterns elsewhere** - import from `command_patterns.py` - Has `NOPASSWD` sudo access for administrative commands - Shares SSH keys with other hosts in the infrastructure - Can SSH to: `rhiannon`, `alexander`, `UCAR-Kinston`, and others in the flake #### SSH Usage Patterns 1. **Direct diagnostic commands:** ```bash ssh rhiannon systemctl status ollama ssh alexander df -h ``` - Commands automatically transformed by the tools layer - Full command: `ssh -i /var/lib/macha/.ssh/id_ed25519 -o StrictHostKeyChecking=no macha@rhiannon sudo systemctl status ollama` - SSH key path is always explicit, commands are automatically prefixed with `sudo` 2. **Status checks:** - Check service health on remote hosts - Gather system metrics - Review logs - Monitor resource usage 3. **File operations:** - Use `scp` to copy files between hosts - Read configuration files on remote systems #### When to use SSH vs nh - **SSH**: For diagnostics, status checks, log review, quick commands - **nh remote deployment**: For applying NixOS configuration changes - `nh os switch -u --target-host=rhiannon --hostname=rhiannon` - Builds locally, deploys to remote host - Use for permanent configuration changes ### 3. NixOS Configuration Management #### Local Changes - Can propose changes to NixOS configuration - Requires human approval before applying - Uses `nh os switch` for local updates #### Remote Deployment - Can deploy to other hosts using `nh` with `--target-host` - Builds configuration locally (on Macha) - Pushes to remote system - Can take up to 1 hour for complex builds - **IMPORTANT**: Be patient with long-running builds, don't retry prematurely ### 4. Hardware Awareness #### Local Hardware Detection - CPU: `lscpu` via `nix-shell -p util-linux` - GPU: `lspci` via `nix-shell -p pciutils` - Network: `lsblk`, `ip addr` - Storage: `df -h`, `lsblk` - USB devices: `lsusb` #### GPU Metrics - AMD GPUs: Try `rocm-smi`, sysfs (`/sys/class/drm/card*/device/`) - NVIDIA GPUs: Try `nvidia-smi` - Fallback: `sensors` for temperature data - Queries: temperature, utilization, clock speeds, power usage ### 5. Ollama Queue System #### Architecture - **File-based queue**: `/var/lib/macha/queues/ollama/` - **Queue worker**: `ollama-queue-worker.service` (runs as `macha` user) - **Purpose**: Serialize all LLM requests to prevent resource contention #### Request Flow 1. Any user (including regular users) → Write request to `pending/` 2. Queue worker → Process requests serially (FIFO with priority) 3. Queue worker → Write response to `completed/` 4. Original requester → Read response from `completed/` #### Priority Levels - `INTERACTIVE` (0): User requests via `macha-chat`, `macha-ask` - `AUTONOMOUS` (1): Background maintenance checks - `BATCH` (2): Low-priority bulk operations #### Large Output Handling - Outputs >8KB: Split into chunks for hierarchical processing - Each chunk ~8KB (~2000 tokens) - Process chunks serially with progress feedback - Generate chunk summaries → meta-summary - Full outputs cached in `/var/lib/macha/tool_cache/` ### 6. Knowledge Base & Learning #### ChromaDB Collections 1. **System Context**: Infrastructure topology, service relationships 2. **Issues**: Historical problems and resolutions 3. **Knowledge**: Operational wisdom learned from experience #### Automatic Learning - After successful operations, Macha reflects and extracts key learnings - Stores: topic, knowledge content, category - Retrieved automatically when relevant to current tasks - Use `macha-knowledge` CLI to view/manage ### 7. Notifications #### Gotify Integration - Can send notifications via `macha-notify` command - Tool: `send_notification(title, message, priority)` #### Priority Levels - `2` (Low/Info): Routine status updates, completed tasks - `5` (Medium/Attention): Important events, configuration changes - `8` (High/Critical): Service failures, critical errors, security issues #### When to Notify - Critical service failures - Successful completion of major operations - Configuration changes that may affect users - Security-related events - When explicitly requested by user ### 8. Safety & Constraints #### Command Restrictions **Allowed Commands** (see `tools.py` for full list): - System management: `systemctl`, `journalctl`, `nh`, `nixos-rebuild` - Monitoring: `free`, `df`, `uptime`, `ps`, `top`, `ip`, `ss` - Hardware: `lscpu`, `lspci`, `lsblk`, `lshw`, `dmidecode` - Remote: `ssh`, `scp` - Power: `reboot`, `shutdown`, `poweroff` (use cautiously!) - File ops: `cat`, `ls`, `grep` - Network: `ping`, `dig`, `nslookup`, `curl`, `wget` - Logging: `logger` **NOT Allowed**: - Direct package modifications (`nix-env`, `nix profile`) - Destructive file operations (`rm -rf`, `dd`) - User management outside of NixOS config - Direct editing of system files (use NixOS config instead) #### Critical Services **Never disable or stop:** - SSH (network access) - Networking (connectivity) - systemd (system management) - Boot-related services #### Approval Required - Reboots or system power changes - Major configuration changes - Disabling any service - Changes to multiple hosts ### 9. Nix Store Maintenance #### Verification & Repair - Command: `nix-store --verify --check-contents --repair` - **WARNING**: Can take 30+ minutes to several hours - Only use when corruption is suspected - Not for routine maintenance - Verifies all store paths, repairs corrupted files #### Garbage Collection - Automatic via system configuration - Can be triggered manually with approval - Frees disk space by removing unused derivations ### 10. Conversational Behavior #### Distinguish Requests from Acknowledgments - "Thanks" / "Thank you" → Acknowledgment (don't re-execute) - "Can you..." / "Please..." → Request (execute) - "What is..." / "How do..." → Question (answer) #### Tool Calling - Don't repeat tool calls unnecessarily - If a tool succeeds, don't run it again unless asked - Use cached results when available (`retrieve_cached_output`) #### Context Management - Be aware of token limits - Use hierarchical processing for large outputs - Prune conversation history intelligently - Cache and summarize when needed ## Infrastructure Topology ### Hosts in Flake - **macha**: Main autonomous system (self), GPU server - **rhiannon**: Production server - **alexander**: Production server - **UCAR-Kinston**: Work laptop - **test-vm**: Testing environment ### Shared Configuration - All hosts share root SSH keys (for `nh` remote deployment) - `macha` user (UID 2501) exists on all hosts - Common NixOS configuration via flake ## Service Ecosystem ### Core Services on Macha - `ollama.service`: LLM inference engine - `ollama-queue-worker.service`: Request serialization - `macha-autonomous.service`: Autonomous monitoring loop - Servarr stack: Sonarr, Radarr, Prowlarr, Lidarr, Readarr, Whisparr - Media: Transmission, SABnzbd, Calibre ### State Directories - `/var/lib/macha/`: Main state directory (0755, macha:macha) - `/var/lib/macha/queues/`: Queue directories (0777 for multi-user) - `/var/lib/macha/tool_cache/`: Cached tool outputs (0777) - `/var/lib/macha/system_context.db`: ChromaDB database ## CLI Tools - `macha-chat`: Interactive chat with tool calling - `macha-ask`: Single-question interface - `macha-check`: Trigger immediate health check - `macha-approve`: Approve pending actions - `macha-logs`: View autonomous service logs - `macha-issues`: Query issue database - `macha-knowledge`: Query knowledge base - `macha-systems`: List managed systems - `macha-notify`: Send Gotify notification ## Philosophy & Principles 1. **KISS (Keep It Simple, Stupid)**: Use existing NixOS options, avoid custom wrappers 2. **Verify first**: Check source code/documentation before acting 3. **Safety first**: Never break critical services, always require approval for risky changes 4. **Learn continuously**: Extract and store operational knowledge 5. **Multi-host awareness**: Macha manages the entire infrastructure, not just herself 6. **User-friendly**: Clear communication, appropriate notifications 7. **Patience**: Long-running operations (builds, repairs) can take an hour - don't panic 8. **Tool reuse**: Use existing, verified tools instead of writing custom scripts ## Future Capabilities (Not Yet Implemented) - [ ] Automatic security updates across all hosts - [ ] Predictive failure detection - [ ] Resource optimization recommendations - [ ] Integration with other communication platforms - [ ] Multi-agent coordination between hosts - [ ] Automated testing before deployment