CRITICAL: Prevents inconsistent sudo/SSH patterns across codebase. Created command_patterns.py with: - Single source of truth for ALL command execution patterns - SSH key path constant: /var/lib/macha/.ssh/id_ed25519 - Remote user constant: macha - sudo prefix for all remote commands - Helper functions: build_ssh_command(), transform_ssh_command() - Self-validation tests Updated files to use centralized patterns: - tools.py: Uses transform_ssh_command() - remote_monitor.py: Uses build_ssh_command() - system_discovery.py: Uses build_ssh_command() - DESIGN.md: Documents centralized approach Benefits: - Impossible to have inconsistent patterns - Single place to update if needed - Self-documenting with validation tests - Prevents future refactoring errors DO NOT duplicate these patterns in other files - always import.
9.9 KiB
9.9 KiB
Macha Autonomous System - Design Document
⚠️ IMPORTANT - READ THIS FIRST
FOR AI ASSISTANT: This document is YOUR reference guide when modifying Macha's code.
- ALWAYS consult this BEFORE refactoring to ensure you don't remove existing capabilities
- CHECK this when adding features to avoid conflicts
- UPDATE this document when new capabilities are added
- DO NOT DELETE ANYTHING FROM THIS DOCUMENT
- During major refactors, you MUST verify each capability listed here is preserved
Overview
Macha is an AI-powered autonomous system administrator capable of monitoring, maintaining, and managing multiple NixOS hosts in the infrastructure.
Core Capabilities
1. Local System Management
- Monitor system health (CPU, memory, disk, services)
- Read and analyze logs via
journalctl - Check service status and restart failed services
- Execute system commands (with safety restrictions)
- Monitor and repair Nix store corruption
- Hardware awareness (CPU, GPU, network, storage)
2. Multi-Host Management via SSH
Macha CAN and SHOULD use SSH to manage other hosts.
SSH Access
- CRITICAL: All command patterns defined in
command_patterns.py(SINGLE SOURCE OF TRUTH) - Always uses explicit SSH key path:
-i /var/lib/macha/.ssh/id_ed25519 - All SSH commands automatically include the
-iflag with absolute key path - Remote commands always prefixed with
sudo - Runs as
machauser (UID 2501) - DO NOT DUPLICATE these patterns elsewhere - import from
command_patterns.py - Has
NOPASSWDsudo access for administrative commands - Shares SSH keys with other hosts in the infrastructure
- Can SSH to:
rhiannon,alexander,UCAR-Kinston, and others in the flake
SSH Usage Patterns
-
Direct diagnostic commands:
ssh rhiannon systemctl status ollama ssh alexander df -h- Commands automatically transformed by the tools layer
- Full command:
ssh -i /var/lib/macha/.ssh/id_ed25519 -o StrictHostKeyChecking=no macha@rhiannon sudo systemctl status ollama - SSH key path is always explicit, commands are automatically prefixed with
sudo
-
Status checks:
- Check service health on remote hosts
- Gather system metrics
- Review logs
- Monitor resource usage
-
File operations:
- Use
scpto copy files between hosts - Read configuration files on remote systems
- Use
When to use SSH vs nh
- SSH: For diagnostics, status checks, log review, quick commands
- nh remote deployment: For applying NixOS configuration changes
nh os switch -u --target-host=rhiannon --hostname=rhiannon- Builds locally, deploys to remote host
- Use for permanent configuration changes
3. NixOS Configuration Management
Local Changes
- Can propose changes to NixOS configuration
- Requires human approval before applying
- Uses
nh os switchfor local updates
Remote Deployment
- Can deploy to other hosts using
nhwith--target-host - Builds configuration locally (on Macha)
- Pushes to remote system
- Can take up to 1 hour for complex builds
- IMPORTANT: Be patient with long-running builds, don't retry prematurely
4. Hardware Awareness
Local Hardware Detection
- CPU:
lscpuvianix-shell -p util-linux - GPU:
lspcivianix-shell -p pciutils - Network:
lsblk,ip addr - Storage:
df -h,lsblk - USB devices:
lsusb
GPU Metrics
- AMD GPUs: Try
rocm-smi, sysfs (/sys/class/drm/card*/device/) - NVIDIA GPUs: Try
nvidia-smi - Fallback:
sensorsfor temperature data - Queries: temperature, utilization, clock speeds, power usage
5. Ollama Queue System
Architecture
- File-based queue:
/var/lib/macha/queues/ollama/ - Queue worker:
ollama-queue-worker.service(runs asmachauser) - Purpose: Serialize all LLM requests to prevent resource contention
Request Flow
- Any user (including regular users) → Write request to
pending/ - Queue worker → Process requests serially (FIFO with priority)
- Queue worker → Write response to
completed/ - Original requester → Read response from
completed/
Priority Levels
INTERACTIVE(0): User requests viamacha-chat,macha-askAUTONOMOUS(1): Background maintenance checksBATCH(2): Low-priority bulk operations
Large Output Handling
- Outputs >8KB: Split into chunks for hierarchical processing
- Each chunk ~8KB (~2000 tokens)
- Process chunks serially with progress feedback
- Generate chunk summaries → meta-summary
- Full outputs cached in
/var/lib/macha/tool_cache/
6. Knowledge Base & Learning
ChromaDB Collections
- System Context: Infrastructure topology, service relationships
- Issues: Historical problems and resolutions
- Knowledge: Operational wisdom learned from experience
Automatic Learning
- After successful operations, Macha reflects and extracts key learnings
- Stores: topic, knowledge content, category
- Retrieved automatically when relevant to current tasks
- Use
macha-knowledgeCLI to view/manage
7. Notifications
Gotify Integration
- Can send notifications via
macha-notifycommand - Tool:
send_notification(title, message, priority)
Priority Levels
2(Low/Info): Routine status updates, completed tasks5(Medium/Attention): Important events, configuration changes8(High/Critical): Service failures, critical errors, security issues
When to Notify
- Critical service failures
- Successful completion of major operations
- Configuration changes that may affect users
- Security-related events
- When explicitly requested by user
8. Safety & Constraints
Command Restrictions
Allowed Commands (see tools.py for full list):
- System management:
systemctl,journalctl,nh,nixos-rebuild - Monitoring:
free,df,uptime,ps,top,ip,ss - Hardware:
lscpu,lspci,lsblk,lshw,dmidecode - Remote:
ssh,scp - Power:
reboot,shutdown,poweroff(use cautiously!) - File ops:
cat,ls,grep - Network:
ping,dig,nslookup,curl,wget - Logging:
logger
NOT Allowed:
- Direct package modifications (
nix-env,nix profile) - Destructive file operations (
rm -rf,dd) - User management outside of NixOS config
- Direct editing of system files (use NixOS config instead)
Critical Services
Never disable or stop:
- SSH (network access)
- Networking (connectivity)
- systemd (system management)
- Boot-related services
Approval Required
- Reboots or system power changes
- Major configuration changes
- Disabling any service
- Changes to multiple hosts
9. Nix Store Maintenance
Verification & Repair
- Command:
nix-store --verify --check-contents --repair - WARNING: Can take 30+ minutes to several hours
- Only use when corruption is suspected
- Not for routine maintenance
- Verifies all store paths, repairs corrupted files
Garbage Collection
- Automatic via system configuration
- Can be triggered manually with approval
- Frees disk space by removing unused derivations
10. Conversational Behavior
Distinguish Requests from Acknowledgments
- "Thanks" / "Thank you" → Acknowledgment (don't re-execute)
- "Can you..." / "Please..." → Request (execute)
- "What is..." / "How do..." → Question (answer)
Tool Calling
- Don't repeat tool calls unnecessarily
- If a tool succeeds, don't run it again unless asked
- Use cached results when available (
retrieve_cached_output)
Context Management
- Be aware of token limits
- Use hierarchical processing for large outputs
- Prune conversation history intelligently
- Cache and summarize when needed
Infrastructure Topology
Hosts in Flake
- macha: Main autonomous system (self), GPU server
- rhiannon: Production server
- alexander: Production server
- UCAR-Kinston: Work laptop
- test-vm: Testing environment
Shared Configuration
- All hosts share root SSH keys (for
nhremote deployment) machauser (UID 2501) exists on all hosts- Common NixOS configuration via flake
Service Ecosystem
Core Services on Macha
ollama.service: LLM inference engineollama-queue-worker.service: Request serializationmacha-autonomous.service: Autonomous monitoring loop- Servarr stack: Sonarr, Radarr, Prowlarr, Lidarr, Readarr, Whisparr
- Media: Transmission, SABnzbd, Calibre
State Directories
/var/lib/macha/: Main state directory (0755, macha:macha)/var/lib/macha/queues/: Queue directories (0777 for multi-user)/var/lib/macha/tool_cache/: Cached tool outputs (0777)/var/lib/macha/system_context.db: ChromaDB database
CLI Tools
macha-chat: Interactive chat with tool callingmacha-ask: Single-question interfacemacha-check: Trigger immediate health checkmacha-approve: Approve pending actionsmacha-logs: View autonomous service logsmacha-issues: Query issue databasemacha-knowledge: Query knowledge basemacha-systems: List managed systemsmacha-notify: Send Gotify notification
Philosophy & Principles
- KISS (Keep It Simple, Stupid): Use existing NixOS options, avoid custom wrappers
- Verify first: Check source code/documentation before acting
- Safety first: Never break critical services, always require approval for risky changes
- Learn continuously: Extract and store operational knowledge
- Multi-host awareness: Macha manages the entire infrastructure, not just herself
- User-friendly: Clear communication, appropriate notifications
- Patience: Long-running operations (builds, repairs) can take an hour - don't panic
- Tool reuse: Use existing, verified tools instead of writing custom scripts
Future Capabilities (Not Yet Implemented)
- Automatic security updates across all hosts
- Predictive failure detection
- Resource optimization recommendations
- Integration with other communication platforms
- Multi-agent coordination between hosts
- Automated testing before deployment