- Added USB device detection and monitoring capabilities. - Updated SSH usage patterns to reflect dynamic host configurations. - Introduced automatic system discovery from journal logs, including OS detection and system profiling. - Enhanced configuration file intelligence with semantic search and categorization. - Expanded knowledge base structure and automatic learning processes. - Clarified the architecture and key modules for better understanding of system components.
18 KiB
18 KiB
Macha Autonomous System - Design Document
⚠️ IMPORTANT - READ THIS FIRST
FOR AI ASSISTANT: This document is YOUR reference guide when modifying Macha's code.
- ALWAYS consult this BEFORE refactoring to ensure you don't remove existing capabilities
- CHECK this when adding features to avoid conflicts
- UPDATE this document when new capabilities are added
- DO NOT DELETE ANYTHING FROM THIS DOCUMENT
- During major refactors, you MUST verify each capability listed here is preserved
Overview
Macha is an AI-powered autonomous system administrator capable of monitoring, maintaining, and managing multiple NixOS hosts in the infrastructure.
Core Capabilities
1. Local System Management
- Monitor system health (CPU, memory, disk, services)
- Read and analyze logs via
journalctl - Check service status and restart failed services
- Execute system commands (with safety restrictions)
- Monitor and repair Nix store corruption
- Hardware awareness (CPU, GPU, network, storage)
- USB device detection and monitoring
2. Multi-Host Management via SSH
Macha CAN and SHOULD use SSH to manage other hosts.
SSH Access
- CRITICAL: All command patterns defined in
command_patterns.py(SINGLE SOURCE OF TRUTH) - Always uses explicit SSH key path:
-i /var/lib/macha/.ssh/id_ed25519 - All SSH commands automatically include the
-iflag with absolute key path - Remote commands always prefixed with
sudo - Runs as
machauser (UID 2501) for standard operations - Note: Some internal operations (like remote monitoring) may use
rootSSH for privileged access - DO NOT DUPLICATE these patterns elsewhere - import from
command_patterns.py - Has
NOPASSWDsudo access for administrative commands - Shares SSH keys with other hosts in the infrastructure
- Can SSH to any hosts defined in your NixOS flake configuration
SSH Usage Patterns
-
Direct diagnostic commands:
ssh hostname systemctl status service-name ssh hostname df -h- Commands automatically transformed by the tools layer
- Full command:
ssh -i /var/lib/macha/.ssh/id_ed25519 -o StrictHostKeyChecking=no macha@hostname sudo systemctl status service-name - SSH key path is always explicit, commands are automatically prefixed with
sudo
-
Status checks:
- Check service health on remote hosts
- Gather system metrics
- Review logs
- Monitor resource usage
-
File operations:
- Use
scpto copy files between hosts - Read configuration files on remote systems
- Use
When to use SSH vs nh
- SSH: For diagnostics, status checks, log review, quick commands
- nh remote deployment: For applying NixOS configuration changes
nh os switch -u --target-host=hostname --hostname=hostname- Builds locally, deploys to remote host
- Use for permanent configuration changes
3. Automatic System Discovery
Discovery from Journal Logs
- Monitors
systemd-journal-remotelogs for new systems sending logs - Parses
_HOSTNAMEfield from remote journal entries - Automatically converts short hostnames to FQDNs (
.coven.systems) - Discovers systems within configurable time windows (default: 10 minutes)
OS Detection
- Detects operating system via SSH (
/etc/os-release,uname) - Supports: NixOS, Ubuntu, Debian, Arch, Fedora, RHEL, Alpine, macOS, FreeBSD
- Falls back to generic "linux" if specific distro cannot be determined
- Records OS type in system registry for appropriate management strategies
System Profiling
- Automatically gathers system information upon discovery:
- Running services (via
systemctl list-units) - Hardware info (CPU cores, memory)
- Capabilities (containers, web-server, database, remote-access)
- System role determination (ai-workstation, server, workstation, minimal)
- Running services (via
- Registers discovered systems in ChromaDB context database
- Sends notification when new systems are discovered
4. NixOS Configuration Management
Local Changes
- Can propose changes to NixOS configuration
- Requires human approval before applying
- Uses
nh os switchfor local updates
Remote Deployment
- Can deploy to other hosts using
nhwith--target-host - Builds configuration locally (on Macha)
- Pushes to remote system
- Can take up to 1 hour for complex builds
- IMPORTANT: Be patient with long-running builds, don't retry prematurely
5. Configuration File Intelligence (RAG)
Semantic Configuration Search
- NixOS configuration files indexed in ChromaDB for semantic search
- Query configurations by natural language: "gotify configuration", "journald settings"
- CLI tools:
macha-configs <query>andmacha-configs-read <path> - Relevance scoring helps find the right config files quickly
- Supports filtering by system hostname or category
Configuration Categories
apps/- Application configurations (Gotify, Nextcloud, etc.)systems/- Per-host system configurationsosconfigs/- Operating system level settingsusers/- User management configurations
Git Context Analysis
- Tracks recent changes to configuration files via
git_context.py - Correlates config changes with system behavior
- Provides context when investigating issues after deployments
- Helps understand "what changed" when debugging
6. Hardware Awareness
Local Hardware Detection
- CPU:
lscpuvianix-shell -p util-linux - GPU:
lspcivianix-shell -p pciutils - Network:
lsblk,ip addr - Storage:
df -h,lsblk - USB devices:
lsusb
GPU Metrics
- AMD GPUs: Try
rocm-smi, sysfs (/sys/class/drm/card*/device/) - NVIDIA GPUs: Try
nvidia-smi - Fallback:
sensorsfor temperature data - Queries: temperature, utilization, clock speeds, power usage
7. Ollama Queue System
Architecture
- File-based queue:
/var/lib/macha/queues/ollama/ - Queue worker:
ollama-queue-worker.service(runs asmachauser) - Purpose: Serialize all LLM requests to prevent resource contention
Request Flow
- Any user (including regular users) → Write request to
pending/ - Queue worker → Process requests serially (FIFO with priority)
- Queue worker → Write response to
completed/ - Original requester → Read response from
completed/
Priority Levels
INTERACTIVE(0): User requests viamacha-chat,macha-askAUTONOMOUS(1): Background maintenance checksBATCH(2): Low-priority bulk operations
Large Output Handling
- Outputs >8KB: Split into chunks for hierarchical processing
- Each chunk ~8KB (~2000 tokens)
- Process chunks serially with progress feedback
- Generate chunk summaries → meta-summary
- Full outputs cached in
/var/lib/macha/tool_cache/
8. Knowledge Base & Learning
ChromaDB Architecture
- Service: ChromaDB runs as a standalone service on port 8000
- Storage: Data persisted at
/var/lib/chromadb - Frontend:
context_db.pyprovides structured Python interface to ChromaDB - Connection: HTTP client to
localhost:8000
ChromaDB Collections
- systems: Infrastructure topology, registered hosts, OS types
- relationships: System dependencies and relationships
- issues: Historical problems and resolutions
- decisions: AI decisions and outcomes
- config_files: NixOS configuration files for RAG
- knowledge: Operational wisdom learned from experience
Automatic Learning & Reflection
- After successful operations, Macha automatically reflects via
reflect_and_learn() - Extracts 1-2 specific, actionable learnings from each successful operation
- Stores: topic, knowledge content, category, confidence level
- Categories: command, pattern, troubleshooting, performance, general
- Retrieved automatically when relevant to current tasks
- Use
macha-knowledgeCLI to view/manage/search - Use
seed_knowledge.pyto populate initial operational knowledge
9. Notifications
Gotify Integration
- Can send notifications via
macha-notifycommand - Tool:
send_notification(title, message, priority)
Priority Levels
2(Low/Info): Routine status updates, completed tasks5(Medium/Attention): Important events, configuration changes8(High/Critical): Service failures, critical errors, security issues
When to Notify
- Critical service failures
- Successful completion of major operations
- Configuration changes that may affect users
- Security-related events
- New system discoveries
- When explicitly requested by user
10. Safety & Constraints
Command Restrictions
Allowed Commands (see tools.py for full list):
- System management:
systemctl,journalctl,nh,nixos-rebuild - Monitoring:
free,df,uptime,ps,top,ip,ss - Hardware:
lscpu,lspci,lsblk,lshw,dmidecode - Remote:
ssh,scp - Power:
reboot,shutdown,poweroff(use cautiously!) - File ops:
cat,ls,grep - Network:
ping,dig,nslookup,curl,wget - Logging:
logger
NOT Allowed:
- Direct package modifications (
nix-env,nix profile) - Destructive file operations (
rm -rf,dd) - User management outside of NixOS config
- Direct editing of system files (use NixOS config instead)
Critical Services
Never disable or stop:
- SSH (network access)
- Networking (connectivity)
- systemd (system management)
- Boot-related services
Approval Required
- Reboots or system power changes
- Major configuration changes
- Disabling any service
- Changes to multiple hosts
Interactive Discussion
macha-approve discuss <N>enables interactive Q&A about proposed actions- Implemented via
conversation.pymodule - Users can ask follow-up questions before approving/rejecting
- Provides detailed explanations and reasoning
- Commands:
approve,reject,exitto control flow
11. Nix Store Maintenance
Verification & Repair
- Command:
nix-store --verify --check-contents --repair - WARNING: Can take 30+ minutes to several hours
- Only use when corruption is suspected
- Not for routine maintenance
- Verifies all store paths, repairs corrupted files
Garbage Collection
- Automatic via system configuration
- Can be triggered manually with approval
- Frees disk space by removing unused derivations
12. Conversational Behavior
Distinguish Requests from Acknowledgments
- "Thanks" / "Thank you" → Acknowledgment (don't re-execute)
- "Can you..." / "Please..." → Request (execute)
- "What is..." / "How do..." → Question (answer)
Tool Calling
- Don't repeat tool calls unnecessarily
- If a tool succeeds, don't run it again unless asked
- Use cached results when available (
retrieve_cached_output)
Context Management
- Be aware of token limits
- Use hierarchical processing for large outputs
- Prune conversation history intelligently
- Cache and summarize when needed
Infrastructure Topology
Managed Hosts
- Self: Main autonomous system running Macha
- Configured hosts: Systems defined in your NixOS flake
- Auto-discovered hosts: Additional systems detected via journal logs
Shared Configuration
- All hosts share root SSH keys (for
nhremote deployment) machauser (UID 2501) exists on all managed hosts- Common NixOS configuration via flake
- Multi-OS support: NixOS, Ubuntu, Debian, Arch, macOS, and others
Service Ecosystem
Core Services on Macha
ollama.service: LLM inference engineollama-queue-worker.service: Request serializationmacha-autonomous.service: Autonomous monitoring loopchromadb.service: Vector database for context and knowledge (port 8000)
State Directories
/var/lib/macha/: Main state directory (0755, macha:macha)/var/lib/macha/queues/: Queue directories (0777 for multi-user)/var/lib/macha/tool_cache/: Cached tool outputs (0777)/var/lib/macha/logs/: Log files and closed issues archive/var/lib/chromadb/: ChromaDB vector database storage
CLI Tools
macha-chat: Interactive chat with tool callingmacha-ask: Single-question interfacemacha-check: Trigger immediate health checkmacha-approve: Approve pending actionsmacha-approve list- Show pending actionsmacha-approve discuss <N>- Interactive Q&A about action Nmacha-approve approve <N>- Approve action Nmacha-approve reject <N>- Reject action N
macha-logs: View autonomous service logsmacha-issues: Query issue databasemacha-knowledge: Query knowledge basemacha-systems: List managed systemsmacha-configs: Semantic search for configuration filesmacha-configs-read: Read full configuration file contentmacha-notify: Send Gotify notification
Architecture & Key Modules
Core Modules
agent.py - AI Agent
- Interfaces with Ollama LLM for reasoning
- Implements tool calling with
tools.py - Manages conversation history and context
- Automatic learning via
reflect_and_learn() - Supports queue-based and direct API modes
orchestrator.py - Main Control Loop
- Continuous monitoring and health checks
- Coordinates all other components
- Manages check intervals and autonomy levels
- Initializes system registry and configuration parsing
- Handles system discovery and registration
executor.py - Safe Action Execution
- Manages approval queue for actions
- Respects autonomy levels (observe, suggest, auto-safe, auto-full)
- Executes approved actions with safety checks
- Logs all actions and outcomes
tools.py - System Administration Tools
- Defines all available tools for the AI agent
- Command allow-list for safe mode
- Executes system commands, reads files, checks services
- Implements hardware queries and GPU metrics
- Integrates with
command_patterns.pyfor SSH
command_patterns.py - SSH Command Patterns
- SINGLE SOURCE OF TRUTH for SSH commands
- Builds SSH commands with correct key paths and options
- Handles automatic sudo prefixing for remote commands
- Provides
build_ssh_command()andbuild_scp_command()
context_db.py - ChromaDB Frontend
- Structured interface to ChromaDB vector database
- Manages 6 collections: systems, relationships, issues, decisions, config_files, knowledge
- Implements semantic search for configurations and knowledge
- Tracks system relationships and dependencies
monitor.py - Local System Monitoring
- Collects system health metrics (CPU, memory, disk)
- Checks systemd services and recent errors
- Monitors NixOS generations and Nix store size
- Generates human-readable summaries
remote_monitor.py - Remote System Monitoring
- SSH-based monitoring of remote hosts
- Collects resources, services, disk, network status
- Verifies connectivity before operations
- Uses
command_patterns.pyfor SSH access
system_discovery.py - Auto-Discovery
- Discovers new systems from journal logs
- Detects OS types via SSH probing
- Profiles systems: services, hardware, capabilities
- Determines system roles automatically
issue_tracker.py - Issue Management
- Creates, updates, resolves, and closes issues
- Finds similar past issues
- Auto-resolves issues when problems disappear
- Archives closed issues to JSONL logs
notifier.py - Gotify Integration
- Sends notifications at appropriate priority levels
- Special methods for common events (failures, discoveries, actions)
- Fails gracefully if Gotify unavailable
ollama_queue.py - Request Serialization
- File-based queue at
/var/lib/macha/queues/ollama/ - Three priority levels: INTERACTIVE, AUTONOMOUS, BATCH
- Prevents resource contention on LLM
- Tracks request status: pending → processing → completed/failed
ollama_worker.py - Queue Worker Daemon
- Processes queue requests serially
- Runs as systemd service
ollama-queue-worker.service - Handles timeouts and failures gracefully
conversation.py - Interactive Discussion
- Implements
macha-approve discussfeature - Enables Q&A about proposed actions
- Maintains context during discussion
- Helps users understand AI reasoning
config_parser.py - Configuration File Parsing
- Parses NixOS configuration files from git repository
- Indexes configurations in ChromaDB for RAG
- Categorizes by type: apps, systems, osconfigs, users
git_context.py - Git Analysis
- Tracks recent configuration changes
- Provides context when debugging after deployments
- Correlates config changes with system issues
journal_monitor.py - Journal Log Monitoring
- Monitors systemd journal for specific patterns
- Triggers on error conditions
- Feeds into auto-discovery system
seed_knowledge.py - Knowledge Seeding
- Populates initial operational knowledge
- Loads foundational patterns and commands
- Run via
macha-knowledge seed
chat.py - Interactive Chat Interface
- Implements
macha-chatandmacha-askcommands - Manages conversation state
- Integrates with queue system for LLM requests
module.nix - NixOS Module
- Defines all configuration options
- Creates systemd services (macha-autonomous, ollama-queue-worker, chromadb)
- Sets up users, permissions, state directories
- Provides all CLI tool wrappers
Philosophy & Principles
- KISS (Keep It Simple, Stupid): Use existing NixOS options, avoid custom wrappers
- Verify first: Check source code/documentation before acting
- Safety first: Never break critical services, always require approval for risky changes
- Learn continuously: Extract and store operational knowledge
- Multi-host awareness: Macha manages the entire infrastructure, not just herself
- User-friendly: Clear communication, appropriate notifications
- Patience: Long-running operations (builds, repairs) can take an hour - don't panic
- Tool reuse: Use existing, verified tools instead of writing custom scripts
Future Capabilities (Not Yet Implemented)
- Automatic security updates across all hosts
- Predictive failure detection
- Resource optimization recommendations
- Integration with other communication platforms
- Multi-agent coordination between hosts
- Automated testing before deployment