Files
macha-autonomous/DESIGN.md
Lily Miller cc8334f2c5 Enhance DESIGN.md with new features and clarifications
- Added USB device detection and monitoring capabilities.
- Updated SSH usage patterns to reflect dynamic host configurations.
- Introduced automatic system discovery from journal logs, including OS detection and system profiling.
- Enhanced configuration file intelligence with semantic search and categorization.
- Expanded knowledge base structure and automatic learning processes.
- Clarified the architecture and key modules for better understanding of system components.
2025-10-09 16:18:34 -06:00

18 KiB

Macha Autonomous System - Design Document

⚠️ IMPORTANT - READ THIS FIRST
FOR AI ASSISTANT: This document is YOUR reference guide when modifying Macha's code.

  • ALWAYS consult this BEFORE refactoring to ensure you don't remove existing capabilities
  • CHECK this when adding features to avoid conflicts
  • UPDATE this document when new capabilities are added
  • DO NOT DELETE ANYTHING FROM THIS DOCUMENT
  • During major refactors, you MUST verify each capability listed here is preserved

Overview

Macha is an AI-powered autonomous system administrator capable of monitoring, maintaining, and managing multiple NixOS hosts in the infrastructure.

Core Capabilities

1. Local System Management

  • Monitor system health (CPU, memory, disk, services)
  • Read and analyze logs via journalctl
  • Check service status and restart failed services
  • Execute system commands (with safety restrictions)
  • Monitor and repair Nix store corruption
  • Hardware awareness (CPU, GPU, network, storage)
  • USB device detection and monitoring

2. Multi-Host Management via SSH

Macha CAN and SHOULD use SSH to manage other hosts.

SSH Access

  • CRITICAL: All command patterns defined in command_patterns.py (SINGLE SOURCE OF TRUTH)
  • Always uses explicit SSH key path: -i /var/lib/macha/.ssh/id_ed25519
  • All SSH commands automatically include the -i flag with absolute key path
  • Remote commands always prefixed with sudo
  • Runs as macha user (UID 2501) for standard operations
  • Note: Some internal operations (like remote monitoring) may use root SSH for privileged access
  • DO NOT DUPLICATE these patterns elsewhere - import from command_patterns.py
  • Has NOPASSWD sudo access for administrative commands
  • Shares SSH keys with other hosts in the infrastructure
  • Can SSH to any hosts defined in your NixOS flake configuration

SSH Usage Patterns

  1. Direct diagnostic commands:

    ssh hostname systemctl status service-name
    ssh hostname df -h
    
    • Commands automatically transformed by the tools layer
    • Full command: ssh -i /var/lib/macha/.ssh/id_ed25519 -o StrictHostKeyChecking=no macha@hostname sudo systemctl status service-name
    • SSH key path is always explicit, commands are automatically prefixed with sudo
  2. Status checks:

    • Check service health on remote hosts
    • Gather system metrics
    • Review logs
    • Monitor resource usage
  3. File operations:

    • Use scp to copy files between hosts
    • Read configuration files on remote systems

When to use SSH vs nh

  • SSH: For diagnostics, status checks, log review, quick commands
  • nh remote deployment: For applying NixOS configuration changes
    • nh os switch -u --target-host=hostname --hostname=hostname
    • Builds locally, deploys to remote host
    • Use for permanent configuration changes

3. Automatic System Discovery

Discovery from Journal Logs

  • Monitors systemd-journal-remote logs for new systems sending logs
  • Parses _HOSTNAME field from remote journal entries
  • Automatically converts short hostnames to FQDNs (.coven.systems)
  • Discovers systems within configurable time windows (default: 10 minutes)

OS Detection

  • Detects operating system via SSH (/etc/os-release, uname)
  • Supports: NixOS, Ubuntu, Debian, Arch, Fedora, RHEL, Alpine, macOS, FreeBSD
  • Falls back to generic "linux" if specific distro cannot be determined
  • Records OS type in system registry for appropriate management strategies

System Profiling

  • Automatically gathers system information upon discovery:
    • Running services (via systemctl list-units)
    • Hardware info (CPU cores, memory)
    • Capabilities (containers, web-server, database, remote-access)
    • System role determination (ai-workstation, server, workstation, minimal)
  • Registers discovered systems in ChromaDB context database
  • Sends notification when new systems are discovered

4. NixOS Configuration Management

Local Changes

  • Can propose changes to NixOS configuration
  • Requires human approval before applying
  • Uses nh os switch for local updates

Remote Deployment

  • Can deploy to other hosts using nh with --target-host
  • Builds configuration locally (on Macha)
  • Pushes to remote system
  • Can take up to 1 hour for complex builds
  • IMPORTANT: Be patient with long-running builds, don't retry prematurely

5. Configuration File Intelligence (RAG)

  • NixOS configuration files indexed in ChromaDB for semantic search
  • Query configurations by natural language: "gotify configuration", "journald settings"
  • CLI tools: macha-configs <query> and macha-configs-read <path>
  • Relevance scoring helps find the right config files quickly
  • Supports filtering by system hostname or category

Configuration Categories

  • apps/ - Application configurations (Gotify, Nextcloud, etc.)
  • systems/ - Per-host system configurations
  • osconfigs/ - Operating system level settings
  • users/ - User management configurations

Git Context Analysis

  • Tracks recent changes to configuration files via git_context.py
  • Correlates config changes with system behavior
  • Provides context when investigating issues after deployments
  • Helps understand "what changed" when debugging

6. Hardware Awareness

Local Hardware Detection

  • CPU: lscpu via nix-shell -p util-linux
  • GPU: lspci via nix-shell -p pciutils
  • Network: lsblk, ip addr
  • Storage: df -h, lsblk
  • USB devices: lsusb

GPU Metrics

  • AMD GPUs: Try rocm-smi, sysfs (/sys/class/drm/card*/device/)
  • NVIDIA GPUs: Try nvidia-smi
  • Fallback: sensors for temperature data
  • Queries: temperature, utilization, clock speeds, power usage

7. Ollama Queue System

Architecture

  • File-based queue: /var/lib/macha/queues/ollama/
  • Queue worker: ollama-queue-worker.service (runs as macha user)
  • Purpose: Serialize all LLM requests to prevent resource contention

Request Flow

  1. Any user (including regular users) → Write request to pending/
  2. Queue worker → Process requests serially (FIFO with priority)
  3. Queue worker → Write response to completed/
  4. Original requester → Read response from completed/

Priority Levels

  • INTERACTIVE (0): User requests via macha-chat, macha-ask
  • AUTONOMOUS (1): Background maintenance checks
  • BATCH (2): Low-priority bulk operations

Large Output Handling

  • Outputs >8KB: Split into chunks for hierarchical processing
  • Each chunk ~8KB (~2000 tokens)
  • Process chunks serially with progress feedback
  • Generate chunk summaries → meta-summary
  • Full outputs cached in /var/lib/macha/tool_cache/

8. Knowledge Base & Learning

ChromaDB Architecture

  • Service: ChromaDB runs as a standalone service on port 8000
  • Storage: Data persisted at /var/lib/chromadb
  • Frontend: context_db.py provides structured Python interface to ChromaDB
  • Connection: HTTP client to localhost:8000

ChromaDB Collections

  1. systems: Infrastructure topology, registered hosts, OS types
  2. relationships: System dependencies and relationships
  3. issues: Historical problems and resolutions
  4. decisions: AI decisions and outcomes
  5. config_files: NixOS configuration files for RAG
  6. knowledge: Operational wisdom learned from experience

Automatic Learning & Reflection

  • After successful operations, Macha automatically reflects via reflect_and_learn()
  • Extracts 1-2 specific, actionable learnings from each successful operation
  • Stores: topic, knowledge content, category, confidence level
  • Categories: command, pattern, troubleshooting, performance, general
  • Retrieved automatically when relevant to current tasks
  • Use macha-knowledge CLI to view/manage/search
  • Use seed_knowledge.py to populate initial operational knowledge

9. Notifications

Gotify Integration

  • Can send notifications via macha-notify command
  • Tool: send_notification(title, message, priority)

Priority Levels

  • 2 (Low/Info): Routine status updates, completed tasks
  • 5 (Medium/Attention): Important events, configuration changes
  • 8 (High/Critical): Service failures, critical errors, security issues

When to Notify

  • Critical service failures
  • Successful completion of major operations
  • Configuration changes that may affect users
  • Security-related events
  • New system discoveries
  • When explicitly requested by user

10. Safety & Constraints

Command Restrictions

Allowed Commands (see tools.py for full list):

  • System management: systemctl, journalctl, nh, nixos-rebuild
  • Monitoring: free, df, uptime, ps, top, ip, ss
  • Hardware: lscpu, lspci, lsblk, lshw, dmidecode
  • Remote: ssh, scp
  • Power: reboot, shutdown, poweroff (use cautiously!)
  • File ops: cat, ls, grep
  • Network: ping, dig, nslookup, curl, wget
  • Logging: logger

NOT Allowed:

  • Direct package modifications (nix-env, nix profile)
  • Destructive file operations (rm -rf, dd)
  • User management outside of NixOS config
  • Direct editing of system files (use NixOS config instead)

Critical Services

Never disable or stop:

  • SSH (network access)
  • Networking (connectivity)
  • systemd (system management)
  • Boot-related services

Approval Required

  • Reboots or system power changes
  • Major configuration changes
  • Disabling any service
  • Changes to multiple hosts

Interactive Discussion

  • macha-approve discuss <N> enables interactive Q&A about proposed actions
  • Implemented via conversation.py module
  • Users can ask follow-up questions before approving/rejecting
  • Provides detailed explanations and reasoning
  • Commands: approve, reject, exit to control flow

11. Nix Store Maintenance

Verification & Repair

  • Command: nix-store --verify --check-contents --repair
  • WARNING: Can take 30+ minutes to several hours
  • Only use when corruption is suspected
  • Not for routine maintenance
  • Verifies all store paths, repairs corrupted files

Garbage Collection

  • Automatic via system configuration
  • Can be triggered manually with approval
  • Frees disk space by removing unused derivations

12. Conversational Behavior

Distinguish Requests from Acknowledgments

  • "Thanks" / "Thank you" → Acknowledgment (don't re-execute)
  • "Can you..." / "Please..." → Request (execute)
  • "What is..." / "How do..." → Question (answer)

Tool Calling

  • Don't repeat tool calls unnecessarily
  • If a tool succeeds, don't run it again unless asked
  • Use cached results when available (retrieve_cached_output)

Context Management

  • Be aware of token limits
  • Use hierarchical processing for large outputs
  • Prune conversation history intelligently
  • Cache and summarize when needed

Infrastructure Topology

Managed Hosts

  • Self: Main autonomous system running Macha
  • Configured hosts: Systems defined in your NixOS flake
  • Auto-discovered hosts: Additional systems detected via journal logs

Shared Configuration

  • All hosts share root SSH keys (for nh remote deployment)
  • macha user (UID 2501) exists on all managed hosts
  • Common NixOS configuration via flake
  • Multi-OS support: NixOS, Ubuntu, Debian, Arch, macOS, and others

Service Ecosystem

Core Services on Macha

  • ollama.service: LLM inference engine
  • ollama-queue-worker.service: Request serialization
  • macha-autonomous.service: Autonomous monitoring loop
  • chromadb.service: Vector database for context and knowledge (port 8000)

State Directories

  • /var/lib/macha/: Main state directory (0755, macha:macha)
  • /var/lib/macha/queues/: Queue directories (0777 for multi-user)
  • /var/lib/macha/tool_cache/: Cached tool outputs (0777)
  • /var/lib/macha/logs/: Log files and closed issues archive
  • /var/lib/chromadb/: ChromaDB vector database storage

CLI Tools

  • macha-chat: Interactive chat with tool calling
  • macha-ask: Single-question interface
  • macha-check: Trigger immediate health check
  • macha-approve: Approve pending actions
    • macha-approve list - Show pending actions
    • macha-approve discuss <N> - Interactive Q&A about action N
    • macha-approve approve <N> - Approve action N
    • macha-approve reject <N> - Reject action N
  • macha-logs: View autonomous service logs
  • macha-issues: Query issue database
  • macha-knowledge: Query knowledge base
  • macha-systems: List managed systems
  • macha-configs: Semantic search for configuration files
  • macha-configs-read: Read full configuration file content
  • macha-notify: Send Gotify notification

Architecture & Key Modules

Core Modules

agent.py - AI Agent

  • Interfaces with Ollama LLM for reasoning
  • Implements tool calling with tools.py
  • Manages conversation history and context
  • Automatic learning via reflect_and_learn()
  • Supports queue-based and direct API modes

orchestrator.py - Main Control Loop

  • Continuous monitoring and health checks
  • Coordinates all other components
  • Manages check intervals and autonomy levels
  • Initializes system registry and configuration parsing
  • Handles system discovery and registration

executor.py - Safe Action Execution

  • Manages approval queue for actions
  • Respects autonomy levels (observe, suggest, auto-safe, auto-full)
  • Executes approved actions with safety checks
  • Logs all actions and outcomes

tools.py - System Administration Tools

  • Defines all available tools for the AI agent
  • Command allow-list for safe mode
  • Executes system commands, reads files, checks services
  • Implements hardware queries and GPU metrics
  • Integrates with command_patterns.py for SSH

command_patterns.py - SSH Command Patterns

  • SINGLE SOURCE OF TRUTH for SSH commands
  • Builds SSH commands with correct key paths and options
  • Handles automatic sudo prefixing for remote commands
  • Provides build_ssh_command() and build_scp_command()

context_db.py - ChromaDB Frontend

  • Structured interface to ChromaDB vector database
  • Manages 6 collections: systems, relationships, issues, decisions, config_files, knowledge
  • Implements semantic search for configurations and knowledge
  • Tracks system relationships and dependencies

monitor.py - Local System Monitoring

  • Collects system health metrics (CPU, memory, disk)
  • Checks systemd services and recent errors
  • Monitors NixOS generations and Nix store size
  • Generates human-readable summaries

remote_monitor.py - Remote System Monitoring

  • SSH-based monitoring of remote hosts
  • Collects resources, services, disk, network status
  • Verifies connectivity before operations
  • Uses command_patterns.py for SSH access

system_discovery.py - Auto-Discovery

  • Discovers new systems from journal logs
  • Detects OS types via SSH probing
  • Profiles systems: services, hardware, capabilities
  • Determines system roles automatically

issue_tracker.py - Issue Management

  • Creates, updates, resolves, and closes issues
  • Finds similar past issues
  • Auto-resolves issues when problems disappear
  • Archives closed issues to JSONL logs

notifier.py - Gotify Integration

  • Sends notifications at appropriate priority levels
  • Special methods for common events (failures, discoveries, actions)
  • Fails gracefully if Gotify unavailable

ollama_queue.py - Request Serialization

  • File-based queue at /var/lib/macha/queues/ollama/
  • Three priority levels: INTERACTIVE, AUTONOMOUS, BATCH
  • Prevents resource contention on LLM
  • Tracks request status: pending → processing → completed/failed

ollama_worker.py - Queue Worker Daemon

  • Processes queue requests serially
  • Runs as systemd service ollama-queue-worker.service
  • Handles timeouts and failures gracefully

conversation.py - Interactive Discussion

  • Implements macha-approve discuss feature
  • Enables Q&A about proposed actions
  • Maintains context during discussion
  • Helps users understand AI reasoning

config_parser.py - Configuration File Parsing

  • Parses NixOS configuration files from git repository
  • Indexes configurations in ChromaDB for RAG
  • Categorizes by type: apps, systems, osconfigs, users

git_context.py - Git Analysis

  • Tracks recent configuration changes
  • Provides context when debugging after deployments
  • Correlates config changes with system issues

journal_monitor.py - Journal Log Monitoring

  • Monitors systemd journal for specific patterns
  • Triggers on error conditions
  • Feeds into auto-discovery system

seed_knowledge.py - Knowledge Seeding

  • Populates initial operational knowledge
  • Loads foundational patterns and commands
  • Run via macha-knowledge seed

chat.py - Interactive Chat Interface

  • Implements macha-chat and macha-ask commands
  • Manages conversation state
  • Integrates with queue system for LLM requests

module.nix - NixOS Module

  • Defines all configuration options
  • Creates systemd services (macha-autonomous, ollama-queue-worker, chromadb)
  • Sets up users, permissions, state directories
  • Provides all CLI tool wrappers

Philosophy & Principles

  1. KISS (Keep It Simple, Stupid): Use existing NixOS options, avoid custom wrappers
  2. Verify first: Check source code/documentation before acting
  3. Safety first: Never break critical services, always require approval for risky changes
  4. Learn continuously: Extract and store operational knowledge
  5. Multi-host awareness: Macha manages the entire infrastructure, not just herself
  6. User-friendly: Clear communication, appropriate notifications
  7. Patience: Long-running operations (builds, repairs) can take an hour - don't panic
  8. Tool reuse: Use existing, verified tools instead of writing custom scripts

Future Capabilities (Not Yet Implemented)

  • Automatic security updates across all hosts
  • Predictive failure detection
  • Resource optimization recommendations
  • Integration with other communication platforms
  • Multi-agent coordination between hosts
  • Automated testing before deployment