Enhance DESIGN.md with new features and clarifications
- Added USB device detection and monitoring capabilities. - Updated SSH usage patterns to reflect dynamic host configurations. - Introduced automatic system discovery from journal logs, including OS detection and system profiling. - Enhanced configuration file intelligence with semantic search and categorization. - Expanded knowledge base structure and automatic learning processes. - Clarified the architecture and key modules for better understanding of system components.
This commit is contained in:
253
DESIGN.md
253
DESIGN.md
@@ -20,6 +20,7 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma
|
||||
- Execute system commands (with safety restrictions)
|
||||
- Monitor and repair Nix store corruption
|
||||
- Hardware awareness (CPU, GPU, network, storage)
|
||||
- USB device detection and monitoring
|
||||
|
||||
### 2. Multi-Host Management via SSH
|
||||
|
||||
@@ -30,20 +31,21 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma
|
||||
- Always uses explicit SSH key path: `-i /var/lib/macha/.ssh/id_ed25519`
|
||||
- All SSH commands automatically include the `-i` flag with absolute key path
|
||||
- Remote commands always prefixed with `sudo`
|
||||
- Runs as `macha` user (UID 2501)
|
||||
- Runs as `macha` user (UID 2501) for standard operations
|
||||
- **Note**: Some internal operations (like remote monitoring) may use `root` SSH for privileged access
|
||||
- **DO NOT DUPLICATE these patterns elsewhere** - import from `command_patterns.py`
|
||||
- Has `NOPASSWD` sudo access for administrative commands
|
||||
- Shares SSH keys with other hosts in the infrastructure
|
||||
- Can SSH to: `rhiannon`, `alexander`, `UCAR-Kinston`, and others in the flake
|
||||
- Can SSH to any hosts defined in your NixOS flake configuration
|
||||
|
||||
#### SSH Usage Patterns
|
||||
1. **Direct diagnostic commands:**
|
||||
```bash
|
||||
ssh rhiannon systemctl status ollama
|
||||
ssh alexander df -h
|
||||
ssh hostname systemctl status service-name
|
||||
ssh hostname df -h
|
||||
```
|
||||
- Commands automatically transformed by the tools layer
|
||||
- Full command: `ssh -i /var/lib/macha/.ssh/id_ed25519 -o StrictHostKeyChecking=no macha@rhiannon sudo systemctl status ollama`
|
||||
- Full command: `ssh -i /var/lib/macha/.ssh/id_ed25519 -o StrictHostKeyChecking=no macha@hostname sudo systemctl status service-name`
|
||||
- SSH key path is always explicit, commands are automatically prefixed with `sudo`
|
||||
|
||||
2. **Status checks:**
|
||||
@@ -59,11 +61,34 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma
|
||||
#### When to use SSH vs nh
|
||||
- **SSH**: For diagnostics, status checks, log review, quick commands
|
||||
- **nh remote deployment**: For applying NixOS configuration changes
|
||||
- `nh os switch -u --target-host=rhiannon --hostname=rhiannon`
|
||||
- `nh os switch -u --target-host=hostname --hostname=hostname`
|
||||
- Builds locally, deploys to remote host
|
||||
- Use for permanent configuration changes
|
||||
|
||||
### 3. NixOS Configuration Management
|
||||
### 3. Automatic System Discovery
|
||||
|
||||
#### Discovery from Journal Logs
|
||||
- Monitors `systemd-journal-remote` logs for new systems sending logs
|
||||
- Parses `_HOSTNAME` field from remote journal entries
|
||||
- Automatically converts short hostnames to FQDNs (`.coven.systems`)
|
||||
- Discovers systems within configurable time windows (default: 10 minutes)
|
||||
|
||||
#### OS Detection
|
||||
- Detects operating system via SSH (`/etc/os-release`, `uname`)
|
||||
- Supports: NixOS, Ubuntu, Debian, Arch, Fedora, RHEL, Alpine, macOS, FreeBSD
|
||||
- Falls back to generic "linux" if specific distro cannot be determined
|
||||
- Records OS type in system registry for appropriate management strategies
|
||||
|
||||
#### System Profiling
|
||||
- Automatically gathers system information upon discovery:
|
||||
- Running services (via `systemctl list-units`)
|
||||
- Hardware info (CPU cores, memory)
|
||||
- Capabilities (containers, web-server, database, remote-access)
|
||||
- System role determination (ai-workstation, server, workstation, minimal)
|
||||
- Registers discovered systems in ChromaDB context database
|
||||
- Sends notification when new systems are discovered
|
||||
|
||||
### 4. NixOS Configuration Management
|
||||
|
||||
#### Local Changes
|
||||
- Can propose changes to NixOS configuration
|
||||
@@ -77,7 +102,28 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma
|
||||
- Can take up to 1 hour for complex builds
|
||||
- **IMPORTANT**: Be patient with long-running builds, don't retry prematurely
|
||||
|
||||
### 4. Hardware Awareness
|
||||
### 5. Configuration File Intelligence (RAG)
|
||||
|
||||
#### Semantic Configuration Search
|
||||
- NixOS configuration files indexed in ChromaDB for semantic search
|
||||
- Query configurations by natural language: "gotify configuration", "journald settings"
|
||||
- CLI tools: `macha-configs <query>` and `macha-configs-read <path>`
|
||||
- Relevance scoring helps find the right config files quickly
|
||||
- Supports filtering by system hostname or category
|
||||
|
||||
#### Configuration Categories
|
||||
- `apps/` - Application configurations (Gotify, Nextcloud, etc.)
|
||||
- `systems/` - Per-host system configurations
|
||||
- `osconfigs/` - Operating system level settings
|
||||
- `users/` - User management configurations
|
||||
|
||||
#### Git Context Analysis
|
||||
- Tracks recent changes to configuration files via `git_context.py`
|
||||
- Correlates config changes with system behavior
|
||||
- Provides context when investigating issues after deployments
|
||||
- Helps understand "what changed" when debugging
|
||||
|
||||
### 6. Hardware Awareness
|
||||
|
||||
#### Local Hardware Detection
|
||||
- CPU: `lscpu` via `nix-shell -p util-linux`
|
||||
@@ -92,7 +138,7 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma
|
||||
- Fallback: `sensors` for temperature data
|
||||
- Queries: temperature, utilization, clock speeds, power usage
|
||||
|
||||
### 5. Ollama Queue System
|
||||
### 7. Ollama Queue System
|
||||
|
||||
#### Architecture
|
||||
- **File-based queue**: `/var/lib/macha/queues/ollama/`
|
||||
@@ -117,20 +163,32 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma
|
||||
- Generate chunk summaries → meta-summary
|
||||
- Full outputs cached in `/var/lib/macha/tool_cache/`
|
||||
|
||||
### 6. Knowledge Base & Learning
|
||||
### 8. Knowledge Base & Learning
|
||||
|
||||
#### ChromaDB Architecture
|
||||
- **Service**: ChromaDB runs as a standalone service on port 8000
|
||||
- **Storage**: Data persisted at `/var/lib/chromadb`
|
||||
- **Frontend**: `context_db.py` provides structured Python interface to ChromaDB
|
||||
- **Connection**: HTTP client to `localhost:8000`
|
||||
|
||||
#### ChromaDB Collections
|
||||
1. **System Context**: Infrastructure topology, service relationships
|
||||
2. **Issues**: Historical problems and resolutions
|
||||
3. **Knowledge**: Operational wisdom learned from experience
|
||||
1. **systems**: Infrastructure topology, registered hosts, OS types
|
||||
2. **relationships**: System dependencies and relationships
|
||||
3. **issues**: Historical problems and resolutions
|
||||
4. **decisions**: AI decisions and outcomes
|
||||
5. **config_files**: NixOS configuration files for RAG
|
||||
6. **knowledge**: Operational wisdom learned from experience
|
||||
|
||||
#### Automatic Learning
|
||||
- After successful operations, Macha reflects and extracts key learnings
|
||||
- Stores: topic, knowledge content, category
|
||||
#### Automatic Learning & Reflection
|
||||
- After successful operations, Macha automatically reflects via `reflect_and_learn()`
|
||||
- Extracts 1-2 specific, actionable learnings from each successful operation
|
||||
- Stores: topic, knowledge content, category, confidence level
|
||||
- Categories: command, pattern, troubleshooting, performance, general
|
||||
- Retrieved automatically when relevant to current tasks
|
||||
- Use `macha-knowledge` CLI to view/manage
|
||||
- Use `macha-knowledge` CLI to view/manage/search
|
||||
- Use `seed_knowledge.py` to populate initial operational knowledge
|
||||
|
||||
### 7. Notifications
|
||||
### 9. Notifications
|
||||
|
||||
#### Gotify Integration
|
||||
- Can send notifications via `macha-notify` command
|
||||
@@ -146,9 +204,10 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma
|
||||
- Successful completion of major operations
|
||||
- Configuration changes that may affect users
|
||||
- Security-related events
|
||||
- New system discoveries
|
||||
- When explicitly requested by user
|
||||
|
||||
### 8. Safety & Constraints
|
||||
### 10. Safety & Constraints
|
||||
|
||||
#### Command Restrictions
|
||||
**Allowed Commands** (see `tools.py` for full list):
|
||||
@@ -180,7 +239,14 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma
|
||||
- Disabling any service
|
||||
- Changes to multiple hosts
|
||||
|
||||
### 9. Nix Store Maintenance
|
||||
#### Interactive Discussion
|
||||
- `macha-approve discuss <N>` enables interactive Q&A about proposed actions
|
||||
- Implemented via `conversation.py` module
|
||||
- Users can ask follow-up questions before approving/rejecting
|
||||
- Provides detailed explanations and reasoning
|
||||
- Commands: `approve`, `reject`, `exit` to control flow
|
||||
|
||||
### 11. Nix Store Maintenance
|
||||
|
||||
#### Verification & Repair
|
||||
- Command: `nix-store --verify --check-contents --repair`
|
||||
@@ -194,7 +260,7 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma
|
||||
- Can be triggered manually with approval
|
||||
- Frees disk space by removing unused derivations
|
||||
|
||||
### 10. Conversational Behavior
|
||||
### 12. Conversational Behavior
|
||||
|
||||
#### Distinguish Requests from Acknowledgments
|
||||
- "Thanks" / "Thank you" → Acknowledgment (don't re-execute)
|
||||
@@ -214,17 +280,16 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma
|
||||
|
||||
## Infrastructure Topology
|
||||
|
||||
### Hosts in Flake
|
||||
- **macha**: Main autonomous system (self), GPU server
|
||||
- **rhiannon**: Production server
|
||||
- **alexander**: Production server
|
||||
- **UCAR-Kinston**: Work laptop
|
||||
- **test-vm**: Testing environment
|
||||
### Managed Hosts
|
||||
- **Self**: Main autonomous system running Macha
|
||||
- **Configured hosts**: Systems defined in your NixOS flake
|
||||
- **Auto-discovered hosts**: Additional systems detected via journal logs
|
||||
|
||||
### Shared Configuration
|
||||
- All hosts share root SSH keys (for `nh` remote deployment)
|
||||
- `macha` user (UID 2501) exists on all hosts
|
||||
- `macha` user (UID 2501) exists on all managed hosts
|
||||
- Common NixOS configuration via flake
|
||||
- Multi-OS support: NixOS, Ubuntu, Debian, Arch, macOS, and others
|
||||
|
||||
## Service Ecosystem
|
||||
|
||||
@@ -232,14 +297,14 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma
|
||||
- `ollama.service`: LLM inference engine
|
||||
- `ollama-queue-worker.service`: Request serialization
|
||||
- `macha-autonomous.service`: Autonomous monitoring loop
|
||||
- Servarr stack: Sonarr, Radarr, Prowlarr, Lidarr, Readarr, Whisparr
|
||||
- Media: Transmission, SABnzbd, Calibre
|
||||
- `chromadb.service`: Vector database for context and knowledge (port 8000)
|
||||
|
||||
### State Directories
|
||||
- `/var/lib/macha/`: Main state directory (0755, macha:macha)
|
||||
- `/var/lib/macha/queues/`: Queue directories (0777 for multi-user)
|
||||
- `/var/lib/macha/tool_cache/`: Cached tool outputs (0777)
|
||||
- `/var/lib/macha/system_context.db`: ChromaDB database
|
||||
- `/var/lib/macha/logs/`: Log files and closed issues archive
|
||||
- `/var/lib/chromadb/`: ChromaDB vector database storage
|
||||
|
||||
## CLI Tools
|
||||
|
||||
@@ -247,12 +312,138 @@ Macha is an AI-powered autonomous system administrator capable of monitoring, ma
|
||||
- `macha-ask`: Single-question interface
|
||||
- `macha-check`: Trigger immediate health check
|
||||
- `macha-approve`: Approve pending actions
|
||||
- `macha-approve list` - Show pending actions
|
||||
- `macha-approve discuss <N>` - Interactive Q&A about action N
|
||||
- `macha-approve approve <N>` - Approve action N
|
||||
- `macha-approve reject <N>` - Reject action N
|
||||
- `macha-logs`: View autonomous service logs
|
||||
- `macha-issues`: Query issue database
|
||||
- `macha-knowledge`: Query knowledge base
|
||||
- `macha-systems`: List managed systems
|
||||
- `macha-configs`: Semantic search for configuration files
|
||||
- `macha-configs-read`: Read full configuration file content
|
||||
- `macha-notify`: Send Gotify notification
|
||||
|
||||
## Architecture & Key Modules
|
||||
|
||||
### Core Modules
|
||||
|
||||
#### `agent.py` - AI Agent
|
||||
- Interfaces with Ollama LLM for reasoning
|
||||
- Implements tool calling with `tools.py`
|
||||
- Manages conversation history and context
|
||||
- Automatic learning via `reflect_and_learn()`
|
||||
- Supports queue-based and direct API modes
|
||||
|
||||
#### `orchestrator.py` - Main Control Loop
|
||||
- Continuous monitoring and health checks
|
||||
- Coordinates all other components
|
||||
- Manages check intervals and autonomy levels
|
||||
- Initializes system registry and configuration parsing
|
||||
- Handles system discovery and registration
|
||||
|
||||
#### `executor.py` - Safe Action Execution
|
||||
- Manages approval queue for actions
|
||||
- Respects autonomy levels (observe, suggest, auto-safe, auto-full)
|
||||
- Executes approved actions with safety checks
|
||||
- Logs all actions and outcomes
|
||||
|
||||
#### `tools.py` - System Administration Tools
|
||||
- Defines all available tools for the AI agent
|
||||
- Command allow-list for safe mode
|
||||
- Executes system commands, reads files, checks services
|
||||
- Implements hardware queries and GPU metrics
|
||||
- Integrates with `command_patterns.py` for SSH
|
||||
|
||||
#### `command_patterns.py` - SSH Command Patterns
|
||||
- **SINGLE SOURCE OF TRUTH** for SSH commands
|
||||
- Builds SSH commands with correct key paths and options
|
||||
- Handles automatic sudo prefixing for remote commands
|
||||
- Provides `build_ssh_command()` and `build_scp_command()`
|
||||
|
||||
#### `context_db.py` - ChromaDB Frontend
|
||||
- Structured interface to ChromaDB vector database
|
||||
- Manages 6 collections: systems, relationships, issues, decisions, config_files, knowledge
|
||||
- Implements semantic search for configurations and knowledge
|
||||
- Tracks system relationships and dependencies
|
||||
|
||||
#### `monitor.py` - Local System Monitoring
|
||||
- Collects system health metrics (CPU, memory, disk)
|
||||
- Checks systemd services and recent errors
|
||||
- Monitors NixOS generations and Nix store size
|
||||
- Generates human-readable summaries
|
||||
|
||||
#### `remote_monitor.py` - Remote System Monitoring
|
||||
- SSH-based monitoring of remote hosts
|
||||
- Collects resources, services, disk, network status
|
||||
- Verifies connectivity before operations
|
||||
- Uses `command_patterns.py` for SSH access
|
||||
|
||||
#### `system_discovery.py` - Auto-Discovery
|
||||
- Discovers new systems from journal logs
|
||||
- Detects OS types via SSH probing
|
||||
- Profiles systems: services, hardware, capabilities
|
||||
- Determines system roles automatically
|
||||
|
||||
#### `issue_tracker.py` - Issue Management
|
||||
- Creates, updates, resolves, and closes issues
|
||||
- Finds similar past issues
|
||||
- Auto-resolves issues when problems disappear
|
||||
- Archives closed issues to JSONL logs
|
||||
|
||||
#### `notifier.py` - Gotify Integration
|
||||
- Sends notifications at appropriate priority levels
|
||||
- Special methods for common events (failures, discoveries, actions)
|
||||
- Fails gracefully if Gotify unavailable
|
||||
|
||||
#### `ollama_queue.py` - Request Serialization
|
||||
- File-based queue at `/var/lib/macha/queues/ollama/`
|
||||
- Three priority levels: INTERACTIVE, AUTONOMOUS, BATCH
|
||||
- Prevents resource contention on LLM
|
||||
- Tracks request status: pending → processing → completed/failed
|
||||
|
||||
#### `ollama_worker.py` - Queue Worker Daemon
|
||||
- Processes queue requests serially
|
||||
- Runs as systemd service `ollama-queue-worker.service`
|
||||
- Handles timeouts and failures gracefully
|
||||
|
||||
#### `conversation.py` - Interactive Discussion
|
||||
- Implements `macha-approve discuss` feature
|
||||
- Enables Q&A about proposed actions
|
||||
- Maintains context during discussion
|
||||
- Helps users understand AI reasoning
|
||||
|
||||
#### `config_parser.py` - Configuration File Parsing
|
||||
- Parses NixOS configuration files from git repository
|
||||
- Indexes configurations in ChromaDB for RAG
|
||||
- Categorizes by type: apps, systems, osconfigs, users
|
||||
|
||||
#### `git_context.py` - Git Analysis
|
||||
- Tracks recent configuration changes
|
||||
- Provides context when debugging after deployments
|
||||
- Correlates config changes with system issues
|
||||
|
||||
#### `journal_monitor.py` - Journal Log Monitoring
|
||||
- Monitors systemd journal for specific patterns
|
||||
- Triggers on error conditions
|
||||
- Feeds into auto-discovery system
|
||||
|
||||
#### `seed_knowledge.py` - Knowledge Seeding
|
||||
- Populates initial operational knowledge
|
||||
- Loads foundational patterns and commands
|
||||
- Run via `macha-knowledge seed`
|
||||
|
||||
#### `chat.py` - Interactive Chat Interface
|
||||
- Implements `macha-chat` and `macha-ask` commands
|
||||
- Manages conversation state
|
||||
- Integrates with queue system for LLM requests
|
||||
|
||||
#### `module.nix` - NixOS Module
|
||||
- Defines all configuration options
|
||||
- Creates systemd services (macha-autonomous, ollama-queue-worker, chromadb)
|
||||
- Sets up users, permissions, state directories
|
||||
- Provides all CLI tool wrappers
|
||||
|
||||
## Philosophy & Principles
|
||||
|
||||
1. **KISS (Keep It Simple, Stupid)**: Use existing NixOS options, avoid custom wrappers
|
||||
|
||||
Reference in New Issue
Block a user