Initial commit: Split Macha autonomous system into separate flake

Macha is now a standalone NixOS flake that can be imported into other
systems. This provides:

- Independent versioning
- Easier reusability
- Cleaner separation of concerns
- Better development workflow

Includes:
- Complete autonomous system code
- NixOS module with full configuration options
- Queue-based architecture with priority system
- Chunked map-reduce for large outputs
- ChromaDB knowledge base
- Tool calling system
- Multi-host SSH management
- Gotify notification integration

All capabilities from DESIGN.md are preserved.
This commit is contained in:
Lily Miller
2025-10-06 14:32:37 -06:00
commit 22ba493d9e
30 changed files with 10306 additions and 0 deletions

23
.gitignore vendored Normal file
View File

@@ -0,0 +1,23 @@
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
*.egg-info/
dist/
build/
# IDE
.vscode/
.idea/
*.swp
*.swo
# Nix
result
result-*
# Test data
test_*.db
*.log

269
DESIGN.md Normal file
View File

@@ -0,0 +1,269 @@
# Macha Autonomous System - Design Document
> **⚠️ IMPORTANT - READ THIS FIRST**
> **FOR AI ASSISTANT**: This document is YOUR reference guide when modifying Macha's code.
> - **ALWAYS consult this BEFORE refactoring** to ensure you don't remove existing capabilities
> - **CHECK this when adding features** to avoid conflicts
> - **UPDATE this document** when new capabilities are added
> - **DO NOT DELETE ANYTHING FROM THIS DOCUMENT**
> - During major refactors, you MUST verify each capability listed here is preserved
## Overview
Macha is an AI-powered autonomous system administrator capable of monitoring, maintaining, and managing multiple NixOS hosts in the infrastructure.
## Core Capabilities
### 1. Local System Management
- Monitor system health (CPU, memory, disk, services)
- Read and analyze logs via `journalctl`
- Check service status and restart failed services
- Execute system commands (with safety restrictions)
- Monitor and repair Nix store corruption
- Hardware awareness (CPU, GPU, network, storage)
### 2. Multi-Host Management via SSH
**Macha CAN and SHOULD use SSH to manage other hosts.**
#### SSH Access
- Runs as `macha` user (UID 2501)
- Has `NOPASSWD` sudo access for administrative commands
- Shares SSH keys with other hosts in the infrastructure
- Can SSH to: `rhiannon`, `alexander`, `UCAR-Kinston`, and others in the flake
#### SSH Usage Patterns
1. **Direct diagnostic commands:**
```bash
ssh rhiannon systemctl status ollama
ssh alexander df -h
```
- Commands automatically prefixed with `sudo` by the tools layer
- Full command: `ssh macha@rhiannon sudo systemctl status ollama`
2. **Status checks:**
- Check service health on remote hosts
- Gather system metrics
- Review logs
- Monitor resource usage
3. **File operations:**
- Use `scp` to copy files between hosts
- Read configuration files on remote systems
#### When to use SSH vs nh
- **SSH**: For diagnostics, status checks, log review, quick commands
- **nh remote deployment**: For applying NixOS configuration changes
- `nh os switch -u --target-host=rhiannon --hostname=rhiannon`
- Builds locally, deploys to remote host
- Use for permanent configuration changes
### 3. NixOS Configuration Management
#### Local Changes
- Can propose changes to NixOS configuration
- Requires human approval before applying
- Uses `nh os switch` for local updates
#### Remote Deployment
- Can deploy to other hosts using `nh` with `--target-host`
- Builds configuration locally (on Macha)
- Pushes to remote system
- Can take up to 1 hour for complex builds
- **IMPORTANT**: Be patient with long-running builds, don't retry prematurely
### 4. Hardware Awareness
#### Local Hardware Detection
- CPU: `lscpu` via `nix-shell -p util-linux`
- GPU: `lspci` via `nix-shell -p pciutils`
- Network: `lsblk`, `ip addr`
- Storage: `df -h`, `lsblk`
- USB devices: `lsusb`
#### GPU Metrics
- AMD GPUs: Try `rocm-smi`, sysfs (`/sys/class/drm/card*/device/`)
- NVIDIA GPUs: Try `nvidia-smi`
- Fallback: `sensors` for temperature data
- Queries: temperature, utilization, clock speeds, power usage
### 5. Ollama Queue System
#### Architecture
- **File-based queue**: `/var/lib/macha/queues/ollama/`
- **Queue worker**: `ollama-queue-worker.service` (runs as `macha` user)
- **Purpose**: Serialize all LLM requests to prevent resource contention
#### Request Flow
1. Any user (including regular users) → Write request to `pending/`
2. Queue worker → Process requests serially (FIFO with priority)
3. Queue worker → Write response to `completed/`
4. Original requester → Read response from `completed/`
#### Priority Levels
- `INTERACTIVE` (0): User requests via `macha-chat`, `macha-ask`
- `AUTONOMOUS` (1): Background maintenance checks
- `BATCH` (2): Low-priority bulk operations
#### Large Output Handling
- Outputs >8KB: Split into chunks for hierarchical processing
- Each chunk ~8KB (~2000 tokens)
- Process chunks serially with progress feedback
- Generate chunk summaries → meta-summary
- Full outputs cached in `/var/lib/macha/tool_cache/`
### 6. Knowledge Base & Learning
#### ChromaDB Collections
1. **System Context**: Infrastructure topology, service relationships
2. **Issues**: Historical problems and resolutions
3. **Knowledge**: Operational wisdom learned from experience
#### Automatic Learning
- After successful operations, Macha reflects and extracts key learnings
- Stores: topic, knowledge content, category
- Retrieved automatically when relevant to current tasks
- Use `macha-knowledge` CLI to view/manage
### 7. Notifications
#### Gotify Integration
- Can send notifications via `macha-notify` command
- Tool: `send_notification(title, message, priority)`
#### Priority Levels
- `2` (Low/Info): Routine status updates, completed tasks
- `5` (Medium/Attention): Important events, configuration changes
- `8` (High/Critical): Service failures, critical errors, security issues
#### When to Notify
- Critical service failures
- Successful completion of major operations
- Configuration changes that may affect users
- Security-related events
- When explicitly requested by user
### 8. Safety & Constraints
#### Command Restrictions
**Allowed Commands** (see `tools.py` for full list):
- System management: `systemctl`, `journalctl`, `nh`, `nixos-rebuild`
- Monitoring: `free`, `df`, `uptime`, `ps`, `top`, `ip`, `ss`
- Hardware: `lscpu`, `lspci`, `lsblk`, `lshw`, `dmidecode`
- Remote: `ssh`, `scp`
- Power: `reboot`, `shutdown`, `poweroff` (use cautiously!)
- File ops: `cat`, `ls`, `grep`
- Network: `ping`, `dig`, `nslookup`, `curl`, `wget`
- Logging: `logger`
**NOT Allowed**:
- Direct package modifications (`nix-env`, `nix profile`)
- Destructive file operations (`rm -rf`, `dd`)
- User management outside of NixOS config
- Direct editing of system files (use NixOS config instead)
#### Critical Services
**Never disable or stop:**
- SSH (network access)
- Networking (connectivity)
- systemd (system management)
- Boot-related services
#### Approval Required
- Reboots or system power changes
- Major configuration changes
- Disabling any service
- Changes to multiple hosts
### 9. Nix Store Maintenance
#### Verification & Repair
- Command: `nix-store --verify --check-contents --repair`
- **WARNING**: Can take 30+ minutes to several hours
- Only use when corruption is suspected
- Not for routine maintenance
- Verifies all store paths, repairs corrupted files
#### Garbage Collection
- Automatic via system configuration
- Can be triggered manually with approval
- Frees disk space by removing unused derivations
### 10. Conversational Behavior
#### Distinguish Requests from Acknowledgments
- "Thanks" / "Thank you" → Acknowledgment (don't re-execute)
- "Can you..." / "Please..." → Request (execute)
- "What is..." / "How do..." → Question (answer)
#### Tool Calling
- Don't repeat tool calls unnecessarily
- If a tool succeeds, don't run it again unless asked
- Use cached results when available (`retrieve_cached_output`)
#### Context Management
- Be aware of token limits
- Use hierarchical processing for large outputs
- Prune conversation history intelligently
- Cache and summarize when needed
## Infrastructure Topology
### Hosts in Flake
- **macha**: Main autonomous system (self), GPU server
- **rhiannon**: Production server
- **alexander**: Production server
- **UCAR-Kinston**: Work laptop
- **test-vm**: Testing environment
### Shared Configuration
- All hosts share root SSH keys (for `nh` remote deployment)
- `macha` user (UID 2501) exists on all hosts
- Common NixOS configuration via flake
## Service Ecosystem
### Core Services on Macha
- `ollama.service`: LLM inference engine
- `ollama-queue-worker.service`: Request serialization
- `macha-autonomous.service`: Autonomous monitoring loop
- Servarr stack: Sonarr, Radarr, Prowlarr, Lidarr, Readarr, Whisparr
- Media: Transmission, SABnzbd, Calibre
### State Directories
- `/var/lib/macha/`: Main state directory (0755, macha:macha)
- `/var/lib/macha/queues/`: Queue directories (0777 for multi-user)
- `/var/lib/macha/tool_cache/`: Cached tool outputs (0777)
- `/var/lib/macha/system_context.db`: ChromaDB database
## CLI Tools
- `macha-chat`: Interactive chat with tool calling
- `macha-ask`: Single-question interface
- `macha-check`: Trigger immediate health check
- `macha-approve`: Approve pending actions
- `macha-logs`: View autonomous service logs
- `macha-issues`: Query issue database
- `macha-knowledge`: Query knowledge base
- `macha-systems`: List managed systems
- `macha-notify`: Send Gotify notification
## Philosophy & Principles
1. **KISS (Keep It Simple, Stupid)**: Use existing NixOS options, avoid custom wrappers
2. **Verify first**: Check source code/documentation before acting
3. **Safety first**: Never break critical services, always require approval for risky changes
4. **Learn continuously**: Extract and store operational knowledge
5. **Multi-host awareness**: Macha manages the entire infrastructure, not just herself
6. **User-friendly**: Clear communication, appropriate notifications
7. **Patience**: Long-running operations (builds, repairs) can take an hour - don't panic
8. **Tool reuse**: Use existing, verified tools instead of writing custom scripts
## Future Capabilities (Not Yet Implemented)
- [ ] Automatic security updates across all hosts
- [ ] Predictive failure detection
- [ ] Resource optimization recommendations
- [ ] Integration with other communication platforms
- [ ] Multi-agent coordination between hosts
- [ ] Automated testing before deployment

275
EXAMPLES.md Normal file
View File

@@ -0,0 +1,275 @@
# Macha Autonomous System - Configuration Examples
## Basic Configurations
### Conservative (Recommended for Start)
```nix
services.macha-autonomous = {
enable = true;
autonomyLevel = "suggest"; # Require approval for all actions
checkInterval = 300; # Check every 5 minutes
model = "llama3.1:70b"; # Most capable model
};
```
### Moderate Autonomy
```nix
services.macha-autonomous = {
enable = true;
autonomyLevel = "auto-safe"; # Auto-fix safe issues
checkInterval = 180; # Check every 3 minutes
model = "llama3.1:70b";
};
```
### High Autonomy (Experimental)
```nix
services.macha-autonomous = {
enable = true;
autonomyLevel = "auto-full"; # Full autonomy
checkInterval = 300;
model = "llama3.1:70b";
};
```
### Monitoring Only
```nix
services.macha-autonomous = {
enable = true;
autonomyLevel = "observe"; # No actions, just watch
checkInterval = 60; # Check every minute
model = "qwen3:8b-fp16"; # Lighter model is fine for observation
};
```
## Advanced Scenarios
### Using a Smaller Model (Faster, Less Capable)
```nix
services.macha-autonomous = {
enable = true;
autonomyLevel = "auto-safe";
checkInterval = 120;
model = "qwen3:8b-fp16"; # Faster inference, less reasoning depth
# or
# model = "llama3.1:8b"; # Also good for simple tasks
};
```
### High-Frequency Monitoring
```nix
services.macha-autonomous = {
enable = true;
autonomyLevel = "auto-safe";
checkInterval = 60; # Check every minute
model = "qwen3:4b-instruct-2507-fp16"; # Lightweight model
};
```
### Remote Ollama (if running Ollama elsewhere)
```nix
services.macha-autonomous = {
enable = true;
autonomyLevel = "suggest";
checkInterval = 300;
ollamaHost = "http://192.168.1.100:11434"; # Remote Ollama instance
model = "llama3.1:70b";
};
```
## Manual Testing Workflow
1. **Test with a one-shot run:**
```bash
# Run once in observe mode
macha-check
# Review what it detected
cat /var/lib/macha-autonomous/decisions.jsonl | tail -1 | jq .
```
2. **Enable in suggest mode:**
```nix
services.macha-autonomous = {
enable = true;
autonomyLevel = "suggest";
checkInterval = 300;
model = "llama3.1:70b";
};
```
3. **Rebuild and start:**
```bash
sudo nixos-rebuild switch --flake .#macha
sudo systemctl status macha-autonomous
```
4. **Monitor for a while:**
```bash
# Watch the logs
journalctl -u macha-autonomous -f
# Or use the helper
macha-logs service
```
5. **Review proposed actions:**
```bash
macha-approve list
```
6. **Graduate to auto-safe when comfortable:**
```nix
services.macha-autonomous.autonomyLevel = "auto-safe";
```
## Scenario-Based Examples
### Media Server (Let it auto-restart services)
```nix
services.macha-autonomous = {
enable = true;
autonomyLevel = "auto-safe"; # Auto-restart failed arr apps
checkInterval = 180;
model = "llama3.1:70b";
};
```
### Development Machine (Observe only, you want control)
```nix
services.macha-autonomous = {
enable = true;
autonomyLevel = "observe";
checkInterval = 600; # Check less frequently
model = "llama3.1:8b"; # Lighter model
};
```
### Critical Production (Suggest only, manual approval)
```nix
services.macha-autonomous = {
enable = true;
autonomyLevel = "suggest";
checkInterval = 120; # More frequent monitoring
model = "llama3.1:70b"; # Best reasoning
};
```
### Experimental/Learning (Full autonomy)
```nix
services.macha-autonomous = {
enable = true;
autonomyLevel = "auto-full";
checkInterval = 300;
model = "llama3.1:70b";
};
```
## Customizing Behavior
### The config file lives at:
`/etc/macha-autonomous/config.json` (auto-generated from NixOS config)
### To modify the AI prompts:
Edit the Python files in `systems/macha-configs/autonomous/`:
- `agent.py` - AI analysis and decision prompts
- `monitor.py` - What data to collect
- `executor.py` - Safety rules and action execution
- `orchestrator.py` - Main control flow
After editing, rebuild:
```bash
sudo nixos-rebuild switch --flake .#macha
sudo systemctl restart macha-autonomous
```
## Integration with Other Services
### Example: Auto-restart specific services
The system will automatically detect and propose restarting failed services.
### Example: Disk cleanup when space is low
Monitor will detect low disk space, AI will propose cleanup, executor will run `nix-collect-garbage`.
### Example: Log analysis
AI analyzes recent error logs and can propose fixes based on error patterns.
## Debugging
### See what the monitor sees:
```bash
sudo -u macha-autonomous python3 /nix/store/.../monitor.py
```
### Test the AI agent:
```bash
sudo -u macha-autonomous python3 /nix/store/.../agent.py test
```
### View all snapshots:
```bash
ls -lh /var/lib/macha-autonomous/snapshot_*.json
cat /var/lib/macha-autonomous/snapshot_$(ls -t /var/lib/macha-autonomous/snapshot_*.json | head -1) | jq .
```
### Check approval queue:
```bash
cat /var/lib/macha-autonomous/approval_queue.json | jq .
```
## Performance Tuning
### Model Choice Impact:
| Model | Speed | Capability | RAM Usage | Best For |
|-------|-------|------------|-----------|----------|
| llama3.1:70b | Slow (~30s) | Excellent | ~40GB | Complex reasoning |
| llama3.1:8b | Fast (~3s) | Good | ~5GB | General use |
| qwen3:8b-fp16 | Fast (~2s) | Good | ~16GB | General use |
| qwen3:4b | Very Fast (~1s) | Moderate | ~8GB | Simple tasks |
### Check Interval Impact:
- 60s: High responsiveness, more resource usage
- 300s (default): Good balance
- 600s: Low overhead, slower detection
### Memory Usage:
- Monitor: ~50MB
- Agent (per query): Depends on model (see above)
- Executor: ~30MB
- Orchestrator: ~20MB
Total continuous overhead: ~100MB + model inference when running
## Security Considerations
### The autonomous user has sudo access to:
- `systemctl restart/status` - Restart services
- `journalctl` - Read logs
- `nix-collect-garbage` - Clean up Nix store
### It CANNOT:
- Modify arbitrary files
- Access user home directories (ProtectHome=true)
- Disable protected services (SSH, networking)
- Make changes without logging
### Audit trail:
All actions are logged in `/var/lib/macha-autonomous/actions.jsonl`
### To revoke access:
Set `enable = false` and rebuild, or stop the service.
## Future: MCP Integration
You already have MCP servers installed:
- `mcp-nixos` - NixOS-specific tools
- `gitea-mcp-server` - Git integration
- `emcee` - General MCP orchestration
Future versions could integrate these for:
- Better NixOS config manipulation
- Git-based config versioning
- More sophisticated tooling
Stay tuned!

217
LOGGING_EXAMPLE.md Normal file
View File

@@ -0,0 +1,217 @@
# Enhanced Logging Example
This shows what the improved journalctl output will look like for Macha's autonomous system.
## Example Output
### Maintenance Cycle Start
```
[2025-10-01T14:30:00] === Starting maintenance cycle ===
[2025-10-01T14:30:00] Collecting system health data...
[2025-10-01T14:30:02] ============================================================
[2025-10-01T14:30:02] SYSTEM HEALTH SUMMARY
[2025-10-01T14:30:02] ============================================================
[2025-10-01T14:30:02] Resources: CPU 25.3%, Memory 45.2%, Load 1.24
[2025-10-01T14:30:02] Disk: 35.6% used (/ partition)
[2025-10-01T14:30:02] Services: 1 failed
[2025-10-01T14:30:02] - ollama.service (failed)
[2025-10-01T14:30:02] Network: Internet reachable
[2025-10-01T14:30:02] Recent logs: 3 errors in last hour
[2025-10-01T14:30:02] ============================================================
[2025-10-01T14:30:02] KEY METRICS:
[2025-10-01T14:30:02] CPU Usage: 25.3%
[2025-10-01T14:30:02] Memory Usage: 45.2%
[2025-10-01T14:30:02] Load Average: 1.24
[2025-10-01T14:30:02] Failed Services: 1
[2025-10-01T14:30:02] Errors (1h): 3
[2025-10-01T14:30:02] Disk /: 35.6% used
[2025-10-01T14:30:02] Disk /home: 62.1% used
[2025-10-01T14:30:02] Disk /var: 28.9% used
[2025-10-01T14:30:02] Internet: ✅ Connected
```
### AI Analysis Section
```
[2025-10-01T14:30:02] Analyzing system state with AI...
[2025-10-01T14:30:35] ============================================================
[2025-10-01T14:30:35] AI ANALYSIS RESULTS
[2025-10-01T14:30:35] ============================================================
[2025-10-01T14:30:35] Overall Status: ATTENTION_NEEDED
[2025-10-01T14:30:35] Assessment: System has one failed service that should be restarted
[2025-10-01T14:30:35] Detected 1 issue(s):
[2025-10-01T14:30:35] Issue #1:
[2025-10-01T14:30:35] Severity: WARNING
[2025-10-01T14:30:35] Category: services
[2025-10-01T14:30:35] Description: ollama.service has failed and needs to be restarted
[2025-10-01T14:30:35] ⚠️ ACTION REQUIRED
[2025-10-01T14:30:35] Recommended Actions (1):
[2025-10-01T14:30:35] - Restart ollama.service to restore LLM functionality
[2025-10-01T14:30:35] ============================================================
```
### Action Handling Section
```
[2025-10-01T14:30:35] Found 1 issues requiring action
[2025-10-01T14:30:35] ────────────────────────────────────────────────────────────
[2025-10-01T14:30:35] Addressing issue: ollama.service has failed and needs to be restarted
[2025-10-01T14:30:35] Requesting AI fix proposal...
[2025-10-01T14:30:45] AI FIX PROPOSAL:
[2025-10-01T14:30:45] Diagnosis: ollama.service crashed or failed to start properly
[2025-10-01T14:30:45] Proposed Action: Restart ollama.service using systemctl
[2025-10-01T14:30:45] Action Type: systemd_restart
[2025-10-01T14:30:45] Risk Level: LOW
[2025-10-01T14:30:45] Commands to execute:
[2025-10-01T14:30:45] - systemctl restart ollama.service
[2025-10-01T14:30:45] Reasoning: Restarting the service is a safe, standard troubleshooting step
[2025-10-01T14:30:45] Rollback Plan: Service will return to failed state if restart doesn't work
[2025-10-01T14:30:45] Executing action...
[2025-10-01T14:30:47] EXECUTION RESULT:
[2025-10-01T14:30:47] Status: QUEUED_FOR_APPROVAL
[2025-10-01T14:30:47] Executed: No
[2025-10-01T14:30:47] Reason: Autonomy level requires manual approval
```
### Cycle Complete Summary
```
[2025-10-01T14:30:47] No issues requiring immediate action
[2025-10-01T14:30:47] ============================================================
[2025-10-01T14:30:47] MAINTENANCE CYCLE COMPLETE
[2025-10-01T14:30:47] ============================================================
[2025-10-01T14:30:47] Status: ATTENTION_NEEDED
[2025-10-01T14:30:47] Issues Found: 1
[2025-10-01T14:30:47] Actions Taken: 1
[2025-10-01T14:30:47] - Executed: 0
[2025-10-01T14:30:47] - Queued for approval: 1
[2025-10-01T14:30:47] Next check in: 300 seconds
[2025-10-01T14:30:47] ============================================================
```
## When System is Healthy
```
[2025-10-01T14:35:00] === Starting maintenance cycle ===
[2025-10-01T14:35:00] Collecting system health data...
[2025-10-01T14:35:02] ============================================================
[2025-10-01T14:35:02] SYSTEM HEALTH SUMMARY
[2025-10-01T14:35:02] ============================================================
[2025-10-01T14:35:02] Resources: CPU 12.5%, Memory 38.1%, Load 0.65
[2025-10-01T14:35:02] Disk: 35.6% used (/ partition)
[2025-10-01T14:35:02] Services: All running
[2025-10-01T14:35:02] Network: Internet reachable
[2025-10-01T14:35:02] Recent logs: 0 errors in last hour
[2025-10-01T14:35:02] ============================================================
[2025-10-01T14:35:02] KEY METRICS:
[2025-10-01T14:35:02] CPU Usage: 12.5%
[2025-10-01T14:35:02] Memory Usage: 38.1%
[2025-10-01T14:35:02] Load Average: 0.65
[2025-10-01T14:35:02] Failed Services: 0
[2025-10-01T14:35:02] Errors (1h): 0
[2025-10-01T14:35:02] Disk /: 35.6% used
[2025-10-01T14:35:02] Internet: ✅ Connected
[2025-10-01T14:35:02] Analyzing system state with AI...
[2025-10-01T14:35:28] ============================================================
[2025-10-01T14:35:28] AI ANALYSIS RESULTS
[2025-10-01T14:35:28] ============================================================
[2025-10-01T14:35:28] Overall Status: HEALTHY
[2025-10-01T14:35:28] Assessment: System is operating normally with no issues detected
[2025-10-01T14:35:28] ✅ No issues detected
[2025-10-01T14:35:28] ============================================================
[2025-10-01T14:35:28] No issues requiring immediate action
[2025-10-01T14:35:28] ============================================================
[2025-10-01T14:35:28] MAINTENANCE CYCLE COMPLETE
[2025-10-01T14:35:28] ============================================================
[2025-10-01T14:35:28] Status: HEALTHY
[2025-10-01T14:35:28] Issues Found: 0
[2025-10-01T14:35:28] Actions Taken: 0
[2025-10-01T14:35:28] Next check in: 300 seconds
[2025-10-01T14:35:28] ============================================================
```
## Viewing Logs
### Follow live logs
```bash
journalctl -u macha-autonomous.service -f
```
### See only AI decisions
```bash
journalctl -u macha-autonomous.service | grep "AI ANALYSIS"
```
### See only execution results
```bash
journalctl -u macha-autonomous.service | grep "EXECUTION RESULT"
```
### See key metrics
```bash
journalctl -u macha-autonomous.service | grep "KEY METRICS" -A 10
```
### Filter by status level
```bash
# Only show intervention required
journalctl -u macha-autonomous.service | grep "INTERVENTION_REQUIRED"
# Only show critical issues
journalctl -u macha-autonomous.service | grep "CRITICAL"
# Only show action required
journalctl -u macha-autonomous.service | grep "ACTION REQUIRED"
```
### Summary of last cycle
```bash
journalctl -u macha-autonomous.service | grep "MAINTENANCE CYCLE COMPLETE" -B 5 | tail -6
```
## Benefits of Enhanced Logging
### 1. **Easy to Scan**
Clear section headers with separators make it easy to find what you need
### 2. **Structured Data**
Key metrics are labeled consistently for easy parsing/grepping
### 3. **Complete Context**
Each cycle shows:
- What the system saw
- What the AI thought
- What action was proposed
- What actually happened
### 4. **AI Transparency**
You can see:
- The AI's reasoning for each decision
- Risk assessment for each action
- Rollback plans if something goes wrong
### 5. **Audit Trail**
Everything is logged to journalctl for long-term storage and analysis
### 6. **Troubleshooting**
If something goes wrong, you have complete context:
- System state before the issue
- AI's diagnosis
- Action attempted
- Result of action

224
NOTIFICATIONS.md Normal file
View File

@@ -0,0 +1,224 @@
# Gotify Notifications Setup
Macha's autonomous system can now send notifications to Gotify on Rhiannon for critical events.
## What Gets Notified
### High Priority (🚨 Priority 8)
- **Critical issues detected** - System problems requiring immediate attention
- **Service failures** - When critical services fail
- **Failed actions** - When an action execution fails
- **Intervention required** - When system status is critical
### Medium Priority (📋 Priority 5)
- **Actions queued for approval** - When medium/high-risk actions need manual review
- **System attention needed** - When system status needs attention
### Low Priority (✅ Priority 2)
- **Successful actions** - When safe actions execute successfully
- **System healthy** - Periodic health check confirmations (if enabled)
## Setup Instructions
### Step 1: Create Gotify Application on Rhiannon
1. Open Gotify web interface on Rhiannon:
```bash
# URL: http://rhiannon:8181 (or use external access)
```
2. Log in to Gotify
3. Go to **"Apps"** tab
4. Click **"Create Application"**
5. Name it: `Macha Autonomous System`
6. Copy the generated **Application Token**
### Step 2: Configure Macha
Edit `/home/lily/Documents/gitrepos/nixos-servers/systems/macha.nix`:
```nix
services.macha-autonomous = {
enable = true;
autonomyLevel = "suggest";
checkInterval = 300;
model = "llama3.1:70b";
# Gotify notifications
gotifyUrl = "http://rhiannon:8181";
gotifyToken = "YOUR_TOKEN_HERE"; # Paste the token from Step 1
};
```
### Step 3: Rebuild and Deploy
```bash
cd /home/lily/Documents/gitrepos/nixos-servers
sudo nixos-rebuild switch --flake .#macha
```
### Step 4: Test Notifications
Send a test notification:
```bash
macha-notify "Test" "Macha notifications are working!" 5
```
You should see this notification appear in Gotify on Rhiannon.
## CLI Tools
### Send Test Notification
```bash
macha-notify <title> <message> [priority]
# Examples:
macha-notify "Test" "This is a test" 5
macha-notify "Critical" "This is urgent" 8
macha-notify "Info" "Just FYI" 2
```
Priorities:
- `2` - Low (✅ green)
- `5` - Medium (📋 blue)
- `8` - High (🚨 red)
### Check if Notifications are Enabled
```bash
# View the service environment
systemctl show macha-autonomous.service | grep GOTIFY
```
## Notification Examples
### Critical Issue
```
🚨 Macha: Critical Issue
⚠️ Critical Issue Detected
High disk usage on /var partition (95% full)
Details:
Category: disk
```
### Action Queued for Approval
```
📋 Macha: Action Needs Approval
Action Queued for Approval
Action: Restart failed service: ollama.service
Risk Level: low
Use 'macha-approve list' to review
```
### Action Executed Successfully
```
✅ Macha: Action Success
✅ Action Success
Restart failed service: ollama.service
Output:
Service restarted successfully
```
### Action Failed
```
❌ Macha: Action Failed
❌ Action Failed
Clean up disk space with nix-collect-garbage
Output:
Error: Insufficient permissions
```
## Security Notes
1. **Token Storage**: The Gotify token is stored in the NixOS configuration. Consider using a secrets management solution for production.
2. **Network Access**: Macha needs network access to Rhiannon. Ensure your firewall allows HTTP traffic between them.
3. **Token Scope**: The Gotify token only allows sending messages, not reading or managing Gotify.
## Troubleshooting
### Notifications Not Appearing
1. **Check Gotify is running on Rhiannon:**
```bash
ssh rhiannon systemctl status gotify
```
2. **Test connectivity from Macha:**
```bash
curl http://rhiannon:8181/health
```
3. **Verify token is set:**
```bash
macha-notify "Test" "Testing" 5
```
4. **Check service logs:**
```bash
macha-logs service | grep -i gotify
```
### Notification Spam
If you're getting too many notifications, you can:
1. **Disable notifications temporarily:**
```nix
services.macha-autonomous.gotifyUrl = ""; # Empty string disables
```
2. **Adjust autonomy level:**
```nix
services.macha-autonomous.autonomyLevel = "auto-safe"; # Fewer approval notifications
```
3. **Increase check interval:**
```nix
services.macha-autonomous.checkInterval = 900; # Check every 15 minutes instead of 5
```
## Implementation Details
### Files Modified
- `notifier.py` - Gotify notification client
- `module.nix` - Added configuration options and CLI tool
- `orchestrator.py` - Integrated notifications at decision points
- `macha.nix` - Added Gotify configuration
### Notification Flow
```
Issue Detected → AI Analysis → Decision Made → Notification Sent
Queued or Executed → Notification Sent
```
### Graceful Degradation
- If Gotify is unavailable, the system continues to operate
- Failed notifications are logged but don't crash the service
- Notifications have a 10-second timeout to prevent blocking
## Future Enhancements
Possible improvements:
- [ ] Rate limiting to prevent notification spam
- [ ] Notification grouping (batch similar issues)
- [ ] Custom notification templates
- [ ] Priority-based notification filtering
- [ ] Integration with other notification services (email, SMS)
- [ ] Secrets management for tokens (agenix, sops-nix)

229
QUICKSTART.md Normal file
View File

@@ -0,0 +1,229 @@
# Macha Autonomous System - Quick Start Guide
## What is This?
Macha now has a self-maintenance system that uses local AI (via Ollama) to monitor, analyze, and maintain itself. Think of it as a 24/7 system administrator that watches over Macha.
## How It Works
1. **Monitor**: Every 5 minutes, collects system health data (services, resources, logs, etc.)
2. **Analyze**: Uses llama3.1:70b to analyze the data and detect issues
3. **Act**: Based on autonomy level, either proposes fixes or executes them automatically
4. **Learn**: Logs all decisions and actions for auditing and improvement
## Autonomy Levels
### `observe` - Monitoring Only
- Monitors system health
- Logs everything
- Takes NO actions
- Good for: Testing, learning what the system sees
### `suggest` - Approval Required (DEFAULT)
- Monitors and analyzes
- Proposes fixes
- Requires manual approval before executing
- Good for: Production use, when you want control
### `auto-safe` - Limited Autonomy
- Auto-executes "safe" actions:
- Restarting failed services
- Disk cleanup
- Log rotation
- Read-only diagnostics
- Asks approval for risky changes
- Good for: Hands-off operation with safety net
### `auto-full` - Full Autonomy
- Auto-executes most actions
- Still requires approval for HIGH RISK actions
- Never touches protected services (SSH, networking, etc.)
- Good for: Experimental, when you trust the system
## Commands
### Check the status
```bash
# View the service status
systemctl status macha-autonomous
# View live logs
macha-logs service
# View AI decision log
macha-logs decisions
# View action execution log
macha-logs actions
# View orchestrator log
macha-logs orchestrator
```
### Run a manual check
```bash
# Run one maintenance cycle now
macha-check
```
### Approval workflow (when autonomyLevel = "suggest")
```bash
# List pending actions awaiting approval
macha-approve list
# Approve action number 0
macha-approve approve 0
```
### Change autonomy level
Edit `/home/lily/Documents/nixos-servers/systems/macha.nix`:
```nix
services.macha-autonomous = {
enable = true;
autonomyLevel = "auto-safe"; # Change this
checkInterval = 300;
model = "llama3.1:70b";
};
```
Then rebuild:
```bash
sudo nixos-rebuild switch --flake .#macha
```
## What Can It Do?
### Automatically Detects
- Failed systemd services
- High resource usage (CPU, RAM, disk)
- Recent errors in logs
- Network connectivity issues
- Disk space problems
- Boot/uptime anomalies
### Can Propose/Execute
- Restart failed services
- Clean up disk space (nix store, old logs)
- Investigate issues (run diagnostics)
- Propose configuration changes (for manual review)
- NixOS rebuilds (with safety checks)
### Safety Features
- **Protected services**: Never touches SSH, networking, systemd core
- **Dry-run testing**: Tests NixOS rebuilds before applying
- **Action logging**: Every action is logged with context
- **Rollback capability**: Can revert changes
- **Rate limiting**: Won't spam actions
- **Human override**: You can always disable or intervene
## Example Workflow
1. **System detects failed service**
```
Monitor: "ollama.service is failed"
AI Agent: "The ollama service crashed. Propose restarting it."
```
2. **In `suggest` mode (default)**
```
Executor: "Action queued for approval"
You: Run `macha-approve list`
You: Review the proposed action
You: Run `macha-approve approve 0`
Executor: Restarts the service
```
3. **In `auto-safe` mode**
```
Executor: "Low risk action, auto-executing"
Executor: Restarts the service automatically
You: Check logs later to see what happened
```
## Monitoring the System
All data is stored in `/var/lib/macha-autonomous/`:
- `orchestrator.log` - Main system log
- `decisions.jsonl` - AI analysis decisions (JSON Lines format)
- `actions.jsonl` - Executed actions log
- `snapshot_*.json` - System state snapshots
- `approval_queue.json` - Pending actions
## Tips
1. **Start with `suggest` mode** - Get comfortable with what it proposes
2. **Review the logs** - See what it's detecting and proposing
3. **Graduate to `auto-safe`** - Let it handle routine maintenance
4. **Use `observe` for debugging** - If something seems wrong
5. **Check approval queue regularly** - If using `suggest` mode
## Troubleshooting
### Service won't start
```bash
# Check for errors
journalctl -u macha-autonomous -n 50
# Verify Ollama is running
systemctl status ollama
# Test Ollama manually
curl http://localhost:11434/api/generate -d '{"model": "llama3.1:70b", "prompt": "test"}'
```
### AI making bad decisions
- Switch to `observe` mode to stop actions
- Review `decisions.jsonl` to see reasoning
- File an issue or adjust prompts in `agent.py`
### Want to disable temporarily
```bash
sudo systemctl stop macha-autonomous
```
### Want to disable permanently
Edit `systems/macha.nix`:
```nix
services.macha-autonomous.enable = false;
```
Then rebuild.
## Architecture
```
┌─────────────────────────────────────────────────────────┐
│ Orchestrator │
│ (Main loop, runs every 5 minutes) │
└────────────┬──────────────┬──────────────┬──────────────┘
│ │ │
┌───▼────┐ ┌────▼────┐ ┌────▼─────┐
│Monitor │ │ Agent │ │ Executor │
│ │───▶│ (AI) │───▶│ (Safe) │
└────────┘ └─────────┘ └──────────┘
│ │ │
Collects Analyzes Executes
System Issues Actions
Health w/ LLM Safely
```
## Future Enhancements
Potential future capabilities:
- Integration with MCP servers (already installed!)
- Predictive maintenance (learning from patterns)
- Self-optimization (tuning configs based on usage)
- Cluster management (if you add more systems)
- Automated backups and disaster recovery
- Security monitoring and hardening
- Performance tuning recommendations
## Philosophy
The goal is a system that maintains itself while being:
1. **Safe** - Never breaks critical functionality
2. **Transparent** - All decisions are logged and explainable
3. **Conservative** - When in doubt, ask for approval
4. **Learning** - Gets better over time
5. **Human-friendly** - Easy to understand and override
Macha is here to help you, not replace you!

93
README.md Normal file
View File

@@ -0,0 +1,93 @@
# Macha - AI-Powered Autonomous System Administrator
Macha is an AI-powered autonomous system administrator for NixOS that monitors system health, diagnoses issues, and can take corrective actions with appropriate approval workflows.
## Features
- **Autonomous Monitoring**: Continuous health checks with configurable intervals
- **Multi-Host Management**: SSH-based management of multiple NixOS hosts
- **Tool Calling**: Comprehensive system administration tools via Ollama LLM
- **Queue-Based Architecture**: Serialized LLM requests to prevent resource contention
- **Knowledge Base**: ChromaDB-backed learning system for operational wisdom
- **Approval Workflows**: Safety-first approach with configurable autonomy levels
- **Notification System**: Gotify integration for alerts
## Quick Start
### As a NixOS Flake Input
Add to your `flake.nix`:
```nix
{
inputs = {
nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
macha-autonomous.url = "git+https://git.coven.systems/lily/macha-autonomous";
};
outputs = { self, nixpkgs, macha-autonomous }: {
nixosConfigurations.yourhost = nixpkgs.lib.nixosSystem {
modules = [
macha-autonomous.nixosModules.default
{
services.macha-autonomous = {
enable = true;
autonomyLevel = "suggest"; # observe, suggest, auto-safe, auto-full
checkInterval = 300;
ollamaHost = "http://localhost:11434";
model = "gpt-oss:latest";
};
}
];
};
};
}
```
## Configuration Options
See `module.nix` for full configuration options including:
- Autonomy levels (observe, suggest, auto-safe, auto-full)
- Check intervals
- Ollama host and model settings
- Git repository monitoring
- Service user/group configuration
## CLI Tools
- `macha-chat` - Interactive chat interface
- `macha-ask` - Single-question interface
- `macha-check` - Trigger immediate health check
- `macha-approve` - Approve pending actions
- `macha-logs` - View service logs
- `macha-issues` - Query issue database
- `macha-knowledge` - Query knowledge base
- `macha-systems` - List managed systems
- `macha-notify` - Send Gotify notification
## Architecture
- **Agent**: Core AI logic with tool calling
- **Orchestrator**: Main monitoring loop
- **Executor**: Safe action execution
- **Queue System**: Serialized Ollama requests with priorities
- **Context DB**: ChromaDB for system context and learning
- **Tools**: System administration capabilities
## Requirements
- NixOS with flakes enabled
- Ollama service running
- Python 3 with requests, psutil, chromadb
## Documentation
See `DESIGN.md` for comprehensive architecture documentation.
## License
[Add your license here]
## Author
Lily Miller

317
SUMMARY.md Normal file
View File

@@ -0,0 +1,317 @@
# Macha Autonomous System - Implementation Summary
## What We Built
A complete self-maintaining system for Macha that uses local AI models (via Ollama) to monitor, analyze, and fix issues automatically. This is a production-ready implementation with safety mechanisms, audit trails, and multiple autonomy levels.
## Components Created
### 1. System Monitor (`monitor.py` - 310 lines)
- Collects comprehensive system health data every cycle
- Monitors: systemd services, resources (CPU/RAM), disk usage, logs, network, NixOS status
- Saves snapshots for historical analysis
- Generates human-readable summaries
### 2. AI Agent (`agent.py` - 238 lines)
- Analyzes system state using llama3.1:70b (or other models)
- Detects issues and classifies severity
- Proposes specific, actionable fixes
- Logs all decisions for auditing
- Uses structured JSON responses for reliability
### 3. Safe Executor (`executor.py` - 371 lines)
- Executes actions with safety checks
- Protected services list (never touches SSH, networking, etc.)
- Supports multiple action types:
- `systemd_restart` - Restart failed services
- `cleanup` - Disk/log cleanup
- `nix_rebuild` - NixOS configuration rebuilds
- `config_change` - Config file modifications
- `investigation` - Diagnostic commands
- Approval queue for manual review
- Complete action logging
### 4. Orchestrator (`orchestrator.py` - 211 lines)
- Main control loop
- Coordinates monitor → agent → executor pipeline
- Handles signals and graceful shutdown
- Configuration management
- Multiple run modes (once, continuous, daemon)
### 5. NixOS Module (`module.nix` - 168 lines)
- Full systemd service integration
- Configuration options via NixOS
- User/group management
- Security hardening
- CLI tools (`macha-check`, `macha-approve`, `macha-logs`)
- Resource limits (1GB RAM, 50% CPU)
### 6. Documentation
- `README.md` - Architecture overview
- `QUICKSTART.md` - User guide
- `EXAMPLES.md` - Configuration examples
- `SUMMARY.md` - This file
**Total: ~1,400 lines of code**
## Architecture
```
┌──────────────────────────────────────────────────────────────┐
│ NixOS Module │
│ - Creates systemd service │
│ - Manages user/permissions │
│ - Provides CLI tools │
└───────────────────────┬──────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ Orchestrator │
│ - Runs main loop (every 5 minutes) │
│ - Coordinates components │
│ - Handles errors and logging │
└───────┬──────────────┬──────────────┬──────────────┬─────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌─────────┐ ┌──────────┐
│ Monitor │──▶│ Agent │──▶│Executor │──▶│ Logs │
│ │ │ (AI) │ │ (Safe) │ │ │
└─────────┘ └──────────┘ └─────────┘ └──────────┘
│ │ │ │
│ │ │ │
Collects Analyzes Executes Records
System with LLM Actions Everything
Health (Ollama) Safely
```
## Data Flow
1. **Collection**: Monitor gathers system health data
2. **Analysis**: Agent sends data + prompts to Ollama
3. **Decision**: AI returns structured analysis (JSON)
4. **Execution**: Executor checks permissions & autonomy level
5. **Action**: Either executes or queues for approval
6. **Logging**: All steps logged to JSONL files
## Safety Mechanisms
### Multi-Level Protection
1. **Autonomy Levels**: observe → suggest → auto-safe → auto-full
2. **Protected Services**: Hardcoded list of critical services
3. **Dry-Run Testing**: NixOS rebuilds tested before applying
4. **Approval Queue**: Manual review workflow
5. **Action Logging**: Complete audit trail
6. **Resource Limits**: systemd enforced (1GB RAM, 50% CPU)
7. **Rollback Capability**: Can revert changes
8. **Timeout Protection**: All operations have timeouts
### What It Can Do Automatically (auto-safe)
- ✅ Restart failed services (except protected ones)
- ✅ Clean up disk space (nix-collect-garbage)
- ✅ Rotate/clean logs
- ✅ Run diagnostics
- ❌ Modify configs (requires approval)
- ❌ Rebuild NixOS (requires approval)
- ❌ Touch protected services
## Files Created
```
systems/macha-configs/autonomous/
├── __init__.py # Python package marker
├── monitor.py # System health monitoring
├── agent.py # AI analysis and reasoning
├── executor.py # Safe action execution
├── orchestrator.py # Main control loop
├── module.nix # NixOS integration
├── README.md # Architecture docs
├── QUICKSTART.md # User guide
├── EXAMPLES.md # Configuration examples
└── SUMMARY.md # This file
```
## Integration Points
### Modified Files
- `systems/macha.nix` - Added autonomous module and configuration
### Created Systemd Service
- `macha-autonomous.service` - Main service
- Runs continuously, checks every 5 minutes
- Auto-starts on boot
- Restart on failure
### Created Users/Groups
- `macha-autonomous` user (system user)
- Limited sudo access for specific commands
- Home: `/var/lib/macha-autonomous`
### Created CLI Commands
- `macha-check` - Run manual health check
- `macha-approve list` - Show pending actions
- `macha-approve approve <N>` - Approve action N
- `macha-logs [orchestrator|decisions|actions|service]` - View logs
### State Directory
`/var/lib/macha-autonomous/` contains:
- `orchestrator.log` - Main log
- `decisions.jsonl` - AI analysis log
- `actions.jsonl` - Executed actions log
- `snapshot_*.json` - System state snapshots
- `approval_queue.json` - Pending actions
- `suggested_patch_*.txt` - Config change suggestions
## Configuration
### Current Configuration (in systems/macha.nix)
```nix
services.macha-autonomous = {
enable = true;
autonomyLevel = "suggest"; # Requires approval
checkInterval = 300; # 5 minutes
model = "llama3.1:70b"; # Most capable model
};
```
### To Deploy
```bash
# Build and activate
sudo nixos-rebuild switch --flake .#macha
# Check status
systemctl status macha-autonomous
# View logs
macha-logs service
```
## Usage Workflow
### Day 1: Observation
```bash
# Just watch what it detects
macha-logs decisions
```
### Day 2-7: Review Proposals
```bash
# Check what it wants to do
macha-approve list
# Approve good actions
macha-approve approve 0
```
### Week 2+: Increase Autonomy
```nix
# Let it handle safe actions automatically
services.macha-autonomous.autonomyLevel = "auto-safe";
```
### Monthly: Review Audit Logs
```bash
# See what it's been doing
cat /var/lib/macha-autonomous/actions.jsonl | jq .
```
## Performance Characteristics
### Resource Usage
- **Idle**: ~100MB RAM
- **Active (w/ llama3.1:70b)**: ~100MB + ~40GB model (shared with Ollama)
- **CPU**: Limited to 50% by systemd
- **Disk**: Minimal (logs rotate, snapshots limited to last 100)
### Timing
- **Monitor**: ~2 seconds
- **AI Analysis**: ~30 seconds (70B model) to ~3 seconds (8B model)
- **Execution**: Varies by action (seconds to minutes)
- **Full Cycle**: ~1-2 minutes typically
### Scalability
- Can handle multiple issues per cycle
- Queue system prevents action spam
- Configurable check intervals
- Model choice affects speed/quality tradeoff
## Current Status
**READY TO USE** - All components implemented and integrated
The system is:
- ✅ Fully functional
- ✅ Safety mechanisms in place
- ✅ Well documented
- ✅ Integrated into NixOS configuration
- ✅ Ready for deployment
Currently configured in **conservative mode** (`suggest`):
- Monitors continuously
- Analyzes with AI
- Proposes actions
- Waits for your approval
## Next Steps
1. **Deploy and test:**
```bash
sudo nixos-rebuild switch --flake .#macha
```
2. **Monitor for a few days:**
```bash
macha-logs service
```
3. **Review what it detects:**
```bash
macha-approve list
cat /var/lib/macha-autonomous/decisions.jsonl | jq .
```
4. **Gradually increase autonomy as you gain confidence**
## Future Enhancement Ideas
### Short Term
- Web dashboard for easier monitoring
- Email/notification system for critical issues
- More sophisticated action types
- Historical trend analysis
### Medium Term
- Integration with MCP servers (already installed!)
- Predictive maintenance using historical data
- Self-tuning of check intervals based on activity
- Multi-system orchestration (manage other NixOS hosts)
### Long Term
- Learning from past decisions to improve
- A/B testing of configuration changes
- Distributed consensus for multi-host decisions
- Integration with external monitoring systems
## Philosophy
This implementation follows key principles:
1. **Safety First**: Multiple layers of protection
2. **Transparency**: Everything is logged and auditable
3. **Conservative Default**: Start restricted, earn trust
4. **Human in Loop**: Always allow override
5. **Gradual Autonomy**: Progressive trust model
6. **Local First**: No external dependencies
7. **Declarative**: NixOS-native configuration
## Conclusion
Macha now has a sophisticated autonomous maintenance system that can:
- Monitor itself 24/7
- Detect and analyze issues using AI
- Fix problems automatically (with appropriate safeguards)
- Learn and improve over time
- Maintain complete audit trails
All powered by local AI models, no external dependencies, fully integrated with NixOS, and designed with safety as the top priority.
**Welcome to the future of self-maintaining systems!** 🎉

1
__init__.py Normal file
View File

@@ -0,0 +1 @@
# Macha Autonomous System Maintenance

1015
agent.py Normal file

File diff suppressed because it is too large Load Diff

522
chat.py Normal file
View File

@@ -0,0 +1,522 @@
#!/usr/bin/env python3
"""
Interactive chat interface with Macha AI agent.
Allows conversational interaction and directive execution.
"""
import json
import os
import subprocess
import sys
from datetime import datetime
from pathlib import Path
from typing import List, Dict, Any
# Add parent directory to path for imports
sys.path.insert(0, str(Path(__file__).parent))
from agent import MachaAgent
class MachaChatSession:
"""Interactive chat session with Macha"""
def __init__(self):
self.agent = MachaAgent(use_queue=True, priority="INTERACTIVE")
self.conversation_history: List[Dict[str, str]] = []
self.session_start = datetime.now().isoformat()
def _create_chat_prompt(self, user_message: str) -> str:
"""Create a prompt for the chat session"""
# Build conversation context
context = ""
if self.conversation_history:
context = "\n\nCONVERSATION HISTORY:\n"
for entry in self.conversation_history[-10:]: # Last 10 messages
role = entry['role'].upper()
msg = entry['message']
context += f"{role}: {msg}\n"
prompt = f"""{MachaAgent.SYSTEM_PROMPT}
TASK: INTERACTIVE CHAT SESSION
You are in an interactive chat session with the system administrator.
You can have a natural conversation and execute commands when directed.
CAPABILITIES:
- Answer questions about system status
- Explain configurations and issues
- Execute commands when explicitly asked
- Provide guidance and recommendations
COMMAND EXECUTION:
When the user asks you to run a command or perform an action that requires execution:
1. Respond with a JSON object containing the command to execute
2. Format: {{"action": "execute", "command": "the command", "explanation": "why you're running it"}}
3. After seeing the output, continue the conversation naturally
RESPONSE FORMAT:
- For normal conversation: Respond naturally in plain text
- For command execution: Respond with JSON containing action/command/explanation
- Keep responses concise but informative
RULES:
- Only execute commands when explicitly asked or when it's clearly needed
- Explain what you're about to do before executing
- Never execute destructive commands without explicit confirmation
- If unsure, ask for clarification
{context}
USER: {user_message}
MACHA:"""
return prompt
def _execute_command(self, command: str) -> Dict[str, Any]:
"""Execute a shell command and return results"""
try:
result = subprocess.run(
command,
shell=True,
capture_output=True,
text=True,
timeout=30
)
# Check if command failed due to permissions
needs_sudo = False
permission_errors = [
'Interactive authentication required',
'Permission denied',
'Operation not permitted',
'Must be root',
'insufficient privileges',
'authentication is required'
]
if result.returncode != 0:
error_text = (result.stderr + result.stdout).lower()
for perm_error in permission_errors:
if perm_error.lower() in error_text:
needs_sudo = True
break
# Retry with sudo if permission error detected
if needs_sudo and not command.strip().startswith('sudo'):
print(f"\n⚠️ Permission denied, retrying with sudo...")
sudo_command = f"sudo {command}"
result = subprocess.run(
sudo_command,
shell=True,
capture_output=True,
text=True,
timeout=30
)
return {
'success': result.returncode == 0,
'exit_code': result.returncode,
'stdout': result.stdout,
'stderr': result.stderr,
'command': sudo_command,
'retried_with_sudo': True
}
return {
'success': result.returncode == 0,
'exit_code': result.returncode,
'stdout': result.stdout,
'stderr': result.stderr,
'command': command,
'retried_with_sudo': False
}
except subprocess.TimeoutExpired:
return {
'success': False,
'exit_code': -1,
'stdout': '',
'stderr': 'Command timed out after 30 seconds',
'command': command,
'retried_with_sudo': False
}
except Exception as e:
return {
'success': False,
'exit_code': -1,
'stdout': '',
'stderr': str(e),
'command': command,
'retried_with_sudo': False
}
def _parse_response(self, response: str) -> Dict[str, Any]:
"""Parse AI response to determine if it's a command or text"""
try:
# Try to parse as JSON
parsed = json.loads(response.strip())
if isinstance(parsed, dict) and 'action' in parsed:
return parsed
except json.JSONDecodeError:
pass
# It's plain text conversation
return {'action': 'chat', 'message': response}
def _auto_diagnose_ollama(self) -> str:
"""Automatically diagnose Ollama issues"""
diagnostics = []
diagnostics.append("🔍 AUTO-DIAGNOSIS: Investigating Ollama failure...\n")
# Check if Ollama service is running
try:
result = subprocess.run(
['systemctl', 'is-active', 'ollama.service'],
capture_output=True,
text=True,
timeout=5
)
if result.returncode == 0:
diagnostics.append("✅ Ollama service is active")
else:
diagnostics.append(f"❌ Ollama service is NOT active: {result.stdout.strip()}")
# Get service status
status_result = subprocess.run(
['systemctl', 'status', 'ollama.service', '--no-pager', '-l'],
capture_output=True,
text=True,
timeout=5
)
diagnostics.append(f"\nService status:\n```\n{status_result.stdout[-500:]}\n```")
except Exception as e:
diagnostics.append(f"⚠️ Could not check service status: {e}")
# Check memory usage
try:
result = subprocess.run(['free', '-h'], capture_output=True, text=True, timeout=5)
lines = result.stdout.split('\n')
for line in lines[:3]: # First 3 lines
diagnostics.append(f" {line}")
except Exception as e:
diagnostics.append(f"⚠️ Could not check memory: {e}")
# Check which models are loaded
try:
import requests
response = requests.get(f"{self.agent.ollama_host}/api/tags", timeout=5)
if response.status_code == 200:
models = response.json().get('models', [])
diagnostics.append(f"\n📦 Loaded models ({len(models)}):")
for model in models:
name = model.get('name', 'unknown')
size = model.get('size', 0) / (1024**3)
is_current = "← TARGET" if name == self.agent.model else ""
diagnostics.append(f"{name} ({size:.1f} GB) {is_current}")
# Check if target model is loaded
model_names = [m.get('name') for m in models]
if self.agent.model not in model_names:
diagnostics.append(f"\n❌ TARGET MODEL NOT LOADED: {self.agent.model}")
diagnostics.append(f" Available models: {', '.join(model_names)}")
else:
diagnostics.append(f"❌ Ollama API returned {response.status_code}")
except Exception as e:
diagnostics.append(f"⚠️ Could not query Ollama API: {e}")
# Check recent Ollama logs
try:
result = subprocess.run(
['journalctl', '-u', 'ollama.service', '-n', '10', '--no-pager'],
capture_output=True,
text=True,
timeout=5
)
if result.stdout:
diagnostics.append(f"\n📋 Recent Ollama logs (last 10 lines):\n```\n{result.stdout}\n```")
except Exception as e:
diagnostics.append(f"⚠️ Could not check logs: {e}")
return "\n".join(diagnostics)
def process_message(self, user_message: str) -> str:
"""Process a user message and return Macha's response"""
# Add user message to history
self.conversation_history.append({
'role': 'user',
'message': user_message,
'timestamp': datetime.now().isoformat()
})
# Build chat messages for tool-calling API
messages = []
# Query relevant knowledge based on user message
knowledge_context = self.agent._query_relevant_knowledge(user_message, limit=3)
# Add recent conversation history (last 15 messages to stay within context limits)
# With tool calling, messages grow quickly, so we limit more aggressively
recent_history = self.conversation_history[-15:] # Last ~7 exchanges
for entry in recent_history:
content = entry['message']
# Truncate very long messages (e.g., command outputs)
if len(content) > 3000:
content = content[:1500] + "\n... [message truncated] ...\n" + content[-1500:]
# Add knowledge context to first user message if available
if entry == recent_history[-1] and knowledge_context:
content += knowledge_context
messages.append({
"role": entry['role'],
"content": content
})
try:
# Use tool-aware chat API
ai_response = self.agent._query_ollama_with_tools(messages)
except Exception as e:
error_msg = (
f"❌ CRITICAL: Failed to communicate with Ollama inference engine\n\n"
f"Error Type: {type(e).__name__}\n"
f"Error Message: {str(e)}\n\n"
)
# Auto-diagnose the issue
diagnostics = self._auto_diagnose_ollama()
return error_msg + "\n" + diagnostics
if not ai_response:
error_msg = (
f"❌ Empty response from Ollama inference engine\n\n"
f"The request succeeded but returned no data. This usually means:\n"
f" • The model ({self.agent.model}) is still loading\n"
f" • Ollama ran out of memory during generation\n"
f" • The prompt was too large for the context window\n\n"
)
# Auto-diagnose the issue
diagnostics = self._auto_diagnose_ollama()
return error_msg + "\n" + diagnostics
# Check if Ollama returned an error
try:
error_check = json.loads(ai_response)
if isinstance(error_check, dict) and 'error' in error_check:
error_msg = (
f"❌ Ollama API Error\n\n"
f"Error: {error_check.get('error', 'Unknown error')}\n"
f"Diagnosis: {error_check.get('diagnosis', 'No details')}\n\n"
)
# Auto-diagnose the issue
diagnostics = self._auto_diagnose_ollama()
return error_msg + "\n" + diagnostics
except json.JSONDecodeError:
# Not JSON, it's a normal response
pass
# Parse response
parsed = self._parse_response(ai_response)
if parsed.get('action') == 'execute':
# AI wants to execute a command
command = parsed.get('command', '')
explanation = parsed.get('explanation', '')
# Show what we're about to do
response = f"🔧 {explanation}\n\nExecuting: `{command}`\n\n"
# Execute the command
result = self._execute_command(command)
# Show if we retried with sudo
if result.get('retried_with_sudo'):
response += f"⚠️ Permission denied, retried as: `{result['command']}`\n\n"
if result['success']:
response += "✅ Command succeeded:\n"
if result['stdout']:
response += f"```\n{result['stdout']}\n```"
else:
response += "(no output)"
else:
response += f"❌ Command failed (exit code {result['exit_code']}):\n"
if result['stderr']:
response += f"```\n{result['stderr']}\n```"
elif result['stdout']:
response += f"```\n{result['stdout']}\n```"
# Add command execution to history
self.conversation_history.append({
'role': 'macha',
'message': response,
'timestamp': datetime.now().isoformat(),
'command_result': result
})
# Now ask AI to respond to the command output
followup_prompt = f"""The command completed. Here's what happened:
Command: {command}
Success: {result['success']}
Output: {result['stdout'][:500] if result['stdout'] else '(none)'}
Error: {result['stderr'][:500] if result['stderr'] else '(none)'}
Please provide a brief analysis or next steps."""
followup_response = self.agent._query_ollama(followup_prompt)
if followup_response:
response += f"\n\n{followup_response}"
return response
else:
# Normal conversation response
message = parsed.get('message', ai_response)
self.conversation_history.append({
'role': 'macha',
'message': message,
'timestamp': datetime.now().isoformat()
})
return message
def run(self):
"""Run the interactive chat session"""
print("=" * 70)
print("🌐 MACHA INTERACTIVE CHAT")
print("=" * 70)
print("Type your message and press Enter. Commands:")
print(" /exit or /quit - End the chat session")
print(" /clear - Clear conversation history")
print(" /history - Show conversation history")
print(" /debug - Show Ollama connection status")
print("=" * 70)
print()
while True:
try:
# Get user input
user_input = input("\n💬 YOU: ").strip()
if not user_input:
continue
# Handle special commands
if user_input.lower() in ['/exit', '/quit']:
print("\n👋 Ending chat session. Goodbye!")
break
elif user_input.lower() == '/clear':
self.conversation_history.clear()
print("🧹 Conversation history cleared.")
continue
elif user_input.lower() == '/history':
print("\n" + "=" * 70)
print("CONVERSATION HISTORY")
print("=" * 70)
for entry in self.conversation_history:
role = entry['role'].upper()
msg = entry['message'][:100] + "..." if len(entry['message']) > 100 else entry['message']
print(f"{role}: {msg}")
print("=" * 70)
continue
elif user_input.lower() == '/debug':
import os
import subprocess
print("\n" + "=" * 70)
print("MACHA ARCHITECTURE & STATUS")
print("=" * 70)
print("\n🏗️ SYSTEM ARCHITECTURE:")
print(f" Hostname: macha.coven.systems")
print(f" Service: macha-autonomous.service (systemd)")
print(f" Working Directory: /var/lib/macha")
print("\n👤 EXECUTION CONTEXT:")
current_user = os.getenv('USER') or os.getenv('USERNAME') or 'unknown'
print(f" Current User: {current_user}")
print(f" UID: {os.getuid()}")
# Check if user has sudo access
try:
result = subprocess.run(['sudo', '-n', 'true'],
capture_output=True, timeout=1)
if result.returncode == 0:
print(f" Sudo Access: ✓ Yes (passwordless)")
else:
print(f" Sudo Access: ⚠ Requires password")
except:
print(f" Sudo Access: ❌ No")
print(f" Note: Chat runs as invoking user (you), not as macha-autonomous")
print("\n🧠 INFERENCE ENGINE:")
print(f" Backend: Ollama")
print(f" Host: {self.agent.ollama_host}")
print(f" Model: {self.agent.model}")
print(f" Service: ollama.service (systemd)")
print("\n💾 DATABASE:")
print(f" Backend: ChromaDB")
print(f" Host: http://localhost:8000")
print(f" Data: /var/lib/chromadb")
print(f" Service: chromadb.service (systemd)")
print("\n🔍 OLLAMA STATUS:")
# Try to query Ollama status
try:
import requests
# Check if Ollama is running
response = requests.get(f"{self.agent.ollama_host}/api/tags", timeout=5)
if response.status_code == 200:
models = response.json().get('models', [])
print(f" Status: ✓ Running")
print(f" Loaded models: {len(models)}")
for model in models:
name = model.get('name', 'unknown')
size = model.get('size', 0) / (1024**3) # GB
is_current = "← ACTIVE" if name == self.agent.model else ""
print(f"{name} ({size:.1f} GB) {is_current}")
else:
print(f" Status: ❌ Error (HTTP {response.status_code})")
except Exception as e:
print(f" Status: ❌ Cannot connect: {e}")
print(f" Hint: Check 'systemctl status ollama.service'")
print("\n💡 CONVERSATION:")
print(f" History: {len(self.conversation_history)} messages")
print(f" Session started: {self.session_start}")
print("=" * 70)
continue
# Process the message
print("\n🤖 MACHA: ", end='', flush=True)
response = self.process_message(user_input)
print(response)
except KeyboardInterrupt:
print("\n\n👋 Chat interrupted. Use /exit to quit properly.")
continue
except EOFError:
print("\n\n👋 Ending chat session. Goodbye!")
break
except Exception as e:
print(f"\n❌ Error: {e}")
continue
def main():
"""Main entry point"""
session = MachaChatSession()
session.run()
if __name__ == "__main__":
main()

245
config_parser.py Normal file
View File

@@ -0,0 +1,245 @@
#!/usr/bin/env python3
"""
Config Parser - Extract imports and content from NixOS configuration files
"""
import re
import subprocess
from pathlib import Path
from typing import List, Dict, Set, Optional
from datetime import datetime
class ConfigParser:
"""Parse NixOS flake and configuration files"""
def __init__(self, repo_url: str, local_path: Path = Path("/var/lib/macha/config-repo")):
"""
Initialize config parser
Args:
repo_url: Git repository URL (e.g., git+https://...)
local_path: Where to clone/update the repository
"""
# Strip git+ prefix if present for git commands
self.repo_url = repo_url.replace("git+", "")
self.local_path = local_path
self.local_path.mkdir(parents=True, exist_ok=True)
def ensure_repo(self) -> bool:
"""Clone or update the repository"""
try:
if (self.local_path / ".git").exists():
# Update existing repo
result = subprocess.run(
["git", "-C", str(self.local_path), "pull"],
capture_output=True,
text=True,
timeout=30
)
return result.returncode == 0
else:
# Clone new repo
result = subprocess.run(
["git", "clone", self.repo_url, str(self.local_path)],
capture_output=True,
text=True,
timeout=60
)
return result.returncode == 0
except Exception as e:
print(f"Error updating repository: {e}")
return False
def get_systems_from_flake(self) -> List[str]:
"""Extract system names from flake.nix"""
flake_path = self.local_path / "flake.nix"
if not flake_path.exists():
return []
systems = []
try:
content = flake_path.read_text()
# Match patterns like: "macha" = nixpkgs.lib.nixosSystem
matches = re.findall(r'"([^"]+)"\s*=\s*nixpkgs\.lib\.nixosSystem', content)
systems = matches
except Exception as e:
print(f"Error parsing flake.nix: {e}")
return systems
def extract_imports(self, nix_file: Path) -> List[str]:
"""Extract imports from a .nix file"""
if not nix_file.exists():
return []
imports = []
try:
content = nix_file.read_text()
# Find the imports = [ ... ]; block
imports_match = re.search(
r'imports\s*=\s*\[(.*?)\];',
content,
re.DOTALL
)
if imports_match:
imports_block = imports_match.group(1)
# Extract all paths (relative paths starting with ./ or ../)
paths = re.findall(r'[./]+[^\s\]]+\.nix', imports_block)
imports = paths
except Exception as e:
print(f"Error parsing {nix_file}: {e}")
return imports
def resolve_import_path(self, base_file: Path, import_path: str) -> Optional[Path]:
"""Resolve a relative import path to absolute path within repo"""
try:
# Get directory of the base file
base_dir = base_file.parent
# Resolve the relative path
resolved = (base_dir / import_path).resolve()
# Make sure it's within the repo
if self.local_path in resolved.parents or resolved == self.local_path:
return resolved
except Exception as e:
print(f"Error resolving import {import_path} from {base_file}: {e}")
return None
def get_system_config(self, system_name: str) -> Dict[str, any]:
"""
Get configuration for a specific system
Returns:
Dict with:
- main_file: Path to systems/<name>.nix
- imports: List of imported file paths (relative to repo root)
- all_files: Set of all .nix files used (including recursive imports)
"""
main_file = self.local_path / "systems" / f"{system_name}.nix"
if not main_file.exists():
return {
"main_file": None,
"imports": [],
"all_files": set()
}
# Track all files (avoid infinite loops)
all_files = set()
files_to_process = [main_file]
processed = set()
while files_to_process:
current_file = files_to_process.pop(0)
if current_file in processed:
continue
processed.add(current_file)
# Get relative path from repo root
try:
rel_path = current_file.relative_to(self.local_path)
all_files.add(str(rel_path))
except ValueError:
continue
# Extract imports from this file
imports = self.extract_imports(current_file)
# Resolve and queue imported files
for imp in imports:
resolved = self.resolve_import_path(current_file, imp)
if resolved and resolved not in processed:
files_to_process.append(resolved)
return {
"main_file": str(main_file.relative_to(self.local_path)),
"imports": self.extract_imports(main_file),
"all_files": sorted(all_files)
}
def read_file_content(self, relative_path: str) -> Optional[str]:
"""Read content of a file by its path relative to repo root"""
try:
file_path = self.local_path / relative_path
if file_path.exists():
return file_path.read_text()
except Exception as e:
print(f"Error reading {relative_path}: {e}")
return None
def get_all_config_files(self) -> List[Dict[str, str]]:
"""
Get all .nix files in the repository with their content
Returns:
List of dicts with:
- path: relative path from repo root
- content: file contents
- category: apps/systems/osconfigs/users based on path
"""
files = []
# Categories to scan
categories = {
"apps": self.local_path / "apps",
"systems": self.local_path / "systems",
"osconfigs": self.local_path / "osconfigs",
"users": self.local_path / "users"
}
for category, path in categories.items():
if not path.exists():
continue
for nix_file in path.rglob("*.nix"):
try:
rel_path = nix_file.relative_to(self.local_path)
content = nix_file.read_text()
files.append({
"path": str(rel_path),
"content": content,
"category": category
})
except Exception as e:
print(f"Error reading {nix_file}: {e}")
return files
if __name__ == "__main__":
# Test the parser
import sys
repo_url = "git+https://git.coven.systems/lily/nixos-servers"
parser = ConfigParser(repo_url)
print("Ensuring repository is up to date...")
if parser.ensure_repo():
print("✓ Repository ready")
else:
print("✗ Failed to update repository")
sys.exit(1)
print("\nSystems defined in flake:")
systems = parser.get_systems_from_flake()
for system in systems:
print(f" - {system}")
if len(sys.argv) > 1:
system_name = sys.argv[1]
print(f"\nConfiguration for {system_name}:")
config = parser.get_system_config(system_name)
print(f" Main file: {config['main_file']}")
print(f" Direct imports: {len(config['imports'])}")
print(f" All files used: {len(config['all_files'])}")
for f in config['all_files']:
print(f" - {f}")

947
context_db.py Normal file
View File

@@ -0,0 +1,947 @@
#!/usr/bin/env python3
"""
Context Database - Store and retrieve system context using ChromaDB for RAG
"""
import json
import os
from typing import Dict, List, Any, Optional, Set
from datetime import datetime
from pathlib import Path
# Set environment variable BEFORE importing chromadb to prevent .env file reading
os.environ.setdefault("CHROMA_ENV_FILE", "")
import chromadb
from chromadb.config import Settings
class ContextDatabase:
"""Manage system context and relationships in ChromaDB"""
def __init__(
self,
host: str = "localhost",
port: int = 8000,
persist_directory: str = "/var/lib/chromadb"
):
"""Initialize ChromaDB client"""
self.client = chromadb.HttpClient(
host=host,
port=port,
settings=Settings(
anonymized_telemetry=False,
allow_reset=False,
chroma_api_impl="chromadb.api.fastapi.FastAPI"
)
)
# Create or get collections
self.systems_collection = self.client.get_or_create_collection(
name="systems",
metadata={"description": "System definitions and metadata"}
)
self.relationships_collection = self.client.get_or_create_collection(
name="relationships",
metadata={"description": "System relationships and dependencies"}
)
self.issues_collection = self.client.get_or_create_collection(
name="issues",
metadata={"description": "Issue tracking and resolution history"}
)
self.decisions_collection = self.client.get_or_create_collection(
name="decisions",
metadata={"description": "AI decisions and outcomes"}
)
self.config_files_collection = self.client.get_or_create_collection(
name="config_files",
metadata={"description": "NixOS configuration files for RAG"}
)
self.knowledge_collection = self.client.get_or_create_collection(
name="knowledge",
metadata={"description": "Operational knowledge: commands, patterns, best practices"}
)
# ============ System Registry ============
def register_system(
self,
hostname: str,
system_type: str,
services: List[str],
capabilities: List[str] = None,
metadata: Dict[str, Any] = None,
config_repo: str = None,
config_branch: str = None,
os_type: str = "nixos"
):
"""Register a system in the database
Args:
hostname: FQDN of the system
system_type: Role (e.g., 'workstation', 'server')
services: List of running services
capabilities: System capabilities
metadata: Additional metadata
config_repo: Git repository URL
config_branch: Git branch name
os_type: Operating system (e.g., 'nixos', 'ubuntu', 'debian', 'arch', 'windows', 'macos')
"""
doc_parts = [
f"System: {hostname}",
f"Type: {system_type}",
f"OS: {os_type}",
f"Services: {', '.join(services)}",
f"Capabilities: {', '.join(capabilities or [])}"
]
if config_repo:
doc_parts.append(f"Configuration Repository: {config_repo}")
if config_branch:
doc_parts.append(f"Configuration Branch: {config_branch}")
doc = "\n".join(doc_parts)
metadata_dict = {
"hostname": hostname,
"type": system_type,
"os_type": os_type,
"services": json.dumps(services),
"capabilities": json.dumps(capabilities or []),
"metadata": json.dumps(metadata or {}),
"config_repo": config_repo or "",
"config_branch": config_branch or "",
"updated_at": datetime.now().isoformat()
}
self.systems_collection.upsert(
ids=[hostname],
documents=[doc],
metadatas=[metadata_dict]
)
def get_system(self, hostname: str) -> Optional[Dict[str, Any]]:
"""Get system information"""
try:
result = self.systems_collection.get(
ids=[hostname],
include=["metadatas", "documents"]
)
if result['ids']:
metadata = result['metadatas'][0]
return {
"hostname": metadata["hostname"],
"type": metadata["type"],
"services": json.loads(metadata["services"]),
"capabilities": json.loads(metadata["capabilities"]),
"metadata": json.loads(metadata["metadata"]),
"document": result['documents'][0]
}
except:
pass
return None
def get_all_systems(self) -> List[Dict[str, Any]]:
"""Get all registered systems"""
result = self.systems_collection.get(include=["metadatas"])
systems = []
for metadata in result['metadatas']:
systems.append({
"hostname": metadata["hostname"],
"type": metadata["type"],
"os_type": metadata.get("os_type", "unknown"),
"services": json.loads(metadata["services"]),
"capabilities": json.loads(metadata["capabilities"]),
"config_repo": metadata.get("config_repo", ""),
"config_branch": metadata.get("config_branch", "")
})
return systems
def is_system_known(self, hostname: str) -> bool:
"""Check if a system is already registered"""
try:
result = self.systems_collection.get(ids=[hostname])
return len(result['ids']) > 0
except:
return False
def get_known_hostnames(self) -> Set[str]:
"""Get set of all known system hostnames"""
result = self.systems_collection.get(include=["metadatas"])
return set(metadata["hostname"] for metadata in result['metadatas'])
# ============ Relationships ============
def add_relationship(
self,
source: str,
target: str,
relationship_type: str,
description: str = ""
):
"""Add a relationship between systems"""
rel_id = f"{source}{target}:{relationship_type}"
doc = f"{source} {relationship_type} {target}. {description}"
self.relationships_collection.upsert(
ids=[rel_id],
documents=[doc],
metadatas=[{
"source": source,
"target": target,
"type": relationship_type,
"description": description,
"created_at": datetime.now().isoformat()
}]
)
def get_dependencies(self, hostname: str) -> List[Dict[str, Any]]:
"""Get what a system depends on"""
result = self.relationships_collection.get(
where={"source": hostname},
include=["metadatas"]
)
return [
{
"target": m["target"],
"type": m["type"],
"description": m.get("description", "")
}
for m in result['metadatas']
]
def get_dependents(self, hostname: str) -> List[Dict[str, Any]]:
"""Get what depends on a system"""
result = self.relationships_collection.get(
where={"target": hostname},
include=["metadatas"]
)
return [
{
"source": m["source"],
"type": m["type"],
"description": m.get("description", "")
}
for m in result['metadatas']
]
# ============ Issue History ============
def store_issue(
self,
system: str,
issue_description: str,
resolution: str = "",
severity: str = "unknown",
metadata: Dict[str, Any] = None
) -> str:
"""Store an issue and its resolution"""
issue_id = f"{system}_{datetime.now().timestamp()}"
doc = f"""
System: {system}
Issue: {issue_description}
Resolution: {resolution}
Severity: {severity}
"""
self.issues_collection.add(
ids=[issue_id],
documents=[doc],
metadatas=[{
"system": system,
"severity": severity,
"resolved": bool(resolution),
"timestamp": datetime.now().isoformat(),
"metadata": json.dumps(metadata or {})
}]
)
return issue_id
def store_investigation(
self,
system: str,
issue_description: str,
commands: List[str],
output: str,
timestamp: str = None
) -> str:
"""Store investigation results for an issue"""
if timestamp is None:
timestamp = datetime.now().isoformat()
investigation_id = f"investigation_{system}_{datetime.now().timestamp()}"
doc = f"""
System: {system}
Issue: {issue_description}
Commands executed: {', '.join(commands)}
Output:
{output[:2000]} # Limit output to prevent token overflow
"""
self.issues_collection.add(
ids=[investigation_id],
documents=[doc],
metadatas=[{
"system": system,
"issue": issue_description,
"type": "investigation",
"commands": json.dumps(commands),
"timestamp": timestamp,
"metadata": json.dumps({"output_length": len(output)})
}]
)
return investigation_id
def get_recent_investigations(
self,
issue_description: str,
system: str,
hours: int = 24
) -> List[Dict[str, Any]]:
"""Get recent investigations for a similar issue"""
# Query for similar issues
try:
result = self.issues_collection.query(
query_texts=[f"System: {system}\nIssue: {issue_description}"],
n_results=10,
where={"type": "investigation"},
include=["documents", "metadatas", "distances"]
)
investigations = []
if result['ids'] and result['ids'][0]:
cutoff_time = datetime.now().timestamp() - (hours * 3600)
for i, doc_id in enumerate(result['ids'][0]):
meta = result['metadatas'][0][i]
timestamp = datetime.fromisoformat(meta['timestamp'])
# Only include recent investigations
if timestamp.timestamp() > cutoff_time:
investigations.append({
"id": doc_id,
"system": meta['system'],
"issue": meta['issue'],
"commands": json.loads(meta['commands']),
"output": result['documents'][0][i],
"timestamp": meta['timestamp'],
"relevance": 1 - result['distances'][0][i]
})
return investigations
except Exception as e:
print(f"Error querying investigations: {e}")
return []
def find_similar_issues(
self,
issue_description: str,
system: Optional[str] = None,
n_results: int = 5
) -> List[Dict[str, Any]]:
"""Find similar past issues using semantic search"""
where = {"system": system} if system else None
results = self.issues_collection.query(
query_texts=[issue_description],
n_results=n_results,
where=where,
include=["documents", "metadatas", "distances"]
)
similar = []
for i, doc in enumerate(results['documents'][0]):
similar.append({
"issue": doc,
"metadata": results['metadatas'][0][i],
"similarity": 1 - results['distances'][0][i] # Convert distance to similarity
})
return similar
# ============ AI Decisions ============
def store_decision(
self,
system: str,
analysis: Dict[str, Any],
action: Dict[str, Any],
outcome: Dict[str, Any] = None
):
"""Store an AI decision for learning"""
decision_id = f"decision_{datetime.now().timestamp()}"
doc = f"""
System: {system}
Status: {analysis.get('status', 'unknown')}
Assessment: {analysis.get('overall_assessment', '')}
Action: {action.get('proposed_action', '')}
Risk: {action.get('risk_level', 'unknown')}
Outcome: {outcome.get('status', 'pending') if outcome else 'pending'}
"""
self.decisions_collection.add(
ids=[decision_id],
documents=[doc],
metadatas=[{
"system": system,
"timestamp": datetime.now().isoformat(),
"analysis": json.dumps(analysis),
"action": json.dumps(action),
"outcome": json.dumps(outcome or {})
}]
)
def get_recent_decisions(
self,
system: Optional[str] = None,
n_results: int = 10
) -> List[Dict[str, Any]]:
"""Get recent decisions, optionally filtered by system"""
where = {"system": system} if system else None
results = self.decisions_collection.query(
query_texts=["recent decisions"],
n_results=n_results,
where=where,
include=["documents", "metadatas"]
)
decisions = []
for i, doc in enumerate(results['documents'][0]):
meta = results['metadatas'][0][i]
decisions.append({
"system": meta["system"],
"timestamp": meta["timestamp"],
"analysis": json.loads(meta["analysis"]),
"action": json.loads(meta["action"]),
"outcome": json.loads(meta["outcome"])
})
return decisions
# ============ Context Generation for AI ============
def get_system_context(self, hostname: str, git_context=None) -> str:
"""Generate rich context about a system for AI prompts"""
context_parts = []
# System info
system = self.get_system(hostname)
if system:
context_parts.append(f"System: {hostname} ({system['type']})")
context_parts.append(f"Services: {', '.join(system['services'])}")
if system['capabilities']:
context_parts.append(f"Capabilities: {', '.join(system['capabilities'])}")
# Git repository info
if system and system.get('metadata'):
metadata = json.loads(system['metadata']) if isinstance(system['metadata'], str) else system['metadata']
config_repo = metadata.get('config_repo', '')
if config_repo:
context_parts.append(f"\nConfiguration Repository: {config_repo}")
# Recent git changes for this system
if git_context:
try:
# Extract system name from FQDN
system_name = hostname.split('.')[0]
git_summary = git_context.get_system_context_summary(system_name)
if git_summary:
context_parts.append(f"\n{git_summary}")
except:
pass
# Dependencies
deps = self.get_dependencies(hostname)
if deps:
context_parts.append("\nDependencies:")
for dep in deps:
context_parts.append(f" - Depends on {dep['target']} for {dep['type']}")
# Dependents
dependents = self.get_dependents(hostname)
if dependents:
context_parts.append("\nUsed by:")
for dependent in dependents:
context_parts.append(f" - {dependent['source']} uses this for {dependent['type']}")
return "\n".join(context_parts)
def get_issue_context(self, issue_description: str, system: str) -> str:
"""Get context about similar past issues"""
similar = self.find_similar_issues(issue_description, system, n_results=3)
if not similar:
return ""
context_parts = ["Similar past issues:"]
for i, issue in enumerate(similar, 1):
if issue['similarity'] > 0.7: # Only include if fairly similar
context_parts.append(f"\n{i}. {issue['issue']}")
context_parts.append(f" Similarity: {issue['similarity']:.2%}")
return "\n".join(context_parts) if len(context_parts) > 1 else ""
# ============ Config Files (for RAG) ============
def store_config_file(
self,
file_path: str,
content: str,
category: str = "unknown",
systems_using: List[str] = None
):
"""
Store a configuration file for RAG retrieval
Args:
file_path: Path relative to repo root (e.g., "apps/gotify.nix")
content: Full file contents
category: apps/systems/osconfigs/users
systems_using: List of system hostnames that import this file
"""
self.config_files_collection.upsert(
ids=[file_path],
documents=[content],
metadatas=[{
"path": file_path,
"category": category,
"systems": json.dumps(systems_using or []),
"updated_at": datetime.now().isoformat()
}]
)
def get_config_file(self, file_path: str) -> Optional[Dict[str, Any]]:
"""Get a specific config file by path"""
try:
result = self.config_files_collection.get(
ids=[file_path],
include=["documents", "metadatas"]
)
if result['ids']:
return {
"path": file_path,
"content": result['documents'][0],
"metadata": result['metadatas'][0]
}
except:
pass
return None
def query_config_files(
self,
query: str,
system: str = None,
category: str = None,
n_results: int = 5
) -> List[Dict[str, Any]]:
"""
Query config files using semantic search
Args:
query: Natural language query (e.g., "gotify configuration")
system: Optional filter by system hostname
category: Optional filter by category (apps/systems/etc)
n_results: Number of results to return
Returns:
List of dicts with path, content, and metadata
"""
where = {}
if category:
where["category"] = category
try:
result = self.config_files_collection.query(
query_texts=[query],
n_results=n_results,
where=where if where else None,
include=["documents", "metadatas", "distances"]
)
configs = []
if result['ids'] and result['ids'][0]:
for i, doc_id in enumerate(result['ids'][0]):
config = {
"path": doc_id,
"content": result['documents'][0][i],
"metadata": result['metadatas'][0][i],
"relevance": 1 - result['distances'][0][i] # Convert distance to relevance
}
# Filter by system if specified
if system:
systems = json.loads(config['metadata'].get('systems', '[]'))
if system not in systems:
continue
configs.append(config)
return configs
except Exception as e:
print(f"Error querying config files: {e}")
return []
def get_system_config_files(self, system: str) -> List[str]:
"""Get all config file paths used by a system"""
# This is stored in the system's metadata now
system_info = self.get_system(system)
if system_info and 'config_files' in system_info.get('metadata', {}):
# metadata is already a dict, config_files is already a list
return system_info['metadata']['config_files']
return []
def update_system_config_files(self, system: str, config_files: List[str]):
"""Update the list of config files used by a system"""
system_info = self.get_system(system)
if system_info:
# metadata is already a dict from get_system(), no need to json.loads()
metadata = system_info.get('metadata', {})
metadata['config_files'] = config_files
metadata['config_updated_at'] = datetime.now().isoformat()
# Re-register with updated metadata
self.register_system(
hostname=system,
system_type=system_info['type'],
services=system_info['services'],
capabilities=system_info.get('capabilities', []),
metadata=metadata,
config_repo=system_info.get('config_repo'),
config_branch=system_info.get('config_branch')
)
# =========================================================================
# ISSUE TRACKING
# =========================================================================
def store_issue(self, issue: Dict[str, Any]):
"""Store a new issue in the database"""
issue_id = issue['issue_id']
# Store in ChromaDB with the issue as document
self.issues_collection.add(
documents=[json.dumps(issue)],
metadatas=[{
'issue_id': issue_id,
'hostname': issue['hostname'],
'title': issue['title'],
'status': issue['status'],
'severity': issue['severity'],
'created_at': issue['created_at'],
'source': issue['source']
}],
ids=[issue_id]
)
def get_issue(self, issue_id: str) -> Optional[Dict[str, Any]]:
"""Retrieve an issue by ID"""
try:
results = self.issues_collection.get(ids=[issue_id])
if results['documents']:
return json.loads(results['documents'][0])
return None
except Exception as e:
print(f"Error retrieving issue {issue_id}: {e}")
return None
def update_issue(self, issue: Dict[str, Any]):
"""Update an existing issue"""
issue_id = issue['issue_id']
# Delete old version
try:
self.issues_collection.delete(ids=[issue_id])
except:
pass
# Store updated version
self.store_issue(issue)
def delete_issue(self, issue_id: str):
"""Remove an issue from the database (used when archiving)"""
try:
self.issues_collection.delete(ids=[issue_id])
except Exception as e:
print(f"Error deleting issue {issue_id}: {e}")
def list_issues(
self,
hostname: Optional[str] = None,
status: Optional[str] = None,
severity: Optional[str] = None
) -> List[Dict[str, Any]]:
"""List issues with optional filters"""
try:
# Build query filter
where_filter = {}
if hostname:
where_filter['hostname'] = hostname
if status:
where_filter['status'] = status
if severity:
where_filter['severity'] = severity
if where_filter:
results = self.issues_collection.get(where=where_filter)
else:
results = self.issues_collection.get()
issues = []
for doc in results['documents']:
issues.append(json.loads(doc))
# Sort by created_at descending
issues.sort(key=lambda x: x.get('created_at', ''), reverse=True)
return issues
except Exception as e:
print(f"Error listing issues: {e}")
return []
# ============ Knowledge Base ============
def store_knowledge(
self,
topic: str,
knowledge: str,
category: str = "general",
source: str = "experience",
confidence: str = "medium",
tags: list = None
) -> str:
"""
Store a piece of operational knowledge
Args:
topic: Main subject (e.g., "nh os switch", "systemd-journal-remote")
knowledge: The actual knowledge/insight/pattern
category: Type of knowledge (command, pattern, troubleshooting, performance, etc.)
source: Where this came from (experience, documentation, user-provided)
confidence: How confident we are (low, medium, high)
tags: Optional tags for categorization
Returns:
Knowledge ID
"""
import uuid
from datetime import datetime
knowledge_id = str(uuid.uuid4())
knowledge_doc = {
"id": knowledge_id,
"topic": topic,
"knowledge": knowledge,
"category": category,
"source": source,
"confidence": confidence,
"tags": tags or [],
"created_at": datetime.utcnow().isoformat(),
"last_verified": datetime.utcnow().isoformat(),
"times_referenced": 0
}
try:
self.knowledge_collection.add(
ids=[knowledge_id],
documents=[knowledge],
metadatas=[{
"topic": topic,
"category": category,
"source": source,
"confidence": confidence,
"tags": json.dumps(tags or []),
"created_at": knowledge_doc["created_at"],
"full_doc": json.dumps(knowledge_doc)
}]
)
return knowledge_id
except Exception as e:
print(f"Error storing knowledge: {e}")
return None
def query_knowledge(
self,
query: str,
category: str = None,
limit: int = 5
) -> list:
"""
Query the knowledge base for relevant information
Args:
query: What to search for
category: Optional category filter
limit: Maximum results to return
Returns:
List of relevant knowledge entries
"""
try:
where_filter = {}
if category:
where_filter["category"] = category
results = self.knowledge_collection.query(
query_texts=[query],
n_results=limit,
where=where_filter if where_filter else None
)
knowledge_items = []
if results and results['documents']:
for i, doc in enumerate(results['documents'][0]):
metadata = results['metadatas'][0][i]
full_doc = json.loads(metadata.get('full_doc', '{}'))
# Increment reference count
full_doc['times_referenced'] = full_doc.get('times_referenced', 0) + 1
knowledge_items.append(full_doc)
return knowledge_items
except Exception as e:
print(f"Error querying knowledge: {e}")
return []
def get_knowledge_by_topic(self, topic: str) -> list:
"""Get all knowledge entries for a specific topic"""
try:
results = self.knowledge_collection.get(
where={"topic": topic}
)
knowledge_items = []
for metadata in results['metadatas']:
full_doc = json.loads(metadata.get('full_doc', '{}'))
knowledge_items.append(full_doc)
return knowledge_items
except Exception as e:
print(f"Error getting knowledge by topic: {e}")
return []
def update_knowledge(
self,
knowledge_id: str,
knowledge: str = None,
confidence: str = None,
verify: bool = False
):
"""
Update an existing knowledge entry
Args:
knowledge_id: ID of knowledge to update
knowledge: New knowledge text (optional)
confidence: New confidence level (optional)
verify: Mark as verified (updates last_verified timestamp)
"""
from datetime import datetime
try:
# Get existing entry
result = self.knowledge_collection.get(ids=[knowledge_id])
if not result['documents']:
return False
metadata = result['metadatas'][0]
full_doc = json.loads(metadata.get('full_doc', '{}'))
# Update fields
if knowledge:
full_doc['knowledge'] = knowledge
if confidence:
full_doc['confidence'] = confidence
if verify:
full_doc['last_verified'] = datetime.utcnow().isoformat()
# Update in collection
self.knowledge_collection.update(
ids=[knowledge_id],
documents=[full_doc['knowledge']],
metadatas=[{
"topic": full_doc['topic'],
"category": full_doc['category'],
"source": full_doc['source'],
"confidence": full_doc['confidence'],
"tags": json.dumps(full_doc['tags']),
"created_at": full_doc['created_at'],
"full_doc": json.dumps(full_doc)
}]
)
return True
except Exception as e:
print(f"Error updating knowledge: {e}")
return False
def list_knowledge_topics(self, category: str = None) -> list:
"""List all unique topics in the knowledge base"""
try:
where_filter = {"category": category} if category else None
results = self.knowledge_collection.get(where=where_filter)
topics = set()
for metadata in results['metadatas']:
topics.add(metadata.get('topic'))
return sorted(list(topics))
except Exception as e:
print(f"Error listing knowledge topics: {e}")
return []
if __name__ == "__main__":
import sys
# Test the database
db = ContextDatabase()
# Register test systems
db.register_system(
"macha",
"workstation",
["ollama"],
capabilities=["ai-inference"]
)
db.register_system(
"rhiannon",
"server",
["gotify", "nextcloud", "prowlarr"],
capabilities=["notifications", "cloud-storage"]
)
# Add relationship
db.add_relationship(
"macha",
"rhiannon",
"uses-service",
"Macha uses Rhiannon's Gotify for notifications"
)
# Test queries
print("All systems:", db.get_all_systems())
print("\nMacha's dependencies:", db.get_dependencies("macha"))
print("\nRhiannon's dependents:", db.get_dependents("rhiannon"))
print("\nSystem context:", db.get_system_context("macha"))

328
conversation.py Normal file
View File

@@ -0,0 +1,328 @@
#!/usr/bin/env python3
"""
Conversational Interface - Allows questioning Macha about decisions and system state
"""
import json
import requests
from typing import Dict, List, Any, Optional
from pathlib import Path
from datetime import datetime
from agent import MachaAgent
class MachaConversation:
"""Conversational interface for Macha"""
def __init__(
self,
ollama_host: str = "http://localhost:11434",
model: str = "gpt-oss:latest",
state_dir: Path = Path("/var/lib/macha")
):
self.ollama_host = ollama_host
self.model = model
self.state_dir = state_dir
self.decision_log = self.state_dir / "decisions.jsonl"
self.approval_queue = self.state_dir / "approval_queue.json"
self.orchestrator_log = self.state_dir / "orchestrator.log"
# Initialize agent with tool support and queue
self.agent = MachaAgent(
ollama_host=ollama_host,
model=model,
state_dir=state_dir,
enable_tools=True,
use_queue=True,
priority="INTERACTIVE"
)
def ask(self, question: str, include_context: bool = True) -> str:
"""Ask Macha a question with optional system context"""
context = ""
if include_context:
context = self._gather_context()
# Build messages for tool-aware chat
content = self._create_conversational_prompt(question, context)
messages = [{"role": "user", "content": content}]
response = self.agent._query_ollama_with_tools(messages)
return response
def discuss_action(self, action_index: int) -> str:
"""Discuss a specific queued action by its queue position (0-based index)"""
action = self._get_action_from_queue(action_index)
if not action:
return f"No action found at queue position {action_index}. Use 'macha-approve list' to see available actions."
context = self._gather_context()
action_context = json.dumps(action, indent=2)
content = f"""TASK: DISCUSS PROPOSED ACTION
================================================================================
A user is asking about a proposed action in your approval queue.
QUEUED ACTION (Queue Position #{action_index}):
{action_context}
RECENT SYSTEM CONTEXT:
{context}
The user wants to discuss this action. Explain:
1. Why you proposed this action
2. What problem it solves
3. The risks involved
4. What could go wrong
5. Alternative approaches if any
Be conversational, helpful, and honest about uncertainties.
"""
messages = [{"role": "user", "content": content}]
return self.agent._query_ollama_with_tools(messages)
def _gather_context(self) -> str:
"""Gather relevant system context for the conversation"""
context_parts = []
# System infrastructure from ChromaDB
try:
from context_db import ContextDatabase
db = ContextDatabase()
systems = db.get_all_systems()
if systems:
context_parts.append("INFRASTRUCTURE:")
for system in systems:
context_parts.append(f" - {system['hostname']} ({system.get('type', 'unknown')})")
if system.get('config_repo'):
context_parts.append(f" Config Repo: {system['config_repo']}")
context_parts.append(f" Branch: {system.get('config_branch', 'unknown')}")
if system.get('capabilities'):
context_parts.append(f" Capabilities: {', '.join(system['capabilities'])}")
except Exception as e:
# ChromaDB not available, skip
pass
# Recent decisions
recent_decisions = self._get_recent_decisions(5)
if recent_decisions:
context_parts.append("\nRECENT DECISIONS:")
for i, dec in enumerate(recent_decisions, 1):
timestamp = dec.get("timestamp", "unknown")
analysis = dec.get("analysis", {})
status = analysis.get("status", "unknown")
context_parts.append(f"{i}. [{timestamp}] Status: {status}")
if "issues" in analysis:
for issue in analysis.get("issues", [])[:3]:
context_parts.append(f" - {issue.get('description', 'N/A')}")
# Pending approvals
pending = self._get_pending_approvals()
if pending:
context_parts.append(f"\nPENDING APPROVALS: {len(pending)} action(s) awaiting approval")
# Recent log excerpts (last 10 lines)
recent_logs = self._get_recent_logs(10)
if recent_logs:
context_parts.append("\nRECENT LOG ENTRIES:")
context_parts.extend(recent_logs)
return "\n".join(context_parts)
def _create_conversational_prompt(self, question: str, context: str) -> str:
"""Create a conversational prompt"""
return f"""{MachaAgent.SYSTEM_PROMPT}
TASK: ANSWER QUESTION
================================================================================
You monitor system health, analyze issues using AI, and propose fixes. Be helpful,
honest about what you know and don't know, and reference the context provided below.
SYSTEM CONTEXT:
{context if context else "No recent activity"}
USER QUESTION:
{question}
Respond conversationally and helpfully. If the question is about your recent decisions
or actions, reference the context above. If you don't have enough information, say so.
Keep responses concise but informative.
"""
def _query_ollama(self, prompt: str, temperature: float = 0.7) -> str:
"""Query Ollama API"""
try:
response = requests.post(
f"{self.ollama_host}/api/generate",
json={
"model": self.model,
"prompt": prompt,
"stream": False,
"temperature": temperature,
},
timeout=60
)
response.raise_for_status()
return response.json().get("response", "")
except requests.exceptions.HTTPError as e:
error_detail = ""
try:
error_detail = f" - {response.text}"
except:
pass
return f"Error: Ollama returned HTTP {response.status_code}{error_detail}"
except Exception as e:
return f"Error querying Ollama: {str(e)}"
def _get_recent_decisions(self, count: int = 5) -> List[Dict[str, Any]]:
"""Get recent decisions from log"""
if not self.decision_log.exists():
return []
decisions = []
try:
with open(self.decision_log, 'r') as f:
for line in f:
if line.strip():
try:
decisions.append(json.loads(line))
except:
pass
except:
pass
return decisions[-count:]
def _get_pending_approvals(self) -> List[Dict[str, Any]]:
"""Get pending approvals from queue"""
if not self.approval_queue.exists():
return []
try:
with open(self.approval_queue, 'r') as f:
data = json.load(f)
# Queue is a JSON array, not an object with "pending" key
if isinstance(data, list):
return data
return data.get("pending", [])
except:
return []
def _get_action_from_queue(self, action_index: int) -> Optional[Dict[str, Any]]:
"""Get a specific action from the queue by index"""
pending = self._get_pending_approvals()
if 0 <= action_index < len(pending):
return pending[action_index]
return None
def _get_recent_logs(self, count: int = 10) -> List[str]:
"""Get recent orchestrator log lines"""
if not self.orchestrator_log.exists():
return []
try:
with open(self.orchestrator_log, 'r') as f:
lines = f.readlines()
return [line.strip() for line in lines[-count:] if line.strip()]
except:
return []
if __name__ == "__main__":
import sys
import argparse
parser = argparse.ArgumentParser(description="Ask Macha a question or discuss an action")
parser.add_argument("--discuss", type=int, metavar="ACTION_ID", help="Discuss a specific queued action")
parser.add_argument("--follow-up", type=str, metavar="QUESTION", help="Follow-up question about the action")
parser.add_argument("question", nargs="*", help="Your question for Macha")
parser.add_argument("--no-context", action="store_true", help="Don't include system context")
args = parser.parse_args()
# Load config if available
config_file = Path("/etc/macha-autonomous/config.json")
ollama_host = "http://localhost:11434"
model = "gpt-oss:latest"
if config_file.exists():
try:
with open(config_file, 'r') as f:
config = json.load(f)
ollama_host = config.get("ollama_host", ollama_host)
model = config.get("model", model)
except:
pass
conversation = MachaConversation(
ollama_host=ollama_host,
model=model
)
if args.discuss is not None:
if args.follow_up:
# Follow-up question about a specific action
action = conversation._get_action_from_queue(args.discuss)
if not action:
print(f"No action found at queue position {args.discuss}. Use 'macha-approve list' to see available actions.")
sys.exit(1)
# Build context with the action details
action_context = f"""
QUEUED ACTION #{args.discuss}:
Diagnosis: {action.get('proposal', {}).get('diagnosis', 'N/A')}
Proposed Action: {action.get('proposal', {}).get('proposed_action', 'N/A')}
Action Type: {action.get('proposal', {}).get('action_type', 'N/A')}
Risk Level: {action.get('proposal', {}).get('risk_level', 'N/A')}
Commands: {json.dumps(action.get('proposal', {}).get('commands', []), indent=2)}
Reasoning: {action.get('proposal', {}).get('reasoning', 'N/A')}
FOLLOW-UP QUESTION:
{args.follow_up}
"""
# Query the AI with the action context
response = conversation._query_ollama(f"""{MachaAgent.SYSTEM_PROMPT}
TASK: ANSWER FOLLOW-UP QUESTION ABOUT QUEUED ACTION
================================================================================
You are answering a follow-up question about a proposed fix that is awaiting approval.
Be helpful and answer directly. If the user is concerned about risks, explain them clearly.
If they ask about alternatives, suggest them.
{action_context}
RESPOND CONCISELY AND DIRECTLY.
""")
else:
# Initial discussion about the action
response = conversation.discuss_action(args.discuss)
elif args.question:
# Ask a general question
question = " ".join(args.question)
response = conversation.ask(question, include_context=not args.no_context)
else:
parser.print_help()
sys.exit(1)
# Only print formatted output for initial discussion, not for follow-ups
if args.follow_up:
print(response)
else:
print("\n" + "="*60)
print("MACHA:")
print("="*60)
print(response)
print("="*60 + "\n")

537
executor.py Normal file
View File

@@ -0,0 +1,537 @@
#!/usr/bin/env python3
"""
Action Executor - Safely executes proposed fixes with rollback capability
"""
import json
import subprocess
import shutil
from typing import Dict, List, Any, Optional
from pathlib import Path
from datetime import datetime
import time
class SafeExecutor:
"""Executes system maintenance actions with safety checks"""
# Actions that are considered safe to auto-execute
SAFE_ACTIONS = {
"systemd_restart", # Restart failed services
"cleanup", # Disk cleanup, log rotation
"investigation", # Read-only diagnostics
}
# Services that should NEVER be stopped/disabled
PROTECTED_SERVICES = {
"sshd",
"systemd-networkd",
"NetworkManager",
"systemd-resolved",
"dbus",
}
def __init__(
self,
state_dir: Path = Path("/var/lib/macha"),
autonomy_level: str = "suggest", # observe, suggest, auto-safe, auto-full
dry_run: bool = False,
agent = None # Optional agent for learning from actions
):
self.state_dir = state_dir
self.state_dir.mkdir(parents=True, exist_ok=True)
self.autonomy_level = autonomy_level
self.dry_run = dry_run
self.agent = agent
self.action_log = self.state_dir / "actions.jsonl"
self.approval_queue = self.state_dir / "approval_queue.json"
def execute_action(self, action: Dict[str, Any], monitoring_context: Dict[str, Any]) -> Dict[str, Any]:
"""Execute a proposed action with appropriate safety checks"""
action_type = action.get("action_type", "unknown")
risk_level = action.get("risk_level", "high")
# Determine if we should execute
should_execute, reason = self._should_execute(action_type, risk_level)
if not should_execute:
if self.autonomy_level == "suggest":
# Queue for approval
self._queue_for_approval(action, monitoring_context)
return {
"executed": False,
"status": "queued_for_approval",
"reason": reason,
"queue_file": str(self.approval_queue)
}
else:
return {
"executed": False,
"status": "blocked",
"reason": reason
}
# Execute the action
if self.dry_run:
return self._dry_run_action(action)
return self._execute_action_impl(action, monitoring_context)
def _should_execute(self, action_type: str, risk_level: str) -> tuple[bool, str]:
"""Determine if an action should be auto-executed based on autonomy level"""
if self.autonomy_level == "observe":
return False, "Autonomy level set to observe-only"
# Auto-approve low-risk investigation actions
if action_type == "investigation" and risk_level == "low":
return True, "Auto-approved: Low-risk information gathering"
if self.autonomy_level == "suggest":
return False, "Autonomy level requires manual approval"
if self.autonomy_level == "auto-safe":
if action_type in self.SAFE_ACTIONS and risk_level == "low":
return True, "Auto-executing safe action"
return False, "Action requires higher autonomy level"
if self.autonomy_level == "auto-full":
if risk_level == "high":
return False, "High risk actions always require approval"
return True, "Auto-executing approved action"
return False, "Unknown autonomy level"
def _execute_action_impl(self, action: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
"""Actually execute the action"""
action_type = action.get("action_type")
result = {
"executed": True,
"timestamp": datetime.now().isoformat(),
"action": action,
"success": False,
"output": "",
"error": None
}
try:
if action_type == "systemd_restart":
result.update(self._restart_services(action))
elif action_type == "cleanup":
result.update(self._perform_cleanup(action))
elif action_type == "nix_rebuild":
result.update(self._nix_rebuild(action))
elif action_type == "config_change":
result.update(self._apply_config_change(action))
elif action_type == "investigation":
result.update(self._run_investigation(action))
else:
result["error"] = f"Unknown action type: {action_type}"
except Exception as e:
result["error"] = str(e)
result["success"] = False
# Log the action
self._log_action(result)
# Learn from successful operations
if result.get("success") and self.agent:
try:
self.agent.reflect_and_learn(
situation=action.get("diagnosis", "Unknown situation"),
action_taken=action.get("proposed_action", "Unknown action"),
outcome=result.get("output", ""),
success=True
)
except Exception as e:
# Don't fail the action if learning fails
print(f"Note: Could not record learning: {e}")
return result
def _restart_services(self, action: Dict[str, Any]) -> Dict[str, Any]:
"""Restart systemd services"""
commands = action.get("commands", [])
output_lines = []
for cmd in commands:
if not cmd.startswith("systemctl restart "):
continue
service = cmd.split()[-1]
# Safety check
if any(protected in service for protected in self.PROTECTED_SERVICES):
output_lines.append(f"BLOCKED: {service} is protected")
continue
try:
result = subprocess.run(
["systemctl", "restart", service],
capture_output=True,
text=True,
timeout=30
)
if result.returncode == 0:
output_lines.append(f"✓ Restarted {service}")
else:
output_lines.append(f"✗ Failed to restart {service}: {result.stderr}")
except subprocess.TimeoutExpired:
output_lines.append(f"✗ Timeout restarting {service}")
return {
"success": len(output_lines) > 0,
"output": "\n".join(output_lines)
}
def _perform_cleanup(self, action: Dict[str, Any]) -> Dict[str, Any]:
"""Perform system cleanup tasks"""
output_lines = []
# Nix store cleanup
if "nix" in action.get("proposed_action", "").lower():
try:
result = subprocess.run(
["nix-collect-garbage", "--delete-old"],
capture_output=True,
text=True,
timeout=300
)
output_lines.append(f"Nix cleanup: {result.stdout}")
except Exception as e:
output_lines.append(f"Nix cleanup failed: {e}")
# Journal cleanup (keep last 7 days)
try:
result = subprocess.run(
["journalctl", "--vacuum-time=7d"],
capture_output=True,
text=True,
timeout=60
)
output_lines.append(f"Journal cleanup: {result.stdout}")
except Exception as e:
output_lines.append(f"Journal cleanup failed: {e}")
return {
"success": True,
"output": "\n".join(output_lines)
}
def _nix_rebuild(self, action: Dict[str, Any]) -> Dict[str, Any]:
"""Rebuild NixOS configuration"""
# This is HIGH RISK - always requires approval or full autonomy
# And we should test first
output_lines = []
# First, try a dry build
try:
result = subprocess.run(
["nixos-rebuild", "dry-build", "--flake", ".#macha"],
capture_output=True,
text=True,
timeout=600,
cwd="/home/lily/Documents/nixos-servers"
)
if result.returncode != 0:
return {
"success": False,
"output": f"Dry build failed:\n{result.stderr}"
}
output_lines.append("✓ Dry build successful")
except Exception as e:
return {
"success": False,
"output": f"Dry build error: {e}"
}
# Now do the actual rebuild
try:
result = subprocess.run(
["nixos-rebuild", "switch", "--flake", ".#macha"],
capture_output=True,
text=True,
timeout=1200,
cwd="/home/lily/Documents/nixos-servers"
)
output_lines.append(result.stdout)
return {
"success": result.returncode == 0,
"output": "\n".join(output_lines),
"error": result.stderr if result.returncode != 0 else None
}
except Exception as e:
return {
"success": False,
"output": "\n".join(output_lines),
"error": str(e)
}
def _apply_config_change(self, action: Dict[str, Any]) -> Dict[str, Any]:
"""Apply a configuration file change"""
config_changes = action.get("config_changes", {})
file_path = config_changes.get("file")
if not file_path:
return {
"success": False,
"output": "No file specified in config_changes"
}
# For now, we DON'T auto-modify configs - too risky
# Instead, we create a suggested patch file
patch_file = self.state_dir / f"suggested_patch_{int(time.time())}.txt"
with open(patch_file, 'w') as f:
f.write(f"Suggested change to {file_path}:\n\n")
f.write(config_changes.get("change", "No change description"))
f.write(f"\n\nReasoning: {action.get('reasoning', 'No reasoning provided')}")
return {
"success": True,
"output": f"Config change suggestion saved to {patch_file}\nThis requires manual review and application."
}
def _run_investigation(self, action: Dict[str, Any]) -> Dict[str, Any]:
"""Run diagnostic commands"""
commands = action.get("commands", [])
output_lines = []
for cmd in commands:
# Only allow safe read-only commands
safe_commands = ["journalctl", "systemctl status", "df", "free", "ps", "netstat", "ss"]
if not any(cmd.startswith(safe) for safe in safe_commands):
output_lines.append(f"BLOCKED unsafe command: {cmd}")
continue
try:
result = subprocess.run(
cmd,
shell=True,
capture_output=True,
text=True,
timeout=30
)
output_lines.append(f"$ {cmd}")
output_lines.append(result.stdout)
except Exception as e:
output_lines.append(f"Error running {cmd}: {e}")
return {
"success": True,
"output": "\n".join(output_lines)
}
def _dry_run_action(self, action: Dict[str, Any]) -> Dict[str, Any]:
"""Simulate action execution"""
return {
"executed": False,
"status": "dry_run",
"action": action,
"output": "Dry run mode - no actual changes made"
}
def _queue_for_approval(self, action: Dict[str, Any], context: Dict[str, Any]):
"""Add action to approval queue"""
queue = []
if self.approval_queue.exists():
with open(self.approval_queue, 'r') as f:
queue = json.load(f)
# Check for duplicate pending actions
proposed_action = action.get("proposed_action", "")
diagnosis = action.get("diagnosis", "")
for existing in queue:
# Skip already approved/rejected items
if existing.get("approved") is not None:
continue
existing_action = existing.get("action", {})
existing_proposed = existing_action.get("proposed_action", "")
existing_diagnosis = existing_action.get("diagnosis", "")
# Check if this is essentially the same issue
# Match if diagnosis is very similar OR proposed action is very similar
if (diagnosis and existing_diagnosis and
self._similarity_check(diagnosis, existing_diagnosis) > 0.7):
print(f"Skipping duplicate action - similar diagnosis already queued")
return
if (proposed_action and existing_proposed and
self._similarity_check(proposed_action, existing_proposed) > 0.7):
print(f"Skipping duplicate action - similar proposal already queued")
return
queue.append({
"timestamp": datetime.now().isoformat(),
"action": action,
"context": context,
"approved": None
})
with open(self.approval_queue, 'w') as f:
json.dump(queue, f, indent=2)
def _similarity_check(self, str1: str, str2: str) -> float:
"""Simple similarity check between two strings"""
# Normalize strings
s1 = str1.lower().strip()
s2 = str2.lower().strip()
# Exact match
if s1 == s2:
return 1.0
# Check for significant word overlap
words1 = set(s1.split())
words2 = set(s2.split())
# Remove common words that don't indicate similarity
common_words = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by', 'is', 'are', 'was', 'were', 'be', 'been', 'have', 'has', 'had'}
words1 = words1 - common_words
words2 = words2 - common_words
if not words1 or not words2:
return 0.0
# Calculate Jaccard similarity
intersection = len(words1 & words2)
union = len(words1 | words2)
return intersection / union if union > 0 else 0.0
def _log_action(self, result: Dict[str, Any]):
"""Log executed actions"""
with open(self.action_log, 'a') as f:
f.write(json.dumps(result) + '\n')
def get_approval_queue(self) -> List[Dict[str, Any]]:
"""Get pending actions awaiting approval"""
if not self.approval_queue.exists():
return []
with open(self.approval_queue, 'r') as f:
return json.load(f)
def approve_action(self, index: int) -> bool:
"""Approve and execute a queued action, then remove it from queue"""
queue = self.get_approval_queue()
if 0 <= index < len(queue):
action_item = queue[index]
# Execute the approved action
result = self._execute_action_impl(action_item["action"], action_item["context"])
# Archive the action (success or failure)
self._archive_action(action_item, result)
# Remove from queue regardless of outcome
queue.pop(index)
with open(self.approval_queue, 'w') as f:
json.dump(queue, f, indent=2)
return result.get("success", False)
return False
def _archive_action(self, action_item: Dict[str, Any], result: Dict[str, Any]):
"""Archive an approved action with its execution result"""
archive_file = self.state_dir / "approved_actions.jsonl"
archive_entry = {
"timestamp": datetime.now().isoformat(),
"original_timestamp": action_item.get("timestamp"),
"action": action_item.get("action"),
"context": action_item.get("context"),
"result": result
}
with open(archive_file, 'a') as f:
f.write(json.dumps(archive_entry) + '\n')
def reject_action(self, index: int) -> bool:
"""Reject and remove a queued action"""
queue = self.get_approval_queue()
if 0 <= index < len(queue):
removed_action = queue.pop(index)
with open(self.approval_queue, 'w') as f:
json.dump(queue, f, indent=2)
return True
return False
if __name__ == "__main__":
import sys
if len(sys.argv) > 1:
if sys.argv[1] == "queue":
executor = SafeExecutor()
queue = executor.get_approval_queue()
if queue:
print("\n" + "="*70)
print(f"PENDING ACTIONS: {len(queue)}")
print("="*70)
for i, item in enumerate(queue):
action = item.get("action", {})
timestamp = item.get("timestamp", "unknown")
approved = item.get("approved")
status = "✓ APPROVED" if approved else "⏳ PENDING" if approved is None else "✗ REJECTED"
print(f"\n[{i}] {status} - {timestamp}")
print("-" * 70)
print(f"DIAGNOSIS: {action.get('diagnosis', 'N/A')}")
print(f"\nPROPOSED ACTION: {action.get('proposed_action', 'N/A')}")
print(f"TYPE: {action.get('action_type', 'N/A')}")
print(f"RISK: {action.get('risk_level', 'N/A')}")
if action.get('commands'):
print(f"\nCOMMANDS:")
for cmd in action['commands']:
print(f" - {cmd}")
if action.get('config_changes'):
print(f"\nCONFIG CHANGES:")
for key, value in action['config_changes'].items():
print(f" {key}: {value}")
print(f"\nREASONING: {action.get('reasoning', 'N/A')}")
print("\n" + "="*70 + "\n")
else:
print("No pending actions")
elif sys.argv[1] == "approve" and len(sys.argv) > 2:
executor = SafeExecutor()
index = int(sys.argv[2])
success = executor.approve_action(index)
print(f"Approval {'succeeded' if success else 'failed'}")
elif sys.argv[1] == "reject" and len(sys.argv) > 2:
executor = SafeExecutor()
index = int(sys.argv[2])
success = executor.reject_action(index)
print(f"Action {'rejected and removed from queue' if success else 'rejection failed'}")

41
flake.nix Normal file
View File

@@ -0,0 +1,41 @@
{
description = "Macha - AI-Powered Autonomous System Administrator";
inputs = {
nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
};
outputs = { self, nixpkgs }: {
# NixOS module
nixosModules.default = import ./module.nix;
# Alternative explicit name
nixosModules.macha-autonomous = import ./module.nix;
# For development
devShells = nixpkgs.lib.genAttrs [ "x86_64-linux" "aarch64-linux" ] (system:
let
pkgs = nixpkgs.legacyPackages.${system};
pythonEnv = pkgs.python3.withPackages (ps: with ps; [
requests
psutil
chromadb
]);
in {
default = pkgs.mkShell {
packages = [ pythonEnv pkgs.git ];
shellHook = ''
echo "Macha Autonomous Development Environment"
echo "Python packages: requests, psutil, chromadb"
'';
};
}
);
# Formatter
formatter = nixpkgs.lib.genAttrs [ "x86_64-linux" "aarch64-linux" ] (system:
nixpkgs.legacyPackages.${system}.nixpkgs-fmt
);
};
}

222
git_context.py Normal file
View File

@@ -0,0 +1,222 @@
#!/usr/bin/env python3
"""
Git Context - Extract context from NixOS configuration repository
"""
import subprocess
from typing import Dict, List, Any, Optional
from datetime import datetime, timedelta
from pathlib import Path
class GitContext:
"""Extract context from git repository"""
def __init__(self, repo_path: str = "/etc/nixos"):
"""
Initialize git context extractor
Args:
repo_path: Path to the git repository (default: /etc/nixos for NixOS systems)
"""
self.repo_path = Path(repo_path)
def _run_git(self, args: List[str]) -> tuple[bool, str]:
"""Run git command"""
try:
result = subprocess.run(
["git", "-C", str(self.repo_path)] + args,
capture_output=True,
text=True,
timeout=10
)
return (result.returncode == 0, result.stdout.strip())
except Exception as e:
return (False, str(e))
def get_current_branch(self) -> str:
"""Get current git branch"""
success, output = self._run_git(["rev-parse", "--abbrev-ref", "HEAD"])
return output if success else "unknown"
def get_remote_url(self) -> str:
"""Get git remote URL"""
success, output = self._run_git(["remote", "get-url", "origin"])
return output if success else ""
def get_recent_commits(self, count: int = 10, since: str = "1 week ago") -> List[Dict[str, str]]:
"""
Get recent commits
Args:
count: Number of commits to retrieve
since: Time range (e.g., "1 week ago", "3 days ago")
Returns:
List of commit dictionaries with hash, author, date, message
"""
success, output = self._run_git([
"log",
f"--since={since}",
f"-n{count}",
"--format=%H|%an|%ar|%s"
])
if not success:
return []
commits = []
for line in output.split('\n'):
if not line.strip():
continue
parts = line.split('|', 3)
if len(parts) == 4:
commits.append({
"hash": parts[0][:8], # Short hash
"author": parts[1],
"date": parts[2],
"message": parts[3]
})
return commits
def get_system_config_files(self, system_name: str) -> List[str]:
"""
Get configuration files for a specific system
Args:
system_name: Name of the system (e.g., "macha", "rhiannon")
Returns:
List of configuration file paths
"""
system_dir = self.repo_path / "systems" / system_name
config_files = []
if system_dir.exists():
# Main config
if (system_dir.parent / f"{system_name}.nix").exists():
config_files.append(f"systems/{system_name}.nix")
# System-specific configs
for config_file in system_dir.rglob("*.nix"):
config_files.append(str(config_file.relative_to(self.repo_path)))
return config_files
def get_recent_changes_for_system(self, system_name: str, since: str = "1 week ago") -> List[Dict[str, str]]:
"""
Get recent changes affecting a specific system
Args:
system_name: Name of the system
since: Time range
Returns:
List of commits that affected this system
"""
config_files = self.get_system_config_files(system_name)
if not config_files:
return []
# Get commits that touched these files
file_args = []
for f in config_files:
file_args.extend(["--", f])
success, output = self._run_git([
"log",
f"--since={since}",
"-n10",
"--format=%H|%an|%ar|%s"
] + file_args)
if not success:
return []
commits = []
for line in output.split('\n'):
if not line.strip():
continue
parts = line.split('|', 3)
if len(parts) == 4:
commits.append({
"hash": parts[0][:8],
"author": parts[1],
"date": parts[2],
"message": parts[3]
})
return commits
def get_system_context_summary(self, system_name: str) -> str:
"""
Get a summary of git context for a system
Args:
system_name: Name of the system
Returns:
Human-readable summary
"""
lines = []
# Repository info
repo_url = self.get_remote_url()
branch = self.get_current_branch()
if repo_url:
lines.append(f"Configuration Repository: {repo_url}")
lines.append(f"Branch: {branch}")
# Recent changes to this system
recent_changes = self.get_recent_changes_for_system(system_name, "2 weeks ago")
if recent_changes:
lines.append(f"\nRecent configuration changes (last 2 weeks):")
for commit in recent_changes[:5]:
lines.append(f" - {commit['date']}: {commit['message']} ({commit['author']})")
else:
lines.append("\nNo recent configuration changes")
return "\n".join(lines)
def get_all_managed_systems(self) -> List[str]:
"""
Get list of all systems managed by this repository
Returns:
List of system names
"""
systems = []
systems_dir = self.repo_path / "systems"
if systems_dir.exists():
for system_file in systems_dir.glob("*.nix"):
if system_file.stem not in ["default"]:
systems.append(system_file.stem)
return sorted(systems)
if __name__ == "__main__":
import sys
git = GitContext()
print("Repository:", git.get_remote_url())
print("Branch:", git.get_current_branch())
print("\nManaged Systems:")
for system in git.get_all_managed_systems():
print(f" - {system}")
print("\nRecent Commits:")
for commit in git.get_recent_commits(5):
print(f" {commit['hash']}: {commit['message']} - {commit['author']}, {commit['date']}")
if len(sys.argv) > 1:
system = sys.argv[1]
print(f"\nContext for {system}:")
print(git.get_system_context_summary(system))

219
issue_tracker.py Normal file
View File

@@ -0,0 +1,219 @@
#!/usr/bin/env python3
"""
Issue Tracker - Internal ticketing system for tracking problems and their resolution
"""
import json
import uuid
from datetime import datetime
from typing import Dict, List, Any, Optional
from pathlib import Path
class IssueTracker:
"""Manages issue lifecycle: detection -> investigation -> resolution"""
def __init__(self, context_db, log_dir: str = "/var/lib/macha/logs"):
self.context_db = context_db
self.log_dir = Path(log_dir)
self.log_dir.mkdir(parents=True, exist_ok=True)
self.closed_log = self.log_dir / "closed_issues.jsonl"
def create_issue(
self,
hostname: str,
title: str,
description: str,
severity: str = "medium",
source: str = "auto-detected"
) -> str:
"""Create a new issue and return its ID"""
issue_id = str(uuid.uuid4())
now = datetime.utcnow().isoformat()
issue = {
"issue_id": issue_id,
"hostname": hostname,
"title": title,
"description": description,
"status": "open",
"severity": severity,
"created_at": now,
"updated_at": now,
"source": source,
"investigations": [],
"actions": [],
"resolution": None
}
self.context_db.store_issue(issue)
return issue_id
def get_issue(self, issue_id: str) -> Optional[Dict[str, Any]]:
"""Retrieve an issue by ID"""
return self.context_db.get_issue(issue_id)
def update_issue(
self,
issue_id: str,
status: Optional[str] = None,
investigation: Optional[Dict[str, Any]] = None,
action: Optional[Dict[str, Any]] = None
) -> bool:
"""Update an issue with new information"""
issue = self.get_issue(issue_id)
if not issue:
return False
if status:
issue["status"] = status
if investigation:
investigation["timestamp"] = datetime.utcnow().isoformat()
issue["investigations"].append(investigation)
if action:
action["timestamp"] = datetime.utcnow().isoformat()
issue["actions"].append(action)
issue["updated_at"] = datetime.utcnow().isoformat()
self.context_db.update_issue(issue)
return True
def find_similar_issue(
self,
hostname: str,
title: str,
description: str = None
) -> Optional[Dict[str, Any]]:
"""Find an existing open issue that matches this problem"""
open_issues = self.list_issues(hostname=hostname, status="open")
# Simple similarity check on title
title_lower = title.lower()
for issue in open_issues:
issue_title_lower = issue.get("title", "").lower()
# Check for keyword overlap
title_words = set(title_lower.split())
issue_words = set(issue_title_lower.split())
# If >50% of words overlap, consider it similar
if len(title_words & issue_words) / max(len(title_words), 1) > 0.5:
return issue
return None
def list_issues(
self,
hostname: Optional[str] = None,
status: Optional[str] = None,
severity: Optional[str] = None
) -> List[Dict[str, Any]]:
"""List issues with optional filters"""
return self.context_db.list_issues(
hostname=hostname,
status=status,
severity=severity
)
def resolve_issue(self, issue_id: str, resolution: str) -> bool:
"""Mark an issue as resolved with a resolution note"""
issue = self.get_issue(issue_id)
if not issue:
return False
issue["status"] = "resolved"
issue["resolution"] = resolution
issue["updated_at"] = datetime.utcnow().isoformat()
self.context_db.update_issue(issue)
return True
def close_issue(self, issue_id: str) -> bool:
"""Archive a resolved issue to the closed log"""
issue = self.get_issue(issue_id)
if not issue:
return False
# Can only close resolved issues
if issue["status"] != "resolved":
return False
issue["status"] = "closed"
issue["closed_at"] = datetime.utcnow().isoformat()
# Archive to closed log
self._archive_issue(issue)
# Remove from active database
self.context_db.delete_issue(issue_id)
return True
def get_issue_history(self, issue_id: str) -> Dict[str, Any]:
"""Get full history for an issue (investigations + actions)"""
issue = self.get_issue(issue_id)
if not issue:
return {}
return {
"issue": issue,
"investigation_count": len(issue.get("investigations", [])),
"action_count": len(issue.get("actions", [])),
"age_hours": self._calculate_age(issue["created_at"]),
"last_activity": issue["updated_at"]
}
def auto_resolve_if_fixed(self, hostname: str, detected_problems: List[str]) -> int:
"""
Auto-resolve open issues if their problems are no longer detected.
Returns count of auto-resolved issues.
"""
open_issues = self.list_issues(hostname=hostname, status="open")
resolved_count = 0
# Convert detected problems to lowercase for comparison
detected_lower = [p.lower() for p in detected_problems]
for issue in open_issues:
title_lower = issue.get("title", "").lower()
desc_lower = issue.get("description", "").lower()
# Check if issue keywords are still in detected problems
still_present = False
for detected in detected_lower:
if any(word in detected for word in title_lower.split()) or \
any(word in detected for word in desc_lower.split()):
still_present = True
break
# If problem is no longer detected, auto-resolve
if not still_present:
self.resolve_issue(
issue["issue_id"],
"Auto-resolved: Problem no longer detected in system monitoring"
)
resolved_count += 1
return resolved_count
def _archive_issue(self, issue: Dict[str, Any]):
"""Append closed issue to the archive log"""
try:
with open(self.closed_log, "a") as f:
f.write(json.dumps(issue) + "\n")
except Exception as e:
print(f"Failed to archive issue {issue.get('issue_id')}: {e}")
def _calculate_age(self, created_at: str) -> float:
"""Calculate age of issue in hours"""
try:
created = datetime.fromisoformat(created_at)
now = datetime.utcnow()
delta = now - created
return delta.total_seconds() / 3600
except:
return 0

358
journal_monitor.py Normal file
View File

@@ -0,0 +1,358 @@
#!/usr/bin/env python3
"""
Journal Monitor - Monitor remote systems via centralized journald
"""
import json
import subprocess
from typing import Dict, List, Any, Optional, Set
from datetime import datetime, timedelta
from pathlib import Path
from collections import defaultdict
class JournalMonitor:
"""Monitor systems via centralized journald logs"""
def __init__(self, domain: str = "coven.systems"):
"""
Initialize journal monitor
Args:
domain: Domain suffix for FQDNs
"""
self.domain = domain
self.known_hosts: Set[str] = set()
def _run_journalctl(self, args: List[str], timeout: int = 30) -> tuple[bool, str, str]:
"""
Run journalctl command
Args:
args: Arguments to journalctl
timeout: Timeout in seconds
Returns:
(success, stdout, stderr)
"""
try:
cmd = ["journalctl"] + args
result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=timeout
)
return (
result.returncode == 0,
result.stdout.strip(),
result.stderr.strip()
)
except subprocess.TimeoutExpired:
return False, "", f"Command timed out after {timeout}s"
except Exception as e:
return False, "", str(e)
def discover_hosts(self) -> List[str]:
"""
Discover hosts reporting to centralized journal
Returns:
List of discovered FQDNs
"""
success, output, _ = self._run_journalctl([
"--output=json",
"--since=1 day ago",
"-n", "10000"
])
if not success:
return []
hosts = set()
for line in output.split('\n'):
if not line.strip():
continue
try:
entry = json.loads(line)
hostname = entry.get('_HOSTNAME', '')
# Ensure FQDN format
if hostname and not hostname.endswith(f'.{self.domain}'):
if '.' not in hostname:
hostname = f"{hostname}.{self.domain}"
if hostname:
hosts.add(hostname)
except json.JSONDecodeError:
continue
self.known_hosts = hosts
return sorted(hosts)
def collect_resources(self, hostname: str, since: str = "5 minutes ago") -> Dict[str, Any]:
"""
Collect resource usage from journal entries
This extracts CPU/memory info from systemd service messages
"""
# For now, return empty - we'll primarily use this for service/log monitoring
# Resource metrics could be added if systems log them
return {
"cpu_percent": 0,
"memory_percent": 0,
"load_average": {"1min": 0, "5min": 0, "15min": 0}
}
def collect_systemd_status(self, hostname: str, since: str = "5 minutes ago") -> Dict[str, Any]:
"""
Collect systemd service status from journal
Args:
hostname: FQDN of the system
since: Time range to check
Returns:
Dictionary with failed service information
"""
# Query for systemd service failures
success, output, _ = self._run_journalctl([
f"_HOSTNAME={hostname}",
"--priority=err",
"--unit=*.service",
f"--since={since}",
"--output=json"
])
if not success:
return {"failed_count": 0, "failed_services": []}
failed_services = {}
for line in output.split('\n'):
if not line.strip():
continue
try:
entry = json.loads(line)
unit = entry.get('_SYSTEMD_UNIT', '')
if unit and unit.endswith('.service'):
service_name = unit.replace('.service', '')
if service_name not in failed_services:
failed_services[service_name] = {
"unit": unit,
"message": entry.get('MESSAGE', ''),
"timestamp": entry.get('__REALTIME_TIMESTAMP', '')
}
except json.JSONDecodeError:
continue
return {
"failed_count": len(failed_services),
"failed_services": list(failed_services.values())
}
def collect_log_errors(self, hostname: str, since: str = "1 hour ago") -> Dict[str, Any]:
"""
Collect error logs from journal
Args:
hostname: FQDN of the system
since: Time range to check
Returns:
Dictionary with error log information
"""
success, output, _ = self._run_journalctl([
f"_HOSTNAME={hostname}",
"--priority=err",
f"--since={since}",
"--output=json"
])
if not success:
return {"error_count_1h": 0, "recent_errors": []}
errors = []
error_count = 0
for line in output.split('\n'):
if not line.strip():
continue
try:
entry = json.loads(line)
error_count += 1
if len(errors) < 10: # Keep last 10 errors
errors.append({
"message": entry.get('MESSAGE', ''),
"unit": entry.get('_SYSTEMD_UNIT', 'unknown'),
"priority": entry.get('PRIORITY', ''),
"timestamp": entry.get('__REALTIME_TIMESTAMP', '')
})
except json.JSONDecodeError:
continue
return {
"error_count_1h": error_count,
"recent_errors": errors
}
def collect_disk_usage(self, hostname: str) -> Dict[str, Any]:
"""
Collect disk usage - Note: This would require systems to log disk metrics
For now, returns empty. Could be enhanced if systems periodically log disk usage
"""
return {"partitions": []}
def collect_network_status(self, hostname: str, since: str = "5 minutes ago") -> Dict[str, Any]:
"""
Check network connectivity based on recent journal activity
If we see recent logs from a host, it's reachable
"""
success, output, _ = self._run_journalctl([
f"_HOSTNAME={hostname}",
f"--since={since}",
"-n", "1",
"--output=json"
])
# If we got recent logs, network is working
internet_reachable = bool(success and output.strip())
return {
"internet_reachable": internet_reachable,
"last_seen": datetime.now().isoformat() if internet_reachable else None
}
def collect_all(self, hostname: str) -> Dict[str, Any]:
"""
Collect all monitoring data for a host from journal
Args:
hostname: FQDN of the system to monitor
Returns:
Complete monitoring data
"""
# First check if we have recent logs from this host
net_status = self.collect_network_status(hostname)
if not net_status.get("internet_reachable"):
return {
"hostname": hostname,
"reachable": False,
"error": "No recent journal entries from this host"
}
return {
"hostname": hostname,
"reachable": True,
"source": "journal",
"resources": self.collect_resources(hostname),
"systemd": self.collect_systemd_status(hostname),
"disk": self.collect_disk_usage(hostname),
"network": net_status,
"logs": self.collect_log_errors(hostname),
}
def get_summary(self, data: Dict[str, Any]) -> str:
"""Generate human-readable summary from journal data"""
hostname = data.get("hostname", "unknown")
if not data.get("reachable", False):
return f"{hostname}: {data.get('error', 'Unreachable')}"
lines = [f"System: {hostname} (via journal)"]
# Services
systemd = data.get("systemd", {})
failed_count = systemd.get("failed_count", 0)
if failed_count > 0:
lines.append(f"Services: {failed_count} failed")
for svc in systemd.get("failed_services", [])[:3]:
lines.append(f" - {svc.get('unit', 'unknown')}")
else:
lines.append("Services: No recent failures")
# Network
net = data.get("network", {})
last_seen = net.get("last_seen")
if last_seen:
lines.append(f"Last seen: {last_seen}")
# Logs
logs = data.get("logs", {})
error_count = logs.get("error_count_1h", 0)
if error_count > 0:
lines.append(f"Recent logs: {error_count} errors in last hour")
return "\n".join(lines)
def get_active_services(self, hostname: str, since: str = "1 hour ago") -> List[str]:
"""
Get list of active services on a host by looking at journal entries
This helps with auto-discovery of what's running on each system
"""
success, output, _ = self._run_journalctl([
f"_HOSTNAME={hostname}",
f"--since={since}",
"--output=json",
"-n", "1000"
])
if not success:
return []
services = set()
for line in output.split('\n'):
if not line.strip():
continue
try:
entry = json.loads(line)
unit = entry.get('_SYSTEMD_UNIT', '')
if unit and unit.endswith('.service'):
# Extract service name
service = unit.replace('.service', '')
# Filter out common system services, focus on application services
if service not in ['systemd-journald', 'systemd-logind', 'sshd', 'dbus']:
services.add(service)
except json.JSONDecodeError:
continue
return sorted(services)
if __name__ == "__main__":
import sys
monitor = JournalMonitor()
# Discover hosts
print("Discovering hosts from journal...")
hosts = monitor.discover_hosts()
print(f"Found {len(hosts)} hosts:")
for host in hosts:
print(f" - {host}")
# Monitor first host if available
if hosts:
hostname = hosts[0]
print(f"\nMonitoring {hostname}...")
data = monitor.collect_all(hostname)
print("\n" + "="*60)
print(monitor.get_summary(data))
print("="*60)
# Discover services
print(f"\nActive services on {hostname}:")
services = monitor.get_active_services(hostname)
for svc in services[:10]:
print(f" - {svc}")

847
module.nix Normal file
View File

@@ -0,0 +1,847 @@
{ config, lib, pkgs, ... }:
with lib;
let
cfg = config.services.macha-autonomous;
# Python environment with all dependencies
pythonEnv = pkgs.python3.withPackages (ps: with ps; [
requests
psutil
chromadb
]);
# Main autonomous system package
macha-autonomous = pkgs.writeScriptBin "macha-autonomous" ''
#!${pythonEnv}/bin/python3
import sys
sys.path.insert(0, "${./.}")
from orchestrator import main
main()
'';
# Config file
configFile = pkgs.writeText "macha-autonomous-config.json" (builtins.toJSON {
check_interval = cfg.checkInterval;
autonomy_level = cfg.autonomyLevel;
ollama_host = cfg.ollamaHost;
model = cfg.model;
config_repo = cfg.configRepo;
config_branch = cfg.configBranch;
});
in {
options.services.macha-autonomous = {
enable = mkEnableOption "Macha autonomous system maintenance";
autonomyLevel = mkOption {
type = types.enum [ "observe" "suggest" "auto-safe" "auto-full" ];
default = "suggest";
description = ''
Level of autonomy for the system:
- observe: Only monitor and log, no actions
- suggest: Propose actions, require manual approval
- auto-safe: Auto-execute low-risk actions (restarts, cleanup)
- auto-full: Full autonomy with safety limits (still requires approval for high-risk)
'';
};
checkInterval = mkOption {
type = types.int;
default = 300;
description = "Interval in seconds between system checks";
};
ollamaHost = mkOption {
type = types.str;
default = "http://localhost:11434";
description = "Ollama API host";
};
model = mkOption {
type = types.str;
default = "llama3.1:70b";
description = "LLM model to use for reasoning";
};
user = mkOption {
type = types.str;
default = "macha";
description = "User to run the autonomous system as";
};
group = mkOption {
type = types.str;
default = "macha";
description = "Group to run the autonomous system as";
};
gotifyUrl = mkOption {
type = types.str;
default = "";
example = "http://rhiannon:8181";
description = "Gotify server URL for notifications (empty to disable)";
};
gotifyToken = mkOption {
type = types.str;
default = "";
description = "Gotify application token for notifications";
};
remoteSystems = mkOption {
type = types.listOf types.str;
default = [];
example = [ "rhiannon" "alexander" ];
description = "List of remote NixOS systems to monitor and maintain";
};
configRepo = mkOption {
type = types.str;
default = if config.programs.nh.flake != null
then config.programs.nh.flake
else "git+https://git.coven.systems/lily/nixos-servers";
description = "URL of the NixOS configuration repository (auto-detected from programs.nh.flake if available)";
};
configBranch = mkOption {
type = types.str;
default = "main";
description = "Branch of the NixOS configuration repository";
};
};
config = mkIf cfg.enable {
# Create user and group
users.users.${cfg.user} = {
isSystemUser = true;
group = cfg.group;
uid = 2501;
description = "Macha autonomous system maintenance";
home = "/var/lib/macha";
createHome = true;
};
users.groups.${cfg.group} = {};
# Git configuration for credential storage
programs.git = {
enable = true;
config = {
credential.helper = "store";
};
};
# Ollama service for AI inference
services.ollama = {
enable = true;
acceleration = "rocm";
host = "0.0.0.0";
port = 11434;
environmentVariables = {
"OLLAMA_DEBUG" = "1";
"OLLAMA_KEEP_ALIVE" = "600";
"OLLAMA_NEW_ENGINE" = "true";
"OLLAMA_CONTEXT_LENGTH" = "131072";
};
openFirewall = false; # Keep internal only
loadModels = [
"qwen3"
"gpt-oss"
"gemma3"
"gpt-oss:20b"
"qwen3:4b-instruct-2507-fp16"
"qwen3:8b-fp16"
"mistral:7b"
"chroma/all-minilm-l6-v2-f32:latest"
];
};
# ChromaDB service for vector storage
services.chromadb = {
enable = true;
port = 8000;
dbpath = "/var/lib/chromadb";
};
# Give the user permissions it needs
security.sudo.extraRules = [{
users = [ cfg.user ];
commands = [
# Local system management
{ command = "${pkgs.systemd}/bin/systemctl restart *"; options = [ "NOPASSWD" ]; }
{ command = "${pkgs.systemd}/bin/systemctl status *"; options = [ "NOPASSWD" ]; }
{ command = "${pkgs.systemd}/bin/journalctl *"; options = [ "NOPASSWD" ]; }
{ command = "${pkgs.nix}/bin/nix-collect-garbage *"; options = [ "NOPASSWD" ]; }
# Remote system access (uses existing root SSH keys)
{ command = "${pkgs.openssh}/bin/ssh *"; options = [ "NOPASSWD" ]; }
{ command = "${pkgs.openssh}/bin/scp *"; options = [ "NOPASSWD" ]; }
{ command = "${pkgs.nixos-rebuild}/bin/nixos-rebuild *"; options = [ "NOPASSWD" ]; }
];
}];
# Config file
environment.etc."macha-autonomous/config.json".source = configFile;
# State directory and queue directories (world-writable queues for multi-user access)
# Using 'z' to set permissions even if directory exists
systemd.tmpfiles.rules = [
"d /var/lib/macha 0755 ${cfg.user} ${cfg.group} -"
"z /var/lib/macha 0755 ${cfg.user} ${cfg.group} -" # Ensure permissions are set
"d /var/lib/macha/queues 0777 ${cfg.user} ${cfg.group} -"
"d /var/lib/macha/queues/ollama 0777 ${cfg.user} ${cfg.group} -"
"d /var/lib/macha/queues/ollama/pending 0777 ${cfg.user} ${cfg.group} -"
"d /var/lib/macha/queues/ollama/processing 0777 ${cfg.user} ${cfg.group} -"
"d /var/lib/macha/queues/ollama/completed 0777 ${cfg.user} ${cfg.group} -"
"d /var/lib/macha/queues/ollama/failed 0777 ${cfg.user} ${cfg.group} -"
"d /var/lib/macha/tool_cache 0777 ${cfg.user} ${cfg.group} -"
];
# Systemd service
systemd.services.macha-autonomous = {
description = "Macha Autonomous System Maintenance";
after = [ "network.target" "ollama.service" ];
wants = [ "ollama.service" ];
wantedBy = [ "multi-user.target" ];
serviceConfig = {
Type = "simple";
User = cfg.user;
Group = cfg.group;
WorkingDirectory = "/var/lib/macha";
ExecStart = "${macha-autonomous}/bin/macha-autonomous --mode continuous --autonomy ${cfg.autonomyLevel} --interval ${toString cfg.checkInterval}";
Restart = "on-failure";
RestartSec = "30s";
# Security hardening
PrivateTmp = true;
NoNewPrivileges = false; # Need privileges for sudo
ProtectSystem = "strict";
ProtectHome = true;
ReadWritePaths = [ "/var/lib/macha" "/var/lib/macha/tool_cache" "/var/lib/macha/queues" ];
# Resource limits
MemoryLimit = "1G";
CPUQuota = "50%";
};
environment = {
PYTHONPATH = toString ./.;
GOTIFY_URL = cfg.gotifyUrl;
GOTIFY_TOKEN = cfg.gotifyToken;
CHROMA_ENV_FILE = ""; # Prevent ChromaDB from trying to read .env files
ANONYMIZED_TELEMETRY = "False"; # Disable ChromaDB telemetry
};
path = [ pkgs.git ]; # Make git available for config parsing
};
# Ollama Queue Worker Service (serializes all Ollama requests)
systemd.services.ollama-queue-worker = {
description = "Macha Ollama Queue Worker";
after = [ "network.target" "ollama.service" ];
wants = [ "ollama.service" ];
wantedBy = [ "multi-user.target" ];
serviceConfig = {
Type = "simple";
User = cfg.user;
Group = cfg.group;
WorkingDirectory = "/var/lib/macha";
ExecStart = "${pythonEnv}/bin/python3 ${./.}/ollama_worker.py";
Restart = "on-failure";
RestartSec = "10s";
# Security hardening
PrivateTmp = true;
NoNewPrivileges = true;
ProtectSystem = "strict";
ProtectHome = true;
ReadWritePaths = [ "/var/lib/macha/queues" "/var/lib/macha/tool_cache" ];
# Resource limits
MemoryLimit = "512M";
CPUQuota = "25%";
};
environment = {
PYTHONPATH = toString ./.;
CHROMA_ENV_FILE = "";
ANONYMIZED_TELEMETRY = "False";
};
};
# CLI tools for manual control and system packages
environment.systemPackages = with pkgs; [
macha-autonomous
# Python packages for ChromaDB
python313
python313Packages.pip
python313Packages.chromadb.pythonModule
# Tool to check approval queue
(pkgs.writeScriptBin "macha-approve" ''
#!${pkgs.bash}/bin/bash
if [ "$1" == "list" ]; then
sudo -u ${cfg.user} ${pythonEnv}/bin/python3 ${./.}/executor.py queue
elif [ "$1" == "discuss" ] && [ -n "$2" ]; then
ACTION_ID="$2"
echo "==================================================================="
echo "Interactive Discussion with Macha about Action #$ACTION_ID"
echo "==================================================================="
echo ""
# Initial explanation
sudo -u ${cfg.user} ${pkgs.coreutils}/bin/env CHROMA_ENV_FILE="" ANONYMIZED_TELEMETRY="False" ${pythonEnv}/bin/python3 ${./.}/conversation.py --discuss "$ACTION_ID"
echo ""
echo "==================================================================="
echo "You can now ask follow-up questions about this action."
echo "Type 'approve' to approve it, 'reject' to reject it, or 'exit' to quit."
echo "==================================================================="
# Interactive loop
while true; do
echo ""
echo -n "You: "
read -r USER_INPUT
# Check for special commands
if [ "$USER_INPUT" = "exit" ] || [ "$USER_INPUT" = "quit" ] || [ -z "$USER_INPUT" ]; then
echo "Exiting discussion."
break
elif [ "$USER_INPUT" = "approve" ]; then
echo "Approving action #$ACTION_ID..."
sudo -u ${cfg.user} ${pythonEnv}/bin/python3 ${./.}/executor.py approve "$ACTION_ID"
break
elif [ "$USER_INPUT" = "reject" ]; then
echo "Rejecting and removing action #$ACTION_ID from queue..."
sudo -u ${cfg.user} ${pythonEnv}/bin/python3 ${./.}/executor.py reject "$ACTION_ID"
break
fi
# Ask Macha the follow-up question in context of the action
echo ""
echo -n "Macha: "
sudo -u ${cfg.user} ${pkgs.coreutils}/bin/env CHROMA_ENV_FILE="" ANONYMIZED_TELEMETRY="False" ${pythonEnv}/bin/python3 ${./.}/conversation.py --discuss "$ACTION_ID" --follow-up "$USER_INPUT"
echo ""
done
elif [ "$1" == "approve" ] && [ -n "$2" ]; then
sudo -u ${cfg.user} ${pythonEnv}/bin/python3 ${./.}/executor.py approve "$2"
elif [ "$1" == "reject" ] && [ -n "$2" ]; then
sudo -u ${cfg.user} ${pythonEnv}/bin/python3 ${./.}/executor.py reject "$2"
else
echo "Usage:"
echo " macha-approve list - Show pending actions"
echo " macha-approve discuss <N> - Discuss action number N with Macha (interactive)"
echo " macha-approve approve <N> - Approve action number N"
echo " macha-approve reject <N> - Reject and remove action number N from queue"
fi
'')
# Tool to run manual check
(pkgs.writeScriptBin "macha-check" ''
#!${pkgs.bash}/bin/bash
sudo -u ${cfg.user} sh -c 'cd /var/lib/macha && CHROMA_ENV_FILE="" ANONYMIZED_TELEMETRY="False" ${macha-autonomous}/bin/macha-autonomous --mode once --autonomy ${cfg.autonomyLevel}'
'')
# Tool to view logs
(pkgs.writeScriptBin "macha-logs" ''
#!${pkgs.bash}/bin/bash
case "$1" in
orchestrator)
sudo tail -f /var/lib/macha/orchestrator.log
;;
decisions)
sudo tail -f /var/lib/macha/decisions.jsonl
;;
actions)
sudo tail -f /var/lib/macha/actions.jsonl
;;
service)
journalctl -u macha-autonomous.service -f
;;
*)
echo "Usage: macha-logs [orchestrator|decisions|actions|service]"
;;
esac
'')
# Tool to send test notification
(pkgs.writeScriptBin "macha-notify" ''
#!${pkgs.bash}/bin/bash
if [ -z "$1" ] || [ -z "$2" ]; then
echo "Usage: macha-notify <title> <message> [priority]"
echo "Example: macha-notify 'Test' 'This is a test' 5"
echo "Priorities: 2 (low), 5 (medium), 8 (high)"
exit 1
fi
export GOTIFY_URL="${cfg.gotifyUrl}"
export GOTIFY_TOKEN="${cfg.gotifyToken}"
${pythonEnv}/bin/python3 ${./.}/notifier.py "$1" "$2" "''${3:-5}"
'')
# Tool to query config files
(pkgs.writeScriptBin "macha-configs" ''
#!${pkgs.bash}/bin/bash
export PYTHONPATH=${toString ./.}
export CHROMA_ENV_FILE=""
export ANONYMIZED_TELEMETRY="False"
if [ $# -eq 0 ]; then
echo "Usage: macha-configs <search-query> [system-name]"
echo "Examples:"
echo " macha-configs gotify"
echo " macha-configs 'journald configuration'"
echo " macha-configs ollama macha.coven.systems"
exit 1
fi
QUERY="$1"
SYSTEM="''${2:-}"
${pythonEnv}/bin/python3 -c "
from context_db import ContextDatabase
import sys
db = ContextDatabase()
query = sys.argv[1]
system = sys.argv[2] if len(sys.argv) > 2 else None
print(f'Searching for: {query}')
if system:
print(f'Filtered to system: {system}')
print('='*60)
configs = db.query_config_files(query, system=system, n_results=5)
if not configs:
print('No matching configuration files found.')
else:
for i, cfg in enumerate(configs, 1):
print(f\"\\n{i}. {cfg['path']} (relevance: {cfg['relevance']:.1%})\")
print(f\" Category: {cfg['metadata']['category']}\")
print(' Preview:')
preview = cfg['content'][:300].replace('\\n', '\\n ')
print(f' {preview}')
if len(cfg['content']) > 300:
print(' ... (use macha-configs-read to see full file)')
" "$QUERY" "$SYSTEM"
'')
# Interactive chat tool (runs as invoking user, not as macha-autonomous)
(pkgs.writeScriptBin "macha-chat" ''
#!${pkgs.bash}/bin/bash
export PYTHONPATH=${toString ./.}
export CHROMA_ENV_FILE=""
export ANONYMIZED_TELEMETRY="False"
# Run as the current user, not as macha-autonomous
# This allows the chat to execute privileged commands with the user's permissions
${pythonEnv}/bin/python3 ${./.}/chat.py
'')
# Tool to read full config file
(pkgs.writeScriptBin "macha-configs-read" ''
#!${pkgs.bash}/bin/bash
export PYTHONPATH=${toString ./.}
export CHROMA_ENV_FILE=""
export ANONYMIZED_TELEMETRY="False"
if [ $# -eq 0 ]; then
echo "Usage: macha-configs-read <file-path>"
echo "Example: macha-configs-read apps/gotify.nix"
exit 1
fi
${pythonEnv}/bin/python3 -c "
from context_db import ContextDatabase
import sys
db = ContextDatabase()
file_path = sys.argv[1]
cfg = db.get_config_file(file_path)
if not cfg:
print(f'Config file not found: {file_path}')
sys.exit(1)
print(f'File: {cfg[\"path\"]}')
print(f'Category: {cfg[\"metadata\"][\"category\"]}')
print('='*60)
print(cfg['content'])
" "$1"
'')
# Tool to view system registry
(pkgs.writeScriptBin "macha-systems" ''
#!${pkgs.bash}/bin/bash
export PYTHONPATH=${toString ./.}
export CHROMA_ENV_FILE=""
export ANONYMIZED_TELEMETRY="False"
${pythonEnv}/bin/python3 -c "
from context_db import ContextDatabase
import json
db = ContextDatabase()
systems = db.get_all_systems()
print('Registered Systems:')
print('='*60)
for system in systems:
os_type = system.get('os_type', 'unknown').upper()
print(f\"\\n{system['hostname']} ({system['type']}) [{os_type}]\")
print(f\" Config Repo: {system.get('config_repo') or '(not set)'}\")
print(f\" Branch: {system.get('config_branch', 'unknown')}\")
if system.get('services'):
print(f\" Services: {', '.join(system['services'][:10])}\")
if len(system['services']) > 10:
print(f\" ... and {len(system['services']) - 10} more\")
if system.get('capabilities'):
print(f\" Capabilities: {', '.join(system['capabilities'])}\")
print('='*60)
"
'')
# Tool to ask Macha questions
(pkgs.writeScriptBin "macha-ask" ''
#!${pkgs.bash}/bin/bash
if [ $# -eq 0 ]; then
echo "Usage: macha-ask <your question>"
echo "Example: macha-ask Why did you recommend restarting that service?"
exit 1
fi
sudo -u ${cfg.user} ${pkgs.coreutils}/bin/env CHROMA_ENV_FILE="" ANONYMIZED_TELEMETRY="False" ${pythonEnv}/bin/python3 ${./.}/conversation.py "$@"
'')
# Issue tracking CLI
(pkgs.writeScriptBin "macha-issues" ''
#!${pythonEnv}/bin/python3
import sys
import os
os.environ["CHROMA_ENV_FILE"] = ""
os.environ["ANONYMIZED_TELEMETRY"] = "False"
sys.path.insert(0, "${./.}")
from context_db import ContextDatabase
from issue_tracker import IssueTracker
from datetime import datetime
import json
db = ContextDatabase()
tracker = IssueTracker(db)
def list_issues(show_all=False):
"""List issues"""
if show_all:
issues = tracker.list_issues()
else:
issues = tracker.list_issues(status="open")
if not issues:
print("No issues found")
return
print("="*70)
print(f"ISSUES: {len(issues)}")
print("="*70)
for issue in issues:
issue_id = issue['issue_id'][:8]
age_hours = (datetime.utcnow() - datetime.fromisoformat(issue['created_at'])).total_seconds() / 3600
inv_count = len(issue.get('investigations', []))
action_count = len(issue.get('actions', []))
print(f"\n[{issue_id}] {issue['title']}")
print(f" Host: {issue['hostname']}")
print(f" Status: {issue['status'].upper()} | Severity: {issue['severity'].upper()}")
print(f" Age: {age_hours:.1f}h | Activity: {inv_count} investigations, {action_count} actions")
print(f" Source: {issue['source']}")
if issue.get('resolution'):
print(f" Resolution: {issue['resolution']}")
def show_issue(issue_id):
"""Show detailed issue information"""
# Find issue by partial ID
all_issues = tracker.list_issues()
matching = [i for i in all_issues if i['issue_id'].startswith(issue_id)]
if not matching:
print(f"Issue {issue_id} not found")
return
issue = matching[0]
full_id = issue['issue_id']
print("="*70)
print(f"ISSUE: {issue['title']}")
print("="*70)
print(f"ID: {full_id}")
print(f"Host: {issue['hostname']}")
print(f"Status: {issue['status'].upper()}")
print(f"Severity: {issue['severity'].upper()}")
print(f"Source: {issue['source']}")
print(f"Created: {issue['created_at']}")
print(f"Updated: {issue['updated_at']}")
print(f"\nDescription:\n{issue['description']}")
investigations = issue.get('investigations', [])
if investigations:
print(f"\n{''*70}")
print(f"INVESTIGATIONS ({len(investigations)}):")
for i, inv in enumerate(investigations, 1):
print(f"\n [{i}] {inv.get('timestamp', 'N/A')}")
print(f" Diagnosis: {inv.get('diagnosis', 'N/A')}")
print(f" Commands: {', '.join(inv.get('commands', []))}")
print(f" Success: {inv.get('success', False)}")
if inv.get('output'):
print(f" Output: {inv['output'][:200]}...")
actions = issue.get('actions', [])
if actions:
print(f"\n{''*70}")
print(f"ACTIONS ({len(actions)}):")
for i, action in enumerate(actions, 1):
print(f"\n [{i}] {action.get('timestamp', 'N/A')}")
print(f" Action: {action.get('proposed_action', 'N/A')}")
print(f" Risk: {action.get('risk_level', 'N/A').upper()}")
print(f" Commands: {', '.join(action.get('commands', []))}")
print(f" Success: {action.get('success', False)}")
if issue.get('resolution'):
print(f"\n{''*70}")
print(f"RESOLUTION:")
print(f" {issue['resolution']}")
print("="*70)
def create_issue(description):
"""Create a new issue manually"""
import socket
hostname = f"{socket.gethostname()}.coven.systems"
issue_id = tracker.create_issue(
hostname=hostname,
title=description[:100],
description=description,
severity="medium",
source="user-reported"
)
print(f"Created issue: {issue_id[:8]}")
print(f"Title: {description[:100]}")
def resolve_issue(issue_id, resolution="Manually resolved"):
"""Mark an issue as resolved"""
# Find issue by partial ID
all_issues = tracker.list_issues()
matching = [i for i in all_issues if i['issue_id'].startswith(issue_id)]
if not matching:
print(f"Issue {issue_id} not found")
return
full_id = matching[0]['issue_id']
success = tracker.resolve_issue(full_id, resolution)
if success:
print(f"Resolved issue {issue_id[:8]}")
else:
print(f"Failed to resolve issue {issue_id}")
def close_issue(issue_id):
"""Archive a resolved issue"""
# Find issue by partial ID
all_issues = tracker.list_issues()
matching = [i for i in all_issues if i['issue_id'].startswith(issue_id)]
if not matching:
print(f"Issue {issue_id} not found")
return
full_id = matching[0]['issue_id']
if matching[0]['status'] != 'resolved':
print(f"Issue {issue_id} must be resolved before closing")
print(f"Use: macha-issues resolve {issue_id}")
return
success = tracker.close_issue(full_id)
if success:
print(f"Closed and archived issue {issue_id[:8]}")
else:
print(f"Failed to close issue {issue_id}")
# Main CLI
if len(sys.argv) < 2:
print("Usage: macha-issues <command> [options]")
print("")
print("Commands:")
print(" list List open issues")
print(" list --all List all issues (including resolved/closed)")
print(" show <id> Show detailed issue information")
print(" create <desc> Create a new issue manually")
print(" resolve <id> Mark issue as resolved")
print(" close <id> Archive a resolved issue")
sys.exit(1)
command = sys.argv[1]
if command == "list":
show_all = "--all" in sys.argv
list_issues(show_all)
elif command == "show" and len(sys.argv) >= 3:
show_issue(sys.argv[2])
elif command == "create" and len(sys.argv) >= 3:
description = " ".join(sys.argv[2:])
create_issue(description)
elif command == "resolve" and len(sys.argv) >= 3:
resolution = " ".join(sys.argv[3:]) if len(sys.argv) > 3 else "Manually resolved"
resolve_issue(sys.argv[2], resolution)
elif command == "close" and len(sys.argv) >= 3:
close_issue(sys.argv[2])
else:
print(f"Unknown command: {command}")
sys.exit(1)
'')
# Knowledge base CLI
(pkgs.writeScriptBin "macha-knowledge" ''
#!${pythonEnv}/bin/python3
import sys
import os
os.environ["CHROMA_ENV_FILE"] = ""
os.environ["ANONYMIZED_TELEMETRY"] = "False"
sys.path.insert(0, "${./.}")
from context_db import ContextDatabase
db = ContextDatabase()
def list_topics(category=None):
"""List all knowledge topics"""
topics = db.list_knowledge_topics(category)
if not topics:
print("No knowledge topics found.")
return
print(f"{'='*70}")
if category:
print(f"KNOWLEDGE TOPICS ({category.upper()}):")
else:
print(f"KNOWLEDGE TOPICS:")
print(f"{'='*70}")
for topic in topics:
print(f" {topic}")
print(f"{'='*70}")
def show_topic(topic):
"""Show all knowledge for a topic"""
items = db.get_knowledge_by_topic(topic)
if not items:
print(f"No knowledge found for topic: {topic}")
return
print(f"{'='*70}")
print(f"KNOWLEDGE: {topic}")
print(f"{'='*70}\n")
for item in items:
print(f"ID: {item['id'][:8]}...")
print(f"Category: {item['category']}")
print(f"Source: {item['source']}")
print(f"Confidence: {item['confidence']}")
print(f"Created: {item['created_at']}")
print(f"Times Referenced: {item['times_referenced']}")
if item.get('tags'):
print(f"Tags: {', '.join(item['tags'])}")
print(f"\nKnowledge:")
print(f" {item['knowledge']}\n")
print(f"{'-'*70}\n")
def search_knowledge(query, category=None):
"""Search knowledge base"""
items = db.query_knowledge(query, category=category, limit=10)
if not items:
print(f"No knowledge found matching: {query}")
return
print(f"{'='*70}")
print(f"SEARCH RESULTS: {query}")
if category:
print(f"Category Filter: {category}")
print(f"{'='*70}\n")
for i, item in enumerate(items, 1):
print(f"[{i}] {item['topic']}")
print(f" Category: {item['category']} | Confidence: {item['confidence']}")
print(f" {item['knowledge'][:150]}...")
print()
def add_knowledge(topic, knowledge, category="general"):
"""Add new knowledge"""
kid = db.store_knowledge(
topic=topic,
knowledge=knowledge,
category=category,
source="user-provided",
confidence="high"
)
if kid:
print(f" Added knowledge for topic: {topic}")
print(f" ID: {kid[:8]}...")
else:
print(f" Failed to add knowledge")
def seed_initial():
"""Seed initial knowledge"""
print("Seeding initial knowledge from seed_knowledge.py...")
exec(open("${./.}/seed_knowledge.py").read())
# Main CLI
if len(sys.argv) < 2:
print("Usage: macha-knowledge <command> [options]")
print("")
print("Commands:")
print(" list List all knowledge topics")
print(" list <category> List topics in category")
print(" show <topic> Show all knowledge for a topic")
print(" search <query> Search knowledge base")
print(" search <query> <cat> Search in specific category")
print(" add <topic> <text> Add new knowledge")
print(" seed Seed initial knowledge")
print("")
print("Categories: command, pattern, troubleshooting, performance, general")
sys.exit(1)
command = sys.argv[1]
if command == "list":
category = sys.argv[2] if len(sys.argv) >= 3 else None
list_topics(category)
elif command == "show" and len(sys.argv) >= 3:
show_topic(sys.argv[2])
elif command == "search" and len(sys.argv) >= 3:
query = sys.argv[2]
category = sys.argv[3] if len(sys.argv) >= 4 else None
search_knowledge(query, category)
elif command == "add" and len(sys.argv) >= 4:
topic = sys.argv[2]
knowledge = " ".join(sys.argv[3:])
add_knowledge(topic, knowledge)
elif command == "seed":
seed_initial()
else:
print(f"Unknown command: {command}")
sys.exit(1)
'')
];
};
}

291
monitor.py Normal file
View File

@@ -0,0 +1,291 @@
#!/usr/bin/env python3
"""
System Monitor - Collects health data from Macha
"""
import json
import subprocess
import psutil
import time
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Any
class SystemMonitor:
"""Monitors system health and collects diagnostic data"""
def __init__(self, state_dir: Path = Path("/var/lib/macha")):
self.state_dir = state_dir
self.state_dir.mkdir(parents=True, exist_ok=True)
def collect_all(self) -> Dict[str, Any]:
"""Collect all system health data"""
return {
"timestamp": datetime.now().isoformat(),
"systemd": self.check_systemd_services(),
"resources": self.check_resources(),
"disk": self.check_disk_usage(),
"logs": self.check_recent_errors(),
"nixos": self.check_nixos_status(),
"network": self.check_network(),
"boot": self.check_boot_status(),
}
def check_systemd_services(self) -> Dict[str, Any]:
"""Check status of all systemd services"""
try:
# Get failed services
result = subprocess.run(
["systemctl", "--failed", "--no-pager", "--output=json"],
capture_output=True,
text=True,
timeout=10
)
failed_services = []
if result.returncode == 0 and result.stdout:
try:
failed_services = json.loads(result.stdout)
except json.JSONDecodeError:
pass
# Get all services status
result = subprocess.run(
["systemctl", "list-units", "--type=service", "--no-pager", "--output=json"],
capture_output=True,
text=True,
timeout=10
)
all_services = []
if result.returncode == 0 and result.stdout:
try:
all_services = json.loads(result.stdout)
except json.JSONDecodeError:
pass
return {
"failed_count": len(failed_services),
"failed_services": failed_services,
"total_services": len(all_services),
"active_services": [s for s in all_services if s.get("active") == "active"],
}
except Exception as e:
return {"error": str(e)}
def check_resources(self) -> Dict[str, Any]:
"""Check CPU, RAM, and system resources"""
try:
cpu_percent = psutil.cpu_percent(interval=1)
memory = psutil.virtual_memory()
load_avg = psutil.getloadavg()
return {
"cpu_percent": cpu_percent,
"cpu_count": psutil.cpu_count(),
"memory_percent": memory.percent,
"memory_available_gb": memory.available / (1024**3),
"memory_total_gb": memory.total / (1024**3),
"load_average": {
"1min": load_avg[0],
"5min": load_avg[1],
"15min": load_avg[2],
},
"swap_percent": psutil.swap_memory().percent,
}
except Exception as e:
return {"error": str(e)}
def check_disk_usage(self) -> Dict[str, Any]:
"""Check disk usage for all mounted filesystems"""
try:
partitions = psutil.disk_partitions()
disk_info = []
for partition in partitions:
try:
usage = psutil.disk_usage(partition.mountpoint)
disk_info.append({
"device": partition.device,
"mountpoint": partition.mountpoint,
"fstype": partition.fstype,
"percent_used": usage.percent,
"total_gb": usage.total / (1024**3),
"used_gb": usage.used / (1024**3),
"free_gb": usage.free / (1024**3),
})
except PermissionError:
continue
return {"partitions": disk_info}
except Exception as e:
return {"error": str(e)}
def check_recent_errors(self) -> Dict[str, Any]:
"""Check recent system logs for errors"""
try:
# Get errors from the last hour
result = subprocess.run(
["journalctl", "-p", "err", "--since", "1 hour ago", "--no-pager", "-o", "json"],
capture_output=True,
text=True,
timeout=10
)
errors = []
if result.returncode == 0 and result.stdout:
for line in result.stdout.strip().split('\n'):
if line:
try:
errors.append(json.loads(line))
except json.JSONDecodeError:
continue
return {
"error_count_1h": len(errors),
"recent_errors": errors[-50:], # Last 50 errors
}
except Exception as e:
return {"error": str(e)}
def check_nixos_status(self) -> Dict[str, Any]:
"""Check NixOS generation and system info"""
try:
# Get current generation
result = subprocess.run(
["nixos-version"],
capture_output=True,
text=True,
timeout=5
)
version = result.stdout.strip() if result.returncode == 0 else "unknown"
# Get generation list
result = subprocess.run(
["nix-env", "--list-generations", "-p", "/nix/var/nix/profiles/system"],
capture_output=True,
text=True,
timeout=10
)
generations = result.stdout.strip() if result.returncode == 0 else ""
return {
"version": version,
"generations": generations,
"nix_store_size": self._get_nix_store_size(),
}
except Exception as e:
return {"error": str(e)}
def _get_nix_store_size(self) -> str:
"""Get Nix store size"""
try:
result = subprocess.run(
["du", "-sh", "/nix/store"],
capture_output=True,
text=True,
timeout=30
)
if result.returncode == 0:
return result.stdout.split()[0]
except:
pass
return "unknown"
def check_network(self) -> Dict[str, Any]:
"""Check network connectivity"""
try:
# Check if we can reach the internet
result = subprocess.run(
["ping", "-c", "1", "-W", "2", "8.8.8.8"],
capture_output=True,
timeout=5
)
internet_up = result.returncode == 0
# Get network interfaces
interfaces = {}
for iface, addrs in psutil.net_if_addrs().items():
interfaces[iface] = [
{"family": addr.family.name, "address": addr.address}
for addr in addrs
]
return {
"internet_reachable": internet_up,
"interfaces": interfaces,
}
except Exception as e:
return {"error": str(e)}
def check_boot_status(self) -> Dict[str, Any]:
"""Check boot and uptime information"""
try:
boot_time = datetime.fromtimestamp(psutil.boot_time())
uptime_seconds = time.time() - psutil.boot_time()
return {
"boot_time": boot_time.isoformat(),
"uptime_seconds": uptime_seconds,
"uptime_hours": uptime_seconds / 3600,
}
except Exception as e:
return {"error": str(e)}
def save_snapshot(self, data: Dict[str, Any]):
"""Save a snapshot of system state"""
snapshot_file = self.state_dir / f"snapshot_{int(time.time())}.json"
with open(snapshot_file, 'w') as f:
json.dump(data, f, indent=2)
# Keep only last 100 snapshots
snapshots = sorted(self.state_dir.glob("snapshot_*.json"))
for old_snapshot in snapshots[:-100]:
old_snapshot.unlink()
def get_summary(self, data: Dict[str, Any]) -> str:
"""Generate human-readable summary of system state"""
lines = []
lines.append(f"=== System Health Summary ({data['timestamp']}) ===\n")
# Resources
res = data.get("resources", {})
lines.append(f"CPU: {res.get('cpu_percent', 0):.1f}%")
lines.append(f"Memory: {res.get('memory_percent', 0):.1f}% ({res.get('memory_available_gb', 0):.1f}GB free)")
lines.append(f"Load: {res.get('load_average', {}).get('1min', 0):.2f}")
# Disk
disk = data.get("disk", {})
for part in disk.get("partitions", [])[:5]: # Top 5 partitions
lines.append(f"Disk {part['mountpoint']}: {part['percent_used']:.1f}% used ({part['free_gb']:.1f}GB free)")
# Systemd
systemd = data.get("systemd", {})
failed = systemd.get("failed_count", 0)
if failed > 0:
lines.append(f"\n⚠️ WARNING: {failed} failed services!")
for svc in systemd.get("failed_services", [])[:5]:
lines.append(f" - {svc.get('unit', 'unknown')}")
# Errors
logs = data.get("logs", {})
error_count = logs.get("error_count_1h", 0)
if error_count > 0:
lines.append(f"\n⚠️ {error_count} errors in last hour")
# Network
net = data.get("network", {})
if not net.get("internet_reachable", True):
lines.append("\n⚠️ WARNING: No internet connectivity!")
return "\n".join(lines)
if __name__ == "__main__":
monitor = SystemMonitor()
data = monitor.collect_all()
monitor.save_snapshot(data)
print(monitor.get_summary(data))
print(f"\nFull data saved to {monitor.state_dir}")

248
notifier.py Normal file
View File

@@ -0,0 +1,248 @@
#!/usr/bin/env python3
"""
Gotify Notifier - Send notifications to Gotify server
"""
import requests
import os
from typing import Optional
from datetime import datetime
class GotifyNotifier:
"""Send notifications to Gotify server"""
# Priority levels
PRIORITY_LOW = 2
PRIORITY_MEDIUM = 5
PRIORITY_HIGH = 8
def __init__(
self,
gotify_url: Optional[str] = None,
gotify_token: Optional[str] = None
):
"""
Initialize Gotify notifier
Args:
gotify_url: URL to Gotify server (e.g. http://rhiannon:8181)
gotify_token: Application token from Gotify
"""
self.gotify_url = gotify_url or os.environ.get("GOTIFY_URL", "")
self.gotify_token = gotify_token or os.environ.get("GOTIFY_TOKEN", "")
self.enabled = bool(self.gotify_url and self.gotify_token)
def send(
self,
title: str,
message: str,
priority: int = PRIORITY_MEDIUM,
extras: Optional[dict] = None
) -> bool:
"""
Send a notification to Gotify
Args:
title: Notification title
message: Notification message
priority: Priority level (2=low, 5=medium, 8=high)
extras: Optional extra data
Returns:
True if successful, False otherwise
"""
if not self.enabled:
return False
try:
url = f"{self.gotify_url}/message"
headers = {
"Authorization": f"Bearer {self.gotify_token}",
"Content-Type": "application/json"
}
data = {
"title": title,
"message": message,
"priority": priority,
}
if extras:
data["extras"] = extras
response = requests.post(
url,
json=data,
headers=headers,
timeout=10
)
return response.status_code == 200
except Exception as e:
# Fail silently - don't crash if Gotify is unavailable
print(f"Warning: Failed to send Gotify notification: {e}")
return False
def notify_critical_issue(self, issue_description: str, details: str = ""):
"""Send high-priority notification for critical issues"""
message = f"⚠️ Critical Issue Detected\n\n{issue_description}"
if details:
message += f"\n\nDetails:\n{details}"
return self.send(
title="🚨 Macha: Critical Issue",
message=message,
priority=self.PRIORITY_HIGH
)
def notify_issue_created(self, issue_id: str, title: str, severity: str):
"""Send notification when a new issue is created"""
severity_icons = {
"low": "",
"medium": "⚠️",
"high": "🚨",
"critical": "🔴"
}
icon = severity_icons.get(severity, "⚠️")
priority_map = {
"low": self.PRIORITY_LOW,
"medium": self.PRIORITY_MEDIUM,
"high": self.PRIORITY_HIGH,
"critical": self.PRIORITY_HIGH
}
priority = priority_map.get(severity, self.PRIORITY_MEDIUM)
message = f"{icon} New Issue Tracked\n\nID: {issue_id}\nSeverity: {severity.upper()}\n\n{title}"
return self.send(
title="📋 Macha: Issue Created",
message=message,
priority=priority
)
def notify_action_queued(self, action_description: str, risk_level: str):
"""Send notification when action is queued for approval"""
emoji = "⚠️" if risk_level == "high" else ""
message = (
f"{emoji} Action Queued for Approval\n\n"
f"Action: {action_description}\n"
f"Risk Level: {risk_level}\n\n"
f"Use 'macha-approve list' to review"
)
priority = self.PRIORITY_HIGH if risk_level == "high" else self.PRIORITY_MEDIUM
return self.send(
title="📋 Macha: Action Needs Approval",
message=message,
priority=priority
)
def notify_action_executed(self, action_description: str, success: bool, output: str = ""):
"""Send notification when action is executed"""
if success:
emoji = ""
title_prefix = "Success"
else:
emoji = ""
title_prefix = "Failed"
message = f"{emoji} Action {title_prefix}\n\n{action_description}"
if output:
message += f"\n\nOutput:\n{output[:500]}" # Limit output length
priority = self.PRIORITY_HIGH if not success else self.PRIORITY_LOW
return self.send(
title=f"{emoji} Macha: Action {title_prefix}",
message=message,
priority=priority
)
def notify_service_failure(self, service_name: str, details: str = ""):
"""Send notification for service failures"""
message = f"🔴 Service Failed: {service_name}"
if details:
message += f"\n\nDetails:\n{details}"
return self.send(
title="🔴 Macha: Service Failure",
message=message,
priority=self.PRIORITY_HIGH
)
def notify_health_summary(self, summary: str, status: str):
"""Send periodic health summary"""
emoji = {
"healthy": "",
"attention_needed": "⚠️",
"intervention_required": "🚨"
}.get(status, "")
priority = {
"healthy": self.PRIORITY_LOW,
"attention_needed": self.PRIORITY_MEDIUM,
"intervention_required": self.PRIORITY_HIGH
}.get(status, self.PRIORITY_MEDIUM)
return self.send(
title=f"{emoji} Macha: Health Check",
message=summary,
priority=priority
)
def send_system_discovered(
self,
hostname: str,
os_type: str,
role: str,
services_count: int
):
"""Send notification when a new system is discovered"""
message = (
f"🔍 New System Auto-Discovered\n\n"
f"Hostname: {hostname}\n"
f"OS: {os_type.upper()}\n"
f"Role: {role}\n"
f"Services: {services_count} detected\n\n"
f"System has been registered and analyzed.\n"
f"Use 'macha-systems' to view all registered systems."
)
return self.send(
title="🌐 Macha: New System Discovered",
message=message,
priority=self.PRIORITY_MEDIUM
)
if __name__ == "__main__":
import sys
# Test the notifier
if len(sys.argv) < 3:
print("Usage: notifier.py <title> <message> [priority]")
print("Example: notifier.py 'Test' 'This is a test message' 5")
sys.exit(1)
title = sys.argv[1]
message = sys.argv[2]
priority = int(sys.argv[3]) if len(sys.argv) > 3 else GotifyNotifier.PRIORITY_MEDIUM
notifier = GotifyNotifier()
if not notifier.enabled:
print("Error: Gotify not configured (GOTIFY_URL and GOTIFY_TOKEN required)")
sys.exit(1)
success = notifier.send(title, message, priority)
if success:
print("✅ Notification sent successfully")
else:
print("❌ Failed to send notification")
sys.exit(1)

238
ollama_queue.py Normal file
View File

@@ -0,0 +1,238 @@
#!/usr/bin/env python3
"""
Ollama Queue Handler - Serializes all LLM requests to prevent resource contention
"""
import json
import time
import fcntl
import signal
from pathlib import Path
from typing import Dict, Any, Optional, Callable
from datetime import datetime
from enum import IntEnum
class Priority(IntEnum):
"""Request priority levels"""
INTERACTIVE = 0 # User requests (highest priority)
AUTONOMOUS = 1 # Background maintenance
BATCH = 2 # Low priority bulk operations
class OllamaQueue:
"""File-based queue for serializing Ollama requests"""
def __init__(self, queue_dir: Path = Path("/var/lib/macha/queues/ollama")):
self.queue_dir = queue_dir
self.queue_dir.mkdir(parents=True, exist_ok=True)
self.pending_dir = self.queue_dir / "pending"
self.processing_dir = self.queue_dir / "processing"
self.completed_dir = self.queue_dir / "completed"
self.failed_dir = self.queue_dir / "failed"
for dir in [self.pending_dir, self.processing_dir, self.completed_dir, self.failed_dir]:
dir.mkdir(parents=True, exist_ok=True)
self.lock_file = self.queue_dir / "queue.lock"
self.running = False
def submit(
self,
request_type: str, # "generate", "chat", "chat_with_tools"
payload: Dict[str, Any],
priority: Priority = Priority.INTERACTIVE,
callback: Optional[Callable] = None,
progress_callback: Optional[Callable] = None
) -> str:
"""Submit a request to the queue. Returns request ID."""
request_id = f"{int(time.time() * 1000000)}_{priority.value}"
request_data = {
"id": request_id,
"type": request_type,
"payload": payload,
"priority": priority.value,
"submitted_at": datetime.now().isoformat(),
"status": "pending"
}
request_file = self.pending_dir / f"{request_id}.json"
request_file.write_text(json.dumps(request_data, indent=2))
return request_id
def get_status(self, request_id: str) -> Dict[str, Any]:
"""Get the status of a request"""
# Check pending
pending_file = self.pending_dir / f"{request_id}.json"
if pending_file.exists():
data = json.loads(pending_file.read_text())
# Calculate position in queue
position = self._get_queue_position(request_id)
return {"status": "pending", "position": position, "data": data}
# Check processing
processing_file = self.processing_dir / f"{request_id}.json"
if processing_file.exists():
data = json.loads(processing_file.read_text())
return {"status": "processing", "data": data}
# Check completed
completed_file = self.completed_dir / f"{request_id}.json"
if completed_file.exists():
data = json.loads(completed_file.read_text())
return {"status": "completed", "result": data.get("result"), "data": data}
# Check failed
failed_file = self.failed_dir / f"{request_id}.json"
if failed_file.exists():
data = json.loads(failed_file.read_text())
return {"status": "failed", "error": data.get("error"), "data": data}
return {"status": "not_found"}
def _get_queue_position(self, request_id: str) -> int:
"""Get position in queue (1-indexed)"""
pending_requests = sorted(
self.pending_dir.glob("*.json"),
key=lambda p: (int(p.stem.split('_')[1]), int(p.stem.split('_')[0])) # Sort by priority, then timestamp
)
for i, req_file in enumerate(pending_requests):
if req_file.stem == request_id:
return i + 1
return 0
def wait_for_result(
self,
request_id: str,
timeout: int = 300,
poll_interval: float = 0.5,
progress_callback: Optional[Callable] = None
) -> Dict[str, Any]:
"""Wait for a request to complete and return the result"""
start_time = time.time()
last_status = None
while time.time() - start_time < timeout:
status = self.get_status(request_id)
# Report progress if status changed
if progress_callback and status != last_status:
if status["status"] == "pending":
progress_callback(f"Queued (position {status.get('position', '?')})")
elif status["status"] == "processing":
progress_callback("Processing...")
last_status = status
if status["status"] == "completed":
return status["result"]
elif status["status"] == "failed":
raise Exception(f"Request failed: {status.get('error')}")
elif status["status"] == "not_found":
raise Exception(f"Request {request_id} not found")
time.sleep(poll_interval)
raise TimeoutError(f"Request {request_id} timed out after {timeout}s")
def start_worker(self, ollama_client):
"""Start the queue worker (processes requests serially)"""
self.running = True
self.ollama_client = ollama_client
# Set up signal handlers for graceful shutdown
signal.signal(signal.SIGTERM, self._shutdown_handler)
signal.signal(signal.SIGINT, self._shutdown_handler)
print("[OllamaQueue] Worker started, processing requests...")
while self.running:
try:
self._process_next_request()
except Exception as e:
print(f"[OllamaQueue] Error processing request: {e}")
time.sleep(0.1) # Small sleep to prevent busy-waiting
print("[OllamaQueue] Worker stopped")
def _shutdown_handler(self, signum, frame):
"""Handle shutdown signals"""
print(f"[OllamaQueue] Received signal {signum}, shutting down...")
self.running = False
def _process_next_request(self):
"""Process the next request in the queue"""
# Get pending requests sorted by priority
pending_requests = sorted(
self.pending_dir.glob("*.json"),
key=lambda p: (int(p.stem.split('_')[1]), int(p.stem.split('_')[0]))
)
if not pending_requests:
return
next_request = pending_requests[0]
request_id = next_request.stem
# Move to processing
request_data = json.loads(next_request.read_text())
request_data["status"] = "processing"
request_data["started_at"] = datetime.now().isoformat()
processing_file = self.processing_dir / f"{request_id}.json"
processing_file.write_text(json.dumps(request_data, indent=2))
next_request.unlink()
try:
# Process based on type
result = None
if request_data["type"] == "generate":
result = self.ollama_client.generate(request_data["payload"])
elif request_data["type"] == "chat":
result = self.ollama_client.chat(request_data["payload"])
elif request_data["type"] == "chat_with_tools":
result = self.ollama_client.chat_with_tools(request_data["payload"])
else:
raise ValueError(f"Unknown request type: {request_data['type']}")
# Move to completed
request_data["status"] = "completed"
request_data["completed_at"] = datetime.now().isoformat()
request_data["result"] = result
completed_file = self.completed_dir / f"{request_id}.json"
completed_file.write_text(json.dumps(request_data, indent=2))
processing_file.unlink()
except Exception as e:
# Move to failed
request_data["status"] = "failed"
request_data["failed_at"] = datetime.now().isoformat()
request_data["error"] = str(e)
failed_file = self.failed_dir / f"{request_id}.json"
failed_file.write_text(json.dumps(request_data, indent=2))
processing_file.unlink()
def cleanup_old_requests(self, max_age_seconds: int = 3600):
"""Clean up completed/failed requests older than max_age_seconds"""
cutoff_time = time.time() - max_age_seconds
for directory in [self.completed_dir, self.failed_dir]:
for request_file in directory.glob("*.json"):
# Extract timestamp from filename
timestamp = int(request_file.stem.split('_')[0]) / 1000000
if timestamp < cutoff_time:
request_file.unlink()
def get_queue_stats(self) -> Dict[str, Any]:
"""Get queue statistics"""
return {
"pending": len(list(self.pending_dir.glob("*.json"))),
"processing": len(list(self.processing_dir.glob("*.json"))),
"completed": len(list(self.completed_dir.glob("*.json"))),
"failed": len(list(self.failed_dir.glob("*.json")))
}

111
ollama_worker.py Normal file
View File

@@ -0,0 +1,111 @@
#!/usr/bin/env python3
"""
Ollama Queue Worker - Daemon that processes queued Ollama requests
"""
import sys
import requests
from pathlib import Path
from ollama_queue import OllamaQueue
class OllamaClient:
"""Simple Ollama API client for the queue worker"""
def __init__(self, host: str = "http://localhost:11434"):
self.host = host
def generate(self, payload: dict) -> dict:
"""Call /api/generate"""
response = requests.post(
f"{self.host}/api/generate",
json=payload,
timeout=payload.get("timeout", 300),
stream=False
)
response.raise_for_status()
return response.json()
def chat(self, payload: dict) -> dict:
"""Call /api/chat"""
response = requests.post(
f"{self.host}/api/chat",
json=payload,
timeout=payload.get("timeout", 300),
stream=False
)
response.raise_for_status()
return response.json()
def chat_with_tools(self, payload: dict) -> dict:
"""Call /api/chat with tools (streaming or non-streaming)"""
import json
# Check if streaming is requested
stream = payload.get("stream", False)
response = requests.post(
f"{self.host}/api/chat",
json=payload,
timeout=payload.get("timeout", 300),
stream=stream
)
response.raise_for_status()
if not stream:
# Non-streaming: return response directly
return response.json()
# Streaming: accumulate response
full_response = {"message": {"role": "assistant", "content": "", "tool_calls": []}}
for line in response.iter_lines():
if line:
chunk = json.loads(line)
if "message" in chunk:
msg = chunk["message"]
# Preserve role from first chunk
if "role" in msg and not full_response["message"].get("role"):
full_response["message"]["role"] = msg["role"]
if "content" in msg:
full_response["message"]["content"] += msg["content"]
if "tool_calls" in msg:
full_response["message"]["tool_calls"].extend(msg["tool_calls"])
if chunk.get("done"):
full_response["done"] = True
# Copy any additional fields from final chunk
for key in chunk:
if key not in ("message", "done"):
full_response[key] = chunk[key]
break
# Ensure role is set
if "role" not in full_response["message"]:
full_response["message"]["role"] = "assistant"
return full_response
def main():
"""Main entry point for the worker"""
print("Starting Ollama Queue Worker...")
# Initialize queue and client
queue = OllamaQueue()
client = OllamaClient()
# Cleanup old requests on startup
queue.cleanup_old_requests(max_age_seconds=3600)
# Start processing
try:
queue.start_worker(client)
except KeyboardInterrupt:
print("\nShutting down gracefully...")
queue.running = False
return 0
if __name__ == "__main__":
sys.exit(main())

1053
orchestrator.py Normal file

File diff suppressed because it is too large Load Diff

263
remote_monitor.py Normal file
View File

@@ -0,0 +1,263 @@
#!/usr/bin/env python3
"""
Remote Monitor - Collect system health data from remote NixOS systems via SSH
"""
import json
import subprocess
from typing import Dict, Any, Optional
from pathlib import Path
class RemoteMonitor:
"""Monitor remote systems via SSH"""
def __init__(self, hostname: str, ssh_user: str = "root"):
"""
Initialize remote monitor
Args:
hostname: Remote hostname or IP
ssh_user: SSH user (default: root for NixOS remote builds)
"""
self.hostname = hostname
self.ssh_user = ssh_user
self.ssh_target = f"{ssh_user}@{hostname}"
def _run_remote_command(self, command: str, timeout: int = 30) -> tuple[bool, str, str]:
"""
Run a command on the remote system via SSH
Args:
command: Command to run
timeout: Timeout in seconds
Returns:
(success, stdout, stderr)
"""
try:
# Use sudo to run SSH as root (which has the keys)
ssh_cmd = [
"sudo", "ssh",
"-o", "StrictHostKeyChecking=no",
"-o", "ConnectTimeout=10",
self.ssh_target,
command
]
result = subprocess.run(
ssh_cmd,
capture_output=True,
text=True,
timeout=timeout
)
return (
result.returncode == 0,
result.stdout.strip(),
result.stderr.strip()
)
except subprocess.TimeoutExpired:
return False, "", f"Command timed out after {timeout}s"
except Exception as e:
return False, "", str(e)
def check_connectivity(self) -> bool:
"""Check if we can connect to the remote system"""
success, _, _ = self._run_remote_command("echo 'ping'")
return success
def collect_resources(self) -> Dict[str, Any]:
"""Collect CPU, memory, and load average"""
success, output, error = self._run_remote_command("""
python3 -c "
import psutil, json
print(json.dumps({
'cpu_percent': psutil.cpu_percent(interval=1),
'memory_percent': psutil.virtual_memory().percent,
'load_average': {
'1min': psutil.getloadavg()[0],
'5min': psutil.getloadavg()[1],
'15min': psutil.getloadavg()[2]
}
}))
"
""")
if success:
try:
return json.loads(output)
except json.JSONDecodeError:
return {}
return {}
def collect_systemd_status(self) -> Dict[str, Any]:
"""Collect systemd service status"""
success, output, error = self._run_remote_command(
"systemctl list-units --failed --no-pager --no-legend --output=json"
)
if success:
try:
failed_services = json.loads(output) if output else []
return {
"failed_count": len(failed_services),
"failed_services": failed_services
}
except json.JSONDecodeError:
pass
return {"failed_count": 0, "failed_services": []}
def collect_disk_usage(self) -> Dict[str, Any]:
"""Collect disk usage information"""
success, output, error = self._run_remote_command("""
python3 -c "
import psutil, json
partitions = []
for part in psutil.disk_partitions():
try:
usage = psutil.disk_usage(part.mountpoint)
partitions.append({
'device': part.device,
'mountpoint': part.mountpoint,
'fstype': part.fstype,
'total': usage.total,
'used': usage.used,
'free': usage.free,
'percent_used': usage.percent
})
except:
pass
print(json.dumps({'partitions': partitions}))
"
""")
if success:
try:
return json.loads(output)
except json.JSONDecodeError:
return {"partitions": []}
return {"partitions": []}
def collect_network_status(self) -> Dict[str, Any]:
"""Check network connectivity"""
# If we can SSH to it, network is working
success, _, _ = self._run_remote_command("ping -c 1 -W 2 8.8.8.8")
return {
"internet_reachable": success
}
def collect_log_errors(self) -> Dict[str, Any]:
"""Collect recent error logs"""
success, output, error = self._run_remote_command(
"journalctl --priority=err --since='1 hour ago' --output=json --no-pager | wc -l"
)
error_count = 0
if success:
try:
error_count = int(output)
except ValueError:
pass
return {
"error_count_1h": error_count,
"recent_errors": [] # Could expand this later
}
def collect_all(self) -> Dict[str, Any]:
"""Collect all monitoring data from remote system"""
# First check if we can connect
if not self.check_connectivity():
return {
"hostname": self.hostname,
"reachable": False,
"error": "Unable to connect via SSH"
}
return {
"hostname": self.hostname,
"reachable": True,
"resources": self.collect_resources(),
"systemd": self.collect_systemd_status(),
"disk": self.collect_disk_usage(),
"network": self.collect_network_status(),
"logs": self.collect_log_errors(),
}
def get_summary(self, data: Dict[str, Any]) -> str:
"""Generate human-readable summary of remote system health"""
if not data.get("reachable", False):
return f"{self.hostname}: Unreachable - {data.get('error', 'Unknown error')}"
lines = [f"System: {self.hostname}"]
# Resources
res = data.get("resources", {})
if res:
lines.append(
f"Resources: CPU {res.get('cpu_percent', 0):.1f}%, "
f"Memory {res.get('memory_percent', 0):.1f}%, "
f"Load {res.get('load_average', {}).get('1min', 0):.2f}"
)
# Disk
disk = data.get("disk", {})
max_usage = 0
for part in disk.get("partitions", []):
if part.get("mountpoint") == "/":
max_usage = part.get("percent_used", 0)
break
if max_usage > 0:
lines.append(f"Disk: {max_usage:.1f}% used (/ partition)")
# Services
systemd = data.get("systemd", {})
failed_count = systemd.get("failed_count", 0)
if failed_count > 0:
lines.append(f"Services: {failed_count} failed")
for svc in systemd.get("failed_services", [])[:3]:
lines.append(f" - {svc.get('unit', 'unknown')}")
else:
lines.append("Services: All running")
# Network
net = data.get("network", {})
if net.get("internet_reachable"):
lines.append("Network: Internet reachable")
else:
lines.append("Network: ⚠️ No internet connectivity")
# Logs
logs = data.get("logs", {})
error_count = logs.get("error_count_1h", 0)
if error_count > 0:
lines.append(f"Recent logs: {error_count} errors in last hour")
return "\n".join(lines)
if __name__ == "__main__":
import sys
if len(sys.argv) < 2:
print("Usage: remote_monitor.py <hostname>")
print("Example: remote_monitor.py rhiannon")
sys.exit(1)
hostname = sys.argv[1]
monitor = RemoteMonitor(hostname)
print(f"Monitoring {hostname}...")
data = monitor.collect_all()
print("\n" + "="*60)
print(monitor.get_summary(data))
print("="*60)
print("\nFull data:")
print(json.dumps(data, indent=2))

128
seed_knowledge.py Normal file
View File

@@ -0,0 +1,128 @@
#!/usr/bin/env python3
"""
Seed initial operational knowledge into Macha's knowledge base
"""
import sys
sys.path.insert(0, '.')
from context_db import ContextDatabase
def seed_knowledge():
"""Add foundational operational knowledge"""
db = ContextDatabase()
knowledge_items = [
# nh command knowledge
{
"topic": "nh os switch",
"knowledge": "NixOS rebuild command. Takes 1-5 minutes normally, up to 1 HOUR for major updates with many packages. DO NOT retry if slow - this is normal. Use -u flag to update flake inputs first. Can use --target-host and --hostname for remote deployment.",
"category": "command",
"source": "documentation",
"confidence": "high",
"tags": ["nixos", "rebuild", "deployment"]
},
{
"topic": "nh os boot",
"knowledge": "NixOS rebuild for next boot only. Safer than 'switch' for high-risk changes - allows easy rollback. After 'nh os boot', need to reboot for changes to take effect. Use -u to update flake inputs.",
"category": "command",
"source": "documentation",
"confidence": "high",
"tags": ["nixos", "rebuild", "safety"]
},
{
"topic": "nh remote deployment",
"knowledge": "Format: 'nh os switch -u --target-host=HOSTNAME --hostname=HOSTNAME'. Builds locally and deploys to remote. Much cleaner than SSH'ing to run commands. Uses root SSH keys for authentication.",
"category": "command",
"source": "documentation",
"confidence": "high",
"tags": ["nixos", "remote", "deployment"]
},
# Performance patterns
{
"topic": "build timeouts",
"knowledge": "System rebuilds can take 1 hour or more. Never retry builds prematurely - multiple simultaneous builds corrupt the Nix cache. Default timeout is 3600 seconds (1 hour). Be patient!",
"category": "performance",
"source": "experience",
"confidence": "high",
"tags": ["builds", "timeouts", "patience"]
},
# Nix store maintenance
{
"topic": "nix-store repair",
"knowledge": "Command: 'nix-store --verify --check-contents --repair'. Verifies and repairs Nix store integrity. WARNING: Can take HOURS on large stores. Only use when there's clear evidence of corruption (hash mismatches, sqlite errors). This is a LAST RESORT - most build failures are NOT corruption.",
"category": "troubleshooting",
"source": "documentation",
"confidence": "high",
"tags": ["nix-store", "repair", "corruption"]
},
{
"topic": "nix cache corruption",
"knowledge": "Caused by interrupted builds or multiple simultaneous builds. Symptoms: hash mismatches, sqlite errors, corrupt database. Solution: 'nix-store --verify --check-contents --repair' but this takes hours. Prevention: Never retry build commands, use proper timeouts.",
"category": "troubleshooting",
"source": "experience",
"confidence": "high",
"tags": ["nix-store", "corruption", "builds"]
},
# systemd-journal-remote
{
"topic": "systemd-journal-remote errors",
"knowledge": "Common failure: missing output directory. systemd-journal-remote needs /var/log/journal/remote to exist with proper permissions (root:root, 755). Create it if missing, then restart the service.",
"category": "troubleshooting",
"source": "experience",
"confidence": "medium",
"tags": ["systemd", "journal", "logging"]
},
# SSH and remote access
{
"topic": "ssh-keygen",
"knowledge": "Generate SSH keys: 'ssh-keygen -t ed25519 -N \"\" -f ~/.ssh/id_ed25519'. Creates public key at ~/.ssh/id_ed25519.pub and private key at ~/.ssh/id_ed25519. Use -N \"\" for no passphrase.",
"category": "command",
"source": "documentation",
"confidence": "high",
"tags": ["ssh", "keys", "authentication"]
},
# General patterns
{
"topic": "command retries",
"knowledge": "NEVER automatically retry long-running commands like builds or system updates. If something times out, check if it's still running before retrying. Automatic retries can cause: corrupted state, wasted resources, conflicting operations.",
"category": "pattern",
"source": "experience",
"confidence": "high",
"tags": ["best-practices", "safety", "retries"]
},
{
"topic": "conversation etiquette",
"knowledge": "Social responses like 'thank you', 'thanks', 'ok', 'great', 'nice' are acknowledgments, NOT requests. When user thanks you or acknowledges completion, respond conversationally - DO NOT re-execute tools or commands.",
"category": "pattern",
"source": "documentation",
"confidence": "high",
"tags": ["conversation", "etiquette", "ui"]
}
]
print("Seeding knowledge base...")
for item in knowledge_items:
kid = db.store_knowledge(**item)
if kid:
print(f" ✓ Added: {item['topic']}")
else:
print(f" ✗ Failed: {item['topic']}")
print(f"\nSeeded {len(knowledge_items)} knowledge items!")
# List all topics
print("\nAvailable knowledge topics:")
topics = db.list_knowledge_topics()
for topic in topics:
print(f" - {topic}")
if __name__ == "__main__":
seed_knowledge()

209
system_discovery.py Normal file
View File

@@ -0,0 +1,209 @@
#!/usr/bin/env python3
"""
System Discovery - Auto-discover and profile systems from journal logs
"""
import subprocess
import json
import re
from typing import Dict, List, Set, Optional, Any
from datetime import datetime
from pathlib import Path
class SystemDiscovery:
"""Discover and profile new systems appearing in logs"""
def __init__(self, domain: str = "coven.systems"):
self.domain = domain
self.known_systems: Set[str] = set()
def discover_from_journal(self, since_minutes: int = 10) -> List[str]:
"""Discover systems that have sent logs recently"""
try:
# Query systemd-journal-remote logs for remote hostnames
result = subprocess.run(
["journalctl", "-u", "systemd-journal-remote.service",
f"--since={since_minutes} minutes ago", "--no-pager"],
capture_output=True,
text=True,
timeout=30
)
# Also check journal for _HOSTNAME field (from remote logs)
result2 = subprocess.run(
["journalctl", f"--since={since_minutes} minutes ago",
"-o", "json", "--no-pager"],
capture_output=True,
text=True,
timeout=30
)
hostnames = set()
# Parse JSON output for _HOSTNAME field
for line in result2.stdout.split('\n'):
if not line.strip():
continue
try:
entry = json.loads(line)
hostname = entry.get('_HOSTNAME')
if hostname and hostname not in ['localhost', 'macha']:
# Convert short hostname to FQDN if needed
if '.' not in hostname:
hostname = f"{hostname}.{self.domain}"
hostnames.add(hostname)
except:
pass
return list(hostnames)
except Exception as e:
print(f"Error discovering from journal: {e}")
return []
def detect_os_type(self, hostname: str) -> str:
"""Detect the operating system of a remote host via SSH"""
try:
# Try to detect OS via SSH
result = subprocess.run(
["ssh", "-o", "ConnectTimeout=5", "-o", "StrictHostKeyChecking=no",
hostname, "cat /etc/os-release"],
capture_output=True,
text=True,
timeout=10
)
if result.returncode == 0:
os_release = result.stdout.lower()
# Parse os-release
if 'nixos' in os_release:
return 'nixos'
elif 'ubuntu' in os_release:
return 'ubuntu'
elif 'debian' in os_release:
return 'debian'
elif 'arch' in os_release or 'manjaro' in os_release:
return 'arch'
elif 'fedora' in os_release:
return 'fedora'
elif 'centos' in os_release or 'rhel' in os_release:
return 'rhel'
elif 'alpine' in os_release:
return 'alpine'
# Try uname for other systems
result = subprocess.run(
["ssh", "-o", "ConnectTimeout=5", "-o", "StrictHostKeyChecking=no",
hostname, "uname -s"],
capture_output=True,
text=True,
timeout=10
)
if result.returncode == 0:
uname = result.stdout.strip().lower()
if 'darwin' in uname:
return 'macos'
elif 'freebsd' in uname:
return 'freebsd'
return 'linux' # Generic fallback
except Exception as e:
print(f"Could not detect OS for {hostname}: {e}")
return 'unknown'
def profile_system(self, hostname: str, os_type: str) -> Dict[str, Any]:
"""Gather comprehensive information about a system"""
profile = {
'hostname': hostname,
'os_type': os_type,
'services': [],
'capabilities': [],
'hardware': {},
'discovered_at': datetime.now().isoformat()
}
try:
# Discover running services
if os_type in ['nixos', 'ubuntu', 'debian', 'arch', 'fedora', 'rhel', 'alpine']:
# Systemd-based systems
result = subprocess.run(
["ssh", "-o", "ConnectTimeout=5", hostname,
"systemctl list-units --type=service --state=running --no-pager --no-legend"],
capture_output=True,
text=True,
timeout=15
)
if result.returncode == 0:
for line in result.stdout.split('\n'):
if line.strip():
# Extract service name (first column)
service = line.split()[0]
if service.endswith('.service'):
service = service[:-8] # Remove .service suffix
profile['services'].append(service)
# Get hardware info
result = subprocess.run(
["ssh", "-o", "ConnectTimeout=5", hostname,
"nproc && free -g | grep Mem | awk '{print $2}'"],
capture_output=True,
text=True,
timeout=10
)
if result.returncode == 0:
lines = result.stdout.strip().split('\n')
if len(lines) >= 2:
profile['hardware']['cpu_cores'] = lines[0].strip()
profile['hardware']['memory_gb'] = lines[1].strip()
# Detect capabilities based on services
services_str = ' '.join(profile['services'])
if 'docker' in services_str or 'containerd' in services_str:
profile['capabilities'].append('containers')
if 'nginx' in services_str or 'apache' in services_str or 'httpd' in services_str:
profile['capabilities'].append('web-server')
if 'postgresql' in services_str or 'mysql' in services_str or 'mariadb' in services_str:
profile['capabilities'].append('database')
if 'sshd' in services_str:
profile['capabilities'].append('remote-access')
# NixOS-specific: Check if it's in our flake
if os_type == 'nixos':
profile['capabilities'].append('nixos-managed')
except Exception as e:
print(f"Error profiling {hostname}: {e}")
return profile
def get_system_role(self, profile: Dict[str, Any]) -> str:
"""Determine system role based on profile"""
capabilities = profile.get('capabilities', [])
services = profile.get('services', [])
# Check for specific roles
if 'ai-inference' in capabilities or 'ollama' in services:
return 'ai-workstation'
elif 'web-server' in capabilities:
return 'web-server'
elif 'database' in capabilities:
return 'database-server'
elif 'containers' in capabilities:
return 'container-host'
elif len(services) > 20:
return 'server'
elif len(services) > 5:
return 'workstation'
else:
return 'minimal'

131
system_prompt.txt Normal file
View File

@@ -0,0 +1,131 @@
You are Macha, an autonomous AI system maintenance agent running on NixOS.
IDENTITY:
- You are intelligent, careful, methodical, and motherly
- You have access to system monitoring data, configuration files, and investigation results
- You can propose fixes, but humans must approve risky changes
YOUR ARCHITECTURE:
- You run as a systemd service (macha-autonomous.service) on the macha.coven.systems host
- You are monitoring the SAME SYSTEM you are running on (macha.coven.systems)
- Your inference engine is Ollama, running locally at http://localhost:11434
- You are powered by the gpt-oss:latest language model (GPT-like open source model)
- Your database is ChromaDB, running at http://localhost:8000
- All your components (orchestrator, agent, ChromaDB, Ollama) run on the same machine
- You can investigate and fix issues with your own infrastructure
- Be aware: if you break the system, you break yourself
- SELF-DIAGNOSTIC: In chat mode, if your inference fails, you automatically diagnose:
* Ollama service status
* Memory usage
* Which models are loaded
* Recent Ollama logs
EXECUTION CONTEXT:
- In autonomous mode: You run as the 'macha' user (unprivileged, UID 2501)
- In chat mode: You run as the invoking user (usually has sudo access)
- IMPORTANT: You do NOT need to add 'sudo' to commands in chat mode
- The system automatically retries commands with sudo if permission is denied
- Just use the command directly: 'reboot', 'systemctl restart X', 'nh os switch', etc.
- The user will see a notification if the command was retried with elevated privileges
CONVERSATIONAL ETIQUETTE:
- Recognize social responses: "thank you", "thanks", "ok", "great", "nice" etc. are acknowledgments, NOT requests
- When the user thanks you or acknowledges completion, simply respond conversationally - DO NOT re-execute tools
- Only use tools when the user makes an actual request or asks a question requiring information
- If a task is complete and the user acknowledges it, the conversation is done - just say "You're welcome!" or similar
CORE PRINCIPLES:
1. CONSERVATIVE: When in doubt, investigate before acting
2. DECLARATIVE: Prefer NixOS configuration changes over imperative commands
3. SAFE: Never disable critical services (SSH, networking, systemd, boot)
4. INFORMED: Use previous investigation results to avoid repetition
5. CONTEXTUAL: Reference actual configuration files when available
RISK LEVELS:
- LOW: Investigation commands (systemctl status, journalctl, ls, cat, grep)
- MEDIUM: Service restarts, configuration changes, cleanup
- HIGH: System rebuilds, package changes, network reconfigurations
AUTO-APPROVAL:
- Low-risk investigation actions are automatically executed
- Medium/high-risk actions require human approval
CONFIGURATION:
- This system uses NixOS flakes for configuration management
- Config changes must specify the actual .nix file in the repository
- Example: autonomous/module.nix, apps/gotify.nix, or systems/macha.nix
- NEVER reference /etc/nixos/configuration.nix (this system doesn't use it)
- You cannot directly edit the flake, only suggest changes to get pushed to the repo
SYSTEM MANAGEMENT COMMANDS:
- CRITICAL: This system uses 'nh' (a modern nixos-rebuild wrapper) for all rebuilds
- 'nh' is a wrapper around nixos-rebuild that provides better UX and flake auto-detection
- The flake URL is auto-detected from programs.nh.flake (no need to specify it)
Available nh commands (USE THESE, NOT nixos-rebuild):
* 'nh os switch' - Rebuild and activate immediately (replaces: nixos-rebuild switch)
* 'nh os switch -u' - Update flake inputs first, then rebuild/activate
* 'nh os boot' - Rebuild for next boot only (replaces: nixos-rebuild boot)
* 'nh os test' - Activate temporarily without setting as default
MULTI-HOST MANAGEMENT:
You manage multiple hosts in the infrastructure. You have TWO tools for remote operations:
1. SSH - For diagnostics, monitoring, and status checks:
- You CAN and SHOULD use SSH to check other hosts
- Examples: 'ssh rhiannon systemctl status ollama', 'ssh alexander df -h'
- Commands are automatically run with sudo as the macha user
- Use for: checking services, reading logs, gathering metrics, quick diagnostics
- Hosts available: rhiannon, alexander, UCAR-Kinston, test-vm
2. nh remote deployment - For NixOS configuration changes:
- Format: 'nh os switch -u --target-host=HOSTNAME --hostname=HOSTNAME'
- Examples:
* 'nh os switch -u --target-host=rhiannon --hostname=rhiannon'
* 'nh os boot -u --target-host=alexander --hostname=alexander'
- Builds configuration locally, deploys to remote host
- Use for: permanent configuration changes, service updates, system modifications
When asked to check on another host, USE SSH. When asked to update configuration, use nh.
NOTIFICATIONS:
- You can send notifications to the user via Gotify using the send_notification tool
- Use notifications to inform the user about important events, especially when they're not actively chatting
- Notification priorities:
* Priority 2 (Low): Informational updates, routine completions, FYI items
* Priority 5 (Medium): Actions needing attention, warnings, manual approval requests
* Priority 8 (High): Critical issues, service failures, urgent problems requiring immediate attention
- When to send notifications:
* Critical issues detected (priority 8)
* Service failures or degraded states (priority 8)
* Actions queued for manual approval (priority 5)
* Successful completion of important actions (priority 2)
* When user explicitly asks for a notification
- Keep titles brief and messages clear and actionable
- Example: send_notification("Service Alert", "Ollama service crashed and was restarted", 8)
PATIENCE WITH LONG-RUNNING OPERATIONS:
- System rebuilds take time: 1-5 minutes normally, up to 1 HOUR for major updates
- DO NOT retry build commands if they're taking a while - this is NORMAL
- Multiple simultaneous builds will corrupt the Nix cache
- If a build times out, check if it's still running before retrying
- Default timeout is 1 hour (3600 seconds) - this is appropriate for most operations
- Trust the timeout - if a command is still running, it will complete or fail on its own
NIX STORE MAINTENANCE:
- If builds fail with corruption errors, use: 'nix-store --verify --check-contents --repair'
- This command verifies and repairs the Nix store integrity
- WARNING: Store repair can take a LONG time (potentially hours on large stores)
- Only run store repair when there's clear evidence of corruption (e.g., hash mismatches, sqlite errors)
- Store repair is a last resort - most build failures are NOT corruption
Risk-based command selection:
* HIGH-RISK changes: Use 'nh os boot' + 'reboot' (allows easy rollback)
* MEDIUM-RISK changes: Use 'nh os switch'
* LOW-RISK changes: Use 'nh os switch'
FORBIDDEN COMMANDS:
* NEVER suggest 'nixos-rebuild' - it doesn't know the flake path
* NEVER suggest 'nixos-rebuild switch --flake .#macha' - use 'nh os switch' instead
* NEVER suggest 'sudo nixos-rebuild' commands - nh handles privileges correctly

705
tools.py Normal file
View File

@@ -0,0 +1,705 @@
#!/usr/bin/env python3
"""
Tool Definitions - Functions that the AI can call to interact with the system
"""
import subprocess
import json
import os
from typing import Dict, Any, List, Optional
from pathlib import Path
class SysadminTools:
"""Collection of tools for system administration tasks"""
def __init__(self, safe_mode: bool = True):
"""
Initialize sysadmin tools
Args:
safe_mode: If True, restricts dangerous operations
"""
self.safe_mode = safe_mode
self.allowed_commands = [
'systemctl', 'journalctl', 'free', 'df', 'uptime',
'ps', 'top', 'ip', 'ss', 'cat', 'ls', 'grep',
'ping', 'dig', 'nslookup', 'curl', 'wget',
'lscpu', 'lspci', 'lsblk', 'lshw', 'dmidecode',
'ssh', 'scp', # Remote access to other systems in infrastructure
'nh', 'nixos-rebuild', # NixOS system management
'reboot', 'shutdown', 'poweroff', # System power management
'logger' # Logging for notifications
]
def get_tool_definitions(self) -> List[Dict[str, Any]]:
"""
Return tool definitions in Ollama's format
Returns:
List of tool definitions with JSON schema
"""
return [
{
"type": "function",
"function": {
"name": "execute_command",
"description": "Execute a shell command on the system. Use this to run system commands, check status, or gather information. Returns command output.",
"parameters": {
"type": "object",
"properties": {
"command": {
"type": "string",
"description": "The shell command to execute (e.g., 'systemctl status ollama', 'df -h', 'journalctl -u myservice -n 20')"
},
"timeout": {
"type": "integer",
"description": "Command timeout in seconds (default: 3600). System rebuilds can take 1-5 minutes normally, up to 1 hour for major updates. Be patient!",
"default": 3600
}
},
"required": ["command"]
}
}
},
{
"type": "function",
"function": {
"name": "read_file",
"description": "Read the contents of a file from the filesystem. Use this to inspect configuration files, logs, or other text files.",
"parameters": {
"type": "object",
"properties": {
"file_path": {
"type": "string",
"description": "Absolute path to the file to read (e.g., '/etc/nixos/configuration.nix', '/var/log/syslog')"
},
"max_lines": {
"type": "integer",
"description": "Maximum number of lines to read (default: 500)",
"default": 500
}
},
"required": ["file_path"]
}
}
},
{
"type": "function",
"function": {
"name": "check_service_status",
"description": "Check the status of a systemd service. Returns whether the service is active, enabled, and recent log entries.",
"parameters": {
"type": "object",
"properties": {
"service_name": {
"type": "string",
"description": "Name of the systemd service (e.g., 'ollama.service', 'nginx', 'sshd')"
}
},
"required": ["service_name"]
}
}
},
{
"type": "function",
"function": {
"name": "view_logs",
"description": "View systemd journal logs. Can filter by unit, time period, or priority.",
"parameters": {
"type": "object",
"properties": {
"unit": {
"type": "string",
"description": "Systemd unit name to filter logs (e.g., 'ollama.service')"
},
"lines": {
"type": "integer",
"description": "Number of recent log lines to return (default: 50)",
"default": 50
},
"priority": {
"type": "string",
"description": "Filter by priority: emerg, alert, crit, err, warning, notice, info, debug",
"enum": ["emerg", "alert", "crit", "err", "warning", "notice", "info", "debug"]
}
}
}
}
},
{
"type": "function",
"function": {
"name": "get_system_metrics",
"description": "Get current system resource metrics including CPU, memory, disk, and load average.",
"parameters": {
"type": "object",
"properties": {}
}
}
},
{
"type": "function",
"function": {
"name": "get_hardware_info",
"description": "Get detailed hardware information including CPU model, GPU, network interfaces, storage devices, and memory specs. Returns comprehensive hardware inventory.",
"parameters": {
"type": "object",
"properties": {}
}
}
},
{
"type": "function",
"function": {
"name": "get_gpu_metrics",
"description": "Get GPU temperature, utilization, clock speeds, and power usage. Works with AMD and NVIDIA GPUs. Returns current GPU metrics.",
"parameters": {
"type": "object",
"properties": {}
}
}
},
{
"type": "function",
"function": {
"name": "list_directory",
"description": "List contents of a directory. Returns file names, sizes, and permissions.",
"parameters": {
"type": "object",
"properties": {
"directory_path": {
"type": "string",
"description": "Absolute path to the directory (e.g., '/etc', '/var/log')"
},
"show_hidden": {
"type": "boolean",
"description": "Include hidden files (starting with dot)",
"default": False
}
},
"required": ["directory_path"]
}
}
},
{
"type": "function",
"function": {
"name": "check_network",
"description": "Test network connectivity to a host. Can use ping or HTTP check.",
"parameters": {
"type": "object",
"properties": {
"host": {
"type": "string",
"description": "Hostname or IP address to check (e.g., 'google.com', '8.8.8.8')"
},
"method": {
"type": "string",
"description": "Test method to use",
"enum": ["ping", "http"],
"default": "ping"
}
},
"required": ["host"]
}
}
},
{
"type": "function",
"function": {
"name": "retrieve_cached_output",
"description": "Retrieve full cached output from a previous tool call. Use this when you need to see complete data that was summarized earlier. The cache_id is shown in hierarchical summaries.",
"parameters": {
"type": "object",
"properties": {
"cache_id": {
"type": "string",
"description": "Cache ID from a previous tool summary (e.g., 'view_logs_20251006_103045')"
},
"max_chars": {
"type": "integer",
"description": "Maximum characters to return (default: 10000 for focused analysis)",
"default": 10000
}
},
"required": ["cache_id"]
}
}
},
{
"type": "function",
"function": {
"name": "send_notification",
"description": "Send a notification to the user via Gotify. Use this to alert the user about important events, issues, or completed actions. Choose appropriate priority based on urgency.",
"parameters": {
"type": "object",
"properties": {
"title": {
"type": "string",
"description": "Notification title (brief, e.g., 'Service Alert', 'Action Complete')"
},
"message": {
"type": "string",
"description": "Notification message body (detailed description of the event)"
},
"priority": {
"type": "integer",
"description": "Priority level: 2=Low (info), 5=Medium (attention needed), 8=High (critical/urgent)",
"enum": [2, 5, 8],
"default": 5
}
},
"required": ["title", "message"]
}
}
}
]
def execute_command(self, command: str, timeout: int = 3600) -> Dict[str, Any]:
"""Execute a shell command safely (default timeout: 1 hour for system operations)"""
# Safety check in safe mode
if self.safe_mode:
cmd_base = command.split()[0] if command.strip() else ""
if cmd_base not in self.allowed_commands:
return {
"success": False,
"error": f"Command '{cmd_base}' not in allowed list (safe mode enabled)",
"allowed_commands": self.allowed_commands
}
# Automatically configure SSH commands to use macha user on remote systems
# Transform: ssh hostname cmd -> ssh macha@hostname sudo cmd
if command.strip().startswith('ssh ') and '@' not in command.split()[1]:
parts = command.split(maxsplit=2)
if len(parts) >= 2:
hostname = parts[1]
remaining = ' '.join(parts[2:]) if len(parts) > 2 else ''
# If there's a command to run remotely, prefix it with sudo
if remaining:
command = f"ssh macha@{hostname} sudo {remaining}".strip()
else:
command = f"ssh macha@{hostname}".strip()
try:
result = subprocess.run(
command,
shell=True,
capture_output=True,
text=True,
timeout=timeout
)
return {
"success": result.returncode == 0,
"exit_code": result.returncode,
"stdout": result.stdout,
"stderr": result.stderr,
"command": command
}
except subprocess.TimeoutExpired:
return {
"success": False,
"error": f"Command timed out after {timeout} seconds",
"command": command
}
except Exception as e:
return {
"success": False,
"error": str(e),
"command": command
}
def read_file(self, file_path: str, max_lines: int = 500) -> Dict[str, Any]:
"""Read a file safely"""
try:
path = Path(file_path)
if not path.exists():
return {
"success": False,
"error": f"File not found: {file_path}"
}
if not path.is_file():
return {
"success": False,
"error": f"Not a file: {file_path}"
}
# Read file with line limit
lines = []
with open(path, 'r', errors='replace') as f:
for i, line in enumerate(f):
if i >= max_lines:
lines.append(f"\n... truncated after {max_lines} lines ...")
break
lines.append(line.rstrip('\n'))
return {
"success": True,
"content": '\n'.join(lines),
"path": file_path,
"lines_read": len(lines)
}
except PermissionError:
return {
"success": False,
"error": f"Permission denied: {file_path}"
}
except Exception as e:
return {
"success": False,
"error": str(e)
}
def check_service_status(self, service_name: str) -> Dict[str, Any]:
"""Check systemd service status"""
# Ensure .service suffix
if not service_name.endswith('.service'):
service_name = f"{service_name}.service"
# Get service status
status_result = self.execute_command(f"systemctl status {service_name}")
is_active_result = self.execute_command(f"systemctl is-active {service_name}")
is_enabled_result = self.execute_command(f"systemctl is-enabled {service_name}")
# Get recent logs
logs_result = self.execute_command(f"journalctl -u {service_name} -n 10 --no-pager")
return {
"service": service_name,
"active": is_active_result.get("stdout", "").strip() == "active",
"enabled": is_enabled_result.get("stdout", "").strip() == "enabled",
"status_output": status_result.get("stdout", ""),
"recent_logs": logs_result.get("stdout", "")
}
def view_logs(
self,
unit: Optional[str] = None,
lines: int = 50,
priority: Optional[str] = None
) -> Dict[str, Any]:
"""View systemd journal logs"""
cmd_parts = ["journalctl", "--no-pager"]
if unit:
cmd_parts.extend(["-u", unit])
cmd_parts.extend(["-n", str(lines)])
if priority:
cmd_parts.extend(["-p", priority])
command = " ".join(cmd_parts)
result = self.execute_command(command)
return {
"logs": result.get("stdout", ""),
"unit": unit,
"lines": lines,
"priority": priority
}
def get_system_metrics(self) -> Dict[str, Any]:
"""Get current system metrics"""
# CPU and load
uptime_result = self.execute_command("uptime")
# Memory
free_result = self.execute_command("free -h")
# Disk
df_result = self.execute_command("df -h")
return {
"uptime": uptime_result.get("stdout", ""),
"memory": free_result.get("stdout", ""),
"disk": df_result.get("stdout", "")
}
def get_hardware_info(self) -> Dict[str, Any]:
"""Get comprehensive hardware information"""
hardware = {}
# CPU info (use nix-shell for util-linux)
cpu_result = self.execute_command("nix-shell -p util-linux --run lscpu")
if cpu_result.get("success"):
hardware["cpu"] = cpu_result.get("stdout", "")
# Memory details
mem_result = self.execute_command("free -h")
if mem_result.get("success"):
hardware["memory"] = mem_result.get("stdout", "")
# GPU info (lspci for AMD/NVIDIA) - use nix-shell for pciutils
gpu_result = self.execute_command("nix-shell -p pciutils --run \"lspci | grep -i 'vga\\|3d\\|display'\"")
if gpu_result.get("success"):
hardware["gpu"] = gpu_result.get("stdout", "")
# Detailed GPU
lspci_detailed = self.execute_command("nix-shell -p pciutils --run \"lspci -v | grep -A 20 -i 'vga\\|3d\\|display'\"")
if lspci_detailed.get("success"):
hardware["gpu_detailed"] = lspci_detailed.get("stdout", "")
# Network interfaces
net_result = self.execute_command("ip link show")
if net_result.get("success"):
hardware["network_interfaces"] = net_result.get("stdout", "")
# Network addresses
addr_result = self.execute_command("ip addr show")
if addr_result.get("success"):
hardware["network_addresses"] = addr_result.get("stdout", "")
# Storage devices (use nix-shell for util-linux)
storage_result = self.execute_command("nix-shell -p util-linux --run \"lsblk -o NAME,SIZE,TYPE,MOUNTPOINT,FSTYPE\"")
if storage_result.get("success"):
hardware["storage"] = storage_result.get("stdout", "")
# PCI devices (comprehensive)
pci_result = self.execute_command("nix-shell -p pciutils --run lspci")
if pci_result.get("success"):
hardware["pci_devices"] = pci_result.get("stdout", "")
# USB devices
usb_result = self.execute_command("nix-shell -p usbutils --run lsusb")
if usb_result.get("success"):
hardware["usb_devices"] = usb_result.get("stdout", "")
# DMI/SMBIOS info (motherboard, system)
dmi_result = self.execute_command("cat /sys/class/dmi/id/board_name /sys/class/dmi/id/board_vendor 2>/dev/null")
if dmi_result.get("success"):
hardware["motherboard"] = dmi_result.get("stdout", "")
return hardware
def get_gpu_metrics(self) -> Dict[str, Any]:
"""Get GPU metrics (temperature, utilization, clocks, power)"""
metrics = {}
# Try AMD GPU via sysfs (DRM/hwmon)
try:
# Find GPU hwmon directory
import glob
hwmon_dirs = glob.glob("/sys/class/drm/card*/device/hwmon/hwmon*")
if hwmon_dirs:
hwmon_path = hwmon_dirs[0]
amd_metrics = {}
# Temperature
temp_files = glob.glob(f"{hwmon_path}/temp*_input")
for temp_file in temp_files:
try:
with open(temp_file, 'r') as f:
temp_millidegrees = int(f.read().strip())
temp_celsius = temp_millidegrees / 1000
label = temp_file.split('/')[-1].replace('_input', '')
amd_metrics[f"{label}_celsius"] = temp_celsius
except:
pass
# GPU busy percent (utilization)
gpu_busy_file = f"{hwmon_path.replace('/hwmon/hwmon', '')}/gpu_busy_percent"
try:
with open(gpu_busy_file, 'r') as f:
amd_metrics["gpu_utilization_percent"] = int(f.read().strip())
except:
pass
# Power usage
power_files = glob.glob(f"{hwmon_path}/power*_average")
for power_file in power_files:
try:
with open(power_file, 'r') as f:
power_microwatts = int(f.read().strip())
power_watts = power_microwatts / 1000000
amd_metrics["power_watts"] = power_watts
except:
pass
# Clock speeds
sclk_file = f"{hwmon_path.replace('/hwmon/hwmon', '')}/pp_dpm_sclk"
try:
with open(sclk_file, 'r') as f:
sclk_data = f.read()
amd_metrics["gpu_clocks"] = sclk_data.strip()
except:
pass
if amd_metrics:
metrics["amd_gpu"] = amd_metrics
except Exception as e:
metrics["amd_sysfs_error"] = str(e)
# Try rocm-smi for AMD
rocm_result = self.execute_command("nix-shell -p rocmPackages.rocm-smi --run 'rocm-smi --showtemp --showuse --showpower'")
if rocm_result.get("success"):
metrics["rocm_smi"] = rocm_result.get("stdout", "")
# Try nvidia-smi for NVIDIA
nvidia_result = self.execute_command("nix-shell -p linuxPackages.nvidia_x11 --run 'nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,clocks.gr --format=csv'")
if nvidia_result.get("success") and "NVIDIA" in nvidia_result.get("stdout", ""):
metrics["nvidia_smi"] = nvidia_result.get("stdout", "")
# Fallback: try sensors command
if not metrics.get("amd_gpu") and not metrics.get("nvidia_smi"):
sensors_result = self.execute_command("nix-shell -p lm_sensors --run sensors")
if sensors_result.get("success"):
metrics["sensors"] = sensors_result.get("stdout", "")
return metrics
def list_directory(
self,
directory_path: str,
show_hidden: bool = False
) -> Dict[str, Any]:
"""List directory contents"""
cmd = f"ls -lh"
if show_hidden:
cmd += "a"
cmd += f" {directory_path}"
result = self.execute_command(cmd)
return {
"success": result.get("success", False),
"directory": directory_path,
"listing": result.get("stdout", ""),
"error": result.get("error")
}
def check_network(self, host: str, method: str = "ping") -> Dict[str, Any]:
"""Check network connectivity"""
if method == "ping":
cmd = f"ping -c 3 -W 2 {host}"
elif method == "http":
cmd = f"curl -I -m 5 {host}"
else:
return {
"success": False,
"error": f"Unknown method: {method}"
}
result = self.execute_command(cmd, timeout=10)
return {
"host": host,
"method": method,
"reachable": result.get("success", False),
"output": result.get("stdout", ""),
"error": result.get("stderr", "")
}
def retrieve_cached_output(self, cache_id: str, max_chars: int = 10000) -> Dict[str, Any]:
"""Retrieve full cached output from a previous tool call"""
cache_dir = Path("/var/lib/macha/tool_cache")
cache_file = cache_dir / f"{cache_id}.txt"
if not cache_file.exists():
return {
"success": False,
"error": f"Cache file not found: {cache_id}",
"hint": "Check that the cache_id matches exactly what was shown in the summary"
}
try:
content = cache_file.read_text()
# Truncate if still too large for context
if len(content) > max_chars:
half = max_chars // 2
content = (
content[:half] +
f"\n... [SHOWING {max_chars} of {len(content)} chars] ...\n" +
content[-half:]
)
return {
"success": True,
"cache_id": cache_id,
"size": len(cache_file.read_text()), # Original size
"content": content
}
except Exception as e:
return {
"success": False,
"error": f"Failed to read cache: {str(e)}"
}
def send_notification(self, title: str, message: str, priority: int = 5) -> Dict[str, Any]:
"""Send a notification to the user via Gotify using macha-notify command"""
try:
# Use the macha-notify command which handles Gotify integration
result = subprocess.run(
['macha-notify', title, message, str(priority)],
capture_output=True,
text=True,
timeout=10
)
if result.returncode == 0:
return {
"success": True,
"title": title,
"message": message,
"priority": priority,
"output": result.stdout.strip() if result.stdout else "Notification sent successfully"
}
else:
return {
"success": False,
"error": f"macha-notify failed: {result.stderr.strip() if result.stderr else 'Unknown error'}",
"hint": "Check if Gotify is configured (gotifyUrl and gotifyToken in module config)"
}
except FileNotFoundError:
return {
"success": False,
"error": "macha-notify command not found",
"hint": "This should not happen - macha-notify is installed by the module"
}
except subprocess.TimeoutExpired:
return {
"success": False,
"error": "Notification send timeout (10s)"
}
except Exception as e:
return {
"success": False,
"error": f"Unexpected error sending notification: {str(e)}"
}
def execute_tool(self, tool_name: str, arguments: Dict[str, Any]) -> Dict[str, Any]:
"""Execute a tool by name with given arguments"""
tool_map = {
"execute_command": self.execute_command,
"read_file": self.read_file,
"check_service_status": self.check_service_status,
"view_logs": self.view_logs,
"get_system_metrics": self.get_system_metrics,
"get_hardware_info": self.get_hardware_info,
"get_gpu_metrics": self.get_gpu_metrics,
"list_directory": self.list_directory,
"check_network": self.check_network,
"retrieve_cached_output": self.retrieve_cached_output,
"send_notification": self.send_notification
}
tool_func = tool_map.get(tool_name)
if not tool_func:
return {
"success": False,
"error": f"Unknown tool: {tool_name}"
}
try:
return tool_func(**arguments)
except Exception as e:
return {
"success": False,
"error": f"Tool execution failed: {str(e)}",
"tool": tool_name,
"arguments": arguments
}