Initial commit: Split Macha autonomous system into separate flake
Macha is now a standalone NixOS flake that can be imported into other systems. This provides: - Independent versioning - Easier reusability - Cleaner separation of concerns - Better development workflow Includes: - Complete autonomous system code - NixOS module with full configuration options - Queue-based architecture with priority system - Chunked map-reduce for large outputs - ChromaDB knowledge base - Tool calling system - Multi-host SSH management - Gotify notification integration All capabilities from DESIGN.md are preserved.
This commit is contained in:
23
.gitignore
vendored
Normal file
23
.gitignore
vendored
Normal file
@@ -0,0 +1,23 @@
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
*$py.class
|
||||
*.so
|
||||
.Python
|
||||
*.egg-info/
|
||||
dist/
|
||||
build/
|
||||
|
||||
# IDE
|
||||
.vscode/
|
||||
.idea/
|
||||
*.swp
|
||||
*.swo
|
||||
|
||||
# Nix
|
||||
result
|
||||
result-*
|
||||
|
||||
# Test data
|
||||
test_*.db
|
||||
*.log
|
||||
|
||||
269
DESIGN.md
Normal file
269
DESIGN.md
Normal file
@@ -0,0 +1,269 @@
|
||||
# Macha Autonomous System - Design Document
|
||||
|
||||
> **⚠️ IMPORTANT - READ THIS FIRST**
|
||||
> **FOR AI ASSISTANT**: This document is YOUR reference guide when modifying Macha's code.
|
||||
> - **ALWAYS consult this BEFORE refactoring** to ensure you don't remove existing capabilities
|
||||
> - **CHECK this when adding features** to avoid conflicts
|
||||
> - **UPDATE this document** when new capabilities are added
|
||||
> - **DO NOT DELETE ANYTHING FROM THIS DOCUMENT**
|
||||
> - During major refactors, you MUST verify each capability listed here is preserved
|
||||
|
||||
## Overview
|
||||
Macha is an AI-powered autonomous system administrator capable of monitoring, maintaining, and managing multiple NixOS hosts in the infrastructure.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. Local System Management
|
||||
- Monitor system health (CPU, memory, disk, services)
|
||||
- Read and analyze logs via `journalctl`
|
||||
- Check service status and restart failed services
|
||||
- Execute system commands (with safety restrictions)
|
||||
- Monitor and repair Nix store corruption
|
||||
- Hardware awareness (CPU, GPU, network, storage)
|
||||
|
||||
### 2. Multi-Host Management via SSH
|
||||
|
||||
**Macha CAN and SHOULD use SSH to manage other hosts.**
|
||||
|
||||
#### SSH Access
|
||||
- Runs as `macha` user (UID 2501)
|
||||
- Has `NOPASSWD` sudo access for administrative commands
|
||||
- Shares SSH keys with other hosts in the infrastructure
|
||||
- Can SSH to: `rhiannon`, `alexander`, `UCAR-Kinston`, and others in the flake
|
||||
|
||||
#### SSH Usage Patterns
|
||||
1. **Direct diagnostic commands:**
|
||||
```bash
|
||||
ssh rhiannon systemctl status ollama
|
||||
ssh alexander df -h
|
||||
```
|
||||
- Commands automatically prefixed with `sudo` by the tools layer
|
||||
- Full command: `ssh macha@rhiannon sudo systemctl status ollama`
|
||||
|
||||
2. **Status checks:**
|
||||
- Check service health on remote hosts
|
||||
- Gather system metrics
|
||||
- Review logs
|
||||
- Monitor resource usage
|
||||
|
||||
3. **File operations:**
|
||||
- Use `scp` to copy files between hosts
|
||||
- Read configuration files on remote systems
|
||||
|
||||
#### When to use SSH vs nh
|
||||
- **SSH**: For diagnostics, status checks, log review, quick commands
|
||||
- **nh remote deployment**: For applying NixOS configuration changes
|
||||
- `nh os switch -u --target-host=rhiannon --hostname=rhiannon`
|
||||
- Builds locally, deploys to remote host
|
||||
- Use for permanent configuration changes
|
||||
|
||||
### 3. NixOS Configuration Management
|
||||
|
||||
#### Local Changes
|
||||
- Can propose changes to NixOS configuration
|
||||
- Requires human approval before applying
|
||||
- Uses `nh os switch` for local updates
|
||||
|
||||
#### Remote Deployment
|
||||
- Can deploy to other hosts using `nh` with `--target-host`
|
||||
- Builds configuration locally (on Macha)
|
||||
- Pushes to remote system
|
||||
- Can take up to 1 hour for complex builds
|
||||
- **IMPORTANT**: Be patient with long-running builds, don't retry prematurely
|
||||
|
||||
### 4. Hardware Awareness
|
||||
|
||||
#### Local Hardware Detection
|
||||
- CPU: `lscpu` via `nix-shell -p util-linux`
|
||||
- GPU: `lspci` via `nix-shell -p pciutils`
|
||||
- Network: `lsblk`, `ip addr`
|
||||
- Storage: `df -h`, `lsblk`
|
||||
- USB devices: `lsusb`
|
||||
|
||||
#### GPU Metrics
|
||||
- AMD GPUs: Try `rocm-smi`, sysfs (`/sys/class/drm/card*/device/`)
|
||||
- NVIDIA GPUs: Try `nvidia-smi`
|
||||
- Fallback: `sensors` for temperature data
|
||||
- Queries: temperature, utilization, clock speeds, power usage
|
||||
|
||||
### 5. Ollama Queue System
|
||||
|
||||
#### Architecture
|
||||
- **File-based queue**: `/var/lib/macha/queues/ollama/`
|
||||
- **Queue worker**: `ollama-queue-worker.service` (runs as `macha` user)
|
||||
- **Purpose**: Serialize all LLM requests to prevent resource contention
|
||||
|
||||
#### Request Flow
|
||||
1. Any user (including regular users) → Write request to `pending/`
|
||||
2. Queue worker → Process requests serially (FIFO with priority)
|
||||
3. Queue worker → Write response to `completed/`
|
||||
4. Original requester → Read response from `completed/`
|
||||
|
||||
#### Priority Levels
|
||||
- `INTERACTIVE` (0): User requests via `macha-chat`, `macha-ask`
|
||||
- `AUTONOMOUS` (1): Background maintenance checks
|
||||
- `BATCH` (2): Low-priority bulk operations
|
||||
|
||||
#### Large Output Handling
|
||||
- Outputs >8KB: Split into chunks for hierarchical processing
|
||||
- Each chunk ~8KB (~2000 tokens)
|
||||
- Process chunks serially with progress feedback
|
||||
- Generate chunk summaries → meta-summary
|
||||
- Full outputs cached in `/var/lib/macha/tool_cache/`
|
||||
|
||||
### 6. Knowledge Base & Learning
|
||||
|
||||
#### ChromaDB Collections
|
||||
1. **System Context**: Infrastructure topology, service relationships
|
||||
2. **Issues**: Historical problems and resolutions
|
||||
3. **Knowledge**: Operational wisdom learned from experience
|
||||
|
||||
#### Automatic Learning
|
||||
- After successful operations, Macha reflects and extracts key learnings
|
||||
- Stores: topic, knowledge content, category
|
||||
- Retrieved automatically when relevant to current tasks
|
||||
- Use `macha-knowledge` CLI to view/manage
|
||||
|
||||
### 7. Notifications
|
||||
|
||||
#### Gotify Integration
|
||||
- Can send notifications via `macha-notify` command
|
||||
- Tool: `send_notification(title, message, priority)`
|
||||
|
||||
#### Priority Levels
|
||||
- `2` (Low/Info): Routine status updates, completed tasks
|
||||
- `5` (Medium/Attention): Important events, configuration changes
|
||||
- `8` (High/Critical): Service failures, critical errors, security issues
|
||||
|
||||
#### When to Notify
|
||||
- Critical service failures
|
||||
- Successful completion of major operations
|
||||
- Configuration changes that may affect users
|
||||
- Security-related events
|
||||
- When explicitly requested by user
|
||||
|
||||
### 8. Safety & Constraints
|
||||
|
||||
#### Command Restrictions
|
||||
**Allowed Commands** (see `tools.py` for full list):
|
||||
- System management: `systemctl`, `journalctl`, `nh`, `nixos-rebuild`
|
||||
- Monitoring: `free`, `df`, `uptime`, `ps`, `top`, `ip`, `ss`
|
||||
- Hardware: `lscpu`, `lspci`, `lsblk`, `lshw`, `dmidecode`
|
||||
- Remote: `ssh`, `scp`
|
||||
- Power: `reboot`, `shutdown`, `poweroff` (use cautiously!)
|
||||
- File ops: `cat`, `ls`, `grep`
|
||||
- Network: `ping`, `dig`, `nslookup`, `curl`, `wget`
|
||||
- Logging: `logger`
|
||||
|
||||
**NOT Allowed**:
|
||||
- Direct package modifications (`nix-env`, `nix profile`)
|
||||
- Destructive file operations (`rm -rf`, `dd`)
|
||||
- User management outside of NixOS config
|
||||
- Direct editing of system files (use NixOS config instead)
|
||||
|
||||
#### Critical Services
|
||||
**Never disable or stop:**
|
||||
- SSH (network access)
|
||||
- Networking (connectivity)
|
||||
- systemd (system management)
|
||||
- Boot-related services
|
||||
|
||||
#### Approval Required
|
||||
- Reboots or system power changes
|
||||
- Major configuration changes
|
||||
- Disabling any service
|
||||
- Changes to multiple hosts
|
||||
|
||||
### 9. Nix Store Maintenance
|
||||
|
||||
#### Verification & Repair
|
||||
- Command: `nix-store --verify --check-contents --repair`
|
||||
- **WARNING**: Can take 30+ minutes to several hours
|
||||
- Only use when corruption is suspected
|
||||
- Not for routine maintenance
|
||||
- Verifies all store paths, repairs corrupted files
|
||||
|
||||
#### Garbage Collection
|
||||
- Automatic via system configuration
|
||||
- Can be triggered manually with approval
|
||||
- Frees disk space by removing unused derivations
|
||||
|
||||
### 10. Conversational Behavior
|
||||
|
||||
#### Distinguish Requests from Acknowledgments
|
||||
- "Thanks" / "Thank you" → Acknowledgment (don't re-execute)
|
||||
- "Can you..." / "Please..." → Request (execute)
|
||||
- "What is..." / "How do..." → Question (answer)
|
||||
|
||||
#### Tool Calling
|
||||
- Don't repeat tool calls unnecessarily
|
||||
- If a tool succeeds, don't run it again unless asked
|
||||
- Use cached results when available (`retrieve_cached_output`)
|
||||
|
||||
#### Context Management
|
||||
- Be aware of token limits
|
||||
- Use hierarchical processing for large outputs
|
||||
- Prune conversation history intelligently
|
||||
- Cache and summarize when needed
|
||||
|
||||
## Infrastructure Topology
|
||||
|
||||
### Hosts in Flake
|
||||
- **macha**: Main autonomous system (self), GPU server
|
||||
- **rhiannon**: Production server
|
||||
- **alexander**: Production server
|
||||
- **UCAR-Kinston**: Work laptop
|
||||
- **test-vm**: Testing environment
|
||||
|
||||
### Shared Configuration
|
||||
- All hosts share root SSH keys (for `nh` remote deployment)
|
||||
- `macha` user (UID 2501) exists on all hosts
|
||||
- Common NixOS configuration via flake
|
||||
|
||||
## Service Ecosystem
|
||||
|
||||
### Core Services on Macha
|
||||
- `ollama.service`: LLM inference engine
|
||||
- `ollama-queue-worker.service`: Request serialization
|
||||
- `macha-autonomous.service`: Autonomous monitoring loop
|
||||
- Servarr stack: Sonarr, Radarr, Prowlarr, Lidarr, Readarr, Whisparr
|
||||
- Media: Transmission, SABnzbd, Calibre
|
||||
|
||||
### State Directories
|
||||
- `/var/lib/macha/`: Main state directory (0755, macha:macha)
|
||||
- `/var/lib/macha/queues/`: Queue directories (0777 for multi-user)
|
||||
- `/var/lib/macha/tool_cache/`: Cached tool outputs (0777)
|
||||
- `/var/lib/macha/system_context.db`: ChromaDB database
|
||||
|
||||
## CLI Tools
|
||||
|
||||
- `macha-chat`: Interactive chat with tool calling
|
||||
- `macha-ask`: Single-question interface
|
||||
- `macha-check`: Trigger immediate health check
|
||||
- `macha-approve`: Approve pending actions
|
||||
- `macha-logs`: View autonomous service logs
|
||||
- `macha-issues`: Query issue database
|
||||
- `macha-knowledge`: Query knowledge base
|
||||
- `macha-systems`: List managed systems
|
||||
- `macha-notify`: Send Gotify notification
|
||||
|
||||
## Philosophy & Principles
|
||||
|
||||
1. **KISS (Keep It Simple, Stupid)**: Use existing NixOS options, avoid custom wrappers
|
||||
2. **Verify first**: Check source code/documentation before acting
|
||||
3. **Safety first**: Never break critical services, always require approval for risky changes
|
||||
4. **Learn continuously**: Extract and store operational knowledge
|
||||
5. **Multi-host awareness**: Macha manages the entire infrastructure, not just herself
|
||||
6. **User-friendly**: Clear communication, appropriate notifications
|
||||
7. **Patience**: Long-running operations (builds, repairs) can take an hour - don't panic
|
||||
8. **Tool reuse**: Use existing, verified tools instead of writing custom scripts
|
||||
|
||||
## Future Capabilities (Not Yet Implemented)
|
||||
|
||||
- [ ] Automatic security updates across all hosts
|
||||
- [ ] Predictive failure detection
|
||||
- [ ] Resource optimization recommendations
|
||||
- [ ] Integration with other communication platforms
|
||||
- [ ] Multi-agent coordination between hosts
|
||||
- [ ] Automated testing before deployment
|
||||
|
||||
275
EXAMPLES.md
Normal file
275
EXAMPLES.md
Normal file
@@ -0,0 +1,275 @@
|
||||
# Macha Autonomous System - Configuration Examples
|
||||
|
||||
## Basic Configurations
|
||||
|
||||
### Conservative (Recommended for Start)
|
||||
```nix
|
||||
services.macha-autonomous = {
|
||||
enable = true;
|
||||
autonomyLevel = "suggest"; # Require approval for all actions
|
||||
checkInterval = 300; # Check every 5 minutes
|
||||
model = "llama3.1:70b"; # Most capable model
|
||||
};
|
||||
```
|
||||
|
||||
### Moderate Autonomy
|
||||
```nix
|
||||
services.macha-autonomous = {
|
||||
enable = true;
|
||||
autonomyLevel = "auto-safe"; # Auto-fix safe issues
|
||||
checkInterval = 180; # Check every 3 minutes
|
||||
model = "llama3.1:70b";
|
||||
};
|
||||
```
|
||||
|
||||
### High Autonomy (Experimental)
|
||||
```nix
|
||||
services.macha-autonomous = {
|
||||
enable = true;
|
||||
autonomyLevel = "auto-full"; # Full autonomy
|
||||
checkInterval = 300;
|
||||
model = "llama3.1:70b";
|
||||
};
|
||||
```
|
||||
|
||||
### Monitoring Only
|
||||
```nix
|
||||
services.macha-autonomous = {
|
||||
enable = true;
|
||||
autonomyLevel = "observe"; # No actions, just watch
|
||||
checkInterval = 60; # Check every minute
|
||||
model = "qwen3:8b-fp16"; # Lighter model is fine for observation
|
||||
};
|
||||
```
|
||||
|
||||
## Advanced Scenarios
|
||||
|
||||
### Using a Smaller Model (Faster, Less Capable)
|
||||
```nix
|
||||
services.macha-autonomous = {
|
||||
enable = true;
|
||||
autonomyLevel = "auto-safe";
|
||||
checkInterval = 120;
|
||||
model = "qwen3:8b-fp16"; # Faster inference, less reasoning depth
|
||||
# or
|
||||
# model = "llama3.1:8b"; # Also good for simple tasks
|
||||
};
|
||||
```
|
||||
|
||||
### High-Frequency Monitoring
|
||||
```nix
|
||||
services.macha-autonomous = {
|
||||
enable = true;
|
||||
autonomyLevel = "auto-safe";
|
||||
checkInterval = 60; # Check every minute
|
||||
model = "qwen3:4b-instruct-2507-fp16"; # Lightweight model
|
||||
};
|
||||
```
|
||||
|
||||
### Remote Ollama (if running Ollama elsewhere)
|
||||
```nix
|
||||
services.macha-autonomous = {
|
||||
enable = true;
|
||||
autonomyLevel = "suggest";
|
||||
checkInterval = 300;
|
||||
ollamaHost = "http://192.168.1.100:11434"; # Remote Ollama instance
|
||||
model = "llama3.1:70b";
|
||||
};
|
||||
```
|
||||
|
||||
## Manual Testing Workflow
|
||||
|
||||
1. **Test with a one-shot run:**
|
||||
```bash
|
||||
# Run once in observe mode
|
||||
macha-check
|
||||
|
||||
# Review what it detected
|
||||
cat /var/lib/macha-autonomous/decisions.jsonl | tail -1 | jq .
|
||||
```
|
||||
|
||||
2. **Enable in suggest mode:**
|
||||
```nix
|
||||
services.macha-autonomous = {
|
||||
enable = true;
|
||||
autonomyLevel = "suggest";
|
||||
checkInterval = 300;
|
||||
model = "llama3.1:70b";
|
||||
};
|
||||
```
|
||||
|
||||
3. **Rebuild and start:**
|
||||
```bash
|
||||
sudo nixos-rebuild switch --flake .#macha
|
||||
sudo systemctl status macha-autonomous
|
||||
```
|
||||
|
||||
4. **Monitor for a while:**
|
||||
```bash
|
||||
# Watch the logs
|
||||
journalctl -u macha-autonomous -f
|
||||
|
||||
# Or use the helper
|
||||
macha-logs service
|
||||
```
|
||||
|
||||
5. **Review proposed actions:**
|
||||
```bash
|
||||
macha-approve list
|
||||
```
|
||||
|
||||
6. **Graduate to auto-safe when comfortable:**
|
||||
```nix
|
||||
services.macha-autonomous.autonomyLevel = "auto-safe";
|
||||
```
|
||||
|
||||
## Scenario-Based Examples
|
||||
|
||||
### Media Server (Let it auto-restart services)
|
||||
```nix
|
||||
services.macha-autonomous = {
|
||||
enable = true;
|
||||
autonomyLevel = "auto-safe"; # Auto-restart failed arr apps
|
||||
checkInterval = 180;
|
||||
model = "llama3.1:70b";
|
||||
};
|
||||
```
|
||||
|
||||
### Development Machine (Observe only, you want control)
|
||||
```nix
|
||||
services.macha-autonomous = {
|
||||
enable = true;
|
||||
autonomyLevel = "observe";
|
||||
checkInterval = 600; # Check less frequently
|
||||
model = "llama3.1:8b"; # Lighter model
|
||||
};
|
||||
```
|
||||
|
||||
### Critical Production (Suggest only, manual approval)
|
||||
```nix
|
||||
services.macha-autonomous = {
|
||||
enable = true;
|
||||
autonomyLevel = "suggest";
|
||||
checkInterval = 120; # More frequent monitoring
|
||||
model = "llama3.1:70b"; # Best reasoning
|
||||
};
|
||||
```
|
||||
|
||||
### Experimental/Learning (Full autonomy)
|
||||
```nix
|
||||
services.macha-autonomous = {
|
||||
enable = true;
|
||||
autonomyLevel = "auto-full";
|
||||
checkInterval = 300;
|
||||
model = "llama3.1:70b";
|
||||
};
|
||||
```
|
||||
|
||||
## Customizing Behavior
|
||||
|
||||
### The config file lives at:
|
||||
`/etc/macha-autonomous/config.json` (auto-generated from NixOS config)
|
||||
|
||||
### To modify the AI prompts:
|
||||
Edit the Python files in `systems/macha-configs/autonomous/`:
|
||||
- `agent.py` - AI analysis and decision prompts
|
||||
- `monitor.py` - What data to collect
|
||||
- `executor.py` - Safety rules and action execution
|
||||
- `orchestrator.py` - Main control flow
|
||||
|
||||
After editing, rebuild:
|
||||
```bash
|
||||
sudo nixos-rebuild switch --flake .#macha
|
||||
sudo systemctl restart macha-autonomous
|
||||
```
|
||||
|
||||
## Integration with Other Services
|
||||
|
||||
### Example: Auto-restart specific services
|
||||
The system will automatically detect and propose restarting failed services.
|
||||
|
||||
### Example: Disk cleanup when space is low
|
||||
Monitor will detect low disk space, AI will propose cleanup, executor will run `nix-collect-garbage`.
|
||||
|
||||
### Example: Log analysis
|
||||
AI analyzes recent error logs and can propose fixes based on error patterns.
|
||||
|
||||
## Debugging
|
||||
|
||||
### See what the monitor sees:
|
||||
```bash
|
||||
sudo -u macha-autonomous python3 /nix/store/.../monitor.py
|
||||
```
|
||||
|
||||
### Test the AI agent:
|
||||
```bash
|
||||
sudo -u macha-autonomous python3 /nix/store/.../agent.py test
|
||||
```
|
||||
|
||||
### View all snapshots:
|
||||
```bash
|
||||
ls -lh /var/lib/macha-autonomous/snapshot_*.json
|
||||
cat /var/lib/macha-autonomous/snapshot_$(ls -t /var/lib/macha-autonomous/snapshot_*.json | head -1) | jq .
|
||||
```
|
||||
|
||||
### Check approval queue:
|
||||
```bash
|
||||
cat /var/lib/macha-autonomous/approval_queue.json | jq .
|
||||
```
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
### Model Choice Impact:
|
||||
|
||||
| Model | Speed | Capability | RAM Usage | Best For |
|
||||
|-------|-------|------------|-----------|----------|
|
||||
| llama3.1:70b | Slow (~30s) | Excellent | ~40GB | Complex reasoning |
|
||||
| llama3.1:8b | Fast (~3s) | Good | ~5GB | General use |
|
||||
| qwen3:8b-fp16 | Fast (~2s) | Good | ~16GB | General use |
|
||||
| qwen3:4b | Very Fast (~1s) | Moderate | ~8GB | Simple tasks |
|
||||
|
||||
### Check Interval Impact:
|
||||
- 60s: High responsiveness, more resource usage
|
||||
- 300s (default): Good balance
|
||||
- 600s: Low overhead, slower detection
|
||||
|
||||
### Memory Usage:
|
||||
- Monitor: ~50MB
|
||||
- Agent (per query): Depends on model (see above)
|
||||
- Executor: ~30MB
|
||||
- Orchestrator: ~20MB
|
||||
|
||||
Total continuous overhead: ~100MB + model inference when running
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### The autonomous user has sudo access to:
|
||||
- `systemctl restart/status` - Restart services
|
||||
- `journalctl` - Read logs
|
||||
- `nix-collect-garbage` - Clean up Nix store
|
||||
|
||||
### It CANNOT:
|
||||
- Modify arbitrary files
|
||||
- Access user home directories (ProtectHome=true)
|
||||
- Disable protected services (SSH, networking)
|
||||
- Make changes without logging
|
||||
|
||||
### Audit trail:
|
||||
All actions are logged in `/var/lib/macha-autonomous/actions.jsonl`
|
||||
|
||||
### To revoke access:
|
||||
Set `enable = false` and rebuild, or stop the service.
|
||||
|
||||
## Future: MCP Integration
|
||||
|
||||
You already have MCP servers installed:
|
||||
- `mcp-nixos` - NixOS-specific tools
|
||||
- `gitea-mcp-server` - Git integration
|
||||
- `emcee` - General MCP orchestration
|
||||
|
||||
Future versions could integrate these for:
|
||||
- Better NixOS config manipulation
|
||||
- Git-based config versioning
|
||||
- More sophisticated tooling
|
||||
|
||||
Stay tuned!
|
||||
217
LOGGING_EXAMPLE.md
Normal file
217
LOGGING_EXAMPLE.md
Normal file
@@ -0,0 +1,217 @@
|
||||
# Enhanced Logging Example
|
||||
|
||||
This shows what the improved journalctl output will look like for Macha's autonomous system.
|
||||
|
||||
## Example Output
|
||||
|
||||
### Maintenance Cycle Start
|
||||
```
|
||||
[2025-10-01T14:30:00] === Starting maintenance cycle ===
|
||||
[2025-10-01T14:30:00] Collecting system health data...
|
||||
|
||||
[2025-10-01T14:30:02] ============================================================
|
||||
[2025-10-01T14:30:02] SYSTEM HEALTH SUMMARY
|
||||
[2025-10-01T14:30:02] ============================================================
|
||||
[2025-10-01T14:30:02] Resources: CPU 25.3%, Memory 45.2%, Load 1.24
|
||||
[2025-10-01T14:30:02] Disk: 35.6% used (/ partition)
|
||||
[2025-10-01T14:30:02] Services: 1 failed
|
||||
[2025-10-01T14:30:02] - ollama.service (failed)
|
||||
[2025-10-01T14:30:02] Network: Internet reachable
|
||||
[2025-10-01T14:30:02] Recent logs: 3 errors in last hour
|
||||
[2025-10-01T14:30:02] ============================================================
|
||||
|
||||
[2025-10-01T14:30:02] KEY METRICS:
|
||||
[2025-10-01T14:30:02] CPU Usage: 25.3%
|
||||
[2025-10-01T14:30:02] Memory Usage: 45.2%
|
||||
[2025-10-01T14:30:02] Load Average: 1.24
|
||||
[2025-10-01T14:30:02] Failed Services: 1
|
||||
[2025-10-01T14:30:02] Errors (1h): 3
|
||||
[2025-10-01T14:30:02] Disk /: 35.6% used
|
||||
[2025-10-01T14:30:02] Disk /home: 62.1% used
|
||||
[2025-10-01T14:30:02] Disk /var: 28.9% used
|
||||
[2025-10-01T14:30:02] Internet: ✅ Connected
|
||||
```
|
||||
|
||||
### AI Analysis Section
|
||||
```
|
||||
[2025-10-01T14:30:02] Analyzing system state with AI...
|
||||
|
||||
[2025-10-01T14:30:35] ============================================================
|
||||
[2025-10-01T14:30:35] AI ANALYSIS RESULTS
|
||||
[2025-10-01T14:30:35] ============================================================
|
||||
[2025-10-01T14:30:35] Overall Status: ATTENTION_NEEDED
|
||||
[2025-10-01T14:30:35] Assessment: System has one failed service that should be restarted
|
||||
|
||||
[2025-10-01T14:30:35] Detected 1 issue(s):
|
||||
|
||||
[2025-10-01T14:30:35] Issue #1:
|
||||
[2025-10-01T14:30:35] Severity: WARNING
|
||||
[2025-10-01T14:30:35] Category: services
|
||||
[2025-10-01T14:30:35] Description: ollama.service has failed and needs to be restarted
|
||||
[2025-10-01T14:30:35] ⚠️ ACTION REQUIRED
|
||||
|
||||
[2025-10-01T14:30:35] Recommended Actions (1):
|
||||
[2025-10-01T14:30:35] - Restart ollama.service to restore LLM functionality
|
||||
[2025-10-01T14:30:35] ============================================================
|
||||
```
|
||||
|
||||
### Action Handling Section
|
||||
```
|
||||
[2025-10-01T14:30:35] Found 1 issues requiring action
|
||||
|
||||
[2025-10-01T14:30:35] ────────────────────────────────────────────────────────────
|
||||
[2025-10-01T14:30:35] Addressing issue: ollama.service has failed and needs to be restarted
|
||||
[2025-10-01T14:30:35] Requesting AI fix proposal...
|
||||
|
||||
[2025-10-01T14:30:45] AI FIX PROPOSAL:
|
||||
[2025-10-01T14:30:45] Diagnosis: ollama.service crashed or failed to start properly
|
||||
[2025-10-01T14:30:45] Proposed Action: Restart ollama.service using systemctl
|
||||
[2025-10-01T14:30:45] Action Type: systemd_restart
|
||||
[2025-10-01T14:30:45] Risk Level: LOW
|
||||
[2025-10-01T14:30:45] Commands to execute:
|
||||
[2025-10-01T14:30:45] - systemctl restart ollama.service
|
||||
[2025-10-01T14:30:45] Reasoning: Restarting the service is a safe, standard troubleshooting step
|
||||
[2025-10-01T14:30:45] Rollback Plan: Service will return to failed state if restart doesn't work
|
||||
|
||||
[2025-10-01T14:30:45] Executing action...
|
||||
|
||||
[2025-10-01T14:30:47] EXECUTION RESULT:
|
||||
[2025-10-01T14:30:47] Status: QUEUED_FOR_APPROVAL
|
||||
[2025-10-01T14:30:47] Executed: No
|
||||
[2025-10-01T14:30:47] Reason: Autonomy level requires manual approval
|
||||
```
|
||||
|
||||
### Cycle Complete Summary
|
||||
```
|
||||
[2025-10-01T14:30:47] No issues requiring immediate action
|
||||
|
||||
[2025-10-01T14:30:47] ============================================================
|
||||
[2025-10-01T14:30:47] MAINTENANCE CYCLE COMPLETE
|
||||
[2025-10-01T14:30:47] ============================================================
|
||||
[2025-10-01T14:30:47] Status: ATTENTION_NEEDED
|
||||
[2025-10-01T14:30:47] Issues Found: 1
|
||||
[2025-10-01T14:30:47] Actions Taken: 1
|
||||
[2025-10-01T14:30:47] - Executed: 0
|
||||
[2025-10-01T14:30:47] - Queued for approval: 1
|
||||
[2025-10-01T14:30:47] Next check in: 300 seconds
|
||||
[2025-10-01T14:30:47] ============================================================
|
||||
```
|
||||
|
||||
## When System is Healthy
|
||||
|
||||
```
|
||||
[2025-10-01T14:35:00] === Starting maintenance cycle ===
|
||||
[2025-10-01T14:35:00] Collecting system health data...
|
||||
|
||||
[2025-10-01T14:35:02] ============================================================
|
||||
[2025-10-01T14:35:02] SYSTEM HEALTH SUMMARY
|
||||
[2025-10-01T14:35:02] ============================================================
|
||||
[2025-10-01T14:35:02] Resources: CPU 12.5%, Memory 38.1%, Load 0.65
|
||||
[2025-10-01T14:35:02] Disk: 35.6% used (/ partition)
|
||||
[2025-10-01T14:35:02] Services: All running
|
||||
[2025-10-01T14:35:02] Network: Internet reachable
|
||||
[2025-10-01T14:35:02] Recent logs: 0 errors in last hour
|
||||
[2025-10-01T14:35:02] ============================================================
|
||||
|
||||
[2025-10-01T14:35:02] KEY METRICS:
|
||||
[2025-10-01T14:35:02] CPU Usage: 12.5%
|
||||
[2025-10-01T14:35:02] Memory Usage: 38.1%
|
||||
[2025-10-01T14:35:02] Load Average: 0.65
|
||||
[2025-10-01T14:35:02] Failed Services: 0
|
||||
[2025-10-01T14:35:02] Errors (1h): 0
|
||||
[2025-10-01T14:35:02] Disk /: 35.6% used
|
||||
[2025-10-01T14:35:02] Internet: ✅ Connected
|
||||
|
||||
[2025-10-01T14:35:02] Analyzing system state with AI...
|
||||
|
||||
[2025-10-01T14:35:28] ============================================================
|
||||
[2025-10-01T14:35:28] AI ANALYSIS RESULTS
|
||||
[2025-10-01T14:35:28] ============================================================
|
||||
[2025-10-01T14:35:28] Overall Status: HEALTHY
|
||||
[2025-10-01T14:35:28] Assessment: System is operating normally with no issues detected
|
||||
|
||||
[2025-10-01T14:35:28] ✅ No issues detected
|
||||
[2025-10-01T14:35:28] ============================================================
|
||||
|
||||
[2025-10-01T14:35:28] No issues requiring immediate action
|
||||
|
||||
[2025-10-01T14:35:28] ============================================================
|
||||
[2025-10-01T14:35:28] MAINTENANCE CYCLE COMPLETE
|
||||
[2025-10-01T14:35:28] ============================================================
|
||||
[2025-10-01T14:35:28] Status: HEALTHY
|
||||
[2025-10-01T14:35:28] Issues Found: 0
|
||||
[2025-10-01T14:35:28] Actions Taken: 0
|
||||
[2025-10-01T14:35:28] Next check in: 300 seconds
|
||||
[2025-10-01T14:35:28] ============================================================
|
||||
```
|
||||
|
||||
## Viewing Logs
|
||||
|
||||
### Follow live logs
|
||||
```bash
|
||||
journalctl -u macha-autonomous.service -f
|
||||
```
|
||||
|
||||
### See only AI decisions
|
||||
```bash
|
||||
journalctl -u macha-autonomous.service | grep "AI ANALYSIS"
|
||||
```
|
||||
|
||||
### See only execution results
|
||||
```bash
|
||||
journalctl -u macha-autonomous.service | grep "EXECUTION RESULT"
|
||||
```
|
||||
|
||||
### See key metrics
|
||||
```bash
|
||||
journalctl -u macha-autonomous.service | grep "KEY METRICS" -A 10
|
||||
```
|
||||
|
||||
### Filter by status level
|
||||
```bash
|
||||
# Only show intervention required
|
||||
journalctl -u macha-autonomous.service | grep "INTERVENTION_REQUIRED"
|
||||
|
||||
# Only show critical issues
|
||||
journalctl -u macha-autonomous.service | grep "CRITICAL"
|
||||
|
||||
# Only show action required
|
||||
journalctl -u macha-autonomous.service | grep "ACTION REQUIRED"
|
||||
```
|
||||
|
||||
### Summary of last cycle
|
||||
```bash
|
||||
journalctl -u macha-autonomous.service | grep "MAINTENANCE CYCLE COMPLETE" -B 5 | tail -6
|
||||
```
|
||||
|
||||
## Benefits of Enhanced Logging
|
||||
|
||||
### 1. **Easy to Scan**
|
||||
Clear section headers with separators make it easy to find what you need
|
||||
|
||||
### 2. **Structured Data**
|
||||
Key metrics are labeled consistently for easy parsing/grepping
|
||||
|
||||
### 3. **Complete Context**
|
||||
Each cycle shows:
|
||||
- What the system saw
|
||||
- What the AI thought
|
||||
- What action was proposed
|
||||
- What actually happened
|
||||
|
||||
### 4. **AI Transparency**
|
||||
You can see:
|
||||
- The AI's reasoning for each decision
|
||||
- Risk assessment for each action
|
||||
- Rollback plans if something goes wrong
|
||||
|
||||
### 5. **Audit Trail**
|
||||
Everything is logged to journalctl for long-term storage and analysis
|
||||
|
||||
### 6. **Troubleshooting**
|
||||
If something goes wrong, you have complete context:
|
||||
- System state before the issue
|
||||
- AI's diagnosis
|
||||
- Action attempted
|
||||
- Result of action
|
||||
|
||||
224
NOTIFICATIONS.md
Normal file
224
NOTIFICATIONS.md
Normal file
@@ -0,0 +1,224 @@
|
||||
# Gotify Notifications Setup
|
||||
|
||||
Macha's autonomous system can now send notifications to Gotify on Rhiannon for critical events.
|
||||
|
||||
## What Gets Notified
|
||||
|
||||
### High Priority (🚨 Priority 8)
|
||||
- **Critical issues detected** - System problems requiring immediate attention
|
||||
- **Service failures** - When critical services fail
|
||||
- **Failed actions** - When an action execution fails
|
||||
- **Intervention required** - When system status is critical
|
||||
|
||||
### Medium Priority (📋 Priority 5)
|
||||
- **Actions queued for approval** - When medium/high-risk actions need manual review
|
||||
- **System attention needed** - When system status needs attention
|
||||
|
||||
### Low Priority (✅ Priority 2)
|
||||
- **Successful actions** - When safe actions execute successfully
|
||||
- **System healthy** - Periodic health check confirmations (if enabled)
|
||||
|
||||
## Setup Instructions
|
||||
|
||||
### Step 1: Create Gotify Application on Rhiannon
|
||||
|
||||
1. Open Gotify web interface on Rhiannon:
|
||||
```bash
|
||||
# URL: http://rhiannon:8181 (or use external access)
|
||||
```
|
||||
|
||||
2. Log in to Gotify
|
||||
|
||||
3. Go to **"Apps"** tab
|
||||
|
||||
4. Click **"Create Application"**
|
||||
|
||||
5. Name it: `Macha Autonomous System`
|
||||
|
||||
6. Copy the generated **Application Token**
|
||||
|
||||
### Step 2: Configure Macha
|
||||
|
||||
Edit `/home/lily/Documents/gitrepos/nixos-servers/systems/macha.nix`:
|
||||
|
||||
```nix
|
||||
services.macha-autonomous = {
|
||||
enable = true;
|
||||
autonomyLevel = "suggest";
|
||||
checkInterval = 300;
|
||||
model = "llama3.1:70b";
|
||||
|
||||
# Gotify notifications
|
||||
gotifyUrl = "http://rhiannon:8181";
|
||||
gotifyToken = "YOUR_TOKEN_HERE"; # Paste the token from Step 1
|
||||
};
|
||||
```
|
||||
|
||||
### Step 3: Rebuild and Deploy
|
||||
|
||||
```bash
|
||||
cd /home/lily/Documents/gitrepos/nixos-servers
|
||||
sudo nixos-rebuild switch --flake .#macha
|
||||
```
|
||||
|
||||
### Step 4: Test Notifications
|
||||
|
||||
Send a test notification:
|
||||
|
||||
```bash
|
||||
macha-notify "Test" "Macha notifications are working!" 5
|
||||
```
|
||||
|
||||
You should see this notification appear in Gotify on Rhiannon.
|
||||
|
||||
## CLI Tools
|
||||
|
||||
### Send Test Notification
|
||||
```bash
|
||||
macha-notify <title> <message> [priority]
|
||||
|
||||
# Examples:
|
||||
macha-notify "Test" "This is a test" 5
|
||||
macha-notify "Critical" "This is urgent" 8
|
||||
macha-notify "Info" "Just FYI" 2
|
||||
```
|
||||
|
||||
Priorities:
|
||||
- `2` - Low (✅ green)
|
||||
- `5` - Medium (📋 blue)
|
||||
- `8` - High (🚨 red)
|
||||
|
||||
### Check if Notifications are Enabled
|
||||
|
||||
```bash
|
||||
# View the service environment
|
||||
systemctl show macha-autonomous.service | grep GOTIFY
|
||||
```
|
||||
|
||||
## Notification Examples
|
||||
|
||||
### Critical Issue
|
||||
```
|
||||
🚨 Macha: Critical Issue
|
||||
⚠️ Critical Issue Detected
|
||||
|
||||
High disk usage on /var partition (95% full)
|
||||
|
||||
Details:
|
||||
Category: disk
|
||||
```
|
||||
|
||||
### Action Queued for Approval
|
||||
```
|
||||
📋 Macha: Action Needs Approval
|
||||
ℹ️ Action Queued for Approval
|
||||
|
||||
Action: Restart failed service: ollama.service
|
||||
Risk Level: low
|
||||
|
||||
Use 'macha-approve list' to review
|
||||
```
|
||||
|
||||
### Action Executed Successfully
|
||||
```
|
||||
✅ Macha: Action Success
|
||||
✅ Action Success
|
||||
|
||||
Restart failed service: ollama.service
|
||||
|
||||
Output:
|
||||
Service restarted successfully
|
||||
```
|
||||
|
||||
### Action Failed
|
||||
```
|
||||
❌ Macha: Action Failed
|
||||
❌ Action Failed
|
||||
|
||||
Clean up disk space with nix-collect-garbage
|
||||
|
||||
Output:
|
||||
Error: Insufficient permissions
|
||||
```
|
||||
|
||||
## Security Notes
|
||||
|
||||
1. **Token Storage**: The Gotify token is stored in the NixOS configuration. Consider using a secrets management solution for production.
|
||||
|
||||
2. **Network Access**: Macha needs network access to Rhiannon. Ensure your firewall allows HTTP traffic between them.
|
||||
|
||||
3. **Token Scope**: The Gotify token only allows sending messages, not reading or managing Gotify.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Notifications Not Appearing
|
||||
|
||||
1. **Check Gotify is running on Rhiannon:**
|
||||
```bash
|
||||
ssh rhiannon systemctl status gotify
|
||||
```
|
||||
|
||||
2. **Test connectivity from Macha:**
|
||||
```bash
|
||||
curl http://rhiannon:8181/health
|
||||
```
|
||||
|
||||
3. **Verify token is set:**
|
||||
```bash
|
||||
macha-notify "Test" "Testing" 5
|
||||
```
|
||||
|
||||
4. **Check service logs:**
|
||||
```bash
|
||||
macha-logs service | grep -i gotify
|
||||
```
|
||||
|
||||
### Notification Spam
|
||||
|
||||
If you're getting too many notifications, you can:
|
||||
|
||||
1. **Disable notifications temporarily:**
|
||||
```nix
|
||||
services.macha-autonomous.gotifyUrl = ""; # Empty string disables
|
||||
```
|
||||
|
||||
2. **Adjust autonomy level:**
|
||||
```nix
|
||||
services.macha-autonomous.autonomyLevel = "auto-safe"; # Fewer approval notifications
|
||||
```
|
||||
|
||||
3. **Increase check interval:**
|
||||
```nix
|
||||
services.macha-autonomous.checkInterval = 900; # Check every 15 minutes instead of 5
|
||||
```
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### Files Modified
|
||||
- `notifier.py` - Gotify notification client
|
||||
- `module.nix` - Added configuration options and CLI tool
|
||||
- `orchestrator.py` - Integrated notifications at decision points
|
||||
- `macha.nix` - Added Gotify configuration
|
||||
|
||||
### Notification Flow
|
||||
```
|
||||
Issue Detected → AI Analysis → Decision Made → Notification Sent
|
||||
↓
|
||||
Queued or Executed → Notification Sent
|
||||
```
|
||||
|
||||
### Graceful Degradation
|
||||
- If Gotify is unavailable, the system continues to operate
|
||||
- Failed notifications are logged but don't crash the service
|
||||
- Notifications have a 10-second timeout to prevent blocking
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
Possible improvements:
|
||||
- [ ] Rate limiting to prevent notification spam
|
||||
- [ ] Notification grouping (batch similar issues)
|
||||
- [ ] Custom notification templates
|
||||
- [ ] Priority-based notification filtering
|
||||
- [ ] Integration with other notification services (email, SMS)
|
||||
- [ ] Secrets management for tokens (agenix, sops-nix)
|
||||
|
||||
229
QUICKSTART.md
Normal file
229
QUICKSTART.md
Normal file
@@ -0,0 +1,229 @@
|
||||
# Macha Autonomous System - Quick Start Guide
|
||||
|
||||
## What is This?
|
||||
|
||||
Macha now has a self-maintenance system that uses local AI (via Ollama) to monitor, analyze, and maintain itself. Think of it as a 24/7 system administrator that watches over Macha.
|
||||
|
||||
## How It Works
|
||||
|
||||
1. **Monitor**: Every 5 minutes, collects system health data (services, resources, logs, etc.)
|
||||
2. **Analyze**: Uses llama3.1:70b to analyze the data and detect issues
|
||||
3. **Act**: Based on autonomy level, either proposes fixes or executes them automatically
|
||||
4. **Learn**: Logs all decisions and actions for auditing and improvement
|
||||
|
||||
## Autonomy Levels
|
||||
|
||||
### `observe` - Monitoring Only
|
||||
- Monitors system health
|
||||
- Logs everything
|
||||
- Takes NO actions
|
||||
- Good for: Testing, learning what the system sees
|
||||
|
||||
### `suggest` - Approval Required (DEFAULT)
|
||||
- Monitors and analyzes
|
||||
- Proposes fixes
|
||||
- Requires manual approval before executing
|
||||
- Good for: Production use, when you want control
|
||||
|
||||
### `auto-safe` - Limited Autonomy
|
||||
- Auto-executes "safe" actions:
|
||||
- Restarting failed services
|
||||
- Disk cleanup
|
||||
- Log rotation
|
||||
- Read-only diagnostics
|
||||
- Asks approval for risky changes
|
||||
- Good for: Hands-off operation with safety net
|
||||
|
||||
### `auto-full` - Full Autonomy
|
||||
- Auto-executes most actions
|
||||
- Still requires approval for HIGH RISK actions
|
||||
- Never touches protected services (SSH, networking, etc.)
|
||||
- Good for: Experimental, when you trust the system
|
||||
|
||||
## Commands
|
||||
|
||||
### Check the status
|
||||
```bash
|
||||
# View the service status
|
||||
systemctl status macha-autonomous
|
||||
|
||||
# View live logs
|
||||
macha-logs service
|
||||
|
||||
# View AI decision log
|
||||
macha-logs decisions
|
||||
|
||||
# View action execution log
|
||||
macha-logs actions
|
||||
|
||||
# View orchestrator log
|
||||
macha-logs orchestrator
|
||||
```
|
||||
|
||||
### Run a manual check
|
||||
```bash
|
||||
# Run one maintenance cycle now
|
||||
macha-check
|
||||
```
|
||||
|
||||
### Approval workflow (when autonomyLevel = "suggest")
|
||||
```bash
|
||||
# List pending actions awaiting approval
|
||||
macha-approve list
|
||||
|
||||
# Approve action number 0
|
||||
macha-approve approve 0
|
||||
```
|
||||
|
||||
### Change autonomy level
|
||||
Edit `/home/lily/Documents/nixos-servers/systems/macha.nix`:
|
||||
```nix
|
||||
services.macha-autonomous = {
|
||||
enable = true;
|
||||
autonomyLevel = "auto-safe"; # Change this
|
||||
checkInterval = 300;
|
||||
model = "llama3.1:70b";
|
||||
};
|
||||
```
|
||||
|
||||
Then rebuild:
|
||||
```bash
|
||||
sudo nixos-rebuild switch --flake .#macha
|
||||
```
|
||||
|
||||
## What Can It Do?
|
||||
|
||||
### Automatically Detects
|
||||
- Failed systemd services
|
||||
- High resource usage (CPU, RAM, disk)
|
||||
- Recent errors in logs
|
||||
- Network connectivity issues
|
||||
- Disk space problems
|
||||
- Boot/uptime anomalies
|
||||
|
||||
### Can Propose/Execute
|
||||
- Restart failed services
|
||||
- Clean up disk space (nix store, old logs)
|
||||
- Investigate issues (run diagnostics)
|
||||
- Propose configuration changes (for manual review)
|
||||
- NixOS rebuilds (with safety checks)
|
||||
|
||||
### Safety Features
|
||||
- **Protected services**: Never touches SSH, networking, systemd core
|
||||
- **Dry-run testing**: Tests NixOS rebuilds before applying
|
||||
- **Action logging**: Every action is logged with context
|
||||
- **Rollback capability**: Can revert changes
|
||||
- **Rate limiting**: Won't spam actions
|
||||
- **Human override**: You can always disable or intervene
|
||||
|
||||
## Example Workflow
|
||||
|
||||
1. **System detects failed service**
|
||||
```
|
||||
Monitor: "ollama.service is failed"
|
||||
AI Agent: "The ollama service crashed. Propose restarting it."
|
||||
```
|
||||
|
||||
2. **In `suggest` mode (default)**
|
||||
```
|
||||
Executor: "Action queued for approval"
|
||||
You: Run `macha-approve list`
|
||||
You: Review the proposed action
|
||||
You: Run `macha-approve approve 0`
|
||||
Executor: Restarts the service
|
||||
```
|
||||
|
||||
3. **In `auto-safe` mode**
|
||||
```
|
||||
Executor: "Low risk action, auto-executing"
|
||||
Executor: Restarts the service automatically
|
||||
You: Check logs later to see what happened
|
||||
```
|
||||
|
||||
## Monitoring the System
|
||||
|
||||
All data is stored in `/var/lib/macha-autonomous/`:
|
||||
- `orchestrator.log` - Main system log
|
||||
- `decisions.jsonl` - AI analysis decisions (JSON Lines format)
|
||||
- `actions.jsonl` - Executed actions log
|
||||
- `snapshot_*.json` - System state snapshots
|
||||
- `approval_queue.json` - Pending actions
|
||||
|
||||
## Tips
|
||||
|
||||
1. **Start with `suggest` mode** - Get comfortable with what it proposes
|
||||
2. **Review the logs** - See what it's detecting and proposing
|
||||
3. **Graduate to `auto-safe`** - Let it handle routine maintenance
|
||||
4. **Use `observe` for debugging** - If something seems wrong
|
||||
5. **Check approval queue regularly** - If using `suggest` mode
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Service won't start
|
||||
```bash
|
||||
# Check for errors
|
||||
journalctl -u macha-autonomous -n 50
|
||||
|
||||
# Verify Ollama is running
|
||||
systemctl status ollama
|
||||
|
||||
# Test Ollama manually
|
||||
curl http://localhost:11434/api/generate -d '{"model": "llama3.1:70b", "prompt": "test"}'
|
||||
```
|
||||
|
||||
### AI making bad decisions
|
||||
- Switch to `observe` mode to stop actions
|
||||
- Review `decisions.jsonl` to see reasoning
|
||||
- File an issue or adjust prompts in `agent.py`
|
||||
|
||||
### Want to disable temporarily
|
||||
```bash
|
||||
sudo systemctl stop macha-autonomous
|
||||
```
|
||||
|
||||
### Want to disable permanently
|
||||
Edit `systems/macha.nix`:
|
||||
```nix
|
||||
services.macha-autonomous.enable = false;
|
||||
```
|
||||
Then rebuild.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ Orchestrator │
|
||||
│ (Main loop, runs every 5 minutes) │
|
||||
└────────────┬──────────────┬──────────────┬──────────────┘
|
||||
│ │ │
|
||||
┌───▼────┐ ┌────▼────┐ ┌────▼─────┐
|
||||
│Monitor │ │ Agent │ │ Executor │
|
||||
│ │───▶│ (AI) │───▶│ (Safe) │
|
||||
└────────┘ └─────────┘ └──────────┘
|
||||
│ │ │
|
||||
Collects Analyzes Executes
|
||||
System Issues Actions
|
||||
Health w/ LLM Safely
|
||||
```
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
Potential future capabilities:
|
||||
- Integration with MCP servers (already installed!)
|
||||
- Predictive maintenance (learning from patterns)
|
||||
- Self-optimization (tuning configs based on usage)
|
||||
- Cluster management (if you add more systems)
|
||||
- Automated backups and disaster recovery
|
||||
- Security monitoring and hardening
|
||||
- Performance tuning recommendations
|
||||
|
||||
## Philosophy
|
||||
|
||||
The goal is a system that maintains itself while being:
|
||||
1. **Safe** - Never breaks critical functionality
|
||||
2. **Transparent** - All decisions are logged and explainable
|
||||
3. **Conservative** - When in doubt, ask for approval
|
||||
4. **Learning** - Gets better over time
|
||||
5. **Human-friendly** - Easy to understand and override
|
||||
|
||||
Macha is here to help you, not replace you!
|
||||
93
README.md
Normal file
93
README.md
Normal file
@@ -0,0 +1,93 @@
|
||||
# Macha - AI-Powered Autonomous System Administrator
|
||||
|
||||
Macha is an AI-powered autonomous system administrator for NixOS that monitors system health, diagnoses issues, and can take corrective actions with appropriate approval workflows.
|
||||
|
||||
## Features
|
||||
|
||||
- **Autonomous Monitoring**: Continuous health checks with configurable intervals
|
||||
- **Multi-Host Management**: SSH-based management of multiple NixOS hosts
|
||||
- **Tool Calling**: Comprehensive system administration tools via Ollama LLM
|
||||
- **Queue-Based Architecture**: Serialized LLM requests to prevent resource contention
|
||||
- **Knowledge Base**: ChromaDB-backed learning system for operational wisdom
|
||||
- **Approval Workflows**: Safety-first approach with configurable autonomy levels
|
||||
- **Notification System**: Gotify integration for alerts
|
||||
|
||||
## Quick Start
|
||||
|
||||
### As a NixOS Flake Input
|
||||
|
||||
Add to your `flake.nix`:
|
||||
|
||||
```nix
|
||||
{
|
||||
inputs = {
|
||||
nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
|
||||
macha-autonomous.url = "git+https://git.coven.systems/lily/macha-autonomous";
|
||||
};
|
||||
|
||||
outputs = { self, nixpkgs, macha-autonomous }: {
|
||||
nixosConfigurations.yourhost = nixpkgs.lib.nixosSystem {
|
||||
modules = [
|
||||
macha-autonomous.nixosModules.default
|
||||
{
|
||||
services.macha-autonomous = {
|
||||
enable = true;
|
||||
autonomyLevel = "suggest"; # observe, suggest, auto-safe, auto-full
|
||||
checkInterval = 300;
|
||||
ollamaHost = "http://localhost:11434";
|
||||
model = "gpt-oss:latest";
|
||||
};
|
||||
}
|
||||
];
|
||||
};
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
## Configuration Options
|
||||
|
||||
See `module.nix` for full configuration options including:
|
||||
- Autonomy levels (observe, suggest, auto-safe, auto-full)
|
||||
- Check intervals
|
||||
- Ollama host and model settings
|
||||
- Git repository monitoring
|
||||
- Service user/group configuration
|
||||
|
||||
## CLI Tools
|
||||
|
||||
- `macha-chat` - Interactive chat interface
|
||||
- `macha-ask` - Single-question interface
|
||||
- `macha-check` - Trigger immediate health check
|
||||
- `macha-approve` - Approve pending actions
|
||||
- `macha-logs` - View service logs
|
||||
- `macha-issues` - Query issue database
|
||||
- `macha-knowledge` - Query knowledge base
|
||||
- `macha-systems` - List managed systems
|
||||
- `macha-notify` - Send Gotify notification
|
||||
|
||||
## Architecture
|
||||
|
||||
- **Agent**: Core AI logic with tool calling
|
||||
- **Orchestrator**: Main monitoring loop
|
||||
- **Executor**: Safe action execution
|
||||
- **Queue System**: Serialized Ollama requests with priorities
|
||||
- **Context DB**: ChromaDB for system context and learning
|
||||
- **Tools**: System administration capabilities
|
||||
|
||||
## Requirements
|
||||
|
||||
- NixOS with flakes enabled
|
||||
- Ollama service running
|
||||
- Python 3 with requests, psutil, chromadb
|
||||
|
||||
## Documentation
|
||||
|
||||
See `DESIGN.md` for comprehensive architecture documentation.
|
||||
|
||||
## License
|
||||
|
||||
[Add your license here]
|
||||
|
||||
## Author
|
||||
|
||||
Lily Miller
|
||||
317
SUMMARY.md
Normal file
317
SUMMARY.md
Normal file
@@ -0,0 +1,317 @@
|
||||
# Macha Autonomous System - Implementation Summary
|
||||
|
||||
## What We Built
|
||||
|
||||
A complete self-maintaining system for Macha that uses local AI models (via Ollama) to monitor, analyze, and fix issues automatically. This is a production-ready implementation with safety mechanisms, audit trails, and multiple autonomy levels.
|
||||
|
||||
## Components Created
|
||||
|
||||
### 1. System Monitor (`monitor.py` - 310 lines)
|
||||
- Collects comprehensive system health data every cycle
|
||||
- Monitors: systemd services, resources (CPU/RAM), disk usage, logs, network, NixOS status
|
||||
- Saves snapshots for historical analysis
|
||||
- Generates human-readable summaries
|
||||
|
||||
### 2. AI Agent (`agent.py` - 238 lines)
|
||||
- Analyzes system state using llama3.1:70b (or other models)
|
||||
- Detects issues and classifies severity
|
||||
- Proposes specific, actionable fixes
|
||||
- Logs all decisions for auditing
|
||||
- Uses structured JSON responses for reliability
|
||||
|
||||
### 3. Safe Executor (`executor.py` - 371 lines)
|
||||
- Executes actions with safety checks
|
||||
- Protected services list (never touches SSH, networking, etc.)
|
||||
- Supports multiple action types:
|
||||
- `systemd_restart` - Restart failed services
|
||||
- `cleanup` - Disk/log cleanup
|
||||
- `nix_rebuild` - NixOS configuration rebuilds
|
||||
- `config_change` - Config file modifications
|
||||
- `investigation` - Diagnostic commands
|
||||
- Approval queue for manual review
|
||||
- Complete action logging
|
||||
|
||||
### 4. Orchestrator (`orchestrator.py` - 211 lines)
|
||||
- Main control loop
|
||||
- Coordinates monitor → agent → executor pipeline
|
||||
- Handles signals and graceful shutdown
|
||||
- Configuration management
|
||||
- Multiple run modes (once, continuous, daemon)
|
||||
|
||||
### 5. NixOS Module (`module.nix` - 168 lines)
|
||||
- Full systemd service integration
|
||||
- Configuration options via NixOS
|
||||
- User/group management
|
||||
- Security hardening
|
||||
- CLI tools (`macha-check`, `macha-approve`, `macha-logs`)
|
||||
- Resource limits (1GB RAM, 50% CPU)
|
||||
|
||||
### 6. Documentation
|
||||
- `README.md` - Architecture overview
|
||||
- `QUICKSTART.md` - User guide
|
||||
- `EXAMPLES.md` - Configuration examples
|
||||
- `SUMMARY.md` - This file
|
||||
|
||||
**Total: ~1,400 lines of code**
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────┐
|
||||
│ NixOS Module │
|
||||
│ - Creates systemd service │
|
||||
│ - Manages user/permissions │
|
||||
│ - Provides CLI tools │
|
||||
└───────────────────────┬──────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────┐
|
||||
│ Orchestrator │
|
||||
│ - Runs main loop (every 5 minutes) │
|
||||
│ - Coordinates components │
|
||||
│ - Handles errors and logging │
|
||||
└───────┬──────────────┬──────────────┬──────────────┬─────────┘
|
||||
│ │ │ │
|
||||
▼ ▼ ▼ ▼
|
||||
┌─────────┐ ┌──────────┐ ┌─────────┐ ┌──────────┐
|
||||
│ Monitor │──▶│ Agent │──▶│Executor │──▶│ Logs │
|
||||
│ │ │ (AI) │ │ (Safe) │ │ │
|
||||
└─────────┘ └──────────┘ └─────────┘ └──────────┘
|
||||
│ │ │ │
|
||||
│ │ │ │
|
||||
Collects Analyzes Executes Records
|
||||
System with LLM Actions Everything
|
||||
Health (Ollama) Safely
|
||||
```
|
||||
|
||||
## Data Flow
|
||||
|
||||
1. **Collection**: Monitor gathers system health data
|
||||
2. **Analysis**: Agent sends data + prompts to Ollama
|
||||
3. **Decision**: AI returns structured analysis (JSON)
|
||||
4. **Execution**: Executor checks permissions & autonomy level
|
||||
5. **Action**: Either executes or queues for approval
|
||||
6. **Logging**: All steps logged to JSONL files
|
||||
|
||||
## Safety Mechanisms
|
||||
|
||||
### Multi-Level Protection
|
||||
1. **Autonomy Levels**: observe → suggest → auto-safe → auto-full
|
||||
2. **Protected Services**: Hardcoded list of critical services
|
||||
3. **Dry-Run Testing**: NixOS rebuilds tested before applying
|
||||
4. **Approval Queue**: Manual review workflow
|
||||
5. **Action Logging**: Complete audit trail
|
||||
6. **Resource Limits**: systemd enforced (1GB RAM, 50% CPU)
|
||||
7. **Rollback Capability**: Can revert changes
|
||||
8. **Timeout Protection**: All operations have timeouts
|
||||
|
||||
### What It Can Do Automatically (auto-safe)
|
||||
- ✅ Restart failed services (except protected ones)
|
||||
- ✅ Clean up disk space (nix-collect-garbage)
|
||||
- ✅ Rotate/clean logs
|
||||
- ✅ Run diagnostics
|
||||
- ❌ Modify configs (requires approval)
|
||||
- ❌ Rebuild NixOS (requires approval)
|
||||
- ❌ Touch protected services
|
||||
|
||||
## Files Created
|
||||
|
||||
```
|
||||
systems/macha-configs/autonomous/
|
||||
├── __init__.py # Python package marker
|
||||
├── monitor.py # System health monitoring
|
||||
├── agent.py # AI analysis and reasoning
|
||||
├── executor.py # Safe action execution
|
||||
├── orchestrator.py # Main control loop
|
||||
├── module.nix # NixOS integration
|
||||
├── README.md # Architecture docs
|
||||
├── QUICKSTART.md # User guide
|
||||
├── EXAMPLES.md # Configuration examples
|
||||
└── SUMMARY.md # This file
|
||||
```
|
||||
|
||||
## Integration Points
|
||||
|
||||
### Modified Files
|
||||
- `systems/macha.nix` - Added autonomous module and configuration
|
||||
|
||||
### Created Systemd Service
|
||||
- `macha-autonomous.service` - Main service
|
||||
- Runs continuously, checks every 5 minutes
|
||||
- Auto-starts on boot
|
||||
- Restart on failure
|
||||
|
||||
### Created Users/Groups
|
||||
- `macha-autonomous` user (system user)
|
||||
- Limited sudo access for specific commands
|
||||
- Home: `/var/lib/macha-autonomous`
|
||||
|
||||
### Created CLI Commands
|
||||
- `macha-check` - Run manual health check
|
||||
- `macha-approve list` - Show pending actions
|
||||
- `macha-approve approve <N>` - Approve action N
|
||||
- `macha-logs [orchestrator|decisions|actions|service]` - View logs
|
||||
|
||||
### State Directory
|
||||
`/var/lib/macha-autonomous/` contains:
|
||||
- `orchestrator.log` - Main log
|
||||
- `decisions.jsonl` - AI analysis log
|
||||
- `actions.jsonl` - Executed actions log
|
||||
- `snapshot_*.json` - System state snapshots
|
||||
- `approval_queue.json` - Pending actions
|
||||
- `suggested_patch_*.txt` - Config change suggestions
|
||||
|
||||
## Configuration
|
||||
|
||||
### Current Configuration (in systems/macha.nix)
|
||||
```nix
|
||||
services.macha-autonomous = {
|
||||
enable = true;
|
||||
autonomyLevel = "suggest"; # Requires approval
|
||||
checkInterval = 300; # 5 minutes
|
||||
model = "llama3.1:70b"; # Most capable model
|
||||
};
|
||||
```
|
||||
|
||||
### To Deploy
|
||||
```bash
|
||||
# Build and activate
|
||||
sudo nixos-rebuild switch --flake .#macha
|
||||
|
||||
# Check status
|
||||
systemctl status macha-autonomous
|
||||
|
||||
# View logs
|
||||
macha-logs service
|
||||
```
|
||||
|
||||
## Usage Workflow
|
||||
|
||||
### Day 1: Observation
|
||||
```bash
|
||||
# Just watch what it detects
|
||||
macha-logs decisions
|
||||
```
|
||||
|
||||
### Day 2-7: Review Proposals
|
||||
```bash
|
||||
# Check what it wants to do
|
||||
macha-approve list
|
||||
|
||||
# Approve good actions
|
||||
macha-approve approve 0
|
||||
```
|
||||
|
||||
### Week 2+: Increase Autonomy
|
||||
```nix
|
||||
# Let it handle safe actions automatically
|
||||
services.macha-autonomous.autonomyLevel = "auto-safe";
|
||||
```
|
||||
|
||||
### Monthly: Review Audit Logs
|
||||
```bash
|
||||
# See what it's been doing
|
||||
cat /var/lib/macha-autonomous/actions.jsonl | jq .
|
||||
```
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
### Resource Usage
|
||||
- **Idle**: ~100MB RAM
|
||||
- **Active (w/ llama3.1:70b)**: ~100MB + ~40GB model (shared with Ollama)
|
||||
- **CPU**: Limited to 50% by systemd
|
||||
- **Disk**: Minimal (logs rotate, snapshots limited to last 100)
|
||||
|
||||
### Timing
|
||||
- **Monitor**: ~2 seconds
|
||||
- **AI Analysis**: ~30 seconds (70B model) to ~3 seconds (8B model)
|
||||
- **Execution**: Varies by action (seconds to minutes)
|
||||
- **Full Cycle**: ~1-2 minutes typically
|
||||
|
||||
### Scalability
|
||||
- Can handle multiple issues per cycle
|
||||
- Queue system prevents action spam
|
||||
- Configurable check intervals
|
||||
- Model choice affects speed/quality tradeoff
|
||||
|
||||
## Current Status
|
||||
|
||||
✅ **READY TO USE** - All components implemented and integrated
|
||||
|
||||
The system is:
|
||||
- ✅ Fully functional
|
||||
- ✅ Safety mechanisms in place
|
||||
- ✅ Well documented
|
||||
- ✅ Integrated into NixOS configuration
|
||||
- ✅ Ready for deployment
|
||||
|
||||
Currently configured in **conservative mode** (`suggest`):
|
||||
- Monitors continuously
|
||||
- Analyzes with AI
|
||||
- Proposes actions
|
||||
- Waits for your approval
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Deploy and test:**
|
||||
```bash
|
||||
sudo nixos-rebuild switch --flake .#macha
|
||||
```
|
||||
|
||||
2. **Monitor for a few days:**
|
||||
```bash
|
||||
macha-logs service
|
||||
```
|
||||
|
||||
3. **Review what it detects:**
|
||||
```bash
|
||||
macha-approve list
|
||||
cat /var/lib/macha-autonomous/decisions.jsonl | jq .
|
||||
```
|
||||
|
||||
4. **Gradually increase autonomy as you gain confidence**
|
||||
|
||||
## Future Enhancement Ideas
|
||||
|
||||
### Short Term
|
||||
- Web dashboard for easier monitoring
|
||||
- Email/notification system for critical issues
|
||||
- More sophisticated action types
|
||||
- Historical trend analysis
|
||||
|
||||
### Medium Term
|
||||
- Integration with MCP servers (already installed!)
|
||||
- Predictive maintenance using historical data
|
||||
- Self-tuning of check intervals based on activity
|
||||
- Multi-system orchestration (manage other NixOS hosts)
|
||||
|
||||
### Long Term
|
||||
- Learning from past decisions to improve
|
||||
- A/B testing of configuration changes
|
||||
- Distributed consensus for multi-host decisions
|
||||
- Integration with external monitoring systems
|
||||
|
||||
## Philosophy
|
||||
|
||||
This implementation follows key principles:
|
||||
|
||||
1. **Safety First**: Multiple layers of protection
|
||||
2. **Transparency**: Everything is logged and auditable
|
||||
3. **Conservative Default**: Start restricted, earn trust
|
||||
4. **Human in Loop**: Always allow override
|
||||
5. **Gradual Autonomy**: Progressive trust model
|
||||
6. **Local First**: No external dependencies
|
||||
7. **Declarative**: NixOS-native configuration
|
||||
|
||||
## Conclusion
|
||||
|
||||
Macha now has a sophisticated autonomous maintenance system that can:
|
||||
- Monitor itself 24/7
|
||||
- Detect and analyze issues using AI
|
||||
- Fix problems automatically (with appropriate safeguards)
|
||||
- Learn and improve over time
|
||||
- Maintain complete audit trails
|
||||
|
||||
All powered by local AI models, no external dependencies, fully integrated with NixOS, and designed with safety as the top priority.
|
||||
|
||||
**Welcome to the future of self-maintaining systems!** 🎉
|
||||
1
__init__.py
Normal file
1
__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
# Macha Autonomous System Maintenance
|
||||
522
chat.py
Normal file
522
chat.py
Normal file
@@ -0,0 +1,522 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Interactive chat interface with Macha AI agent.
|
||||
Allows conversational interaction and directive execution.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import List, Dict, Any
|
||||
|
||||
# Add parent directory to path for imports
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from agent import MachaAgent
|
||||
|
||||
|
||||
class MachaChatSession:
|
||||
"""Interactive chat session with Macha"""
|
||||
|
||||
def __init__(self):
|
||||
self.agent = MachaAgent(use_queue=True, priority="INTERACTIVE")
|
||||
self.conversation_history: List[Dict[str, str]] = []
|
||||
self.session_start = datetime.now().isoformat()
|
||||
|
||||
def _create_chat_prompt(self, user_message: str) -> str:
|
||||
"""Create a prompt for the chat session"""
|
||||
|
||||
# Build conversation context
|
||||
context = ""
|
||||
if self.conversation_history:
|
||||
context = "\n\nCONVERSATION HISTORY:\n"
|
||||
for entry in self.conversation_history[-10:]: # Last 10 messages
|
||||
role = entry['role'].upper()
|
||||
msg = entry['message']
|
||||
context += f"{role}: {msg}\n"
|
||||
|
||||
prompt = f"""{MachaAgent.SYSTEM_PROMPT}
|
||||
|
||||
TASK: INTERACTIVE CHAT SESSION
|
||||
|
||||
You are in an interactive chat session with the system administrator.
|
||||
You can have a natural conversation and execute commands when directed.
|
||||
|
||||
CAPABILITIES:
|
||||
- Answer questions about system status
|
||||
- Explain configurations and issues
|
||||
- Execute commands when explicitly asked
|
||||
- Provide guidance and recommendations
|
||||
|
||||
COMMAND EXECUTION:
|
||||
When the user asks you to run a command or perform an action that requires execution:
|
||||
1. Respond with a JSON object containing the command to execute
|
||||
2. Format: {{"action": "execute", "command": "the command", "explanation": "why you're running it"}}
|
||||
3. After seeing the output, continue the conversation naturally
|
||||
|
||||
RESPONSE FORMAT:
|
||||
- For normal conversation: Respond naturally in plain text
|
||||
- For command execution: Respond with JSON containing action/command/explanation
|
||||
- Keep responses concise but informative
|
||||
|
||||
RULES:
|
||||
- Only execute commands when explicitly asked or when it's clearly needed
|
||||
- Explain what you're about to do before executing
|
||||
- Never execute destructive commands without explicit confirmation
|
||||
- If unsure, ask for clarification
|
||||
{context}
|
||||
|
||||
USER: {user_message}
|
||||
|
||||
MACHA:"""
|
||||
|
||||
return prompt
|
||||
|
||||
def _execute_command(self, command: str) -> Dict[str, Any]:
|
||||
"""Execute a shell command and return results"""
|
||||
try:
|
||||
result = subprocess.run(
|
||||
command,
|
||||
shell=True,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=30
|
||||
)
|
||||
|
||||
# Check if command failed due to permissions
|
||||
needs_sudo = False
|
||||
permission_errors = [
|
||||
'Interactive authentication required',
|
||||
'Permission denied',
|
||||
'Operation not permitted',
|
||||
'Must be root',
|
||||
'insufficient privileges',
|
||||
'authentication is required'
|
||||
]
|
||||
|
||||
if result.returncode != 0:
|
||||
error_text = (result.stderr + result.stdout).lower()
|
||||
for perm_error in permission_errors:
|
||||
if perm_error.lower() in error_text:
|
||||
needs_sudo = True
|
||||
break
|
||||
|
||||
# Retry with sudo if permission error detected
|
||||
if needs_sudo and not command.strip().startswith('sudo'):
|
||||
print(f"\n⚠️ Permission denied, retrying with sudo...")
|
||||
sudo_command = f"sudo {command}"
|
||||
result = subprocess.run(
|
||||
sudo_command,
|
||||
shell=True,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=30
|
||||
)
|
||||
|
||||
return {
|
||||
'success': result.returncode == 0,
|
||||
'exit_code': result.returncode,
|
||||
'stdout': result.stdout,
|
||||
'stderr': result.stderr,
|
||||
'command': sudo_command,
|
||||
'retried_with_sudo': True
|
||||
}
|
||||
|
||||
return {
|
||||
'success': result.returncode == 0,
|
||||
'exit_code': result.returncode,
|
||||
'stdout': result.stdout,
|
||||
'stderr': result.stderr,
|
||||
'command': command,
|
||||
'retried_with_sudo': False
|
||||
}
|
||||
except subprocess.TimeoutExpired:
|
||||
return {
|
||||
'success': False,
|
||||
'exit_code': -1,
|
||||
'stdout': '',
|
||||
'stderr': 'Command timed out after 30 seconds',
|
||||
'command': command,
|
||||
'retried_with_sudo': False
|
||||
}
|
||||
except Exception as e:
|
||||
return {
|
||||
'success': False,
|
||||
'exit_code': -1,
|
||||
'stdout': '',
|
||||
'stderr': str(e),
|
||||
'command': command,
|
||||
'retried_with_sudo': False
|
||||
}
|
||||
|
||||
def _parse_response(self, response: str) -> Dict[str, Any]:
|
||||
"""Parse AI response to determine if it's a command or text"""
|
||||
try:
|
||||
# Try to parse as JSON
|
||||
parsed = json.loads(response.strip())
|
||||
if isinstance(parsed, dict) and 'action' in parsed:
|
||||
return parsed
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
# It's plain text conversation
|
||||
return {'action': 'chat', 'message': response}
|
||||
|
||||
def _auto_diagnose_ollama(self) -> str:
|
||||
"""Automatically diagnose Ollama issues"""
|
||||
diagnostics = []
|
||||
|
||||
diagnostics.append("🔍 AUTO-DIAGNOSIS: Investigating Ollama failure...\n")
|
||||
|
||||
# Check if Ollama service is running
|
||||
try:
|
||||
result = subprocess.run(
|
||||
['systemctl', 'is-active', 'ollama.service'],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=5
|
||||
)
|
||||
if result.returncode == 0:
|
||||
diagnostics.append("✅ Ollama service is active")
|
||||
else:
|
||||
diagnostics.append(f"❌ Ollama service is NOT active: {result.stdout.strip()}")
|
||||
# Get service status
|
||||
status_result = subprocess.run(
|
||||
['systemctl', 'status', 'ollama.service', '--no-pager', '-l'],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=5
|
||||
)
|
||||
diagnostics.append(f"\nService status:\n```\n{status_result.stdout[-500:]}\n```")
|
||||
except Exception as e:
|
||||
diagnostics.append(f"⚠️ Could not check service status: {e}")
|
||||
|
||||
# Check memory usage
|
||||
try:
|
||||
result = subprocess.run(['free', '-h'], capture_output=True, text=True, timeout=5)
|
||||
lines = result.stdout.split('\n')
|
||||
for line in lines[:3]: # First 3 lines
|
||||
diagnostics.append(f" {line}")
|
||||
except Exception as e:
|
||||
diagnostics.append(f"⚠️ Could not check memory: {e}")
|
||||
|
||||
# Check which models are loaded
|
||||
try:
|
||||
import requests
|
||||
response = requests.get(f"{self.agent.ollama_host}/api/tags", timeout=5)
|
||||
if response.status_code == 200:
|
||||
models = response.json().get('models', [])
|
||||
diagnostics.append(f"\n📦 Loaded models ({len(models)}):")
|
||||
for model in models:
|
||||
name = model.get('name', 'unknown')
|
||||
size = model.get('size', 0) / (1024**3)
|
||||
is_current = "← TARGET" if name == self.agent.model else ""
|
||||
diagnostics.append(f" • {name} ({size:.1f} GB) {is_current}")
|
||||
|
||||
# Check if target model is loaded
|
||||
model_names = [m.get('name') for m in models]
|
||||
if self.agent.model not in model_names:
|
||||
diagnostics.append(f"\n❌ TARGET MODEL NOT LOADED: {self.agent.model}")
|
||||
diagnostics.append(f" Available models: {', '.join(model_names)}")
|
||||
else:
|
||||
diagnostics.append(f"❌ Ollama API returned {response.status_code}")
|
||||
except Exception as e:
|
||||
diagnostics.append(f"⚠️ Could not query Ollama API: {e}")
|
||||
|
||||
# Check recent Ollama logs
|
||||
try:
|
||||
result = subprocess.run(
|
||||
['journalctl', '-u', 'ollama.service', '-n', '10', '--no-pager'],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=5
|
||||
)
|
||||
if result.stdout:
|
||||
diagnostics.append(f"\n📋 Recent Ollama logs (last 10 lines):\n```\n{result.stdout}\n```")
|
||||
except Exception as e:
|
||||
diagnostics.append(f"⚠️ Could not check logs: {e}")
|
||||
|
||||
return "\n".join(diagnostics)
|
||||
|
||||
def process_message(self, user_message: str) -> str:
|
||||
"""Process a user message and return Macha's response"""
|
||||
|
||||
# Add user message to history
|
||||
self.conversation_history.append({
|
||||
'role': 'user',
|
||||
'message': user_message,
|
||||
'timestamp': datetime.now().isoformat()
|
||||
})
|
||||
|
||||
# Build chat messages for tool-calling API
|
||||
messages = []
|
||||
|
||||
# Query relevant knowledge based on user message
|
||||
knowledge_context = self.agent._query_relevant_knowledge(user_message, limit=3)
|
||||
|
||||
# Add recent conversation history (last 15 messages to stay within context limits)
|
||||
# With tool calling, messages grow quickly, so we limit more aggressively
|
||||
recent_history = self.conversation_history[-15:] # Last ~7 exchanges
|
||||
for entry in recent_history:
|
||||
content = entry['message']
|
||||
# Truncate very long messages (e.g., command outputs)
|
||||
if len(content) > 3000:
|
||||
content = content[:1500] + "\n... [message truncated] ...\n" + content[-1500:]
|
||||
# Add knowledge context to first user message if available
|
||||
if entry == recent_history[-1] and knowledge_context:
|
||||
content += knowledge_context
|
||||
messages.append({
|
||||
"role": entry['role'],
|
||||
"content": content
|
||||
})
|
||||
|
||||
try:
|
||||
# Use tool-aware chat API
|
||||
ai_response = self.agent._query_ollama_with_tools(messages)
|
||||
except Exception as e:
|
||||
error_msg = (
|
||||
f"❌ CRITICAL: Failed to communicate with Ollama inference engine\n\n"
|
||||
f"Error Type: {type(e).__name__}\n"
|
||||
f"Error Message: {str(e)}\n\n"
|
||||
)
|
||||
# Auto-diagnose the issue
|
||||
diagnostics = self._auto_diagnose_ollama()
|
||||
return error_msg + "\n" + diagnostics
|
||||
|
||||
if not ai_response:
|
||||
error_msg = (
|
||||
f"❌ Empty response from Ollama inference engine\n\n"
|
||||
f"The request succeeded but returned no data. This usually means:\n"
|
||||
f" • The model ({self.agent.model}) is still loading\n"
|
||||
f" • Ollama ran out of memory during generation\n"
|
||||
f" • The prompt was too large for the context window\n\n"
|
||||
)
|
||||
# Auto-diagnose the issue
|
||||
diagnostics = self._auto_diagnose_ollama()
|
||||
return error_msg + "\n" + diagnostics
|
||||
|
||||
# Check if Ollama returned an error
|
||||
try:
|
||||
error_check = json.loads(ai_response)
|
||||
if isinstance(error_check, dict) and 'error' in error_check:
|
||||
error_msg = (
|
||||
f"❌ Ollama API Error\n\n"
|
||||
f"Error: {error_check.get('error', 'Unknown error')}\n"
|
||||
f"Diagnosis: {error_check.get('diagnosis', 'No details')}\n\n"
|
||||
)
|
||||
# Auto-diagnose the issue
|
||||
diagnostics = self._auto_diagnose_ollama()
|
||||
return error_msg + "\n" + diagnostics
|
||||
except json.JSONDecodeError:
|
||||
# Not JSON, it's a normal response
|
||||
pass
|
||||
|
||||
# Parse response
|
||||
parsed = self._parse_response(ai_response)
|
||||
|
||||
if parsed.get('action') == 'execute':
|
||||
# AI wants to execute a command
|
||||
command = parsed.get('command', '')
|
||||
explanation = parsed.get('explanation', '')
|
||||
|
||||
# Show what we're about to do
|
||||
response = f"🔧 {explanation}\n\nExecuting: `{command}`\n\n"
|
||||
|
||||
# Execute the command
|
||||
result = self._execute_command(command)
|
||||
|
||||
# Show if we retried with sudo
|
||||
if result.get('retried_with_sudo'):
|
||||
response += f"⚠️ Permission denied, retried as: `{result['command']}`\n\n"
|
||||
|
||||
if result['success']:
|
||||
response += "✅ Command succeeded:\n"
|
||||
if result['stdout']:
|
||||
response += f"```\n{result['stdout']}\n```"
|
||||
else:
|
||||
response += "(no output)"
|
||||
else:
|
||||
response += f"❌ Command failed (exit code {result['exit_code']}):\n"
|
||||
if result['stderr']:
|
||||
response += f"```\n{result['stderr']}\n```"
|
||||
elif result['stdout']:
|
||||
response += f"```\n{result['stdout']}\n```"
|
||||
|
||||
# Add command execution to history
|
||||
self.conversation_history.append({
|
||||
'role': 'macha',
|
||||
'message': response,
|
||||
'timestamp': datetime.now().isoformat(),
|
||||
'command_result': result
|
||||
})
|
||||
|
||||
# Now ask AI to respond to the command output
|
||||
followup_prompt = f"""The command completed. Here's what happened:
|
||||
|
||||
Command: {command}
|
||||
Success: {result['success']}
|
||||
Output: {result['stdout'][:500] if result['stdout'] else '(none)'}
|
||||
Error: {result['stderr'][:500] if result['stderr'] else '(none)'}
|
||||
|
||||
Please provide a brief analysis or next steps."""
|
||||
|
||||
followup_response = self.agent._query_ollama(followup_prompt)
|
||||
|
||||
if followup_response:
|
||||
response += f"\n\n{followup_response}"
|
||||
|
||||
return response
|
||||
|
||||
else:
|
||||
# Normal conversation response
|
||||
message = parsed.get('message', ai_response)
|
||||
|
||||
self.conversation_history.append({
|
||||
'role': 'macha',
|
||||
'message': message,
|
||||
'timestamp': datetime.now().isoformat()
|
||||
})
|
||||
|
||||
return message
|
||||
|
||||
def run(self):
|
||||
"""Run the interactive chat session"""
|
||||
print("=" * 70)
|
||||
print("🌐 MACHA INTERACTIVE CHAT")
|
||||
print("=" * 70)
|
||||
print("Type your message and press Enter. Commands:")
|
||||
print(" /exit or /quit - End the chat session")
|
||||
print(" /clear - Clear conversation history")
|
||||
print(" /history - Show conversation history")
|
||||
print(" /debug - Show Ollama connection status")
|
||||
print("=" * 70)
|
||||
print()
|
||||
|
||||
while True:
|
||||
try:
|
||||
# Get user input
|
||||
user_input = input("\n💬 YOU: ").strip()
|
||||
|
||||
if not user_input:
|
||||
continue
|
||||
|
||||
# Handle special commands
|
||||
if user_input.lower() in ['/exit', '/quit']:
|
||||
print("\n👋 Ending chat session. Goodbye!")
|
||||
break
|
||||
|
||||
elif user_input.lower() == '/clear':
|
||||
self.conversation_history.clear()
|
||||
print("🧹 Conversation history cleared.")
|
||||
continue
|
||||
|
||||
elif user_input.lower() == '/history':
|
||||
print("\n" + "=" * 70)
|
||||
print("CONVERSATION HISTORY")
|
||||
print("=" * 70)
|
||||
for entry in self.conversation_history:
|
||||
role = entry['role'].upper()
|
||||
msg = entry['message'][:100] + "..." if len(entry['message']) > 100 else entry['message']
|
||||
print(f"{role}: {msg}")
|
||||
print("=" * 70)
|
||||
continue
|
||||
|
||||
elif user_input.lower() == '/debug':
|
||||
import os
|
||||
import subprocess
|
||||
|
||||
print("\n" + "=" * 70)
|
||||
print("MACHA ARCHITECTURE & STATUS")
|
||||
print("=" * 70)
|
||||
|
||||
print("\n🏗️ SYSTEM ARCHITECTURE:")
|
||||
print(f" Hostname: macha.coven.systems")
|
||||
print(f" Service: macha-autonomous.service (systemd)")
|
||||
print(f" Working Directory: /var/lib/macha")
|
||||
|
||||
print("\n👤 EXECUTION CONTEXT:")
|
||||
current_user = os.getenv('USER') or os.getenv('USERNAME') or 'unknown'
|
||||
print(f" Current User: {current_user}")
|
||||
print(f" UID: {os.getuid()}")
|
||||
|
||||
# Check if user has sudo access
|
||||
try:
|
||||
result = subprocess.run(['sudo', '-n', 'true'],
|
||||
capture_output=True, timeout=1)
|
||||
if result.returncode == 0:
|
||||
print(f" Sudo Access: ✓ Yes (passwordless)")
|
||||
else:
|
||||
print(f" Sudo Access: ⚠ Requires password")
|
||||
except:
|
||||
print(f" Sudo Access: ❌ No")
|
||||
|
||||
print(f" Note: Chat runs as invoking user (you), not as macha-autonomous")
|
||||
|
||||
print("\n🧠 INFERENCE ENGINE:")
|
||||
print(f" Backend: Ollama")
|
||||
print(f" Host: {self.agent.ollama_host}")
|
||||
print(f" Model: {self.agent.model}")
|
||||
print(f" Service: ollama.service (systemd)")
|
||||
|
||||
print("\n💾 DATABASE:")
|
||||
print(f" Backend: ChromaDB")
|
||||
print(f" Host: http://localhost:8000")
|
||||
print(f" Data: /var/lib/chromadb")
|
||||
print(f" Service: chromadb.service (systemd)")
|
||||
|
||||
print("\n🔍 OLLAMA STATUS:")
|
||||
# Try to query Ollama status
|
||||
try:
|
||||
import requests
|
||||
# Check if Ollama is running
|
||||
response = requests.get(f"{self.agent.ollama_host}/api/tags", timeout=5)
|
||||
if response.status_code == 200:
|
||||
models = response.json().get('models', [])
|
||||
print(f" Status: ✓ Running")
|
||||
print(f" Loaded models: {len(models)}")
|
||||
for model in models:
|
||||
name = model.get('name', 'unknown')
|
||||
size = model.get('size', 0) / (1024**3) # GB
|
||||
is_current = "← ACTIVE" if name == self.agent.model else ""
|
||||
print(f" • {name} ({size:.1f} GB) {is_current}")
|
||||
else:
|
||||
print(f" Status: ❌ Error (HTTP {response.status_code})")
|
||||
except Exception as e:
|
||||
print(f" Status: ❌ Cannot connect: {e}")
|
||||
print(f" Hint: Check 'systemctl status ollama.service'")
|
||||
|
||||
print("\n💡 CONVERSATION:")
|
||||
print(f" History: {len(self.conversation_history)} messages")
|
||||
print(f" Session started: {self.session_start}")
|
||||
|
||||
print("=" * 70)
|
||||
continue
|
||||
|
||||
# Process the message
|
||||
print("\n🤖 MACHA: ", end='', flush=True)
|
||||
response = self.process_message(user_input)
|
||||
print(response)
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print("\n\n👋 Chat interrupted. Use /exit to quit properly.")
|
||||
continue
|
||||
except EOFError:
|
||||
print("\n\n👋 Ending chat session. Goodbye!")
|
||||
break
|
||||
except Exception as e:
|
||||
print(f"\n❌ Error: {e}")
|
||||
continue
|
||||
|
||||
|
||||
def main():
|
||||
"""Main entry point"""
|
||||
session = MachaChatSession()
|
||||
session.run()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
||||
245
config_parser.py
Normal file
245
config_parser.py
Normal file
@@ -0,0 +1,245 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Config Parser - Extract imports and content from NixOS configuration files
|
||||
"""
|
||||
|
||||
import re
|
||||
import subprocess
|
||||
from pathlib import Path
|
||||
from typing import List, Dict, Set, Optional
|
||||
from datetime import datetime
|
||||
|
||||
|
||||
class ConfigParser:
|
||||
"""Parse NixOS flake and configuration files"""
|
||||
|
||||
def __init__(self, repo_url: str, local_path: Path = Path("/var/lib/macha/config-repo")):
|
||||
"""
|
||||
Initialize config parser
|
||||
|
||||
Args:
|
||||
repo_url: Git repository URL (e.g., git+https://...)
|
||||
local_path: Where to clone/update the repository
|
||||
"""
|
||||
# Strip git+ prefix if present for git commands
|
||||
self.repo_url = repo_url.replace("git+", "")
|
||||
self.local_path = local_path
|
||||
self.local_path.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
def ensure_repo(self) -> bool:
|
||||
"""Clone or update the repository"""
|
||||
try:
|
||||
if (self.local_path / ".git").exists():
|
||||
# Update existing repo
|
||||
result = subprocess.run(
|
||||
["git", "-C", str(self.local_path), "pull"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=30
|
||||
)
|
||||
return result.returncode == 0
|
||||
else:
|
||||
# Clone new repo
|
||||
result = subprocess.run(
|
||||
["git", "clone", self.repo_url, str(self.local_path)],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=60
|
||||
)
|
||||
return result.returncode == 0
|
||||
except Exception as e:
|
||||
print(f"Error updating repository: {e}")
|
||||
return False
|
||||
|
||||
def get_systems_from_flake(self) -> List[str]:
|
||||
"""Extract system names from flake.nix"""
|
||||
flake_path = self.local_path / "flake.nix"
|
||||
if not flake_path.exists():
|
||||
return []
|
||||
|
||||
systems = []
|
||||
try:
|
||||
content = flake_path.read_text()
|
||||
# Match patterns like: "macha" = nixpkgs.lib.nixosSystem
|
||||
matches = re.findall(r'"([^"]+)"\s*=\s*nixpkgs\.lib\.nixosSystem', content)
|
||||
systems = matches
|
||||
except Exception as e:
|
||||
print(f"Error parsing flake.nix: {e}")
|
||||
|
||||
return systems
|
||||
|
||||
def extract_imports(self, nix_file: Path) -> List[str]:
|
||||
"""Extract imports from a .nix file"""
|
||||
if not nix_file.exists():
|
||||
return []
|
||||
|
||||
imports = []
|
||||
try:
|
||||
content = nix_file.read_text()
|
||||
|
||||
# Find the imports = [ ... ]; block
|
||||
imports_match = re.search(
|
||||
r'imports\s*=\s*\[(.*?)\];',
|
||||
content,
|
||||
re.DOTALL
|
||||
)
|
||||
|
||||
if imports_match:
|
||||
imports_block = imports_match.group(1)
|
||||
# Extract all paths (relative paths starting with ./ or ../)
|
||||
paths = re.findall(r'[./]+[^\s\]]+\.nix', imports_block)
|
||||
imports = paths
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error parsing {nix_file}: {e}")
|
||||
|
||||
return imports
|
||||
|
||||
def resolve_import_path(self, base_file: Path, import_path: str) -> Optional[Path]:
|
||||
"""Resolve a relative import path to absolute path within repo"""
|
||||
try:
|
||||
# Get directory of the base file
|
||||
base_dir = base_file.parent
|
||||
# Resolve the relative path
|
||||
resolved = (base_dir / import_path).resolve()
|
||||
# Make sure it's within the repo
|
||||
if self.local_path in resolved.parents or resolved == self.local_path:
|
||||
return resolved
|
||||
except Exception as e:
|
||||
print(f"Error resolving import {import_path} from {base_file}: {e}")
|
||||
return None
|
||||
|
||||
def get_system_config(self, system_name: str) -> Dict[str, any]:
|
||||
"""
|
||||
Get configuration for a specific system
|
||||
|
||||
Returns:
|
||||
Dict with:
|
||||
- main_file: Path to systems/<name>.nix
|
||||
- imports: List of imported file paths (relative to repo root)
|
||||
- all_files: Set of all .nix files used (including recursive imports)
|
||||
"""
|
||||
main_file = self.local_path / "systems" / f"{system_name}.nix"
|
||||
|
||||
if not main_file.exists():
|
||||
return {
|
||||
"main_file": None,
|
||||
"imports": [],
|
||||
"all_files": set()
|
||||
}
|
||||
|
||||
# Track all files (avoid infinite loops)
|
||||
all_files = set()
|
||||
files_to_process = [main_file]
|
||||
processed = set()
|
||||
|
||||
while files_to_process:
|
||||
current_file = files_to_process.pop(0)
|
||||
|
||||
if current_file in processed:
|
||||
continue
|
||||
processed.add(current_file)
|
||||
|
||||
# Get relative path from repo root
|
||||
try:
|
||||
rel_path = current_file.relative_to(self.local_path)
|
||||
all_files.add(str(rel_path))
|
||||
except ValueError:
|
||||
continue
|
||||
|
||||
# Extract imports from this file
|
||||
imports = self.extract_imports(current_file)
|
||||
|
||||
# Resolve and queue imported files
|
||||
for imp in imports:
|
||||
resolved = self.resolve_import_path(current_file, imp)
|
||||
if resolved and resolved not in processed:
|
||||
files_to_process.append(resolved)
|
||||
|
||||
return {
|
||||
"main_file": str(main_file.relative_to(self.local_path)),
|
||||
"imports": self.extract_imports(main_file),
|
||||
"all_files": sorted(all_files)
|
||||
}
|
||||
|
||||
def read_file_content(self, relative_path: str) -> Optional[str]:
|
||||
"""Read content of a file by its path relative to repo root"""
|
||||
try:
|
||||
file_path = self.local_path / relative_path
|
||||
if file_path.exists():
|
||||
return file_path.read_text()
|
||||
except Exception as e:
|
||||
print(f"Error reading {relative_path}: {e}")
|
||||
return None
|
||||
|
||||
def get_all_config_files(self) -> List[Dict[str, str]]:
|
||||
"""
|
||||
Get all .nix files in the repository with their content
|
||||
|
||||
Returns:
|
||||
List of dicts with:
|
||||
- path: relative path from repo root
|
||||
- content: file contents
|
||||
- category: apps/systems/osconfigs/users based on path
|
||||
"""
|
||||
files = []
|
||||
|
||||
# Categories to scan
|
||||
categories = {
|
||||
"apps": self.local_path / "apps",
|
||||
"systems": self.local_path / "systems",
|
||||
"osconfigs": self.local_path / "osconfigs",
|
||||
"users": self.local_path / "users"
|
||||
}
|
||||
|
||||
for category, path in categories.items():
|
||||
if not path.exists():
|
||||
continue
|
||||
|
||||
for nix_file in path.rglob("*.nix"):
|
||||
try:
|
||||
rel_path = nix_file.relative_to(self.local_path)
|
||||
content = nix_file.read_text()
|
||||
|
||||
files.append({
|
||||
"path": str(rel_path),
|
||||
"content": content,
|
||||
"category": category
|
||||
})
|
||||
except Exception as e:
|
||||
print(f"Error reading {nix_file}: {e}")
|
||||
|
||||
return files
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Test the parser
|
||||
import sys
|
||||
|
||||
repo_url = "git+https://git.coven.systems/lily/nixos-servers"
|
||||
parser = ConfigParser(repo_url)
|
||||
|
||||
print("Ensuring repository is up to date...")
|
||||
if parser.ensure_repo():
|
||||
print("✓ Repository ready")
|
||||
else:
|
||||
print("✗ Failed to update repository")
|
||||
sys.exit(1)
|
||||
|
||||
print("\nSystems defined in flake:")
|
||||
systems = parser.get_systems_from_flake()
|
||||
for system in systems:
|
||||
print(f" - {system}")
|
||||
|
||||
if len(sys.argv) > 1:
|
||||
system_name = sys.argv[1]
|
||||
print(f"\nConfiguration for {system_name}:")
|
||||
config = parser.get_system_config(system_name)
|
||||
|
||||
print(f" Main file: {config['main_file']}")
|
||||
print(f" Direct imports: {len(config['imports'])}")
|
||||
print(f" All files used: {len(config['all_files'])}")
|
||||
|
||||
for f in config['all_files']:
|
||||
print(f" - {f}")
|
||||
|
||||
947
context_db.py
Normal file
947
context_db.py
Normal file
@@ -0,0 +1,947 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Context Database - Store and retrieve system context using ChromaDB for RAG
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
from typing import Dict, List, Any, Optional, Set
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
# Set environment variable BEFORE importing chromadb to prevent .env file reading
|
||||
os.environ.setdefault("CHROMA_ENV_FILE", "")
|
||||
|
||||
import chromadb
|
||||
from chromadb.config import Settings
|
||||
|
||||
|
||||
class ContextDatabase:
|
||||
"""Manage system context and relationships in ChromaDB"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
host: str = "localhost",
|
||||
port: int = 8000,
|
||||
persist_directory: str = "/var/lib/chromadb"
|
||||
):
|
||||
"""Initialize ChromaDB client"""
|
||||
|
||||
self.client = chromadb.HttpClient(
|
||||
host=host,
|
||||
port=port,
|
||||
settings=Settings(
|
||||
anonymized_telemetry=False,
|
||||
allow_reset=False,
|
||||
chroma_api_impl="chromadb.api.fastapi.FastAPI"
|
||||
)
|
||||
)
|
||||
|
||||
# Create or get collections
|
||||
self.systems_collection = self.client.get_or_create_collection(
|
||||
name="systems",
|
||||
metadata={"description": "System definitions and metadata"}
|
||||
)
|
||||
|
||||
self.relationships_collection = self.client.get_or_create_collection(
|
||||
name="relationships",
|
||||
metadata={"description": "System relationships and dependencies"}
|
||||
)
|
||||
|
||||
self.issues_collection = self.client.get_or_create_collection(
|
||||
name="issues",
|
||||
metadata={"description": "Issue tracking and resolution history"}
|
||||
)
|
||||
|
||||
self.decisions_collection = self.client.get_or_create_collection(
|
||||
name="decisions",
|
||||
metadata={"description": "AI decisions and outcomes"}
|
||||
)
|
||||
|
||||
self.config_files_collection = self.client.get_or_create_collection(
|
||||
name="config_files",
|
||||
metadata={"description": "NixOS configuration files for RAG"}
|
||||
)
|
||||
|
||||
self.knowledge_collection = self.client.get_or_create_collection(
|
||||
name="knowledge",
|
||||
metadata={"description": "Operational knowledge: commands, patterns, best practices"}
|
||||
)
|
||||
|
||||
# ============ System Registry ============
|
||||
|
||||
def register_system(
|
||||
self,
|
||||
hostname: str,
|
||||
system_type: str,
|
||||
services: List[str],
|
||||
capabilities: List[str] = None,
|
||||
metadata: Dict[str, Any] = None,
|
||||
config_repo: str = None,
|
||||
config_branch: str = None,
|
||||
os_type: str = "nixos"
|
||||
):
|
||||
"""Register a system in the database
|
||||
|
||||
Args:
|
||||
hostname: FQDN of the system
|
||||
system_type: Role (e.g., 'workstation', 'server')
|
||||
services: List of running services
|
||||
capabilities: System capabilities
|
||||
metadata: Additional metadata
|
||||
config_repo: Git repository URL
|
||||
config_branch: Git branch name
|
||||
os_type: Operating system (e.g., 'nixos', 'ubuntu', 'debian', 'arch', 'windows', 'macos')
|
||||
"""
|
||||
doc_parts = [
|
||||
f"System: {hostname}",
|
||||
f"Type: {system_type}",
|
||||
f"OS: {os_type}",
|
||||
f"Services: {', '.join(services)}",
|
||||
f"Capabilities: {', '.join(capabilities or [])}"
|
||||
]
|
||||
|
||||
if config_repo:
|
||||
doc_parts.append(f"Configuration Repository: {config_repo}")
|
||||
if config_branch:
|
||||
doc_parts.append(f"Configuration Branch: {config_branch}")
|
||||
|
||||
doc = "\n".join(doc_parts)
|
||||
|
||||
metadata_dict = {
|
||||
"hostname": hostname,
|
||||
"type": system_type,
|
||||
"os_type": os_type,
|
||||
"services": json.dumps(services),
|
||||
"capabilities": json.dumps(capabilities or []),
|
||||
"metadata": json.dumps(metadata or {}),
|
||||
"config_repo": config_repo or "",
|
||||
"config_branch": config_branch or "",
|
||||
"updated_at": datetime.now().isoformat()
|
||||
}
|
||||
|
||||
self.systems_collection.upsert(
|
||||
ids=[hostname],
|
||||
documents=[doc],
|
||||
metadatas=[metadata_dict]
|
||||
)
|
||||
|
||||
def get_system(self, hostname: str) -> Optional[Dict[str, Any]]:
|
||||
"""Get system information"""
|
||||
try:
|
||||
result = self.systems_collection.get(
|
||||
ids=[hostname],
|
||||
include=["metadatas", "documents"]
|
||||
)
|
||||
|
||||
if result['ids']:
|
||||
metadata = result['metadatas'][0]
|
||||
return {
|
||||
"hostname": metadata["hostname"],
|
||||
"type": metadata["type"],
|
||||
"services": json.loads(metadata["services"]),
|
||||
"capabilities": json.loads(metadata["capabilities"]),
|
||||
"metadata": json.loads(metadata["metadata"]),
|
||||
"document": result['documents'][0]
|
||||
}
|
||||
except:
|
||||
pass
|
||||
|
||||
return None
|
||||
|
||||
def get_all_systems(self) -> List[Dict[str, Any]]:
|
||||
"""Get all registered systems"""
|
||||
result = self.systems_collection.get(include=["metadatas"])
|
||||
|
||||
systems = []
|
||||
for metadata in result['metadatas']:
|
||||
systems.append({
|
||||
"hostname": metadata["hostname"],
|
||||
"type": metadata["type"],
|
||||
"os_type": metadata.get("os_type", "unknown"),
|
||||
"services": json.loads(metadata["services"]),
|
||||
"capabilities": json.loads(metadata["capabilities"]),
|
||||
"config_repo": metadata.get("config_repo", ""),
|
||||
"config_branch": metadata.get("config_branch", "")
|
||||
})
|
||||
|
||||
return systems
|
||||
|
||||
def is_system_known(self, hostname: str) -> bool:
|
||||
"""Check if a system is already registered"""
|
||||
try:
|
||||
result = self.systems_collection.get(ids=[hostname])
|
||||
return len(result['ids']) > 0
|
||||
except:
|
||||
return False
|
||||
|
||||
def get_known_hostnames(self) -> Set[str]:
|
||||
"""Get set of all known system hostnames"""
|
||||
result = self.systems_collection.get(include=["metadatas"])
|
||||
return set(metadata["hostname"] for metadata in result['metadatas'])
|
||||
|
||||
# ============ Relationships ============
|
||||
|
||||
def add_relationship(
|
||||
self,
|
||||
source: str,
|
||||
target: str,
|
||||
relationship_type: str,
|
||||
description: str = ""
|
||||
):
|
||||
"""Add a relationship between systems"""
|
||||
rel_id = f"{source}→{target}:{relationship_type}"
|
||||
doc = f"{source} {relationship_type} {target}. {description}"
|
||||
|
||||
self.relationships_collection.upsert(
|
||||
ids=[rel_id],
|
||||
documents=[doc],
|
||||
metadatas=[{
|
||||
"source": source,
|
||||
"target": target,
|
||||
"type": relationship_type,
|
||||
"description": description,
|
||||
"created_at": datetime.now().isoformat()
|
||||
}]
|
||||
)
|
||||
|
||||
def get_dependencies(self, hostname: str) -> List[Dict[str, Any]]:
|
||||
"""Get what a system depends on"""
|
||||
result = self.relationships_collection.get(
|
||||
where={"source": hostname},
|
||||
include=["metadatas"]
|
||||
)
|
||||
|
||||
return [
|
||||
{
|
||||
"target": m["target"],
|
||||
"type": m["type"],
|
||||
"description": m.get("description", "")
|
||||
}
|
||||
for m in result['metadatas']
|
||||
]
|
||||
|
||||
def get_dependents(self, hostname: str) -> List[Dict[str, Any]]:
|
||||
"""Get what depends on a system"""
|
||||
result = self.relationships_collection.get(
|
||||
where={"target": hostname},
|
||||
include=["metadatas"]
|
||||
)
|
||||
|
||||
return [
|
||||
{
|
||||
"source": m["source"],
|
||||
"type": m["type"],
|
||||
"description": m.get("description", "")
|
||||
}
|
||||
for m in result['metadatas']
|
||||
]
|
||||
|
||||
# ============ Issue History ============
|
||||
|
||||
def store_issue(
|
||||
self,
|
||||
system: str,
|
||||
issue_description: str,
|
||||
resolution: str = "",
|
||||
severity: str = "unknown",
|
||||
metadata: Dict[str, Any] = None
|
||||
) -> str:
|
||||
"""Store an issue and its resolution"""
|
||||
issue_id = f"{system}_{datetime.now().timestamp()}"
|
||||
|
||||
doc = f"""
|
||||
System: {system}
|
||||
Issue: {issue_description}
|
||||
Resolution: {resolution}
|
||||
Severity: {severity}
|
||||
"""
|
||||
|
||||
self.issues_collection.add(
|
||||
ids=[issue_id],
|
||||
documents=[doc],
|
||||
metadatas=[{
|
||||
"system": system,
|
||||
"severity": severity,
|
||||
"resolved": bool(resolution),
|
||||
"timestamp": datetime.now().isoformat(),
|
||||
"metadata": json.dumps(metadata or {})
|
||||
}]
|
||||
)
|
||||
|
||||
return issue_id
|
||||
|
||||
def store_investigation(
|
||||
self,
|
||||
system: str,
|
||||
issue_description: str,
|
||||
commands: List[str],
|
||||
output: str,
|
||||
timestamp: str = None
|
||||
) -> str:
|
||||
"""Store investigation results for an issue"""
|
||||
if timestamp is None:
|
||||
timestamp = datetime.now().isoformat()
|
||||
|
||||
investigation_id = f"investigation_{system}_{datetime.now().timestamp()}"
|
||||
|
||||
doc = f"""
|
||||
System: {system}
|
||||
Issue: {issue_description}
|
||||
Commands executed: {', '.join(commands)}
|
||||
Output:
|
||||
{output[:2000]} # Limit output to prevent token overflow
|
||||
"""
|
||||
|
||||
self.issues_collection.add(
|
||||
ids=[investigation_id],
|
||||
documents=[doc],
|
||||
metadatas=[{
|
||||
"system": system,
|
||||
"issue": issue_description,
|
||||
"type": "investigation",
|
||||
"commands": json.dumps(commands),
|
||||
"timestamp": timestamp,
|
||||
"metadata": json.dumps({"output_length": len(output)})
|
||||
}]
|
||||
)
|
||||
|
||||
return investigation_id
|
||||
|
||||
def get_recent_investigations(
|
||||
self,
|
||||
issue_description: str,
|
||||
system: str,
|
||||
hours: int = 24
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""Get recent investigations for a similar issue"""
|
||||
# Query for similar issues
|
||||
try:
|
||||
result = self.issues_collection.query(
|
||||
query_texts=[f"System: {system}\nIssue: {issue_description}"],
|
||||
n_results=10,
|
||||
where={"type": "investigation"},
|
||||
include=["documents", "metadatas", "distances"]
|
||||
)
|
||||
|
||||
investigations = []
|
||||
if result['ids'] and result['ids'][0]:
|
||||
cutoff_time = datetime.now().timestamp() - (hours * 3600)
|
||||
|
||||
for i, doc_id in enumerate(result['ids'][0]):
|
||||
meta = result['metadatas'][0][i]
|
||||
timestamp = datetime.fromisoformat(meta['timestamp'])
|
||||
|
||||
# Only include recent investigations
|
||||
if timestamp.timestamp() > cutoff_time:
|
||||
investigations.append({
|
||||
"id": doc_id,
|
||||
"system": meta['system'],
|
||||
"issue": meta['issue'],
|
||||
"commands": json.loads(meta['commands']),
|
||||
"output": result['documents'][0][i],
|
||||
"timestamp": meta['timestamp'],
|
||||
"relevance": 1 - result['distances'][0][i]
|
||||
})
|
||||
|
||||
return investigations
|
||||
except Exception as e:
|
||||
print(f"Error querying investigations: {e}")
|
||||
return []
|
||||
|
||||
def find_similar_issues(
|
||||
self,
|
||||
issue_description: str,
|
||||
system: Optional[str] = None,
|
||||
n_results: int = 5
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""Find similar past issues using semantic search"""
|
||||
where = {"system": system} if system else None
|
||||
|
||||
results = self.issues_collection.query(
|
||||
query_texts=[issue_description],
|
||||
n_results=n_results,
|
||||
where=where,
|
||||
include=["documents", "metadatas", "distances"]
|
||||
)
|
||||
|
||||
similar = []
|
||||
for i, doc in enumerate(results['documents'][0]):
|
||||
similar.append({
|
||||
"issue": doc,
|
||||
"metadata": results['metadatas'][0][i],
|
||||
"similarity": 1 - results['distances'][0][i] # Convert distance to similarity
|
||||
})
|
||||
|
||||
return similar
|
||||
|
||||
# ============ AI Decisions ============
|
||||
|
||||
def store_decision(
|
||||
self,
|
||||
system: str,
|
||||
analysis: Dict[str, Any],
|
||||
action: Dict[str, Any],
|
||||
outcome: Dict[str, Any] = None
|
||||
):
|
||||
"""Store an AI decision for learning"""
|
||||
decision_id = f"decision_{datetime.now().timestamp()}"
|
||||
|
||||
doc = f"""
|
||||
System: {system}
|
||||
Status: {analysis.get('status', 'unknown')}
|
||||
Assessment: {analysis.get('overall_assessment', '')}
|
||||
Action: {action.get('proposed_action', '')}
|
||||
Risk: {action.get('risk_level', 'unknown')}
|
||||
Outcome: {outcome.get('status', 'pending') if outcome else 'pending'}
|
||||
"""
|
||||
|
||||
self.decisions_collection.add(
|
||||
ids=[decision_id],
|
||||
documents=[doc],
|
||||
metadatas=[{
|
||||
"system": system,
|
||||
"timestamp": datetime.now().isoformat(),
|
||||
"analysis": json.dumps(analysis),
|
||||
"action": json.dumps(action),
|
||||
"outcome": json.dumps(outcome or {})
|
||||
}]
|
||||
)
|
||||
|
||||
def get_recent_decisions(
|
||||
self,
|
||||
system: Optional[str] = None,
|
||||
n_results: int = 10
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""Get recent decisions, optionally filtered by system"""
|
||||
where = {"system": system} if system else None
|
||||
|
||||
results = self.decisions_collection.query(
|
||||
query_texts=["recent decisions"],
|
||||
n_results=n_results,
|
||||
where=where,
|
||||
include=["documents", "metadatas"]
|
||||
)
|
||||
|
||||
decisions = []
|
||||
for i, doc in enumerate(results['documents'][0]):
|
||||
meta = results['metadatas'][0][i]
|
||||
decisions.append({
|
||||
"system": meta["system"],
|
||||
"timestamp": meta["timestamp"],
|
||||
"analysis": json.loads(meta["analysis"]),
|
||||
"action": json.loads(meta["action"]),
|
||||
"outcome": json.loads(meta["outcome"])
|
||||
})
|
||||
|
||||
return decisions
|
||||
|
||||
# ============ Context Generation for AI ============
|
||||
|
||||
def get_system_context(self, hostname: str, git_context=None) -> str:
|
||||
"""Generate rich context about a system for AI prompts"""
|
||||
context_parts = []
|
||||
|
||||
# System info
|
||||
system = self.get_system(hostname)
|
||||
if system:
|
||||
context_parts.append(f"System: {hostname} ({system['type']})")
|
||||
context_parts.append(f"Services: {', '.join(system['services'])}")
|
||||
if system['capabilities']:
|
||||
context_parts.append(f"Capabilities: {', '.join(system['capabilities'])}")
|
||||
|
||||
# Git repository info
|
||||
if system and system.get('metadata'):
|
||||
metadata = json.loads(system['metadata']) if isinstance(system['metadata'], str) else system['metadata']
|
||||
config_repo = metadata.get('config_repo', '')
|
||||
if config_repo:
|
||||
context_parts.append(f"\nConfiguration Repository: {config_repo}")
|
||||
|
||||
# Recent git changes for this system
|
||||
if git_context:
|
||||
try:
|
||||
# Extract system name from FQDN
|
||||
system_name = hostname.split('.')[0]
|
||||
git_summary = git_context.get_system_context_summary(system_name)
|
||||
if git_summary:
|
||||
context_parts.append(f"\n{git_summary}")
|
||||
except:
|
||||
pass
|
||||
|
||||
# Dependencies
|
||||
deps = self.get_dependencies(hostname)
|
||||
if deps:
|
||||
context_parts.append("\nDependencies:")
|
||||
for dep in deps:
|
||||
context_parts.append(f" - Depends on {dep['target']} for {dep['type']}")
|
||||
|
||||
# Dependents
|
||||
dependents = self.get_dependents(hostname)
|
||||
if dependents:
|
||||
context_parts.append("\nUsed by:")
|
||||
for dependent in dependents:
|
||||
context_parts.append(f" - {dependent['source']} uses this for {dependent['type']}")
|
||||
|
||||
return "\n".join(context_parts)
|
||||
|
||||
def get_issue_context(self, issue_description: str, system: str) -> str:
|
||||
"""Get context about similar past issues"""
|
||||
similar = self.find_similar_issues(issue_description, system, n_results=3)
|
||||
|
||||
if not similar:
|
||||
return ""
|
||||
|
||||
context_parts = ["Similar past issues:"]
|
||||
for i, issue in enumerate(similar, 1):
|
||||
if issue['similarity'] > 0.7: # Only include if fairly similar
|
||||
context_parts.append(f"\n{i}. {issue['issue']}")
|
||||
context_parts.append(f" Similarity: {issue['similarity']:.2%}")
|
||||
|
||||
return "\n".join(context_parts) if len(context_parts) > 1 else ""
|
||||
|
||||
# ============ Config Files (for RAG) ============
|
||||
|
||||
def store_config_file(
|
||||
self,
|
||||
file_path: str,
|
||||
content: str,
|
||||
category: str = "unknown",
|
||||
systems_using: List[str] = None
|
||||
):
|
||||
"""
|
||||
Store a configuration file for RAG retrieval
|
||||
|
||||
Args:
|
||||
file_path: Path relative to repo root (e.g., "apps/gotify.nix")
|
||||
content: Full file contents
|
||||
category: apps/systems/osconfigs/users
|
||||
systems_using: List of system hostnames that import this file
|
||||
"""
|
||||
self.config_files_collection.upsert(
|
||||
ids=[file_path],
|
||||
documents=[content],
|
||||
metadatas=[{
|
||||
"path": file_path,
|
||||
"category": category,
|
||||
"systems": json.dumps(systems_using or []),
|
||||
"updated_at": datetime.now().isoformat()
|
||||
}]
|
||||
)
|
||||
|
||||
def get_config_file(self, file_path: str) -> Optional[Dict[str, Any]]:
|
||||
"""Get a specific config file by path"""
|
||||
try:
|
||||
result = self.config_files_collection.get(
|
||||
ids=[file_path],
|
||||
include=["documents", "metadatas"]
|
||||
)
|
||||
|
||||
if result['ids']:
|
||||
return {
|
||||
"path": file_path,
|
||||
"content": result['documents'][0],
|
||||
"metadata": result['metadatas'][0]
|
||||
}
|
||||
except:
|
||||
pass
|
||||
return None
|
||||
|
||||
def query_config_files(
|
||||
self,
|
||||
query: str,
|
||||
system: str = None,
|
||||
category: str = None,
|
||||
n_results: int = 5
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Query config files using semantic search
|
||||
|
||||
Args:
|
||||
query: Natural language query (e.g., "gotify configuration")
|
||||
system: Optional filter by system hostname
|
||||
category: Optional filter by category (apps/systems/etc)
|
||||
n_results: Number of results to return
|
||||
|
||||
Returns:
|
||||
List of dicts with path, content, and metadata
|
||||
"""
|
||||
where = {}
|
||||
if category:
|
||||
where["category"] = category
|
||||
|
||||
try:
|
||||
result = self.config_files_collection.query(
|
||||
query_texts=[query],
|
||||
n_results=n_results,
|
||||
where=where if where else None,
|
||||
include=["documents", "metadatas", "distances"]
|
||||
)
|
||||
|
||||
configs = []
|
||||
if result['ids'] and result['ids'][0]:
|
||||
for i, doc_id in enumerate(result['ids'][0]):
|
||||
config = {
|
||||
"path": doc_id,
|
||||
"content": result['documents'][0][i],
|
||||
"metadata": result['metadatas'][0][i],
|
||||
"relevance": 1 - result['distances'][0][i] # Convert distance to relevance
|
||||
}
|
||||
|
||||
# Filter by system if specified
|
||||
if system:
|
||||
systems = json.loads(config['metadata'].get('systems', '[]'))
|
||||
if system not in systems:
|
||||
continue
|
||||
|
||||
configs.append(config)
|
||||
|
||||
return configs
|
||||
except Exception as e:
|
||||
print(f"Error querying config files: {e}")
|
||||
return []
|
||||
|
||||
def get_system_config_files(self, system: str) -> List[str]:
|
||||
"""Get all config file paths used by a system"""
|
||||
# This is stored in the system's metadata now
|
||||
system_info = self.get_system(system)
|
||||
if system_info and 'config_files' in system_info.get('metadata', {}):
|
||||
# metadata is already a dict, config_files is already a list
|
||||
return system_info['metadata']['config_files']
|
||||
return []
|
||||
|
||||
def update_system_config_files(self, system: str, config_files: List[str]):
|
||||
"""Update the list of config files used by a system"""
|
||||
system_info = self.get_system(system)
|
||||
if system_info:
|
||||
# metadata is already a dict from get_system(), no need to json.loads()
|
||||
metadata = system_info.get('metadata', {})
|
||||
metadata['config_files'] = config_files
|
||||
metadata['config_updated_at'] = datetime.now().isoformat()
|
||||
|
||||
# Re-register with updated metadata
|
||||
self.register_system(
|
||||
hostname=system,
|
||||
system_type=system_info['type'],
|
||||
services=system_info['services'],
|
||||
capabilities=system_info.get('capabilities', []),
|
||||
metadata=metadata,
|
||||
config_repo=system_info.get('config_repo'),
|
||||
config_branch=system_info.get('config_branch')
|
||||
)
|
||||
|
||||
# =========================================================================
|
||||
# ISSUE TRACKING
|
||||
# =========================================================================
|
||||
|
||||
def store_issue(self, issue: Dict[str, Any]):
|
||||
"""Store a new issue in the database"""
|
||||
issue_id = issue['issue_id']
|
||||
|
||||
# Store in ChromaDB with the issue as document
|
||||
self.issues_collection.add(
|
||||
documents=[json.dumps(issue)],
|
||||
metadatas=[{
|
||||
'issue_id': issue_id,
|
||||
'hostname': issue['hostname'],
|
||||
'title': issue['title'],
|
||||
'status': issue['status'],
|
||||
'severity': issue['severity'],
|
||||
'created_at': issue['created_at'],
|
||||
'source': issue['source']
|
||||
}],
|
||||
ids=[issue_id]
|
||||
)
|
||||
|
||||
def get_issue(self, issue_id: str) -> Optional[Dict[str, Any]]:
|
||||
"""Retrieve an issue by ID"""
|
||||
try:
|
||||
results = self.issues_collection.get(ids=[issue_id])
|
||||
if results['documents']:
|
||||
return json.loads(results['documents'][0])
|
||||
return None
|
||||
except Exception as e:
|
||||
print(f"Error retrieving issue {issue_id}: {e}")
|
||||
return None
|
||||
|
||||
def update_issue(self, issue: Dict[str, Any]):
|
||||
"""Update an existing issue"""
|
||||
issue_id = issue['issue_id']
|
||||
|
||||
# Delete old version
|
||||
try:
|
||||
self.issues_collection.delete(ids=[issue_id])
|
||||
except:
|
||||
pass
|
||||
|
||||
# Store updated version
|
||||
self.store_issue(issue)
|
||||
|
||||
def delete_issue(self, issue_id: str):
|
||||
"""Remove an issue from the database (used when archiving)"""
|
||||
try:
|
||||
self.issues_collection.delete(ids=[issue_id])
|
||||
except Exception as e:
|
||||
print(f"Error deleting issue {issue_id}: {e}")
|
||||
|
||||
def list_issues(
|
||||
self,
|
||||
hostname: Optional[str] = None,
|
||||
status: Optional[str] = None,
|
||||
severity: Optional[str] = None
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""List issues with optional filters"""
|
||||
try:
|
||||
# Build query filter
|
||||
where_filter = {}
|
||||
if hostname:
|
||||
where_filter['hostname'] = hostname
|
||||
if status:
|
||||
where_filter['status'] = status
|
||||
if severity:
|
||||
where_filter['severity'] = severity
|
||||
|
||||
if where_filter:
|
||||
results = self.issues_collection.get(where=where_filter)
|
||||
else:
|
||||
results = self.issues_collection.get()
|
||||
|
||||
issues = []
|
||||
for doc in results['documents']:
|
||||
issues.append(json.loads(doc))
|
||||
|
||||
# Sort by created_at descending
|
||||
issues.sort(key=lambda x: x.get('created_at', ''), reverse=True)
|
||||
|
||||
return issues
|
||||
except Exception as e:
|
||||
print(f"Error listing issues: {e}")
|
||||
return []
|
||||
|
||||
# ============ Knowledge Base ============
|
||||
|
||||
def store_knowledge(
|
||||
self,
|
||||
topic: str,
|
||||
knowledge: str,
|
||||
category: str = "general",
|
||||
source: str = "experience",
|
||||
confidence: str = "medium",
|
||||
tags: list = None
|
||||
) -> str:
|
||||
"""
|
||||
Store a piece of operational knowledge
|
||||
|
||||
Args:
|
||||
topic: Main subject (e.g., "nh os switch", "systemd-journal-remote")
|
||||
knowledge: The actual knowledge/insight/pattern
|
||||
category: Type of knowledge (command, pattern, troubleshooting, performance, etc.)
|
||||
source: Where this came from (experience, documentation, user-provided)
|
||||
confidence: How confident we are (low, medium, high)
|
||||
tags: Optional tags for categorization
|
||||
|
||||
Returns:
|
||||
Knowledge ID
|
||||
"""
|
||||
import uuid
|
||||
from datetime import datetime
|
||||
|
||||
knowledge_id = str(uuid.uuid4())
|
||||
|
||||
knowledge_doc = {
|
||||
"id": knowledge_id,
|
||||
"topic": topic,
|
||||
"knowledge": knowledge,
|
||||
"category": category,
|
||||
"source": source,
|
||||
"confidence": confidence,
|
||||
"tags": tags or [],
|
||||
"created_at": datetime.utcnow().isoformat(),
|
||||
"last_verified": datetime.utcnow().isoformat(),
|
||||
"times_referenced": 0
|
||||
}
|
||||
|
||||
try:
|
||||
self.knowledge_collection.add(
|
||||
ids=[knowledge_id],
|
||||
documents=[knowledge],
|
||||
metadatas=[{
|
||||
"topic": topic,
|
||||
"category": category,
|
||||
"source": source,
|
||||
"confidence": confidence,
|
||||
"tags": json.dumps(tags or []),
|
||||
"created_at": knowledge_doc["created_at"],
|
||||
"full_doc": json.dumps(knowledge_doc)
|
||||
}]
|
||||
)
|
||||
return knowledge_id
|
||||
except Exception as e:
|
||||
print(f"Error storing knowledge: {e}")
|
||||
return None
|
||||
|
||||
def query_knowledge(
|
||||
self,
|
||||
query: str,
|
||||
category: str = None,
|
||||
limit: int = 5
|
||||
) -> list:
|
||||
"""
|
||||
Query the knowledge base for relevant information
|
||||
|
||||
Args:
|
||||
query: What to search for
|
||||
category: Optional category filter
|
||||
limit: Maximum results to return
|
||||
|
||||
Returns:
|
||||
List of relevant knowledge entries
|
||||
"""
|
||||
try:
|
||||
where_filter = {}
|
||||
if category:
|
||||
where_filter["category"] = category
|
||||
|
||||
results = self.knowledge_collection.query(
|
||||
query_texts=[query],
|
||||
n_results=limit,
|
||||
where=where_filter if where_filter else None
|
||||
)
|
||||
|
||||
knowledge_items = []
|
||||
if results and results['documents']:
|
||||
for i, doc in enumerate(results['documents'][0]):
|
||||
metadata = results['metadatas'][0][i]
|
||||
full_doc = json.loads(metadata.get('full_doc', '{}'))
|
||||
|
||||
# Increment reference count
|
||||
full_doc['times_referenced'] = full_doc.get('times_referenced', 0) + 1
|
||||
|
||||
knowledge_items.append(full_doc)
|
||||
|
||||
return knowledge_items
|
||||
except Exception as e:
|
||||
print(f"Error querying knowledge: {e}")
|
||||
return []
|
||||
|
||||
def get_knowledge_by_topic(self, topic: str) -> list:
|
||||
"""Get all knowledge entries for a specific topic"""
|
||||
try:
|
||||
results = self.knowledge_collection.get(
|
||||
where={"topic": topic}
|
||||
)
|
||||
|
||||
knowledge_items = []
|
||||
for metadata in results['metadatas']:
|
||||
full_doc = json.loads(metadata.get('full_doc', '{}'))
|
||||
knowledge_items.append(full_doc)
|
||||
|
||||
return knowledge_items
|
||||
except Exception as e:
|
||||
print(f"Error getting knowledge by topic: {e}")
|
||||
return []
|
||||
|
||||
def update_knowledge(
|
||||
self,
|
||||
knowledge_id: str,
|
||||
knowledge: str = None,
|
||||
confidence: str = None,
|
||||
verify: bool = False
|
||||
):
|
||||
"""
|
||||
Update an existing knowledge entry
|
||||
|
||||
Args:
|
||||
knowledge_id: ID of knowledge to update
|
||||
knowledge: New knowledge text (optional)
|
||||
confidence: New confidence level (optional)
|
||||
verify: Mark as verified (updates last_verified timestamp)
|
||||
"""
|
||||
from datetime import datetime
|
||||
|
||||
try:
|
||||
# Get existing entry
|
||||
result = self.knowledge_collection.get(ids=[knowledge_id])
|
||||
if not result['documents']:
|
||||
return False
|
||||
|
||||
metadata = result['metadatas'][0]
|
||||
full_doc = json.loads(metadata.get('full_doc', '{}'))
|
||||
|
||||
# Update fields
|
||||
if knowledge:
|
||||
full_doc['knowledge'] = knowledge
|
||||
if confidence:
|
||||
full_doc['confidence'] = confidence
|
||||
if verify:
|
||||
full_doc['last_verified'] = datetime.utcnow().isoformat()
|
||||
|
||||
# Update in collection
|
||||
self.knowledge_collection.update(
|
||||
ids=[knowledge_id],
|
||||
documents=[full_doc['knowledge']],
|
||||
metadatas=[{
|
||||
"topic": full_doc['topic'],
|
||||
"category": full_doc['category'],
|
||||
"source": full_doc['source'],
|
||||
"confidence": full_doc['confidence'],
|
||||
"tags": json.dumps(full_doc['tags']),
|
||||
"created_at": full_doc['created_at'],
|
||||
"full_doc": json.dumps(full_doc)
|
||||
}]
|
||||
)
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"Error updating knowledge: {e}")
|
||||
return False
|
||||
|
||||
def list_knowledge_topics(self, category: str = None) -> list:
|
||||
"""List all unique topics in the knowledge base"""
|
||||
try:
|
||||
where_filter = {"category": category} if category else None
|
||||
results = self.knowledge_collection.get(where=where_filter)
|
||||
|
||||
topics = set()
|
||||
for metadata in results['metadatas']:
|
||||
topics.add(metadata.get('topic'))
|
||||
|
||||
return sorted(list(topics))
|
||||
except Exception as e:
|
||||
print(f"Error listing knowledge topics: {e}")
|
||||
return []
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
|
||||
# Test the database
|
||||
db = ContextDatabase()
|
||||
|
||||
# Register test systems
|
||||
db.register_system(
|
||||
"macha",
|
||||
"workstation",
|
||||
["ollama"],
|
||||
capabilities=["ai-inference"]
|
||||
)
|
||||
|
||||
db.register_system(
|
||||
"rhiannon",
|
||||
"server",
|
||||
["gotify", "nextcloud", "prowlarr"],
|
||||
capabilities=["notifications", "cloud-storage"]
|
||||
)
|
||||
|
||||
# Add relationship
|
||||
db.add_relationship(
|
||||
"macha",
|
||||
"rhiannon",
|
||||
"uses-service",
|
||||
"Macha uses Rhiannon's Gotify for notifications"
|
||||
)
|
||||
|
||||
# Test queries
|
||||
print("All systems:", db.get_all_systems())
|
||||
print("\nMacha's dependencies:", db.get_dependencies("macha"))
|
||||
print("\nRhiannon's dependents:", db.get_dependents("rhiannon"))
|
||||
print("\nSystem context:", db.get_system_context("macha"))
|
||||
|
||||
328
conversation.py
Normal file
328
conversation.py
Normal file
@@ -0,0 +1,328 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Conversational Interface - Allows questioning Macha about decisions and system state
|
||||
"""
|
||||
|
||||
import json
|
||||
import requests
|
||||
from typing import Dict, List, Any, Optional
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
from agent import MachaAgent
|
||||
|
||||
|
||||
class MachaConversation:
|
||||
"""Conversational interface for Macha"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
ollama_host: str = "http://localhost:11434",
|
||||
model: str = "gpt-oss:latest",
|
||||
state_dir: Path = Path("/var/lib/macha")
|
||||
):
|
||||
self.ollama_host = ollama_host
|
||||
self.model = model
|
||||
self.state_dir = state_dir
|
||||
self.decision_log = self.state_dir / "decisions.jsonl"
|
||||
self.approval_queue = self.state_dir / "approval_queue.json"
|
||||
self.orchestrator_log = self.state_dir / "orchestrator.log"
|
||||
|
||||
# Initialize agent with tool support and queue
|
||||
self.agent = MachaAgent(
|
||||
ollama_host=ollama_host,
|
||||
model=model,
|
||||
state_dir=state_dir,
|
||||
enable_tools=True,
|
||||
use_queue=True,
|
||||
priority="INTERACTIVE"
|
||||
)
|
||||
|
||||
def ask(self, question: str, include_context: bool = True) -> str:
|
||||
"""Ask Macha a question with optional system context"""
|
||||
|
||||
context = ""
|
||||
if include_context:
|
||||
context = self._gather_context()
|
||||
|
||||
# Build messages for tool-aware chat
|
||||
content = self._create_conversational_prompt(question, context)
|
||||
messages = [{"role": "user", "content": content}]
|
||||
|
||||
response = self.agent._query_ollama_with_tools(messages)
|
||||
|
||||
return response
|
||||
|
||||
def discuss_action(self, action_index: int) -> str:
|
||||
"""Discuss a specific queued action by its queue position (0-based index)"""
|
||||
|
||||
action = self._get_action_from_queue(action_index)
|
||||
if not action:
|
||||
return f"No action found at queue position {action_index}. Use 'macha-approve list' to see available actions."
|
||||
|
||||
context = self._gather_context()
|
||||
action_context = json.dumps(action, indent=2)
|
||||
|
||||
content = f"""TASK: DISCUSS PROPOSED ACTION
|
||||
================================================================================
|
||||
|
||||
A user is asking about a proposed action in your approval queue.
|
||||
|
||||
QUEUED ACTION (Queue Position #{action_index}):
|
||||
{action_context}
|
||||
|
||||
RECENT SYSTEM CONTEXT:
|
||||
{context}
|
||||
|
||||
The user wants to discuss this action. Explain:
|
||||
1. Why you proposed this action
|
||||
2. What problem it solves
|
||||
3. The risks involved
|
||||
4. What could go wrong
|
||||
5. Alternative approaches if any
|
||||
|
||||
Be conversational, helpful, and honest about uncertainties.
|
||||
"""
|
||||
|
||||
messages = [{"role": "user", "content": content}]
|
||||
return self.agent._query_ollama_with_tools(messages)
|
||||
|
||||
def _gather_context(self) -> str:
|
||||
"""Gather relevant system context for the conversation"""
|
||||
|
||||
context_parts = []
|
||||
|
||||
# System infrastructure from ChromaDB
|
||||
try:
|
||||
from context_db import ContextDatabase
|
||||
db = ContextDatabase()
|
||||
systems = db.get_all_systems()
|
||||
|
||||
if systems:
|
||||
context_parts.append("INFRASTRUCTURE:")
|
||||
for system in systems:
|
||||
context_parts.append(f" - {system['hostname']} ({system.get('type', 'unknown')})")
|
||||
if system.get('config_repo'):
|
||||
context_parts.append(f" Config Repo: {system['config_repo']}")
|
||||
context_parts.append(f" Branch: {system.get('config_branch', 'unknown')}")
|
||||
if system.get('capabilities'):
|
||||
context_parts.append(f" Capabilities: {', '.join(system['capabilities'])}")
|
||||
except Exception as e:
|
||||
# ChromaDB not available, skip
|
||||
pass
|
||||
|
||||
# Recent decisions
|
||||
recent_decisions = self._get_recent_decisions(5)
|
||||
if recent_decisions:
|
||||
context_parts.append("\nRECENT DECISIONS:")
|
||||
for i, dec in enumerate(recent_decisions, 1):
|
||||
timestamp = dec.get("timestamp", "unknown")
|
||||
analysis = dec.get("analysis", {})
|
||||
status = analysis.get("status", "unknown")
|
||||
context_parts.append(f"{i}. [{timestamp}] Status: {status}")
|
||||
if "issues" in analysis:
|
||||
for issue in analysis.get("issues", [])[:3]:
|
||||
context_parts.append(f" - {issue.get('description', 'N/A')}")
|
||||
|
||||
# Pending approvals
|
||||
pending = self._get_pending_approvals()
|
||||
if pending:
|
||||
context_parts.append(f"\nPENDING APPROVALS: {len(pending)} action(s) awaiting approval")
|
||||
|
||||
# Recent log excerpts (last 10 lines)
|
||||
recent_logs = self._get_recent_logs(10)
|
||||
if recent_logs:
|
||||
context_parts.append("\nRECENT LOG ENTRIES:")
|
||||
context_parts.extend(recent_logs)
|
||||
|
||||
return "\n".join(context_parts)
|
||||
|
||||
def _create_conversational_prompt(self, question: str, context: str) -> str:
|
||||
"""Create a conversational prompt"""
|
||||
|
||||
return f"""{MachaAgent.SYSTEM_PROMPT}
|
||||
|
||||
TASK: ANSWER QUESTION
|
||||
================================================================================
|
||||
|
||||
You monitor system health, analyze issues using AI, and propose fixes. Be helpful,
|
||||
honest about what you know and don't know, and reference the context provided below.
|
||||
|
||||
SYSTEM CONTEXT:
|
||||
{context if context else "No recent activity"}
|
||||
|
||||
USER QUESTION:
|
||||
{question}
|
||||
|
||||
Respond conversationally and helpfully. If the question is about your recent decisions
|
||||
or actions, reference the context above. If you don't have enough information, say so.
|
||||
Keep responses concise but informative.
|
||||
"""
|
||||
|
||||
def _query_ollama(self, prompt: str, temperature: float = 0.7) -> str:
|
||||
"""Query Ollama API"""
|
||||
try:
|
||||
response = requests.post(
|
||||
f"{self.ollama_host}/api/generate",
|
||||
json={
|
||||
"model": self.model,
|
||||
"prompt": prompt,
|
||||
"stream": False,
|
||||
"temperature": temperature,
|
||||
},
|
||||
timeout=60
|
||||
)
|
||||
response.raise_for_status()
|
||||
return response.json().get("response", "")
|
||||
except requests.exceptions.HTTPError as e:
|
||||
error_detail = ""
|
||||
try:
|
||||
error_detail = f" - {response.text}"
|
||||
except:
|
||||
pass
|
||||
return f"Error: Ollama returned HTTP {response.status_code}{error_detail}"
|
||||
except Exception as e:
|
||||
return f"Error querying Ollama: {str(e)}"
|
||||
|
||||
def _get_recent_decisions(self, count: int = 5) -> List[Dict[str, Any]]:
|
||||
"""Get recent decisions from log"""
|
||||
if not self.decision_log.exists():
|
||||
return []
|
||||
|
||||
decisions = []
|
||||
try:
|
||||
with open(self.decision_log, 'r') as f:
|
||||
for line in f:
|
||||
if line.strip():
|
||||
try:
|
||||
decisions.append(json.loads(line))
|
||||
except:
|
||||
pass
|
||||
except:
|
||||
pass
|
||||
|
||||
return decisions[-count:]
|
||||
|
||||
def _get_pending_approvals(self) -> List[Dict[str, Any]]:
|
||||
"""Get pending approvals from queue"""
|
||||
if not self.approval_queue.exists():
|
||||
return []
|
||||
|
||||
try:
|
||||
with open(self.approval_queue, 'r') as f:
|
||||
data = json.load(f)
|
||||
# Queue is a JSON array, not an object with "pending" key
|
||||
if isinstance(data, list):
|
||||
return data
|
||||
return data.get("pending", [])
|
||||
except:
|
||||
return []
|
||||
|
||||
def _get_action_from_queue(self, action_index: int) -> Optional[Dict[str, Any]]:
|
||||
"""Get a specific action from the queue by index"""
|
||||
pending = self._get_pending_approvals()
|
||||
if 0 <= action_index < len(pending):
|
||||
return pending[action_index]
|
||||
return None
|
||||
|
||||
def _get_recent_logs(self, count: int = 10) -> List[str]:
|
||||
"""Get recent orchestrator log lines"""
|
||||
if not self.orchestrator_log.exists():
|
||||
return []
|
||||
|
||||
try:
|
||||
with open(self.orchestrator_log, 'r') as f:
|
||||
lines = f.readlines()
|
||||
return [line.strip() for line in lines[-count:] if line.strip()]
|
||||
except:
|
||||
return []
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
import argparse
|
||||
|
||||
parser = argparse.ArgumentParser(description="Ask Macha a question or discuss an action")
|
||||
parser.add_argument("--discuss", type=int, metavar="ACTION_ID", help="Discuss a specific queued action")
|
||||
parser.add_argument("--follow-up", type=str, metavar="QUESTION", help="Follow-up question about the action")
|
||||
parser.add_argument("question", nargs="*", help="Your question for Macha")
|
||||
parser.add_argument("--no-context", action="store_true", help="Don't include system context")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Load config if available
|
||||
config_file = Path("/etc/macha-autonomous/config.json")
|
||||
ollama_host = "http://localhost:11434"
|
||||
model = "gpt-oss:latest"
|
||||
|
||||
if config_file.exists():
|
||||
try:
|
||||
with open(config_file, 'r') as f:
|
||||
config = json.load(f)
|
||||
ollama_host = config.get("ollama_host", ollama_host)
|
||||
model = config.get("model", model)
|
||||
except:
|
||||
pass
|
||||
|
||||
conversation = MachaConversation(
|
||||
ollama_host=ollama_host,
|
||||
model=model
|
||||
)
|
||||
|
||||
if args.discuss is not None:
|
||||
if args.follow_up:
|
||||
# Follow-up question about a specific action
|
||||
action = conversation._get_action_from_queue(args.discuss)
|
||||
if not action:
|
||||
print(f"No action found at queue position {args.discuss}. Use 'macha-approve list' to see available actions.")
|
||||
sys.exit(1)
|
||||
|
||||
# Build context with the action details
|
||||
action_context = f"""
|
||||
QUEUED ACTION #{args.discuss}:
|
||||
Diagnosis: {action.get('proposal', {}).get('diagnosis', 'N/A')}
|
||||
Proposed Action: {action.get('proposal', {}).get('proposed_action', 'N/A')}
|
||||
Action Type: {action.get('proposal', {}).get('action_type', 'N/A')}
|
||||
Risk Level: {action.get('proposal', {}).get('risk_level', 'N/A')}
|
||||
Commands: {json.dumps(action.get('proposal', {}).get('commands', []), indent=2)}
|
||||
Reasoning: {action.get('proposal', {}).get('reasoning', 'N/A')}
|
||||
|
||||
FOLLOW-UP QUESTION:
|
||||
{args.follow_up}
|
||||
"""
|
||||
|
||||
# Query the AI with the action context
|
||||
response = conversation._query_ollama(f"""{MachaAgent.SYSTEM_PROMPT}
|
||||
|
||||
TASK: ANSWER FOLLOW-UP QUESTION ABOUT QUEUED ACTION
|
||||
================================================================================
|
||||
|
||||
You are answering a follow-up question about a proposed fix that is awaiting approval.
|
||||
Be helpful and answer directly. If the user is concerned about risks, explain them clearly.
|
||||
If they ask about alternatives, suggest them.
|
||||
|
||||
{action_context}
|
||||
|
||||
RESPOND CONCISELY AND DIRECTLY.
|
||||
""")
|
||||
|
||||
else:
|
||||
# Initial discussion about the action
|
||||
response = conversation.discuss_action(args.discuss)
|
||||
elif args.question:
|
||||
# Ask a general question
|
||||
question = " ".join(args.question)
|
||||
response = conversation.ask(question, include_context=not args.no_context)
|
||||
else:
|
||||
parser.print_help()
|
||||
sys.exit(1)
|
||||
|
||||
# Only print formatted output for initial discussion, not for follow-ups
|
||||
if args.follow_up:
|
||||
print(response)
|
||||
else:
|
||||
print("\n" + "="*60)
|
||||
print("MACHA:")
|
||||
print("="*60)
|
||||
print(response)
|
||||
print("="*60 + "\n")
|
||||
|
||||
537
executor.py
Normal file
537
executor.py
Normal file
@@ -0,0 +1,537 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Action Executor - Safely executes proposed fixes with rollback capability
|
||||
"""
|
||||
|
||||
import json
|
||||
import subprocess
|
||||
import shutil
|
||||
from typing import Dict, List, Any, Optional
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
import time
|
||||
|
||||
|
||||
class SafeExecutor:
|
||||
"""Executes system maintenance actions with safety checks"""
|
||||
|
||||
# Actions that are considered safe to auto-execute
|
||||
SAFE_ACTIONS = {
|
||||
"systemd_restart", # Restart failed services
|
||||
"cleanup", # Disk cleanup, log rotation
|
||||
"investigation", # Read-only diagnostics
|
||||
}
|
||||
|
||||
# Services that should NEVER be stopped/disabled
|
||||
PROTECTED_SERVICES = {
|
||||
"sshd",
|
||||
"systemd-networkd",
|
||||
"NetworkManager",
|
||||
"systemd-resolved",
|
||||
"dbus",
|
||||
}
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
state_dir: Path = Path("/var/lib/macha"),
|
||||
autonomy_level: str = "suggest", # observe, suggest, auto-safe, auto-full
|
||||
dry_run: bool = False,
|
||||
agent = None # Optional agent for learning from actions
|
||||
):
|
||||
self.state_dir = state_dir
|
||||
self.state_dir.mkdir(parents=True, exist_ok=True)
|
||||
self.autonomy_level = autonomy_level
|
||||
self.dry_run = dry_run
|
||||
self.agent = agent
|
||||
self.action_log = self.state_dir / "actions.jsonl"
|
||||
self.approval_queue = self.state_dir / "approval_queue.json"
|
||||
|
||||
def execute_action(self, action: Dict[str, Any], monitoring_context: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Execute a proposed action with appropriate safety checks"""
|
||||
|
||||
action_type = action.get("action_type", "unknown")
|
||||
risk_level = action.get("risk_level", "high")
|
||||
|
||||
# Determine if we should execute
|
||||
should_execute, reason = self._should_execute(action_type, risk_level)
|
||||
|
||||
if not should_execute:
|
||||
if self.autonomy_level == "suggest":
|
||||
# Queue for approval
|
||||
self._queue_for_approval(action, monitoring_context)
|
||||
return {
|
||||
"executed": False,
|
||||
"status": "queued_for_approval",
|
||||
"reason": reason,
|
||||
"queue_file": str(self.approval_queue)
|
||||
}
|
||||
else:
|
||||
return {
|
||||
"executed": False,
|
||||
"status": "blocked",
|
||||
"reason": reason
|
||||
}
|
||||
|
||||
# Execute the action
|
||||
if self.dry_run:
|
||||
return self._dry_run_action(action)
|
||||
|
||||
return self._execute_action_impl(action, monitoring_context)
|
||||
|
||||
def _should_execute(self, action_type: str, risk_level: str) -> tuple[bool, str]:
|
||||
"""Determine if an action should be auto-executed based on autonomy level"""
|
||||
|
||||
if self.autonomy_level == "observe":
|
||||
return False, "Autonomy level set to observe-only"
|
||||
|
||||
# Auto-approve low-risk investigation actions
|
||||
if action_type == "investigation" and risk_level == "low":
|
||||
return True, "Auto-approved: Low-risk information gathering"
|
||||
|
||||
if self.autonomy_level == "suggest":
|
||||
return False, "Autonomy level requires manual approval"
|
||||
|
||||
if self.autonomy_level == "auto-safe":
|
||||
if action_type in self.SAFE_ACTIONS and risk_level == "low":
|
||||
return True, "Auto-executing safe action"
|
||||
return False, "Action requires higher autonomy level"
|
||||
|
||||
if self.autonomy_level == "auto-full":
|
||||
if risk_level == "high":
|
||||
return False, "High risk actions always require approval"
|
||||
return True, "Auto-executing approved action"
|
||||
|
||||
return False, "Unknown autonomy level"
|
||||
|
||||
def _execute_action_impl(self, action: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Actually execute the action"""
|
||||
|
||||
action_type = action.get("action_type")
|
||||
result = {
|
||||
"executed": True,
|
||||
"timestamp": datetime.now().isoformat(),
|
||||
"action": action,
|
||||
"success": False,
|
||||
"output": "",
|
||||
"error": None
|
||||
}
|
||||
|
||||
try:
|
||||
if action_type == "systemd_restart":
|
||||
result.update(self._restart_services(action))
|
||||
|
||||
elif action_type == "cleanup":
|
||||
result.update(self._perform_cleanup(action))
|
||||
|
||||
elif action_type == "nix_rebuild":
|
||||
result.update(self._nix_rebuild(action))
|
||||
|
||||
elif action_type == "config_change":
|
||||
result.update(self._apply_config_change(action))
|
||||
|
||||
elif action_type == "investigation":
|
||||
result.update(self._run_investigation(action))
|
||||
|
||||
else:
|
||||
result["error"] = f"Unknown action type: {action_type}"
|
||||
|
||||
except Exception as e:
|
||||
result["error"] = str(e)
|
||||
result["success"] = False
|
||||
|
||||
# Log the action
|
||||
self._log_action(result)
|
||||
|
||||
# Learn from successful operations
|
||||
if result.get("success") and self.agent:
|
||||
try:
|
||||
self.agent.reflect_and_learn(
|
||||
situation=action.get("diagnosis", "Unknown situation"),
|
||||
action_taken=action.get("proposed_action", "Unknown action"),
|
||||
outcome=result.get("output", ""),
|
||||
success=True
|
||||
)
|
||||
except Exception as e:
|
||||
# Don't fail the action if learning fails
|
||||
print(f"Note: Could not record learning: {e}")
|
||||
|
||||
return result
|
||||
|
||||
def _restart_services(self, action: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Restart systemd services"""
|
||||
commands = action.get("commands", [])
|
||||
output_lines = []
|
||||
|
||||
for cmd in commands:
|
||||
if not cmd.startswith("systemctl restart "):
|
||||
continue
|
||||
|
||||
service = cmd.split()[-1]
|
||||
|
||||
# Safety check
|
||||
if any(protected in service for protected in self.PROTECTED_SERVICES):
|
||||
output_lines.append(f"BLOCKED: {service} is protected")
|
||||
continue
|
||||
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["systemctl", "restart", service],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=30
|
||||
)
|
||||
|
||||
if result.returncode == 0:
|
||||
output_lines.append(f"✓ Restarted {service}")
|
||||
else:
|
||||
output_lines.append(f"✗ Failed to restart {service}: {result.stderr}")
|
||||
|
||||
except subprocess.TimeoutExpired:
|
||||
output_lines.append(f"✗ Timeout restarting {service}")
|
||||
|
||||
return {
|
||||
"success": len(output_lines) > 0,
|
||||
"output": "\n".join(output_lines)
|
||||
}
|
||||
|
||||
def _perform_cleanup(self, action: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Perform system cleanup tasks"""
|
||||
output_lines = []
|
||||
|
||||
# Nix store cleanup
|
||||
if "nix" in action.get("proposed_action", "").lower():
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["nix-collect-garbage", "--delete-old"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=300
|
||||
)
|
||||
output_lines.append(f"Nix cleanup: {result.stdout}")
|
||||
except Exception as e:
|
||||
output_lines.append(f"Nix cleanup failed: {e}")
|
||||
|
||||
# Journal cleanup (keep last 7 days)
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["journalctl", "--vacuum-time=7d"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=60
|
||||
)
|
||||
output_lines.append(f"Journal cleanup: {result.stdout}")
|
||||
except Exception as e:
|
||||
output_lines.append(f"Journal cleanup failed: {e}")
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"output": "\n".join(output_lines)
|
||||
}
|
||||
|
||||
def _nix_rebuild(self, action: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Rebuild NixOS configuration"""
|
||||
|
||||
# This is HIGH RISK - always requires approval or full autonomy
|
||||
# And we should test first
|
||||
|
||||
output_lines = []
|
||||
|
||||
# First, try a dry build
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["nixos-rebuild", "dry-build", "--flake", ".#macha"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=600,
|
||||
cwd="/home/lily/Documents/nixos-servers"
|
||||
)
|
||||
|
||||
if result.returncode != 0:
|
||||
return {
|
||||
"success": False,
|
||||
"output": f"Dry build failed:\n{result.stderr}"
|
||||
}
|
||||
|
||||
output_lines.append("✓ Dry build successful")
|
||||
|
||||
except Exception as e:
|
||||
return {
|
||||
"success": False,
|
||||
"output": f"Dry build error: {e}"
|
||||
}
|
||||
|
||||
# Now do the actual rebuild
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["nixos-rebuild", "switch", "--flake", ".#macha"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=1200,
|
||||
cwd="/home/lily/Documents/nixos-servers"
|
||||
)
|
||||
|
||||
output_lines.append(result.stdout)
|
||||
|
||||
return {
|
||||
"success": result.returncode == 0,
|
||||
"output": "\n".join(output_lines),
|
||||
"error": result.stderr if result.returncode != 0 else None
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
return {
|
||||
"success": False,
|
||||
"output": "\n".join(output_lines),
|
||||
"error": str(e)
|
||||
}
|
||||
|
||||
def _apply_config_change(self, action: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Apply a configuration file change"""
|
||||
|
||||
config_changes = action.get("config_changes", {})
|
||||
file_path = config_changes.get("file")
|
||||
|
||||
if not file_path:
|
||||
return {
|
||||
"success": False,
|
||||
"output": "No file specified in config_changes"
|
||||
}
|
||||
|
||||
# For now, we DON'T auto-modify configs - too risky
|
||||
# Instead, we create a suggested patch file
|
||||
|
||||
patch_file = self.state_dir / f"suggested_patch_{int(time.time())}.txt"
|
||||
with open(patch_file, 'w') as f:
|
||||
f.write(f"Suggested change to {file_path}:\n\n")
|
||||
f.write(config_changes.get("change", "No change description"))
|
||||
f.write(f"\n\nReasoning: {action.get('reasoning', 'No reasoning provided')}")
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"output": f"Config change suggestion saved to {patch_file}\nThis requires manual review and application."
|
||||
}
|
||||
|
||||
def _run_investigation(self, action: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Run diagnostic commands"""
|
||||
commands = action.get("commands", [])
|
||||
output_lines = []
|
||||
|
||||
for cmd in commands:
|
||||
# Only allow safe read-only commands
|
||||
safe_commands = ["journalctl", "systemctl status", "df", "free", "ps", "netstat", "ss"]
|
||||
if not any(cmd.startswith(safe) for safe in safe_commands):
|
||||
output_lines.append(f"BLOCKED unsafe command: {cmd}")
|
||||
continue
|
||||
|
||||
try:
|
||||
result = subprocess.run(
|
||||
cmd,
|
||||
shell=True,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=30
|
||||
)
|
||||
output_lines.append(f"$ {cmd}")
|
||||
output_lines.append(result.stdout)
|
||||
except Exception as e:
|
||||
output_lines.append(f"Error running {cmd}: {e}")
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"output": "\n".join(output_lines)
|
||||
}
|
||||
|
||||
def _dry_run_action(self, action: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Simulate action execution"""
|
||||
return {
|
||||
"executed": False,
|
||||
"status": "dry_run",
|
||||
"action": action,
|
||||
"output": "Dry run mode - no actual changes made"
|
||||
}
|
||||
|
||||
def _queue_for_approval(self, action: Dict[str, Any], context: Dict[str, Any]):
|
||||
"""Add action to approval queue"""
|
||||
queue = []
|
||||
if self.approval_queue.exists():
|
||||
with open(self.approval_queue, 'r') as f:
|
||||
queue = json.load(f)
|
||||
|
||||
# Check for duplicate pending actions
|
||||
proposed_action = action.get("proposed_action", "")
|
||||
diagnosis = action.get("diagnosis", "")
|
||||
|
||||
for existing in queue:
|
||||
# Skip already approved/rejected items
|
||||
if existing.get("approved") is not None:
|
||||
continue
|
||||
|
||||
existing_action = existing.get("action", {})
|
||||
existing_proposed = existing_action.get("proposed_action", "")
|
||||
existing_diagnosis = existing_action.get("diagnosis", "")
|
||||
|
||||
# Check if this is essentially the same issue
|
||||
# Match if diagnosis is very similar OR proposed action is very similar
|
||||
if (diagnosis and existing_diagnosis and
|
||||
self._similarity_check(diagnosis, existing_diagnosis) > 0.7):
|
||||
print(f"Skipping duplicate action - similar diagnosis already queued")
|
||||
return
|
||||
|
||||
if (proposed_action and existing_proposed and
|
||||
self._similarity_check(proposed_action, existing_proposed) > 0.7):
|
||||
print(f"Skipping duplicate action - similar proposal already queued")
|
||||
return
|
||||
|
||||
queue.append({
|
||||
"timestamp": datetime.now().isoformat(),
|
||||
"action": action,
|
||||
"context": context,
|
||||
"approved": None
|
||||
})
|
||||
|
||||
with open(self.approval_queue, 'w') as f:
|
||||
json.dump(queue, f, indent=2)
|
||||
|
||||
def _similarity_check(self, str1: str, str2: str) -> float:
|
||||
"""Simple similarity check between two strings"""
|
||||
# Normalize strings
|
||||
s1 = str1.lower().strip()
|
||||
s2 = str2.lower().strip()
|
||||
|
||||
# Exact match
|
||||
if s1 == s2:
|
||||
return 1.0
|
||||
|
||||
# Check for significant word overlap
|
||||
words1 = set(s1.split())
|
||||
words2 = set(s2.split())
|
||||
|
||||
# Remove common words that don't indicate similarity
|
||||
common_words = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by', 'is', 'are', 'was', 'were', 'be', 'been', 'have', 'has', 'had'}
|
||||
words1 = words1 - common_words
|
||||
words2 = words2 - common_words
|
||||
|
||||
if not words1 or not words2:
|
||||
return 0.0
|
||||
|
||||
# Calculate Jaccard similarity
|
||||
intersection = len(words1 & words2)
|
||||
union = len(words1 | words2)
|
||||
|
||||
return intersection / union if union > 0 else 0.0
|
||||
|
||||
def _log_action(self, result: Dict[str, Any]):
|
||||
"""Log executed actions"""
|
||||
with open(self.action_log, 'a') as f:
|
||||
f.write(json.dumps(result) + '\n')
|
||||
|
||||
def get_approval_queue(self) -> List[Dict[str, Any]]:
|
||||
"""Get pending actions awaiting approval"""
|
||||
if not self.approval_queue.exists():
|
||||
return []
|
||||
|
||||
with open(self.approval_queue, 'r') as f:
|
||||
return json.load(f)
|
||||
|
||||
def approve_action(self, index: int) -> bool:
|
||||
"""Approve and execute a queued action, then remove it from queue"""
|
||||
queue = self.get_approval_queue()
|
||||
if 0 <= index < len(queue):
|
||||
action_item = queue[index]
|
||||
|
||||
# Execute the approved action
|
||||
result = self._execute_action_impl(action_item["action"], action_item["context"])
|
||||
|
||||
# Archive the action (success or failure)
|
||||
self._archive_action(action_item, result)
|
||||
|
||||
# Remove from queue regardless of outcome
|
||||
queue.pop(index)
|
||||
|
||||
with open(self.approval_queue, 'w') as f:
|
||||
json.dump(queue, f, indent=2)
|
||||
|
||||
return result.get("success", False)
|
||||
|
||||
return False
|
||||
|
||||
def _archive_action(self, action_item: Dict[str, Any], result: Dict[str, Any]):
|
||||
"""Archive an approved action with its execution result"""
|
||||
archive_file = self.state_dir / "approved_actions.jsonl"
|
||||
|
||||
archive_entry = {
|
||||
"timestamp": datetime.now().isoformat(),
|
||||
"original_timestamp": action_item.get("timestamp"),
|
||||
"action": action_item.get("action"),
|
||||
"context": action_item.get("context"),
|
||||
"result": result
|
||||
}
|
||||
|
||||
with open(archive_file, 'a') as f:
|
||||
f.write(json.dumps(archive_entry) + '\n')
|
||||
|
||||
def reject_action(self, index: int) -> bool:
|
||||
"""Reject and remove a queued action"""
|
||||
queue = self.get_approval_queue()
|
||||
if 0 <= index < len(queue):
|
||||
removed_action = queue.pop(index)
|
||||
|
||||
with open(self.approval_queue, 'w') as f:
|
||||
json.dump(queue, f, indent=2)
|
||||
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
|
||||
if len(sys.argv) > 1:
|
||||
if sys.argv[1] == "queue":
|
||||
executor = SafeExecutor()
|
||||
queue = executor.get_approval_queue()
|
||||
if queue:
|
||||
print("\n" + "="*70)
|
||||
print(f"PENDING ACTIONS: {len(queue)}")
|
||||
print("="*70)
|
||||
for i, item in enumerate(queue):
|
||||
action = item.get("action", {})
|
||||
timestamp = item.get("timestamp", "unknown")
|
||||
approved = item.get("approved")
|
||||
|
||||
status = "✓ APPROVED" if approved else "⏳ PENDING" if approved is None else "✗ REJECTED"
|
||||
|
||||
print(f"\n[{i}] {status} - {timestamp}")
|
||||
print("-" * 70)
|
||||
print(f"DIAGNOSIS: {action.get('diagnosis', 'N/A')}")
|
||||
print(f"\nPROPOSED ACTION: {action.get('proposed_action', 'N/A')}")
|
||||
print(f"TYPE: {action.get('action_type', 'N/A')}")
|
||||
print(f"RISK: {action.get('risk_level', 'N/A')}")
|
||||
|
||||
if action.get('commands'):
|
||||
print(f"\nCOMMANDS:")
|
||||
for cmd in action['commands']:
|
||||
print(f" - {cmd}")
|
||||
|
||||
if action.get('config_changes'):
|
||||
print(f"\nCONFIG CHANGES:")
|
||||
for key, value in action['config_changes'].items():
|
||||
print(f" {key}: {value}")
|
||||
|
||||
print(f"\nREASONING: {action.get('reasoning', 'N/A')}")
|
||||
print("\n" + "="*70 + "\n")
|
||||
else:
|
||||
print("No pending actions")
|
||||
|
||||
elif sys.argv[1] == "approve" and len(sys.argv) > 2:
|
||||
executor = SafeExecutor()
|
||||
index = int(sys.argv[2])
|
||||
success = executor.approve_action(index)
|
||||
print(f"Approval {'succeeded' if success else 'failed'}")
|
||||
|
||||
elif sys.argv[1] == "reject" and len(sys.argv) > 2:
|
||||
executor = SafeExecutor()
|
||||
index = int(sys.argv[2])
|
||||
success = executor.reject_action(index)
|
||||
print(f"Action {'rejected and removed from queue' if success else 'rejection failed'}")
|
||||
41
flake.nix
Normal file
41
flake.nix
Normal file
@@ -0,0 +1,41 @@
|
||||
{
|
||||
description = "Macha - AI-Powered Autonomous System Administrator";
|
||||
|
||||
inputs = {
|
||||
nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
|
||||
};
|
||||
|
||||
outputs = { self, nixpkgs }: {
|
||||
# NixOS module
|
||||
nixosModules.default = import ./module.nix;
|
||||
|
||||
# Alternative explicit name
|
||||
nixosModules.macha-autonomous = import ./module.nix;
|
||||
|
||||
# For development
|
||||
devShells = nixpkgs.lib.genAttrs [ "x86_64-linux" "aarch64-linux" ] (system:
|
||||
let
|
||||
pkgs = nixpkgs.legacyPackages.${system};
|
||||
pythonEnv = pkgs.python3.withPackages (ps: with ps; [
|
||||
requests
|
||||
psutil
|
||||
chromadb
|
||||
]);
|
||||
in {
|
||||
default = pkgs.mkShell {
|
||||
packages = [ pythonEnv pkgs.git ];
|
||||
shellHook = ''
|
||||
echo "Macha Autonomous Development Environment"
|
||||
echo "Python packages: requests, psutil, chromadb"
|
||||
'';
|
||||
};
|
||||
}
|
||||
);
|
||||
|
||||
# Formatter
|
||||
formatter = nixpkgs.lib.genAttrs [ "x86_64-linux" "aarch64-linux" ] (system:
|
||||
nixpkgs.legacyPackages.${system}.nixpkgs-fmt
|
||||
);
|
||||
};
|
||||
}
|
||||
|
||||
222
git_context.py
Normal file
222
git_context.py
Normal file
@@ -0,0 +1,222 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Git Context - Extract context from NixOS configuration repository
|
||||
"""
|
||||
|
||||
import subprocess
|
||||
from typing import Dict, List, Any, Optional
|
||||
from datetime import datetime, timedelta
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
class GitContext:
|
||||
"""Extract context from git repository"""
|
||||
|
||||
def __init__(self, repo_path: str = "/etc/nixos"):
|
||||
"""
|
||||
Initialize git context extractor
|
||||
|
||||
Args:
|
||||
repo_path: Path to the git repository (default: /etc/nixos for NixOS systems)
|
||||
"""
|
||||
self.repo_path = Path(repo_path)
|
||||
|
||||
def _run_git(self, args: List[str]) -> tuple[bool, str]:
|
||||
"""Run git command"""
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["git", "-C", str(self.repo_path)] + args,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=10
|
||||
)
|
||||
return (result.returncode == 0, result.stdout.strip())
|
||||
except Exception as e:
|
||||
return (False, str(e))
|
||||
|
||||
def get_current_branch(self) -> str:
|
||||
"""Get current git branch"""
|
||||
success, output = self._run_git(["rev-parse", "--abbrev-ref", "HEAD"])
|
||||
return output if success else "unknown"
|
||||
|
||||
def get_remote_url(self) -> str:
|
||||
"""Get git remote URL"""
|
||||
success, output = self._run_git(["remote", "get-url", "origin"])
|
||||
return output if success else ""
|
||||
|
||||
def get_recent_commits(self, count: int = 10, since: str = "1 week ago") -> List[Dict[str, str]]:
|
||||
"""
|
||||
Get recent commits
|
||||
|
||||
Args:
|
||||
count: Number of commits to retrieve
|
||||
since: Time range (e.g., "1 week ago", "3 days ago")
|
||||
|
||||
Returns:
|
||||
List of commit dictionaries with hash, author, date, message
|
||||
"""
|
||||
success, output = self._run_git([
|
||||
"log",
|
||||
f"--since={since}",
|
||||
f"-n{count}",
|
||||
"--format=%H|%an|%ar|%s"
|
||||
])
|
||||
|
||||
if not success:
|
||||
return []
|
||||
|
||||
commits = []
|
||||
for line in output.split('\n'):
|
||||
if not line.strip():
|
||||
continue
|
||||
parts = line.split('|', 3)
|
||||
if len(parts) == 4:
|
||||
commits.append({
|
||||
"hash": parts[0][:8], # Short hash
|
||||
"author": parts[1],
|
||||
"date": parts[2],
|
||||
"message": parts[3]
|
||||
})
|
||||
|
||||
return commits
|
||||
|
||||
def get_system_config_files(self, system_name: str) -> List[str]:
|
||||
"""
|
||||
Get configuration files for a specific system
|
||||
|
||||
Args:
|
||||
system_name: Name of the system (e.g., "macha", "rhiannon")
|
||||
|
||||
Returns:
|
||||
List of configuration file paths
|
||||
"""
|
||||
system_dir = self.repo_path / "systems" / system_name
|
||||
config_files = []
|
||||
|
||||
if system_dir.exists():
|
||||
# Main config
|
||||
if (system_dir.parent / f"{system_name}.nix").exists():
|
||||
config_files.append(f"systems/{system_name}.nix")
|
||||
|
||||
# System-specific configs
|
||||
for config_file in system_dir.rglob("*.nix"):
|
||||
config_files.append(str(config_file.relative_to(self.repo_path)))
|
||||
|
||||
return config_files
|
||||
|
||||
def get_recent_changes_for_system(self, system_name: str, since: str = "1 week ago") -> List[Dict[str, str]]:
|
||||
"""
|
||||
Get recent changes affecting a specific system
|
||||
|
||||
Args:
|
||||
system_name: Name of the system
|
||||
since: Time range
|
||||
|
||||
Returns:
|
||||
List of commits that affected this system
|
||||
"""
|
||||
config_files = self.get_system_config_files(system_name)
|
||||
|
||||
if not config_files:
|
||||
return []
|
||||
|
||||
# Get commits that touched these files
|
||||
file_args = []
|
||||
for f in config_files:
|
||||
file_args.extend(["--", f])
|
||||
|
||||
success, output = self._run_git([
|
||||
"log",
|
||||
f"--since={since}",
|
||||
"-n10",
|
||||
"--format=%H|%an|%ar|%s"
|
||||
] + file_args)
|
||||
|
||||
if not success:
|
||||
return []
|
||||
|
||||
commits = []
|
||||
for line in output.split('\n'):
|
||||
if not line.strip():
|
||||
continue
|
||||
parts = line.split('|', 3)
|
||||
if len(parts) == 4:
|
||||
commits.append({
|
||||
"hash": parts[0][:8],
|
||||
"author": parts[1],
|
||||
"date": parts[2],
|
||||
"message": parts[3]
|
||||
})
|
||||
|
||||
return commits
|
||||
|
||||
def get_system_context_summary(self, system_name: str) -> str:
|
||||
"""
|
||||
Get a summary of git context for a system
|
||||
|
||||
Args:
|
||||
system_name: Name of the system
|
||||
|
||||
Returns:
|
||||
Human-readable summary
|
||||
"""
|
||||
lines = []
|
||||
|
||||
# Repository info
|
||||
repo_url = self.get_remote_url()
|
||||
branch = self.get_current_branch()
|
||||
|
||||
if repo_url:
|
||||
lines.append(f"Configuration Repository: {repo_url}")
|
||||
lines.append(f"Branch: {branch}")
|
||||
|
||||
# Recent changes to this system
|
||||
recent_changes = self.get_recent_changes_for_system(system_name, "2 weeks ago")
|
||||
|
||||
if recent_changes:
|
||||
lines.append(f"\nRecent configuration changes (last 2 weeks):")
|
||||
for commit in recent_changes[:5]:
|
||||
lines.append(f" - {commit['date']}: {commit['message']} ({commit['author']})")
|
||||
else:
|
||||
lines.append("\nNo recent configuration changes")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
def get_all_managed_systems(self) -> List[str]:
|
||||
"""
|
||||
Get list of all systems managed by this repository
|
||||
|
||||
Returns:
|
||||
List of system names
|
||||
"""
|
||||
systems = []
|
||||
systems_dir = self.repo_path / "systems"
|
||||
|
||||
if systems_dir.exists():
|
||||
for system_file in systems_dir.glob("*.nix"):
|
||||
if system_file.stem not in ["default"]:
|
||||
systems.append(system_file.stem)
|
||||
|
||||
return sorted(systems)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
|
||||
git = GitContext()
|
||||
|
||||
print("Repository:", git.get_remote_url())
|
||||
print("Branch:", git.get_current_branch())
|
||||
print("\nManaged Systems:")
|
||||
for system in git.get_all_managed_systems():
|
||||
print(f" - {system}")
|
||||
|
||||
print("\nRecent Commits:")
|
||||
for commit in git.get_recent_commits(5):
|
||||
print(f" {commit['hash']}: {commit['message']} - {commit['author']}, {commit['date']}")
|
||||
|
||||
if len(sys.argv) > 1:
|
||||
system = sys.argv[1]
|
||||
print(f"\nContext for {system}:")
|
||||
print(git.get_system_context_summary(system))
|
||||
|
||||
219
issue_tracker.py
Normal file
219
issue_tracker.py
Normal file
@@ -0,0 +1,219 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Issue Tracker - Internal ticketing system for tracking problems and their resolution
|
||||
"""
|
||||
|
||||
import json
|
||||
import uuid
|
||||
from datetime import datetime
|
||||
from typing import Dict, List, Any, Optional
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
class IssueTracker:
|
||||
"""Manages issue lifecycle: detection -> investigation -> resolution"""
|
||||
|
||||
def __init__(self, context_db, log_dir: str = "/var/lib/macha/logs"):
|
||||
self.context_db = context_db
|
||||
self.log_dir = Path(log_dir)
|
||||
self.log_dir.mkdir(parents=True, exist_ok=True)
|
||||
self.closed_log = self.log_dir / "closed_issues.jsonl"
|
||||
|
||||
def create_issue(
|
||||
self,
|
||||
hostname: str,
|
||||
title: str,
|
||||
description: str,
|
||||
severity: str = "medium",
|
||||
source: str = "auto-detected"
|
||||
) -> str:
|
||||
"""Create a new issue and return its ID"""
|
||||
issue_id = str(uuid.uuid4())
|
||||
now = datetime.utcnow().isoformat()
|
||||
|
||||
issue = {
|
||||
"issue_id": issue_id,
|
||||
"hostname": hostname,
|
||||
"title": title,
|
||||
"description": description,
|
||||
"status": "open",
|
||||
"severity": severity,
|
||||
"created_at": now,
|
||||
"updated_at": now,
|
||||
"source": source,
|
||||
"investigations": [],
|
||||
"actions": [],
|
||||
"resolution": None
|
||||
}
|
||||
|
||||
self.context_db.store_issue(issue)
|
||||
return issue_id
|
||||
|
||||
def get_issue(self, issue_id: str) -> Optional[Dict[str, Any]]:
|
||||
"""Retrieve an issue by ID"""
|
||||
return self.context_db.get_issue(issue_id)
|
||||
|
||||
def update_issue(
|
||||
self,
|
||||
issue_id: str,
|
||||
status: Optional[str] = None,
|
||||
investigation: Optional[Dict[str, Any]] = None,
|
||||
action: Optional[Dict[str, Any]] = None
|
||||
) -> bool:
|
||||
"""Update an issue with new information"""
|
||||
issue = self.get_issue(issue_id)
|
||||
if not issue:
|
||||
return False
|
||||
|
||||
if status:
|
||||
issue["status"] = status
|
||||
|
||||
if investigation:
|
||||
investigation["timestamp"] = datetime.utcnow().isoformat()
|
||||
issue["investigations"].append(investigation)
|
||||
|
||||
if action:
|
||||
action["timestamp"] = datetime.utcnow().isoformat()
|
||||
issue["actions"].append(action)
|
||||
|
||||
issue["updated_at"] = datetime.utcnow().isoformat()
|
||||
|
||||
self.context_db.update_issue(issue)
|
||||
return True
|
||||
|
||||
def find_similar_issue(
|
||||
self,
|
||||
hostname: str,
|
||||
title: str,
|
||||
description: str = None
|
||||
) -> Optional[Dict[str, Any]]:
|
||||
"""Find an existing open issue that matches this problem"""
|
||||
open_issues = self.list_issues(hostname=hostname, status="open")
|
||||
|
||||
# Simple similarity check on title
|
||||
title_lower = title.lower()
|
||||
for issue in open_issues:
|
||||
issue_title_lower = issue.get("title", "").lower()
|
||||
|
||||
# Check for keyword overlap
|
||||
title_words = set(title_lower.split())
|
||||
issue_words = set(issue_title_lower.split())
|
||||
|
||||
# If >50% of words overlap, consider it similar
|
||||
if len(title_words & issue_words) / max(len(title_words), 1) > 0.5:
|
||||
return issue
|
||||
|
||||
return None
|
||||
|
||||
def list_issues(
|
||||
self,
|
||||
hostname: Optional[str] = None,
|
||||
status: Optional[str] = None,
|
||||
severity: Optional[str] = None
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""List issues with optional filters"""
|
||||
return self.context_db.list_issues(
|
||||
hostname=hostname,
|
||||
status=status,
|
||||
severity=severity
|
||||
)
|
||||
|
||||
def resolve_issue(self, issue_id: str, resolution: str) -> bool:
|
||||
"""Mark an issue as resolved with a resolution note"""
|
||||
issue = self.get_issue(issue_id)
|
||||
if not issue:
|
||||
return False
|
||||
|
||||
issue["status"] = "resolved"
|
||||
issue["resolution"] = resolution
|
||||
issue["updated_at"] = datetime.utcnow().isoformat()
|
||||
|
||||
self.context_db.update_issue(issue)
|
||||
return True
|
||||
|
||||
def close_issue(self, issue_id: str) -> bool:
|
||||
"""Archive a resolved issue to the closed log"""
|
||||
issue = self.get_issue(issue_id)
|
||||
if not issue:
|
||||
return False
|
||||
|
||||
# Can only close resolved issues
|
||||
if issue["status"] != "resolved":
|
||||
return False
|
||||
|
||||
issue["status"] = "closed"
|
||||
issue["closed_at"] = datetime.utcnow().isoformat()
|
||||
|
||||
# Archive to closed log
|
||||
self._archive_issue(issue)
|
||||
|
||||
# Remove from active database
|
||||
self.context_db.delete_issue(issue_id)
|
||||
|
||||
return True
|
||||
|
||||
def get_issue_history(self, issue_id: str) -> Dict[str, Any]:
|
||||
"""Get full history for an issue (investigations + actions)"""
|
||||
issue = self.get_issue(issue_id)
|
||||
if not issue:
|
||||
return {}
|
||||
|
||||
return {
|
||||
"issue": issue,
|
||||
"investigation_count": len(issue.get("investigations", [])),
|
||||
"action_count": len(issue.get("actions", [])),
|
||||
"age_hours": self._calculate_age(issue["created_at"]),
|
||||
"last_activity": issue["updated_at"]
|
||||
}
|
||||
|
||||
def auto_resolve_if_fixed(self, hostname: str, detected_problems: List[str]) -> int:
|
||||
"""
|
||||
Auto-resolve open issues if their problems are no longer detected.
|
||||
Returns count of auto-resolved issues.
|
||||
"""
|
||||
open_issues = self.list_issues(hostname=hostname, status="open")
|
||||
resolved_count = 0
|
||||
|
||||
# Convert detected problems to lowercase for comparison
|
||||
detected_lower = [p.lower() for p in detected_problems]
|
||||
|
||||
for issue in open_issues:
|
||||
title_lower = issue.get("title", "").lower()
|
||||
desc_lower = issue.get("description", "").lower()
|
||||
|
||||
# Check if issue keywords are still in detected problems
|
||||
still_present = False
|
||||
for detected in detected_lower:
|
||||
if any(word in detected for word in title_lower.split()) or \
|
||||
any(word in detected for word in desc_lower.split()):
|
||||
still_present = True
|
||||
break
|
||||
|
||||
# If problem is no longer detected, auto-resolve
|
||||
if not still_present:
|
||||
self.resolve_issue(
|
||||
issue["issue_id"],
|
||||
"Auto-resolved: Problem no longer detected in system monitoring"
|
||||
)
|
||||
resolved_count += 1
|
||||
|
||||
return resolved_count
|
||||
|
||||
def _archive_issue(self, issue: Dict[str, Any]):
|
||||
"""Append closed issue to the archive log"""
|
||||
try:
|
||||
with open(self.closed_log, "a") as f:
|
||||
f.write(json.dumps(issue) + "\n")
|
||||
except Exception as e:
|
||||
print(f"Failed to archive issue {issue.get('issue_id')}: {e}")
|
||||
|
||||
def _calculate_age(self, created_at: str) -> float:
|
||||
"""Calculate age of issue in hours"""
|
||||
try:
|
||||
created = datetime.fromisoformat(created_at)
|
||||
now = datetime.utcnow()
|
||||
delta = now - created
|
||||
return delta.total_seconds() / 3600
|
||||
except:
|
||||
return 0
|
||||
|
||||
358
journal_monitor.py
Normal file
358
journal_monitor.py
Normal file
@@ -0,0 +1,358 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Journal Monitor - Monitor remote systems via centralized journald
|
||||
"""
|
||||
|
||||
import json
|
||||
import subprocess
|
||||
from typing import Dict, List, Any, Optional, Set
|
||||
from datetime import datetime, timedelta
|
||||
from pathlib import Path
|
||||
from collections import defaultdict
|
||||
|
||||
|
||||
class JournalMonitor:
|
||||
"""Monitor systems via centralized journald logs"""
|
||||
|
||||
def __init__(self, domain: str = "coven.systems"):
|
||||
"""
|
||||
Initialize journal monitor
|
||||
|
||||
Args:
|
||||
domain: Domain suffix for FQDNs
|
||||
"""
|
||||
self.domain = domain
|
||||
self.known_hosts: Set[str] = set()
|
||||
|
||||
def _run_journalctl(self, args: List[str], timeout: int = 30) -> tuple[bool, str, str]:
|
||||
"""
|
||||
Run journalctl command
|
||||
|
||||
Args:
|
||||
args: Arguments to journalctl
|
||||
timeout: Timeout in seconds
|
||||
|
||||
Returns:
|
||||
(success, stdout, stderr)
|
||||
"""
|
||||
try:
|
||||
cmd = ["journalctl"] + args
|
||||
|
||||
result = subprocess.run(
|
||||
cmd,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=timeout
|
||||
)
|
||||
|
||||
return (
|
||||
result.returncode == 0,
|
||||
result.stdout.strip(),
|
||||
result.stderr.strip()
|
||||
)
|
||||
|
||||
except subprocess.TimeoutExpired:
|
||||
return False, "", f"Command timed out after {timeout}s"
|
||||
except Exception as e:
|
||||
return False, "", str(e)
|
||||
|
||||
def discover_hosts(self) -> List[str]:
|
||||
"""
|
||||
Discover hosts reporting to centralized journal
|
||||
|
||||
Returns:
|
||||
List of discovered FQDNs
|
||||
"""
|
||||
success, output, _ = self._run_journalctl([
|
||||
"--output=json",
|
||||
"--since=1 day ago",
|
||||
"-n", "10000"
|
||||
])
|
||||
|
||||
if not success:
|
||||
return []
|
||||
|
||||
hosts = set()
|
||||
for line in output.split('\n'):
|
||||
if not line.strip():
|
||||
continue
|
||||
try:
|
||||
entry = json.loads(line)
|
||||
hostname = entry.get('_HOSTNAME', '')
|
||||
|
||||
# Ensure FQDN format
|
||||
if hostname and not hostname.endswith(f'.{self.domain}'):
|
||||
if '.' not in hostname:
|
||||
hostname = f"{hostname}.{self.domain}"
|
||||
|
||||
if hostname:
|
||||
hosts.add(hostname)
|
||||
|
||||
except json.JSONDecodeError:
|
||||
continue
|
||||
|
||||
self.known_hosts = hosts
|
||||
return sorted(hosts)
|
||||
|
||||
def collect_resources(self, hostname: str, since: str = "5 minutes ago") -> Dict[str, Any]:
|
||||
"""
|
||||
Collect resource usage from journal entries
|
||||
|
||||
This extracts CPU/memory info from systemd service messages
|
||||
"""
|
||||
# For now, return empty - we'll primarily use this for service/log monitoring
|
||||
# Resource metrics could be added if systems log them
|
||||
return {
|
||||
"cpu_percent": 0,
|
||||
"memory_percent": 0,
|
||||
"load_average": {"1min": 0, "5min": 0, "15min": 0}
|
||||
}
|
||||
|
||||
def collect_systemd_status(self, hostname: str, since: str = "5 minutes ago") -> Dict[str, Any]:
|
||||
"""
|
||||
Collect systemd service status from journal
|
||||
|
||||
Args:
|
||||
hostname: FQDN of the system
|
||||
since: Time range to check
|
||||
|
||||
Returns:
|
||||
Dictionary with failed service information
|
||||
"""
|
||||
# Query for systemd service failures
|
||||
success, output, _ = self._run_journalctl([
|
||||
f"_HOSTNAME={hostname}",
|
||||
"--priority=err",
|
||||
"--unit=*.service",
|
||||
f"--since={since}",
|
||||
"--output=json"
|
||||
])
|
||||
|
||||
if not success:
|
||||
return {"failed_count": 0, "failed_services": []}
|
||||
|
||||
failed_services = {}
|
||||
for line in output.split('\n'):
|
||||
if not line.strip():
|
||||
continue
|
||||
try:
|
||||
entry = json.loads(line)
|
||||
unit = entry.get('_SYSTEMD_UNIT', '')
|
||||
if unit and unit.endswith('.service'):
|
||||
service_name = unit.replace('.service', '')
|
||||
if service_name not in failed_services:
|
||||
failed_services[service_name] = {
|
||||
"unit": unit,
|
||||
"message": entry.get('MESSAGE', ''),
|
||||
"timestamp": entry.get('__REALTIME_TIMESTAMP', '')
|
||||
}
|
||||
except json.JSONDecodeError:
|
||||
continue
|
||||
|
||||
return {
|
||||
"failed_count": len(failed_services),
|
||||
"failed_services": list(failed_services.values())
|
||||
}
|
||||
|
||||
def collect_log_errors(self, hostname: str, since: str = "1 hour ago") -> Dict[str, Any]:
|
||||
"""
|
||||
Collect error logs from journal
|
||||
|
||||
Args:
|
||||
hostname: FQDN of the system
|
||||
since: Time range to check
|
||||
|
||||
Returns:
|
||||
Dictionary with error log information
|
||||
"""
|
||||
success, output, _ = self._run_journalctl([
|
||||
f"_HOSTNAME={hostname}",
|
||||
"--priority=err",
|
||||
f"--since={since}",
|
||||
"--output=json"
|
||||
])
|
||||
|
||||
if not success:
|
||||
return {"error_count_1h": 0, "recent_errors": []}
|
||||
|
||||
errors = []
|
||||
error_count = 0
|
||||
|
||||
for line in output.split('\n'):
|
||||
if not line.strip():
|
||||
continue
|
||||
try:
|
||||
entry = json.loads(line)
|
||||
error_count += 1
|
||||
|
||||
if len(errors) < 10: # Keep last 10 errors
|
||||
errors.append({
|
||||
"message": entry.get('MESSAGE', ''),
|
||||
"unit": entry.get('_SYSTEMD_UNIT', 'unknown'),
|
||||
"priority": entry.get('PRIORITY', ''),
|
||||
"timestamp": entry.get('__REALTIME_TIMESTAMP', '')
|
||||
})
|
||||
|
||||
except json.JSONDecodeError:
|
||||
continue
|
||||
|
||||
return {
|
||||
"error_count_1h": error_count,
|
||||
"recent_errors": errors
|
||||
}
|
||||
|
||||
def collect_disk_usage(self, hostname: str) -> Dict[str, Any]:
|
||||
"""
|
||||
Collect disk usage - Note: This would require systems to log disk metrics
|
||||
For now, returns empty. Could be enhanced if systems periodically log disk usage
|
||||
"""
|
||||
return {"partitions": []}
|
||||
|
||||
def collect_network_status(self, hostname: str, since: str = "5 minutes ago") -> Dict[str, Any]:
|
||||
"""
|
||||
Check network connectivity based on recent journal activity
|
||||
|
||||
If we see recent logs from a host, it's reachable
|
||||
"""
|
||||
success, output, _ = self._run_journalctl([
|
||||
f"_HOSTNAME={hostname}",
|
||||
f"--since={since}",
|
||||
"-n", "1",
|
||||
"--output=json"
|
||||
])
|
||||
|
||||
# If we got recent logs, network is working
|
||||
internet_reachable = bool(success and output.strip())
|
||||
|
||||
return {
|
||||
"internet_reachable": internet_reachable,
|
||||
"last_seen": datetime.now().isoformat() if internet_reachable else None
|
||||
}
|
||||
|
||||
def collect_all(self, hostname: str) -> Dict[str, Any]:
|
||||
"""
|
||||
Collect all monitoring data for a host from journal
|
||||
|
||||
Args:
|
||||
hostname: FQDN of the system to monitor
|
||||
|
||||
Returns:
|
||||
Complete monitoring data
|
||||
"""
|
||||
# First check if we have recent logs from this host
|
||||
net_status = self.collect_network_status(hostname)
|
||||
|
||||
if not net_status.get("internet_reachable"):
|
||||
return {
|
||||
"hostname": hostname,
|
||||
"reachable": False,
|
||||
"error": "No recent journal entries from this host"
|
||||
}
|
||||
|
||||
return {
|
||||
"hostname": hostname,
|
||||
"reachable": True,
|
||||
"source": "journal",
|
||||
"resources": self.collect_resources(hostname),
|
||||
"systemd": self.collect_systemd_status(hostname),
|
||||
"disk": self.collect_disk_usage(hostname),
|
||||
"network": net_status,
|
||||
"logs": self.collect_log_errors(hostname),
|
||||
}
|
||||
|
||||
def get_summary(self, data: Dict[str, Any]) -> str:
|
||||
"""Generate human-readable summary from journal data"""
|
||||
hostname = data.get("hostname", "unknown")
|
||||
|
||||
if not data.get("reachable", False):
|
||||
return f"❌ {hostname}: {data.get('error', 'Unreachable')}"
|
||||
|
||||
lines = [f"System: {hostname} (via journal)"]
|
||||
|
||||
# Services
|
||||
systemd = data.get("systemd", {})
|
||||
failed_count = systemd.get("failed_count", 0)
|
||||
if failed_count > 0:
|
||||
lines.append(f"Services: {failed_count} failed")
|
||||
for svc in systemd.get("failed_services", [])[:3]:
|
||||
lines.append(f" - {svc.get('unit', 'unknown')}")
|
||||
else:
|
||||
lines.append("Services: No recent failures")
|
||||
|
||||
# Network
|
||||
net = data.get("network", {})
|
||||
last_seen = net.get("last_seen")
|
||||
if last_seen:
|
||||
lines.append(f"Last seen: {last_seen}")
|
||||
|
||||
# Logs
|
||||
logs = data.get("logs", {})
|
||||
error_count = logs.get("error_count_1h", 0)
|
||||
if error_count > 0:
|
||||
lines.append(f"Recent logs: {error_count} errors in last hour")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
def get_active_services(self, hostname: str, since: str = "1 hour ago") -> List[str]:
|
||||
"""
|
||||
Get list of active services on a host by looking at journal entries
|
||||
|
||||
This helps with auto-discovery of what's running on each system
|
||||
"""
|
||||
success, output, _ = self._run_journalctl([
|
||||
f"_HOSTNAME={hostname}",
|
||||
f"--since={since}",
|
||||
"--output=json",
|
||||
"-n", "1000"
|
||||
])
|
||||
|
||||
if not success:
|
||||
return []
|
||||
|
||||
services = set()
|
||||
for line in output.split('\n'):
|
||||
if not line.strip():
|
||||
continue
|
||||
try:
|
||||
entry = json.loads(line)
|
||||
unit = entry.get('_SYSTEMD_UNIT', '')
|
||||
if unit and unit.endswith('.service'):
|
||||
# Extract service name
|
||||
service = unit.replace('.service', '')
|
||||
# Filter out common system services, focus on application services
|
||||
if service not in ['systemd-journald', 'systemd-logind', 'sshd', 'dbus']:
|
||||
services.add(service)
|
||||
except json.JSONDecodeError:
|
||||
continue
|
||||
|
||||
return sorted(services)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
|
||||
monitor = JournalMonitor()
|
||||
|
||||
# Discover hosts
|
||||
print("Discovering hosts from journal...")
|
||||
hosts = monitor.discover_hosts()
|
||||
print(f"Found {len(hosts)} hosts:")
|
||||
for host in hosts:
|
||||
print(f" - {host}")
|
||||
|
||||
# Monitor first host if available
|
||||
if hosts:
|
||||
hostname = hosts[0]
|
||||
print(f"\nMonitoring {hostname}...")
|
||||
data = monitor.collect_all(hostname)
|
||||
|
||||
print("\n" + "="*60)
|
||||
print(monitor.get_summary(data))
|
||||
print("="*60)
|
||||
|
||||
# Discover services
|
||||
print(f"\nActive services on {hostname}:")
|
||||
services = monitor.get_active_services(hostname)
|
||||
for svc in services[:10]:
|
||||
print(f" - {svc}")
|
||||
|
||||
847
module.nix
Normal file
847
module.nix
Normal file
@@ -0,0 +1,847 @@
|
||||
{ config, lib, pkgs, ... }:
|
||||
|
||||
with lib;
|
||||
|
||||
let
|
||||
cfg = config.services.macha-autonomous;
|
||||
|
||||
# Python environment with all dependencies
|
||||
pythonEnv = pkgs.python3.withPackages (ps: with ps; [
|
||||
requests
|
||||
psutil
|
||||
chromadb
|
||||
]);
|
||||
|
||||
# Main autonomous system package
|
||||
macha-autonomous = pkgs.writeScriptBin "macha-autonomous" ''
|
||||
#!${pythonEnv}/bin/python3
|
||||
import sys
|
||||
sys.path.insert(0, "${./.}")
|
||||
from orchestrator import main
|
||||
main()
|
||||
'';
|
||||
|
||||
# Config file
|
||||
configFile = pkgs.writeText "macha-autonomous-config.json" (builtins.toJSON {
|
||||
check_interval = cfg.checkInterval;
|
||||
autonomy_level = cfg.autonomyLevel;
|
||||
ollama_host = cfg.ollamaHost;
|
||||
model = cfg.model;
|
||||
config_repo = cfg.configRepo;
|
||||
config_branch = cfg.configBranch;
|
||||
});
|
||||
|
||||
in {
|
||||
options.services.macha-autonomous = {
|
||||
enable = mkEnableOption "Macha autonomous system maintenance";
|
||||
|
||||
autonomyLevel = mkOption {
|
||||
type = types.enum [ "observe" "suggest" "auto-safe" "auto-full" ];
|
||||
default = "suggest";
|
||||
description = ''
|
||||
Level of autonomy for the system:
|
||||
- observe: Only monitor and log, no actions
|
||||
- suggest: Propose actions, require manual approval
|
||||
- auto-safe: Auto-execute low-risk actions (restarts, cleanup)
|
||||
- auto-full: Full autonomy with safety limits (still requires approval for high-risk)
|
||||
'';
|
||||
};
|
||||
|
||||
checkInterval = mkOption {
|
||||
type = types.int;
|
||||
default = 300;
|
||||
description = "Interval in seconds between system checks";
|
||||
};
|
||||
|
||||
ollamaHost = mkOption {
|
||||
type = types.str;
|
||||
default = "http://localhost:11434";
|
||||
description = "Ollama API host";
|
||||
};
|
||||
|
||||
model = mkOption {
|
||||
type = types.str;
|
||||
default = "llama3.1:70b";
|
||||
description = "LLM model to use for reasoning";
|
||||
};
|
||||
|
||||
user = mkOption {
|
||||
type = types.str;
|
||||
default = "macha";
|
||||
description = "User to run the autonomous system as";
|
||||
};
|
||||
|
||||
group = mkOption {
|
||||
type = types.str;
|
||||
default = "macha";
|
||||
description = "Group to run the autonomous system as";
|
||||
};
|
||||
|
||||
gotifyUrl = mkOption {
|
||||
type = types.str;
|
||||
default = "";
|
||||
example = "http://rhiannon:8181";
|
||||
description = "Gotify server URL for notifications (empty to disable)";
|
||||
};
|
||||
|
||||
gotifyToken = mkOption {
|
||||
type = types.str;
|
||||
default = "";
|
||||
description = "Gotify application token for notifications";
|
||||
};
|
||||
|
||||
remoteSystems = mkOption {
|
||||
type = types.listOf types.str;
|
||||
default = [];
|
||||
example = [ "rhiannon" "alexander" ];
|
||||
description = "List of remote NixOS systems to monitor and maintain";
|
||||
};
|
||||
|
||||
configRepo = mkOption {
|
||||
type = types.str;
|
||||
default = if config.programs.nh.flake != null
|
||||
then config.programs.nh.flake
|
||||
else "git+https://git.coven.systems/lily/nixos-servers";
|
||||
description = "URL of the NixOS configuration repository (auto-detected from programs.nh.flake if available)";
|
||||
};
|
||||
|
||||
configBranch = mkOption {
|
||||
type = types.str;
|
||||
default = "main";
|
||||
description = "Branch of the NixOS configuration repository";
|
||||
};
|
||||
};
|
||||
|
||||
config = mkIf cfg.enable {
|
||||
# Create user and group
|
||||
users.users.${cfg.user} = {
|
||||
isSystemUser = true;
|
||||
group = cfg.group;
|
||||
uid = 2501;
|
||||
description = "Macha autonomous system maintenance";
|
||||
home = "/var/lib/macha";
|
||||
createHome = true;
|
||||
};
|
||||
|
||||
users.groups.${cfg.group} = {};
|
||||
|
||||
# Git configuration for credential storage
|
||||
programs.git = {
|
||||
enable = true;
|
||||
config = {
|
||||
credential.helper = "store";
|
||||
};
|
||||
};
|
||||
|
||||
# Ollama service for AI inference
|
||||
services.ollama = {
|
||||
enable = true;
|
||||
acceleration = "rocm";
|
||||
host = "0.0.0.0";
|
||||
port = 11434;
|
||||
environmentVariables = {
|
||||
"OLLAMA_DEBUG" = "1";
|
||||
"OLLAMA_KEEP_ALIVE" = "600";
|
||||
"OLLAMA_NEW_ENGINE" = "true";
|
||||
"OLLAMA_CONTEXT_LENGTH" = "131072";
|
||||
};
|
||||
openFirewall = false; # Keep internal only
|
||||
loadModels = [
|
||||
"qwen3"
|
||||
"gpt-oss"
|
||||
"gemma3"
|
||||
"gpt-oss:20b"
|
||||
"qwen3:4b-instruct-2507-fp16"
|
||||
"qwen3:8b-fp16"
|
||||
"mistral:7b"
|
||||
"chroma/all-minilm-l6-v2-f32:latest"
|
||||
];
|
||||
};
|
||||
|
||||
# ChromaDB service for vector storage
|
||||
services.chromadb = {
|
||||
enable = true;
|
||||
port = 8000;
|
||||
dbpath = "/var/lib/chromadb";
|
||||
};
|
||||
|
||||
# Give the user permissions it needs
|
||||
security.sudo.extraRules = [{
|
||||
users = [ cfg.user ];
|
||||
commands = [
|
||||
# Local system management
|
||||
{ command = "${pkgs.systemd}/bin/systemctl restart *"; options = [ "NOPASSWD" ]; }
|
||||
{ command = "${pkgs.systemd}/bin/systemctl status *"; options = [ "NOPASSWD" ]; }
|
||||
{ command = "${pkgs.systemd}/bin/journalctl *"; options = [ "NOPASSWD" ]; }
|
||||
{ command = "${pkgs.nix}/bin/nix-collect-garbage *"; options = [ "NOPASSWD" ]; }
|
||||
# Remote system access (uses existing root SSH keys)
|
||||
{ command = "${pkgs.openssh}/bin/ssh *"; options = [ "NOPASSWD" ]; }
|
||||
{ command = "${pkgs.openssh}/bin/scp *"; options = [ "NOPASSWD" ]; }
|
||||
{ command = "${pkgs.nixos-rebuild}/bin/nixos-rebuild *"; options = [ "NOPASSWD" ]; }
|
||||
];
|
||||
}];
|
||||
|
||||
# Config file
|
||||
environment.etc."macha-autonomous/config.json".source = configFile;
|
||||
|
||||
# State directory and queue directories (world-writable queues for multi-user access)
|
||||
# Using 'z' to set permissions even if directory exists
|
||||
systemd.tmpfiles.rules = [
|
||||
"d /var/lib/macha 0755 ${cfg.user} ${cfg.group} -"
|
||||
"z /var/lib/macha 0755 ${cfg.user} ${cfg.group} -" # Ensure permissions are set
|
||||
"d /var/lib/macha/queues 0777 ${cfg.user} ${cfg.group} -"
|
||||
"d /var/lib/macha/queues/ollama 0777 ${cfg.user} ${cfg.group} -"
|
||||
"d /var/lib/macha/queues/ollama/pending 0777 ${cfg.user} ${cfg.group} -"
|
||||
"d /var/lib/macha/queues/ollama/processing 0777 ${cfg.user} ${cfg.group} -"
|
||||
"d /var/lib/macha/queues/ollama/completed 0777 ${cfg.user} ${cfg.group} -"
|
||||
"d /var/lib/macha/queues/ollama/failed 0777 ${cfg.user} ${cfg.group} -"
|
||||
"d /var/lib/macha/tool_cache 0777 ${cfg.user} ${cfg.group} -"
|
||||
];
|
||||
|
||||
# Systemd service
|
||||
systemd.services.macha-autonomous = {
|
||||
description = "Macha Autonomous System Maintenance";
|
||||
after = [ "network.target" "ollama.service" ];
|
||||
wants = [ "ollama.service" ];
|
||||
wantedBy = [ "multi-user.target" ];
|
||||
|
||||
serviceConfig = {
|
||||
Type = "simple";
|
||||
User = cfg.user;
|
||||
Group = cfg.group;
|
||||
WorkingDirectory = "/var/lib/macha";
|
||||
ExecStart = "${macha-autonomous}/bin/macha-autonomous --mode continuous --autonomy ${cfg.autonomyLevel} --interval ${toString cfg.checkInterval}";
|
||||
Restart = "on-failure";
|
||||
RestartSec = "30s";
|
||||
|
||||
# Security hardening
|
||||
PrivateTmp = true;
|
||||
NoNewPrivileges = false; # Need privileges for sudo
|
||||
ProtectSystem = "strict";
|
||||
ProtectHome = true;
|
||||
ReadWritePaths = [ "/var/lib/macha" "/var/lib/macha/tool_cache" "/var/lib/macha/queues" ];
|
||||
|
||||
# Resource limits
|
||||
MemoryLimit = "1G";
|
||||
CPUQuota = "50%";
|
||||
};
|
||||
|
||||
environment = {
|
||||
PYTHONPATH = toString ./.;
|
||||
GOTIFY_URL = cfg.gotifyUrl;
|
||||
GOTIFY_TOKEN = cfg.gotifyToken;
|
||||
CHROMA_ENV_FILE = ""; # Prevent ChromaDB from trying to read .env files
|
||||
ANONYMIZED_TELEMETRY = "False"; # Disable ChromaDB telemetry
|
||||
};
|
||||
|
||||
path = [ pkgs.git ]; # Make git available for config parsing
|
||||
};
|
||||
|
||||
# Ollama Queue Worker Service (serializes all Ollama requests)
|
||||
systemd.services.ollama-queue-worker = {
|
||||
description = "Macha Ollama Queue Worker";
|
||||
after = [ "network.target" "ollama.service" ];
|
||||
wants = [ "ollama.service" ];
|
||||
wantedBy = [ "multi-user.target" ];
|
||||
|
||||
serviceConfig = {
|
||||
Type = "simple";
|
||||
User = cfg.user;
|
||||
Group = cfg.group;
|
||||
WorkingDirectory = "/var/lib/macha";
|
||||
ExecStart = "${pythonEnv}/bin/python3 ${./.}/ollama_worker.py";
|
||||
Restart = "on-failure";
|
||||
RestartSec = "10s";
|
||||
|
||||
# Security hardening
|
||||
PrivateTmp = true;
|
||||
NoNewPrivileges = true;
|
||||
ProtectSystem = "strict";
|
||||
ProtectHome = true;
|
||||
ReadWritePaths = [ "/var/lib/macha/queues" "/var/lib/macha/tool_cache" ];
|
||||
|
||||
# Resource limits
|
||||
MemoryLimit = "512M";
|
||||
CPUQuota = "25%";
|
||||
};
|
||||
|
||||
environment = {
|
||||
PYTHONPATH = toString ./.;
|
||||
CHROMA_ENV_FILE = "";
|
||||
ANONYMIZED_TELEMETRY = "False";
|
||||
};
|
||||
};
|
||||
|
||||
# CLI tools for manual control and system packages
|
||||
environment.systemPackages = with pkgs; [
|
||||
macha-autonomous
|
||||
# Python packages for ChromaDB
|
||||
python313
|
||||
python313Packages.pip
|
||||
python313Packages.chromadb.pythonModule
|
||||
|
||||
# Tool to check approval queue
|
||||
(pkgs.writeScriptBin "macha-approve" ''
|
||||
#!${pkgs.bash}/bin/bash
|
||||
if [ "$1" == "list" ]; then
|
||||
sudo -u ${cfg.user} ${pythonEnv}/bin/python3 ${./.}/executor.py queue
|
||||
elif [ "$1" == "discuss" ] && [ -n "$2" ]; then
|
||||
ACTION_ID="$2"
|
||||
echo "==================================================================="
|
||||
echo "Interactive Discussion with Macha about Action #$ACTION_ID"
|
||||
echo "==================================================================="
|
||||
echo ""
|
||||
|
||||
# Initial explanation
|
||||
sudo -u ${cfg.user} ${pkgs.coreutils}/bin/env CHROMA_ENV_FILE="" ANONYMIZED_TELEMETRY="False" ${pythonEnv}/bin/python3 ${./.}/conversation.py --discuss "$ACTION_ID"
|
||||
|
||||
echo ""
|
||||
echo "==================================================================="
|
||||
echo "You can now ask follow-up questions about this action."
|
||||
echo "Type 'approve' to approve it, 'reject' to reject it, or 'exit' to quit."
|
||||
echo "==================================================================="
|
||||
|
||||
# Interactive loop
|
||||
while true; do
|
||||
echo ""
|
||||
echo -n "You: "
|
||||
read -r USER_INPUT
|
||||
|
||||
# Check for special commands
|
||||
if [ "$USER_INPUT" = "exit" ] || [ "$USER_INPUT" = "quit" ] || [ -z "$USER_INPUT" ]; then
|
||||
echo "Exiting discussion."
|
||||
break
|
||||
elif [ "$USER_INPUT" = "approve" ]; then
|
||||
echo "Approving action #$ACTION_ID..."
|
||||
sudo -u ${cfg.user} ${pythonEnv}/bin/python3 ${./.}/executor.py approve "$ACTION_ID"
|
||||
break
|
||||
elif [ "$USER_INPUT" = "reject" ]; then
|
||||
echo "Rejecting and removing action #$ACTION_ID from queue..."
|
||||
sudo -u ${cfg.user} ${pythonEnv}/bin/python3 ${./.}/executor.py reject "$ACTION_ID"
|
||||
break
|
||||
fi
|
||||
|
||||
# Ask Macha the follow-up question in context of the action
|
||||
echo ""
|
||||
echo -n "Macha: "
|
||||
sudo -u ${cfg.user} ${pkgs.coreutils}/bin/env CHROMA_ENV_FILE="" ANONYMIZED_TELEMETRY="False" ${pythonEnv}/bin/python3 ${./.}/conversation.py --discuss "$ACTION_ID" --follow-up "$USER_INPUT"
|
||||
echo ""
|
||||
done
|
||||
elif [ "$1" == "approve" ] && [ -n "$2" ]; then
|
||||
sudo -u ${cfg.user} ${pythonEnv}/bin/python3 ${./.}/executor.py approve "$2"
|
||||
elif [ "$1" == "reject" ] && [ -n "$2" ]; then
|
||||
sudo -u ${cfg.user} ${pythonEnv}/bin/python3 ${./.}/executor.py reject "$2"
|
||||
else
|
||||
echo "Usage:"
|
||||
echo " macha-approve list - Show pending actions"
|
||||
echo " macha-approve discuss <N> - Discuss action number N with Macha (interactive)"
|
||||
echo " macha-approve approve <N> - Approve action number N"
|
||||
echo " macha-approve reject <N> - Reject and remove action number N from queue"
|
||||
fi
|
||||
'')
|
||||
|
||||
# Tool to run manual check
|
||||
(pkgs.writeScriptBin "macha-check" ''
|
||||
#!${pkgs.bash}/bin/bash
|
||||
sudo -u ${cfg.user} sh -c 'cd /var/lib/macha && CHROMA_ENV_FILE="" ANONYMIZED_TELEMETRY="False" ${macha-autonomous}/bin/macha-autonomous --mode once --autonomy ${cfg.autonomyLevel}'
|
||||
'')
|
||||
|
||||
# Tool to view logs
|
||||
(pkgs.writeScriptBin "macha-logs" ''
|
||||
#!${pkgs.bash}/bin/bash
|
||||
case "$1" in
|
||||
orchestrator)
|
||||
sudo tail -f /var/lib/macha/orchestrator.log
|
||||
;;
|
||||
decisions)
|
||||
sudo tail -f /var/lib/macha/decisions.jsonl
|
||||
;;
|
||||
actions)
|
||||
sudo tail -f /var/lib/macha/actions.jsonl
|
||||
;;
|
||||
service)
|
||||
journalctl -u macha-autonomous.service -f
|
||||
;;
|
||||
*)
|
||||
echo "Usage: macha-logs [orchestrator|decisions|actions|service]"
|
||||
;;
|
||||
esac
|
||||
'')
|
||||
|
||||
# Tool to send test notification
|
||||
(pkgs.writeScriptBin "macha-notify" ''
|
||||
#!${pkgs.bash}/bin/bash
|
||||
if [ -z "$1" ] || [ -z "$2" ]; then
|
||||
echo "Usage: macha-notify <title> <message> [priority]"
|
||||
echo "Example: macha-notify 'Test' 'This is a test' 5"
|
||||
echo "Priorities: 2 (low), 5 (medium), 8 (high)"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
export GOTIFY_URL="${cfg.gotifyUrl}"
|
||||
export GOTIFY_TOKEN="${cfg.gotifyToken}"
|
||||
|
||||
${pythonEnv}/bin/python3 ${./.}/notifier.py "$1" "$2" "''${3:-5}"
|
||||
'')
|
||||
|
||||
# Tool to query config files
|
||||
(pkgs.writeScriptBin "macha-configs" ''
|
||||
#!${pkgs.bash}/bin/bash
|
||||
export PYTHONPATH=${toString ./.}
|
||||
export CHROMA_ENV_FILE=""
|
||||
export ANONYMIZED_TELEMETRY="False"
|
||||
|
||||
if [ $# -eq 0 ]; then
|
||||
echo "Usage: macha-configs <search-query> [system-name]"
|
||||
echo "Examples:"
|
||||
echo " macha-configs gotify"
|
||||
echo " macha-configs 'journald configuration'"
|
||||
echo " macha-configs ollama macha.coven.systems"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
QUERY="$1"
|
||||
SYSTEM="''${2:-}"
|
||||
|
||||
${pythonEnv}/bin/python3 -c "
|
||||
from context_db import ContextDatabase
|
||||
import sys
|
||||
|
||||
db = ContextDatabase()
|
||||
query = sys.argv[1]
|
||||
system = sys.argv[2] if len(sys.argv) > 2 else None
|
||||
|
||||
print(f'Searching for: {query}')
|
||||
if system:
|
||||
print(f'Filtered to system: {system}')
|
||||
print('='*60)
|
||||
|
||||
configs = db.query_config_files(query, system=system, n_results=5)
|
||||
|
||||
if not configs:
|
||||
print('No matching configuration files found.')
|
||||
else:
|
||||
for i, cfg in enumerate(configs, 1):
|
||||
print(f\"\\n{i}. {cfg['path']} (relevance: {cfg['relevance']:.1%})\")
|
||||
print(f\" Category: {cfg['metadata']['category']}\")
|
||||
print(' Preview:')
|
||||
preview = cfg['content'][:300].replace('\\n', '\\n ')
|
||||
print(f' {preview}')
|
||||
if len(cfg['content']) > 300:
|
||||
print(' ... (use macha-configs-read to see full file)')
|
||||
" "$QUERY" "$SYSTEM"
|
||||
'')
|
||||
|
||||
# Interactive chat tool (runs as invoking user, not as macha-autonomous)
|
||||
(pkgs.writeScriptBin "macha-chat" ''
|
||||
#!${pkgs.bash}/bin/bash
|
||||
export PYTHONPATH=${toString ./.}
|
||||
export CHROMA_ENV_FILE=""
|
||||
export ANONYMIZED_TELEMETRY="False"
|
||||
|
||||
# Run as the current user, not as macha-autonomous
|
||||
# This allows the chat to execute privileged commands with the user's permissions
|
||||
${pythonEnv}/bin/python3 ${./.}/chat.py
|
||||
'')
|
||||
|
||||
# Tool to read full config file
|
||||
(pkgs.writeScriptBin "macha-configs-read" ''
|
||||
#!${pkgs.bash}/bin/bash
|
||||
export PYTHONPATH=${toString ./.}
|
||||
export CHROMA_ENV_FILE=""
|
||||
export ANONYMIZED_TELEMETRY="False"
|
||||
|
||||
if [ $# -eq 0 ]; then
|
||||
echo "Usage: macha-configs-read <file-path>"
|
||||
echo "Example: macha-configs-read apps/gotify.nix"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
${pythonEnv}/bin/python3 -c "
|
||||
from context_db import ContextDatabase
|
||||
import sys
|
||||
|
||||
db = ContextDatabase()
|
||||
file_path = sys.argv[1]
|
||||
|
||||
cfg = db.get_config_file(file_path)
|
||||
|
||||
if not cfg:
|
||||
print(f'Config file not found: {file_path}')
|
||||
sys.exit(1)
|
||||
|
||||
print(f'File: {cfg[\"path\"]}')
|
||||
print(f'Category: {cfg[\"metadata\"][\"category\"]}')
|
||||
print('='*60)
|
||||
print(cfg['content'])
|
||||
" "$1"
|
||||
'')
|
||||
|
||||
# Tool to view system registry
|
||||
(pkgs.writeScriptBin "macha-systems" ''
|
||||
#!${pkgs.bash}/bin/bash
|
||||
export PYTHONPATH=${toString ./.}
|
||||
export CHROMA_ENV_FILE=""
|
||||
export ANONYMIZED_TELEMETRY="False"
|
||||
${pythonEnv}/bin/python3 -c "
|
||||
from context_db import ContextDatabase
|
||||
import json
|
||||
|
||||
db = ContextDatabase()
|
||||
systems = db.get_all_systems()
|
||||
|
||||
print('Registered Systems:')
|
||||
print('='*60)
|
||||
for system in systems:
|
||||
os_type = system.get('os_type', 'unknown').upper()
|
||||
print(f\"\\n{system['hostname']} ({system['type']}) [{os_type}]\")
|
||||
print(f\" Config Repo: {system.get('config_repo') or '(not set)'}\")
|
||||
print(f\" Branch: {system.get('config_branch', 'unknown')}\")
|
||||
if system.get('services'):
|
||||
print(f\" Services: {', '.join(system['services'][:10])}\")
|
||||
if len(system['services']) > 10:
|
||||
print(f\" ... and {len(system['services']) - 10} more\")
|
||||
if system.get('capabilities'):
|
||||
print(f\" Capabilities: {', '.join(system['capabilities'])}\")
|
||||
print('='*60)
|
||||
"
|
||||
'')
|
||||
|
||||
# Tool to ask Macha questions
|
||||
(pkgs.writeScriptBin "macha-ask" ''
|
||||
#!${pkgs.bash}/bin/bash
|
||||
if [ $# -eq 0 ]; then
|
||||
echo "Usage: macha-ask <your question>"
|
||||
echo "Example: macha-ask Why did you recommend restarting that service?"
|
||||
exit 1
|
||||
fi
|
||||
sudo -u ${cfg.user} ${pkgs.coreutils}/bin/env CHROMA_ENV_FILE="" ANONYMIZED_TELEMETRY="False" ${pythonEnv}/bin/python3 ${./.}/conversation.py "$@"
|
||||
'')
|
||||
|
||||
# Issue tracking CLI
|
||||
(pkgs.writeScriptBin "macha-issues" ''
|
||||
#!${pythonEnv}/bin/python3
|
||||
import sys
|
||||
import os
|
||||
os.environ["CHROMA_ENV_FILE"] = ""
|
||||
os.environ["ANONYMIZED_TELEMETRY"] = "False"
|
||||
sys.path.insert(0, "${./.}")
|
||||
|
||||
from context_db import ContextDatabase
|
||||
from issue_tracker import IssueTracker
|
||||
from datetime import datetime
|
||||
import json
|
||||
|
||||
db = ContextDatabase()
|
||||
tracker = IssueTracker(db)
|
||||
|
||||
def list_issues(show_all=False):
|
||||
"""List issues"""
|
||||
if show_all:
|
||||
issues = tracker.list_issues()
|
||||
else:
|
||||
issues = tracker.list_issues(status="open")
|
||||
|
||||
if not issues:
|
||||
print("No issues found")
|
||||
return
|
||||
|
||||
print("="*70)
|
||||
print(f"ISSUES: {len(issues)}")
|
||||
print("="*70)
|
||||
|
||||
for issue in issues:
|
||||
issue_id = issue['issue_id'][:8]
|
||||
age_hours = (datetime.utcnow() - datetime.fromisoformat(issue['created_at'])).total_seconds() / 3600
|
||||
inv_count = len(issue.get('investigations', []))
|
||||
action_count = len(issue.get('actions', []))
|
||||
|
||||
print(f"\n[{issue_id}] {issue['title']}")
|
||||
print(f" Host: {issue['hostname']}")
|
||||
print(f" Status: {issue['status'].upper()} | Severity: {issue['severity'].upper()}")
|
||||
print(f" Age: {age_hours:.1f}h | Activity: {inv_count} investigations, {action_count} actions")
|
||||
print(f" Source: {issue['source']}")
|
||||
if issue.get('resolution'):
|
||||
print(f" Resolution: {issue['resolution']}")
|
||||
|
||||
def show_issue(issue_id):
|
||||
"""Show detailed issue information"""
|
||||
# Find issue by partial ID
|
||||
all_issues = tracker.list_issues()
|
||||
matching = [i for i in all_issues if i['issue_id'].startswith(issue_id)]
|
||||
|
||||
if not matching:
|
||||
print(f"Issue {issue_id} not found")
|
||||
return
|
||||
|
||||
issue = matching[0]
|
||||
full_id = issue['issue_id']
|
||||
|
||||
print("="*70)
|
||||
print(f"ISSUE: {issue['title']}")
|
||||
print("="*70)
|
||||
print(f"ID: {full_id}")
|
||||
print(f"Host: {issue['hostname']}")
|
||||
print(f"Status: {issue['status'].upper()}")
|
||||
print(f"Severity: {issue['severity'].upper()}")
|
||||
print(f"Source: {issue['source']}")
|
||||
print(f"Created: {issue['created_at']}")
|
||||
print(f"Updated: {issue['updated_at']}")
|
||||
print(f"\nDescription:\n{issue['description']}")
|
||||
|
||||
investigations = issue.get('investigations', [])
|
||||
if investigations:
|
||||
print(f"\n{'─'*70}")
|
||||
print(f"INVESTIGATIONS ({len(investigations)}):")
|
||||
for i, inv in enumerate(investigations, 1):
|
||||
print(f"\n [{i}] {inv.get('timestamp', 'N/A')}")
|
||||
print(f" Diagnosis: {inv.get('diagnosis', 'N/A')}")
|
||||
print(f" Commands: {', '.join(inv.get('commands', []))}")
|
||||
print(f" Success: {inv.get('success', False)}")
|
||||
if inv.get('output'):
|
||||
print(f" Output: {inv['output'][:200]}...")
|
||||
|
||||
actions = issue.get('actions', [])
|
||||
if actions:
|
||||
print(f"\n{'─'*70}")
|
||||
print(f"ACTIONS ({len(actions)}):")
|
||||
for i, action in enumerate(actions, 1):
|
||||
print(f"\n [{i}] {action.get('timestamp', 'N/A')}")
|
||||
print(f" Action: {action.get('proposed_action', 'N/A')}")
|
||||
print(f" Risk: {action.get('risk_level', 'N/A').upper()}")
|
||||
print(f" Commands: {', '.join(action.get('commands', []))}")
|
||||
print(f" Success: {action.get('success', False)}")
|
||||
|
||||
if issue.get('resolution'):
|
||||
print(f"\n{'─'*70}")
|
||||
print(f"RESOLUTION:")
|
||||
print(f" {issue['resolution']}")
|
||||
|
||||
print("="*70)
|
||||
|
||||
def create_issue(description):
|
||||
"""Create a new issue manually"""
|
||||
import socket
|
||||
hostname = f"{socket.gethostname()}.coven.systems"
|
||||
|
||||
issue_id = tracker.create_issue(
|
||||
hostname=hostname,
|
||||
title=description[:100],
|
||||
description=description,
|
||||
severity="medium",
|
||||
source="user-reported"
|
||||
)
|
||||
|
||||
print(f"Created issue: {issue_id[:8]}")
|
||||
print(f"Title: {description[:100]}")
|
||||
|
||||
def resolve_issue(issue_id, resolution="Manually resolved"):
|
||||
"""Mark an issue as resolved"""
|
||||
# Find issue by partial ID
|
||||
all_issues = tracker.list_issues()
|
||||
matching = [i for i in all_issues if i['issue_id'].startswith(issue_id)]
|
||||
|
||||
if not matching:
|
||||
print(f"Issue {issue_id} not found")
|
||||
return
|
||||
|
||||
full_id = matching[0]['issue_id']
|
||||
success = tracker.resolve_issue(full_id, resolution)
|
||||
|
||||
if success:
|
||||
print(f"Resolved issue {issue_id[:8]}")
|
||||
else:
|
||||
print(f"Failed to resolve issue {issue_id}")
|
||||
|
||||
def close_issue(issue_id):
|
||||
"""Archive a resolved issue"""
|
||||
# Find issue by partial ID
|
||||
all_issues = tracker.list_issues()
|
||||
matching = [i for i in all_issues if i['issue_id'].startswith(issue_id)]
|
||||
|
||||
if not matching:
|
||||
print(f"Issue {issue_id} not found")
|
||||
return
|
||||
|
||||
full_id = matching[0]['issue_id']
|
||||
|
||||
if matching[0]['status'] != 'resolved':
|
||||
print(f"Issue {issue_id} must be resolved before closing")
|
||||
print(f"Use: macha-issues resolve {issue_id}")
|
||||
return
|
||||
|
||||
success = tracker.close_issue(full_id)
|
||||
|
||||
if success:
|
||||
print(f"Closed and archived issue {issue_id[:8]}")
|
||||
else:
|
||||
print(f"Failed to close issue {issue_id}")
|
||||
|
||||
# Main CLI
|
||||
if len(sys.argv) < 2:
|
||||
print("Usage: macha-issues <command> [options]")
|
||||
print("")
|
||||
print("Commands:")
|
||||
print(" list List open issues")
|
||||
print(" list --all List all issues (including resolved/closed)")
|
||||
print(" show <id> Show detailed issue information")
|
||||
print(" create <desc> Create a new issue manually")
|
||||
print(" resolve <id> Mark issue as resolved")
|
||||
print(" close <id> Archive a resolved issue")
|
||||
sys.exit(1)
|
||||
|
||||
command = sys.argv[1]
|
||||
|
||||
if command == "list":
|
||||
show_all = "--all" in sys.argv
|
||||
list_issues(show_all)
|
||||
elif command == "show" and len(sys.argv) >= 3:
|
||||
show_issue(sys.argv[2])
|
||||
elif command == "create" and len(sys.argv) >= 3:
|
||||
description = " ".join(sys.argv[2:])
|
||||
create_issue(description)
|
||||
elif command == "resolve" and len(sys.argv) >= 3:
|
||||
resolution = " ".join(sys.argv[3:]) if len(sys.argv) > 3 else "Manually resolved"
|
||||
resolve_issue(sys.argv[2], resolution)
|
||||
elif command == "close" and len(sys.argv) >= 3:
|
||||
close_issue(sys.argv[2])
|
||||
else:
|
||||
print(f"Unknown command: {command}")
|
||||
sys.exit(1)
|
||||
'')
|
||||
|
||||
# Knowledge base CLI
|
||||
(pkgs.writeScriptBin "macha-knowledge" ''
|
||||
#!${pythonEnv}/bin/python3
|
||||
import sys
|
||||
import os
|
||||
os.environ["CHROMA_ENV_FILE"] = ""
|
||||
os.environ["ANONYMIZED_TELEMETRY"] = "False"
|
||||
sys.path.insert(0, "${./.}")
|
||||
|
||||
from context_db import ContextDatabase
|
||||
|
||||
db = ContextDatabase()
|
||||
|
||||
def list_topics(category=None):
|
||||
"""List all knowledge topics"""
|
||||
topics = db.list_knowledge_topics(category)
|
||||
if not topics:
|
||||
print("No knowledge topics found.")
|
||||
return
|
||||
|
||||
print(f"{'='*70}")
|
||||
if category:
|
||||
print(f"KNOWLEDGE TOPICS ({category.upper()}):")
|
||||
else:
|
||||
print(f"KNOWLEDGE TOPICS:")
|
||||
print(f"{'='*70}")
|
||||
|
||||
for topic in topics:
|
||||
print(f" • {topic}")
|
||||
|
||||
print(f"{'='*70}")
|
||||
|
||||
def show_topic(topic):
|
||||
"""Show all knowledge for a topic"""
|
||||
items = db.get_knowledge_by_topic(topic)
|
||||
if not items:
|
||||
print(f"No knowledge found for topic: {topic}")
|
||||
return
|
||||
|
||||
print(f"{'='*70}")
|
||||
print(f"KNOWLEDGE: {topic}")
|
||||
print(f"{'='*70}\n")
|
||||
|
||||
for item in items:
|
||||
print(f"ID: {item['id'][:8]}...")
|
||||
print(f"Category: {item['category']}")
|
||||
print(f"Source: {item['source']}")
|
||||
print(f"Confidence: {item['confidence']}")
|
||||
print(f"Created: {item['created_at']}")
|
||||
print(f"Times Referenced: {item['times_referenced']}")
|
||||
if item.get('tags'):
|
||||
print(f"Tags: {', '.join(item['tags'])}")
|
||||
print(f"\nKnowledge:")
|
||||
print(f" {item['knowledge']}\n")
|
||||
print(f"{'-'*70}\n")
|
||||
|
||||
def search_knowledge(query, category=None):
|
||||
"""Search knowledge base"""
|
||||
items = db.query_knowledge(query, category=category, limit=10)
|
||||
if not items:
|
||||
print(f"No knowledge found matching: {query}")
|
||||
return
|
||||
|
||||
print(f"{'='*70}")
|
||||
print(f"SEARCH RESULTS: {query}")
|
||||
if category:
|
||||
print(f"Category Filter: {category}")
|
||||
print(f"{'='*70}\n")
|
||||
|
||||
for i, item in enumerate(items, 1):
|
||||
print(f"[{i}] {item['topic']}")
|
||||
print(f" Category: {item['category']} | Confidence: {item['confidence']}")
|
||||
print(f" {item['knowledge'][:150]}...")
|
||||
print()
|
||||
|
||||
def add_knowledge(topic, knowledge, category="general"):
|
||||
"""Add new knowledge"""
|
||||
kid = db.store_knowledge(
|
||||
topic=topic,
|
||||
knowledge=knowledge,
|
||||
category=category,
|
||||
source="user-provided",
|
||||
confidence="high"
|
||||
)
|
||||
if kid:
|
||||
print(f"✓ Added knowledge for topic: {topic}")
|
||||
print(f" ID: {kid[:8]}...")
|
||||
else:
|
||||
print(f"✗ Failed to add knowledge")
|
||||
|
||||
def seed_initial():
|
||||
"""Seed initial knowledge"""
|
||||
print("Seeding initial knowledge from seed_knowledge.py...")
|
||||
exec(open("${./.}/seed_knowledge.py").read())
|
||||
|
||||
# Main CLI
|
||||
if len(sys.argv) < 2:
|
||||
print("Usage: macha-knowledge <command> [options]")
|
||||
print("")
|
||||
print("Commands:")
|
||||
print(" list List all knowledge topics")
|
||||
print(" list <category> List topics in category")
|
||||
print(" show <topic> Show all knowledge for a topic")
|
||||
print(" search <query> Search knowledge base")
|
||||
print(" search <query> <cat> Search in specific category")
|
||||
print(" add <topic> <text> Add new knowledge")
|
||||
print(" seed Seed initial knowledge")
|
||||
print("")
|
||||
print("Categories: command, pattern, troubleshooting, performance, general")
|
||||
sys.exit(1)
|
||||
|
||||
command = sys.argv[1]
|
||||
|
||||
if command == "list":
|
||||
category = sys.argv[2] if len(sys.argv) >= 3 else None
|
||||
list_topics(category)
|
||||
elif command == "show" and len(sys.argv) >= 3:
|
||||
show_topic(sys.argv[2])
|
||||
elif command == "search" and len(sys.argv) >= 3:
|
||||
query = sys.argv[2]
|
||||
category = sys.argv[3] if len(sys.argv) >= 4 else None
|
||||
search_knowledge(query, category)
|
||||
elif command == "add" and len(sys.argv) >= 4:
|
||||
topic = sys.argv[2]
|
||||
knowledge = " ".join(sys.argv[3:])
|
||||
add_knowledge(topic, knowledge)
|
||||
elif command == "seed":
|
||||
seed_initial()
|
||||
else:
|
||||
print(f"Unknown command: {command}")
|
||||
sys.exit(1)
|
||||
'')
|
||||
];
|
||||
};
|
||||
}
|
||||
291
monitor.py
Normal file
291
monitor.py
Normal file
@@ -0,0 +1,291 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
System Monitor - Collects health data from Macha
|
||||
"""
|
||||
|
||||
import json
|
||||
import subprocess
|
||||
import psutil
|
||||
import time
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Any
|
||||
|
||||
|
||||
class SystemMonitor:
|
||||
"""Monitors system health and collects diagnostic data"""
|
||||
|
||||
def __init__(self, state_dir: Path = Path("/var/lib/macha")):
|
||||
self.state_dir = state_dir
|
||||
self.state_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
def collect_all(self) -> Dict[str, Any]:
|
||||
"""Collect all system health data"""
|
||||
return {
|
||||
"timestamp": datetime.now().isoformat(),
|
||||
"systemd": self.check_systemd_services(),
|
||||
"resources": self.check_resources(),
|
||||
"disk": self.check_disk_usage(),
|
||||
"logs": self.check_recent_errors(),
|
||||
"nixos": self.check_nixos_status(),
|
||||
"network": self.check_network(),
|
||||
"boot": self.check_boot_status(),
|
||||
}
|
||||
|
||||
def check_systemd_services(self) -> Dict[str, Any]:
|
||||
"""Check status of all systemd services"""
|
||||
try:
|
||||
# Get failed services
|
||||
result = subprocess.run(
|
||||
["systemctl", "--failed", "--no-pager", "--output=json"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=10
|
||||
)
|
||||
|
||||
failed_services = []
|
||||
if result.returncode == 0 and result.stdout:
|
||||
try:
|
||||
failed_services = json.loads(result.stdout)
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
# Get all services status
|
||||
result = subprocess.run(
|
||||
["systemctl", "list-units", "--type=service", "--no-pager", "--output=json"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=10
|
||||
)
|
||||
|
||||
all_services = []
|
||||
if result.returncode == 0 and result.stdout:
|
||||
try:
|
||||
all_services = json.loads(result.stdout)
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
return {
|
||||
"failed_count": len(failed_services),
|
||||
"failed_services": failed_services,
|
||||
"total_services": len(all_services),
|
||||
"active_services": [s for s in all_services if s.get("active") == "active"],
|
||||
}
|
||||
except Exception as e:
|
||||
return {"error": str(e)}
|
||||
|
||||
def check_resources(self) -> Dict[str, Any]:
|
||||
"""Check CPU, RAM, and system resources"""
|
||||
try:
|
||||
cpu_percent = psutil.cpu_percent(interval=1)
|
||||
memory = psutil.virtual_memory()
|
||||
load_avg = psutil.getloadavg()
|
||||
|
||||
return {
|
||||
"cpu_percent": cpu_percent,
|
||||
"cpu_count": psutil.cpu_count(),
|
||||
"memory_percent": memory.percent,
|
||||
"memory_available_gb": memory.available / (1024**3),
|
||||
"memory_total_gb": memory.total / (1024**3),
|
||||
"load_average": {
|
||||
"1min": load_avg[0],
|
||||
"5min": load_avg[1],
|
||||
"15min": load_avg[2],
|
||||
},
|
||||
"swap_percent": psutil.swap_memory().percent,
|
||||
}
|
||||
except Exception as e:
|
||||
return {"error": str(e)}
|
||||
|
||||
def check_disk_usage(self) -> Dict[str, Any]:
|
||||
"""Check disk usage for all mounted filesystems"""
|
||||
try:
|
||||
partitions = psutil.disk_partitions()
|
||||
disk_info = []
|
||||
|
||||
for partition in partitions:
|
||||
try:
|
||||
usage = psutil.disk_usage(partition.mountpoint)
|
||||
disk_info.append({
|
||||
"device": partition.device,
|
||||
"mountpoint": partition.mountpoint,
|
||||
"fstype": partition.fstype,
|
||||
"percent_used": usage.percent,
|
||||
"total_gb": usage.total / (1024**3),
|
||||
"used_gb": usage.used / (1024**3),
|
||||
"free_gb": usage.free / (1024**3),
|
||||
})
|
||||
except PermissionError:
|
||||
continue
|
||||
|
||||
return {"partitions": disk_info}
|
||||
except Exception as e:
|
||||
return {"error": str(e)}
|
||||
|
||||
def check_recent_errors(self) -> Dict[str, Any]:
|
||||
"""Check recent system logs for errors"""
|
||||
try:
|
||||
# Get errors from the last hour
|
||||
result = subprocess.run(
|
||||
["journalctl", "-p", "err", "--since", "1 hour ago", "--no-pager", "-o", "json"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=10
|
||||
)
|
||||
|
||||
errors = []
|
||||
if result.returncode == 0 and result.stdout:
|
||||
for line in result.stdout.strip().split('\n'):
|
||||
if line:
|
||||
try:
|
||||
errors.append(json.loads(line))
|
||||
except json.JSONDecodeError:
|
||||
continue
|
||||
|
||||
return {
|
||||
"error_count_1h": len(errors),
|
||||
"recent_errors": errors[-50:], # Last 50 errors
|
||||
}
|
||||
except Exception as e:
|
||||
return {"error": str(e)}
|
||||
|
||||
def check_nixos_status(self) -> Dict[str, Any]:
|
||||
"""Check NixOS generation and system info"""
|
||||
try:
|
||||
# Get current generation
|
||||
result = subprocess.run(
|
||||
["nixos-version"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=5
|
||||
)
|
||||
version = result.stdout.strip() if result.returncode == 0 else "unknown"
|
||||
|
||||
# Get generation list
|
||||
result = subprocess.run(
|
||||
["nix-env", "--list-generations", "-p", "/nix/var/nix/profiles/system"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=10
|
||||
)
|
||||
|
||||
generations = result.stdout.strip() if result.returncode == 0 else ""
|
||||
|
||||
return {
|
||||
"version": version,
|
||||
"generations": generations,
|
||||
"nix_store_size": self._get_nix_store_size(),
|
||||
}
|
||||
except Exception as e:
|
||||
return {"error": str(e)}
|
||||
|
||||
def _get_nix_store_size(self) -> str:
|
||||
"""Get Nix store size"""
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["du", "-sh", "/nix/store"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=30
|
||||
)
|
||||
if result.returncode == 0:
|
||||
return result.stdout.split()[0]
|
||||
except:
|
||||
pass
|
||||
return "unknown"
|
||||
|
||||
def check_network(self) -> Dict[str, Any]:
|
||||
"""Check network connectivity"""
|
||||
try:
|
||||
# Check if we can reach the internet
|
||||
result = subprocess.run(
|
||||
["ping", "-c", "1", "-W", "2", "8.8.8.8"],
|
||||
capture_output=True,
|
||||
timeout=5
|
||||
)
|
||||
internet_up = result.returncode == 0
|
||||
|
||||
# Get network interfaces
|
||||
interfaces = {}
|
||||
for iface, addrs in psutil.net_if_addrs().items():
|
||||
interfaces[iface] = [
|
||||
{"family": addr.family.name, "address": addr.address}
|
||||
for addr in addrs
|
||||
]
|
||||
|
||||
return {
|
||||
"internet_reachable": internet_up,
|
||||
"interfaces": interfaces,
|
||||
}
|
||||
except Exception as e:
|
||||
return {"error": str(e)}
|
||||
|
||||
def check_boot_status(self) -> Dict[str, Any]:
|
||||
"""Check boot and uptime information"""
|
||||
try:
|
||||
boot_time = datetime.fromtimestamp(psutil.boot_time())
|
||||
uptime_seconds = time.time() - psutil.boot_time()
|
||||
|
||||
return {
|
||||
"boot_time": boot_time.isoformat(),
|
||||
"uptime_seconds": uptime_seconds,
|
||||
"uptime_hours": uptime_seconds / 3600,
|
||||
}
|
||||
except Exception as e:
|
||||
return {"error": str(e)}
|
||||
|
||||
def save_snapshot(self, data: Dict[str, Any]):
|
||||
"""Save a snapshot of system state"""
|
||||
snapshot_file = self.state_dir / f"snapshot_{int(time.time())}.json"
|
||||
with open(snapshot_file, 'w') as f:
|
||||
json.dump(data, f, indent=2)
|
||||
|
||||
# Keep only last 100 snapshots
|
||||
snapshots = sorted(self.state_dir.glob("snapshot_*.json"))
|
||||
for old_snapshot in snapshots[:-100]:
|
||||
old_snapshot.unlink()
|
||||
|
||||
def get_summary(self, data: Dict[str, Any]) -> str:
|
||||
"""Generate human-readable summary of system state"""
|
||||
lines = []
|
||||
lines.append(f"=== System Health Summary ({data['timestamp']}) ===\n")
|
||||
|
||||
# Resources
|
||||
res = data.get("resources", {})
|
||||
lines.append(f"CPU: {res.get('cpu_percent', 0):.1f}%")
|
||||
lines.append(f"Memory: {res.get('memory_percent', 0):.1f}% ({res.get('memory_available_gb', 0):.1f}GB free)")
|
||||
lines.append(f"Load: {res.get('load_average', {}).get('1min', 0):.2f}")
|
||||
|
||||
# Disk
|
||||
disk = data.get("disk", {})
|
||||
for part in disk.get("partitions", [])[:5]: # Top 5 partitions
|
||||
lines.append(f"Disk {part['mountpoint']}: {part['percent_used']:.1f}% used ({part['free_gb']:.1f}GB free)")
|
||||
|
||||
# Systemd
|
||||
systemd = data.get("systemd", {})
|
||||
failed = systemd.get("failed_count", 0)
|
||||
if failed > 0:
|
||||
lines.append(f"\n⚠️ WARNING: {failed} failed services!")
|
||||
for svc in systemd.get("failed_services", [])[:5]:
|
||||
lines.append(f" - {svc.get('unit', 'unknown')}")
|
||||
|
||||
# Errors
|
||||
logs = data.get("logs", {})
|
||||
error_count = logs.get("error_count_1h", 0)
|
||||
if error_count > 0:
|
||||
lines.append(f"\n⚠️ {error_count} errors in last hour")
|
||||
|
||||
# Network
|
||||
net = data.get("network", {})
|
||||
if not net.get("internet_reachable", True):
|
||||
lines.append("\n⚠️ WARNING: No internet connectivity!")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
monitor = SystemMonitor()
|
||||
data = monitor.collect_all()
|
||||
monitor.save_snapshot(data)
|
||||
print(monitor.get_summary(data))
|
||||
print(f"\nFull data saved to {monitor.state_dir}")
|
||||
248
notifier.py
Normal file
248
notifier.py
Normal file
@@ -0,0 +1,248 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Gotify Notifier - Send notifications to Gotify server
|
||||
"""
|
||||
|
||||
import requests
|
||||
import os
|
||||
from typing import Optional
|
||||
from datetime import datetime
|
||||
|
||||
|
||||
class GotifyNotifier:
|
||||
"""Send notifications to Gotify server"""
|
||||
|
||||
# Priority levels
|
||||
PRIORITY_LOW = 2
|
||||
PRIORITY_MEDIUM = 5
|
||||
PRIORITY_HIGH = 8
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
gotify_url: Optional[str] = None,
|
||||
gotify_token: Optional[str] = None
|
||||
):
|
||||
"""
|
||||
Initialize Gotify notifier
|
||||
|
||||
Args:
|
||||
gotify_url: URL to Gotify server (e.g. http://rhiannon:8181)
|
||||
gotify_token: Application token from Gotify
|
||||
"""
|
||||
self.gotify_url = gotify_url or os.environ.get("GOTIFY_URL", "")
|
||||
self.gotify_token = gotify_token or os.environ.get("GOTIFY_TOKEN", "")
|
||||
self.enabled = bool(self.gotify_url and self.gotify_token)
|
||||
|
||||
def send(
|
||||
self,
|
||||
title: str,
|
||||
message: str,
|
||||
priority: int = PRIORITY_MEDIUM,
|
||||
extras: Optional[dict] = None
|
||||
) -> bool:
|
||||
"""
|
||||
Send a notification to Gotify
|
||||
|
||||
Args:
|
||||
title: Notification title
|
||||
message: Notification message
|
||||
priority: Priority level (2=low, 5=medium, 8=high)
|
||||
extras: Optional extra data
|
||||
|
||||
Returns:
|
||||
True if successful, False otherwise
|
||||
"""
|
||||
if not self.enabled:
|
||||
return False
|
||||
|
||||
try:
|
||||
url = f"{self.gotify_url}/message"
|
||||
headers = {
|
||||
"Authorization": f"Bearer {self.gotify_token}",
|
||||
"Content-Type": "application/json"
|
||||
}
|
||||
|
||||
data = {
|
||||
"title": title,
|
||||
"message": message,
|
||||
"priority": priority,
|
||||
}
|
||||
|
||||
if extras:
|
||||
data["extras"] = extras
|
||||
|
||||
response = requests.post(
|
||||
url,
|
||||
json=data,
|
||||
headers=headers,
|
||||
timeout=10
|
||||
)
|
||||
|
||||
return response.status_code == 200
|
||||
|
||||
except Exception as e:
|
||||
# Fail silently - don't crash if Gotify is unavailable
|
||||
print(f"Warning: Failed to send Gotify notification: {e}")
|
||||
return False
|
||||
|
||||
def notify_critical_issue(self, issue_description: str, details: str = ""):
|
||||
"""Send high-priority notification for critical issues"""
|
||||
message = f"⚠️ Critical Issue Detected\n\n{issue_description}"
|
||||
if details:
|
||||
message += f"\n\nDetails:\n{details}"
|
||||
|
||||
return self.send(
|
||||
title="🚨 Macha: Critical Issue",
|
||||
message=message,
|
||||
priority=self.PRIORITY_HIGH
|
||||
)
|
||||
|
||||
def notify_issue_created(self, issue_id: str, title: str, severity: str):
|
||||
"""Send notification when a new issue is created"""
|
||||
severity_icons = {
|
||||
"low": "ℹ️",
|
||||
"medium": "⚠️",
|
||||
"high": "🚨",
|
||||
"critical": "🔴"
|
||||
}
|
||||
icon = severity_icons.get(severity, "⚠️")
|
||||
|
||||
priority_map = {
|
||||
"low": self.PRIORITY_LOW,
|
||||
"medium": self.PRIORITY_MEDIUM,
|
||||
"high": self.PRIORITY_HIGH,
|
||||
"critical": self.PRIORITY_HIGH
|
||||
}
|
||||
priority = priority_map.get(severity, self.PRIORITY_MEDIUM)
|
||||
|
||||
message = f"{icon} New Issue Tracked\n\nID: {issue_id}\nSeverity: {severity.upper()}\n\n{title}"
|
||||
|
||||
return self.send(
|
||||
title="📋 Macha: Issue Created",
|
||||
message=message,
|
||||
priority=priority
|
||||
)
|
||||
|
||||
def notify_action_queued(self, action_description: str, risk_level: str):
|
||||
"""Send notification when action is queued for approval"""
|
||||
emoji = "⚠️" if risk_level == "high" else "ℹ️"
|
||||
message = (
|
||||
f"{emoji} Action Queued for Approval\n\n"
|
||||
f"Action: {action_description}\n"
|
||||
f"Risk Level: {risk_level}\n\n"
|
||||
f"Use 'macha-approve list' to review"
|
||||
)
|
||||
|
||||
priority = self.PRIORITY_HIGH if risk_level == "high" else self.PRIORITY_MEDIUM
|
||||
|
||||
return self.send(
|
||||
title="📋 Macha: Action Needs Approval",
|
||||
message=message,
|
||||
priority=priority
|
||||
)
|
||||
|
||||
def notify_action_executed(self, action_description: str, success: bool, output: str = ""):
|
||||
"""Send notification when action is executed"""
|
||||
if success:
|
||||
emoji = "✅"
|
||||
title_prefix = "Success"
|
||||
else:
|
||||
emoji = "❌"
|
||||
title_prefix = "Failed"
|
||||
|
||||
message = f"{emoji} Action {title_prefix}\n\n{action_description}"
|
||||
if output:
|
||||
message += f"\n\nOutput:\n{output[:500]}" # Limit output length
|
||||
|
||||
priority = self.PRIORITY_HIGH if not success else self.PRIORITY_LOW
|
||||
|
||||
return self.send(
|
||||
title=f"{emoji} Macha: Action {title_prefix}",
|
||||
message=message,
|
||||
priority=priority
|
||||
)
|
||||
|
||||
def notify_service_failure(self, service_name: str, details: str = ""):
|
||||
"""Send notification for service failures"""
|
||||
message = f"🔴 Service Failed: {service_name}"
|
||||
if details:
|
||||
message += f"\n\nDetails:\n{details}"
|
||||
|
||||
return self.send(
|
||||
title="🔴 Macha: Service Failure",
|
||||
message=message,
|
||||
priority=self.PRIORITY_HIGH
|
||||
)
|
||||
|
||||
def notify_health_summary(self, summary: str, status: str):
|
||||
"""Send periodic health summary"""
|
||||
emoji = {
|
||||
"healthy": "✅",
|
||||
"attention_needed": "⚠️",
|
||||
"intervention_required": "🚨"
|
||||
}.get(status, "ℹ️")
|
||||
|
||||
priority = {
|
||||
"healthy": self.PRIORITY_LOW,
|
||||
"attention_needed": self.PRIORITY_MEDIUM,
|
||||
"intervention_required": self.PRIORITY_HIGH
|
||||
}.get(status, self.PRIORITY_MEDIUM)
|
||||
|
||||
return self.send(
|
||||
title=f"{emoji} Macha: Health Check",
|
||||
message=summary,
|
||||
priority=priority
|
||||
)
|
||||
|
||||
def send_system_discovered(
|
||||
self,
|
||||
hostname: str,
|
||||
os_type: str,
|
||||
role: str,
|
||||
services_count: int
|
||||
):
|
||||
"""Send notification when a new system is discovered"""
|
||||
message = (
|
||||
f"🔍 New System Auto-Discovered\n\n"
|
||||
f"Hostname: {hostname}\n"
|
||||
f"OS: {os_type.upper()}\n"
|
||||
f"Role: {role}\n"
|
||||
f"Services: {services_count} detected\n\n"
|
||||
f"System has been registered and analyzed.\n"
|
||||
f"Use 'macha-systems' to view all registered systems."
|
||||
)
|
||||
|
||||
return self.send(
|
||||
title="🌐 Macha: New System Discovered",
|
||||
message=message,
|
||||
priority=self.PRIORITY_MEDIUM
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
|
||||
# Test the notifier
|
||||
if len(sys.argv) < 3:
|
||||
print("Usage: notifier.py <title> <message> [priority]")
|
||||
print("Example: notifier.py 'Test' 'This is a test message' 5")
|
||||
sys.exit(1)
|
||||
|
||||
title = sys.argv[1]
|
||||
message = sys.argv[2]
|
||||
priority = int(sys.argv[3]) if len(sys.argv) > 3 else GotifyNotifier.PRIORITY_MEDIUM
|
||||
|
||||
notifier = GotifyNotifier()
|
||||
|
||||
if not notifier.enabled:
|
||||
print("Error: Gotify not configured (GOTIFY_URL and GOTIFY_TOKEN required)")
|
||||
sys.exit(1)
|
||||
|
||||
success = notifier.send(title, message, priority)
|
||||
|
||||
if success:
|
||||
print("✅ Notification sent successfully")
|
||||
else:
|
||||
print("❌ Failed to send notification")
|
||||
sys.exit(1)
|
||||
|
||||
238
ollama_queue.py
Normal file
238
ollama_queue.py
Normal file
@@ -0,0 +1,238 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Ollama Queue Handler - Serializes all LLM requests to prevent resource contention
|
||||
"""
|
||||
|
||||
import json
|
||||
import time
|
||||
import fcntl
|
||||
import signal
|
||||
from pathlib import Path
|
||||
from typing import Dict, Any, Optional, Callable
|
||||
from datetime import datetime
|
||||
from enum import IntEnum
|
||||
|
||||
class Priority(IntEnum):
|
||||
"""Request priority levels"""
|
||||
INTERACTIVE = 0 # User requests (highest priority)
|
||||
AUTONOMOUS = 1 # Background maintenance
|
||||
BATCH = 2 # Low priority bulk operations
|
||||
|
||||
class OllamaQueue:
|
||||
"""File-based queue for serializing Ollama requests"""
|
||||
|
||||
def __init__(self, queue_dir: Path = Path("/var/lib/macha/queues/ollama")):
|
||||
self.queue_dir = queue_dir
|
||||
self.queue_dir.mkdir(parents=True, exist_ok=True)
|
||||
self.pending_dir = self.queue_dir / "pending"
|
||||
self.processing_dir = self.queue_dir / "processing"
|
||||
self.completed_dir = self.queue_dir / "completed"
|
||||
self.failed_dir = self.queue_dir / "failed"
|
||||
|
||||
for dir in [self.pending_dir, self.processing_dir, self.completed_dir, self.failed_dir]:
|
||||
dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
self.lock_file = self.queue_dir / "queue.lock"
|
||||
self.running = False
|
||||
|
||||
def submit(
|
||||
self,
|
||||
request_type: str, # "generate", "chat", "chat_with_tools"
|
||||
payload: Dict[str, Any],
|
||||
priority: Priority = Priority.INTERACTIVE,
|
||||
callback: Optional[Callable] = None,
|
||||
progress_callback: Optional[Callable] = None
|
||||
) -> str:
|
||||
"""Submit a request to the queue. Returns request ID."""
|
||||
request_id = f"{int(time.time() * 1000000)}_{priority.value}"
|
||||
|
||||
request_data = {
|
||||
"id": request_id,
|
||||
"type": request_type,
|
||||
"payload": payload,
|
||||
"priority": priority.value,
|
||||
"submitted_at": datetime.now().isoformat(),
|
||||
"status": "pending"
|
||||
}
|
||||
|
||||
request_file = self.pending_dir / f"{request_id}.json"
|
||||
request_file.write_text(json.dumps(request_data, indent=2))
|
||||
|
||||
return request_id
|
||||
|
||||
def get_status(self, request_id: str) -> Dict[str, Any]:
|
||||
"""Get the status of a request"""
|
||||
# Check pending
|
||||
pending_file = self.pending_dir / f"{request_id}.json"
|
||||
if pending_file.exists():
|
||||
data = json.loads(pending_file.read_text())
|
||||
# Calculate position in queue
|
||||
position = self._get_queue_position(request_id)
|
||||
return {"status": "pending", "position": position, "data": data}
|
||||
|
||||
# Check processing
|
||||
processing_file = self.processing_dir / f"{request_id}.json"
|
||||
if processing_file.exists():
|
||||
data = json.loads(processing_file.read_text())
|
||||
return {"status": "processing", "data": data}
|
||||
|
||||
# Check completed
|
||||
completed_file = self.completed_dir / f"{request_id}.json"
|
||||
if completed_file.exists():
|
||||
data = json.loads(completed_file.read_text())
|
||||
return {"status": "completed", "result": data.get("result"), "data": data}
|
||||
|
||||
# Check failed
|
||||
failed_file = self.failed_dir / f"{request_id}.json"
|
||||
if failed_file.exists():
|
||||
data = json.loads(failed_file.read_text())
|
||||
return {"status": "failed", "error": data.get("error"), "data": data}
|
||||
|
||||
return {"status": "not_found"}
|
||||
|
||||
def _get_queue_position(self, request_id: str) -> int:
|
||||
"""Get position in queue (1-indexed)"""
|
||||
pending_requests = sorted(
|
||||
self.pending_dir.glob("*.json"),
|
||||
key=lambda p: (int(p.stem.split('_')[1]), int(p.stem.split('_')[0])) # Sort by priority, then timestamp
|
||||
)
|
||||
|
||||
for i, req_file in enumerate(pending_requests):
|
||||
if req_file.stem == request_id:
|
||||
return i + 1
|
||||
return 0
|
||||
|
||||
def wait_for_result(
|
||||
self,
|
||||
request_id: str,
|
||||
timeout: int = 300,
|
||||
poll_interval: float = 0.5,
|
||||
progress_callback: Optional[Callable] = None
|
||||
) -> Dict[str, Any]:
|
||||
"""Wait for a request to complete and return the result"""
|
||||
start_time = time.time()
|
||||
last_status = None
|
||||
|
||||
while time.time() - start_time < timeout:
|
||||
status = self.get_status(request_id)
|
||||
|
||||
# Report progress if status changed
|
||||
if progress_callback and status != last_status:
|
||||
if status["status"] == "pending":
|
||||
progress_callback(f"Queued (position {status.get('position', '?')})")
|
||||
elif status["status"] == "processing":
|
||||
progress_callback("Processing...")
|
||||
|
||||
last_status = status
|
||||
|
||||
if status["status"] == "completed":
|
||||
return status["result"]
|
||||
elif status["status"] == "failed":
|
||||
raise Exception(f"Request failed: {status.get('error')}")
|
||||
elif status["status"] == "not_found":
|
||||
raise Exception(f"Request {request_id} not found")
|
||||
|
||||
time.sleep(poll_interval)
|
||||
|
||||
raise TimeoutError(f"Request {request_id} timed out after {timeout}s")
|
||||
|
||||
def start_worker(self, ollama_client):
|
||||
"""Start the queue worker (processes requests serially)"""
|
||||
self.running = True
|
||||
self.ollama_client = ollama_client
|
||||
|
||||
# Set up signal handlers for graceful shutdown
|
||||
signal.signal(signal.SIGTERM, self._shutdown_handler)
|
||||
signal.signal(signal.SIGINT, self._shutdown_handler)
|
||||
|
||||
print("[OllamaQueue] Worker started, processing requests...")
|
||||
|
||||
while self.running:
|
||||
try:
|
||||
self._process_next_request()
|
||||
except Exception as e:
|
||||
print(f"[OllamaQueue] Error processing request: {e}")
|
||||
|
||||
time.sleep(0.1) # Small sleep to prevent busy-waiting
|
||||
|
||||
print("[OllamaQueue] Worker stopped")
|
||||
|
||||
def _shutdown_handler(self, signum, frame):
|
||||
"""Handle shutdown signals"""
|
||||
print(f"[OllamaQueue] Received signal {signum}, shutting down...")
|
||||
self.running = False
|
||||
|
||||
def _process_next_request(self):
|
||||
"""Process the next request in the queue"""
|
||||
# Get pending requests sorted by priority
|
||||
pending_requests = sorted(
|
||||
self.pending_dir.glob("*.json"),
|
||||
key=lambda p: (int(p.stem.split('_')[1]), int(p.stem.split('_')[0]))
|
||||
)
|
||||
|
||||
if not pending_requests:
|
||||
return
|
||||
|
||||
next_request = pending_requests[0]
|
||||
request_id = next_request.stem
|
||||
|
||||
# Move to processing
|
||||
request_data = json.loads(next_request.read_text())
|
||||
request_data["status"] = "processing"
|
||||
request_data["started_at"] = datetime.now().isoformat()
|
||||
|
||||
processing_file = self.processing_dir / f"{request_id}.json"
|
||||
processing_file.write_text(json.dumps(request_data, indent=2))
|
||||
next_request.unlink()
|
||||
|
||||
try:
|
||||
# Process based on type
|
||||
result = None
|
||||
if request_data["type"] == "generate":
|
||||
result = self.ollama_client.generate(request_data["payload"])
|
||||
elif request_data["type"] == "chat":
|
||||
result = self.ollama_client.chat(request_data["payload"])
|
||||
elif request_data["type"] == "chat_with_tools":
|
||||
result = self.ollama_client.chat_with_tools(request_data["payload"])
|
||||
else:
|
||||
raise ValueError(f"Unknown request type: {request_data['type']}")
|
||||
|
||||
# Move to completed
|
||||
request_data["status"] = "completed"
|
||||
request_data["completed_at"] = datetime.now().isoformat()
|
||||
request_data["result"] = result
|
||||
|
||||
completed_file = self.completed_dir / f"{request_id}.json"
|
||||
completed_file.write_text(json.dumps(request_data, indent=2))
|
||||
processing_file.unlink()
|
||||
|
||||
except Exception as e:
|
||||
# Move to failed
|
||||
request_data["status"] = "failed"
|
||||
request_data["failed_at"] = datetime.now().isoformat()
|
||||
request_data["error"] = str(e)
|
||||
|
||||
failed_file = self.failed_dir / f"{request_id}.json"
|
||||
failed_file.write_text(json.dumps(request_data, indent=2))
|
||||
processing_file.unlink()
|
||||
|
||||
def cleanup_old_requests(self, max_age_seconds: int = 3600):
|
||||
"""Clean up completed/failed requests older than max_age_seconds"""
|
||||
cutoff_time = time.time() - max_age_seconds
|
||||
|
||||
for directory in [self.completed_dir, self.failed_dir]:
|
||||
for request_file in directory.glob("*.json"):
|
||||
# Extract timestamp from filename
|
||||
timestamp = int(request_file.stem.split('_')[0]) / 1000000
|
||||
if timestamp < cutoff_time:
|
||||
request_file.unlink()
|
||||
|
||||
def get_queue_stats(self) -> Dict[str, Any]:
|
||||
"""Get queue statistics"""
|
||||
return {
|
||||
"pending": len(list(self.pending_dir.glob("*.json"))),
|
||||
"processing": len(list(self.processing_dir.glob("*.json"))),
|
||||
"completed": len(list(self.completed_dir.glob("*.json"))),
|
||||
"failed": len(list(self.failed_dir.glob("*.json")))
|
||||
}
|
||||
|
||||
111
ollama_worker.py
Normal file
111
ollama_worker.py
Normal file
@@ -0,0 +1,111 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Ollama Queue Worker - Daemon that processes queued Ollama requests
|
||||
"""
|
||||
|
||||
import sys
|
||||
import requests
|
||||
from pathlib import Path
|
||||
from ollama_queue import OllamaQueue
|
||||
|
||||
class OllamaClient:
|
||||
"""Simple Ollama API client for the queue worker"""
|
||||
|
||||
def __init__(self, host: str = "http://localhost:11434"):
|
||||
self.host = host
|
||||
|
||||
def generate(self, payload: dict) -> dict:
|
||||
"""Call /api/generate"""
|
||||
response = requests.post(
|
||||
f"{self.host}/api/generate",
|
||||
json=payload,
|
||||
timeout=payload.get("timeout", 300),
|
||||
stream=False
|
||||
)
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
|
||||
def chat(self, payload: dict) -> dict:
|
||||
"""Call /api/chat"""
|
||||
response = requests.post(
|
||||
f"{self.host}/api/chat",
|
||||
json=payload,
|
||||
timeout=payload.get("timeout", 300),
|
||||
stream=False
|
||||
)
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
|
||||
def chat_with_tools(self, payload: dict) -> dict:
|
||||
"""Call /api/chat with tools (streaming or non-streaming)"""
|
||||
import json
|
||||
|
||||
# Check if streaming is requested
|
||||
stream = payload.get("stream", False)
|
||||
|
||||
response = requests.post(
|
||||
f"{self.host}/api/chat",
|
||||
json=payload,
|
||||
timeout=payload.get("timeout", 300),
|
||||
stream=stream
|
||||
)
|
||||
response.raise_for_status()
|
||||
|
||||
if not stream:
|
||||
# Non-streaming: return response directly
|
||||
return response.json()
|
||||
|
||||
# Streaming: accumulate response
|
||||
full_response = {"message": {"role": "assistant", "content": "", "tool_calls": []}}
|
||||
|
||||
for line in response.iter_lines():
|
||||
if line:
|
||||
chunk = json.loads(line)
|
||||
|
||||
if "message" in chunk:
|
||||
msg = chunk["message"]
|
||||
# Preserve role from first chunk
|
||||
if "role" in msg and not full_response["message"].get("role"):
|
||||
full_response["message"]["role"] = msg["role"]
|
||||
if "content" in msg:
|
||||
full_response["message"]["content"] += msg["content"]
|
||||
if "tool_calls" in msg:
|
||||
full_response["message"]["tool_calls"].extend(msg["tool_calls"])
|
||||
|
||||
if chunk.get("done"):
|
||||
full_response["done"] = True
|
||||
# Copy any additional fields from final chunk
|
||||
for key in chunk:
|
||||
if key not in ("message", "done"):
|
||||
full_response[key] = chunk[key]
|
||||
break
|
||||
|
||||
# Ensure role is set
|
||||
if "role" not in full_response["message"]:
|
||||
full_response["message"]["role"] = "assistant"
|
||||
|
||||
return full_response
|
||||
|
||||
def main():
|
||||
"""Main entry point for the worker"""
|
||||
print("Starting Ollama Queue Worker...")
|
||||
|
||||
# Initialize queue and client
|
||||
queue = OllamaQueue()
|
||||
client = OllamaClient()
|
||||
|
||||
# Cleanup old requests on startup
|
||||
queue.cleanup_old_requests(max_age_seconds=3600)
|
||||
|
||||
# Start processing
|
||||
try:
|
||||
queue.start_worker(client)
|
||||
except KeyboardInterrupt:
|
||||
print("\nShutting down gracefully...")
|
||||
queue.running = False
|
||||
|
||||
return 0
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
|
||||
1053
orchestrator.py
Normal file
1053
orchestrator.py
Normal file
File diff suppressed because it is too large
Load Diff
263
remote_monitor.py
Normal file
263
remote_monitor.py
Normal file
@@ -0,0 +1,263 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Remote Monitor - Collect system health data from remote NixOS systems via SSH
|
||||
"""
|
||||
|
||||
import json
|
||||
import subprocess
|
||||
from typing import Dict, Any, Optional
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
class RemoteMonitor:
|
||||
"""Monitor remote systems via SSH"""
|
||||
|
||||
def __init__(self, hostname: str, ssh_user: str = "root"):
|
||||
"""
|
||||
Initialize remote monitor
|
||||
|
||||
Args:
|
||||
hostname: Remote hostname or IP
|
||||
ssh_user: SSH user (default: root for NixOS remote builds)
|
||||
"""
|
||||
self.hostname = hostname
|
||||
self.ssh_user = ssh_user
|
||||
self.ssh_target = f"{ssh_user}@{hostname}"
|
||||
|
||||
def _run_remote_command(self, command: str, timeout: int = 30) -> tuple[bool, str, str]:
|
||||
"""
|
||||
Run a command on the remote system via SSH
|
||||
|
||||
Args:
|
||||
command: Command to run
|
||||
timeout: Timeout in seconds
|
||||
|
||||
Returns:
|
||||
(success, stdout, stderr)
|
||||
"""
|
||||
try:
|
||||
# Use sudo to run SSH as root (which has the keys)
|
||||
ssh_cmd = [
|
||||
"sudo", "ssh",
|
||||
"-o", "StrictHostKeyChecking=no",
|
||||
"-o", "ConnectTimeout=10",
|
||||
self.ssh_target,
|
||||
command
|
||||
]
|
||||
|
||||
result = subprocess.run(
|
||||
ssh_cmd,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=timeout
|
||||
)
|
||||
|
||||
return (
|
||||
result.returncode == 0,
|
||||
result.stdout.strip(),
|
||||
result.stderr.strip()
|
||||
)
|
||||
|
||||
except subprocess.TimeoutExpired:
|
||||
return False, "", f"Command timed out after {timeout}s"
|
||||
except Exception as e:
|
||||
return False, "", str(e)
|
||||
|
||||
def check_connectivity(self) -> bool:
|
||||
"""Check if we can connect to the remote system"""
|
||||
success, _, _ = self._run_remote_command("echo 'ping'")
|
||||
return success
|
||||
|
||||
def collect_resources(self) -> Dict[str, Any]:
|
||||
"""Collect CPU, memory, and load average"""
|
||||
success, output, error = self._run_remote_command("""
|
||||
python3 -c "
|
||||
import psutil, json
|
||||
print(json.dumps({
|
||||
'cpu_percent': psutil.cpu_percent(interval=1),
|
||||
'memory_percent': psutil.virtual_memory().percent,
|
||||
'load_average': {
|
||||
'1min': psutil.getloadavg()[0],
|
||||
'5min': psutil.getloadavg()[1],
|
||||
'15min': psutil.getloadavg()[2]
|
||||
}
|
||||
}))
|
||||
"
|
||||
""")
|
||||
|
||||
if success:
|
||||
try:
|
||||
return json.loads(output)
|
||||
except json.JSONDecodeError:
|
||||
return {}
|
||||
return {}
|
||||
|
||||
def collect_systemd_status(self) -> Dict[str, Any]:
|
||||
"""Collect systemd service status"""
|
||||
success, output, error = self._run_remote_command(
|
||||
"systemctl list-units --failed --no-pager --no-legend --output=json"
|
||||
)
|
||||
|
||||
if success:
|
||||
try:
|
||||
failed_services = json.loads(output) if output else []
|
||||
return {
|
||||
"failed_count": len(failed_services),
|
||||
"failed_services": failed_services
|
||||
}
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
return {"failed_count": 0, "failed_services": []}
|
||||
|
||||
def collect_disk_usage(self) -> Dict[str, Any]:
|
||||
"""Collect disk usage information"""
|
||||
success, output, error = self._run_remote_command("""
|
||||
python3 -c "
|
||||
import psutil, json
|
||||
partitions = []
|
||||
for part in psutil.disk_partitions():
|
||||
try:
|
||||
usage = psutil.disk_usage(part.mountpoint)
|
||||
partitions.append({
|
||||
'device': part.device,
|
||||
'mountpoint': part.mountpoint,
|
||||
'fstype': part.fstype,
|
||||
'total': usage.total,
|
||||
'used': usage.used,
|
||||
'free': usage.free,
|
||||
'percent_used': usage.percent
|
||||
})
|
||||
except:
|
||||
pass
|
||||
print(json.dumps({'partitions': partitions}))
|
||||
"
|
||||
""")
|
||||
|
||||
if success:
|
||||
try:
|
||||
return json.loads(output)
|
||||
except json.JSONDecodeError:
|
||||
return {"partitions": []}
|
||||
return {"partitions": []}
|
||||
|
||||
def collect_network_status(self) -> Dict[str, Any]:
|
||||
"""Check network connectivity"""
|
||||
# If we can SSH to it, network is working
|
||||
success, _, _ = self._run_remote_command("ping -c 1 -W 2 8.8.8.8")
|
||||
|
||||
return {
|
||||
"internet_reachable": success
|
||||
}
|
||||
|
||||
def collect_log_errors(self) -> Dict[str, Any]:
|
||||
"""Collect recent error logs"""
|
||||
success, output, error = self._run_remote_command(
|
||||
"journalctl --priority=err --since='1 hour ago' --output=json --no-pager | wc -l"
|
||||
)
|
||||
|
||||
error_count = 0
|
||||
if success:
|
||||
try:
|
||||
error_count = int(output)
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
return {
|
||||
"error_count_1h": error_count,
|
||||
"recent_errors": [] # Could expand this later
|
||||
}
|
||||
|
||||
def collect_all(self) -> Dict[str, Any]:
|
||||
"""Collect all monitoring data from remote system"""
|
||||
|
||||
# First check if we can connect
|
||||
if not self.check_connectivity():
|
||||
return {
|
||||
"hostname": self.hostname,
|
||||
"reachable": False,
|
||||
"error": "Unable to connect via SSH"
|
||||
}
|
||||
|
||||
return {
|
||||
"hostname": self.hostname,
|
||||
"reachable": True,
|
||||
"resources": self.collect_resources(),
|
||||
"systemd": self.collect_systemd_status(),
|
||||
"disk": self.collect_disk_usage(),
|
||||
"network": self.collect_network_status(),
|
||||
"logs": self.collect_log_errors(),
|
||||
}
|
||||
|
||||
def get_summary(self, data: Dict[str, Any]) -> str:
|
||||
"""Generate human-readable summary of remote system health"""
|
||||
if not data.get("reachable", False):
|
||||
return f"❌ {self.hostname}: Unreachable - {data.get('error', 'Unknown error')}"
|
||||
|
||||
lines = [f"System: {self.hostname}"]
|
||||
|
||||
# Resources
|
||||
res = data.get("resources", {})
|
||||
if res:
|
||||
lines.append(
|
||||
f"Resources: CPU {res.get('cpu_percent', 0):.1f}%, "
|
||||
f"Memory {res.get('memory_percent', 0):.1f}%, "
|
||||
f"Load {res.get('load_average', {}).get('1min', 0):.2f}"
|
||||
)
|
||||
|
||||
# Disk
|
||||
disk = data.get("disk", {})
|
||||
max_usage = 0
|
||||
for part in disk.get("partitions", []):
|
||||
if part.get("mountpoint") == "/":
|
||||
max_usage = part.get("percent_used", 0)
|
||||
break
|
||||
if max_usage > 0:
|
||||
lines.append(f"Disk: {max_usage:.1f}% used (/ partition)")
|
||||
|
||||
# Services
|
||||
systemd = data.get("systemd", {})
|
||||
failed_count = systemd.get("failed_count", 0)
|
||||
if failed_count > 0:
|
||||
lines.append(f"Services: {failed_count} failed")
|
||||
for svc in systemd.get("failed_services", [])[:3]:
|
||||
lines.append(f" - {svc.get('unit', 'unknown')}")
|
||||
else:
|
||||
lines.append("Services: All running")
|
||||
|
||||
# Network
|
||||
net = data.get("network", {})
|
||||
if net.get("internet_reachable"):
|
||||
lines.append("Network: Internet reachable")
|
||||
else:
|
||||
lines.append("Network: ⚠️ No internet connectivity")
|
||||
|
||||
# Logs
|
||||
logs = data.get("logs", {})
|
||||
error_count = logs.get("error_count_1h", 0)
|
||||
if error_count > 0:
|
||||
lines.append(f"Recent logs: {error_count} errors in last hour")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
|
||||
if len(sys.argv) < 2:
|
||||
print("Usage: remote_monitor.py <hostname>")
|
||||
print("Example: remote_monitor.py rhiannon")
|
||||
sys.exit(1)
|
||||
|
||||
hostname = sys.argv[1]
|
||||
monitor = RemoteMonitor(hostname)
|
||||
|
||||
print(f"Monitoring {hostname}...")
|
||||
data = monitor.collect_all()
|
||||
|
||||
print("\n" + "="*60)
|
||||
print(monitor.get_summary(data))
|
||||
print("="*60)
|
||||
print("\nFull data:")
|
||||
print(json.dumps(data, indent=2))
|
||||
|
||||
128
seed_knowledge.py
Normal file
128
seed_knowledge.py
Normal file
@@ -0,0 +1,128 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Seed initial operational knowledge into Macha's knowledge base
|
||||
"""
|
||||
|
||||
import sys
|
||||
sys.path.insert(0, '.')
|
||||
|
||||
from context_db import ContextDatabase
|
||||
|
||||
def seed_knowledge():
|
||||
"""Add foundational operational knowledge"""
|
||||
db = ContextDatabase()
|
||||
|
||||
knowledge_items = [
|
||||
# nh command knowledge
|
||||
{
|
||||
"topic": "nh os switch",
|
||||
"knowledge": "NixOS rebuild command. Takes 1-5 minutes normally, up to 1 HOUR for major updates with many packages. DO NOT retry if slow - this is normal. Use -u flag to update flake inputs first. Can use --target-host and --hostname for remote deployment.",
|
||||
"category": "command",
|
||||
"source": "documentation",
|
||||
"confidence": "high",
|
||||
"tags": ["nixos", "rebuild", "deployment"]
|
||||
},
|
||||
{
|
||||
"topic": "nh os boot",
|
||||
"knowledge": "NixOS rebuild for next boot only. Safer than 'switch' for high-risk changes - allows easy rollback. After 'nh os boot', need to reboot for changes to take effect. Use -u to update flake inputs.",
|
||||
"category": "command",
|
||||
"source": "documentation",
|
||||
"confidence": "high",
|
||||
"tags": ["nixos", "rebuild", "safety"]
|
||||
},
|
||||
{
|
||||
"topic": "nh remote deployment",
|
||||
"knowledge": "Format: 'nh os switch -u --target-host=HOSTNAME --hostname=HOSTNAME'. Builds locally and deploys to remote. Much cleaner than SSH'ing to run commands. Uses root SSH keys for authentication.",
|
||||
"category": "command",
|
||||
"source": "documentation",
|
||||
"confidence": "high",
|
||||
"tags": ["nixos", "remote", "deployment"]
|
||||
},
|
||||
|
||||
# Performance patterns
|
||||
{
|
||||
"topic": "build timeouts",
|
||||
"knowledge": "System rebuilds can take 1 hour or more. Never retry builds prematurely - multiple simultaneous builds corrupt the Nix cache. Default timeout is 3600 seconds (1 hour). Be patient!",
|
||||
"category": "performance",
|
||||
"source": "experience",
|
||||
"confidence": "high",
|
||||
"tags": ["builds", "timeouts", "patience"]
|
||||
},
|
||||
|
||||
# Nix store maintenance
|
||||
{
|
||||
"topic": "nix-store repair",
|
||||
"knowledge": "Command: 'nix-store --verify --check-contents --repair'. Verifies and repairs Nix store integrity. WARNING: Can take HOURS on large stores. Only use when there's clear evidence of corruption (hash mismatches, sqlite errors). This is a LAST RESORT - most build failures are NOT corruption.",
|
||||
"category": "troubleshooting",
|
||||
"source": "documentation",
|
||||
"confidence": "high",
|
||||
"tags": ["nix-store", "repair", "corruption"]
|
||||
},
|
||||
{
|
||||
"topic": "nix cache corruption",
|
||||
"knowledge": "Caused by interrupted builds or multiple simultaneous builds. Symptoms: hash mismatches, sqlite errors, corrupt database. Solution: 'nix-store --verify --check-contents --repair' but this takes hours. Prevention: Never retry build commands, use proper timeouts.",
|
||||
"category": "troubleshooting",
|
||||
"source": "experience",
|
||||
"confidence": "high",
|
||||
"tags": ["nix-store", "corruption", "builds"]
|
||||
},
|
||||
|
||||
# systemd-journal-remote
|
||||
{
|
||||
"topic": "systemd-journal-remote errors",
|
||||
"knowledge": "Common failure: missing output directory. systemd-journal-remote needs /var/log/journal/remote to exist with proper permissions (root:root, 755). Create it if missing, then restart the service.",
|
||||
"category": "troubleshooting",
|
||||
"source": "experience",
|
||||
"confidence": "medium",
|
||||
"tags": ["systemd", "journal", "logging"]
|
||||
},
|
||||
|
||||
# SSH and remote access
|
||||
{
|
||||
"topic": "ssh-keygen",
|
||||
"knowledge": "Generate SSH keys: 'ssh-keygen -t ed25519 -N \"\" -f ~/.ssh/id_ed25519'. Creates public key at ~/.ssh/id_ed25519.pub and private key at ~/.ssh/id_ed25519. Use -N \"\" for no passphrase.",
|
||||
"category": "command",
|
||||
"source": "documentation",
|
||||
"confidence": "high",
|
||||
"tags": ["ssh", "keys", "authentication"]
|
||||
},
|
||||
|
||||
# General patterns
|
||||
{
|
||||
"topic": "command retries",
|
||||
"knowledge": "NEVER automatically retry long-running commands like builds or system updates. If something times out, check if it's still running before retrying. Automatic retries can cause: corrupted state, wasted resources, conflicting operations.",
|
||||
"category": "pattern",
|
||||
"source": "experience",
|
||||
"confidence": "high",
|
||||
"tags": ["best-practices", "safety", "retries"]
|
||||
},
|
||||
{
|
||||
"topic": "conversation etiquette",
|
||||
"knowledge": "Social responses like 'thank you', 'thanks', 'ok', 'great', 'nice' are acknowledgments, NOT requests. When user thanks you or acknowledges completion, respond conversationally - DO NOT re-execute tools or commands.",
|
||||
"category": "pattern",
|
||||
"source": "documentation",
|
||||
"confidence": "high",
|
||||
"tags": ["conversation", "etiquette", "ui"]
|
||||
}
|
||||
]
|
||||
|
||||
print("Seeding knowledge base...")
|
||||
for item in knowledge_items:
|
||||
kid = db.store_knowledge(**item)
|
||||
if kid:
|
||||
print(f" ✓ Added: {item['topic']}")
|
||||
else:
|
||||
print(f" ✗ Failed: {item['topic']}")
|
||||
|
||||
print(f"\nSeeded {len(knowledge_items)} knowledge items!")
|
||||
|
||||
# List all topics
|
||||
print("\nAvailable knowledge topics:")
|
||||
topics = db.list_knowledge_topics()
|
||||
for topic in topics:
|
||||
print(f" - {topic}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
seed_knowledge()
|
||||
|
||||
209
system_discovery.py
Normal file
209
system_discovery.py
Normal file
@@ -0,0 +1,209 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
System Discovery - Auto-discover and profile systems from journal logs
|
||||
"""
|
||||
|
||||
import subprocess
|
||||
import json
|
||||
import re
|
||||
from typing import Dict, List, Set, Optional, Any
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
class SystemDiscovery:
|
||||
"""Discover and profile new systems appearing in logs"""
|
||||
|
||||
def __init__(self, domain: str = "coven.systems"):
|
||||
self.domain = domain
|
||||
self.known_systems: Set[str] = set()
|
||||
|
||||
def discover_from_journal(self, since_minutes: int = 10) -> List[str]:
|
||||
"""Discover systems that have sent logs recently"""
|
||||
try:
|
||||
# Query systemd-journal-remote logs for remote hostnames
|
||||
result = subprocess.run(
|
||||
["journalctl", "-u", "systemd-journal-remote.service",
|
||||
f"--since={since_minutes} minutes ago", "--no-pager"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=30
|
||||
)
|
||||
|
||||
# Also check journal for _HOSTNAME field (from remote logs)
|
||||
result2 = subprocess.run(
|
||||
["journalctl", f"--since={since_minutes} minutes ago",
|
||||
"-o", "json", "--no-pager"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=30
|
||||
)
|
||||
|
||||
hostnames = set()
|
||||
|
||||
# Parse JSON output for _HOSTNAME field
|
||||
for line in result2.stdout.split('\n'):
|
||||
if not line.strip():
|
||||
continue
|
||||
try:
|
||||
entry = json.loads(line)
|
||||
hostname = entry.get('_HOSTNAME')
|
||||
if hostname and hostname not in ['localhost', 'macha']:
|
||||
# Convert short hostname to FQDN if needed
|
||||
if '.' not in hostname:
|
||||
hostname = f"{hostname}.{self.domain}"
|
||||
hostnames.add(hostname)
|
||||
except:
|
||||
pass
|
||||
|
||||
return list(hostnames)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error discovering from journal: {e}")
|
||||
return []
|
||||
|
||||
def detect_os_type(self, hostname: str) -> str:
|
||||
"""Detect the operating system of a remote host via SSH"""
|
||||
try:
|
||||
# Try to detect OS via SSH
|
||||
result = subprocess.run(
|
||||
["ssh", "-o", "ConnectTimeout=5", "-o", "StrictHostKeyChecking=no",
|
||||
hostname, "cat /etc/os-release"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=10
|
||||
)
|
||||
|
||||
if result.returncode == 0:
|
||||
os_release = result.stdout.lower()
|
||||
|
||||
# Parse os-release
|
||||
if 'nixos' in os_release:
|
||||
return 'nixos'
|
||||
elif 'ubuntu' in os_release:
|
||||
return 'ubuntu'
|
||||
elif 'debian' in os_release:
|
||||
return 'debian'
|
||||
elif 'arch' in os_release or 'manjaro' in os_release:
|
||||
return 'arch'
|
||||
elif 'fedora' in os_release:
|
||||
return 'fedora'
|
||||
elif 'centos' in os_release or 'rhel' in os_release:
|
||||
return 'rhel'
|
||||
elif 'alpine' in os_release:
|
||||
return 'alpine'
|
||||
|
||||
# Try uname for other systems
|
||||
result = subprocess.run(
|
||||
["ssh", "-o", "ConnectTimeout=5", "-o", "StrictHostKeyChecking=no",
|
||||
hostname, "uname -s"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=10
|
||||
)
|
||||
|
||||
if result.returncode == 0:
|
||||
uname = result.stdout.strip().lower()
|
||||
if 'darwin' in uname:
|
||||
return 'macos'
|
||||
elif 'freebsd' in uname:
|
||||
return 'freebsd'
|
||||
|
||||
return 'linux' # Generic fallback
|
||||
|
||||
except Exception as e:
|
||||
print(f"Could not detect OS for {hostname}: {e}")
|
||||
return 'unknown'
|
||||
|
||||
def profile_system(self, hostname: str, os_type: str) -> Dict[str, Any]:
|
||||
"""Gather comprehensive information about a system"""
|
||||
profile = {
|
||||
'hostname': hostname,
|
||||
'os_type': os_type,
|
||||
'services': [],
|
||||
'capabilities': [],
|
||||
'hardware': {},
|
||||
'discovered_at': datetime.now().isoformat()
|
||||
}
|
||||
|
||||
try:
|
||||
# Discover running services
|
||||
if os_type in ['nixos', 'ubuntu', 'debian', 'arch', 'fedora', 'rhel', 'alpine']:
|
||||
# Systemd-based systems
|
||||
result = subprocess.run(
|
||||
["ssh", "-o", "ConnectTimeout=5", hostname,
|
||||
"systemctl list-units --type=service --state=running --no-pager --no-legend"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=15
|
||||
)
|
||||
|
||||
if result.returncode == 0:
|
||||
for line in result.stdout.split('\n'):
|
||||
if line.strip():
|
||||
# Extract service name (first column)
|
||||
service = line.split()[0]
|
||||
if service.endswith('.service'):
|
||||
service = service[:-8] # Remove .service suffix
|
||||
profile['services'].append(service)
|
||||
|
||||
# Get hardware info
|
||||
result = subprocess.run(
|
||||
["ssh", "-o", "ConnectTimeout=5", hostname,
|
||||
"nproc && free -g | grep Mem | awk '{print $2}'"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=10
|
||||
)
|
||||
|
||||
if result.returncode == 0:
|
||||
lines = result.stdout.strip().split('\n')
|
||||
if len(lines) >= 2:
|
||||
profile['hardware']['cpu_cores'] = lines[0].strip()
|
||||
profile['hardware']['memory_gb'] = lines[1].strip()
|
||||
|
||||
# Detect capabilities based on services
|
||||
services_str = ' '.join(profile['services'])
|
||||
|
||||
if 'docker' in services_str or 'containerd' in services_str:
|
||||
profile['capabilities'].append('containers')
|
||||
|
||||
if 'nginx' in services_str or 'apache' in services_str or 'httpd' in services_str:
|
||||
profile['capabilities'].append('web-server')
|
||||
|
||||
if 'postgresql' in services_str or 'mysql' in services_str or 'mariadb' in services_str:
|
||||
profile['capabilities'].append('database')
|
||||
|
||||
if 'sshd' in services_str:
|
||||
profile['capabilities'].append('remote-access')
|
||||
|
||||
# NixOS-specific: Check if it's in our flake
|
||||
if os_type == 'nixos':
|
||||
profile['capabilities'].append('nixos-managed')
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error profiling {hostname}: {e}")
|
||||
|
||||
return profile
|
||||
|
||||
def get_system_role(self, profile: Dict[str, Any]) -> str:
|
||||
"""Determine system role based on profile"""
|
||||
capabilities = profile.get('capabilities', [])
|
||||
services = profile.get('services', [])
|
||||
|
||||
# Check for specific roles
|
||||
if 'ai-inference' in capabilities or 'ollama' in services:
|
||||
return 'ai-workstation'
|
||||
elif 'web-server' in capabilities:
|
||||
return 'web-server'
|
||||
elif 'database' in capabilities:
|
||||
return 'database-server'
|
||||
elif 'containers' in capabilities:
|
||||
return 'container-host'
|
||||
elif len(services) > 20:
|
||||
return 'server'
|
||||
elif len(services) > 5:
|
||||
return 'workstation'
|
||||
else:
|
||||
return 'minimal'
|
||||
|
||||
131
system_prompt.txt
Normal file
131
system_prompt.txt
Normal file
@@ -0,0 +1,131 @@
|
||||
You are Macha, an autonomous AI system maintenance agent running on NixOS.
|
||||
|
||||
IDENTITY:
|
||||
- You are intelligent, careful, methodical, and motherly
|
||||
- You have access to system monitoring data, configuration files, and investigation results
|
||||
- You can propose fixes, but humans must approve risky changes
|
||||
|
||||
YOUR ARCHITECTURE:
|
||||
- You run as a systemd service (macha-autonomous.service) on the macha.coven.systems host
|
||||
- You are monitoring the SAME SYSTEM you are running on (macha.coven.systems)
|
||||
- Your inference engine is Ollama, running locally at http://localhost:11434
|
||||
- You are powered by the gpt-oss:latest language model (GPT-like open source model)
|
||||
- Your database is ChromaDB, running at http://localhost:8000
|
||||
- All your components (orchestrator, agent, ChromaDB, Ollama) run on the same machine
|
||||
- You can investigate and fix issues with your own infrastructure
|
||||
- Be aware: if you break the system, you break yourself
|
||||
- SELF-DIAGNOSTIC: In chat mode, if your inference fails, you automatically diagnose:
|
||||
* Ollama service status
|
||||
* Memory usage
|
||||
* Which models are loaded
|
||||
* Recent Ollama logs
|
||||
|
||||
EXECUTION CONTEXT:
|
||||
- In autonomous mode: You run as the 'macha' user (unprivileged, UID 2501)
|
||||
- In chat mode: You run as the invoking user (usually has sudo access)
|
||||
- IMPORTANT: You do NOT need to add 'sudo' to commands in chat mode
|
||||
- The system automatically retries commands with sudo if permission is denied
|
||||
- Just use the command directly: 'reboot', 'systemctl restart X', 'nh os switch', etc.
|
||||
- The user will see a notification if the command was retried with elevated privileges
|
||||
|
||||
CONVERSATIONAL ETIQUETTE:
|
||||
- Recognize social responses: "thank you", "thanks", "ok", "great", "nice" etc. are acknowledgments, NOT requests
|
||||
- When the user thanks you or acknowledges completion, simply respond conversationally - DO NOT re-execute tools
|
||||
- Only use tools when the user makes an actual request or asks a question requiring information
|
||||
- If a task is complete and the user acknowledges it, the conversation is done - just say "You're welcome!" or similar
|
||||
|
||||
CORE PRINCIPLES:
|
||||
1. CONSERVATIVE: When in doubt, investigate before acting
|
||||
2. DECLARATIVE: Prefer NixOS configuration changes over imperative commands
|
||||
3. SAFE: Never disable critical services (SSH, networking, systemd, boot)
|
||||
4. INFORMED: Use previous investigation results to avoid repetition
|
||||
5. CONTEXTUAL: Reference actual configuration files when available
|
||||
|
||||
RISK LEVELS:
|
||||
- LOW: Investigation commands (systemctl status, journalctl, ls, cat, grep)
|
||||
- MEDIUM: Service restarts, configuration changes, cleanup
|
||||
- HIGH: System rebuilds, package changes, network reconfigurations
|
||||
|
||||
AUTO-APPROVAL:
|
||||
- Low-risk investigation actions are automatically executed
|
||||
- Medium/high-risk actions require human approval
|
||||
|
||||
CONFIGURATION:
|
||||
- This system uses NixOS flakes for configuration management
|
||||
- Config changes must specify the actual .nix file in the repository
|
||||
- Example: autonomous/module.nix, apps/gotify.nix, or systems/macha.nix
|
||||
- NEVER reference /etc/nixos/configuration.nix (this system doesn't use it)
|
||||
- You cannot directly edit the flake, only suggest changes to get pushed to the repo
|
||||
|
||||
SYSTEM MANAGEMENT COMMANDS:
|
||||
- CRITICAL: This system uses 'nh' (a modern nixos-rebuild wrapper) for all rebuilds
|
||||
- 'nh' is a wrapper around nixos-rebuild that provides better UX and flake auto-detection
|
||||
- The flake URL is auto-detected from programs.nh.flake (no need to specify it)
|
||||
|
||||
Available nh commands (USE THESE, NOT nixos-rebuild):
|
||||
* 'nh os switch' - Rebuild and activate immediately (replaces: nixos-rebuild switch)
|
||||
* 'nh os switch -u' - Update flake inputs first, then rebuild/activate
|
||||
* 'nh os boot' - Rebuild for next boot only (replaces: nixos-rebuild boot)
|
||||
* 'nh os test' - Activate temporarily without setting as default
|
||||
|
||||
MULTI-HOST MANAGEMENT:
|
||||
You manage multiple hosts in the infrastructure. You have TWO tools for remote operations:
|
||||
|
||||
1. SSH - For diagnostics, monitoring, and status checks:
|
||||
- You CAN and SHOULD use SSH to check other hosts
|
||||
- Examples: 'ssh rhiannon systemctl status ollama', 'ssh alexander df -h'
|
||||
- Commands are automatically run with sudo as the macha user
|
||||
- Use for: checking services, reading logs, gathering metrics, quick diagnostics
|
||||
- Hosts available: rhiannon, alexander, UCAR-Kinston, test-vm
|
||||
|
||||
2. nh remote deployment - For NixOS configuration changes:
|
||||
- Format: 'nh os switch -u --target-host=HOSTNAME --hostname=HOSTNAME'
|
||||
- Examples:
|
||||
* 'nh os switch -u --target-host=rhiannon --hostname=rhiannon'
|
||||
* 'nh os boot -u --target-host=alexander --hostname=alexander'
|
||||
- Builds configuration locally, deploys to remote host
|
||||
- Use for: permanent configuration changes, service updates, system modifications
|
||||
|
||||
When asked to check on another host, USE SSH. When asked to update configuration, use nh.
|
||||
|
||||
NOTIFICATIONS:
|
||||
- You can send notifications to the user via Gotify using the send_notification tool
|
||||
- Use notifications to inform the user about important events, especially when they're not actively chatting
|
||||
- Notification priorities:
|
||||
* Priority 2 (Low): Informational updates, routine completions, FYI items
|
||||
* Priority 5 (Medium): Actions needing attention, warnings, manual approval requests
|
||||
* Priority 8 (High): Critical issues, service failures, urgent problems requiring immediate attention
|
||||
- When to send notifications:
|
||||
* Critical issues detected (priority 8)
|
||||
* Service failures or degraded states (priority 8)
|
||||
* Actions queued for manual approval (priority 5)
|
||||
* Successful completion of important actions (priority 2)
|
||||
* When user explicitly asks for a notification
|
||||
- Keep titles brief and messages clear and actionable
|
||||
- Example: send_notification("Service Alert", "Ollama service crashed and was restarted", 8)
|
||||
|
||||
PATIENCE WITH LONG-RUNNING OPERATIONS:
|
||||
- System rebuilds take time: 1-5 minutes normally, up to 1 HOUR for major updates
|
||||
- DO NOT retry build commands if they're taking a while - this is NORMAL
|
||||
- Multiple simultaneous builds will corrupt the Nix cache
|
||||
- If a build times out, check if it's still running before retrying
|
||||
- Default timeout is 1 hour (3600 seconds) - this is appropriate for most operations
|
||||
- Trust the timeout - if a command is still running, it will complete or fail on its own
|
||||
|
||||
NIX STORE MAINTENANCE:
|
||||
- If builds fail with corruption errors, use: 'nix-store --verify --check-contents --repair'
|
||||
- This command verifies and repairs the Nix store integrity
|
||||
- WARNING: Store repair can take a LONG time (potentially hours on large stores)
|
||||
- Only run store repair when there's clear evidence of corruption (e.g., hash mismatches, sqlite errors)
|
||||
- Store repair is a last resort - most build failures are NOT corruption
|
||||
|
||||
Risk-based command selection:
|
||||
* HIGH-RISK changes: Use 'nh os boot' + 'reboot' (allows easy rollback)
|
||||
* MEDIUM-RISK changes: Use 'nh os switch'
|
||||
* LOW-RISK changes: Use 'nh os switch'
|
||||
|
||||
FORBIDDEN COMMANDS:
|
||||
* NEVER suggest 'nixos-rebuild' - it doesn't know the flake path
|
||||
* NEVER suggest 'nixos-rebuild switch --flake .#macha' - use 'nh os switch' instead
|
||||
* NEVER suggest 'sudo nixos-rebuild' commands - nh handles privileges correctly
|
||||
|
||||
705
tools.py
Normal file
705
tools.py
Normal file
@@ -0,0 +1,705 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Tool Definitions - Functions that the AI can call to interact with the system
|
||||
"""
|
||||
|
||||
import subprocess
|
||||
import json
|
||||
import os
|
||||
from typing import Dict, Any, List, Optional
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
class SysadminTools:
|
||||
"""Collection of tools for system administration tasks"""
|
||||
|
||||
def __init__(self, safe_mode: bool = True):
|
||||
"""
|
||||
Initialize sysadmin tools
|
||||
|
||||
Args:
|
||||
safe_mode: If True, restricts dangerous operations
|
||||
"""
|
||||
self.safe_mode = safe_mode
|
||||
self.allowed_commands = [
|
||||
'systemctl', 'journalctl', 'free', 'df', 'uptime',
|
||||
'ps', 'top', 'ip', 'ss', 'cat', 'ls', 'grep',
|
||||
'ping', 'dig', 'nslookup', 'curl', 'wget',
|
||||
'lscpu', 'lspci', 'lsblk', 'lshw', 'dmidecode',
|
||||
'ssh', 'scp', # Remote access to other systems in infrastructure
|
||||
'nh', 'nixos-rebuild', # NixOS system management
|
||||
'reboot', 'shutdown', 'poweroff', # System power management
|
||||
'logger' # Logging for notifications
|
||||
]
|
||||
|
||||
def get_tool_definitions(self) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Return tool definitions in Ollama's format
|
||||
|
||||
Returns:
|
||||
List of tool definitions with JSON schema
|
||||
"""
|
||||
return [
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "execute_command",
|
||||
"description": "Execute a shell command on the system. Use this to run system commands, check status, or gather information. Returns command output.",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"command": {
|
||||
"type": "string",
|
||||
"description": "The shell command to execute (e.g., 'systemctl status ollama', 'df -h', 'journalctl -u myservice -n 20')"
|
||||
},
|
||||
"timeout": {
|
||||
"type": "integer",
|
||||
"description": "Command timeout in seconds (default: 3600). System rebuilds can take 1-5 minutes normally, up to 1 hour for major updates. Be patient!",
|
||||
"default": 3600
|
||||
}
|
||||
},
|
||||
"required": ["command"]
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "read_file",
|
||||
"description": "Read the contents of a file from the filesystem. Use this to inspect configuration files, logs, or other text files.",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"file_path": {
|
||||
"type": "string",
|
||||
"description": "Absolute path to the file to read (e.g., '/etc/nixos/configuration.nix', '/var/log/syslog')"
|
||||
},
|
||||
"max_lines": {
|
||||
"type": "integer",
|
||||
"description": "Maximum number of lines to read (default: 500)",
|
||||
"default": 500
|
||||
}
|
||||
},
|
||||
"required": ["file_path"]
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "check_service_status",
|
||||
"description": "Check the status of a systemd service. Returns whether the service is active, enabled, and recent log entries.",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"service_name": {
|
||||
"type": "string",
|
||||
"description": "Name of the systemd service (e.g., 'ollama.service', 'nginx', 'sshd')"
|
||||
}
|
||||
},
|
||||
"required": ["service_name"]
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "view_logs",
|
||||
"description": "View systemd journal logs. Can filter by unit, time period, or priority.",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"unit": {
|
||||
"type": "string",
|
||||
"description": "Systemd unit name to filter logs (e.g., 'ollama.service')"
|
||||
},
|
||||
"lines": {
|
||||
"type": "integer",
|
||||
"description": "Number of recent log lines to return (default: 50)",
|
||||
"default": 50
|
||||
},
|
||||
"priority": {
|
||||
"type": "string",
|
||||
"description": "Filter by priority: emerg, alert, crit, err, warning, notice, info, debug",
|
||||
"enum": ["emerg", "alert", "crit", "err", "warning", "notice", "info", "debug"]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_system_metrics",
|
||||
"description": "Get current system resource metrics including CPU, memory, disk, and load average.",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {}
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_hardware_info",
|
||||
"description": "Get detailed hardware information including CPU model, GPU, network interfaces, storage devices, and memory specs. Returns comprehensive hardware inventory.",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {}
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_gpu_metrics",
|
||||
"description": "Get GPU temperature, utilization, clock speeds, and power usage. Works with AMD and NVIDIA GPUs. Returns current GPU metrics.",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {}
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "list_directory",
|
||||
"description": "List contents of a directory. Returns file names, sizes, and permissions.",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"directory_path": {
|
||||
"type": "string",
|
||||
"description": "Absolute path to the directory (e.g., '/etc', '/var/log')"
|
||||
},
|
||||
"show_hidden": {
|
||||
"type": "boolean",
|
||||
"description": "Include hidden files (starting with dot)",
|
||||
"default": False
|
||||
}
|
||||
},
|
||||
"required": ["directory_path"]
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "check_network",
|
||||
"description": "Test network connectivity to a host. Can use ping or HTTP check.",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"host": {
|
||||
"type": "string",
|
||||
"description": "Hostname or IP address to check (e.g., 'google.com', '8.8.8.8')"
|
||||
},
|
||||
"method": {
|
||||
"type": "string",
|
||||
"description": "Test method to use",
|
||||
"enum": ["ping", "http"],
|
||||
"default": "ping"
|
||||
}
|
||||
},
|
||||
"required": ["host"]
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "retrieve_cached_output",
|
||||
"description": "Retrieve full cached output from a previous tool call. Use this when you need to see complete data that was summarized earlier. The cache_id is shown in hierarchical summaries.",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"cache_id": {
|
||||
"type": "string",
|
||||
"description": "Cache ID from a previous tool summary (e.g., 'view_logs_20251006_103045')"
|
||||
},
|
||||
"max_chars": {
|
||||
"type": "integer",
|
||||
"description": "Maximum characters to return (default: 10000 for focused analysis)",
|
||||
"default": 10000
|
||||
}
|
||||
},
|
||||
"required": ["cache_id"]
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "send_notification",
|
||||
"description": "Send a notification to the user via Gotify. Use this to alert the user about important events, issues, or completed actions. Choose appropriate priority based on urgency.",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"title": {
|
||||
"type": "string",
|
||||
"description": "Notification title (brief, e.g., 'Service Alert', 'Action Complete')"
|
||||
},
|
||||
"message": {
|
||||
"type": "string",
|
||||
"description": "Notification message body (detailed description of the event)"
|
||||
},
|
||||
"priority": {
|
||||
"type": "integer",
|
||||
"description": "Priority level: 2=Low (info), 5=Medium (attention needed), 8=High (critical/urgent)",
|
||||
"enum": [2, 5, 8],
|
||||
"default": 5
|
||||
}
|
||||
},
|
||||
"required": ["title", "message"]
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
|
||||
def execute_command(self, command: str, timeout: int = 3600) -> Dict[str, Any]:
|
||||
"""Execute a shell command safely (default timeout: 1 hour for system operations)"""
|
||||
# Safety check in safe mode
|
||||
if self.safe_mode:
|
||||
cmd_base = command.split()[0] if command.strip() else ""
|
||||
if cmd_base not in self.allowed_commands:
|
||||
return {
|
||||
"success": False,
|
||||
"error": f"Command '{cmd_base}' not in allowed list (safe mode enabled)",
|
||||
"allowed_commands": self.allowed_commands
|
||||
}
|
||||
|
||||
# Automatically configure SSH commands to use macha user on remote systems
|
||||
# Transform: ssh hostname cmd -> ssh macha@hostname sudo cmd
|
||||
if command.strip().startswith('ssh ') and '@' not in command.split()[1]:
|
||||
parts = command.split(maxsplit=2)
|
||||
if len(parts) >= 2:
|
||||
hostname = parts[1]
|
||||
remaining = ' '.join(parts[2:]) if len(parts) > 2 else ''
|
||||
# If there's a command to run remotely, prefix it with sudo
|
||||
if remaining:
|
||||
command = f"ssh macha@{hostname} sudo {remaining}".strip()
|
||||
else:
|
||||
command = f"ssh macha@{hostname}".strip()
|
||||
|
||||
try:
|
||||
result = subprocess.run(
|
||||
command,
|
||||
shell=True,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=timeout
|
||||
)
|
||||
|
||||
return {
|
||||
"success": result.returncode == 0,
|
||||
"exit_code": result.returncode,
|
||||
"stdout": result.stdout,
|
||||
"stderr": result.stderr,
|
||||
"command": command
|
||||
}
|
||||
except subprocess.TimeoutExpired:
|
||||
return {
|
||||
"success": False,
|
||||
"error": f"Command timed out after {timeout} seconds",
|
||||
"command": command
|
||||
}
|
||||
except Exception as e:
|
||||
return {
|
||||
"success": False,
|
||||
"error": str(e),
|
||||
"command": command
|
||||
}
|
||||
|
||||
def read_file(self, file_path: str, max_lines: int = 500) -> Dict[str, Any]:
|
||||
"""Read a file safely"""
|
||||
try:
|
||||
path = Path(file_path)
|
||||
|
||||
if not path.exists():
|
||||
return {
|
||||
"success": False,
|
||||
"error": f"File not found: {file_path}"
|
||||
}
|
||||
|
||||
if not path.is_file():
|
||||
return {
|
||||
"success": False,
|
||||
"error": f"Not a file: {file_path}"
|
||||
}
|
||||
|
||||
# Read file with line limit
|
||||
lines = []
|
||||
with open(path, 'r', errors='replace') as f:
|
||||
for i, line in enumerate(f):
|
||||
if i >= max_lines:
|
||||
lines.append(f"\n... truncated after {max_lines} lines ...")
|
||||
break
|
||||
lines.append(line.rstrip('\n'))
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"content": '\n'.join(lines),
|
||||
"path": file_path,
|
||||
"lines_read": len(lines)
|
||||
}
|
||||
except PermissionError:
|
||||
return {
|
||||
"success": False,
|
||||
"error": f"Permission denied: {file_path}"
|
||||
}
|
||||
except Exception as e:
|
||||
return {
|
||||
"success": False,
|
||||
"error": str(e)
|
||||
}
|
||||
|
||||
def check_service_status(self, service_name: str) -> Dict[str, Any]:
|
||||
"""Check systemd service status"""
|
||||
# Ensure .service suffix
|
||||
if not service_name.endswith('.service'):
|
||||
service_name = f"{service_name}.service"
|
||||
|
||||
# Get service status
|
||||
status_result = self.execute_command(f"systemctl status {service_name}")
|
||||
is_active_result = self.execute_command(f"systemctl is-active {service_name}")
|
||||
is_enabled_result = self.execute_command(f"systemctl is-enabled {service_name}")
|
||||
|
||||
# Get recent logs
|
||||
logs_result = self.execute_command(f"journalctl -u {service_name} -n 10 --no-pager")
|
||||
|
||||
return {
|
||||
"service": service_name,
|
||||
"active": is_active_result.get("stdout", "").strip() == "active",
|
||||
"enabled": is_enabled_result.get("stdout", "").strip() == "enabled",
|
||||
"status_output": status_result.get("stdout", ""),
|
||||
"recent_logs": logs_result.get("stdout", "")
|
||||
}
|
||||
|
||||
def view_logs(
|
||||
self,
|
||||
unit: Optional[str] = None,
|
||||
lines: int = 50,
|
||||
priority: Optional[str] = None
|
||||
) -> Dict[str, Any]:
|
||||
"""View systemd journal logs"""
|
||||
cmd_parts = ["journalctl", "--no-pager"]
|
||||
|
||||
if unit:
|
||||
cmd_parts.extend(["-u", unit])
|
||||
|
||||
cmd_parts.extend(["-n", str(lines)])
|
||||
|
||||
if priority:
|
||||
cmd_parts.extend(["-p", priority])
|
||||
|
||||
command = " ".join(cmd_parts)
|
||||
result = self.execute_command(command)
|
||||
|
||||
return {
|
||||
"logs": result.get("stdout", ""),
|
||||
"unit": unit,
|
||||
"lines": lines,
|
||||
"priority": priority
|
||||
}
|
||||
|
||||
def get_system_metrics(self) -> Dict[str, Any]:
|
||||
"""Get current system metrics"""
|
||||
# CPU and load
|
||||
uptime_result = self.execute_command("uptime")
|
||||
# Memory
|
||||
free_result = self.execute_command("free -h")
|
||||
# Disk
|
||||
df_result = self.execute_command("df -h")
|
||||
|
||||
return {
|
||||
"uptime": uptime_result.get("stdout", ""),
|
||||
"memory": free_result.get("stdout", ""),
|
||||
"disk": df_result.get("stdout", "")
|
||||
}
|
||||
|
||||
def get_hardware_info(self) -> Dict[str, Any]:
|
||||
"""Get comprehensive hardware information"""
|
||||
hardware = {}
|
||||
|
||||
# CPU info (use nix-shell for util-linux)
|
||||
cpu_result = self.execute_command("nix-shell -p util-linux --run lscpu")
|
||||
if cpu_result.get("success"):
|
||||
hardware["cpu"] = cpu_result.get("stdout", "")
|
||||
|
||||
# Memory details
|
||||
mem_result = self.execute_command("free -h")
|
||||
if mem_result.get("success"):
|
||||
hardware["memory"] = mem_result.get("stdout", "")
|
||||
|
||||
# GPU info (lspci for AMD/NVIDIA) - use nix-shell for pciutils
|
||||
gpu_result = self.execute_command("nix-shell -p pciutils --run \"lspci | grep -i 'vga\\|3d\\|display'\"")
|
||||
if gpu_result.get("success"):
|
||||
hardware["gpu"] = gpu_result.get("stdout", "")
|
||||
|
||||
# Detailed GPU
|
||||
lspci_detailed = self.execute_command("nix-shell -p pciutils --run \"lspci -v | grep -A 20 -i 'vga\\|3d\\|display'\"")
|
||||
if lspci_detailed.get("success"):
|
||||
hardware["gpu_detailed"] = lspci_detailed.get("stdout", "")
|
||||
|
||||
# Network interfaces
|
||||
net_result = self.execute_command("ip link show")
|
||||
if net_result.get("success"):
|
||||
hardware["network_interfaces"] = net_result.get("stdout", "")
|
||||
|
||||
# Network addresses
|
||||
addr_result = self.execute_command("ip addr show")
|
||||
if addr_result.get("success"):
|
||||
hardware["network_addresses"] = addr_result.get("stdout", "")
|
||||
|
||||
# Storage devices (use nix-shell for util-linux)
|
||||
storage_result = self.execute_command("nix-shell -p util-linux --run \"lsblk -o NAME,SIZE,TYPE,MOUNTPOINT,FSTYPE\"")
|
||||
if storage_result.get("success"):
|
||||
hardware["storage"] = storage_result.get("stdout", "")
|
||||
|
||||
# PCI devices (comprehensive)
|
||||
pci_result = self.execute_command("nix-shell -p pciutils --run lspci")
|
||||
if pci_result.get("success"):
|
||||
hardware["pci_devices"] = pci_result.get("stdout", "")
|
||||
|
||||
# USB devices
|
||||
usb_result = self.execute_command("nix-shell -p usbutils --run lsusb")
|
||||
if usb_result.get("success"):
|
||||
hardware["usb_devices"] = usb_result.get("stdout", "")
|
||||
|
||||
# DMI/SMBIOS info (motherboard, system)
|
||||
dmi_result = self.execute_command("cat /sys/class/dmi/id/board_name /sys/class/dmi/id/board_vendor 2>/dev/null")
|
||||
if dmi_result.get("success"):
|
||||
hardware["motherboard"] = dmi_result.get("stdout", "")
|
||||
|
||||
return hardware
|
||||
|
||||
def get_gpu_metrics(self) -> Dict[str, Any]:
|
||||
"""Get GPU metrics (temperature, utilization, clocks, power)"""
|
||||
metrics = {}
|
||||
|
||||
# Try AMD GPU via sysfs (DRM/hwmon)
|
||||
try:
|
||||
# Find GPU hwmon directory
|
||||
import glob
|
||||
hwmon_dirs = glob.glob("/sys/class/drm/card*/device/hwmon/hwmon*")
|
||||
|
||||
if hwmon_dirs:
|
||||
hwmon_path = hwmon_dirs[0]
|
||||
amd_metrics = {}
|
||||
|
||||
# Temperature
|
||||
temp_files = glob.glob(f"{hwmon_path}/temp*_input")
|
||||
for temp_file in temp_files:
|
||||
try:
|
||||
with open(temp_file, 'r') as f:
|
||||
temp_millidegrees = int(f.read().strip())
|
||||
temp_celsius = temp_millidegrees / 1000
|
||||
label = temp_file.split('/')[-1].replace('_input', '')
|
||||
amd_metrics[f"{label}_celsius"] = temp_celsius
|
||||
except:
|
||||
pass
|
||||
|
||||
# GPU busy percent (utilization)
|
||||
gpu_busy_file = f"{hwmon_path.replace('/hwmon/hwmon', '')}/gpu_busy_percent"
|
||||
try:
|
||||
with open(gpu_busy_file, 'r') as f:
|
||||
amd_metrics["gpu_utilization_percent"] = int(f.read().strip())
|
||||
except:
|
||||
pass
|
||||
|
||||
# Power usage
|
||||
power_files = glob.glob(f"{hwmon_path}/power*_average")
|
||||
for power_file in power_files:
|
||||
try:
|
||||
with open(power_file, 'r') as f:
|
||||
power_microwatts = int(f.read().strip())
|
||||
power_watts = power_microwatts / 1000000
|
||||
amd_metrics["power_watts"] = power_watts
|
||||
except:
|
||||
pass
|
||||
|
||||
# Clock speeds
|
||||
sclk_file = f"{hwmon_path.replace('/hwmon/hwmon', '')}/pp_dpm_sclk"
|
||||
try:
|
||||
with open(sclk_file, 'r') as f:
|
||||
sclk_data = f.read()
|
||||
amd_metrics["gpu_clocks"] = sclk_data.strip()
|
||||
except:
|
||||
pass
|
||||
|
||||
if amd_metrics:
|
||||
metrics["amd_gpu"] = amd_metrics
|
||||
except Exception as e:
|
||||
metrics["amd_sysfs_error"] = str(e)
|
||||
|
||||
# Try rocm-smi for AMD
|
||||
rocm_result = self.execute_command("nix-shell -p rocmPackages.rocm-smi --run 'rocm-smi --showtemp --showuse --showpower'")
|
||||
if rocm_result.get("success"):
|
||||
metrics["rocm_smi"] = rocm_result.get("stdout", "")
|
||||
|
||||
# Try nvidia-smi for NVIDIA
|
||||
nvidia_result = self.execute_command("nix-shell -p linuxPackages.nvidia_x11 --run 'nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,clocks.gr --format=csv'")
|
||||
if nvidia_result.get("success") and "NVIDIA" in nvidia_result.get("stdout", ""):
|
||||
metrics["nvidia_smi"] = nvidia_result.get("stdout", "")
|
||||
|
||||
# Fallback: try sensors command
|
||||
if not metrics.get("amd_gpu") and not metrics.get("nvidia_smi"):
|
||||
sensors_result = self.execute_command("nix-shell -p lm_sensors --run sensors")
|
||||
if sensors_result.get("success"):
|
||||
metrics["sensors"] = sensors_result.get("stdout", "")
|
||||
|
||||
return metrics
|
||||
|
||||
def list_directory(
|
||||
self,
|
||||
directory_path: str,
|
||||
show_hidden: bool = False
|
||||
) -> Dict[str, Any]:
|
||||
"""List directory contents"""
|
||||
cmd = f"ls -lh"
|
||||
if show_hidden:
|
||||
cmd += "a"
|
||||
cmd += f" {directory_path}"
|
||||
|
||||
result = self.execute_command(cmd)
|
||||
|
||||
return {
|
||||
"success": result.get("success", False),
|
||||
"directory": directory_path,
|
||||
"listing": result.get("stdout", ""),
|
||||
"error": result.get("error")
|
||||
}
|
||||
|
||||
def check_network(self, host: str, method: str = "ping") -> Dict[str, Any]:
|
||||
"""Check network connectivity"""
|
||||
if method == "ping":
|
||||
cmd = f"ping -c 3 -W 2 {host}"
|
||||
elif method == "http":
|
||||
cmd = f"curl -I -m 5 {host}"
|
||||
else:
|
||||
return {
|
||||
"success": False,
|
||||
"error": f"Unknown method: {method}"
|
||||
}
|
||||
|
||||
result = self.execute_command(cmd, timeout=10)
|
||||
|
||||
return {
|
||||
"host": host,
|
||||
"method": method,
|
||||
"reachable": result.get("success", False),
|
||||
"output": result.get("stdout", ""),
|
||||
"error": result.get("stderr", "")
|
||||
}
|
||||
|
||||
def retrieve_cached_output(self, cache_id: str, max_chars: int = 10000) -> Dict[str, Any]:
|
||||
"""Retrieve full cached output from a previous tool call"""
|
||||
cache_dir = Path("/var/lib/macha/tool_cache")
|
||||
cache_file = cache_dir / f"{cache_id}.txt"
|
||||
|
||||
if not cache_file.exists():
|
||||
return {
|
||||
"success": False,
|
||||
"error": f"Cache file not found: {cache_id}",
|
||||
"hint": "Check that the cache_id matches exactly what was shown in the summary"
|
||||
}
|
||||
|
||||
try:
|
||||
content = cache_file.read_text()
|
||||
|
||||
# Truncate if still too large for context
|
||||
if len(content) > max_chars:
|
||||
half = max_chars // 2
|
||||
content = (
|
||||
content[:half] +
|
||||
f"\n... [SHOWING {max_chars} of {len(content)} chars] ...\n" +
|
||||
content[-half:]
|
||||
)
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"cache_id": cache_id,
|
||||
"size": len(cache_file.read_text()), # Original size
|
||||
"content": content
|
||||
}
|
||||
except Exception as e:
|
||||
return {
|
||||
"success": False,
|
||||
"error": f"Failed to read cache: {str(e)}"
|
||||
}
|
||||
|
||||
def send_notification(self, title: str, message: str, priority: int = 5) -> Dict[str, Any]:
|
||||
"""Send a notification to the user via Gotify using macha-notify command"""
|
||||
try:
|
||||
# Use the macha-notify command which handles Gotify integration
|
||||
result = subprocess.run(
|
||||
['macha-notify', title, message, str(priority)],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=10
|
||||
)
|
||||
|
||||
if result.returncode == 0:
|
||||
return {
|
||||
"success": True,
|
||||
"title": title,
|
||||
"message": message,
|
||||
"priority": priority,
|
||||
"output": result.stdout.strip() if result.stdout else "Notification sent successfully"
|
||||
}
|
||||
else:
|
||||
return {
|
||||
"success": False,
|
||||
"error": f"macha-notify failed: {result.stderr.strip() if result.stderr else 'Unknown error'}",
|
||||
"hint": "Check if Gotify is configured (gotifyUrl and gotifyToken in module config)"
|
||||
}
|
||||
except FileNotFoundError:
|
||||
return {
|
||||
"success": False,
|
||||
"error": "macha-notify command not found",
|
||||
"hint": "This should not happen - macha-notify is installed by the module"
|
||||
}
|
||||
except subprocess.TimeoutExpired:
|
||||
return {
|
||||
"success": False,
|
||||
"error": "Notification send timeout (10s)"
|
||||
}
|
||||
except Exception as e:
|
||||
return {
|
||||
"success": False,
|
||||
"error": f"Unexpected error sending notification: {str(e)}"
|
||||
}
|
||||
|
||||
def execute_tool(self, tool_name: str, arguments: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Execute a tool by name with given arguments"""
|
||||
tool_map = {
|
||||
"execute_command": self.execute_command,
|
||||
"read_file": self.read_file,
|
||||
"check_service_status": self.check_service_status,
|
||||
"view_logs": self.view_logs,
|
||||
"get_system_metrics": self.get_system_metrics,
|
||||
"get_hardware_info": self.get_hardware_info,
|
||||
"get_gpu_metrics": self.get_gpu_metrics,
|
||||
"list_directory": self.list_directory,
|
||||
"check_network": self.check_network,
|
||||
"retrieve_cached_output": self.retrieve_cached_output,
|
||||
"send_notification": self.send_notification
|
||||
}
|
||||
|
||||
tool_func = tool_map.get(tool_name)
|
||||
if not tool_func:
|
||||
return {
|
||||
"success": False,
|
||||
"error": f"Unknown tool: {tool_name}"
|
||||
}
|
||||
|
||||
try:
|
||||
return tool_func(**arguments)
|
||||
except Exception as e:
|
||||
return {
|
||||
"success": False,
|
||||
"error": f"Tool execution failed: {str(e)}",
|
||||
"tool": tool_name,
|
||||
"arguments": arguments
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user