feat: CI/CD pipeline setup complete - Ansible playbooks updated, secrets configured, workflow ready

2025-10-31 01:39:24 +01:00
parent 55c04e4fd0
commit e26eb2aa12
601 changed files with 44184 additions and 32477 deletions
--- a/deployment/stacks/monitoring/.env.example
+++ b/deployment/stacks/monitoring/.env.example
@@ -0,0 +1,21 @@
+# Monitoring Stack Environment Configuration
+# Copy to .env and configure with your actual values
+
+# Domain Configuration
+DOMAIN=michaelschiemer.de
+
+# Grafana Configuration
+GRAFANA_ADMIN_USER=admin
+GRAFANA_ADMIN_PASSWORD=changeme_secure_password
+
+# Grafana Plugins (comma-separated)
+# Common useful plugins:
+# - grafana-clock-panel
+# - grafana-piechart-panel
+# - grafana-worldmap-panel
+GRAFANA_PLUGINS=
+
+# Prometheus BasicAuth
+# Generate with: htpasswd -nb admin password
+# Format: username:hashed_password
+PROMETHEUS_AUTH=admin:$$apr1$$xyz...
--- a/deployment/stacks/monitoring/README.md
+++ b/deployment/stacks/monitoring/README.md
@@ -0,0 +1,751 @@
+# Stack 6: Monitoring (Portainer + Grafana + Prometheus)
+
+Comprehensive monitoring stack for infrastructure and application observability.
+
+## Overview
+
+This stack provides complete monitoring and visualization capabilities for the entire infrastructure:
+- **Prometheus**: Time-series metrics collection and alerting
+- **Grafana**: Metrics visualization with pre-configured dashboards
+- **Portainer**: Container management UI
+- **Node Exporter**: Host system metrics (CPU, memory, disk, network)
+- **cAdvisor**: Container resource usage metrics
+- **Alertmanager**: Alert routing and management (via Prometheus)
+
+## Features
+
+### Prometheus
+- Multi-target scraping (node-exporter, cadvisor, traefik)
+- 15-second scrape interval for near real-time metrics
+- 15-day retention period
+- Pre-configured alert rules for critical conditions
+- Built-in alerting engine
+- Service discovery via static configs
+- HTTPS support with BasicAuth protection
+
+### Grafana
+- Pre-configured Prometheus datasource
+- Three comprehensive dashboards:
+  - **Docker Containers**: Container CPU, memory, network I/O, restarts
+  - **Host System**: System CPU, memory, disk, network, uptime
+  - **Traefik**: Request rates, response times, status codes, error rates
+- Auto-provisioning (no manual configuration needed)
+- HTTPS access via Traefik
+- 30-second auto-refresh
+- Dark theme for reduced eye strain
+
+### Portainer
+- Web-based Docker management UI
+- Container start/stop/restart/logs
+- Stack management and deployment
+- Volume and network management
+- Resource usage visualization
+- HTTPS access via Traefik
+
+### Node Exporter
+- Host system metrics:
+  - CPU usage by core and mode
+  - Memory usage and available memory
+  - Disk usage by filesystem
+  - Network I/O by interface
+  - System load averages
+  - System uptime
+
+### cAdvisor
+- Container metrics:
+  - CPU usage per container
+  - Memory usage per container
+  - Network I/O per container
+  - Disk I/O per container
+  - Container restart counts
+  - Container health status
+
+## Services
+
+| Service | Domain | Port | Purpose |
+|---------|--------|------|---------|
+| Grafana | grafana.michaelschiemer.de | 3000 | Metrics visualization |
+| Prometheus | prometheus.michaelschiemer.de | 9090 | Metrics collection |
+| Portainer | portainer.michaelschiemer.de | 9000/9443 | Container management |
+| Node Exporter | - | 9100 | Host metrics (internal) |
+| cAdvisor | - | 8080 | Container metrics (internal) |
+
+## Prerequisites
+
+- Traefik stack deployed and running (Stack 1)
+- Docker networks: `traefik-public`, `monitoring`
+- Docker Swarm initialized (if using swarm mode)
+- Domain DNS configured (grafana/prometheus/portainer subdomains)
+
+## Directory Structure
+
+```
+monitoring/
+├── docker-compose.yml              # Main stack definition
+├── .env.example                    # Environment template
+├── prometheus/
+│   ├── prometheus.yml             # Prometheus configuration
+│   └── alerts.yml                 # Alert rules
+├── grafana/
+│   ├── provisioning/
+│   │   ├── datasources/
+│   │   │   └── prometheus.yml     # Auto-configured datasource
+│   │   └── dashboards/
+│   │       └── dashboard.yml      # Dashboard provisioning
+│   └── dashboards/
+│       ├── docker-containers.json # Container metrics dashboard
+│       ├── host-system.json       # Host metrics dashboard
+│       └── traefik.json          # Traefik metrics dashboard
+└── README.md                       # This file
+```
+
+## Configuration
+
+### 1. Create Environment File
+
+```bash
+cp .env.example .env
+```
+
+### 2. Configure Environment Variables
+
+Edit `.env` and set the following variables:
+
+```bash
+# Domain Configuration
+DOMAIN=michaelschiemer.de
+
+# Grafana Configuration
+GRAFANA_ADMIN_USER=admin
+GRAFANA_ADMIN_PASSWORD=<generate-strong-password>
+
+# Prometheus Configuration
+PROMETHEUS_USER=admin
+PROMETHEUS_PASSWORD=<generate-strong-password>
+
+# Portainer Configuration
+PORTAINER_ADMIN_PASSWORD=<generate-strong-password>
+
+# Network Configuration
+TRAEFIK_NETWORK=traefik-public
+MONITORING_NETWORK=monitoring
+```
+
+### 3. Generate Strong Passwords
+
+```bash
+# Generate random passwords
+openssl rand -base64 32
+
+# For Prometheus BasicAuth (bcrypt hash)
+docker run --rm httpd:alpine htpasswd -nbB admin "your-password" | cut -d ":" -f 2
+```
+
+### 4. Update Traefik BasicAuth (Optional)
+
+If using Prometheus BasicAuth, add the bcrypt hash to Traefik labels in docker-compose.yml:
+
+```yaml
+- "traefik.http.middlewares.prometheus-auth.basicauth.users=admin:$$2y$$05$$..."
+```
+
+## Deployment
+
+### Deploy Stack
+
+```bash
+cd /home/michael/dev/michaelschiemer/deployment/stacks/monitoring
+
+# Deploy with Docker Compose
+docker compose up -d
+
+# Or with Docker Stack (Swarm mode)
+docker stack deploy -c docker-compose.yml monitoring
+```
+
+### Verify Deployment
+
+```bash
+# Check running containers
+docker compose ps
+
+# Check service logs
+docker compose logs -f grafana
+docker compose logs -f prometheus
+
+# Check Prometheus targets
+curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/targets
+```
+
+### Initial Access
+
+1. **Grafana**: https://grafana.michaelschiemer.de
+   - Login: `admin` / `<GRAFANA_ADMIN_PASSWORD>`
+   - Dashboards are pre-loaded and ready to use
+
+2. **Prometheus**: https://prometheus.michaelschiemer.de
+   - BasicAuth: `admin` / `<PROMETHEUS_PASSWORD>`
+   - Check targets at `/targets`
+   - View alerts at `/alerts`
+
+3. **Portainer**: https://portainer.michaelschiemer.de
+   - First login: Set admin password
+   - Connect to local Docker environment
+
+## Usage
+
+### Grafana Dashboards
+
+#### Docker Containers Dashboard
+Access: https://grafana.michaelschiemer.de/d/docker-containers
+
+**Metrics Displayed**:
+- Container CPU Usage % (per container, timeseries)
+- Container Memory Usage (bytes per container, timeseries)
+- Containers Running (current count, stat)
+- Container Restarts in 5m (rate with thresholds, stat)
+- Container Network I/O (RX/TX per container, timeseries)
+
+**Use Cases**:
+- Identify containers with high resource usage
+- Monitor container stability (restart rates)
+- Track network bandwidth consumption
+- Verify all expected containers are running
+
+#### Host System Dashboard
+Access: https://grafana.michaelschiemer.de/d/host-system
+
+**Metrics Displayed**:
+- CPU Usage % (historical and current)
+- Memory Usage % (historical and current)
+- Disk Usage % (root filesystem, historical and current)
+- Network I/O (RX/TX by interface)
+- System Uptime (seconds since boot)
+
+**Thresholds**:
+- Green: < 80% usage
+- Yellow: 80-90% usage
+- Red: > 90% usage
+
+**Use Cases**:
+- Monitor server health and resource utilization
+- Identify resource bottlenecks
+- Plan capacity upgrades
+- Track system stability (uptime)
+
+#### Traefik Dashboard
+Access: https://grafana.michaelschiemer.de/d/traefik
+
+**Metrics Displayed**:
+- Request Rate by Service (req/s, timeseries)
+- Response Time p95/p99 (milliseconds, timeseries)
+- HTTP Status Codes (2xx/4xx/5xx stacked, color-coded)
+- Service Status (Up/Down per service)
+- Requests per Minute (total)
+- 4xx Error Rate (percentage)
+- 5xx Error Rate (percentage)
+- Active Services (count)
+
+**Thresholds**:
+- 4xx errors: Green < 5%, Yellow < 10%, Red ≥ 10%
+- 5xx errors: Green < 1%, Yellow < 5%, Red ≥ 5%
+
+**Use Cases**:
+- Monitor HTTP traffic patterns
+- Identify performance issues (high latency)
+- Track error rates and types
+- Verify service availability
+
+### Prometheus Queries
+
+#### Common PromQL Examples
+
+**CPU Usage**:
+```promql
+# Overall CPU usage
+100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
+
+# Per-core CPU usage
+rate(node_cpu_seconds_total[5m]) * 100
+```
+
+**Memory Usage**:
+```promql
+# Memory usage percentage
+100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)
+
+# Memory available in GB
+node_memory_MemAvailable_bytes / 1024 / 1024 / 1024
+```
+
+**Disk Usage**:
+```promql
+# Disk usage percentage
+100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100)
+
+# Disk I/O rate
+rate(node_disk_io_time_seconds_total[5m])
+```
+
+**Container Metrics**:
+```promql
+# Container CPU usage
+sum(rate(container_cpu_usage_seconds_total{name!~".*exporter.*"}[5m])) by (name) * 100
+
+# Container memory usage
+sum(container_memory_usage_bytes{name!~".*exporter.*"}) by (name)
+
+# Container network I/O
+rate(container_network_receive_bytes_total[5m])
+rate(container_network_transmit_bytes_total[5m])
+```
+
+**Traefik Metrics**:
+```promql
+# Request rate by service
+sum(rate(traefik_service_requests_total[5m])) by (service)
+
+# Response time percentiles
+histogram_quantile(0.95, sum(rate(traefik_service_request_duration_seconds_bucket[5m])) by (service, le))
+
+# Error rate
+sum(rate(traefik_service_requests_total{code=~"5.."}[5m])) / sum(rate(traefik_service_requests_total[5m])) * 100
+```
+
+### Alert Management
+
+#### Configured Alerts
+
+Alerts are defined in `prometheus/alerts.yml`:
+
+1. **HostHighCPU**: CPU usage > 80% for 5 minutes
+2. **HostHighMemory**: Memory usage > 80% for 5 minutes
+3. **HostDiskSpaceLow**: Disk usage > 80%
+4. **ContainerHighCPU**: Container CPU > 80% for 5 minutes
+5. **ContainerHighMemory**: Container memory > 80% for 5 minutes
+6. **ServiceDown**: Service unavailable
+7. **HighErrorRate**: Error rate > 5% for 5 minutes
+
+#### View Active Alerts
+
+```bash
+# Via Prometheus UI
+https://prometheus.michaelschiemer.de/alerts
+
+# Via API
+curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/alerts
+
+# Check alert rules
+curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/rules
+```
+
+#### Silence Alerts
+
+Use Prometheus UI or API to silence alerts during maintenance:
+
+```bash
+# Silence via API (example)
+curl -X POST -u admin:password \
+  https://prometheus.michaelschiemer.de/api/v1/alerts \
+  -d 'alertname=HostHighCPU&duration=1h'
+```
+
+### Portainer Usage
+
+#### Container Management
+
+1. Navigate to https://portainer.michaelschiemer.de
+2. Select "Local" environment
+3. Go to "Containers" section
+4. Available actions:
+   - Start/Stop/Restart containers
+   - View logs (live stream)
+   - Inspect container details
+   - Execute commands in containers
+   - View resource statistics
+
+#### Stack Management
+
+1. Go to "Stacks" section
+2. View deployed stacks
+3. Actions available:
+   - View stack definition
+   - Update stack (edit compose file)
+   - Stop/Start entire stack
+   - Remove stack
+
+#### Volume Management
+
+1. Go to "Volumes" section
+2. View volume details and size
+3. Browse volume contents
+4. Backup/restore volumes
+
+## Integration with Other Stacks
+
+### Stack 1: Traefik
+- Provides HTTPS reverse proxy for Grafana, Prometheus, Portainer
+- Automatic SSL certificate management
+- BasicAuth middleware for Prometheus
+
+### Stack 2: Gitea
+- Monitor Gitea container resources
+- Track HTTP requests to Gitea via Traefik dashboard
+- Alert on Gitea service downtime
+
+### Stack 3: Docker Registry
+- Monitor registry container resources
+- Track registry HTTP requests
+- Alert on registry unavailability
+
+### Stack 4: Application
+- Monitor PHP-FPM, Nginx, Redis, Worker containers
+- Track application response times
+- Monitor queue worker health
+
+### Stack 5: PostgreSQL
+- Monitor database container resources
+- Track PostgreSQL metrics (if postgres_exporter added)
+- Alert on database unavailability
+
+## Monitoring Best Practices
+
+### 1. Regular Dashboard Review
+- Check dashboards daily for anomalies
+- Review error rates and response times
+- Monitor resource utilization trends
+
+### 2. Alert Configuration
+- Tune alert thresholds based on baseline metrics
+- Avoid alert fatigue (too many non-critical alerts)
+- Document alert response procedures
+
+### 3. Capacity Planning
+- Review resource usage trends weekly
+- Plan capacity upgrades before hitting limits
+- Monitor growth rates for proactive scaling
+
+### 4. Performance Optimization
+- Identify containers with high resource usage
+- Optimize slow endpoints (high p95/p99 latency)
+- Balance load across services
+
+### 5. Security Monitoring
+- Monitor failed authentication attempts
+- Track unusual traffic patterns
+- Review service availability trends
+
+## Troubleshooting
+
+### Grafana Issues
+
+#### Dashboard Not Loading
+```bash
+# Check Grafana logs
+docker compose logs grafana
+
+# Verify datasource connection
+curl http://localhost:3000/api/health
+
+# Restart Grafana
+docker compose restart grafana
+```
+
+#### Missing Metrics
+```bash
+# Check Prometheus datasource
+curl http://prometheus:9090/api/v1/targets
+
+# Verify Prometheus is scraping
+docker compose logs prometheus | grep "Scrape"
+
+# Check network connectivity
+docker compose exec grafana ping prometheus
+```
+
+### Prometheus Issues
+
+#### Targets Down
+```bash
+# Check target status
+curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/targets
+
+# Verify target services are running
+docker compose ps
+
+# Check Prometheus configuration
+docker compose exec prometheus cat /etc/prometheus/prometheus.yml
+
+# Reload configuration
+curl -X POST -u admin:password https://prometheus.michaelschiemer.de/-/reload
+```
+
+#### High Memory Usage
+```bash
+# Check Prometheus memory
+docker stats prometheus
+
+# Reduce retention period in docker-compose.yml:
+# --storage.tsdb.retention.time=7d
+
+# Reduce scrape interval in prometheus.yml:
+# scrape_interval: 30s
+```
+
+### Node Exporter Issues
+
+#### No Host Metrics
+```bash
+# Check node-exporter is running
+docker compose ps node-exporter
+
+# Test metrics endpoint
+curl http://localhost:9100/metrics
+
+# Check Prometheus scraping
+docker compose logs prometheus | grep node-exporter
+```
+
+### cAdvisor Issues
+
+#### No Container Metrics
+```bash
+# Check cAdvisor is running
+docker compose ps cadvisor
+
+# Test metrics endpoint
+curl http://localhost:8080/metrics
+
+# Verify Docker socket mount
+docker compose exec cadvisor ls -la /var/run/docker.sock
+```
+
+### Portainer Issues
+
+#### Cannot Access UI
+```bash
+# Check Portainer is running
+docker compose ps portainer
+
+# Check Traefik routing
+docker compose -f ../traefik/docker-compose.yml logs
+
+# Verify network connectivity
+docker network ls | grep monitoring
+```
+
+#### Cannot Connect to Docker
+```bash
+# Verify Docker socket permissions
+ls -la /var/run/docker.sock
+
+# Check Portainer logs
+docker compose logs portainer
+
+# Restart Portainer
+docker compose restart portainer
+```
+
+## Performance Tuning
+
+### Prometheus Optimization
+
+#### Reduce Memory Usage
+```yaml
+# In docker-compose.yml, adjust retention:
+command:
+  - '--storage.tsdb.retention.time=7d'  # Reduce from 15d
+  - '--storage.tsdb.retention.size=5GB' # Add size limit
+```
+
+#### Optimize Scrape Intervals
+```yaml
+# In prometheus/prometheus.yml:
+global:
+  scrape_interval: 30s  # Increase from 15s for less load
+  evaluation_interval: 30s
+```
+
+#### Reduce Cardinality
+```yaml
+# In prometheus/prometheus.yml, add metric_relabel_configs:
+metric_relabel_configs:
+  - source_labels: [__name__]
+    regex: 'unused_metric_.*'
+    action: drop
+```
+
+### Grafana Optimization
+
+#### Reduce Query Load
+```json
+// In dashboard JSON, adjust refresh rate:
+"refresh": "1m"  // Increase from 30s
+```
+
+#### Optimize Panel Queries
+- Use recording rules for expensive queries
+- Reduce time range for heavy queries
+- Use appropriate resolution (step parameter)
+
+### Storage Optimization
+
+#### Prometheus Data Volume
+```bash
+# Check current size
+du -sh volumes/prometheus/
+
+# Compact old data
+docker compose exec prometheus curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones
+```
+
+#### Grafana Data Volume
+```bash
+# Check current size
+du -sh volumes/grafana/
+
+# Clean old sessions
+docker compose exec grafana grafana-cli admin reset-admin-password
+```
+
+## Security Considerations
+
+### 1. Password Security
+- Use strong, randomly generated passwords
+- Store passwords securely (password manager)
+- Rotate passwords regularly
+- Use bcrypt for Prometheus BasicAuth
+
+### 2. Network Security
+- Monitoring network is internal-only (except exporters)
+- Traefik handles SSL/TLS termination
+- BasicAuth protects Prometheus UI
+- Grafana requires login for dashboard access
+
+### 3. Access Control
+- Limit Grafana admin access
+- Use Grafana organizations for multi-tenancy
+- Configure Prometheus with read-only access where possible
+- Restrict Portainer access to trusted users
+
+### 4. Data Security
+- Prometheus stores metrics in plain text
+- Grafana encrypts passwords in database
+- Backup volumes contain sensitive data
+- Secure backups with encryption
+
+### 5. Container Security
+- Use official Docker images
+- Keep images updated (security patches)
+- Run containers as non-root where possible
+- Limit container capabilities
+
+## Backup and Recovery
+
+### Backup Prometheus Data
+```bash
+# Stop Prometheus
+docker compose stop prometheus
+
+# Backup data volume
+tar czf prometheus-backup-$(date +%Y%m%d).tar.gz -C volumes/prometheus .
+
+# Restart Prometheus
+docker compose start prometheus
+```
+
+### Backup Grafana Data
+```bash
+# Backup Grafana database and dashboards
+docker compose exec grafana tar czf - /var/lib/grafana > grafana-backup-$(date +%Y%m%d).tar.gz
+```
+
+### Restore from Backup
+```bash
+# Stop services
+docker compose down
+
+# Restore Prometheus data
+tar xzf prometheus-backup-YYYYMMDD.tar.gz -C volumes/prometheus/
+
+# Restore Grafana data
+docker compose up -d grafana
+docker compose exec grafana tar xzf - -C / < grafana-backup-YYYYMMDD.tar.gz
+
+# Start all services
+docker compose up -d
+```
+
+## Maintenance
+
+### Regular Tasks
+
+#### Daily
+- Review dashboards for anomalies
+- Check active alerts
+- Verify all services are running
+
+#### Weekly
+- Review resource usage trends
+- Check disk space usage
+- Update passwords if needed
+
+#### Monthly
+- Review and update alert rules
+- Optimize slow queries
+- Clean up old data if needed
+- Update Docker images
+
+### Update Procedure
+
+```bash
+# Pull latest images
+docker compose pull
+
+# Recreate containers with new images
+docker compose up -d
+
+# Verify services are healthy
+docker compose ps
+docker compose logs -f
+```
+
+## Support
+
+### Documentation
+- Prometheus: https://prometheus.io/docs/
+- Grafana: https://grafana.com/docs/
+- Portainer: https://docs.portainer.io/
+
+### Logs
+```bash
+# View all logs
+docker compose logs
+
+# Follow specific service logs
+docker compose logs -f grafana
+docker compose logs -f prometheus
+
+# View last 100 lines
+docker compose logs --tail=100
+```
+
+### Health Checks
+```bash
+# Check service health
+docker compose ps
+
+# Test endpoints
+curl http://localhost:9090/-/healthy  # Prometheus
+curl http://localhost:3000/api/health # Grafana
+
+# Check metrics
+curl http://localhost:9100/metrics    # Node Exporter
+curl http://localhost:8080/metrics    # cAdvisor
+```
+
+---
+
+**Stack Version**: 1.0
+**Last Updated**: 2025-01-30
+**Maintained By**: DevOps Team
--- a/deployment/stacks/monitoring/docker-compose.yml
+++ b/deployment/stacks/monitoring/docker-compose.yml
@@ -0,0 +1,147 @@
+services:
+  portainer:
+    image: portainer/portainer-ce:latest
+    container_name: portainer
+    restart: unless-stopped
+    networks:
+      - traefik-public
+    volumes:
+      - /var/run/docker.sock:/var/run/docker.sock:ro
+      - portainer-data:/data
+    labels:
+      - "traefik.enable=true"
+      - "traefik.http.routers.portainer.rule=Host(`portainer.${DOMAIN}`)"
+      - "traefik.http.routers.portainer.entrypoints=websecure"
+      - "traefik.http.routers.portainer.tls=true"
+      - "traefik.http.routers.portainer.tls.certresolver=letsencrypt"
+      - "traefik.http.services.portainer.loadbalancer.server.port=9000"
+    healthcheck:
+      test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:9000/api/system/status"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+      start_period: 40s
+
+  prometheus:
+    image: prom/prometheus:latest
+    container_name: prometheus
+    restart: unless-stopped
+    user: "65534:65534"
+    networks:
+      - traefik-public
+      - app-internal
+    volumes:
+      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
+      - ./prometheus/alerts.yml:/etc/prometheus/alerts.yml:ro
+      - prometheus-data:/prometheus
+    command:
+      - '--config.file=/etc/prometheus/prometheus.yml'
+      - '--storage.tsdb.path=/prometheus'
+      - '--storage.tsdb.retention.time=30d'
+      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
+      - '--web.console.templates=/usr/share/prometheus/consoles'
+      - '--web.enable-lifecycle'
+    labels:
+      - "traefik.enable=true"
+      - "traefik.http.routers.prometheus.rule=Host(`prometheus.${DOMAIN}`)"
+      - "traefik.http.routers.prometheus.entrypoints=websecure"
+      - "traefik.http.routers.prometheus.tls=true"
+      - "traefik.http.routers.prometheus.tls.certresolver=letsencrypt"
+      - "traefik.http.routers.prometheus.middlewares=prometheus-auth"
+      - "traefik.http.middlewares.prometheus-auth.basicauth.users=${PROMETHEUS_AUTH}"
+      - "traefik.http.services.prometheus.loadbalancer.server.port=9090"
+    healthcheck:
+      test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:9090/-/healthy"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+
+  grafana:
+    image: grafana/grafana:latest
+    container_name: grafana
+    restart: unless-stopped
+    networks:
+      - traefik-public
+      - app-internal
+    environment:
+      - GF_SERVER_ROOT_URL=https://grafana.${DOMAIN}
+      - GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER}
+      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD}
+      - GF_USERS_ALLOW_SIGN_UP=false
+      - GF_INSTALL_PLUGINS=${GRAFANA_PLUGINS}
+      - GF_LOG_LEVEL=info
+      - GF_ANALYTICS_REPORTING_ENABLED=false
+    volumes:
+      - grafana-data:/var/lib/grafana
+      - ./grafana/provisioning:/etc/grafana/provisioning:ro
+      - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
+    labels:
+      - "traefik.enable=true"
+      - "traefik.http.routers.grafana.rule=Host(`grafana.${DOMAIN}`)"
+      - "traefik.http.routers.grafana.entrypoints=websecure"
+      - "traefik.http.routers.grafana.tls=true"
+      - "traefik.http.routers.grafana.tls.certresolver=letsencrypt"
+      - "traefik.http.services.grafana.loadbalancer.server.port=3000"
+    depends_on:
+      prometheus:
+        condition: service_healthy
+    healthcheck:
+      test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/api/health"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+
+  node-exporter:
+    image: prom/node-exporter:latest
+    container_name: node-exporter
+    restart: unless-stopped
+    networks:
+      - app-internal
+    volumes:
+      - /proc:/host/proc:ro
+      - /sys:/host/sys:ro
+      - /:/rootfs:ro
+    command:
+      - '--path.procfs=/host/proc'
+      - '--path.sysfs=/host/sys'
+      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
+    healthcheck:
+      test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:9100/metrics"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+
+  cadvisor:
+    image: gcr.io/cadvisor/cadvisor:latest
+    container_name: cadvisor
+    restart: unless-stopped
+    privileged: true
+    networks:
+      - app-internal
+    volumes:
+      - /:/rootfs:ro
+      - /var/run:/var/run:ro
+      - /sys:/sys:ro
+      - /var/lib/docker/:/var/lib/docker:ro
+      - /dev/disk/:/dev/disk:ro
+    devices:
+      - /dev/kmsg
+    healthcheck:
+      test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:8080/healthz"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+
+volumes:
+  portainer-data:
+    name: portainer-data
+  prometheus-data:
+    name: prometheus-data
+  grafana-data:
+    name: grafana-data
+
+networks:
+  traefik-public:
+    external: true
+  app-internal:
+    external: true
--- a/deployment/stacks/monitoring/grafana/dashboards/docker-containers.json
+++ b/deployment/stacks/monitoring/grafana/dashboards/docker-containers.json
@@ -0,0 +1,397 @@
+{
+  "annotations": {
+    "list": [
+      {
+        "builtIn": 1,
+        "datasource": "-- Grafana --",
+        "enable": true,
+        "hide": true,
+        "iconColor": "rgba(0, 211, 255, 1)",
+        "name": "Annotations & Alerts",
+        "type": "dashboard"
+      }
+    ]
+  },
+  "editable": true,
+  "gnetId": null,
+  "graphTooltip": 0,
+  "id": null,
+  "links": [],
+  "panels": [
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {
+              "tooltip": false,
+              "viz": false,
+              "legend": false
+            },
+            "lineInterpolation": "linear",
+            "lineWidth": 1,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": false,
+            "stacking": {
+              "group": "A",
+              "mode": "none"
+            },
+            "thresholdsStyle": {
+              "mode": "off"
+            }
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          },
+          "unit": "percent"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 0,
+        "y": 0
+      },
+      "id": 1,
+      "options": {
+        "legend": {
+          "calcs": ["lastNotNull"],
+          "displayMode": "table",
+          "placement": "right"
+        },
+        "tooltip": {
+          "mode": "single"
+        }
+      },
+      "targets": [
+        {
+          "expr": "sum(rate(container_cpu_usage_seconds_total{name!~\".*exporter.*\"}[5m])) by (name) * 100",
+          "refId": "A",
+          "legendFormat": "{{name}}"
+        }
+      ],
+      "title": "Container CPU Usage %",
+      "type": "timeseries"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {
+              "tooltip": false,
+              "viz": false,
+              "legend": false
+            },
+            "lineInterpolation": "linear",
+            "lineWidth": 1,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": false,
+            "stacking": {
+              "group": "A",
+              "mode": "none"
+            },
+            "thresholdsStyle": {
+              "mode": "off"
+            }
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          },
+          "unit": "bytes"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 12,
+        "y": 0
+      },
+      "id": 2,
+      "options": {
+        "legend": {
+          "calcs": ["lastNotNull"],
+          "displayMode": "table",
+          "placement": "right"
+        },
+        "tooltip": {
+          "mode": "single"
+        }
+      },
+      "targets": [
+        {
+          "expr": "sum(container_memory_usage_bytes{name!~\".*exporter.*\"}) by (name)",
+          "refId": "A",
+          "legendFormat": "{{name}}"
+        }
+      ],
+      "title": "Container Memory Usage",
+      "type": "timeseries"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [
+            {
+              "options": {
+                "0": {
+                  "color": "red",
+                  "index": 1,
+                  "text": "Down"
+                },
+                "1": {
+                  "color": "green",
+                  "index": 0,
+                  "text": "Up"
+                }
+              },
+              "type": "value"
+            }
+          ],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "red",
+                "value": null
+              },
+              {
+                "color": "green",
+                "value": 1
+              }
+            ]
+          }
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 4,
+        "w": 6,
+        "x": 0,
+        "y": 8
+      },
+      "id": 3,
+      "options": {
+        "colorMode": "background",
+        "graphMode": "none",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "auto"
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "count(container_last_seen{name!~\".*exporter.*\"}) > 0",
+          "refId": "A"
+        }
+      ],
+      "title": "Containers Running",
+      "type": "stat"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 3
+              },
+              {
+                "color": "red",
+                "value": 5
+              }
+            ]
+          }
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 4,
+        "w": 6,
+        "x": 6,
+        "y": 8
+      },
+      "id": 4,
+      "options": {
+        "colorMode": "background",
+        "graphMode": "none",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "auto"
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "sum(rate(container_restart_count[5m])) > 0",
+          "refId": "A"
+        }
+      ],
+      "title": "Container Restarts (5m)",
+      "type": "stat"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {
+              "tooltip": false,
+              "viz": false,
+              "legend": false
+            },
+            "lineInterpolation": "linear",
+            "lineWidth": 1,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": false,
+            "stacking": {
+              "group": "A",
+              "mode": "none"
+            },
+            "thresholdsStyle": {
+              "mode": "off"
+            }
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          },
+          "unit": "Bps"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 12,
+        "y": 8
+      },
+      "id": 5,
+      "options": {
+        "legend": {
+          "calcs": ["lastNotNull"],
+          "displayMode": "table",
+          "placement": "right"
+        },
+        "tooltip": {
+          "mode": "single"
+        }
+      },
+      "targets": [
+        {
+          "expr": "sum(rate(container_network_receive_bytes_total{name!~\".*exporter.*\"}[5m])) by (name)",
+          "refId": "A",
+          "legendFormat": "{{name}} RX"
+        },
+        {
+          "expr": "sum(rate(container_network_transmit_bytes_total{name!~\".*exporter.*\"}[5m])) by (name)",
+          "refId": "B",
+          "legendFormat": "{{name}} TX"
+        }
+      ],
+      "title": "Container Network I/O",
+      "type": "timeseries"
+    }
+  ],
+  "refresh": "30s",
+  "schemaVersion": 36,
+  "style": "dark",
+  "tags": ["docker", "containers"],
+  "templating": {
+    "list": []
+  },
+  "time": {
+    "from": "now-1h",
+    "to": "now"
+  },
+  "timepicker": {},
+  "timezone": "",
+  "title": "Docker Containers",
+  "uid": "docker-containers",
+  "version": 1
+}
--- a/deployment/stacks/monitoring/grafana/dashboards/host-system.json
+++ b/deployment/stacks/monitoring/grafana/dashboards/host-system.json
@@ -0,0 +1,591 @@
+{
+  "annotations": {
+    "list": [
+      {
+        "builtIn": 1,
+        "datasource": "-- Grafana --",
+        "enable": true,
+        "hide": true,
+        "iconColor": "rgba(0, 211, 255, 1)",
+        "name": "Annotations & Alerts",
+        "type": "dashboard"
+      }
+    ]
+  },
+  "editable": true,
+  "gnetId": null,
+  "graphTooltip": 0,
+  "id": null,
+  "links": [],
+  "panels": [
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {
+              "tooltip": false,
+              "viz": false,
+              "legend": false
+            },
+            "lineInterpolation": "linear",
+            "lineWidth": 1,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": false,
+            "stacking": {
+              "group": "A",
+              "mode": "none"
+            },
+            "thresholdsStyle": {
+              "mode": "line"
+            }
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "red",
+                "value": 80
+              }
+            ]
+          },
+          "unit": "percent"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 0,
+        "y": 0
+      },
+      "id": 1,
+      "options": {
+        "legend": {
+          "calcs": ["lastNotNull", "mean", "max"],
+          "displayMode": "table",
+          "placement": "right"
+        },
+        "tooltip": {
+          "mode": "single"
+        }
+      },
+      "targets": [
+        {
+          "expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
+          "refId": "A",
+          "legendFormat": "{{instance}}"
+        }
+      ],
+      "title": "CPU Usage %",
+      "type": "timeseries"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {
+              "tooltip": false,
+              "viz": false,
+              "legend": false
+            },
+            "lineInterpolation": "linear",
+            "lineWidth": 1,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": false,
+            "stacking": {
+              "group": "A",
+              "mode": "none"
+            },
+            "thresholdsStyle": {
+              "mode": "line"
+            }
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 80
+              },
+              {
+                "color": "red",
+                "value": 90
+              }
+            ]
+          },
+          "unit": "percent"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 12,
+        "y": 0
+      },
+      "id": 2,
+      "options": {
+        "legend": {
+          "calcs": ["lastNotNull", "mean", "max"],
+          "displayMode": "table",
+          "placement": "right"
+        },
+        "tooltip": {
+          "mode": "single"
+        }
+      },
+      "targets": [
+        {
+          "expr": "100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)",
+          "refId": "A",
+          "legendFormat": "{{instance}}"
+        }
+      ],
+      "title": "Memory Usage %",
+      "type": "timeseries"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {
+              "tooltip": false,
+              "viz": false,
+              "legend": false
+            },
+            "lineInterpolation": "linear",
+            "lineWidth": 1,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": false,
+            "stacking": {
+              "group": "A",
+              "mode": "none"
+            },
+            "thresholdsStyle": {
+              "mode": "line"
+            }
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 80
+              },
+              {
+                "color": "red",
+                "value": 90
+              }
+            ]
+          },
+          "unit": "percent"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 0,
+        "y": 8
+      },
+      "id": 3,
+      "options": {
+        "legend": {
+          "calcs": ["lastNotNull", "mean", "max"],
+          "displayMode": "table",
+          "placement": "right"
+        },
+        "tooltip": {
+          "mode": "single"
+        }
+      },
+      "targets": [
+        {
+          "expr": "100 - ((node_filesystem_avail_bytes{mountpoint=\"/\",fstype!=\"rootfs\"} / node_filesystem_size_bytes{mountpoint=\"/\",fstype!=\"rootfs\"}) * 100)",
+          "refId": "A",
+          "legendFormat": "{{instance}}"
+        }
+      ],
+      "title": "Disk Usage %",
+      "type": "timeseries"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {
+              "tooltip": false,
+              "viz": false,
+              "legend": false
+            },
+            "lineInterpolation": "linear",
+            "lineWidth": 1,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": false,
+            "stacking": {
+              "group": "A",
+              "mode": "none"
+            },
+            "thresholdsStyle": {
+              "mode": "off"
+            }
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          },
+          "unit": "Bps"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 12,
+        "y": 8
+      },
+      "id": 4,
+      "options": {
+        "legend": {
+          "calcs": ["lastNotNull", "mean", "max"],
+          "displayMode": "table",
+          "placement": "right"
+        },
+        "tooltip": {
+          "mode": "single"
+        }
+      },
+      "targets": [
+        {
+          "expr": "rate(node_network_receive_bytes_total[5m])",
+          "refId": "A",
+          "legendFormat": "{{instance}} - {{device}} RX"
+        },
+        {
+          "expr": "rate(node_network_transmit_bytes_total[5m])",
+          "refId": "B",
+          "legendFormat": "{{instance}} - {{device}} TX"
+        }
+      ],
+      "title": "Network I/O",
+      "type": "timeseries"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 80
+              },
+              {
+                "color": "red",
+                "value": 90
+              }
+            ]
+          },
+          "unit": "percent"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 4,
+        "w": 6,
+        "x": 0,
+        "y": 16
+      },
+      "id": 5,
+      "options": {
+        "colorMode": "background",
+        "graphMode": "area",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "auto"
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
+          "refId": "A"
+        }
+      ],
+      "title": "Current CPU Usage",
+      "type": "stat"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 80
+              },
+              {
+                "color": "red",
+                "value": 90
+              }
+            ]
+          },
+          "unit": "percent"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 4,
+        "w": 6,
+        "x": 6,
+        "y": 16
+      },
+      "id": 6,
+      "options": {
+        "colorMode": "background",
+        "graphMode": "area",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "auto"
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)",
+          "refId": "A"
+        }
+      ],
+      "title": "Current Memory Usage",
+      "type": "stat"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 80
+              },
+              {
+                "color": "red",
+                "value": 90
+              }
+            ]
+          },
+          "unit": "percent"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 4,
+        "w": 6,
+        "x": 12,
+        "y": 16
+      },
+      "id": 7,
+      "options": {
+        "colorMode": "background",
+        "graphMode": "area",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "auto"
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "100 - ((node_filesystem_avail_bytes{mountpoint=\"/\",fstype!=\"rootfs\"} / node_filesystem_size_bytes{mountpoint=\"/\",fstype!=\"rootfs\"}) * 100)",
+          "refId": "A"
+        }
+      ],
+      "title": "Current Disk Usage",
+      "type": "stat"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          },
+          "unit": "s"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 4,
+        "w": 6,
+        "x": 18,
+        "y": 16
+      },
+      "id": 8,
+      "options": {
+        "colorMode": "value",
+        "graphMode": "area",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "auto"
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "time() - node_boot_time_seconds",
+          "refId": "A"
+        }
+      ],
+      "title": "System Uptime",
+      "type": "stat"
+    }
+  ],
+  "refresh": "30s",
+  "schemaVersion": 36,
+  "style": "dark",
+  "tags": ["host", "system"],
+  "templating": {
+    "list": []
+  },
+  "time": {
+    "from": "now-1h",
+    "to": "now"
+  },
+  "timepicker": {},
+  "timezone": "",
+  "title": "Host System",
+  "uid": "host-system",
+  "version": 1
+}
--- a/deployment/stacks/monitoring/grafana/dashboards/traefik.json
+++ b/deployment/stacks/monitoring/grafana/dashboards/traefik.json
@@ -0,0 +1,613 @@
+{
+  "annotations": {
+    "list": [
+      {
+        "builtIn": 1,
+        "datasource": "-- Grafana --",
+        "enable": true,
+        "hide": true,
+        "iconColor": "rgba(0, 211, 255, 1)",
+        "name": "Annotations & Alerts",
+        "type": "dashboard"
+      }
+    ]
+  },
+  "editable": true,
+  "gnetId": null,
+  "graphTooltip": 0,
+  "id": null,
+  "links": [],
+  "panels": [
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {
+              "tooltip": false,
+              "viz": false,
+              "legend": false
+            },
+            "lineInterpolation": "linear",
+            "lineWidth": 1,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": false,
+            "stacking": {
+              "group": "A",
+              "mode": "none"
+            },
+            "thresholdsStyle": {
+              "mode": "off"
+            }
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          },
+          "unit": "reqps"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 0,
+        "y": 0
+      },
+      "id": 1,
+      "options": {
+        "legend": {
+          "calcs": ["lastNotNull", "mean", "max"],
+          "displayMode": "table",
+          "placement": "right"
+        },
+        "tooltip": {
+          "mode": "single"
+        }
+      },
+      "targets": [
+        {
+          "expr": "sum(rate(traefik_service_requests_total[5m])) by (service)",
+          "refId": "A",
+          "legendFormat": "{{service}}"
+        }
+      ],
+      "title": "Request Rate by Service",
+      "type": "timeseries"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {
+              "tooltip": false,
+              "viz": false,
+              "legend": false
+            },
+            "lineInterpolation": "linear",
+            "lineWidth": 1,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": false,
+            "stacking": {
+              "group": "A",
+              "mode": "none"
+            },
+            "thresholdsStyle": {
+              "mode": "off"
+            }
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          },
+          "unit": "ms"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 12,
+        "y": 0
+      },
+      "id": 2,
+      "options": {
+        "legend": {
+          "calcs": ["lastNotNull", "mean", "max"],
+          "displayMode": "table",
+          "placement": "right"
+        },
+        "tooltip": {
+          "mode": "single"
+        }
+      },
+      "targets": [
+        {
+          "expr": "histogram_quantile(0.95, sum(rate(traefik_service_request_duration_seconds_bucket[5m])) by (service, le)) * 1000",
+          "refId": "A",
+          "legendFormat": "{{service}} p95"
+        },
+        {
+          "expr": "histogram_quantile(0.99, sum(rate(traefik_service_request_duration_seconds_bucket[5m])) by (service, le)) * 1000",
+          "refId": "B",
+          "legendFormat": "{{service}} p99"
+        }
+      ],
+      "title": "Response Time (p95/p99)",
+      "type": "timeseries"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": {
+              "tooltip": false,
+              "viz": false,
+              "legend": false
+            },
+            "lineInterpolation": "linear",
+            "lineWidth": 1,
+            "pointSize": 5,
+            "scaleDistribution": {
+              "type": "linear"
+            },
+            "showPoints": "never",
+            "spanNulls": false,
+            "stacking": {
+              "group": "A",
+              "mode": "normal"
+            },
+            "thresholdsStyle": {
+              "mode": "off"
+            }
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          },
+          "unit": "reqps"
+        },
+        "overrides": [
+          {
+            "matcher": {
+              "id": "byRegexp",
+              "options": ".*2xx.*"
+            },
+            "properties": [
+              {
+                "id": "color",
+                "value": {
+                  "fixedColor": "green",
+                  "mode": "fixed"
+                }
+              }
+            ]
+          },
+          {
+            "matcher": {
+              "id": "byRegexp",
+              "options": ".*4xx.*"
+            },
+            "properties": [
+              {
+                "id": "color",
+                "value": {
+                  "fixedColor": "yellow",
+                  "mode": "fixed"
+                }
+              }
+            ]
+          },
+          {
+            "matcher": {
+              "id": "byRegexp",
+              "options": ".*5xx.*"
+            },
+            "properties": [
+              {
+                "id": "color",
+                "value": {
+                  "fixedColor": "red",
+                  "mode": "fixed"
+                }
+              }
+            ]
+          }
+        ]
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 0,
+        "y": 8
+      },
+      "id": 3,
+      "options": {
+        "legend": {
+          "calcs": ["lastNotNull", "sum"],
+          "displayMode": "table",
+          "placement": "right"
+        },
+        "tooltip": {
+          "mode": "single"
+        }
+      },
+      "targets": [
+        {
+          "expr": "sum(rate(traefik_service_requests_total{code=~\"2..\"}[5m])) by (service)",
+          "refId": "A",
+          "legendFormat": "{{service}} 2xx"
+        },
+        {
+          "expr": "sum(rate(traefik_service_requests_total{code=~\"4..\"}[5m])) by (service)",
+          "refId": "B",
+          "legendFormat": "{{service}} 4xx"
+        },
+        {
+          "expr": "sum(rate(traefik_service_requests_total{code=~\"5..\"}[5m])) by (service)",
+          "refId": "C",
+          "legendFormat": "{{service}} 5xx"
+        }
+      ],
+      "title": "HTTP Status Codes",
+      "type": "timeseries"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [
+            {
+              "options": {
+                "0": {
+                  "color": "red",
+                  "index": 1,
+                  "text": "Down"
+                },
+                "1": {
+                  "color": "green",
+                  "index": 0,
+                  "text": "Up"
+                }
+              },
+              "type": "value"
+            }
+          ],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "red",
+                "value": null
+              },
+              {
+                "color": "green",
+                "value": 1
+              }
+            ]
+          }
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 12,
+        "x": 12,
+        "y": 8
+      },
+      "id": 4,
+      "options": {
+        "colorMode": "background",
+        "graphMode": "none",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "auto"
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "traefik_service_server_up",
+          "refId": "A",
+          "legendFormat": "{{service}}"
+        }
+      ],
+      "title": "Service Status",
+      "type": "stat"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          },
+          "unit": "short"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 4,
+        "w": 6,
+        "x": 0,
+        "y": 16
+      },
+      "id": 5,
+      "options": {
+        "colorMode": "value",
+        "graphMode": "area",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "auto"
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "sum(rate(traefik_service_requests_total[5m])) * 60",
+          "refId": "A"
+        }
+      ],
+      "title": "Requests per Minute",
+      "type": "stat"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 5
+              },
+              {
+                "color": "red",
+                "value": 10
+              }
+            ]
+          },
+          "unit": "percent"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 4,
+        "w": 6,
+        "x": 6,
+        "y": 16
+      },
+      "id": 6,
+      "options": {
+        "colorMode": "background",
+        "graphMode": "area",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "auto"
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "(sum(rate(traefik_service_requests_total{code=~\"4..\"}[5m])) / sum(rate(traefik_service_requests_total[5m]))) * 100",
+          "refId": "A"
+        }
+      ],
+      "title": "4xx Error Rate",
+      "type": "stat"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "yellow",
+                "value": 1
+              },
+              {
+                "color": "red",
+                "value": 5
+              }
+            ]
+          },
+          "unit": "percent"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 4,
+        "w": 6,
+        "x": 12,
+        "y": 16
+      },
+      "id": 7,
+      "options": {
+        "colorMode": "background",
+        "graphMode": "area",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "auto"
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "(sum(rate(traefik_service_requests_total{code=~\"5..\"}[5m])) / sum(rate(traefik_service_requests_total[5m]))) * 100",
+          "refId": "A"
+        }
+      ],
+      "title": "5xx Error Rate",
+      "type": "stat"
+    },
+    {
+      "datasource": "Prometheus",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          },
+          "unit": "short"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 4,
+        "w": 6,
+        "x": 18,
+        "y": 16
+      },
+      "id": 8,
+      "options": {
+        "colorMode": "value",
+        "graphMode": "none",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "auto"
+      },
+      "pluginVersion": "9.0.0",
+      "targets": [
+        {
+          "expr": "count(traefik_service_server_up == 1)",
+          "refId": "A"
+        }
+      ],
+      "title": "Active Services",
+      "type": "stat"
+    }
+  ],
+  "refresh": "30s",
+  "schemaVersion": 36,
+  "style": "dark",
+  "tags": ["traefik", "proxy"],
+  "templating": {
+    "list": []
+  },
+  "time": {
+    "from": "now-1h",
+    "to": "now"
+  },
+  "timepicker": {},
+  "timezone": "",
+  "title": "Traefik",
+  "uid": "traefik",
+  "version": 1
+}
--- a/deployment/stacks/monitoring/grafana/provisioning/dashboards/dashboard.yml
+++ b/deployment/stacks/monitoring/grafana/provisioning/dashboards/dashboard.yml
@@ -0,0 +1,15 @@
+# Grafana Dashboard Provisioning
+# https://grafana.com/docs/grafana/latest/administration/provisioning/#dashboards
+
+apiVersion: 1
+
+providers:
+  - name: 'Default'
+    orgId: 1
+    folder: ''
+    type: file
+    disableDeletion: false
+    updateIntervalSeconds: 10
+    allowUiUpdates: true
+    options:
+      path: /var/lib/grafana/dashboards
--- a/deployment/stacks/monitoring/grafana/provisioning/datasources/prometheus.yml
+++ b/deployment/stacks/monitoring/grafana/provisioning/datasources/prometheus.yml
@@ -0,0 +1,17 @@
+# Grafana Datasource Provisioning
+# https://grafana.com/docs/grafana/latest/administration/provisioning/#data-sources
+
+apiVersion: 1
+
+datasources:
+  - name: Prometheus
+    type: prometheus
+    access: proxy
+    url: http://prometheus:9090
+    isDefault: true
+    editable: false
+    jsonData:
+      timeInterval: 15s
+      queryTimeout: 60s
+      httpMethod: POST
+    version: 1
--- a/deployment/stacks/monitoring/prometheus/alerts.yml
+++ b/deployment/stacks/monitoring/prometheus/alerts.yml
@@ -0,0 +1,245 @@
+# Prometheus Alerting Rules
+# https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
+
+groups:
+  - name: infrastructure_alerts
+    interval: 30s
+    rules:
+      # Host System Alerts
+      - alert: HostHighCpuLoad
+        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
+        for: 5m
+        labels:
+          severity: warning
+          category: infrastructure
+        annotations:
+          summary: "High CPU load on {{ $labels.instance }}"
+          description: "CPU load is above 80% (current value: {{ $value }}%)"
+
+      - alert: HostOutOfMemory
+        expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
+        for: 2m
+        labels:
+          severity: critical
+          category: infrastructure
+        annotations:
+          summary: "Host out of memory on {{ $labels.instance }}"
+          description: "Available memory is below 10% (current value: {{ $value }}%)"
+
+      - alert: HostOutOfDiskSpace
+        expr: (node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"} * 100) < 10
+        for: 2m
+        labels:
+          severity: critical
+          category: infrastructure
+        annotations:
+          summary: "Host out of disk space on {{ $labels.instance }}"
+          description: "Disk space is below 10% (current value: {{ $value }}%)"
+
+      - alert: HostDiskSpaceWarning
+        expr: (node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"} * 100) < 20
+        for: 5m
+        labels:
+          severity: warning
+          category: infrastructure
+        annotations:
+          summary: "Disk space warning on {{ $labels.instance }}"
+          description: "Disk space is below 20% (current value: {{ $value }}%)"
+
+      - alert: HostHighDiskReadLatency
+        expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1
+        for: 5m
+        labels:
+          severity: warning
+          category: infrastructure
+        annotations:
+          summary: "High disk read latency on {{ $labels.instance }}"
+          description: "Disk read latency is high (current value: {{ $value }}s)"
+
+      # Container Alerts
+      - alert: ContainerKilled
+        expr: time() - container_last_seen{name!~".*exporter.*"} > 60
+        for: 1m
+        labels:
+          severity: critical
+          category: container
+        annotations:
+          summary: "Container killed: {{ $labels.name }}"
+          description: "Container {{ $labels.name }} has disappeared"
+
+      - alert: ContainerHighCpuUsage
+        expr: (sum(rate(container_cpu_usage_seconds_total{name!~".*exporter.*"}[5m])) by (name) * 100) > 80
+        for: 5m
+        labels:
+          severity: warning
+          category: container
+        annotations:
+          summary: "High CPU usage in container {{ $labels.name }}"
+          description: "Container CPU usage is above 80% (current value: {{ $value }}%)"
+
+      - alert: ContainerHighMemoryUsage
+        expr: (sum(container_memory_usage_bytes{name!~".*exporter.*"}) by (name) / sum(container_spec_memory_limit_bytes{name!~".*exporter.*"}) by (name) * 100) > 80
+        for: 5m
+        labels:
+          severity: warning
+          category: container
+        annotations:
+          summary: "High memory usage in container {{ $labels.name }}"
+          description: "Container memory usage is above 80% (current value: {{ $value }}%)"
+
+      - alert: ContainerVolumeUsage
+        expr: (1 - (sum(container_fs_inodes_free) BY (instance) / sum(container_fs_inodes_total) BY (instance))) * 100 > 80
+        for: 5m
+        labels:
+          severity: warning
+          category: container
+        annotations:
+          summary: "Container volume usage on {{ $labels.instance }}"
+          description: "Container volume usage is above 80% (current value: {{ $value }}%)"
+
+      - alert: ContainerRestartCount
+        expr: rate(container_restart_count[5m]) > 0
+        for: 1m
+        labels:
+          severity: warning
+          category: container
+        annotations:
+          summary: "Container restarting: {{ $labels.name }}"
+          description: "Container {{ $labels.name }} is restarting frequently"
+
+      # Prometheus Self-Monitoring
+      - alert: PrometheusTargetDown
+        expr: up == 0
+        for: 1m
+        labels:
+          severity: critical
+          category: prometheus
+        annotations:
+          summary: "Prometheus target down: {{ $labels.job }}"
+          description: "Target {{ $labels.job }} on {{ $labels.instance }} is down"
+
+      - alert: PrometheusConfigReloadFailure
+        expr: prometheus_config_last_reload_successful == 0
+        for: 1m
+        labels:
+          severity: critical
+          category: prometheus
+        annotations:
+          summary: "Prometheus configuration reload failure"
+          description: "Prometheus configuration reload has failed"
+
+      - alert: PrometheusTooManyRestarts
+        expr: changes(process_start_time_seconds{job=~"prometheus"}[15m]) > 2
+        for: 1m
+        labels:
+          severity: warning
+          category: prometheus
+        annotations:
+          summary: "Prometheus restarting frequently"
+          description: "Prometheus has restarted more than twice in the last 15 minutes"
+
+      - alert: PrometheusTargetScrapingSlow
+        expr: prometheus_target_interval_length_seconds{quantile="0.9"} > 60
+        for: 5m
+        labels:
+          severity: warning
+          category: prometheus
+        annotations:
+          summary: "Prometheus target scraping slow"
+          description: "Prometheus is scraping targets slowly (current value: {{ $value }}s)"
+
+      # Traefik Alerts
+      - alert: TraefikServiceDown
+        expr: count(traefik_service_server_up) by (service) == 0
+        for: 1m
+        labels:
+          severity: critical
+          category: traefik
+        annotations:
+          summary: "Traefik service down: {{ $labels.service }}"
+          description: "Traefik service {{ $labels.service }} is down"
+
+      - alert: TraefikHighHttp4xxErrorRate
+        expr: sum(rate(traefik_service_requests_total{code=~"4.."}[5m])) by (service) / sum(rate(traefik_service_requests_total[5m])) by (service) * 100 > 5
+        for: 5m
+        labels:
+          severity: warning
+          category: traefik
+        annotations:
+          summary: "High HTTP 4xx error rate for {{ $labels.service }}"
+          description: "HTTP 4xx error rate is above 5% (current value: {{ $value }}%)"
+
+      - alert: TraefikHighHttp5xxErrorRate
+        expr: sum(rate(traefik_service_requests_total{code=~"5.."}[5m])) by (service) / sum(rate(traefik_service_requests_total[5m])) by (service) * 100 > 1
+        for: 5m
+        labels:
+          severity: critical
+          category: traefik
+        annotations:
+          summary: "High HTTP 5xx error rate for {{ $labels.service }}"
+          description: "HTTP 5xx error rate is above 1% (current value: {{ $value }}%)"
+
+  - name: database_alerts
+    interval: 30s
+    rules:
+      # PostgreSQL Alerts (uncomment when postgres-exporter is deployed)
+      # - alert: PostgresqlDown
+      #   expr: pg_up == 0
+      #   for: 1m
+      #   labels:
+      #     severity: critical
+      #     category: database
+      #   annotations:
+      #     summary: "PostgreSQL down on {{ $labels.instance }}"
+      #     description: "PostgreSQL instance is down"
+
+      # - alert: PostgresqlTooManyConnections
+      #   expr: sum by (instance) (pg_stat_activity_count) > pg_settings_max_connections * 0.8
+      #   for: 5m
+      #   labels:
+      #     severity: warning
+      #     category: database
+      #   annotations:
+      #     summary: "Too many PostgreSQL connections on {{ $labels.instance }}"
+      #     description: "PostgreSQL connections are above 80% of max_connections"
+
+      # - alert: PostgresqlDeadLocks
+      #   expr: rate(pg_stat_database_deadlocks[1m]) > 0
+      #   for: 1m
+      #   labels:
+      #     severity: warning
+      #     category: database
+      #   annotations:
+      #     summary: "PostgreSQL deadlocks on {{ $labels.instance }}"
+      #     description: "PostgreSQL has deadlocks"
+
+      # Redis Alerts (uncomment when redis-exporter is deployed)
+      # - alert: RedisDown
+      #   expr: redis_up == 0
+      #   for: 1m
+      #   labels:
+      #     severity: critical
+      #     category: cache
+      #   annotations:
+      #     summary: "Redis down on {{ $labels.instance }}"
+      #     description: "Redis instance is down"
+
+      # - alert: RedisOutOfMemory
+      #   expr: redis_memory_used_bytes / redis_memory_max_bytes * 100 > 90
+      #   for: 5m
+      #   labels:
+      #     severity: critical
+      #     category: cache
+      #   annotations:
+      #     summary: "Redis out of memory on {{ $labels.instance }}"
+      #     description: "Redis memory usage is above 90%"
+
+      # - alert: RedisTooManyConnections
+      #   expr: redis_connected_clients > 100
+      #   for: 5m
+      #   labels:
+      #     severity: warning
+      #     category: cache
+      #   annotations:
+      #     summary: "Too many Redis connections on {{ $labels.instance }}"
+      #     description: "Redis has too many client connections (current value: {{ $value }})"
--- a/deployment/stacks/monitoring/prometheus/prometheus.yml
+++ b/deployment/stacks/monitoring/prometheus/prometheus.yml
@@ -0,0 +1,82 @@
+# Prometheus Configuration
+# https://prometheus.io/docs/prometheus/latest/configuration/configuration/
+
+global:
+  scrape_interval: 15s
+  evaluation_interval: 15s
+  external_labels:
+    cluster: 'production'
+    environment: 'michaelschiemer'
+
+# Alertmanager configuration (optional)
+# alerting:
+#   alertmanagers:
+#     - static_configs:
+#         - targets:
+#           - alertmanager:9093
+
+# Load alerting rules
+rule_files:
+  - '/etc/prometheus/alerts.yml'
+
+# Scrape configurations
+scrape_configs:
+  # Prometheus self-monitoring
+  - job_name: 'prometheus'
+    static_configs:
+      - targets: ['localhost:9090']
+        labels:
+          service: 'prometheus'
+
+  # Node Exporter - Host system metrics
+  - job_name: 'node-exporter'
+    static_configs:
+      - targets: ['node-exporter:9100']
+        labels:
+          service: 'node-exporter'
+          instance: 'production-server'
+
+  # cAdvisor - Container metrics
+  - job_name: 'cadvisor'
+    static_configs:
+      - targets: ['cadvisor:8080']
+        labels:
+          service: 'cadvisor'
+
+  # Traefik metrics
+  - job_name: 'traefik'
+    static_configs:
+      - targets: ['traefik:8080']
+        labels:
+          service: 'traefik'
+
+  # PostgreSQL Exporter (if deployed)
+  # Uncomment if you add postgres-exporter to postgresql stack
+  # - job_name: 'postgres'
+  #   static_configs:
+  #     - targets: ['postgres-exporter:9187']
+  #       labels:
+  #         service: 'postgresql'
+
+  # Redis Exporter (if deployed)
+  # Uncomment if you add redis-exporter to application stack
+  # - job_name: 'redis'
+  #   static_configs:
+  #     - targets: ['redis-exporter:9121']
+  #       labels:
+  #         service: 'redis'
+
+  # Application metrics endpoint (if available)
+  # Uncomment and configure if your PHP app exposes Prometheus metrics
+  # - job_name: 'application'
+  #   static_configs:
+  #     - targets: ['app:9000']
+  #       labels:
+  #         service: 'application'
+
+  # Nginx metrics (if nginx-prometheus-exporter deployed)
+  # - job_name: 'nginx'
+  #   static_configs:
+  #     - targets: ['nginx-exporter:9113']
+  #       labels:
+  #         service: 'nginx'