# Stack 6: Monitoring (Portainer + Grafana + Prometheus) Comprehensive monitoring stack for infrastructure and application observability. ## Overview This stack provides complete monitoring and visualization capabilities for the entire infrastructure: - **Prometheus**: Time-series metrics collection and alerting - **Grafana**: Metrics visualization with pre-configured dashboards - **Portainer**: Container management UI - **Node Exporter**: Host system metrics (CPU, memory, disk, network) - **cAdvisor**: Container resource usage metrics - **Alertmanager**: Alert routing and management (via Prometheus) ## Features ### Prometheus - Multi-target scraping (node-exporter, cadvisor, traefik) - 15-second scrape interval for near real-time metrics - 15-day retention period - Pre-configured alert rules for critical conditions - Built-in alerting engine - Service discovery via static configs - HTTPS support with BasicAuth protection ### Grafana - Pre-configured Prometheus datasource - Three comprehensive dashboards: - **Docker Containers**: Container CPU, memory, network I/O, restarts - **Host System**: System CPU, memory, disk, network, uptime - **Traefik**: Request rates, response times, status codes, error rates - Auto-provisioning (no manual configuration needed) - HTTPS access via Traefik - 30-second auto-refresh - Dark theme for reduced eye strain ### Portainer - Web-based Docker management UI - Container start/stop/restart/logs - Stack management and deployment - Volume and network management - Resource usage visualization - HTTPS access via Traefik ### Node Exporter - Host system metrics: - CPU usage by core and mode - Memory usage and available memory - Disk usage by filesystem - Network I/O by interface - System load averages - System uptime ### cAdvisor - Container metrics: - CPU usage per container - Memory usage per container - Network I/O per container - Disk I/O per container - Container restart counts - Container health status ## Services | Service | Domain | Port | Purpose | |---------|--------|------|---------| | Grafana | grafana.michaelschiemer.de | 3000 | Metrics visualization | | Prometheus | prometheus.michaelschiemer.de | 9090 | Metrics collection | | Portainer | portainer.michaelschiemer.de | 9000/9443 | Container management | | Node Exporter | - | 9100 | Host metrics (internal) | | cAdvisor | - | 8080 | Container metrics (internal) | ## Prerequisites - Traefik stack deployed and running (Stack 1) - Docker networks: `traefik-public`, `monitoring` - Docker Swarm initialized (if using swarm mode) - Domain DNS configured (grafana/prometheus/portainer subdomains) ## Directory Structure ``` monitoring/ ├── docker-compose.yml # Main stack definition ├── .env.example # Environment template ├── prometheus/ │ ├── prometheus.yml # Prometheus configuration │ └── alerts.yml # Alert rules ├── grafana/ │ ├── provisioning/ │ │ ├── datasources/ │ │ │ └── prometheus.yml # Auto-configured datasource │ │ └── dashboards/ │ │ └── dashboard.yml # Dashboard provisioning │ └── dashboards/ │ ├── docker-containers.json # Container metrics dashboard │ ├── host-system.json # Host metrics dashboard │ └── traefik.json # Traefik metrics dashboard └── README.md # This file ``` ## Configuration ### 1. Create Environment File ```bash cp .env.example .env ``` ### 2. Configure Environment Variables Edit `.env` and set the following variables: ```bash # Domain Configuration DOMAIN=michaelschiemer.de # Grafana Configuration GRAFANA_ADMIN_USER=admin GRAFANA_ADMIN_PASSWORD= # Prometheus Configuration PROMETHEUS_USER=admin PROMETHEUS_PASSWORD= # Portainer Configuration PORTAINER_ADMIN_PASSWORD= # Network Configuration TRAEFIK_NETWORK=traefik-public MONITORING_NETWORK=monitoring ``` ### 3. Generate Strong Passwords ```bash # Generate random passwords openssl rand -base64 32 # For Prometheus BasicAuth (bcrypt hash) docker run --rm httpd:alpine htpasswd -nbB admin "your-password" | cut -d ":" -f 2 ``` ### 4. Update Traefik BasicAuth (Optional) If using Prometheus BasicAuth, add the bcrypt hash to Traefik labels in docker-compose.yml: ```yaml - "traefik.http.middlewares.prometheus-auth.basicauth.users=admin:$$2y$$05$$..." ``` ## Deployment ### Deploy Stack ```bash cd /home/michael/dev/michaelschiemer/deployment/stacks/monitoring # Deploy with Docker Compose docker compose up -d # Or with Docker Stack (Swarm mode) docker stack deploy -c docker-compose.yml monitoring ``` ### Verify Deployment ```bash # Check running containers docker compose ps # Check service logs docker compose logs -f grafana docker compose logs -f prometheus # Check Prometheus targets curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/targets ``` ### Initial Access 1. **Grafana**: https://grafana.michaelschiemer.de - Login: `admin` / `` - Dashboards are pre-loaded and ready to use 2. **Prometheus**: https://prometheus.michaelschiemer.de - BasicAuth: `admin` / `` - Check targets at `/targets` - View alerts at `/alerts` 3. **Portainer**: https://portainer.michaelschiemer.de - First login: Set admin password - Connect to local Docker environment ## Usage ### Grafana Dashboards #### Docker Containers Dashboard Access: https://grafana.michaelschiemer.de/d/docker-containers **Metrics Displayed**: - Container CPU Usage % (per container, timeseries) - Container Memory Usage (bytes per container, timeseries) - Containers Running (current count, stat) - Container Restarts in 5m (rate with thresholds, stat) - Container Network I/O (RX/TX per container, timeseries) **Use Cases**: - Identify containers with high resource usage - Monitor container stability (restart rates) - Track network bandwidth consumption - Verify all expected containers are running #### Host System Dashboard Access: https://grafana.michaelschiemer.de/d/host-system **Metrics Displayed**: - CPU Usage % (historical and current) - Memory Usage % (historical and current) - Disk Usage % (root filesystem, historical and current) - Network I/O (RX/TX by interface) - System Uptime (seconds since boot) **Thresholds**: - Green: < 80% usage - Yellow: 80-90% usage - Red: > 90% usage **Use Cases**: - Monitor server health and resource utilization - Identify resource bottlenecks - Plan capacity upgrades - Track system stability (uptime) #### Traefik Dashboard Access: https://grafana.michaelschiemer.de/d/traefik **Metrics Displayed**: - Request Rate by Service (req/s, timeseries) - Response Time p95/p99 (milliseconds, timeseries) - HTTP Status Codes (2xx/4xx/5xx stacked, color-coded) - Service Status (Up/Down per service) - Requests per Minute (total) - 4xx Error Rate (percentage) - 5xx Error Rate (percentage) - Active Services (count) **Thresholds**: - 4xx errors: Green < 5%, Yellow < 10%, Red ≥ 10% - 5xx errors: Green < 1%, Yellow < 5%, Red ≥ 5% **Use Cases**: - Monitor HTTP traffic patterns - Identify performance issues (high latency) - Track error rates and types - Verify service availability ### Prometheus Queries #### Common PromQL Examples **CPU Usage**: ```promql # Overall CPU usage 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # Per-core CPU usage rate(node_cpu_seconds_total[5m]) * 100 ``` **Memory Usage**: ```promql # Memory usage percentage 100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100) # Memory available in GB node_memory_MemAvailable_bytes / 1024 / 1024 / 1024 ``` **Disk Usage**: ```promql # Disk usage percentage 100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100) # Disk I/O rate rate(node_disk_io_time_seconds_total[5m]) ``` **Container Metrics**: ```promql # Container CPU usage sum(rate(container_cpu_usage_seconds_total{name!~".*exporter.*"}[5m])) by (name) * 100 # Container memory usage sum(container_memory_usage_bytes{name!~".*exporter.*"}) by (name) # Container network I/O rate(container_network_receive_bytes_total[5m]) rate(container_network_transmit_bytes_total[5m]) ``` **Traefik Metrics**: ```promql # Request rate by service sum(rate(traefik_service_requests_total[5m])) by (service) # Response time percentiles histogram_quantile(0.95, sum(rate(traefik_service_request_duration_seconds_bucket[5m])) by (service, le)) # Error rate sum(rate(traefik_service_requests_total{code=~"5.."}[5m])) / sum(rate(traefik_service_requests_total[5m])) * 100 ``` ### Alert Management #### Configured Alerts Alerts are defined in `prometheus/alerts.yml`: 1. **HostHighCPU**: CPU usage > 80% for 5 minutes 2. **HostHighMemory**: Memory usage > 80% for 5 minutes 3. **HostDiskSpaceLow**: Disk usage > 80% 4. **ContainerHighCPU**: Container CPU > 80% for 5 minutes 5. **ContainerHighMemory**: Container memory > 80% for 5 minutes 6. **ServiceDown**: Service unavailable 7. **HighErrorRate**: Error rate > 5% for 5 minutes #### View Active Alerts ```bash # Via Prometheus UI https://prometheus.michaelschiemer.de/alerts # Via API curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/alerts # Check alert rules curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/rules ``` #### Silence Alerts Use Prometheus UI or API to silence alerts during maintenance: ```bash # Silence via API (example) curl -X POST -u admin:password \ https://prometheus.michaelschiemer.de/api/v1/alerts \ -d 'alertname=HostHighCPU&duration=1h' ``` ### Portainer Usage #### Container Management 1. Navigate to https://portainer.michaelschiemer.de 2. Select "Local" environment 3. Go to "Containers" section 4. Available actions: - Start/Stop/Restart containers - View logs (live stream) - Inspect container details - Execute commands in containers - View resource statistics #### Stack Management 1. Go to "Stacks" section 2. View deployed stacks 3. Actions available: - View stack definition - Update stack (edit compose file) - Stop/Start entire stack - Remove stack #### Volume Management 1. Go to "Volumes" section 2. View volume details and size 3. Browse volume contents 4. Backup/restore volumes ## Integration with Other Stacks ### Stack 1: Traefik - Provides HTTPS reverse proxy for Grafana, Prometheus, Portainer - Automatic SSL certificate management - BasicAuth middleware for Prometheus ### Stack 2: Gitea - Monitor Gitea container resources - Track HTTP requests to Gitea via Traefik dashboard - Alert on Gitea service downtime ### Stack 3: Docker Registry - Monitor registry container resources - Track registry HTTP requests - Alert on registry unavailability ### Stack 4: Application - Monitor PHP-FPM, Nginx, Redis, Worker containers - Track application response times - Monitor queue worker health ### Stack 5: PostgreSQL - Monitor database container resources - Track PostgreSQL metrics (if postgres_exporter added) - Alert on database unavailability ## Monitoring Best Practices ### 1. Regular Dashboard Review - Check dashboards daily for anomalies - Review error rates and response times - Monitor resource utilization trends ### 2. Alert Configuration - Tune alert thresholds based on baseline metrics - Avoid alert fatigue (too many non-critical alerts) - Document alert response procedures ### 3. Capacity Planning - Review resource usage trends weekly - Plan capacity upgrades before hitting limits - Monitor growth rates for proactive scaling ### 4. Performance Optimization - Identify containers with high resource usage - Optimize slow endpoints (high p95/p99 latency) - Balance load across services ### 5. Security Monitoring - Monitor failed authentication attempts - Track unusual traffic patterns - Review service availability trends ## Troubleshooting ### Grafana Issues #### Dashboard Not Loading ```bash # Check Grafana logs docker compose logs grafana # Verify datasource connection curl http://localhost:3000/api/health # Restart Grafana docker compose restart grafana ``` #### Missing Metrics ```bash # Check Prometheus datasource curl http://prometheus:9090/api/v1/targets # Verify Prometheus is scraping docker compose logs prometheus | grep "Scrape" # Check network connectivity docker compose exec grafana ping prometheus ``` ### Prometheus Issues #### Targets Down ```bash # Check target status curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/targets # Verify target services are running docker compose ps # Check Prometheus configuration docker compose exec prometheus cat /etc/prometheus/prometheus.yml # Reload configuration curl -X POST -u admin:password https://prometheus.michaelschiemer.de/-/reload ``` #### High Memory Usage ```bash # Check Prometheus memory docker stats prometheus # Reduce retention period in docker-compose.yml: # --storage.tsdb.retention.time=7d # Reduce scrape interval in prometheus.yml: # scrape_interval: 30s ``` ### Node Exporter Issues #### No Host Metrics ```bash # Check node-exporter is running docker compose ps node-exporter # Test metrics endpoint curl http://localhost:9100/metrics # Check Prometheus scraping docker compose logs prometheus | grep node-exporter ``` ### cAdvisor Issues #### No Container Metrics ```bash # Check cAdvisor is running docker compose ps cadvisor # Test metrics endpoint curl http://localhost:8080/metrics # Verify Docker socket mount docker compose exec cadvisor ls -la /var/run/docker.sock ``` ### Portainer Issues #### Cannot Access UI ```bash # Check Portainer is running docker compose ps portainer # Check Traefik routing docker compose -f ../traefik/docker-compose.yml logs # Verify network connectivity docker network ls | grep monitoring ``` #### Cannot Connect to Docker ```bash # Verify Docker socket permissions ls -la /var/run/docker.sock # Check Portainer logs docker compose logs portainer # Restart Portainer docker compose restart portainer ``` ## Performance Tuning ### Prometheus Optimization #### Reduce Memory Usage ```yaml # In docker-compose.yml, adjust retention: command: - '--storage.tsdb.retention.time=7d' # Reduce from 15d - '--storage.tsdb.retention.size=5GB' # Add size limit ``` #### Optimize Scrape Intervals ```yaml # In prometheus/prometheus.yml: global: scrape_interval: 30s # Increase from 15s for less load evaluation_interval: 30s ``` #### Reduce Cardinality ```yaml # In prometheus/prometheus.yml, add metric_relabel_configs: metric_relabel_configs: - source_labels: [__name__] regex: 'unused_metric_.*' action: drop ``` ### Grafana Optimization #### Reduce Query Load ```json // In dashboard JSON, adjust refresh rate: "refresh": "1m" // Increase from 30s ``` #### Optimize Panel Queries - Use recording rules for expensive queries - Reduce time range for heavy queries - Use appropriate resolution (step parameter) ### Storage Optimization #### Prometheus Data Volume ```bash # Check current size du -sh volumes/prometheus/ # Compact old data docker compose exec prometheus curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones ``` #### Grafana Data Volume ```bash # Check current size du -sh volumes/grafana/ # Clean old sessions docker compose exec grafana grafana-cli admin reset-admin-password ``` ## Security Considerations ### 1. Password Security - Use strong, randomly generated passwords - Store passwords securely (password manager) - Rotate passwords regularly - Use bcrypt for Prometheus BasicAuth ### 2. Network Security - Monitoring network is internal-only (except exporters) - Traefik handles SSL/TLS termination - BasicAuth protects Prometheus UI - Grafana requires login for dashboard access ### 3. Access Control - Limit Grafana admin access - Use Grafana organizations for multi-tenancy - Configure Prometheus with read-only access where possible - Restrict Portainer access to trusted users ### 4. Data Security - Prometheus stores metrics in plain text - Grafana encrypts passwords in database - Backup volumes contain sensitive data - Secure backups with encryption ### 5. Container Security - Use official Docker images - Keep images updated (security patches) - Run containers as non-root where possible - Limit container capabilities ## Backup and Recovery ### Backup Prometheus Data ```bash # Stop Prometheus docker compose stop prometheus # Backup data volume tar czf prometheus-backup-$(date +%Y%m%d).tar.gz -C volumes/prometheus . # Restart Prometheus docker compose start prometheus ``` ### Backup Grafana Data ```bash # Backup Grafana database and dashboards docker compose exec grafana tar czf - /var/lib/grafana > grafana-backup-$(date +%Y%m%d).tar.gz ``` ### Restore from Backup ```bash # Stop services docker compose down # Restore Prometheus data tar xzf prometheus-backup-YYYYMMDD.tar.gz -C volumes/prometheus/ # Restore Grafana data docker compose up -d grafana docker compose exec grafana tar xzf - -C / < grafana-backup-YYYYMMDD.tar.gz # Start all services docker compose up -d ``` ## Maintenance ### Regular Tasks #### Daily - Review dashboards for anomalies - Check active alerts - Verify all services are running #### Weekly - Review resource usage trends - Check disk space usage - Update passwords if needed #### Monthly - Review and update alert rules - Optimize slow queries - Clean up old data if needed - Update Docker images ### Update Procedure ```bash # Pull latest images docker compose pull # Recreate containers with new images docker compose up -d # Verify services are healthy docker compose ps docker compose logs -f ``` ## Support ### Documentation - Prometheus: https://prometheus.io/docs/ - Grafana: https://grafana.com/docs/ - Portainer: https://docs.portainer.io/ ### Logs ```bash # View all logs docker compose logs # Follow specific service logs docker compose logs -f grafana docker compose logs -f prometheus # View last 100 lines docker compose logs --tail=100 ``` ### Health Checks ```bash # Check service health docker compose ps # Test endpoints curl http://localhost:9090/-/healthy # Prometheus curl http://localhost:3000/api/health # Grafana # Check metrics curl http://localhost:9100/metrics # Node Exporter curl http://localhost:8080/metrics # cAdvisor ``` --- **Stack Version**: 1.0 **Last Updated**: 2025-01-30 **Maintained By**: DevOps Team