- Add docker-compose-direct-access.yml for VPN-only admin access - Configure Portainer on port 9002 (avoid MinIO conflict) - Add grafana.ini to disable external plugin update checks - Bind services to 10.8.0.1 (WireGuard VPN gateway) This configuration enables direct access to admin services via WireGuard VPN while removing Traefik routing overhead. Services are bound exclusively to the VPN gateway IP to prevent public access.
Stack 6: Monitoring (Portainer + Grafana + Prometheus)
Comprehensive monitoring stack for infrastructure and application observability.
Overview
This stack provides complete monitoring and visualization capabilities for the entire infrastructure:
- Prometheus: Time-series metrics collection and alerting
- Grafana: Metrics visualization with pre-configured dashboards
- Portainer: Container management UI
- Node Exporter: Host system metrics (CPU, memory, disk, network)
- cAdvisor: Container resource usage metrics
- Alertmanager: Alert routing and management (via Prometheus)
Features
Prometheus
- Multi-target scraping (node-exporter, cadvisor, traefik)
- 15-second scrape interval for near real-time metrics
- 15-day retention period
- Pre-configured alert rules for critical conditions
- Built-in alerting engine
- Service discovery via static configs
- HTTPS support with BasicAuth protection
Grafana
- Pre-configured Prometheus datasource
- Three comprehensive dashboards:
- Docker Containers: Container CPU, memory, network I/O, restarts
- Host System: System CPU, memory, disk, network, uptime
- Traefik: Request rates, response times, status codes, error rates
- Auto-provisioning (no manual configuration needed)
- HTTPS access via Traefik
- 30-second auto-refresh
- Dark theme for reduced eye strain
Portainer
- Web-based Docker management UI
- Container start/stop/restart/logs
- Stack management and deployment
- Volume and network management
- Resource usage visualization
- HTTPS access via Traefik
Node Exporter
- Host system metrics:
- CPU usage by core and mode
- Memory usage and available memory
- Disk usage by filesystem
- Network I/O by interface
- System load averages
- System uptime
cAdvisor
- Container metrics:
- CPU usage per container
- Memory usage per container
- Network I/O per container
- Disk I/O per container
- Container restart counts
- Container health status
Services
| Service | Domain | Port | Purpose |
|---|---|---|---|
| Grafana | grafana.michaelschiemer.de | 3000 | Metrics visualization |
| Prometheus | prometheus.michaelschiemer.de | 9090 | Metrics collection |
| Portainer | portainer.michaelschiemer.de | 9000/9443 | Container management |
| Node Exporter | - | 9100 | Host metrics (internal) |
| cAdvisor | - | 8080 | Container metrics (internal) |
Prerequisites
- Traefik stack deployed and running (Stack 1)
- Docker networks:
traefik-public,monitoring - Docker Swarm initialized (if using swarm mode)
- Domain DNS configured (grafana/prometheus/portainer subdomains)
Directory Structure
monitoring/
├── docker-compose.yml # Main stack definition
├── .env.example # Environment template
├── prometheus/
│ ├── prometheus.yml # Prometheus configuration
│ └── alerts.yml # Alert rules
├── grafana/
│ ├── provisioning/
│ │ ├── datasources/
│ │ │ └── prometheus.yml # Auto-configured datasource
│ │ └── dashboards/
│ │ └── dashboard.yml # Dashboard provisioning
│ └── dashboards/
│ ├── docker-containers.json # Container metrics dashboard
│ ├── host-system.json # Host metrics dashboard
│ └── traefik.json # Traefik metrics dashboard
└── README.md # This file
Configuration
1. Create Environment File
cp .env.example .env
2. Configure Environment Variables
Edit .env and set the following variables:
# Domain Configuration
DOMAIN=michaelschiemer.de
# Grafana Configuration
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=<generate-strong-password>
# Prometheus Configuration
PROMETHEUS_USER=admin
PROMETHEUS_PASSWORD=<generate-strong-password>
# Portainer Configuration
PORTAINER_ADMIN_PASSWORD=<generate-strong-password>
# Network Configuration
TRAEFIK_NETWORK=traefik-public
MONITORING_NETWORK=monitoring
3. Generate Strong Passwords
# Generate random passwords
openssl rand -base64 32
# For Prometheus BasicAuth (bcrypt hash)
docker run --rm httpd:alpine htpasswd -nbB admin "your-password" | cut -d ":" -f 2
4. Update Traefik BasicAuth (Optional)
If using Prometheus BasicAuth, add the bcrypt hash to Traefik labels in docker-compose.yml:
- "traefik.http.middlewares.prometheus-auth.basicauth.users=admin:$$2y$$05$$..."
Deployment
Deploy Stack
cd /home/michael/dev/michaelschiemer/deployment/stacks/monitoring
# Deploy with Docker Compose
docker compose up -d
# Or with Docker Stack (Swarm mode)
docker stack deploy -c docker-compose.yml monitoring
Verify Deployment
# Check running containers
docker compose ps
# Check service logs
docker compose logs -f grafana
docker compose logs -f prometheus
# Check Prometheus targets
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/targets
Initial Access
-
Grafana: https://grafana.michaelschiemer.de
- Login:
admin/<GRAFANA_ADMIN_PASSWORD> - Dashboards are pre-loaded and ready to use
- Login:
-
Prometheus: https://prometheus.michaelschiemer.de
- BasicAuth:
admin/<PROMETHEUS_PASSWORD> - Check targets at
/targets - View alerts at
/alerts
- BasicAuth:
-
Portainer: https://portainer.michaelschiemer.de
- First login: Set admin password
- Connect to local Docker environment
Usage
Grafana Dashboards
Docker Containers Dashboard
Access: https://grafana.michaelschiemer.de/d/docker-containers
Metrics Displayed:
- Container CPU Usage % (per container, timeseries)
- Container Memory Usage (bytes per container, timeseries)
- Containers Running (current count, stat)
- Container Restarts in 5m (rate with thresholds, stat)
- Container Network I/O (RX/TX per container, timeseries)
Use Cases:
- Identify containers with high resource usage
- Monitor container stability (restart rates)
- Track network bandwidth consumption
- Verify all expected containers are running
Host System Dashboard
Access: https://grafana.michaelschiemer.de/d/host-system
Metrics Displayed:
- CPU Usage % (historical and current)
- Memory Usage % (historical and current)
- Disk Usage % (root filesystem, historical and current)
- Network I/O (RX/TX by interface)
- System Uptime (seconds since boot)
Thresholds:
- Green: < 80% usage
- Yellow: 80-90% usage
- Red: > 90% usage
Use Cases:
- Monitor server health and resource utilization
- Identify resource bottlenecks
- Plan capacity upgrades
- Track system stability (uptime)
Traefik Dashboard
Access: https://grafana.michaelschiemer.de/d/traefik
Metrics Displayed:
- Request Rate by Service (req/s, timeseries)
- Response Time p95/p99 (milliseconds, timeseries)
- HTTP Status Codes (2xx/4xx/5xx stacked, color-coded)
- Service Status (Up/Down per service)
- Requests per Minute (total)
- 4xx Error Rate (percentage)
- 5xx Error Rate (percentage)
- Active Services (count)
Thresholds:
- 4xx errors: Green < 5%, Yellow < 10%, Red ≥ 10%
- 5xx errors: Green < 1%, Yellow < 5%, Red ≥ 5%
Use Cases:
- Monitor HTTP traffic patterns
- Identify performance issues (high latency)
- Track error rates and types
- Verify service availability
Prometheus Queries
Common PromQL Examples
CPU Usage:
# Overall CPU usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Per-core CPU usage
rate(node_cpu_seconds_total[5m]) * 100
Memory Usage:
# Memory usage percentage
100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)
# Memory available in GB
node_memory_MemAvailable_bytes / 1024 / 1024 / 1024
Disk Usage:
# Disk usage percentage
100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100)
# Disk I/O rate
rate(node_disk_io_time_seconds_total[5m])
Container Metrics:
# Container CPU usage
sum(rate(container_cpu_usage_seconds_total{name!~".*exporter.*"}[5m])) by (name) * 100
# Container memory usage
sum(container_memory_usage_bytes{name!~".*exporter.*"}) by (name)
# Container network I/O
rate(container_network_receive_bytes_total[5m])
rate(container_network_transmit_bytes_total[5m])
Traefik Metrics:
# Request rate by service
sum(rate(traefik_service_requests_total[5m])) by (service)
# Response time percentiles
histogram_quantile(0.95, sum(rate(traefik_service_request_duration_seconds_bucket[5m])) by (service, le))
# Error rate
sum(rate(traefik_service_requests_total{code=~"5.."}[5m])) / sum(rate(traefik_service_requests_total[5m])) * 100
Alert Management
Configured Alerts
Alerts are defined in prometheus/alerts.yml:
- HostHighCPU: CPU usage > 80% for 5 minutes
- HostHighMemory: Memory usage > 80% for 5 minutes
- HostDiskSpaceLow: Disk usage > 80%
- ContainerHighCPU: Container CPU > 80% for 5 minutes
- ContainerHighMemory: Container memory > 80% for 5 minutes
- ServiceDown: Service unavailable
- HighErrorRate: Error rate > 5% for 5 minutes
View Active Alerts
# Via Prometheus UI
https://prometheus.michaelschiemer.de/alerts
# Via API
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/alerts
# Check alert rules
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/rules
Silence Alerts
Use Prometheus UI or API to silence alerts during maintenance:
# Silence via API (example)
curl -X POST -u admin:password \
https://prometheus.michaelschiemer.de/api/v1/alerts \
-d 'alertname=HostHighCPU&duration=1h'
Portainer Usage
Container Management
- Navigate to https://portainer.michaelschiemer.de
- Select "Local" environment
- Go to "Containers" section
- Available actions:
- Start/Stop/Restart containers
- View logs (live stream)
- Inspect container details
- Execute commands in containers
- View resource statistics
Stack Management
- Go to "Stacks" section
- View deployed stacks
- Actions available:
- View stack definition
- Update stack (edit compose file)
- Stop/Start entire stack
- Remove stack
Volume Management
- Go to "Volumes" section
- View volume details and size
- Browse volume contents
- Backup/restore volumes
Integration with Other Stacks
Stack 1: Traefik
- Provides HTTPS reverse proxy for Grafana, Prometheus, Portainer
- Automatic SSL certificate management
- BasicAuth middleware for Prometheus
Stack 2: Gitea
- Monitor Gitea container resources
- Track HTTP requests to Gitea via Traefik dashboard
- Alert on Gitea service downtime
Stack 3: Docker Registry
- Monitor registry container resources
- Track registry HTTP requests
- Alert on registry unavailability
Stack 4: Application
- Monitor PHP-FPM, Nginx, Redis, Worker containers
- Track application response times
- Monitor queue worker health
Stack 5: PostgreSQL
- Monitor database container resources
- Track PostgreSQL metrics (if postgres_exporter added)
- Alert on database unavailability
Monitoring Best Practices
1. Regular Dashboard Review
- Check dashboards daily for anomalies
- Review error rates and response times
- Monitor resource utilization trends
2. Alert Configuration
- Tune alert thresholds based on baseline metrics
- Avoid alert fatigue (too many non-critical alerts)
- Document alert response procedures
3. Capacity Planning
- Review resource usage trends weekly
- Plan capacity upgrades before hitting limits
- Monitor growth rates for proactive scaling
4. Performance Optimization
- Identify containers with high resource usage
- Optimize slow endpoints (high p95/p99 latency)
- Balance load across services
5. Security Monitoring
- Monitor failed authentication attempts
- Track unusual traffic patterns
- Review service availability trends
Troubleshooting
Grafana Issues
Dashboard Not Loading
# Check Grafana logs
docker compose logs grafana
# Verify datasource connection
curl http://localhost:3000/api/health
# Restart Grafana
docker compose restart grafana
Missing Metrics
# Check Prometheus datasource
curl http://prometheus:9090/api/v1/targets
# Verify Prometheus is scraping
docker compose logs prometheus | grep "Scrape"
# Check network connectivity
docker compose exec grafana ping prometheus
Prometheus Issues
Targets Down
# Check target status
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/targets
# Verify target services are running
docker compose ps
# Check Prometheus configuration
docker compose exec prometheus cat /etc/prometheus/prometheus.yml
# Reload configuration
curl -X POST -u admin:password https://prometheus.michaelschiemer.de/-/reload
High Memory Usage
# Check Prometheus memory
docker stats prometheus
# Reduce retention period in docker-compose.yml:
# --storage.tsdb.retention.time=7d
# Reduce scrape interval in prometheus.yml:
# scrape_interval: 30s
Node Exporter Issues
No Host Metrics
# Check node-exporter is running
docker compose ps node-exporter
# Test metrics endpoint
curl http://localhost:9100/metrics
# Check Prometheus scraping
docker compose logs prometheus | grep node-exporter
cAdvisor Issues
No Container Metrics
# Check cAdvisor is running
docker compose ps cadvisor
# Test metrics endpoint
curl http://localhost:8080/metrics
# Verify Docker socket mount
docker compose exec cadvisor ls -la /var/run/docker.sock
Portainer Issues
Cannot Access UI
# Check Portainer is running
docker compose ps portainer
# Check Traefik routing
docker compose -f ../traefik/docker-compose.yml logs
# Verify network connectivity
docker network ls | grep monitoring
Cannot Connect to Docker
# Verify Docker socket permissions
ls -la /var/run/docker.sock
# Check Portainer logs
docker compose logs portainer
# Restart Portainer
docker compose restart portainer
Performance Tuning
Prometheus Optimization
Reduce Memory Usage
# In docker-compose.yml, adjust retention:
command:
- '--storage.tsdb.retention.time=7d' # Reduce from 15d
- '--storage.tsdb.retention.size=5GB' # Add size limit
Optimize Scrape Intervals
# In prometheus/prometheus.yml:
global:
scrape_interval: 30s # Increase from 15s for less load
evaluation_interval: 30s
Reduce Cardinality
# In prometheus/prometheus.yml, add metric_relabel_configs:
metric_relabel_configs:
- source_labels: [__name__]
regex: 'unused_metric_.*'
action: drop
Grafana Optimization
Reduce Query Load
// In dashboard JSON, adjust refresh rate:
"refresh": "1m" // Increase from 30s
Optimize Panel Queries
- Use recording rules for expensive queries
- Reduce time range for heavy queries
- Use appropriate resolution (step parameter)
Storage Optimization
Prometheus Data Volume
# Check current size
du -sh volumes/prometheus/
# Compact old data
docker compose exec prometheus curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones
Grafana Data Volume
# Check current size
du -sh volumes/grafana/
# Clean old sessions
docker compose exec grafana grafana-cli admin reset-admin-password
Security Considerations
1. Password Security
- Use strong, randomly generated passwords
- Store passwords securely (password manager)
- Rotate passwords regularly
- Use bcrypt for Prometheus BasicAuth
2. Network Security
- Monitoring network is internal-only (except exporters)
- Traefik handles SSL/TLS termination
- BasicAuth protects Prometheus UI
- Grafana requires login for dashboard access
3. Access Control
- Limit Grafana admin access
- Use Grafana organizations for multi-tenancy
- Configure Prometheus with read-only access where possible
- Restrict Portainer access to trusted users
4. Data Security
- Prometheus stores metrics in plain text
- Grafana encrypts passwords in database
- Backup volumes contain sensitive data
- Secure backups with encryption
5. Container Security
- Use official Docker images
- Keep images updated (security patches)
- Run containers as non-root where possible
- Limit container capabilities
Backup and Recovery
Backup Prometheus Data
# Stop Prometheus
docker compose stop prometheus
# Backup data volume
tar czf prometheus-backup-$(date +%Y%m%d).tar.gz -C volumes/prometheus .
# Restart Prometheus
docker compose start prometheus
Backup Grafana Data
# Backup Grafana database and dashboards
docker compose exec grafana tar czf - /var/lib/grafana > grafana-backup-$(date +%Y%m%d).tar.gz
Restore from Backup
# Stop services
docker compose down
# Restore Prometheus data
tar xzf prometheus-backup-YYYYMMDD.tar.gz -C volumes/prometheus/
# Restore Grafana data
docker compose up -d grafana
docker compose exec grafana tar xzf - -C / < grafana-backup-YYYYMMDD.tar.gz
# Start all services
docker compose up -d
Maintenance
Regular Tasks
Daily
- Review dashboards for anomalies
- Check active alerts
- Verify all services are running
Weekly
- Review resource usage trends
- Check disk space usage
- Update passwords if needed
Monthly
- Review and update alert rules
- Optimize slow queries
- Clean up old data if needed
- Update Docker images
Update Procedure
# Pull latest images
docker compose pull
# Recreate containers with new images
docker compose up -d
# Verify services are healthy
docker compose ps
docker compose logs -f
Support
Documentation
- Prometheus: https://prometheus.io/docs/
- Grafana: https://grafana.com/docs/
- Portainer: https://docs.portainer.io/
Logs
# View all logs
docker compose logs
# Follow specific service logs
docker compose logs -f grafana
docker compose logs -f prometheus
# View last 100 lines
docker compose logs --tail=100
Health Checks
# Check service health
docker compose ps
# Test endpoints
curl http://localhost:9090/-/healthy # Prometheus
curl http://localhost:3000/api/health # Grafana
# Check metrics
curl http://localhost:9100/metrics # Node Exporter
curl http://localhost:8080/metrics # cAdvisor
Stack Version: 1.0 Last Updated: 2025-01-30 Maintained By: DevOps Team