Files
michaelschiemer/deployment/stacks/monitoring
Michael Schiemer 95147ff23e refactor(deployment): Remove WireGuard VPN dependency and restore public service access
Remove WireGuard integration from production deployment to simplify infrastructure:
- Remove docker-compose-direct-access.yml (VPN-bound services)
- Remove VPN-only middlewares from Grafana, Prometheus, Portainer
- Remove WireGuard middleware definitions from Traefik
- Remove WireGuard IPs (10.8.0.0/24) from Traefik forwarded headers

All monitoring services now publicly accessible via subdomains:
- grafana.michaelschiemer.de (with Grafana native auth)
- prometheus.michaelschiemer.de (with Basic Auth)
- portainer.michaelschiemer.de (with Portainer native auth)

All services use Let's Encrypt SSL certificates via Traefik.
2025-11-05 12:48:25 +01:00
..

Stack 6: Monitoring (Portainer + Grafana + Prometheus)

Comprehensive monitoring stack for infrastructure and application observability.

Overview

This stack provides complete monitoring and visualization capabilities for the entire infrastructure:

  • Prometheus: Time-series metrics collection and alerting
  • Grafana: Metrics visualization with pre-configured dashboards
  • Portainer: Container management UI
  • Node Exporter: Host system metrics (CPU, memory, disk, network)
  • cAdvisor: Container resource usage metrics
  • Alertmanager: Alert routing and management (via Prometheus)

Features

Prometheus

  • Multi-target scraping (node-exporter, cadvisor, traefik)
  • 15-second scrape interval for near real-time metrics
  • 15-day retention period
  • Pre-configured alert rules for critical conditions
  • Built-in alerting engine
  • Service discovery via static configs
  • HTTPS support with BasicAuth protection

Grafana

  • Pre-configured Prometheus datasource
  • Three comprehensive dashboards:
    • Docker Containers: Container CPU, memory, network I/O, restarts
    • Host System: System CPU, memory, disk, network, uptime
    • Traefik: Request rates, response times, status codes, error rates
  • Auto-provisioning (no manual configuration needed)
  • HTTPS access via Traefik
  • 30-second auto-refresh
  • Dark theme for reduced eye strain

Portainer

  • Web-based Docker management UI
  • Container start/stop/restart/logs
  • Stack management and deployment
  • Volume and network management
  • Resource usage visualization
  • HTTPS access via Traefik

Node Exporter

  • Host system metrics:
    • CPU usage by core and mode
    • Memory usage and available memory
    • Disk usage by filesystem
    • Network I/O by interface
    • System load averages
    • System uptime

cAdvisor

  • Container metrics:
    • CPU usage per container
    • Memory usage per container
    • Network I/O per container
    • Disk I/O per container
    • Container restart counts
    • Container health status

Services

Service Domain Port Purpose
Grafana grafana.michaelschiemer.de 3000 Metrics visualization
Prometheus prometheus.michaelschiemer.de 9090 Metrics collection
Portainer portainer.michaelschiemer.de 9000/9443 Container management
Node Exporter - 9100 Host metrics (internal)
cAdvisor - 8080 Container metrics (internal)

Prerequisites

  • Traefik stack deployed and running (Stack 1)
  • Docker networks: traefik-public, monitoring
  • Docker Swarm initialized (if using swarm mode)
  • Domain DNS configured (grafana/prometheus/portainer subdomains)

Directory Structure

monitoring/
├── docker-compose.yml              # Main stack definition
├── .env.example                    # Environment template
├── prometheus/
│   ├── prometheus.yml             # Prometheus configuration
│   └── alerts.yml                 # Alert rules
├── grafana/
│   ├── provisioning/
│   │   ├── datasources/
│   │   │   └── prometheus.yml     # Auto-configured datasource
│   │   └── dashboards/
│   │       └── dashboard.yml      # Dashboard provisioning
│   └── dashboards/
│       ├── docker-containers.json # Container metrics dashboard
│       ├── host-system.json       # Host metrics dashboard
│       └── traefik.json          # Traefik metrics dashboard
└── README.md                       # This file

Configuration

1. Create Environment File

cp .env.example .env

2. Configure Environment Variables

Edit .env and set the following variables:

# Domain Configuration
DOMAIN=michaelschiemer.de

# Grafana Configuration
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=<generate-strong-password>

# Prometheus Configuration
PROMETHEUS_USER=admin
PROMETHEUS_PASSWORD=<generate-strong-password>

# Portainer Configuration
PORTAINER_ADMIN_PASSWORD=<generate-strong-password>

# Network Configuration
TRAEFIK_NETWORK=traefik-public
MONITORING_NETWORK=monitoring

3. Generate Strong Passwords

# Generate random passwords
openssl rand -base64 32

# For Prometheus BasicAuth (bcrypt hash)
docker run --rm httpd:alpine htpasswd -nbB admin "your-password" | cut -d ":" -f 2

4. Update Traefik BasicAuth (Optional)

If using Prometheus BasicAuth, add the bcrypt hash to Traefik labels in docker-compose.yml:

- "traefik.http.middlewares.prometheus-auth.basicauth.users=admin:$$2y$$05$$..."

Deployment

Deploy Stack

cd /home/michael/dev/michaelschiemer/deployment/stacks/monitoring

# Deploy with Docker Compose
docker compose up -d

# Or with Docker Stack (Swarm mode)
docker stack deploy -c docker-compose.yml monitoring

Verify Deployment

# Check running containers
docker compose ps

# Check service logs
docker compose logs -f grafana
docker compose logs -f prometheus

# Check Prometheus targets
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/targets

Initial Access

  1. Grafana: https://grafana.michaelschiemer.de

    • Login: admin / <GRAFANA_ADMIN_PASSWORD>
    • Dashboards are pre-loaded and ready to use
  2. Prometheus: https://prometheus.michaelschiemer.de

    • BasicAuth: admin / <PROMETHEUS_PASSWORD>
    • Check targets at /targets
    • View alerts at /alerts
  3. Portainer: https://portainer.michaelschiemer.de

    • First login: Set admin password
    • Connect to local Docker environment

Usage

Grafana Dashboards

Docker Containers Dashboard

Access: https://grafana.michaelschiemer.de/d/docker-containers

Metrics Displayed:

  • Container CPU Usage % (per container, timeseries)
  • Container Memory Usage (bytes per container, timeseries)
  • Containers Running (current count, stat)
  • Container Restarts in 5m (rate with thresholds, stat)
  • Container Network I/O (RX/TX per container, timeseries)

Use Cases:

  • Identify containers with high resource usage
  • Monitor container stability (restart rates)
  • Track network bandwidth consumption
  • Verify all expected containers are running

Host System Dashboard

Access: https://grafana.michaelschiemer.de/d/host-system

Metrics Displayed:

  • CPU Usage % (historical and current)
  • Memory Usage % (historical and current)
  • Disk Usage % (root filesystem, historical and current)
  • Network I/O (RX/TX by interface)
  • System Uptime (seconds since boot)

Thresholds:

  • Green: < 80% usage
  • Yellow: 80-90% usage
  • Red: > 90% usage

Use Cases:

  • Monitor server health and resource utilization
  • Identify resource bottlenecks
  • Plan capacity upgrades
  • Track system stability (uptime)

Traefik Dashboard

Access: https://grafana.michaelschiemer.de/d/traefik

Metrics Displayed:

  • Request Rate by Service (req/s, timeseries)
  • Response Time p95/p99 (milliseconds, timeseries)
  • HTTP Status Codes (2xx/4xx/5xx stacked, color-coded)
  • Service Status (Up/Down per service)
  • Requests per Minute (total)
  • 4xx Error Rate (percentage)
  • 5xx Error Rate (percentage)
  • Active Services (count)

Thresholds:

  • 4xx errors: Green < 5%, Yellow < 10%, Red ≥ 10%
  • 5xx errors: Green < 1%, Yellow < 5%, Red ≥ 5%

Use Cases:

  • Monitor HTTP traffic patterns
  • Identify performance issues (high latency)
  • Track error rates and types
  • Verify service availability

Prometheus Queries

Common PromQL Examples

CPU Usage:

# Overall CPU usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Per-core CPU usage
rate(node_cpu_seconds_total[5m]) * 100

Memory Usage:

# Memory usage percentage
100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)

# Memory available in GB
node_memory_MemAvailable_bytes / 1024 / 1024 / 1024

Disk Usage:

# Disk usage percentage
100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100)

# Disk I/O rate
rate(node_disk_io_time_seconds_total[5m])

Container Metrics:

# Container CPU usage
sum(rate(container_cpu_usage_seconds_total{name!~".*exporter.*"}[5m])) by (name) * 100

# Container memory usage
sum(container_memory_usage_bytes{name!~".*exporter.*"}) by (name)

# Container network I/O
rate(container_network_receive_bytes_total[5m])
rate(container_network_transmit_bytes_total[5m])

Traefik Metrics:

# Request rate by service
sum(rate(traefik_service_requests_total[5m])) by (service)

# Response time percentiles
histogram_quantile(0.95, sum(rate(traefik_service_request_duration_seconds_bucket[5m])) by (service, le))

# Error rate
sum(rate(traefik_service_requests_total{code=~"5.."}[5m])) / sum(rate(traefik_service_requests_total[5m])) * 100

Alert Management

Configured Alerts

Alerts are defined in prometheus/alerts.yml:

  1. HostHighCPU: CPU usage > 80% for 5 minutes
  2. HostHighMemory: Memory usage > 80% for 5 minutes
  3. HostDiskSpaceLow: Disk usage > 80%
  4. ContainerHighCPU: Container CPU > 80% for 5 minutes
  5. ContainerHighMemory: Container memory > 80% for 5 minutes
  6. ServiceDown: Service unavailable
  7. HighErrorRate: Error rate > 5% for 5 minutes

View Active Alerts

# Via Prometheus UI
https://prometheus.michaelschiemer.de/alerts

# Via API
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/alerts

# Check alert rules
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/rules

Silence Alerts

Use Prometheus UI or API to silence alerts during maintenance:

# Silence via API (example)
curl -X POST -u admin:password \
  https://prometheus.michaelschiemer.de/api/v1/alerts \
  -d 'alertname=HostHighCPU&duration=1h'

Portainer Usage

Container Management

  1. Navigate to https://portainer.michaelschiemer.de
  2. Select "Local" environment
  3. Go to "Containers" section
  4. Available actions:
    • Start/Stop/Restart containers
    • View logs (live stream)
    • Inspect container details
    • Execute commands in containers
    • View resource statistics

Stack Management

  1. Go to "Stacks" section
  2. View deployed stacks
  3. Actions available:
    • View stack definition
    • Update stack (edit compose file)
    • Stop/Start entire stack
    • Remove stack

Volume Management

  1. Go to "Volumes" section
  2. View volume details and size
  3. Browse volume contents
  4. Backup/restore volumes

Integration with Other Stacks

Stack 1: Traefik

  • Provides HTTPS reverse proxy for Grafana, Prometheus, Portainer
  • Automatic SSL certificate management
  • BasicAuth middleware for Prometheus

Stack 2: Gitea

  • Monitor Gitea container resources
  • Track HTTP requests to Gitea via Traefik dashboard
  • Alert on Gitea service downtime

Stack 3: Docker Registry

  • Monitor registry container resources
  • Track registry HTTP requests
  • Alert on registry unavailability

Stack 4: Application

  • Monitor PHP-FPM, Nginx, Redis, Worker containers
  • Track application response times
  • Monitor queue worker health

Stack 5: PostgreSQL

  • Monitor database container resources
  • Track PostgreSQL metrics (if postgres_exporter added)
  • Alert on database unavailability

Monitoring Best Practices

1. Regular Dashboard Review

  • Check dashboards daily for anomalies
  • Review error rates and response times
  • Monitor resource utilization trends

2. Alert Configuration

  • Tune alert thresholds based on baseline metrics
  • Avoid alert fatigue (too many non-critical alerts)
  • Document alert response procedures

3. Capacity Planning

  • Review resource usage trends weekly
  • Plan capacity upgrades before hitting limits
  • Monitor growth rates for proactive scaling

4. Performance Optimization

  • Identify containers with high resource usage
  • Optimize slow endpoints (high p95/p99 latency)
  • Balance load across services

5. Security Monitoring

  • Monitor failed authentication attempts
  • Track unusual traffic patterns
  • Review service availability trends

Troubleshooting

Grafana Issues

Dashboard Not Loading

# Check Grafana logs
docker compose logs grafana

# Verify datasource connection
curl http://localhost:3000/api/health

# Restart Grafana
docker compose restart grafana

Missing Metrics

# Check Prometheus datasource
curl http://prometheus:9090/api/v1/targets

# Verify Prometheus is scraping
docker compose logs prometheus | grep "Scrape"

# Check network connectivity
docker compose exec grafana ping prometheus

Prometheus Issues

Targets Down

# Check target status
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/targets

# Verify target services are running
docker compose ps

# Check Prometheus configuration
docker compose exec prometheus cat /etc/prometheus/prometheus.yml

# Reload configuration
curl -X POST -u admin:password https://prometheus.michaelschiemer.de/-/reload

High Memory Usage

# Check Prometheus memory
docker stats prometheus

# Reduce retention period in docker-compose.yml:
# --storage.tsdb.retention.time=7d

# Reduce scrape interval in prometheus.yml:
# scrape_interval: 30s

Node Exporter Issues

No Host Metrics

# Check node-exporter is running
docker compose ps node-exporter

# Test metrics endpoint
curl http://localhost:9100/metrics

# Check Prometheus scraping
docker compose logs prometheus | grep node-exporter

cAdvisor Issues

No Container Metrics

# Check cAdvisor is running
docker compose ps cadvisor

# Test metrics endpoint
curl http://localhost:8080/metrics

# Verify Docker socket mount
docker compose exec cadvisor ls -la /var/run/docker.sock

Portainer Issues

Cannot Access UI

# Check Portainer is running
docker compose ps portainer

# Check Traefik routing
docker compose -f ../traefik/docker-compose.yml logs

# Verify network connectivity
docker network ls | grep monitoring

Cannot Connect to Docker

# Verify Docker socket permissions
ls -la /var/run/docker.sock

# Check Portainer logs
docker compose logs portainer

# Restart Portainer
docker compose restart portainer

Performance Tuning

Prometheus Optimization

Reduce Memory Usage

# In docker-compose.yml, adjust retention:
command:
  - '--storage.tsdb.retention.time=7d'  # Reduce from 15d
  - '--storage.tsdb.retention.size=5GB' # Add size limit

Optimize Scrape Intervals

# In prometheus/prometheus.yml:
global:
  scrape_interval: 30s  # Increase from 15s for less load
  evaluation_interval: 30s

Reduce Cardinality

# In prometheus/prometheus.yml, add metric_relabel_configs:
metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'unused_metric_.*'
    action: drop

Grafana Optimization

Reduce Query Load

// In dashboard JSON, adjust refresh rate:
"refresh": "1m"  // Increase from 30s

Optimize Panel Queries

  • Use recording rules for expensive queries
  • Reduce time range for heavy queries
  • Use appropriate resolution (step parameter)

Storage Optimization

Prometheus Data Volume

# Check current size
du -sh volumes/prometheus/

# Compact old data
docker compose exec prometheus curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones

Grafana Data Volume

# Check current size
du -sh volumes/grafana/

# Clean old sessions
docker compose exec grafana grafana-cli admin reset-admin-password

Security Considerations

1. Password Security

  • Use strong, randomly generated passwords
  • Store passwords securely (password manager)
  • Rotate passwords regularly
  • Use bcrypt for Prometheus BasicAuth

2. Network Security

  • Monitoring network is internal-only (except exporters)
  • Traefik handles SSL/TLS termination
  • BasicAuth protects Prometheus UI
  • Grafana requires login for dashboard access

3. Access Control

  • Limit Grafana admin access
  • Use Grafana organizations for multi-tenancy
  • Configure Prometheus with read-only access where possible
  • Restrict Portainer access to trusted users

4. Data Security

  • Prometheus stores metrics in plain text
  • Grafana encrypts passwords in database
  • Backup volumes contain sensitive data
  • Secure backups with encryption

5. Container Security

  • Use official Docker images
  • Keep images updated (security patches)
  • Run containers as non-root where possible
  • Limit container capabilities

Backup and Recovery

Backup Prometheus Data

# Stop Prometheus
docker compose stop prometheus

# Backup data volume
tar czf prometheus-backup-$(date +%Y%m%d).tar.gz -C volumes/prometheus .

# Restart Prometheus
docker compose start prometheus

Backup Grafana Data

# Backup Grafana database and dashboards
docker compose exec grafana tar czf - /var/lib/grafana > grafana-backup-$(date +%Y%m%d).tar.gz

Restore from Backup

# Stop services
docker compose down

# Restore Prometheus data
tar xzf prometheus-backup-YYYYMMDD.tar.gz -C volumes/prometheus/

# Restore Grafana data
docker compose up -d grafana
docker compose exec grafana tar xzf - -C / < grafana-backup-YYYYMMDD.tar.gz

# Start all services
docker compose up -d

Maintenance

Regular Tasks

Daily

  • Review dashboards for anomalies
  • Check active alerts
  • Verify all services are running

Weekly

  • Review resource usage trends
  • Check disk space usage
  • Update passwords if needed

Monthly

  • Review and update alert rules
  • Optimize slow queries
  • Clean up old data if needed
  • Update Docker images

Update Procedure

# Pull latest images
docker compose pull

# Recreate containers with new images
docker compose up -d

# Verify services are healthy
docker compose ps
docker compose logs -f

Support

Documentation

Logs

# View all logs
docker compose logs

# Follow specific service logs
docker compose logs -f grafana
docker compose logs -f prometheus

# View last 100 lines
docker compose logs --tail=100

Health Checks

# Check service health
docker compose ps

# Test endpoints
curl http://localhost:9090/-/healthy  # Prometheus
curl http://localhost:3000/api/health # Grafana

# Check metrics
curl http://localhost:9100/metrics    # Node Exporter
curl http://localhost:8080/metrics    # cAdvisor

Stack Version: 1.0 Last Updated: 2025-01-30 Maintained By: DevOps Team