Files
michaelschiemer/deployment/stacks/monitoring
Michael Schiemer aeeed293af feat(monitoring): Add direct VPN access configuration
- Add docker-compose-direct-access.yml for VPN-only admin access
- Configure Portainer on port 9002 (avoid MinIO conflict)
- Add grafana.ini to disable external plugin update checks
- Bind services to 10.8.0.1 (WireGuard VPN gateway)

This configuration enables direct access to admin services via WireGuard VPN
while removing Traefik routing overhead. Services are bound exclusively to
the VPN gateway IP to prevent public access.
2025-11-05 04:42:17 +01:00
..

Stack 6: Monitoring (Portainer + Grafana + Prometheus)

Comprehensive monitoring stack for infrastructure and application observability.

Overview

This stack provides complete monitoring and visualization capabilities for the entire infrastructure:

  • Prometheus: Time-series metrics collection and alerting
  • Grafana: Metrics visualization with pre-configured dashboards
  • Portainer: Container management UI
  • Node Exporter: Host system metrics (CPU, memory, disk, network)
  • cAdvisor: Container resource usage metrics
  • Alertmanager: Alert routing and management (via Prometheus)

Features

Prometheus

  • Multi-target scraping (node-exporter, cadvisor, traefik)
  • 15-second scrape interval for near real-time metrics
  • 15-day retention period
  • Pre-configured alert rules for critical conditions
  • Built-in alerting engine
  • Service discovery via static configs
  • HTTPS support with BasicAuth protection

Grafana

  • Pre-configured Prometheus datasource
  • Three comprehensive dashboards:
    • Docker Containers: Container CPU, memory, network I/O, restarts
    • Host System: System CPU, memory, disk, network, uptime
    • Traefik: Request rates, response times, status codes, error rates
  • Auto-provisioning (no manual configuration needed)
  • HTTPS access via Traefik
  • 30-second auto-refresh
  • Dark theme for reduced eye strain

Portainer

  • Web-based Docker management UI
  • Container start/stop/restart/logs
  • Stack management and deployment
  • Volume and network management
  • Resource usage visualization
  • HTTPS access via Traefik

Node Exporter

  • Host system metrics:
    • CPU usage by core and mode
    • Memory usage and available memory
    • Disk usage by filesystem
    • Network I/O by interface
    • System load averages
    • System uptime

cAdvisor

  • Container metrics:
    • CPU usage per container
    • Memory usage per container
    • Network I/O per container
    • Disk I/O per container
    • Container restart counts
    • Container health status

Services

Service Domain Port Purpose
Grafana grafana.michaelschiemer.de 3000 Metrics visualization
Prometheus prometheus.michaelschiemer.de 9090 Metrics collection
Portainer portainer.michaelschiemer.de 9000/9443 Container management
Node Exporter - 9100 Host metrics (internal)
cAdvisor - 8080 Container metrics (internal)

Prerequisites

  • Traefik stack deployed and running (Stack 1)
  • Docker networks: traefik-public, monitoring
  • Docker Swarm initialized (if using swarm mode)
  • Domain DNS configured (grafana/prometheus/portainer subdomains)

Directory Structure

monitoring/
├── docker-compose.yml              # Main stack definition
├── .env.example                    # Environment template
├── prometheus/
│   ├── prometheus.yml             # Prometheus configuration
│   └── alerts.yml                 # Alert rules
├── grafana/
│   ├── provisioning/
│   │   ├── datasources/
│   │   │   └── prometheus.yml     # Auto-configured datasource
│   │   └── dashboards/
│   │       └── dashboard.yml      # Dashboard provisioning
│   └── dashboards/
│       ├── docker-containers.json # Container metrics dashboard
│       ├── host-system.json       # Host metrics dashboard
│       └── traefik.json          # Traefik metrics dashboard
└── README.md                       # This file

Configuration

1. Create Environment File

cp .env.example .env

2. Configure Environment Variables

Edit .env and set the following variables:

# Domain Configuration
DOMAIN=michaelschiemer.de

# Grafana Configuration
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=<generate-strong-password>

# Prometheus Configuration
PROMETHEUS_USER=admin
PROMETHEUS_PASSWORD=<generate-strong-password>

# Portainer Configuration
PORTAINER_ADMIN_PASSWORD=<generate-strong-password>

# Network Configuration
TRAEFIK_NETWORK=traefik-public
MONITORING_NETWORK=monitoring

3. Generate Strong Passwords

# Generate random passwords
openssl rand -base64 32

# For Prometheus BasicAuth (bcrypt hash)
docker run --rm httpd:alpine htpasswd -nbB admin "your-password" | cut -d ":" -f 2

4. Update Traefik BasicAuth (Optional)

If using Prometheus BasicAuth, add the bcrypt hash to Traefik labels in docker-compose.yml:

- "traefik.http.middlewares.prometheus-auth.basicauth.users=admin:$$2y$$05$$..."

Deployment

Deploy Stack

cd /home/michael/dev/michaelschiemer/deployment/stacks/monitoring

# Deploy with Docker Compose
docker compose up -d

# Or with Docker Stack (Swarm mode)
docker stack deploy -c docker-compose.yml monitoring

Verify Deployment

# Check running containers
docker compose ps

# Check service logs
docker compose logs -f grafana
docker compose logs -f prometheus

# Check Prometheus targets
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/targets

Initial Access

  1. Grafana: https://grafana.michaelschiemer.de

    • Login: admin / <GRAFANA_ADMIN_PASSWORD>
    • Dashboards are pre-loaded and ready to use
  2. Prometheus: https://prometheus.michaelschiemer.de

    • BasicAuth: admin / <PROMETHEUS_PASSWORD>
    • Check targets at /targets
    • View alerts at /alerts
  3. Portainer: https://portainer.michaelschiemer.de

    • First login: Set admin password
    • Connect to local Docker environment

Usage

Grafana Dashboards

Docker Containers Dashboard

Access: https://grafana.michaelschiemer.de/d/docker-containers

Metrics Displayed:

  • Container CPU Usage % (per container, timeseries)
  • Container Memory Usage (bytes per container, timeseries)
  • Containers Running (current count, stat)
  • Container Restarts in 5m (rate with thresholds, stat)
  • Container Network I/O (RX/TX per container, timeseries)

Use Cases:

  • Identify containers with high resource usage
  • Monitor container stability (restart rates)
  • Track network bandwidth consumption
  • Verify all expected containers are running

Host System Dashboard

Access: https://grafana.michaelschiemer.de/d/host-system

Metrics Displayed:

  • CPU Usage % (historical and current)
  • Memory Usage % (historical and current)
  • Disk Usage % (root filesystem, historical and current)
  • Network I/O (RX/TX by interface)
  • System Uptime (seconds since boot)

Thresholds:

  • Green: < 80% usage
  • Yellow: 80-90% usage
  • Red: > 90% usage

Use Cases:

  • Monitor server health and resource utilization
  • Identify resource bottlenecks
  • Plan capacity upgrades
  • Track system stability (uptime)

Traefik Dashboard

Access: https://grafana.michaelschiemer.de/d/traefik

Metrics Displayed:

  • Request Rate by Service (req/s, timeseries)
  • Response Time p95/p99 (milliseconds, timeseries)
  • HTTP Status Codes (2xx/4xx/5xx stacked, color-coded)
  • Service Status (Up/Down per service)
  • Requests per Minute (total)
  • 4xx Error Rate (percentage)
  • 5xx Error Rate (percentage)
  • Active Services (count)

Thresholds:

  • 4xx errors: Green < 5%, Yellow < 10%, Red ≥ 10%
  • 5xx errors: Green < 1%, Yellow < 5%, Red ≥ 5%

Use Cases:

  • Monitor HTTP traffic patterns
  • Identify performance issues (high latency)
  • Track error rates and types
  • Verify service availability

Prometheus Queries

Common PromQL Examples

CPU Usage:

# Overall CPU usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Per-core CPU usage
rate(node_cpu_seconds_total[5m]) * 100

Memory Usage:

# Memory usage percentage
100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)

# Memory available in GB
node_memory_MemAvailable_bytes / 1024 / 1024 / 1024

Disk Usage:

# Disk usage percentage
100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100)

# Disk I/O rate
rate(node_disk_io_time_seconds_total[5m])

Container Metrics:

# Container CPU usage
sum(rate(container_cpu_usage_seconds_total{name!~".*exporter.*"}[5m])) by (name) * 100

# Container memory usage
sum(container_memory_usage_bytes{name!~".*exporter.*"}) by (name)

# Container network I/O
rate(container_network_receive_bytes_total[5m])
rate(container_network_transmit_bytes_total[5m])

Traefik Metrics:

# Request rate by service
sum(rate(traefik_service_requests_total[5m])) by (service)

# Response time percentiles
histogram_quantile(0.95, sum(rate(traefik_service_request_duration_seconds_bucket[5m])) by (service, le))

# Error rate
sum(rate(traefik_service_requests_total{code=~"5.."}[5m])) / sum(rate(traefik_service_requests_total[5m])) * 100

Alert Management

Configured Alerts

Alerts are defined in prometheus/alerts.yml:

  1. HostHighCPU: CPU usage > 80% for 5 minutes
  2. HostHighMemory: Memory usage > 80% for 5 minutes
  3. HostDiskSpaceLow: Disk usage > 80%
  4. ContainerHighCPU: Container CPU > 80% for 5 minutes
  5. ContainerHighMemory: Container memory > 80% for 5 minutes
  6. ServiceDown: Service unavailable
  7. HighErrorRate: Error rate > 5% for 5 minutes

View Active Alerts

# Via Prometheus UI
https://prometheus.michaelschiemer.de/alerts

# Via API
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/alerts

# Check alert rules
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/rules

Silence Alerts

Use Prometheus UI or API to silence alerts during maintenance:

# Silence via API (example)
curl -X POST -u admin:password \
  https://prometheus.michaelschiemer.de/api/v1/alerts \
  -d 'alertname=HostHighCPU&duration=1h'

Portainer Usage

Container Management

  1. Navigate to https://portainer.michaelschiemer.de
  2. Select "Local" environment
  3. Go to "Containers" section
  4. Available actions:
    • Start/Stop/Restart containers
    • View logs (live stream)
    • Inspect container details
    • Execute commands in containers
    • View resource statistics

Stack Management

  1. Go to "Stacks" section
  2. View deployed stacks
  3. Actions available:
    • View stack definition
    • Update stack (edit compose file)
    • Stop/Start entire stack
    • Remove stack

Volume Management

  1. Go to "Volumes" section
  2. View volume details and size
  3. Browse volume contents
  4. Backup/restore volumes

Integration with Other Stacks

Stack 1: Traefik

  • Provides HTTPS reverse proxy for Grafana, Prometheus, Portainer
  • Automatic SSL certificate management
  • BasicAuth middleware for Prometheus

Stack 2: Gitea

  • Monitor Gitea container resources
  • Track HTTP requests to Gitea via Traefik dashboard
  • Alert on Gitea service downtime

Stack 3: Docker Registry

  • Monitor registry container resources
  • Track registry HTTP requests
  • Alert on registry unavailability

Stack 4: Application

  • Monitor PHP-FPM, Nginx, Redis, Worker containers
  • Track application response times
  • Monitor queue worker health

Stack 5: PostgreSQL

  • Monitor database container resources
  • Track PostgreSQL metrics (if postgres_exporter added)
  • Alert on database unavailability

Monitoring Best Practices

1. Regular Dashboard Review

  • Check dashboards daily for anomalies
  • Review error rates and response times
  • Monitor resource utilization trends

2. Alert Configuration

  • Tune alert thresholds based on baseline metrics
  • Avoid alert fatigue (too many non-critical alerts)
  • Document alert response procedures

3. Capacity Planning

  • Review resource usage trends weekly
  • Plan capacity upgrades before hitting limits
  • Monitor growth rates for proactive scaling

4. Performance Optimization

  • Identify containers with high resource usage
  • Optimize slow endpoints (high p95/p99 latency)
  • Balance load across services

5. Security Monitoring

  • Monitor failed authentication attempts
  • Track unusual traffic patterns
  • Review service availability trends

Troubleshooting

Grafana Issues

Dashboard Not Loading

# Check Grafana logs
docker compose logs grafana

# Verify datasource connection
curl http://localhost:3000/api/health

# Restart Grafana
docker compose restart grafana

Missing Metrics

# Check Prometheus datasource
curl http://prometheus:9090/api/v1/targets

# Verify Prometheus is scraping
docker compose logs prometheus | grep "Scrape"

# Check network connectivity
docker compose exec grafana ping prometheus

Prometheus Issues

Targets Down

# Check target status
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/targets

# Verify target services are running
docker compose ps

# Check Prometheus configuration
docker compose exec prometheus cat /etc/prometheus/prometheus.yml

# Reload configuration
curl -X POST -u admin:password https://prometheus.michaelschiemer.de/-/reload

High Memory Usage

# Check Prometheus memory
docker stats prometheus

# Reduce retention period in docker-compose.yml:
# --storage.tsdb.retention.time=7d

# Reduce scrape interval in prometheus.yml:
# scrape_interval: 30s

Node Exporter Issues

No Host Metrics

# Check node-exporter is running
docker compose ps node-exporter

# Test metrics endpoint
curl http://localhost:9100/metrics

# Check Prometheus scraping
docker compose logs prometheus | grep node-exporter

cAdvisor Issues

No Container Metrics

# Check cAdvisor is running
docker compose ps cadvisor

# Test metrics endpoint
curl http://localhost:8080/metrics

# Verify Docker socket mount
docker compose exec cadvisor ls -la /var/run/docker.sock

Portainer Issues

Cannot Access UI

# Check Portainer is running
docker compose ps portainer

# Check Traefik routing
docker compose -f ../traefik/docker-compose.yml logs

# Verify network connectivity
docker network ls | grep monitoring

Cannot Connect to Docker

# Verify Docker socket permissions
ls -la /var/run/docker.sock

# Check Portainer logs
docker compose logs portainer

# Restart Portainer
docker compose restart portainer

Performance Tuning

Prometheus Optimization

Reduce Memory Usage

# In docker-compose.yml, adjust retention:
command:
  - '--storage.tsdb.retention.time=7d'  # Reduce from 15d
  - '--storage.tsdb.retention.size=5GB' # Add size limit

Optimize Scrape Intervals

# In prometheus/prometheus.yml:
global:
  scrape_interval: 30s  # Increase from 15s for less load
  evaluation_interval: 30s

Reduce Cardinality

# In prometheus/prometheus.yml, add metric_relabel_configs:
metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'unused_metric_.*'
    action: drop

Grafana Optimization

Reduce Query Load

// In dashboard JSON, adjust refresh rate:
"refresh": "1m"  // Increase from 30s

Optimize Panel Queries

  • Use recording rules for expensive queries
  • Reduce time range for heavy queries
  • Use appropriate resolution (step parameter)

Storage Optimization

Prometheus Data Volume

# Check current size
du -sh volumes/prometheus/

# Compact old data
docker compose exec prometheus curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones

Grafana Data Volume

# Check current size
du -sh volumes/grafana/

# Clean old sessions
docker compose exec grafana grafana-cli admin reset-admin-password

Security Considerations

1. Password Security

  • Use strong, randomly generated passwords
  • Store passwords securely (password manager)
  • Rotate passwords regularly
  • Use bcrypt for Prometheus BasicAuth

2. Network Security

  • Monitoring network is internal-only (except exporters)
  • Traefik handles SSL/TLS termination
  • BasicAuth protects Prometheus UI
  • Grafana requires login for dashboard access

3. Access Control

  • Limit Grafana admin access
  • Use Grafana organizations for multi-tenancy
  • Configure Prometheus with read-only access where possible
  • Restrict Portainer access to trusted users

4. Data Security

  • Prometheus stores metrics in plain text
  • Grafana encrypts passwords in database
  • Backup volumes contain sensitive data
  • Secure backups with encryption

5. Container Security

  • Use official Docker images
  • Keep images updated (security patches)
  • Run containers as non-root where possible
  • Limit container capabilities

Backup and Recovery

Backup Prometheus Data

# Stop Prometheus
docker compose stop prometheus

# Backup data volume
tar czf prometheus-backup-$(date +%Y%m%d).tar.gz -C volumes/prometheus .

# Restart Prometheus
docker compose start prometheus

Backup Grafana Data

# Backup Grafana database and dashboards
docker compose exec grafana tar czf - /var/lib/grafana > grafana-backup-$(date +%Y%m%d).tar.gz

Restore from Backup

# Stop services
docker compose down

# Restore Prometheus data
tar xzf prometheus-backup-YYYYMMDD.tar.gz -C volumes/prometheus/

# Restore Grafana data
docker compose up -d grafana
docker compose exec grafana tar xzf - -C / < grafana-backup-YYYYMMDD.tar.gz

# Start all services
docker compose up -d

Maintenance

Regular Tasks

Daily

  • Review dashboards for anomalies
  • Check active alerts
  • Verify all services are running

Weekly

  • Review resource usage trends
  • Check disk space usage
  • Update passwords if needed

Monthly

  • Review and update alert rules
  • Optimize slow queries
  • Clean up old data if needed
  • Update Docker images

Update Procedure

# Pull latest images
docker compose pull

# Recreate containers with new images
docker compose up -d

# Verify services are healthy
docker compose ps
docker compose logs -f

Support

Documentation

Logs

# View all logs
docker compose logs

# Follow specific service logs
docker compose logs -f grafana
docker compose logs -f prometheus

# View last 100 lines
docker compose logs --tail=100

Health Checks

# Check service health
docker compose ps

# Test endpoints
curl http://localhost:9090/-/healthy  # Prometheus
curl http://localhost:3000/api/health # Grafana

# Check metrics
curl http://localhost:9100/metrics    # Node Exporter
curl http://localhost:8080/metrics    # cAdvisor

Stack Version: 1.0 Last Updated: 2025-01-30 Maintained By: DevOps Team