michaelschiemer/deployment/stacks/monitoring/README.md

# Stack 6: Monitoring (Portainer + Grafana + Prometheus)

Comprehensive monitoring stack for infrastructure and application observability.

## Overview

This stack provides complete monitoring and visualization capabilities for the entire infrastructure:
- **Prometheus**: Time-series metrics collection and alerting
- **Grafana**: Metrics visualization with pre-configured dashboards
- **Portainer**: Container management UI
- **Node Exporter**: Host system metrics (CPU, memory, disk, network)
- **cAdvisor**: Container resource usage metrics
- **Alertmanager**: Alert routing and management (via Prometheus)

## Features

### Prometheus
- Multi-target scraping (node-exporter, cadvisor, traefik)
- 15-second scrape interval for near real-time metrics
- 15-day retention period
- Pre-configured alert rules for critical conditions
- Built-in alerting engine
- Service discovery via static configs
- HTTPS support with BasicAuth protection

### Grafana
- Pre-configured Prometheus datasource
- Three comprehensive dashboards:
  - **Docker Containers**: Container CPU, memory, network I/O, restarts
  - **Host System**: System CPU, memory, disk, network, uptime
  - **Traefik**: Request rates, response times, status codes, error rates
- Auto-provisioning (no manual configuration needed)
- HTTPS access via Traefik
- 30-second auto-refresh
- Dark theme for reduced eye strain

### Portainer
- Web-based Docker management UI
- Container start/stop/restart/logs
- Stack management and deployment
- Volume and network management
- Resource usage visualization
- HTTPS access via Traefik

### Node Exporter
- Host system metrics:
  - CPU usage by core and mode
  - Memory usage and available memory
  - Disk usage by filesystem
  - Network I/O by interface
  - System load averages
  - System uptime

### cAdvisor
- Container metrics:
  - CPU usage per container
  - Memory usage per container
  - Network I/O per container
  - Disk I/O per container
  - Container restart counts
  - Container health status

## Services

| Service | Domain | Port | Purpose |
|---------|--------|------|---------|
| Grafana | grafana.michaelschiemer.de | 3000 | Metrics visualization |
| Prometheus | prometheus.michaelschiemer.de | 9090 | Metrics collection |
| Portainer | portainer.michaelschiemer.de | 9000/9443 | Container management |
| Node Exporter | - | 9100 | Host metrics (internal) |
| cAdvisor | - | 8080 | Container metrics (internal) |

## Prerequisites

- Traefik stack deployed and running (Stack 1)
- Docker networks: `traefik-public`, `monitoring`
- Docker Swarm initialized (if using swarm mode)
- Domain DNS configured (grafana/prometheus/portainer subdomains)

## Directory Structure

```
monitoring/
├── docker-compose.yml              # Main stack definition
├── .env.example                    # Environment template
├── prometheus/
│   ├── prometheus.yml             # Prometheus configuration
│   └── alerts.yml                 # Alert rules
├── grafana/
│   ├── provisioning/
│   │   ├── datasources/
│   │   │   └── prometheus.yml     # Auto-configured datasource
│   │   └── dashboards/
│   │       └── dashboard.yml      # Dashboard provisioning
│   └── dashboards/
│       ├── docker-containers.json # Container metrics dashboard
│       ├── host-system.json       # Host metrics dashboard
│       └── traefik.json          # Traefik metrics dashboard
└── README.md                       # This file
```

## Configuration

### 1. Create Environment File

```bash
cp .env.example .env
```

### 2. Configure Environment Variables

Edit `.env` and set the following variables:

```bash
# Domain Configuration
DOMAIN=michaelschiemer.de

# Grafana Configuration
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=<generate-strong-password>

# Prometheus Configuration
PROMETHEUS_USER=admin
PROMETHEUS_PASSWORD=<generate-strong-password>

# Portainer Configuration
PORTAINER_ADMIN_PASSWORD=<generate-strong-password>

# Network Configuration
TRAEFIK_NETWORK=traefik-public
MONITORING_NETWORK=monitoring
```

### 3. Generate Strong Passwords

```bash
# Generate random passwords
openssl rand -base64 32

# For Prometheus BasicAuth (bcrypt hash)
docker run --rm httpd:alpine htpasswd -nbB admin "your-password" | cut -d ":" -f 2
```

### 4. Update Traefik BasicAuth (Optional)

If using Prometheus BasicAuth, add the bcrypt hash to Traefik labels in docker-compose.yml:

```yaml
- "traefik.http.middlewares.prometheus-auth.basicauth.users=admin:$$2y$$05$$..."
```

## Deployment

### Deploy Stack

```bash
cd /home/michael/dev/michaelschiemer/deployment/stacks/monitoring

# Deploy with Docker Compose
docker compose up -d

# Or with Docker Stack (Swarm mode)
docker stack deploy -c docker-compose.yml monitoring
```

### Verify Deployment

```bash
# Check running containers
docker compose ps

# Check service logs
docker compose logs -f grafana
docker compose logs -f prometheus

# Check Prometheus targets
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/targets
```

### Initial Access

1. **Grafana**: https://grafana.michaelschiemer.de
   - Login: `admin` / `<GRAFANA_ADMIN_PASSWORD>`
   - Dashboards are pre-loaded and ready to use

2. **Prometheus**: https://prometheus.michaelschiemer.de
   - BasicAuth: `admin` / `<PROMETHEUS_PASSWORD>`
   - Check targets at `/targets`
   - View alerts at `/alerts`

3. **Portainer**: https://portainer.michaelschiemer.de
   - First login: Set admin password
   - Connect to local Docker environment

## Usage

### Grafana Dashboards

#### Docker Containers Dashboard
Access: https://grafana.michaelschiemer.de/d/docker-containers

**Metrics Displayed**:
- Container CPU Usage % (per container, timeseries)
- Container Memory Usage (bytes per container, timeseries)
- Containers Running (current count, stat)
- Container Restarts in 5m (rate with thresholds, stat)
- Container Network I/O (RX/TX per container, timeseries)

**Use Cases**:
- Identify containers with high resource usage
- Monitor container stability (restart rates)
- Track network bandwidth consumption
- Verify all expected containers are running

#### Host System Dashboard
Access: https://grafana.michaelschiemer.de/d/host-system

**Metrics Displayed**:
- CPU Usage % (historical and current)
- Memory Usage % (historical and current)
- Disk Usage % (root filesystem, historical and current)
- Network I/O (RX/TX by interface)
- System Uptime (seconds since boot)

**Thresholds**:
- Green: < 80% usage
- Yellow: 80-90% usage
- Red: > 90% usage

**Use Cases**:
- Monitor server health and resource utilization
- Identify resource bottlenecks
- Plan capacity upgrades
- Track system stability (uptime)

#### Traefik Dashboard
Access: https://grafana.michaelschiemer.de/d/traefik

**Metrics Displayed**:
- Request Rate by Service (req/s, timeseries)
- Response Time p95/p99 (milliseconds, timeseries)
- HTTP Status Codes (2xx/4xx/5xx stacked, color-coded)
- Service Status (Up/Down per service)
- Requests per Minute (total)
- 4xx Error Rate (percentage)
- 5xx Error Rate (percentage)
- Active Services (count)

**Thresholds**:
- 4xx errors: Green < 5%, Yellow < 10%, Red ≥ 10%
- 5xx errors: Green < 1%, Yellow < 5%, Red ≥ 5%

**Use Cases**:
- Monitor HTTP traffic patterns
- Identify performance issues (high latency)
- Track error rates and types
- Verify service availability

### Prometheus Queries

#### Common PromQL Examples

**CPU Usage**:
```promql
# Overall CPU usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Per-core CPU usage
rate(node_cpu_seconds_total[5m]) * 100
```

**Memory Usage**:
```promql
# Memory usage percentage
100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)

# Memory available in GB
node_memory_MemAvailable_bytes / 1024 / 1024 / 1024
```

**Disk Usage**:
```promql
# Disk usage percentage
100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100)

# Disk I/O rate
rate(node_disk_io_time_seconds_total[5m])
```

**Container Metrics**:
```promql
# Container CPU usage
sum(rate(container_cpu_usage_seconds_total{name!~".*exporter.*"}[5m])) by (name) * 100

# Container memory usage
sum(container_memory_usage_bytes{name!~".*exporter.*"}) by (name)

# Container network I/O
rate(container_network_receive_bytes_total[5m])
rate(container_network_transmit_bytes_total[5m])
```

**Traefik Metrics**:
```promql
# Request rate by service
sum(rate(traefik_service_requests_total[5m])) by (service)

# Response time percentiles
histogram_quantile(0.95, sum(rate(traefik_service_request_duration_seconds_bucket[5m])) by (service, le))

# Error rate
sum(rate(traefik_service_requests_total{code=~"5.."}[5m])) / sum(rate(traefik_service_requests_total[5m])) * 100
```

### Alert Management

#### Configured Alerts

Alerts are defined in `prometheus/alerts.yml`:

1. **HostHighCPU**: CPU usage > 80% for 5 minutes
2. **HostHighMemory**: Memory usage > 80% for 5 minutes
3. **HostDiskSpaceLow**: Disk usage > 80%
4. **ContainerHighCPU**: Container CPU > 80% for 5 minutes
5. **ContainerHighMemory**: Container memory > 80% for 5 minutes
6. **ServiceDown**: Service unavailable
7. **HighErrorRate**: Error rate > 5% for 5 minutes

#### View Active Alerts

```bash
# Via Prometheus UI
https://prometheus.michaelschiemer.de/alerts

# Via API
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/alerts

# Check alert rules
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/rules
```

#### Silence Alerts

Use Prometheus UI or API to silence alerts during maintenance:

```bash
# Silence via API (example)
curl -X POST -u admin:password \
  https://prometheus.michaelschiemer.de/api/v1/alerts \
  -d 'alertname=HostHighCPU&duration=1h'
```

### Portainer Usage

#### Container Management

1. Navigate to https://portainer.michaelschiemer.de
2. Select "Local" environment
3. Go to "Containers" section
4. Available actions:
   - Start/Stop/Restart containers
   - View logs (live stream)
   - Inspect container details
   - Execute commands in containers
   - View resource statistics

#### Stack Management

1. Go to "Stacks" section
2. View deployed stacks
3. Actions available:
   - View stack definition
   - Update stack (edit compose file)
   - Stop/Start entire stack
   - Remove stack

#### Volume Management

1. Go to "Volumes" section
2. View volume details and size
3. Browse volume contents
4. Backup/restore volumes

## Integration with Other Stacks

### Stack 1: Traefik
- Provides HTTPS reverse proxy for Grafana, Prometheus, Portainer
- Automatic SSL certificate management
- BasicAuth middleware for Prometheus

### Stack 2: Gitea
- Monitor Gitea container resources
- Track HTTP requests to Gitea via Traefik dashboard
- Alert on Gitea service downtime

### Stack 3: Docker Registry
- Monitor registry container resources
- Track registry HTTP requests
- Alert on registry unavailability

### Stack 4: Application
- Monitor PHP-FPM, Nginx, Redis, Worker containers
- Track application response times
- Monitor queue worker health

### Stack 5: PostgreSQL
- Monitor database container resources
- Track PostgreSQL metrics (if postgres_exporter added)
- Alert on database unavailability

## Monitoring Best Practices

### 1. Regular Dashboard Review
- Check dashboards daily for anomalies
- Review error rates and response times
- Monitor resource utilization trends

### 2. Alert Configuration
- Tune alert thresholds based on baseline metrics
- Avoid alert fatigue (too many non-critical alerts)
- Document alert response procedures

### 3. Capacity Planning
- Review resource usage trends weekly
- Plan capacity upgrades before hitting limits
- Monitor growth rates for proactive scaling

### 4. Performance Optimization
- Identify containers with high resource usage
- Optimize slow endpoints (high p95/p99 latency)
- Balance load across services

### 5. Security Monitoring
- Monitor failed authentication attempts
- Track unusual traffic patterns
- Review service availability trends

## Troubleshooting

### Grafana Issues

#### Dashboard Not Loading
```bash
# Check Grafana logs
docker compose logs grafana

# Verify datasource connection
curl http://localhost:3000/api/health

# Restart Grafana
docker compose restart grafana
```

#### Missing Metrics
```bash
# Check Prometheus datasource
curl http://prometheus:9090/api/v1/targets

# Verify Prometheus is scraping
docker compose logs prometheus | grep "Scrape"

# Check network connectivity
docker compose exec grafana ping prometheus
```

### Prometheus Issues

#### Targets Down
```bash
# Check target status
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/targets

# Verify target services are running
docker compose ps

# Check Prometheus configuration
docker compose exec prometheus cat /etc/prometheus/prometheus.yml

# Reload configuration
curl -X POST -u admin:password https://prometheus.michaelschiemer.de/-/reload
```

#### High Memory Usage
```bash
# Check Prometheus memory
docker stats prometheus

# Reduce retention period in docker-compose.yml:
# --storage.tsdb.retention.time=7d

# Reduce scrape interval in prometheus.yml:
# scrape_interval: 30s
```

### Node Exporter Issues

#### No Host Metrics
```bash
# Check node-exporter is running
docker compose ps node-exporter

# Test metrics endpoint
curl http://localhost:9100/metrics

# Check Prometheus scraping
docker compose logs prometheus | grep node-exporter
```

### cAdvisor Issues

#### No Container Metrics
```bash
# Check cAdvisor is running
docker compose ps cadvisor

# Test metrics endpoint
curl http://localhost:8080/metrics

# Verify Docker socket mount
docker compose exec cadvisor ls -la /var/run/docker.sock
```

### Portainer Issues

#### Cannot Access UI
```bash
# Check Portainer is running
docker compose ps portainer

# Check Traefik routing
docker compose -f ../traefik/docker-compose.yml logs

# Verify network connectivity
docker network ls | grep monitoring
```

#### Cannot Connect to Docker
```bash
# Verify Docker socket permissions
ls -la /var/run/docker.sock

# Check Portainer logs
docker compose logs portainer

# Restart Portainer
docker compose restart portainer
```

## Performance Tuning

### Prometheus Optimization

#### Reduce Memory Usage
```yaml
# In docker-compose.yml, adjust retention:
command:
  - '--storage.tsdb.retention.time=7d'  # Reduce from 15d
  - '--storage.tsdb.retention.size=5GB' # Add size limit
```

#### Optimize Scrape Intervals
```yaml
# In prometheus/prometheus.yml:
global:
  scrape_interval: 30s  # Increase from 15s for less load
  evaluation_interval: 30s
```

#### Reduce Cardinality
```yaml
# In prometheus/prometheus.yml, add metric_relabel_configs:
metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'unused_metric_.*'
    action: drop
```

### Grafana Optimization

#### Reduce Query Load
```json
// In dashboard JSON, adjust refresh rate:
"refresh": "1m"  // Increase from 30s
```

#### Optimize Panel Queries
- Use recording rules for expensive queries
- Reduce time range for heavy queries
- Use appropriate resolution (step parameter)

### Storage Optimization

#### Prometheus Data Volume
```bash
# Check current size
du -sh volumes/prometheus/

# Compact old data
docker compose exec prometheus curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones
```

#### Grafana Data Volume
```bash
# Check current size
du -sh volumes/grafana/

# Clean old sessions
docker compose exec grafana grafana-cli admin reset-admin-password
```

## Security Considerations

### 1. Password Security
- Use strong, randomly generated passwords
- Store passwords securely (password manager)
- Rotate passwords regularly
- Use bcrypt for Prometheus BasicAuth

### 2. Network Security
- Monitoring network is internal-only (except exporters)
- Traefik handles SSL/TLS termination
- BasicAuth protects Prometheus UI
- Grafana requires login for dashboard access

### 3. Access Control
- Limit Grafana admin access
- Use Grafana organizations for multi-tenancy
- Configure Prometheus with read-only access where possible
- Restrict Portainer access to trusted users

### 4. Data Security
- Prometheus stores metrics in plain text
- Grafana encrypts passwords in database
- Backup volumes contain sensitive data
- Secure backups with encryption

### 5. Container Security
- Use official Docker images
- Keep images updated (security patches)
- Run containers as non-root where possible
- Limit container capabilities

## Backup and Recovery

### Backup Prometheus Data
```bash
# Stop Prometheus
docker compose stop prometheus

# Backup data volume
tar czf prometheus-backup-$(date +%Y%m%d).tar.gz -C volumes/prometheus .

# Restart Prometheus
docker compose start prometheus
```

### Backup Grafana Data
```bash
# Backup Grafana database and dashboards
docker compose exec grafana tar czf - /var/lib/grafana > grafana-backup-$(date +%Y%m%d).tar.gz
```

### Restore from Backup
```bash
# Stop services
docker compose down

# Restore Prometheus data
tar xzf prometheus-backup-YYYYMMDD.tar.gz -C volumes/prometheus/

# Restore Grafana data
docker compose up -d grafana
docker compose exec grafana tar xzf - -C / < grafana-backup-YYYYMMDD.tar.gz

# Start all services
docker compose up -d
```

## Maintenance

### Regular Tasks

#### Daily
- Review dashboards for anomalies
- Check active alerts
- Verify all services are running

#### Weekly
- Review resource usage trends
- Check disk space usage
- Update passwords if needed

#### Monthly
- Review and update alert rules
- Optimize slow queries
- Clean up old data if needed
- Update Docker images

### Update Procedure

```bash
# Pull latest images
docker compose pull

# Recreate containers with new images
docker compose up -d

# Verify services are healthy
docker compose ps
docker compose logs -f
```

## Support

### Documentation
- Prometheus: https://prometheus.io/docs/
- Grafana: https://grafana.com/docs/
- Portainer: https://docs.portainer.io/

### Logs
```bash
# View all logs
docker compose logs

# Follow specific service logs
docker compose logs -f grafana
docker compose logs -f prometheus

# View last 100 lines
docker compose logs --tail=100
```

### Health Checks
```bash
# Check service health
docker compose ps

# Test endpoints
curl http://localhost:9090/-/healthy  # Prometheus
curl http://localhost:3000/api/health # Grafana

# Check metrics
curl http://localhost:9100/metrics    # Node Exporter
curl http://localhost:8080/metrics    # cAdvisor
```

---

**Stack Version**: 1.0
**Last Updated**: 2025-01-30
**Maintained By**: DevOps Team