752 lines
18 KiB
Markdown
752 lines
18 KiB
Markdown
# Stack 6: Monitoring (Portainer + Grafana + Prometheus)
|
|
|
|
Comprehensive monitoring stack for infrastructure and application observability.
|
|
|
|
## Overview
|
|
|
|
This stack provides complete monitoring and visualization capabilities for the entire infrastructure:
|
|
- **Prometheus**: Time-series metrics collection and alerting
|
|
- **Grafana**: Metrics visualization with pre-configured dashboards
|
|
- **Portainer**: Container management UI
|
|
- **Node Exporter**: Host system metrics (CPU, memory, disk, network)
|
|
- **cAdvisor**: Container resource usage metrics
|
|
- **Alertmanager**: Alert routing and management (via Prometheus)
|
|
|
|
## Features
|
|
|
|
### Prometheus
|
|
- Multi-target scraping (node-exporter, cadvisor, traefik)
|
|
- 15-second scrape interval for near real-time metrics
|
|
- 15-day retention period
|
|
- Pre-configured alert rules for critical conditions
|
|
- Built-in alerting engine
|
|
- Service discovery via static configs
|
|
- HTTPS support with BasicAuth protection
|
|
|
|
### Grafana
|
|
- Pre-configured Prometheus datasource
|
|
- Three comprehensive dashboards:
|
|
- **Docker Containers**: Container CPU, memory, network I/O, restarts
|
|
- **Host System**: System CPU, memory, disk, network, uptime
|
|
- **Traefik**: Request rates, response times, status codes, error rates
|
|
- Auto-provisioning (no manual configuration needed)
|
|
- HTTPS access via Traefik
|
|
- 30-second auto-refresh
|
|
- Dark theme for reduced eye strain
|
|
|
|
### Portainer
|
|
- Web-based Docker management UI
|
|
- Container start/stop/restart/logs
|
|
- Stack management and deployment
|
|
- Volume and network management
|
|
- Resource usage visualization
|
|
- HTTPS access via Traefik
|
|
|
|
### Node Exporter
|
|
- Host system metrics:
|
|
- CPU usage by core and mode
|
|
- Memory usage and available memory
|
|
- Disk usage by filesystem
|
|
- Network I/O by interface
|
|
- System load averages
|
|
- System uptime
|
|
|
|
### cAdvisor
|
|
- Container metrics:
|
|
- CPU usage per container
|
|
- Memory usage per container
|
|
- Network I/O per container
|
|
- Disk I/O per container
|
|
- Container restart counts
|
|
- Container health status
|
|
|
|
## Services
|
|
|
|
| Service | Domain | Port | Purpose |
|
|
|---------|--------|------|---------|
|
|
| Grafana | grafana.michaelschiemer.de | 3000 | Metrics visualization |
|
|
| Prometheus | prometheus.michaelschiemer.de | 9090 | Metrics collection |
|
|
| Portainer | portainer.michaelschiemer.de | 9000/9443 | Container management |
|
|
| Node Exporter | - | 9100 | Host metrics (internal) |
|
|
| cAdvisor | - | 8080 | Container metrics (internal) |
|
|
|
|
## Prerequisites
|
|
|
|
- Traefik stack deployed and running (Stack 1)
|
|
- Docker networks: `traefik-public`, `monitoring`
|
|
- Docker Swarm initialized (if using swarm mode)
|
|
- Domain DNS configured (grafana/prometheus/portainer subdomains)
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
monitoring/
|
|
├── docker-compose.yml # Main stack definition
|
|
├── .env.example # Environment template
|
|
├── prometheus/
|
|
│ ├── prometheus.yml # Prometheus configuration
|
|
│ └── alerts.yml # Alert rules
|
|
├── grafana/
|
|
│ ├── provisioning/
|
|
│ │ ├── datasources/
|
|
│ │ │ └── prometheus.yml # Auto-configured datasource
|
|
│ │ └── dashboards/
|
|
│ │ └── dashboard.yml # Dashboard provisioning
|
|
│ └── dashboards/
|
|
│ ├── docker-containers.json # Container metrics dashboard
|
|
│ ├── host-system.json # Host metrics dashboard
|
|
│ └── traefik.json # Traefik metrics dashboard
|
|
└── README.md # This file
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### 1. Create Environment File
|
|
|
|
```bash
|
|
cp .env.example .env
|
|
```
|
|
|
|
### 2. Configure Environment Variables
|
|
|
|
Edit `.env` and set the following variables:
|
|
|
|
```bash
|
|
# Domain Configuration
|
|
DOMAIN=michaelschiemer.de
|
|
|
|
# Grafana Configuration
|
|
GRAFANA_ADMIN_USER=admin
|
|
GRAFANA_ADMIN_PASSWORD=<generate-strong-password>
|
|
|
|
# Prometheus Configuration
|
|
PROMETHEUS_USER=admin
|
|
PROMETHEUS_PASSWORD=<generate-strong-password>
|
|
|
|
# Portainer Configuration
|
|
PORTAINER_ADMIN_PASSWORD=<generate-strong-password>
|
|
|
|
# Network Configuration
|
|
TRAEFIK_NETWORK=traefik-public
|
|
MONITORING_NETWORK=monitoring
|
|
```
|
|
|
|
### 3. Generate Strong Passwords
|
|
|
|
```bash
|
|
# Generate random passwords
|
|
openssl rand -base64 32
|
|
|
|
# For Prometheus BasicAuth (bcrypt hash)
|
|
docker run --rm httpd:alpine htpasswd -nbB admin "your-password" | cut -d ":" -f 2
|
|
```
|
|
|
|
### 4. Update Traefik BasicAuth (Optional)
|
|
|
|
If using Prometheus BasicAuth, add the bcrypt hash to Traefik labels in docker-compose.yml:
|
|
|
|
```yaml
|
|
- "traefik.http.middlewares.prometheus-auth.basicauth.users=admin:$$2y$$05$$..."
|
|
```
|
|
|
|
## Deployment
|
|
|
|
### Deploy Stack
|
|
|
|
```bash
|
|
cd /home/michael/dev/michaelschiemer/deployment/stacks/monitoring
|
|
|
|
# Deploy with Docker Compose
|
|
docker compose up -d
|
|
|
|
# Or with Docker Stack (Swarm mode)
|
|
docker stack deploy -c docker-compose.yml monitoring
|
|
```
|
|
|
|
### Verify Deployment
|
|
|
|
```bash
|
|
# Check running containers
|
|
docker compose ps
|
|
|
|
# Check service logs
|
|
docker compose logs -f grafana
|
|
docker compose logs -f prometheus
|
|
|
|
# Check Prometheus targets
|
|
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/targets
|
|
```
|
|
|
|
### Initial Access
|
|
|
|
1. **Grafana**: https://grafana.michaelschiemer.de
|
|
- Login: `admin` / `<GRAFANA_ADMIN_PASSWORD>`
|
|
- Dashboards are pre-loaded and ready to use
|
|
|
|
2. **Prometheus**: https://prometheus.michaelschiemer.de
|
|
- BasicAuth: `admin` / `<PROMETHEUS_PASSWORD>`
|
|
- Check targets at `/targets`
|
|
- View alerts at `/alerts`
|
|
|
|
3. **Portainer**: https://portainer.michaelschiemer.de
|
|
- First login: Set admin password
|
|
- Connect to local Docker environment
|
|
|
|
## Usage
|
|
|
|
### Grafana Dashboards
|
|
|
|
#### Docker Containers Dashboard
|
|
Access: https://grafana.michaelschiemer.de/d/docker-containers
|
|
|
|
**Metrics Displayed**:
|
|
- Container CPU Usage % (per container, timeseries)
|
|
- Container Memory Usage (bytes per container, timeseries)
|
|
- Containers Running (current count, stat)
|
|
- Container Restarts in 5m (rate with thresholds, stat)
|
|
- Container Network I/O (RX/TX per container, timeseries)
|
|
|
|
**Use Cases**:
|
|
- Identify containers with high resource usage
|
|
- Monitor container stability (restart rates)
|
|
- Track network bandwidth consumption
|
|
- Verify all expected containers are running
|
|
|
|
#### Host System Dashboard
|
|
Access: https://grafana.michaelschiemer.de/d/host-system
|
|
|
|
**Metrics Displayed**:
|
|
- CPU Usage % (historical and current)
|
|
- Memory Usage % (historical and current)
|
|
- Disk Usage % (root filesystem, historical and current)
|
|
- Network I/O (RX/TX by interface)
|
|
- System Uptime (seconds since boot)
|
|
|
|
**Thresholds**:
|
|
- Green: < 80% usage
|
|
- Yellow: 80-90% usage
|
|
- Red: > 90% usage
|
|
|
|
**Use Cases**:
|
|
- Monitor server health and resource utilization
|
|
- Identify resource bottlenecks
|
|
- Plan capacity upgrades
|
|
- Track system stability (uptime)
|
|
|
|
#### Traefik Dashboard
|
|
Access: https://grafana.michaelschiemer.de/d/traefik
|
|
|
|
**Metrics Displayed**:
|
|
- Request Rate by Service (req/s, timeseries)
|
|
- Response Time p95/p99 (milliseconds, timeseries)
|
|
- HTTP Status Codes (2xx/4xx/5xx stacked, color-coded)
|
|
- Service Status (Up/Down per service)
|
|
- Requests per Minute (total)
|
|
- 4xx Error Rate (percentage)
|
|
- 5xx Error Rate (percentage)
|
|
- Active Services (count)
|
|
|
|
**Thresholds**:
|
|
- 4xx errors: Green < 5%, Yellow < 10%, Red ≥ 10%
|
|
- 5xx errors: Green < 1%, Yellow < 5%, Red ≥ 5%
|
|
|
|
**Use Cases**:
|
|
- Monitor HTTP traffic patterns
|
|
- Identify performance issues (high latency)
|
|
- Track error rates and types
|
|
- Verify service availability
|
|
|
|
### Prometheus Queries
|
|
|
|
#### Common PromQL Examples
|
|
|
|
**CPU Usage**:
|
|
```promql
|
|
# Overall CPU usage
|
|
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
|
|
|
# Per-core CPU usage
|
|
rate(node_cpu_seconds_total[5m]) * 100
|
|
```
|
|
|
|
**Memory Usage**:
|
|
```promql
|
|
# Memory usage percentage
|
|
100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)
|
|
|
|
# Memory available in GB
|
|
node_memory_MemAvailable_bytes / 1024 / 1024 / 1024
|
|
```
|
|
|
|
**Disk Usage**:
|
|
```promql
|
|
# Disk usage percentage
|
|
100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100)
|
|
|
|
# Disk I/O rate
|
|
rate(node_disk_io_time_seconds_total[5m])
|
|
```
|
|
|
|
**Container Metrics**:
|
|
```promql
|
|
# Container CPU usage
|
|
sum(rate(container_cpu_usage_seconds_total{name!~".*exporter.*"}[5m])) by (name) * 100
|
|
|
|
# Container memory usage
|
|
sum(container_memory_usage_bytes{name!~".*exporter.*"}) by (name)
|
|
|
|
# Container network I/O
|
|
rate(container_network_receive_bytes_total[5m])
|
|
rate(container_network_transmit_bytes_total[5m])
|
|
```
|
|
|
|
**Traefik Metrics**:
|
|
```promql
|
|
# Request rate by service
|
|
sum(rate(traefik_service_requests_total[5m])) by (service)
|
|
|
|
# Response time percentiles
|
|
histogram_quantile(0.95, sum(rate(traefik_service_request_duration_seconds_bucket[5m])) by (service, le))
|
|
|
|
# Error rate
|
|
sum(rate(traefik_service_requests_total{code=~"5.."}[5m])) / sum(rate(traefik_service_requests_total[5m])) * 100
|
|
```
|
|
|
|
### Alert Management
|
|
|
|
#### Configured Alerts
|
|
|
|
Alerts are defined in `prometheus/alerts.yml`:
|
|
|
|
1. **HostHighCPU**: CPU usage > 80% for 5 minutes
|
|
2. **HostHighMemory**: Memory usage > 80% for 5 minutes
|
|
3. **HostDiskSpaceLow**: Disk usage > 80%
|
|
4. **ContainerHighCPU**: Container CPU > 80% for 5 minutes
|
|
5. **ContainerHighMemory**: Container memory > 80% for 5 minutes
|
|
6. **ServiceDown**: Service unavailable
|
|
7. **HighErrorRate**: Error rate > 5% for 5 minutes
|
|
|
|
#### View Active Alerts
|
|
|
|
```bash
|
|
# Via Prometheus UI
|
|
https://prometheus.michaelschiemer.de/alerts
|
|
|
|
# Via API
|
|
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/alerts
|
|
|
|
# Check alert rules
|
|
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/rules
|
|
```
|
|
|
|
#### Silence Alerts
|
|
|
|
Use Prometheus UI or API to silence alerts during maintenance:
|
|
|
|
```bash
|
|
# Silence via API (example)
|
|
curl -X POST -u admin:password \
|
|
https://prometheus.michaelschiemer.de/api/v1/alerts \
|
|
-d 'alertname=HostHighCPU&duration=1h'
|
|
```
|
|
|
|
### Portainer Usage
|
|
|
|
#### Container Management
|
|
|
|
1. Navigate to https://portainer.michaelschiemer.de
|
|
2. Select "Local" environment
|
|
3. Go to "Containers" section
|
|
4. Available actions:
|
|
- Start/Stop/Restart containers
|
|
- View logs (live stream)
|
|
- Inspect container details
|
|
- Execute commands in containers
|
|
- View resource statistics
|
|
|
|
#### Stack Management
|
|
|
|
1. Go to "Stacks" section
|
|
2. View deployed stacks
|
|
3. Actions available:
|
|
- View stack definition
|
|
- Update stack (edit compose file)
|
|
- Stop/Start entire stack
|
|
- Remove stack
|
|
|
|
#### Volume Management
|
|
|
|
1. Go to "Volumes" section
|
|
2. View volume details and size
|
|
3. Browse volume contents
|
|
4. Backup/restore volumes
|
|
|
|
## Integration with Other Stacks
|
|
|
|
### Stack 1: Traefik
|
|
- Provides HTTPS reverse proxy for Grafana, Prometheus, Portainer
|
|
- Automatic SSL certificate management
|
|
- BasicAuth middleware for Prometheus
|
|
|
|
### Stack 2: Gitea
|
|
- Monitor Gitea container resources
|
|
- Track HTTP requests to Gitea via Traefik dashboard
|
|
- Alert on Gitea service downtime
|
|
|
|
### Stack 3: Docker Registry
|
|
- Monitor registry container resources
|
|
- Track registry HTTP requests
|
|
- Alert on registry unavailability
|
|
|
|
### Stack 4: Application
|
|
- Monitor PHP-FPM, Nginx, Redis, Worker containers
|
|
- Track application response times
|
|
- Monitor queue worker health
|
|
|
|
### Stack 5: PostgreSQL
|
|
- Monitor database container resources
|
|
- Track PostgreSQL metrics (if postgres_exporter added)
|
|
- Alert on database unavailability
|
|
|
|
## Monitoring Best Practices
|
|
|
|
### 1. Regular Dashboard Review
|
|
- Check dashboards daily for anomalies
|
|
- Review error rates and response times
|
|
- Monitor resource utilization trends
|
|
|
|
### 2. Alert Configuration
|
|
- Tune alert thresholds based on baseline metrics
|
|
- Avoid alert fatigue (too many non-critical alerts)
|
|
- Document alert response procedures
|
|
|
|
### 3. Capacity Planning
|
|
- Review resource usage trends weekly
|
|
- Plan capacity upgrades before hitting limits
|
|
- Monitor growth rates for proactive scaling
|
|
|
|
### 4. Performance Optimization
|
|
- Identify containers with high resource usage
|
|
- Optimize slow endpoints (high p95/p99 latency)
|
|
- Balance load across services
|
|
|
|
### 5. Security Monitoring
|
|
- Monitor failed authentication attempts
|
|
- Track unusual traffic patterns
|
|
- Review service availability trends
|
|
|
|
## Troubleshooting
|
|
|
|
### Grafana Issues
|
|
|
|
#### Dashboard Not Loading
|
|
```bash
|
|
# Check Grafana logs
|
|
docker compose logs grafana
|
|
|
|
# Verify datasource connection
|
|
curl http://localhost:3000/api/health
|
|
|
|
# Restart Grafana
|
|
docker compose restart grafana
|
|
```
|
|
|
|
#### Missing Metrics
|
|
```bash
|
|
# Check Prometheus datasource
|
|
curl http://prometheus:9090/api/v1/targets
|
|
|
|
# Verify Prometheus is scraping
|
|
docker compose logs prometheus | grep "Scrape"
|
|
|
|
# Check network connectivity
|
|
docker compose exec grafana ping prometheus
|
|
```
|
|
|
|
### Prometheus Issues
|
|
|
|
#### Targets Down
|
|
```bash
|
|
# Check target status
|
|
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/targets
|
|
|
|
# Verify target services are running
|
|
docker compose ps
|
|
|
|
# Check Prometheus configuration
|
|
docker compose exec prometheus cat /etc/prometheus/prometheus.yml
|
|
|
|
# Reload configuration
|
|
curl -X POST -u admin:password https://prometheus.michaelschiemer.de/-/reload
|
|
```
|
|
|
|
#### High Memory Usage
|
|
```bash
|
|
# Check Prometheus memory
|
|
docker stats prometheus
|
|
|
|
# Reduce retention period in docker-compose.yml:
|
|
# --storage.tsdb.retention.time=7d
|
|
|
|
# Reduce scrape interval in prometheus.yml:
|
|
# scrape_interval: 30s
|
|
```
|
|
|
|
### Node Exporter Issues
|
|
|
|
#### No Host Metrics
|
|
```bash
|
|
# Check node-exporter is running
|
|
docker compose ps node-exporter
|
|
|
|
# Test metrics endpoint
|
|
curl http://localhost:9100/metrics
|
|
|
|
# Check Prometheus scraping
|
|
docker compose logs prometheus | grep node-exporter
|
|
```
|
|
|
|
### cAdvisor Issues
|
|
|
|
#### No Container Metrics
|
|
```bash
|
|
# Check cAdvisor is running
|
|
docker compose ps cadvisor
|
|
|
|
# Test metrics endpoint
|
|
curl http://localhost:8080/metrics
|
|
|
|
# Verify Docker socket mount
|
|
docker compose exec cadvisor ls -la /var/run/docker.sock
|
|
```
|
|
|
|
### Portainer Issues
|
|
|
|
#### Cannot Access UI
|
|
```bash
|
|
# Check Portainer is running
|
|
docker compose ps portainer
|
|
|
|
# Check Traefik routing
|
|
docker compose -f ../traefik/docker-compose.yml logs
|
|
|
|
# Verify network connectivity
|
|
docker network ls | grep monitoring
|
|
```
|
|
|
|
#### Cannot Connect to Docker
|
|
```bash
|
|
# Verify Docker socket permissions
|
|
ls -la /var/run/docker.sock
|
|
|
|
# Check Portainer logs
|
|
docker compose logs portainer
|
|
|
|
# Restart Portainer
|
|
docker compose restart portainer
|
|
```
|
|
|
|
## Performance Tuning
|
|
|
|
### Prometheus Optimization
|
|
|
|
#### Reduce Memory Usage
|
|
```yaml
|
|
# In docker-compose.yml, adjust retention:
|
|
command:
|
|
- '--storage.tsdb.retention.time=7d' # Reduce from 15d
|
|
- '--storage.tsdb.retention.size=5GB' # Add size limit
|
|
```
|
|
|
|
#### Optimize Scrape Intervals
|
|
```yaml
|
|
# In prometheus/prometheus.yml:
|
|
global:
|
|
scrape_interval: 30s # Increase from 15s for less load
|
|
evaluation_interval: 30s
|
|
```
|
|
|
|
#### Reduce Cardinality
|
|
```yaml
|
|
# In prometheus/prometheus.yml, add metric_relabel_configs:
|
|
metric_relabel_configs:
|
|
- source_labels: [__name__]
|
|
regex: 'unused_metric_.*'
|
|
action: drop
|
|
```
|
|
|
|
### Grafana Optimization
|
|
|
|
#### Reduce Query Load
|
|
```json
|
|
// In dashboard JSON, adjust refresh rate:
|
|
"refresh": "1m" // Increase from 30s
|
|
```
|
|
|
|
#### Optimize Panel Queries
|
|
- Use recording rules for expensive queries
|
|
- Reduce time range for heavy queries
|
|
- Use appropriate resolution (step parameter)
|
|
|
|
### Storage Optimization
|
|
|
|
#### Prometheus Data Volume
|
|
```bash
|
|
# Check current size
|
|
du -sh volumes/prometheus/
|
|
|
|
# Compact old data
|
|
docker compose exec prometheus curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones
|
|
```
|
|
|
|
#### Grafana Data Volume
|
|
```bash
|
|
# Check current size
|
|
du -sh volumes/grafana/
|
|
|
|
# Clean old sessions
|
|
docker compose exec grafana grafana-cli admin reset-admin-password
|
|
```
|
|
|
|
## Security Considerations
|
|
|
|
### 1. Password Security
|
|
- Use strong, randomly generated passwords
|
|
- Store passwords securely (password manager)
|
|
- Rotate passwords regularly
|
|
- Use bcrypt for Prometheus BasicAuth
|
|
|
|
### 2. Network Security
|
|
- Monitoring network is internal-only (except exporters)
|
|
- Traefik handles SSL/TLS termination
|
|
- BasicAuth protects Prometheus UI
|
|
- Grafana requires login for dashboard access
|
|
|
|
### 3. Access Control
|
|
- Limit Grafana admin access
|
|
- Use Grafana organizations for multi-tenancy
|
|
- Configure Prometheus with read-only access where possible
|
|
- Restrict Portainer access to trusted users
|
|
|
|
### 4. Data Security
|
|
- Prometheus stores metrics in plain text
|
|
- Grafana encrypts passwords in database
|
|
- Backup volumes contain sensitive data
|
|
- Secure backups with encryption
|
|
|
|
### 5. Container Security
|
|
- Use official Docker images
|
|
- Keep images updated (security patches)
|
|
- Run containers as non-root where possible
|
|
- Limit container capabilities
|
|
|
|
## Backup and Recovery
|
|
|
|
### Backup Prometheus Data
|
|
```bash
|
|
# Stop Prometheus
|
|
docker compose stop prometheus
|
|
|
|
# Backup data volume
|
|
tar czf prometheus-backup-$(date +%Y%m%d).tar.gz -C volumes/prometheus .
|
|
|
|
# Restart Prometheus
|
|
docker compose start prometheus
|
|
```
|
|
|
|
### Backup Grafana Data
|
|
```bash
|
|
# Backup Grafana database and dashboards
|
|
docker compose exec grafana tar czf - /var/lib/grafana > grafana-backup-$(date +%Y%m%d).tar.gz
|
|
```
|
|
|
|
### Restore from Backup
|
|
```bash
|
|
# Stop services
|
|
docker compose down
|
|
|
|
# Restore Prometheus data
|
|
tar xzf prometheus-backup-YYYYMMDD.tar.gz -C volumes/prometheus/
|
|
|
|
# Restore Grafana data
|
|
docker compose up -d grafana
|
|
docker compose exec grafana tar xzf - -C / < grafana-backup-YYYYMMDD.tar.gz
|
|
|
|
# Start all services
|
|
docker compose up -d
|
|
```
|
|
|
|
## Maintenance
|
|
|
|
### Regular Tasks
|
|
|
|
#### Daily
|
|
- Review dashboards for anomalies
|
|
- Check active alerts
|
|
- Verify all services are running
|
|
|
|
#### Weekly
|
|
- Review resource usage trends
|
|
- Check disk space usage
|
|
- Update passwords if needed
|
|
|
|
#### Monthly
|
|
- Review and update alert rules
|
|
- Optimize slow queries
|
|
- Clean up old data if needed
|
|
- Update Docker images
|
|
|
|
### Update Procedure
|
|
|
|
```bash
|
|
# Pull latest images
|
|
docker compose pull
|
|
|
|
# Recreate containers with new images
|
|
docker compose up -d
|
|
|
|
# Verify services are healthy
|
|
docker compose ps
|
|
docker compose logs -f
|
|
```
|
|
|
|
## Support
|
|
|
|
### Documentation
|
|
- Prometheus: https://prometheus.io/docs/
|
|
- Grafana: https://grafana.com/docs/
|
|
- Portainer: https://docs.portainer.io/
|
|
|
|
### Logs
|
|
```bash
|
|
# View all logs
|
|
docker compose logs
|
|
|
|
# Follow specific service logs
|
|
docker compose logs -f grafana
|
|
docker compose logs -f prometheus
|
|
|
|
# View last 100 lines
|
|
docker compose logs --tail=100
|
|
```
|
|
|
|
### Health Checks
|
|
```bash
|
|
# Check service health
|
|
docker compose ps
|
|
|
|
# Test endpoints
|
|
curl http://localhost:9090/-/healthy # Prometheus
|
|
curl http://localhost:3000/api/health # Grafana
|
|
|
|
# Check metrics
|
|
curl http://localhost:9100/metrics # Node Exporter
|
|
curl http://localhost:8080/metrics # cAdvisor
|
|
```
|
|
|
|
---
|
|
|
|
**Stack Version**: 1.0
|
|
**Last Updated**: 2025-01-30
|
|
**Maintained By**: DevOps Team
|