feat: CI/CD pipeline setup complete - Ansible playbooks updated, secrets configured, workflow ready
This commit is contained in:
21
deployment/stacks/monitoring/.env.example
Normal file
21
deployment/stacks/monitoring/.env.example
Normal file
@@ -0,0 +1,21 @@
|
||||
# Monitoring Stack Environment Configuration
|
||||
# Copy to .env and configure with your actual values
|
||||
|
||||
# Domain Configuration
|
||||
DOMAIN=michaelschiemer.de
|
||||
|
||||
# Grafana Configuration
|
||||
GRAFANA_ADMIN_USER=admin
|
||||
GRAFANA_ADMIN_PASSWORD=changeme_secure_password
|
||||
|
||||
# Grafana Plugins (comma-separated)
|
||||
# Common useful plugins:
|
||||
# - grafana-clock-panel
|
||||
# - grafana-piechart-panel
|
||||
# - grafana-worldmap-panel
|
||||
GRAFANA_PLUGINS=
|
||||
|
||||
# Prometheus BasicAuth
|
||||
# Generate with: htpasswd -nb admin password
|
||||
# Format: username:hashed_password
|
||||
PROMETHEUS_AUTH=admin:$$apr1$$xyz...
|
||||
751
deployment/stacks/monitoring/README.md
Normal file
751
deployment/stacks/monitoring/README.md
Normal file
@@ -0,0 +1,751 @@
|
||||
# Stack 6: Monitoring (Portainer + Grafana + Prometheus)
|
||||
|
||||
Comprehensive monitoring stack for infrastructure and application observability.
|
||||
|
||||
## Overview
|
||||
|
||||
This stack provides complete monitoring and visualization capabilities for the entire infrastructure:
|
||||
- **Prometheus**: Time-series metrics collection and alerting
|
||||
- **Grafana**: Metrics visualization with pre-configured dashboards
|
||||
- **Portainer**: Container management UI
|
||||
- **Node Exporter**: Host system metrics (CPU, memory, disk, network)
|
||||
- **cAdvisor**: Container resource usage metrics
|
||||
- **Alertmanager**: Alert routing and management (via Prometheus)
|
||||
|
||||
## Features
|
||||
|
||||
### Prometheus
|
||||
- Multi-target scraping (node-exporter, cadvisor, traefik)
|
||||
- 15-second scrape interval for near real-time metrics
|
||||
- 15-day retention period
|
||||
- Pre-configured alert rules for critical conditions
|
||||
- Built-in alerting engine
|
||||
- Service discovery via static configs
|
||||
- HTTPS support with BasicAuth protection
|
||||
|
||||
### Grafana
|
||||
- Pre-configured Prometheus datasource
|
||||
- Three comprehensive dashboards:
|
||||
- **Docker Containers**: Container CPU, memory, network I/O, restarts
|
||||
- **Host System**: System CPU, memory, disk, network, uptime
|
||||
- **Traefik**: Request rates, response times, status codes, error rates
|
||||
- Auto-provisioning (no manual configuration needed)
|
||||
- HTTPS access via Traefik
|
||||
- 30-second auto-refresh
|
||||
- Dark theme for reduced eye strain
|
||||
|
||||
### Portainer
|
||||
- Web-based Docker management UI
|
||||
- Container start/stop/restart/logs
|
||||
- Stack management and deployment
|
||||
- Volume and network management
|
||||
- Resource usage visualization
|
||||
- HTTPS access via Traefik
|
||||
|
||||
### Node Exporter
|
||||
- Host system metrics:
|
||||
- CPU usage by core and mode
|
||||
- Memory usage and available memory
|
||||
- Disk usage by filesystem
|
||||
- Network I/O by interface
|
||||
- System load averages
|
||||
- System uptime
|
||||
|
||||
### cAdvisor
|
||||
- Container metrics:
|
||||
- CPU usage per container
|
||||
- Memory usage per container
|
||||
- Network I/O per container
|
||||
- Disk I/O per container
|
||||
- Container restart counts
|
||||
- Container health status
|
||||
|
||||
## Services
|
||||
|
||||
| Service | Domain | Port | Purpose |
|
||||
|---------|--------|------|---------|
|
||||
| Grafana | grafana.michaelschiemer.de | 3000 | Metrics visualization |
|
||||
| Prometheus | prometheus.michaelschiemer.de | 9090 | Metrics collection |
|
||||
| Portainer | portainer.michaelschiemer.de | 9000/9443 | Container management |
|
||||
| Node Exporter | - | 9100 | Host metrics (internal) |
|
||||
| cAdvisor | - | 8080 | Container metrics (internal) |
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Traefik stack deployed and running (Stack 1)
|
||||
- Docker networks: `traefik-public`, `monitoring`
|
||||
- Docker Swarm initialized (if using swarm mode)
|
||||
- Domain DNS configured (grafana/prometheus/portainer subdomains)
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
monitoring/
|
||||
├── docker-compose.yml # Main stack definition
|
||||
├── .env.example # Environment template
|
||||
├── prometheus/
|
||||
│ ├── prometheus.yml # Prometheus configuration
|
||||
│ └── alerts.yml # Alert rules
|
||||
├── grafana/
|
||||
│ ├── provisioning/
|
||||
│ │ ├── datasources/
|
||||
│ │ │ └── prometheus.yml # Auto-configured datasource
|
||||
│ │ └── dashboards/
|
||||
│ │ └── dashboard.yml # Dashboard provisioning
|
||||
│ └── dashboards/
|
||||
│ ├── docker-containers.json # Container metrics dashboard
|
||||
│ ├── host-system.json # Host metrics dashboard
|
||||
│ └── traefik.json # Traefik metrics dashboard
|
||||
└── README.md # This file
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### 1. Create Environment File
|
||||
|
||||
```bash
|
||||
cp .env.example .env
|
||||
```
|
||||
|
||||
### 2. Configure Environment Variables
|
||||
|
||||
Edit `.env` and set the following variables:
|
||||
|
||||
```bash
|
||||
# Domain Configuration
|
||||
DOMAIN=michaelschiemer.de
|
||||
|
||||
# Grafana Configuration
|
||||
GRAFANA_ADMIN_USER=admin
|
||||
GRAFANA_ADMIN_PASSWORD=<generate-strong-password>
|
||||
|
||||
# Prometheus Configuration
|
||||
PROMETHEUS_USER=admin
|
||||
PROMETHEUS_PASSWORD=<generate-strong-password>
|
||||
|
||||
# Portainer Configuration
|
||||
PORTAINER_ADMIN_PASSWORD=<generate-strong-password>
|
||||
|
||||
# Network Configuration
|
||||
TRAEFIK_NETWORK=traefik-public
|
||||
MONITORING_NETWORK=monitoring
|
||||
```
|
||||
|
||||
### 3. Generate Strong Passwords
|
||||
|
||||
```bash
|
||||
# Generate random passwords
|
||||
openssl rand -base64 32
|
||||
|
||||
# For Prometheus BasicAuth (bcrypt hash)
|
||||
docker run --rm httpd:alpine htpasswd -nbB admin "your-password" | cut -d ":" -f 2
|
||||
```
|
||||
|
||||
### 4. Update Traefik BasicAuth (Optional)
|
||||
|
||||
If using Prometheus BasicAuth, add the bcrypt hash to Traefik labels in docker-compose.yml:
|
||||
|
||||
```yaml
|
||||
- "traefik.http.middlewares.prometheus-auth.basicauth.users=admin:$$2y$$05$$..."
|
||||
```
|
||||
|
||||
## Deployment
|
||||
|
||||
### Deploy Stack
|
||||
|
||||
```bash
|
||||
cd /home/michael/dev/michaelschiemer/deployment/stacks/monitoring
|
||||
|
||||
# Deploy with Docker Compose
|
||||
docker compose up -d
|
||||
|
||||
# Or with Docker Stack (Swarm mode)
|
||||
docker stack deploy -c docker-compose.yml monitoring
|
||||
```
|
||||
|
||||
### Verify Deployment
|
||||
|
||||
```bash
|
||||
# Check running containers
|
||||
docker compose ps
|
||||
|
||||
# Check service logs
|
||||
docker compose logs -f grafana
|
||||
docker compose logs -f prometheus
|
||||
|
||||
# Check Prometheus targets
|
||||
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/targets
|
||||
```
|
||||
|
||||
### Initial Access
|
||||
|
||||
1. **Grafana**: https://grafana.michaelschiemer.de
|
||||
- Login: `admin` / `<GRAFANA_ADMIN_PASSWORD>`
|
||||
- Dashboards are pre-loaded and ready to use
|
||||
|
||||
2. **Prometheus**: https://prometheus.michaelschiemer.de
|
||||
- BasicAuth: `admin` / `<PROMETHEUS_PASSWORD>`
|
||||
- Check targets at `/targets`
|
||||
- View alerts at `/alerts`
|
||||
|
||||
3. **Portainer**: https://portainer.michaelschiemer.de
|
||||
- First login: Set admin password
|
||||
- Connect to local Docker environment
|
||||
|
||||
## Usage
|
||||
|
||||
### Grafana Dashboards
|
||||
|
||||
#### Docker Containers Dashboard
|
||||
Access: https://grafana.michaelschiemer.de/d/docker-containers
|
||||
|
||||
**Metrics Displayed**:
|
||||
- Container CPU Usage % (per container, timeseries)
|
||||
- Container Memory Usage (bytes per container, timeseries)
|
||||
- Containers Running (current count, stat)
|
||||
- Container Restarts in 5m (rate with thresholds, stat)
|
||||
- Container Network I/O (RX/TX per container, timeseries)
|
||||
|
||||
**Use Cases**:
|
||||
- Identify containers with high resource usage
|
||||
- Monitor container stability (restart rates)
|
||||
- Track network bandwidth consumption
|
||||
- Verify all expected containers are running
|
||||
|
||||
#### Host System Dashboard
|
||||
Access: https://grafana.michaelschiemer.de/d/host-system
|
||||
|
||||
**Metrics Displayed**:
|
||||
- CPU Usage % (historical and current)
|
||||
- Memory Usage % (historical and current)
|
||||
- Disk Usage % (root filesystem, historical and current)
|
||||
- Network I/O (RX/TX by interface)
|
||||
- System Uptime (seconds since boot)
|
||||
|
||||
**Thresholds**:
|
||||
- Green: < 80% usage
|
||||
- Yellow: 80-90% usage
|
||||
- Red: > 90% usage
|
||||
|
||||
**Use Cases**:
|
||||
- Monitor server health and resource utilization
|
||||
- Identify resource bottlenecks
|
||||
- Plan capacity upgrades
|
||||
- Track system stability (uptime)
|
||||
|
||||
#### Traefik Dashboard
|
||||
Access: https://grafana.michaelschiemer.de/d/traefik
|
||||
|
||||
**Metrics Displayed**:
|
||||
- Request Rate by Service (req/s, timeseries)
|
||||
- Response Time p95/p99 (milliseconds, timeseries)
|
||||
- HTTP Status Codes (2xx/4xx/5xx stacked, color-coded)
|
||||
- Service Status (Up/Down per service)
|
||||
- Requests per Minute (total)
|
||||
- 4xx Error Rate (percentage)
|
||||
- 5xx Error Rate (percentage)
|
||||
- Active Services (count)
|
||||
|
||||
**Thresholds**:
|
||||
- 4xx errors: Green < 5%, Yellow < 10%, Red ≥ 10%
|
||||
- 5xx errors: Green < 1%, Yellow < 5%, Red ≥ 5%
|
||||
|
||||
**Use Cases**:
|
||||
- Monitor HTTP traffic patterns
|
||||
- Identify performance issues (high latency)
|
||||
- Track error rates and types
|
||||
- Verify service availability
|
||||
|
||||
### Prometheus Queries
|
||||
|
||||
#### Common PromQL Examples
|
||||
|
||||
**CPU Usage**:
|
||||
```promql
|
||||
# Overall CPU usage
|
||||
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
||||
|
||||
# Per-core CPU usage
|
||||
rate(node_cpu_seconds_total[5m]) * 100
|
||||
```
|
||||
|
||||
**Memory Usage**:
|
||||
```promql
|
||||
# Memory usage percentage
|
||||
100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)
|
||||
|
||||
# Memory available in GB
|
||||
node_memory_MemAvailable_bytes / 1024 / 1024 / 1024
|
||||
```
|
||||
|
||||
**Disk Usage**:
|
||||
```promql
|
||||
# Disk usage percentage
|
||||
100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100)
|
||||
|
||||
# Disk I/O rate
|
||||
rate(node_disk_io_time_seconds_total[5m])
|
||||
```
|
||||
|
||||
**Container Metrics**:
|
||||
```promql
|
||||
# Container CPU usage
|
||||
sum(rate(container_cpu_usage_seconds_total{name!~".*exporter.*"}[5m])) by (name) * 100
|
||||
|
||||
# Container memory usage
|
||||
sum(container_memory_usage_bytes{name!~".*exporter.*"}) by (name)
|
||||
|
||||
# Container network I/O
|
||||
rate(container_network_receive_bytes_total[5m])
|
||||
rate(container_network_transmit_bytes_total[5m])
|
||||
```
|
||||
|
||||
**Traefik Metrics**:
|
||||
```promql
|
||||
# Request rate by service
|
||||
sum(rate(traefik_service_requests_total[5m])) by (service)
|
||||
|
||||
# Response time percentiles
|
||||
histogram_quantile(0.95, sum(rate(traefik_service_request_duration_seconds_bucket[5m])) by (service, le))
|
||||
|
||||
# Error rate
|
||||
sum(rate(traefik_service_requests_total{code=~"5.."}[5m])) / sum(rate(traefik_service_requests_total[5m])) * 100
|
||||
```
|
||||
|
||||
### Alert Management
|
||||
|
||||
#### Configured Alerts
|
||||
|
||||
Alerts are defined in `prometheus/alerts.yml`:
|
||||
|
||||
1. **HostHighCPU**: CPU usage > 80% for 5 minutes
|
||||
2. **HostHighMemory**: Memory usage > 80% for 5 minutes
|
||||
3. **HostDiskSpaceLow**: Disk usage > 80%
|
||||
4. **ContainerHighCPU**: Container CPU > 80% for 5 minutes
|
||||
5. **ContainerHighMemory**: Container memory > 80% for 5 minutes
|
||||
6. **ServiceDown**: Service unavailable
|
||||
7. **HighErrorRate**: Error rate > 5% for 5 minutes
|
||||
|
||||
#### View Active Alerts
|
||||
|
||||
```bash
|
||||
# Via Prometheus UI
|
||||
https://prometheus.michaelschiemer.de/alerts
|
||||
|
||||
# Via API
|
||||
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/alerts
|
||||
|
||||
# Check alert rules
|
||||
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/rules
|
||||
```
|
||||
|
||||
#### Silence Alerts
|
||||
|
||||
Use Prometheus UI or API to silence alerts during maintenance:
|
||||
|
||||
```bash
|
||||
# Silence via API (example)
|
||||
curl -X POST -u admin:password \
|
||||
https://prometheus.michaelschiemer.de/api/v1/alerts \
|
||||
-d 'alertname=HostHighCPU&duration=1h'
|
||||
```
|
||||
|
||||
### Portainer Usage
|
||||
|
||||
#### Container Management
|
||||
|
||||
1. Navigate to https://portainer.michaelschiemer.de
|
||||
2. Select "Local" environment
|
||||
3. Go to "Containers" section
|
||||
4. Available actions:
|
||||
- Start/Stop/Restart containers
|
||||
- View logs (live stream)
|
||||
- Inspect container details
|
||||
- Execute commands in containers
|
||||
- View resource statistics
|
||||
|
||||
#### Stack Management
|
||||
|
||||
1. Go to "Stacks" section
|
||||
2. View deployed stacks
|
||||
3. Actions available:
|
||||
- View stack definition
|
||||
- Update stack (edit compose file)
|
||||
- Stop/Start entire stack
|
||||
- Remove stack
|
||||
|
||||
#### Volume Management
|
||||
|
||||
1. Go to "Volumes" section
|
||||
2. View volume details and size
|
||||
3. Browse volume contents
|
||||
4. Backup/restore volumes
|
||||
|
||||
## Integration with Other Stacks
|
||||
|
||||
### Stack 1: Traefik
|
||||
- Provides HTTPS reverse proxy for Grafana, Prometheus, Portainer
|
||||
- Automatic SSL certificate management
|
||||
- BasicAuth middleware for Prometheus
|
||||
|
||||
### Stack 2: Gitea
|
||||
- Monitor Gitea container resources
|
||||
- Track HTTP requests to Gitea via Traefik dashboard
|
||||
- Alert on Gitea service downtime
|
||||
|
||||
### Stack 3: Docker Registry
|
||||
- Monitor registry container resources
|
||||
- Track registry HTTP requests
|
||||
- Alert on registry unavailability
|
||||
|
||||
### Stack 4: Application
|
||||
- Monitor PHP-FPM, Nginx, Redis, Worker containers
|
||||
- Track application response times
|
||||
- Monitor queue worker health
|
||||
|
||||
### Stack 5: PostgreSQL
|
||||
- Monitor database container resources
|
||||
- Track PostgreSQL metrics (if postgres_exporter added)
|
||||
- Alert on database unavailability
|
||||
|
||||
## Monitoring Best Practices
|
||||
|
||||
### 1. Regular Dashboard Review
|
||||
- Check dashboards daily for anomalies
|
||||
- Review error rates and response times
|
||||
- Monitor resource utilization trends
|
||||
|
||||
### 2. Alert Configuration
|
||||
- Tune alert thresholds based on baseline metrics
|
||||
- Avoid alert fatigue (too many non-critical alerts)
|
||||
- Document alert response procedures
|
||||
|
||||
### 3. Capacity Planning
|
||||
- Review resource usage trends weekly
|
||||
- Plan capacity upgrades before hitting limits
|
||||
- Monitor growth rates for proactive scaling
|
||||
|
||||
### 4. Performance Optimization
|
||||
- Identify containers with high resource usage
|
||||
- Optimize slow endpoints (high p95/p99 latency)
|
||||
- Balance load across services
|
||||
|
||||
### 5. Security Monitoring
|
||||
- Monitor failed authentication attempts
|
||||
- Track unusual traffic patterns
|
||||
- Review service availability trends
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Grafana Issues
|
||||
|
||||
#### Dashboard Not Loading
|
||||
```bash
|
||||
# Check Grafana logs
|
||||
docker compose logs grafana
|
||||
|
||||
# Verify datasource connection
|
||||
curl http://localhost:3000/api/health
|
||||
|
||||
# Restart Grafana
|
||||
docker compose restart grafana
|
||||
```
|
||||
|
||||
#### Missing Metrics
|
||||
```bash
|
||||
# Check Prometheus datasource
|
||||
curl http://prometheus:9090/api/v1/targets
|
||||
|
||||
# Verify Prometheus is scraping
|
||||
docker compose logs prometheus | grep "Scrape"
|
||||
|
||||
# Check network connectivity
|
||||
docker compose exec grafana ping prometheus
|
||||
```
|
||||
|
||||
### Prometheus Issues
|
||||
|
||||
#### Targets Down
|
||||
```bash
|
||||
# Check target status
|
||||
curl -u admin:password https://prometheus.michaelschiemer.de/api/v1/targets
|
||||
|
||||
# Verify target services are running
|
||||
docker compose ps
|
||||
|
||||
# Check Prometheus configuration
|
||||
docker compose exec prometheus cat /etc/prometheus/prometheus.yml
|
||||
|
||||
# Reload configuration
|
||||
curl -X POST -u admin:password https://prometheus.michaelschiemer.de/-/reload
|
||||
```
|
||||
|
||||
#### High Memory Usage
|
||||
```bash
|
||||
# Check Prometheus memory
|
||||
docker stats prometheus
|
||||
|
||||
# Reduce retention period in docker-compose.yml:
|
||||
# --storage.tsdb.retention.time=7d
|
||||
|
||||
# Reduce scrape interval in prometheus.yml:
|
||||
# scrape_interval: 30s
|
||||
```
|
||||
|
||||
### Node Exporter Issues
|
||||
|
||||
#### No Host Metrics
|
||||
```bash
|
||||
# Check node-exporter is running
|
||||
docker compose ps node-exporter
|
||||
|
||||
# Test metrics endpoint
|
||||
curl http://localhost:9100/metrics
|
||||
|
||||
# Check Prometheus scraping
|
||||
docker compose logs prometheus | grep node-exporter
|
||||
```
|
||||
|
||||
### cAdvisor Issues
|
||||
|
||||
#### No Container Metrics
|
||||
```bash
|
||||
# Check cAdvisor is running
|
||||
docker compose ps cadvisor
|
||||
|
||||
# Test metrics endpoint
|
||||
curl http://localhost:8080/metrics
|
||||
|
||||
# Verify Docker socket mount
|
||||
docker compose exec cadvisor ls -la /var/run/docker.sock
|
||||
```
|
||||
|
||||
### Portainer Issues
|
||||
|
||||
#### Cannot Access UI
|
||||
```bash
|
||||
# Check Portainer is running
|
||||
docker compose ps portainer
|
||||
|
||||
# Check Traefik routing
|
||||
docker compose -f ../traefik/docker-compose.yml logs
|
||||
|
||||
# Verify network connectivity
|
||||
docker network ls | grep monitoring
|
||||
```
|
||||
|
||||
#### Cannot Connect to Docker
|
||||
```bash
|
||||
# Verify Docker socket permissions
|
||||
ls -la /var/run/docker.sock
|
||||
|
||||
# Check Portainer logs
|
||||
docker compose logs portainer
|
||||
|
||||
# Restart Portainer
|
||||
docker compose restart portainer
|
||||
```
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
### Prometheus Optimization
|
||||
|
||||
#### Reduce Memory Usage
|
||||
```yaml
|
||||
# In docker-compose.yml, adjust retention:
|
||||
command:
|
||||
- '--storage.tsdb.retention.time=7d' # Reduce from 15d
|
||||
- '--storage.tsdb.retention.size=5GB' # Add size limit
|
||||
```
|
||||
|
||||
#### Optimize Scrape Intervals
|
||||
```yaml
|
||||
# In prometheus/prometheus.yml:
|
||||
global:
|
||||
scrape_interval: 30s # Increase from 15s for less load
|
||||
evaluation_interval: 30s
|
||||
```
|
||||
|
||||
#### Reduce Cardinality
|
||||
```yaml
|
||||
# In prometheus/prometheus.yml, add metric_relabel_configs:
|
||||
metric_relabel_configs:
|
||||
- source_labels: [__name__]
|
||||
regex: 'unused_metric_.*'
|
||||
action: drop
|
||||
```
|
||||
|
||||
### Grafana Optimization
|
||||
|
||||
#### Reduce Query Load
|
||||
```json
|
||||
// In dashboard JSON, adjust refresh rate:
|
||||
"refresh": "1m" // Increase from 30s
|
||||
```
|
||||
|
||||
#### Optimize Panel Queries
|
||||
- Use recording rules for expensive queries
|
||||
- Reduce time range for heavy queries
|
||||
- Use appropriate resolution (step parameter)
|
||||
|
||||
### Storage Optimization
|
||||
|
||||
#### Prometheus Data Volume
|
||||
```bash
|
||||
# Check current size
|
||||
du -sh volumes/prometheus/
|
||||
|
||||
# Compact old data
|
||||
docker compose exec prometheus curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones
|
||||
```
|
||||
|
||||
#### Grafana Data Volume
|
||||
```bash
|
||||
# Check current size
|
||||
du -sh volumes/grafana/
|
||||
|
||||
# Clean old sessions
|
||||
docker compose exec grafana grafana-cli admin reset-admin-password
|
||||
```
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### 1. Password Security
|
||||
- Use strong, randomly generated passwords
|
||||
- Store passwords securely (password manager)
|
||||
- Rotate passwords regularly
|
||||
- Use bcrypt for Prometheus BasicAuth
|
||||
|
||||
### 2. Network Security
|
||||
- Monitoring network is internal-only (except exporters)
|
||||
- Traefik handles SSL/TLS termination
|
||||
- BasicAuth protects Prometheus UI
|
||||
- Grafana requires login for dashboard access
|
||||
|
||||
### 3. Access Control
|
||||
- Limit Grafana admin access
|
||||
- Use Grafana organizations for multi-tenancy
|
||||
- Configure Prometheus with read-only access where possible
|
||||
- Restrict Portainer access to trusted users
|
||||
|
||||
### 4. Data Security
|
||||
- Prometheus stores metrics in plain text
|
||||
- Grafana encrypts passwords in database
|
||||
- Backup volumes contain sensitive data
|
||||
- Secure backups with encryption
|
||||
|
||||
### 5. Container Security
|
||||
- Use official Docker images
|
||||
- Keep images updated (security patches)
|
||||
- Run containers as non-root where possible
|
||||
- Limit container capabilities
|
||||
|
||||
## Backup and Recovery
|
||||
|
||||
### Backup Prometheus Data
|
||||
```bash
|
||||
# Stop Prometheus
|
||||
docker compose stop prometheus
|
||||
|
||||
# Backup data volume
|
||||
tar czf prometheus-backup-$(date +%Y%m%d).tar.gz -C volumes/prometheus .
|
||||
|
||||
# Restart Prometheus
|
||||
docker compose start prometheus
|
||||
```
|
||||
|
||||
### Backup Grafana Data
|
||||
```bash
|
||||
# Backup Grafana database and dashboards
|
||||
docker compose exec grafana tar czf - /var/lib/grafana > grafana-backup-$(date +%Y%m%d).tar.gz
|
||||
```
|
||||
|
||||
### Restore from Backup
|
||||
```bash
|
||||
# Stop services
|
||||
docker compose down
|
||||
|
||||
# Restore Prometheus data
|
||||
tar xzf prometheus-backup-YYYYMMDD.tar.gz -C volumes/prometheus/
|
||||
|
||||
# Restore Grafana data
|
||||
docker compose up -d grafana
|
||||
docker compose exec grafana tar xzf - -C / < grafana-backup-YYYYMMDD.tar.gz
|
||||
|
||||
# Start all services
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Regular Tasks
|
||||
|
||||
#### Daily
|
||||
- Review dashboards for anomalies
|
||||
- Check active alerts
|
||||
- Verify all services are running
|
||||
|
||||
#### Weekly
|
||||
- Review resource usage trends
|
||||
- Check disk space usage
|
||||
- Update passwords if needed
|
||||
|
||||
#### Monthly
|
||||
- Review and update alert rules
|
||||
- Optimize slow queries
|
||||
- Clean up old data if needed
|
||||
- Update Docker images
|
||||
|
||||
### Update Procedure
|
||||
|
||||
```bash
|
||||
# Pull latest images
|
||||
docker compose pull
|
||||
|
||||
# Recreate containers with new images
|
||||
docker compose up -d
|
||||
|
||||
# Verify services are healthy
|
||||
docker compose ps
|
||||
docker compose logs -f
|
||||
```
|
||||
|
||||
## Support
|
||||
|
||||
### Documentation
|
||||
- Prometheus: https://prometheus.io/docs/
|
||||
- Grafana: https://grafana.com/docs/
|
||||
- Portainer: https://docs.portainer.io/
|
||||
|
||||
### Logs
|
||||
```bash
|
||||
# View all logs
|
||||
docker compose logs
|
||||
|
||||
# Follow specific service logs
|
||||
docker compose logs -f grafana
|
||||
docker compose logs -f prometheus
|
||||
|
||||
# View last 100 lines
|
||||
docker compose logs --tail=100
|
||||
```
|
||||
|
||||
### Health Checks
|
||||
```bash
|
||||
# Check service health
|
||||
docker compose ps
|
||||
|
||||
# Test endpoints
|
||||
curl http://localhost:9090/-/healthy # Prometheus
|
||||
curl http://localhost:3000/api/health # Grafana
|
||||
|
||||
# Check metrics
|
||||
curl http://localhost:9100/metrics # Node Exporter
|
||||
curl http://localhost:8080/metrics # cAdvisor
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Stack Version**: 1.0
|
||||
**Last Updated**: 2025-01-30
|
||||
**Maintained By**: DevOps Team
|
||||
147
deployment/stacks/monitoring/docker-compose.yml
Normal file
147
deployment/stacks/monitoring/docker-compose.yml
Normal file
@@ -0,0 +1,147 @@
|
||||
services:
|
||||
portainer:
|
||||
image: portainer/portainer-ce:latest
|
||||
container_name: portainer
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- traefik-public
|
||||
volumes:
|
||||
- /var/run/docker.sock:/var/run/docker.sock:ro
|
||||
- portainer-data:/data
|
||||
labels:
|
||||
- "traefik.enable=true"
|
||||
- "traefik.http.routers.portainer.rule=Host(`portainer.${DOMAIN}`)"
|
||||
- "traefik.http.routers.portainer.entrypoints=websecure"
|
||||
- "traefik.http.routers.portainer.tls=true"
|
||||
- "traefik.http.routers.portainer.tls.certresolver=letsencrypt"
|
||||
- "traefik.http.services.portainer.loadbalancer.server.port=9000"
|
||||
healthcheck:
|
||||
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:9000/api/system/status"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 40s
|
||||
|
||||
prometheus:
|
||||
image: prom/prometheus:latest
|
||||
container_name: prometheus
|
||||
restart: unless-stopped
|
||||
user: "65534:65534"
|
||||
networks:
|
||||
- traefik-public
|
||||
- app-internal
|
||||
volumes:
|
||||
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
|
||||
- ./prometheus/alerts.yml:/etc/prometheus/alerts.yml:ro
|
||||
- prometheus-data:/prometheus
|
||||
command:
|
||||
- '--config.file=/etc/prometheus/prometheus.yml'
|
||||
- '--storage.tsdb.path=/prometheus'
|
||||
- '--storage.tsdb.retention.time=30d'
|
||||
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
|
||||
- '--web.console.templates=/usr/share/prometheus/consoles'
|
||||
- '--web.enable-lifecycle'
|
||||
labels:
|
||||
- "traefik.enable=true"
|
||||
- "traefik.http.routers.prometheus.rule=Host(`prometheus.${DOMAIN}`)"
|
||||
- "traefik.http.routers.prometheus.entrypoints=websecure"
|
||||
- "traefik.http.routers.prometheus.tls=true"
|
||||
- "traefik.http.routers.prometheus.tls.certresolver=letsencrypt"
|
||||
- "traefik.http.routers.prometheus.middlewares=prometheus-auth"
|
||||
- "traefik.http.middlewares.prometheus-auth.basicauth.users=${PROMETHEUS_AUTH}"
|
||||
- "traefik.http.services.prometheus.loadbalancer.server.port=9090"
|
||||
healthcheck:
|
||||
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:9090/-/healthy"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
|
||||
grafana:
|
||||
image: grafana/grafana:latest
|
||||
container_name: grafana
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- traefik-public
|
||||
- app-internal
|
||||
environment:
|
||||
- GF_SERVER_ROOT_URL=https://grafana.${DOMAIN}
|
||||
- GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER}
|
||||
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD}
|
||||
- GF_USERS_ALLOW_SIGN_UP=false
|
||||
- GF_INSTALL_PLUGINS=${GRAFANA_PLUGINS}
|
||||
- GF_LOG_LEVEL=info
|
||||
- GF_ANALYTICS_REPORTING_ENABLED=false
|
||||
volumes:
|
||||
- grafana-data:/var/lib/grafana
|
||||
- ./grafana/provisioning:/etc/grafana/provisioning:ro
|
||||
- ./grafana/dashboards:/var/lib/grafana/dashboards:ro
|
||||
labels:
|
||||
- "traefik.enable=true"
|
||||
- "traefik.http.routers.grafana.rule=Host(`grafana.${DOMAIN}`)"
|
||||
- "traefik.http.routers.grafana.entrypoints=websecure"
|
||||
- "traefik.http.routers.grafana.tls=true"
|
||||
- "traefik.http.routers.grafana.tls.certresolver=letsencrypt"
|
||||
- "traefik.http.services.grafana.loadbalancer.server.port=3000"
|
||||
depends_on:
|
||||
prometheus:
|
||||
condition: service_healthy
|
||||
healthcheck:
|
||||
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/api/health"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
|
||||
node-exporter:
|
||||
image: prom/node-exporter:latest
|
||||
container_name: node-exporter
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- app-internal
|
||||
volumes:
|
||||
- /proc:/host/proc:ro
|
||||
- /sys:/host/sys:ro
|
||||
- /:/rootfs:ro
|
||||
command:
|
||||
- '--path.procfs=/host/proc'
|
||||
- '--path.sysfs=/host/sys'
|
||||
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
|
||||
healthcheck:
|
||||
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:9100/metrics"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
|
||||
cadvisor:
|
||||
image: gcr.io/cadvisor/cadvisor:latest
|
||||
container_name: cadvisor
|
||||
restart: unless-stopped
|
||||
privileged: true
|
||||
networks:
|
||||
- app-internal
|
||||
volumes:
|
||||
- /:/rootfs:ro
|
||||
- /var/run:/var/run:ro
|
||||
- /sys:/sys:ro
|
||||
- /var/lib/docker/:/var/lib/docker:ro
|
||||
- /dev/disk/:/dev/disk:ro
|
||||
devices:
|
||||
- /dev/kmsg
|
||||
healthcheck:
|
||||
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:8080/healthz"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
|
||||
volumes:
|
||||
portainer-data:
|
||||
name: portainer-data
|
||||
prometheus-data:
|
||||
name: prometheus-data
|
||||
grafana-data:
|
||||
name: grafana-data
|
||||
|
||||
networks:
|
||||
traefik-public:
|
||||
external: true
|
||||
app-internal:
|
||||
external: true
|
||||
@@ -0,0 +1,397 @@
|
||||
{
|
||||
"annotations": {
|
||||
"list": [
|
||||
{
|
||||
"builtIn": 1,
|
||||
"datasource": "-- Grafana --",
|
||||
"enable": true,
|
||||
"hide": true,
|
||||
"iconColor": "rgba(0, 211, 255, 1)",
|
||||
"name": "Annotations & Alerts",
|
||||
"type": "dashboard"
|
||||
}
|
||||
]
|
||||
},
|
||||
"editable": true,
|
||||
"gnetId": null,
|
||||
"graphTooltip": 0,
|
||||
"id": null,
|
||||
"links": [],
|
||||
"panels": [
|
||||
{
|
||||
"datasource": "Prometheus",
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "palette-classic"
|
||||
},
|
||||
"custom": {
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 10,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": {
|
||||
"tooltip": false,
|
||||
"viz": false,
|
||||
"legend": false
|
||||
},
|
||||
"lineInterpolation": "linear",
|
||||
"lineWidth": 1,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": {
|
||||
"type": "linear"
|
||||
},
|
||||
"showPoints": "never",
|
||||
"spanNulls": false,
|
||||
"stacking": {
|
||||
"group": "A",
|
||||
"mode": "none"
|
||||
},
|
||||
"thresholdsStyle": {
|
||||
"mode": "off"
|
||||
}
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
}
|
||||
]
|
||||
},
|
||||
"unit": "percent"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 0
|
||||
},
|
||||
"id": 1,
|
||||
"options": {
|
||||
"legend": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"displayMode": "table",
|
||||
"placement": "right"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "single"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(container_cpu_usage_seconds_total{name!~\".*exporter.*\"}[5m])) by (name) * 100",
|
||||
"refId": "A",
|
||||
"legendFormat": "{{name}}"
|
||||
}
|
||||
],
|
||||
"title": "Container CPU Usage %",
|
||||
"type": "timeseries"
|
||||
},
|
||||
{
|
||||
"datasource": "Prometheus",
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "palette-classic"
|
||||
},
|
||||
"custom": {
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 10,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": {
|
||||
"tooltip": false,
|
||||
"viz": false,
|
||||
"legend": false
|
||||
},
|
||||
"lineInterpolation": "linear",
|
||||
"lineWidth": 1,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": {
|
||||
"type": "linear"
|
||||
},
|
||||
"showPoints": "never",
|
||||
"spanNulls": false,
|
||||
"stacking": {
|
||||
"group": "A",
|
||||
"mode": "none"
|
||||
},
|
||||
"thresholdsStyle": {
|
||||
"mode": "off"
|
||||
}
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
}
|
||||
]
|
||||
},
|
||||
"unit": "bytes"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 0
|
||||
},
|
||||
"id": 2,
|
||||
"options": {
|
||||
"legend": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"displayMode": "table",
|
||||
"placement": "right"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "single"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(container_memory_usage_bytes{name!~\".*exporter.*\"}) by (name)",
|
||||
"refId": "A",
|
||||
"legendFormat": "{{name}}"
|
||||
}
|
||||
],
|
||||
"title": "Container Memory Usage",
|
||||
"type": "timeseries"
|
||||
},
|
||||
{
|
||||
"datasource": "Prometheus",
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"mappings": [
|
||||
{
|
||||
"options": {
|
||||
"0": {
|
||||
"color": "red",
|
||||
"index": 1,
|
||||
"text": "Down"
|
||||
},
|
||||
"1": {
|
||||
"color": "green",
|
||||
"index": 0,
|
||||
"text": "Up"
|
||||
}
|
||||
},
|
||||
"type": "value"
|
||||
}
|
||||
],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "red",
|
||||
"value": null
|
||||
},
|
||||
{
|
||||
"color": "green",
|
||||
"value": 1
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 4,
|
||||
"w": 6,
|
||||
"x": 0,
|
||||
"y": 8
|
||||
},
|
||||
"id": 3,
|
||||
"options": {
|
||||
"colorMode": "background",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"pluginVersion": "9.0.0",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "count(container_last_seen{name!~\".*exporter.*\"}) > 0",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Containers Running",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": "Prometheus",
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
},
|
||||
{
|
||||
"color": "yellow",
|
||||
"value": 3
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 5
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 4,
|
||||
"w": 6,
|
||||
"x": 6,
|
||||
"y": 8
|
||||
},
|
||||
"id": 4,
|
||||
"options": {
|
||||
"colorMode": "background",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"pluginVersion": "9.0.0",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(container_restart_count[5m])) > 0",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Container Restarts (5m)",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": "Prometheus",
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "palette-classic"
|
||||
},
|
||||
"custom": {
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 10,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": {
|
||||
"tooltip": false,
|
||||
"viz": false,
|
||||
"legend": false
|
||||
},
|
||||
"lineInterpolation": "linear",
|
||||
"lineWidth": 1,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": {
|
||||
"type": "linear"
|
||||
},
|
||||
"showPoints": "never",
|
||||
"spanNulls": false,
|
||||
"stacking": {
|
||||
"group": "A",
|
||||
"mode": "none"
|
||||
},
|
||||
"thresholdsStyle": {
|
||||
"mode": "off"
|
||||
}
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
}
|
||||
]
|
||||
},
|
||||
"unit": "Bps"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 8
|
||||
},
|
||||
"id": 5,
|
||||
"options": {
|
||||
"legend": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"displayMode": "table",
|
||||
"placement": "right"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "single"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(container_network_receive_bytes_total{name!~\".*exporter.*\"}[5m])) by (name)",
|
||||
"refId": "A",
|
||||
"legendFormat": "{{name}} RX"
|
||||
},
|
||||
{
|
||||
"expr": "sum(rate(container_network_transmit_bytes_total{name!~\".*exporter.*\"}[5m])) by (name)",
|
||||
"refId": "B",
|
||||
"legendFormat": "{{name}} TX"
|
||||
}
|
||||
],
|
||||
"title": "Container Network I/O",
|
||||
"type": "timeseries"
|
||||
}
|
||||
],
|
||||
"refresh": "30s",
|
||||
"schemaVersion": 36,
|
||||
"style": "dark",
|
||||
"tags": ["docker", "containers"],
|
||||
"templating": {
|
||||
"list": []
|
||||
},
|
||||
"time": {
|
||||
"from": "now-1h",
|
||||
"to": "now"
|
||||
},
|
||||
"timepicker": {},
|
||||
"timezone": "",
|
||||
"title": "Docker Containers",
|
||||
"uid": "docker-containers",
|
||||
"version": 1
|
||||
}
|
||||
591
deployment/stacks/monitoring/grafana/dashboards/host-system.json
Normal file
591
deployment/stacks/monitoring/grafana/dashboards/host-system.json
Normal file
@@ -0,0 +1,591 @@
|
||||
{
|
||||
"annotations": {
|
||||
"list": [
|
||||
{
|
||||
"builtIn": 1,
|
||||
"datasource": "-- Grafana --",
|
||||
"enable": true,
|
||||
"hide": true,
|
||||
"iconColor": "rgba(0, 211, 255, 1)",
|
||||
"name": "Annotations & Alerts",
|
||||
"type": "dashboard"
|
||||
}
|
||||
]
|
||||
},
|
||||
"editable": true,
|
||||
"gnetId": null,
|
||||
"graphTooltip": 0,
|
||||
"id": null,
|
||||
"links": [],
|
||||
"panels": [
|
||||
{
|
||||
"datasource": "Prometheus",
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "palette-classic"
|
||||
},
|
||||
"custom": {
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 10,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": {
|
||||
"tooltip": false,
|
||||
"viz": false,
|
||||
"legend": false
|
||||
},
|
||||
"lineInterpolation": "linear",
|
||||
"lineWidth": 1,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": {
|
||||
"type": "linear"
|
||||
},
|
||||
"showPoints": "never",
|
||||
"spanNulls": false,
|
||||
"stacking": {
|
||||
"group": "A",
|
||||
"mode": "none"
|
||||
},
|
||||
"thresholdsStyle": {
|
||||
"mode": "line"
|
||||
}
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 80
|
||||
}
|
||||
]
|
||||
},
|
||||
"unit": "percent"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 0
|
||||
},
|
||||
"id": 1,
|
||||
"options": {
|
||||
"legend": {
|
||||
"calcs": ["lastNotNull", "mean", "max"],
|
||||
"displayMode": "table",
|
||||
"placement": "right"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "single"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
|
||||
"refId": "A",
|
||||
"legendFormat": "{{instance}}"
|
||||
}
|
||||
],
|
||||
"title": "CPU Usage %",
|
||||
"type": "timeseries"
|
||||
},
|
||||
{
|
||||
"datasource": "Prometheus",
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "palette-classic"
|
||||
},
|
||||
"custom": {
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 10,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": {
|
||||
"tooltip": false,
|
||||
"viz": false,
|
||||
"legend": false
|
||||
},
|
||||
"lineInterpolation": "linear",
|
||||
"lineWidth": 1,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": {
|
||||
"type": "linear"
|
||||
},
|
||||
"showPoints": "never",
|
||||
"spanNulls": false,
|
||||
"stacking": {
|
||||
"group": "A",
|
||||
"mode": "none"
|
||||
},
|
||||
"thresholdsStyle": {
|
||||
"mode": "line"
|
||||
}
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
},
|
||||
{
|
||||
"color": "yellow",
|
||||
"value": 80
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 90
|
||||
}
|
||||
]
|
||||
},
|
||||
"unit": "percent"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 0
|
||||
},
|
||||
"id": 2,
|
||||
"options": {
|
||||
"legend": {
|
||||
"calcs": ["lastNotNull", "mean", "max"],
|
||||
"displayMode": "table",
|
||||
"placement": "right"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "single"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)",
|
||||
"refId": "A",
|
||||
"legendFormat": "{{instance}}"
|
||||
}
|
||||
],
|
||||
"title": "Memory Usage %",
|
||||
"type": "timeseries"
|
||||
},
|
||||
{
|
||||
"datasource": "Prometheus",
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "palette-classic"
|
||||
},
|
||||
"custom": {
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 10,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": {
|
||||
"tooltip": false,
|
||||
"viz": false,
|
||||
"legend": false
|
||||
},
|
||||
"lineInterpolation": "linear",
|
||||
"lineWidth": 1,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": {
|
||||
"type": "linear"
|
||||
},
|
||||
"showPoints": "never",
|
||||
"spanNulls": false,
|
||||
"stacking": {
|
||||
"group": "A",
|
||||
"mode": "none"
|
||||
},
|
||||
"thresholdsStyle": {
|
||||
"mode": "line"
|
||||
}
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
},
|
||||
{
|
||||
"color": "yellow",
|
||||
"value": 80
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 90
|
||||
}
|
||||
]
|
||||
},
|
||||
"unit": "percent"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 8
|
||||
},
|
||||
"id": 3,
|
||||
"options": {
|
||||
"legend": {
|
||||
"calcs": ["lastNotNull", "mean", "max"],
|
||||
"displayMode": "table",
|
||||
"placement": "right"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "single"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "100 - ((node_filesystem_avail_bytes{mountpoint=\"/\",fstype!=\"rootfs\"} / node_filesystem_size_bytes{mountpoint=\"/\",fstype!=\"rootfs\"}) * 100)",
|
||||
"refId": "A",
|
||||
"legendFormat": "{{instance}}"
|
||||
}
|
||||
],
|
||||
"title": "Disk Usage %",
|
||||
"type": "timeseries"
|
||||
},
|
||||
{
|
||||
"datasource": "Prometheus",
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "palette-classic"
|
||||
},
|
||||
"custom": {
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 10,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": {
|
||||
"tooltip": false,
|
||||
"viz": false,
|
||||
"legend": false
|
||||
},
|
||||
"lineInterpolation": "linear",
|
||||
"lineWidth": 1,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": {
|
||||
"type": "linear"
|
||||
},
|
||||
"showPoints": "never",
|
||||
"spanNulls": false,
|
||||
"stacking": {
|
||||
"group": "A",
|
||||
"mode": "none"
|
||||
},
|
||||
"thresholdsStyle": {
|
||||
"mode": "off"
|
||||
}
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
}
|
||||
]
|
||||
},
|
||||
"unit": "Bps"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 8
|
||||
},
|
||||
"id": 4,
|
||||
"options": {
|
||||
"legend": {
|
||||
"calcs": ["lastNotNull", "mean", "max"],
|
||||
"displayMode": "table",
|
||||
"placement": "right"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "single"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "rate(node_network_receive_bytes_total[5m])",
|
||||
"refId": "A",
|
||||
"legendFormat": "{{instance}} - {{device}} RX"
|
||||
},
|
||||
{
|
||||
"expr": "rate(node_network_transmit_bytes_total[5m])",
|
||||
"refId": "B",
|
||||
"legendFormat": "{{instance}} - {{device}} TX"
|
||||
}
|
||||
],
|
||||
"title": "Network I/O",
|
||||
"type": "timeseries"
|
||||
},
|
||||
{
|
||||
"datasource": "Prometheus",
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
},
|
||||
{
|
||||
"color": "yellow",
|
||||
"value": 80
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 90
|
||||
}
|
||||
]
|
||||
},
|
||||
"unit": "percent"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 4,
|
||||
"w": 6,
|
||||
"x": 0,
|
||||
"y": 16
|
||||
},
|
||||
"id": 5,
|
||||
"options": {
|
||||
"colorMode": "background",
|
||||
"graphMode": "area",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"pluginVersion": "9.0.0",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Current CPU Usage",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": "Prometheus",
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
},
|
||||
{
|
||||
"color": "yellow",
|
||||
"value": 80
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 90
|
||||
}
|
||||
]
|
||||
},
|
||||
"unit": "percent"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 4,
|
||||
"w": 6,
|
||||
"x": 6,
|
||||
"y": 16
|
||||
},
|
||||
"id": 6,
|
||||
"options": {
|
||||
"colorMode": "background",
|
||||
"graphMode": "area",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"pluginVersion": "9.0.0",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Current Memory Usage",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": "Prometheus",
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
},
|
||||
{
|
||||
"color": "yellow",
|
||||
"value": 80
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 90
|
||||
}
|
||||
]
|
||||
},
|
||||
"unit": "percent"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 4,
|
||||
"w": 6,
|
||||
"x": 12,
|
||||
"y": 16
|
||||
},
|
||||
"id": 7,
|
||||
"options": {
|
||||
"colorMode": "background",
|
||||
"graphMode": "area",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"pluginVersion": "9.0.0",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "100 - ((node_filesystem_avail_bytes{mountpoint=\"/\",fstype!=\"rootfs\"} / node_filesystem_size_bytes{mountpoint=\"/\",fstype!=\"rootfs\"}) * 100)",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Current Disk Usage",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": "Prometheus",
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
}
|
||||
]
|
||||
},
|
||||
"unit": "s"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 4,
|
||||
"w": 6,
|
||||
"x": 18,
|
||||
"y": 16
|
||||
},
|
||||
"id": 8,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "area",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"pluginVersion": "9.0.0",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "time() - node_boot_time_seconds",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "System Uptime",
|
||||
"type": "stat"
|
||||
}
|
||||
],
|
||||
"refresh": "30s",
|
||||
"schemaVersion": 36,
|
||||
"style": "dark",
|
||||
"tags": ["host", "system"],
|
||||
"templating": {
|
||||
"list": []
|
||||
},
|
||||
"time": {
|
||||
"from": "now-1h",
|
||||
"to": "now"
|
||||
},
|
||||
"timepicker": {},
|
||||
"timezone": "",
|
||||
"title": "Host System",
|
||||
"uid": "host-system",
|
||||
"version": 1
|
||||
}
|
||||
613
deployment/stacks/monitoring/grafana/dashboards/traefik.json
Normal file
613
deployment/stacks/monitoring/grafana/dashboards/traefik.json
Normal file
@@ -0,0 +1,613 @@
|
||||
{
|
||||
"annotations": {
|
||||
"list": [
|
||||
{
|
||||
"builtIn": 1,
|
||||
"datasource": "-- Grafana --",
|
||||
"enable": true,
|
||||
"hide": true,
|
||||
"iconColor": "rgba(0, 211, 255, 1)",
|
||||
"name": "Annotations & Alerts",
|
||||
"type": "dashboard"
|
||||
}
|
||||
]
|
||||
},
|
||||
"editable": true,
|
||||
"gnetId": null,
|
||||
"graphTooltip": 0,
|
||||
"id": null,
|
||||
"links": [],
|
||||
"panels": [
|
||||
{
|
||||
"datasource": "Prometheus",
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "palette-classic"
|
||||
},
|
||||
"custom": {
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 10,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": {
|
||||
"tooltip": false,
|
||||
"viz": false,
|
||||
"legend": false
|
||||
},
|
||||
"lineInterpolation": "linear",
|
||||
"lineWidth": 1,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": {
|
||||
"type": "linear"
|
||||
},
|
||||
"showPoints": "never",
|
||||
"spanNulls": false,
|
||||
"stacking": {
|
||||
"group": "A",
|
||||
"mode": "none"
|
||||
},
|
||||
"thresholdsStyle": {
|
||||
"mode": "off"
|
||||
}
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
}
|
||||
]
|
||||
},
|
||||
"unit": "reqps"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 0
|
||||
},
|
||||
"id": 1,
|
||||
"options": {
|
||||
"legend": {
|
||||
"calcs": ["lastNotNull", "mean", "max"],
|
||||
"displayMode": "table",
|
||||
"placement": "right"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "single"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(traefik_service_requests_total[5m])) by (service)",
|
||||
"refId": "A",
|
||||
"legendFormat": "{{service}}"
|
||||
}
|
||||
],
|
||||
"title": "Request Rate by Service",
|
||||
"type": "timeseries"
|
||||
},
|
||||
{
|
||||
"datasource": "Prometheus",
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "palette-classic"
|
||||
},
|
||||
"custom": {
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 10,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": {
|
||||
"tooltip": false,
|
||||
"viz": false,
|
||||
"legend": false
|
||||
},
|
||||
"lineInterpolation": "linear",
|
||||
"lineWidth": 1,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": {
|
||||
"type": "linear"
|
||||
},
|
||||
"showPoints": "never",
|
||||
"spanNulls": false,
|
||||
"stacking": {
|
||||
"group": "A",
|
||||
"mode": "none"
|
||||
},
|
||||
"thresholdsStyle": {
|
||||
"mode": "off"
|
||||
}
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
}
|
||||
]
|
||||
},
|
||||
"unit": "ms"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 0
|
||||
},
|
||||
"id": 2,
|
||||
"options": {
|
||||
"legend": {
|
||||
"calcs": ["lastNotNull", "mean", "max"],
|
||||
"displayMode": "table",
|
||||
"placement": "right"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "single"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "histogram_quantile(0.95, sum(rate(traefik_service_request_duration_seconds_bucket[5m])) by (service, le)) * 1000",
|
||||
"refId": "A",
|
||||
"legendFormat": "{{service}} p95"
|
||||
},
|
||||
{
|
||||
"expr": "histogram_quantile(0.99, sum(rate(traefik_service_request_duration_seconds_bucket[5m])) by (service, le)) * 1000",
|
||||
"refId": "B",
|
||||
"legendFormat": "{{service}} p99"
|
||||
}
|
||||
],
|
||||
"title": "Response Time (p95/p99)",
|
||||
"type": "timeseries"
|
||||
},
|
||||
{
|
||||
"datasource": "Prometheus",
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "palette-classic"
|
||||
},
|
||||
"custom": {
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 10,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": {
|
||||
"tooltip": false,
|
||||
"viz": false,
|
||||
"legend": false
|
||||
},
|
||||
"lineInterpolation": "linear",
|
||||
"lineWidth": 1,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": {
|
||||
"type": "linear"
|
||||
},
|
||||
"showPoints": "never",
|
||||
"spanNulls": false,
|
||||
"stacking": {
|
||||
"group": "A",
|
||||
"mode": "normal"
|
||||
},
|
||||
"thresholdsStyle": {
|
||||
"mode": "off"
|
||||
}
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
}
|
||||
]
|
||||
},
|
||||
"unit": "reqps"
|
||||
},
|
||||
"overrides": [
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byRegexp",
|
||||
"options": ".*2xx.*"
|
||||
},
|
||||
"properties": [
|
||||
{
|
||||
"id": "color",
|
||||
"value": {
|
||||
"fixedColor": "green",
|
||||
"mode": "fixed"
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byRegexp",
|
||||
"options": ".*4xx.*"
|
||||
},
|
||||
"properties": [
|
||||
{
|
||||
"id": "color",
|
||||
"value": {
|
||||
"fixedColor": "yellow",
|
||||
"mode": "fixed"
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": {
|
||||
"id": "byRegexp",
|
||||
"options": ".*5xx.*"
|
||||
},
|
||||
"properties": [
|
||||
{
|
||||
"id": "color",
|
||||
"value": {
|
||||
"fixedColor": "red",
|
||||
"mode": "fixed"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 8
|
||||
},
|
||||
"id": 3,
|
||||
"options": {
|
||||
"legend": {
|
||||
"calcs": ["lastNotNull", "sum"],
|
||||
"displayMode": "table",
|
||||
"placement": "right"
|
||||
},
|
||||
"tooltip": {
|
||||
"mode": "single"
|
||||
}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(traefik_service_requests_total{code=~\"2..\"}[5m])) by (service)",
|
||||
"refId": "A",
|
||||
"legendFormat": "{{service}} 2xx"
|
||||
},
|
||||
{
|
||||
"expr": "sum(rate(traefik_service_requests_total{code=~\"4..\"}[5m])) by (service)",
|
||||
"refId": "B",
|
||||
"legendFormat": "{{service}} 4xx"
|
||||
},
|
||||
{
|
||||
"expr": "sum(rate(traefik_service_requests_total{code=~\"5..\"}[5m])) by (service)",
|
||||
"refId": "C",
|
||||
"legendFormat": "{{service}} 5xx"
|
||||
}
|
||||
],
|
||||
"title": "HTTP Status Codes",
|
||||
"type": "timeseries"
|
||||
},
|
||||
{
|
||||
"datasource": "Prometheus",
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"mappings": [
|
||||
{
|
||||
"options": {
|
||||
"0": {
|
||||
"color": "red",
|
||||
"index": 1,
|
||||
"text": "Down"
|
||||
},
|
||||
"1": {
|
||||
"color": "green",
|
||||
"index": 0,
|
||||
"text": "Up"
|
||||
}
|
||||
},
|
||||
"type": "value"
|
||||
}
|
||||
],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "red",
|
||||
"value": null
|
||||
},
|
||||
{
|
||||
"color": "green",
|
||||
"value": 1
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 8
|
||||
},
|
||||
"id": 4,
|
||||
"options": {
|
||||
"colorMode": "background",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"pluginVersion": "9.0.0",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "traefik_service_server_up",
|
||||
"refId": "A",
|
||||
"legendFormat": "{{service}}"
|
||||
}
|
||||
],
|
||||
"title": "Service Status",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": "Prometheus",
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
}
|
||||
]
|
||||
},
|
||||
"unit": "short"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 4,
|
||||
"w": 6,
|
||||
"x": 0,
|
||||
"y": 16
|
||||
},
|
||||
"id": 5,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "area",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"pluginVersion": "9.0.0",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(traefik_service_requests_total[5m])) * 60",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Requests per Minute",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": "Prometheus",
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
},
|
||||
{
|
||||
"color": "yellow",
|
||||
"value": 5
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 10
|
||||
}
|
||||
]
|
||||
},
|
||||
"unit": "percent"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 4,
|
||||
"w": 6,
|
||||
"x": 6,
|
||||
"y": 16
|
||||
},
|
||||
"id": 6,
|
||||
"options": {
|
||||
"colorMode": "background",
|
||||
"graphMode": "area",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"pluginVersion": "9.0.0",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "(sum(rate(traefik_service_requests_total{code=~\"4..\"}[5m])) / sum(rate(traefik_service_requests_total[5m]))) * 100",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "4xx Error Rate",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": "Prometheus",
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
},
|
||||
{
|
||||
"color": "yellow",
|
||||
"value": 1
|
||||
},
|
||||
{
|
||||
"color": "red",
|
||||
"value": 5
|
||||
}
|
||||
]
|
||||
},
|
||||
"unit": "percent"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 4,
|
||||
"w": 6,
|
||||
"x": 12,
|
||||
"y": 16
|
||||
},
|
||||
"id": 7,
|
||||
"options": {
|
||||
"colorMode": "background",
|
||||
"graphMode": "area",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"pluginVersion": "9.0.0",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "(sum(rate(traefik_service_requests_total{code=~\"5..\"}[5m])) / sum(rate(traefik_service_requests_total[5m]))) * 100",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "5xx Error Rate",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": "Prometheus",
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{
|
||||
"color": "green",
|
||||
"value": null
|
||||
}
|
||||
]
|
||||
},
|
||||
"unit": "short"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 4,
|
||||
"w": 6,
|
||||
"x": 18,
|
||||
"y": 16
|
||||
},
|
||||
"id": 8,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"pluginVersion": "9.0.0",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "count(traefik_service_server_up == 1)",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Active Services",
|
||||
"type": "stat"
|
||||
}
|
||||
],
|
||||
"refresh": "30s",
|
||||
"schemaVersion": 36,
|
||||
"style": "dark",
|
||||
"tags": ["traefik", "proxy"],
|
||||
"templating": {
|
||||
"list": []
|
||||
},
|
||||
"time": {
|
||||
"from": "now-1h",
|
||||
"to": "now"
|
||||
},
|
||||
"timepicker": {},
|
||||
"timezone": "",
|
||||
"title": "Traefik",
|
||||
"uid": "traefik",
|
||||
"version": 1
|
||||
}
|
||||
@@ -0,0 +1,15 @@
|
||||
# Grafana Dashboard Provisioning
|
||||
# https://grafana.com/docs/grafana/latest/administration/provisioning/#dashboards
|
||||
|
||||
apiVersion: 1
|
||||
|
||||
providers:
|
||||
- name: 'Default'
|
||||
orgId: 1
|
||||
folder: ''
|
||||
type: file
|
||||
disableDeletion: false
|
||||
updateIntervalSeconds: 10
|
||||
allowUiUpdates: true
|
||||
options:
|
||||
path: /var/lib/grafana/dashboards
|
||||
@@ -0,0 +1,17 @@
|
||||
# Grafana Datasource Provisioning
|
||||
# https://grafana.com/docs/grafana/latest/administration/provisioning/#data-sources
|
||||
|
||||
apiVersion: 1
|
||||
|
||||
datasources:
|
||||
- name: Prometheus
|
||||
type: prometheus
|
||||
access: proxy
|
||||
url: http://prometheus:9090
|
||||
isDefault: true
|
||||
editable: false
|
||||
jsonData:
|
||||
timeInterval: 15s
|
||||
queryTimeout: 60s
|
||||
httpMethod: POST
|
||||
version: 1
|
||||
245
deployment/stacks/monitoring/prometheus/alerts.yml
Normal file
245
deployment/stacks/monitoring/prometheus/alerts.yml
Normal file
@@ -0,0 +1,245 @@
|
||||
# Prometheus Alerting Rules
|
||||
# https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
|
||||
|
||||
groups:
|
||||
- name: infrastructure_alerts
|
||||
interval: 30s
|
||||
rules:
|
||||
# Host System Alerts
|
||||
- alert: HostHighCpuLoad
|
||||
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
category: infrastructure
|
||||
annotations:
|
||||
summary: "High CPU load on {{ $labels.instance }}"
|
||||
description: "CPU load is above 80% (current value: {{ $value }}%)"
|
||||
|
||||
- alert: HostOutOfMemory
|
||||
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
category: infrastructure
|
||||
annotations:
|
||||
summary: "Host out of memory on {{ $labels.instance }}"
|
||||
description: "Available memory is below 10% (current value: {{ $value }}%)"
|
||||
|
||||
- alert: HostOutOfDiskSpace
|
||||
expr: (node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"} * 100) < 10
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
category: infrastructure
|
||||
annotations:
|
||||
summary: "Host out of disk space on {{ $labels.instance }}"
|
||||
description: "Disk space is below 10% (current value: {{ $value }}%)"
|
||||
|
||||
- alert: HostDiskSpaceWarning
|
||||
expr: (node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"} * 100) < 20
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
category: infrastructure
|
||||
annotations:
|
||||
summary: "Disk space warning on {{ $labels.instance }}"
|
||||
description: "Disk space is below 20% (current value: {{ $value }}%)"
|
||||
|
||||
- alert: HostHighDiskReadLatency
|
||||
expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
category: infrastructure
|
||||
annotations:
|
||||
summary: "High disk read latency on {{ $labels.instance }}"
|
||||
description: "Disk read latency is high (current value: {{ $value }}s)"
|
||||
|
||||
# Container Alerts
|
||||
- alert: ContainerKilled
|
||||
expr: time() - container_last_seen{name!~".*exporter.*"} > 60
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
category: container
|
||||
annotations:
|
||||
summary: "Container killed: {{ $labels.name }}"
|
||||
description: "Container {{ $labels.name }} has disappeared"
|
||||
|
||||
- alert: ContainerHighCpuUsage
|
||||
expr: (sum(rate(container_cpu_usage_seconds_total{name!~".*exporter.*"}[5m])) by (name) * 100) > 80
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
category: container
|
||||
annotations:
|
||||
summary: "High CPU usage in container {{ $labels.name }}"
|
||||
description: "Container CPU usage is above 80% (current value: {{ $value }}%)"
|
||||
|
||||
- alert: ContainerHighMemoryUsage
|
||||
expr: (sum(container_memory_usage_bytes{name!~".*exporter.*"}) by (name) / sum(container_spec_memory_limit_bytes{name!~".*exporter.*"}) by (name) * 100) > 80
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
category: container
|
||||
annotations:
|
||||
summary: "High memory usage in container {{ $labels.name }}"
|
||||
description: "Container memory usage is above 80% (current value: {{ $value }}%)"
|
||||
|
||||
- alert: ContainerVolumeUsage
|
||||
expr: (1 - (sum(container_fs_inodes_free) BY (instance) / sum(container_fs_inodes_total) BY (instance))) * 100 > 80
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
category: container
|
||||
annotations:
|
||||
summary: "Container volume usage on {{ $labels.instance }}"
|
||||
description: "Container volume usage is above 80% (current value: {{ $value }}%)"
|
||||
|
||||
- alert: ContainerRestartCount
|
||||
expr: rate(container_restart_count[5m]) > 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: warning
|
||||
category: container
|
||||
annotations:
|
||||
summary: "Container restarting: {{ $labels.name }}"
|
||||
description: "Container {{ $labels.name }} is restarting frequently"
|
||||
|
||||
# Prometheus Self-Monitoring
|
||||
- alert: PrometheusTargetDown
|
||||
expr: up == 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
category: prometheus
|
||||
annotations:
|
||||
summary: "Prometheus target down: {{ $labels.job }}"
|
||||
description: "Target {{ $labels.job }} on {{ $labels.instance }} is down"
|
||||
|
||||
- alert: PrometheusConfigReloadFailure
|
||||
expr: prometheus_config_last_reload_successful == 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
category: prometheus
|
||||
annotations:
|
||||
summary: "Prometheus configuration reload failure"
|
||||
description: "Prometheus configuration reload has failed"
|
||||
|
||||
- alert: PrometheusTooManyRestarts
|
||||
expr: changes(process_start_time_seconds{job=~"prometheus"}[15m]) > 2
|
||||
for: 1m
|
||||
labels:
|
||||
severity: warning
|
||||
category: prometheus
|
||||
annotations:
|
||||
summary: "Prometheus restarting frequently"
|
||||
description: "Prometheus has restarted more than twice in the last 15 minutes"
|
||||
|
||||
- alert: PrometheusTargetScrapingSlow
|
||||
expr: prometheus_target_interval_length_seconds{quantile="0.9"} > 60
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
category: prometheus
|
||||
annotations:
|
||||
summary: "Prometheus target scraping slow"
|
||||
description: "Prometheus is scraping targets slowly (current value: {{ $value }}s)"
|
||||
|
||||
# Traefik Alerts
|
||||
- alert: TraefikServiceDown
|
||||
expr: count(traefik_service_server_up) by (service) == 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
category: traefik
|
||||
annotations:
|
||||
summary: "Traefik service down: {{ $labels.service }}"
|
||||
description: "Traefik service {{ $labels.service }} is down"
|
||||
|
||||
- alert: TraefikHighHttp4xxErrorRate
|
||||
expr: sum(rate(traefik_service_requests_total{code=~"4.."}[5m])) by (service) / sum(rate(traefik_service_requests_total[5m])) by (service) * 100 > 5
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
category: traefik
|
||||
annotations:
|
||||
summary: "High HTTP 4xx error rate for {{ $labels.service }}"
|
||||
description: "HTTP 4xx error rate is above 5% (current value: {{ $value }}%)"
|
||||
|
||||
- alert: TraefikHighHttp5xxErrorRate
|
||||
expr: sum(rate(traefik_service_requests_total{code=~"5.."}[5m])) by (service) / sum(rate(traefik_service_requests_total[5m])) by (service) * 100 > 1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
category: traefik
|
||||
annotations:
|
||||
summary: "High HTTP 5xx error rate for {{ $labels.service }}"
|
||||
description: "HTTP 5xx error rate is above 1% (current value: {{ $value }}%)"
|
||||
|
||||
- name: database_alerts
|
||||
interval: 30s
|
||||
rules:
|
||||
# PostgreSQL Alerts (uncomment when postgres-exporter is deployed)
|
||||
# - alert: PostgresqlDown
|
||||
# expr: pg_up == 0
|
||||
# for: 1m
|
||||
# labels:
|
||||
# severity: critical
|
||||
# category: database
|
||||
# annotations:
|
||||
# summary: "PostgreSQL down on {{ $labels.instance }}"
|
||||
# description: "PostgreSQL instance is down"
|
||||
|
||||
# - alert: PostgresqlTooManyConnections
|
||||
# expr: sum by (instance) (pg_stat_activity_count) > pg_settings_max_connections * 0.8
|
||||
# for: 5m
|
||||
# labels:
|
||||
# severity: warning
|
||||
# category: database
|
||||
# annotations:
|
||||
# summary: "Too many PostgreSQL connections on {{ $labels.instance }}"
|
||||
# description: "PostgreSQL connections are above 80% of max_connections"
|
||||
|
||||
# - alert: PostgresqlDeadLocks
|
||||
# expr: rate(pg_stat_database_deadlocks[1m]) > 0
|
||||
# for: 1m
|
||||
# labels:
|
||||
# severity: warning
|
||||
# category: database
|
||||
# annotations:
|
||||
# summary: "PostgreSQL deadlocks on {{ $labels.instance }}"
|
||||
# description: "PostgreSQL has deadlocks"
|
||||
|
||||
# Redis Alerts (uncomment when redis-exporter is deployed)
|
||||
# - alert: RedisDown
|
||||
# expr: redis_up == 0
|
||||
# for: 1m
|
||||
# labels:
|
||||
# severity: critical
|
||||
# category: cache
|
||||
# annotations:
|
||||
# summary: "Redis down on {{ $labels.instance }}"
|
||||
# description: "Redis instance is down"
|
||||
|
||||
# - alert: RedisOutOfMemory
|
||||
# expr: redis_memory_used_bytes / redis_memory_max_bytes * 100 > 90
|
||||
# for: 5m
|
||||
# labels:
|
||||
# severity: critical
|
||||
# category: cache
|
||||
# annotations:
|
||||
# summary: "Redis out of memory on {{ $labels.instance }}"
|
||||
# description: "Redis memory usage is above 90%"
|
||||
|
||||
# - alert: RedisTooManyConnections
|
||||
# expr: redis_connected_clients > 100
|
||||
# for: 5m
|
||||
# labels:
|
||||
# severity: warning
|
||||
# category: cache
|
||||
# annotations:
|
||||
# summary: "Too many Redis connections on {{ $labels.instance }}"
|
||||
# description: "Redis has too many client connections (current value: {{ $value }})"
|
||||
82
deployment/stacks/monitoring/prometheus/prometheus.yml
Normal file
82
deployment/stacks/monitoring/prometheus/prometheus.yml
Normal file
@@ -0,0 +1,82 @@
|
||||
# Prometheus Configuration
|
||||
# https://prometheus.io/docs/prometheus/latest/configuration/configuration/
|
||||
|
||||
global:
|
||||
scrape_interval: 15s
|
||||
evaluation_interval: 15s
|
||||
external_labels:
|
||||
cluster: 'production'
|
||||
environment: 'michaelschiemer'
|
||||
|
||||
# Alertmanager configuration (optional)
|
||||
# alerting:
|
||||
# alertmanagers:
|
||||
# - static_configs:
|
||||
# - targets:
|
||||
# - alertmanager:9093
|
||||
|
||||
# Load alerting rules
|
||||
rule_files:
|
||||
- '/etc/prometheus/alerts.yml'
|
||||
|
||||
# Scrape configurations
|
||||
scrape_configs:
|
||||
# Prometheus self-monitoring
|
||||
- job_name: 'prometheus'
|
||||
static_configs:
|
||||
- targets: ['localhost:9090']
|
||||
labels:
|
||||
service: 'prometheus'
|
||||
|
||||
# Node Exporter - Host system metrics
|
||||
- job_name: 'node-exporter'
|
||||
static_configs:
|
||||
- targets: ['node-exporter:9100']
|
||||
labels:
|
||||
service: 'node-exporter'
|
||||
instance: 'production-server'
|
||||
|
||||
# cAdvisor - Container metrics
|
||||
- job_name: 'cadvisor'
|
||||
static_configs:
|
||||
- targets: ['cadvisor:8080']
|
||||
labels:
|
||||
service: 'cadvisor'
|
||||
|
||||
# Traefik metrics
|
||||
- job_name: 'traefik'
|
||||
static_configs:
|
||||
- targets: ['traefik:8080']
|
||||
labels:
|
||||
service: 'traefik'
|
||||
|
||||
# PostgreSQL Exporter (if deployed)
|
||||
# Uncomment if you add postgres-exporter to postgresql stack
|
||||
# - job_name: 'postgres'
|
||||
# static_configs:
|
||||
# - targets: ['postgres-exporter:9187']
|
||||
# labels:
|
||||
# service: 'postgresql'
|
||||
|
||||
# Redis Exporter (if deployed)
|
||||
# Uncomment if you add redis-exporter to application stack
|
||||
# - job_name: 'redis'
|
||||
# static_configs:
|
||||
# - targets: ['redis-exporter:9121']
|
||||
# labels:
|
||||
# service: 'redis'
|
||||
|
||||
# Application metrics endpoint (if available)
|
||||
# Uncomment and configure if your PHP app exposes Prometheus metrics
|
||||
# - job_name: 'application'
|
||||
# static_configs:
|
||||
# - targets: ['app:9000']
|
||||
# labels:
|
||||
# service: 'application'
|
||||
|
||||
# Nginx metrics (if nginx-prometheus-exporter deployed)
|
||||
# - job_name: 'nginx'
|
||||
# static_configs:
|
||||
# - targets: ['nginx-exporter:9113']
|
||||
# labels:
|
||||
# service: 'nginx'
|
||||
Reference in New Issue
Block a user