michaelschiemer/docs/queue-deployment.md

# Production Deployment Documentation - Distributed Queue System

## Übersicht

Diese Dokumentation beschreibt die Produktions-Bereitstellung des Distributed Queue Processing Systems des Custom PHP Frameworks.

## Systemvoraussetzungen

### Mindestanforderungen
- **PHP**: 8.3+
- **MySQL/PostgreSQL**: 8.0+ / 13+
- **Redis**: 7.0+ (optional, für Redis-basierte Queues)
- **RAM**: 2GB pro Worker-Node
- **CPU**: 2 Cores pro Worker-Node
- **Festplatte**: 10GB für Logs und temporäre Dateien

### Empfohlene Produktionsumgebung
- **Load Balancer**: Nginx/HAProxy
- **Database**: MySQL 8.0+ mit Master-Slave Setup
- **Caching**: Redis Cluster für High Availability
- **Monitoring**: Prometheus + Grafana
- **Logging**: ELK Stack (Elasticsearch, Logstash, Kibana)

## Deployment-Schritte

### 1. Database Setup

```bash
# Migrationen ausführen
php console.php db:migrate

# Verify migrations
php console.php db:status
```

**Erwartete Tabellen:**
- `queue_workers` - Worker Registration
- `distributed_locks` - Distributed Locking
- `job_assignments` - Job-Worker Assignments
- `worker_health_checks` - Worker Health Monitoring
- `failover_events` - Failover Event Tracking

### 2. Environment Configuration

**Produktions-Environment (.env.production):**
```bash
# Database Configuration
DB_HOST=production-db-cluster
DB_PORT=3306
DB_NAME=framework_production
DB_USER=queue_user
DB_PASS=secure_production_password

# Queue Configuration
QUEUE_DRIVER=database
QUEUE_DEFAULT=default

# Worker Configuration
WORKER_HEALTH_CHECK_INTERVAL=30
WORKER_REGISTRATION_TTL=300
FAILOVER_CHECK_INTERVAL=60

# Performance Tuning
DB_POOL_SIZE=20
DB_MAX_IDLE_TIME=3600
CACHE_TTL=3600

# Monitoring
LOG_LEVEL=info
PERFORMANCE_MONITORING=true
HEALTH_CHECK_ENDPOINT=/health
```

### 3. Worker Node Deployment

**Docker Compose für Worker Node:**
```yaml
version: '3.8'
services:
  queue-worker:
    image: custom-php-framework:production
    environment:
      - NODE_ROLE=worker
      - WORKER_QUEUES=default,emails,reports
      - WORKER_CONCURRENCY=4
      - DB_HOST=${DB_HOST}
      - DB_NAME=${DB_NAME}
      - DB_USER=${DB_USER}
      - DB_PASS=${DB_PASS}
    command: php console.php worker:start
    restart: unless-stopped
    deploy:
      replicas: 3
      resources:
        limits:
          memory: 2G
          cpus: '2'
        reservations:
          memory: 1G
          cpus: '1'
    healthcheck:
      test: ["CMD", "php", "console.php", "worker:health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 10s
```

### 4. Load Balancer Configuration

**Nginx Configuration:**
```nginx
upstream queue_workers {
    least_conn;
    server worker-node-1:80 max_fails=3 fail_timeout=30s;
    server worker-node-2:80 max_fails=3 fail_timeout=30s;
    server worker-node-3:80 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    server_name queue.production.example.com;

    location /health {
        proxy_pass http://queue_workers/health;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        access_log off;
    }

    location /admin/queue {
        proxy_pass http://queue_workers/admin/queue;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;

        # Admin-Access nur von internen IPs
        allow 10.0.0.0/8;
        allow 172.16.0.0/12;
        allow 192.168.0.0/16;
        deny all;
    }
}
```

### 5. Monitoring Setup

**Prometheus Metrics Configuration:**
```yaml
# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'queue-workers'
    static_configs:
      - targets: ['worker-node-1:9090', 'worker-node-2:9090', 'worker-node-3:9090']
    metrics_path: '/metrics'
    scrape_interval: 30s

  - job_name: 'queue-system'
    static_configs:
      - targets: ['queue.production.example.com:80']
    metrics_path: '/admin/metrics'
    scrape_interval: 60s
```

**Grafana Dashboard Queries:**
```promql
# Worker Health Status
sum(rate(worker_health_checks_total[5m])) by (status)

# Job Processing Rate
rate(jobs_processed_total[5m])

# Queue Length
queue_length{queue_name=~".*"}

# Worker CPU Usage
worker_cpu_usage_percent

# Database Connection Pool
db_connection_pool_active / db_connection_pool_max * 100
```

## Operational Commands

### Worker Management

```bash
# Worker starten
php console.php worker:start --queues=default,emails --concurrency=4

# Worker-Status prüfen
php console.php worker:list

# Worker beenden (graceful shutdown)
php console.php worker:stop --worker-id=worker_123

# Alle Worker beenden
php console.php worker:stop-all

# Worker-Gesundheitscheck
php console.php worker:health

# Failover-Recovery ausführen
php console.php worker:failover-recovery

# Worker deregistrieren
php console.php worker:deregister --worker-id=worker_123

# Worker-Statistiken
php console.php worker:stats
```

### System Monitoring

```bash
# System-Health Check
curl -f http://queue.production.example.com/health

# Worker-Status API
curl http://queue.production.example.com/admin/queue/workers

# Queue-Statistiken
curl http://queue.production.example.com/admin/queue/stats

# Metrics-Endpoint
curl http://queue.production.example.com/admin/metrics
```

## Performance Tuning

### Database Optimization

**MySQL Configuration (my.cnf):**
```ini
[mysqld]
# InnoDB Settings für Queue-System
innodb_buffer_pool_size = 2G
innodb_log_file_size = 512M
innodb_flush_log_at_trx_commit = 2
innodb_lock_wait_timeout = 5

# Connection Settings
max_connections = 500
max_connect_errors = 100000
connect_timeout = 10
wait_timeout = 28800

# Query Cache (für Read-Heavy Workloads)
query_cache_type = 1
query_cache_size = 256M
query_cache_limit = 1M
```

**Empfohlene Indizes:**
```sql
-- Queue Workers Performance
CREATE INDEX idx_worker_status_updated ON queue_workers(status, updated_at);
CREATE INDEX idx_worker_queues ON queue_workers(queues(255));

-- Distributed Locks Performance
CREATE INDEX idx_lock_expires_worker ON distributed_locks(expires_at, worker_id);

-- Job Assignments Performance
CREATE INDEX idx_assignment_worker_time ON job_assignments(worker_id, assigned_at);
CREATE INDEX idx_assignment_queue_time ON job_assignments(queue_name, assigned_at);

-- Health Checks Performance
CREATE INDEX idx_health_worker_time ON worker_health_checks(worker_id, checked_at);
CREATE INDEX idx_health_status_time ON worker_health_checks(status, checked_at);

-- Failover Events Performance
CREATE INDEX idx_failover_worker_time ON failover_events(failed_worker_id, failover_at);
CREATE INDEX idx_failover_event_type ON failover_events(event_type, failover_at);
```

### Application Performance

**PHP Configuration (php.ini):**
```ini
; Memory Limits
memory_limit = 2G
max_execution_time = 300

; OPcache für Production
opcache.enable = 1
opcache.memory_consumption = 256
opcache.max_accelerated_files = 20000
opcache.validate_timestamps = 0

; Session (nicht für Worker benötigt)
session.auto_start = 0
```

**Worker Concurrency Tuning:**
```bash
# Leichte Jobs (E-Mails, Notifications)
php console.php worker:start --concurrency=8

# Schwere Jobs (Reports, Exports)
php console.php worker:start --concurrency=2

# Mixed Workload
php console.php worker:start --concurrency=4
```

## Security Configuration

### Network Security

**Firewall Rules (iptables):**
```bash
# Worker-Nodes untereinander (Health Checks)
iptables -A INPUT -p tcp --dport 8080 -s 10.0.1.0/24 -j ACCEPT

# Database Access nur von Worker-Nodes
iptables -A INPUT -p tcp --dport 3306 -s 10.0.1.0/24 -j ACCEPT

# Redis Access (falls verwendet)
iptables -A INPUT -p tcp --dport 6379 -s 10.0.1.0/24 -j ACCEPT

# Admin Interface nur von Management Network
iptables -A INPUT -p tcp --dport 80 -s 10.0.0.0/24 -j ACCEPT
```

### Database Security

```sql
-- Dedicated Queue User mit minimalen Rechten
CREATE USER 'queue_user'@'10.0.1.%' IDENTIFIED BY 'secure_production_password';

GRANT SELECT, INSERT, UPDATE, DELETE ON framework_production.queue_workers TO 'queue_user'@'10.0.1.%';
GRANT SELECT, INSERT, UPDATE, DELETE ON framework_production.distributed_locks TO 'queue_user'@'10.0.1.%';
GRANT SELECT, INSERT, UPDATE, DELETE ON framework_production.job_assignments TO 'queue_user'@'10.0.1.%';
GRANT SELECT, INSERT, UPDATE, DELETE ON framework_production.worker_health_checks TO 'queue_user'@'10.0.1.%';
GRANT SELECT, INSERT, UPDATE, DELETE ON framework_production.failover_events TO 'queue_user'@'10.0.1.%';

FLUSH PRIVILEGES;
```

## Disaster Recovery

### Backup Strategy

**Database Backup:**
```bash
#!/bin/bash
# daily-queue-backup.sh

DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backup/queue-system"

# Backup nur Queue-relevante Tabellen
mysqldump --single-transaction \
  --routines --triggers \
  --tables queue_workers distributed_locks job_assignments worker_health_checks failover_events \
  framework_production > $BACKUP_DIR/queue_backup_$DATE.sql

# Retention: 30 Tage
find $BACKUP_DIR -name "queue_backup_*.sql" -mtime +30 -delete
```

**Worker State Backup:**
```bash
# Worker-Konfiguration sichern
php console.php worker:export-config > /backup/worker-config-$(date +%Y%m%d).json
```

### Recovery Procedures

**Database Recovery:**
```bash
# Tabellen wiederherstellen
mysql framework_production < /backup/queue_backup_YYYYMMDD_HHMMSS.sql

# Migrationen prüfen
php console.php db:status

# Worker neu starten
docker-compose restart queue-worker
```

**Worker Recovery:**
```bash
# Crashed Worker aufräumen
php console.php worker:cleanup-crashed

# Failover für verlorene Jobs
php console.php worker:failover-recovery

# Worker-Pool neu starten
docker-compose up -d --scale queue-worker=3
```

## Monitoring & Alerting

### Critical Alerts

**Prometheus Alert Rules (alerts.yml):**
```yaml
groups:
  - name: queue-system
    rules:
      - alert: WorkerDown
        expr: up{job="queue-workers"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Queue worker {{ $labels.instance }} is down"

      - alert: HighJobBacklog
        expr: queue_length > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High job backlog in queue {{ $labels.queue_name }}"

      - alert: WorkerHealthFailing
        expr: rate(worker_health_checks_failed_total[5m]) > 0.1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Worker health checks failing"

      - alert: DatabaseConnectionExhaustion
        expr: db_connection_pool_active / db_connection_pool_max > 0.9
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Database connection pool near exhaustion"
```

### Log Monitoring

**Logstash Configuration:**
```ruby
input {
  file {
    path => "/var/log/queue-workers/*.log"
    type => "queue-worker"
    codec => json
  }
}

filter {
  if [type] == "queue-worker" {
    if [level] == "ERROR" or [level] == "CRITICAL" {
      mutate {
        add_tag => ["alert"]
      }
    }
  }
}

output {
  if "alert" in [tags] {
    email {
      to => "ops-team@example.com"
      subject => "Queue System Alert: %{[level]} in %{[component]}"
      body => "Error: %{[message]}\nContext: %{[context]}"
    }
  }

  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "queue-logs-%{+YYYY.MM.dd}"
  }
}
```

## Maintenance Procedures

### Routine Maintenance

**Weekly Maintenance Script:**
```bash
#!/bin/bash
# weekly-queue-maintenance.sh

echo "Starting weekly queue system maintenance..."

# Cleanup alte Health Check Einträge (> 30 Tage)
php console.php queue:cleanup-health-checks --days=30

# Cleanup alte Failover Events (> 90 Tage)
php console.php queue:cleanup-failover-events --days=90

# Cleanup verwaiste Job Assignments
php console.php queue:cleanup-orphaned-assignments

# Database Statistiken aktualisieren
mysql framework_production -e "ANALYZE TABLE queue_workers, distributed_locks, job_assignments, worker_health_checks, failover_events;"

# Worker-Pool Health Check
php console.php worker:validate-pool

echo "Weekly maintenance completed."
```

**Rolling Updates:**
```bash
#!/bin/bash
# rolling-update-workers.sh

WORKER_NODES=("worker-node-1" "worker-node-2" "worker-node-3")

for node in "${WORKER_NODES[@]}"; do
    echo "Updating $node..."

    # Graceful Shutdown
    ssh $node "docker exec queue-worker php console.php worker:stop --graceful"

    # Wait for shutdown
    sleep 30

    # Update Container
    ssh $node "docker-compose pull && docker-compose up -d"

    # Wait for startup
    sleep 60

    # Health Check
    ssh $node "docker exec queue-worker php console.php worker:health" || exit 1

    echo "$node updated successfully"
done

echo "Rolling update completed."
```

## Troubleshooting

### Common Issues

**1. Worker startet nicht:**
```bash
# Check Database Connection
php console.php db:status

# Check Worker Configuration
php console.php worker:validate-config

# Check System Resources
free -h && df -h

# Check Docker Logs
docker-compose logs queue-worker
```

**2. Jobs bleiben in der Queue hängen:**
```bash
# Check Worker Status
php console.php worker:list

# Check Distributed Locks
php console.php queue:list-locks

# Force Failover Recovery
php console.php worker:failover-recovery --force

# Clear Stale Locks
php console.php queue:clear-stale-locks
```

**3. Performance Issues:**
```bash
# Check Database Performance
php console.php db:performance-stats

# Check Worker Resource Usage
php console.php worker:resource-stats

# Analyze Slow Queries
mysql framework_production -e "SELECT * FROM performance_schema.events_statements_summary_by_digest ORDER BY avg_timer_wait DESC LIMIT 10;"
```

### Emergency Procedures

**Worker Pool Restart:**
```bash
# Graceful restart aller Worker
docker-compose exec queue-worker php console.php worker:stop-all --graceful

# Wait for shutdown
sleep 60

# Restart Container
docker-compose restart queue-worker

# Verify restart
php console.php worker:list
```

**Database Failover:**
```bash
# Switch to backup database
sed -i 's/DB_HOST=primary-db/DB_HOST=backup-db/' .env.production

# Restart worker pool
docker-compose restart queue-worker

# Verify connection
php console.php db:status
```

## Performance Benchmarks

Basierend auf den Performance-Tests sind folgende Benchmarks zu erwarten:

### Einzelner Worker
- **Job Distribution**: < 10ms pro Job
- **Worker Selection**: < 5ms pro Auswahl
- **Lock Acquisition**: < 2ms pro Lock
- **Health Check**: < 1ms pro Check

### Multi-Worker Setup (3 Worker)
- **Throughput**: 500+ Jobs/Sekunde
- **Load Balancing**: Gleichmäßige Verteilung ±5%
- **Failover Time**: < 30 Sekunden
- **Resource Usage**: < 80% CPU bei Vollast

### Database Performance
- **Connection Pool**: 95%+ Effizienz
- **Query Response**: < 10ms für Standard-Operationen
- **Lock Contention**: < 1% bei normalem Load

Diese Dokumentation sollte als Grundlage für die Produktions-Bereitstellung dienen und regelmäßig aktualisiert werden, basierend auf operativen Erfahrungen.