Files
michaelschiemer/docs/deployment/docker-swarm-deployment.md

382 lines
8.6 KiB
Markdown

# Docker Swarm + Traefik Deployment Guide
Production deployment guide for the Custom PHP Framework using Docker Swarm orchestration with Traefik load balancer.
## Architecture Overview
```
Internet → Traefik (SSL Termination, Load Balancing)
[Web Service - 3 Replicas]
↓ ↓
Database Redis Queue Workers
(PostgreSQL) (Cache/Sessions) (2 Replicas)
```
**Key Components**:
- **Traefik v2.10**: Reverse proxy, SSL termination, automatic service discovery
- **Web Service**: 3 replicas of PHP-FPM + Nginx (HTTP only, Traefik handles HTTPS)
- **PostgreSQL 16**: Single instance database (manager node)
- **Redis 7**: Sessions and cache (manager node)
- **Queue Workers**: 2 replicas for background job processing
- **Docker Swarm**: Native container orchestration with rolling updates and health checks
## Prerequisites
1. **Docker Engine 28.0+** with Swarm mode enabled
2. **Production Server** with SSH access
3. **SSL Certificates** in `./ssl/` directory (cert.pem, key.pem)
4. **Environment Variables** in `.env` file on production server
5. **Docker Image** built and available
## Initial Setup
### 1. Initialize Docker Swarm
On production server:
```bash
docker swarm init
```
Verify:
```bash
docker node ls
# Should show 1 node as Leader
```
### 2. Create Docker Secrets
Create secrets from .env file values:
```bash
cd /home/deploy/framework
# Create secrets (one-time setup)
echo "$DB_PASSWORD" | docker secret create db_password -
echo "$APP_KEY" | docker secret create app_key -
echo "$VAULT_ENCRYPTION_KEY" | docker secret create vault_encryption_key -
echo "$SHOPIFY_WEBHOOK_SECRET" | docker secret create shopify_webhook_secret -
echo "$RAPIDMAIL_PASSWORD" | docker secret create rapidmail_password -
```
Or use the automated script:
```bash
./scripts/setup-production-secrets.sh
```
Verify secrets:
```bash
docker secret ls
```
### 3. Build and Transfer Docker Image
On local machine:
**Option A: Via Private Registry** (if available):
```bash
# Build image
docker build -f Dockerfile.production -t 94.16.110.151:5000/framework:latest .
# Push to registry
docker push 94.16.110.151:5000/framework:latest
```
**Option B: Direct Transfer via SSH** (recommended for now):
```bash
# Build image
docker build -f Dockerfile.production -t 94.16.110.151:5000/framework:latest .
# Save and transfer to production
docker save 94.16.110.151:5000/framework:latest | \
ssh -i ~/.ssh/production deploy@94.16.110.151 'docker load'
```
### 4. Deploy Stack
On production server:
```bash
cd /home/deploy/framework
# Deploy the stack
docker stack deploy -c docker-compose.prod.yml framework
# Monitor deployment
watch docker stack ps framework
# Check service status
docker stack services framework
```
## Health Monitoring
### Check Service Status
```bash
# List all services
docker stack services framework
# Check specific service
docker service ps framework_web
# View service logs
docker service logs framework_web -f
docker service logs framework_traefik -f
docker service logs framework_db -f
```
### Health Check Endpoints
- **Main Health**: http://localhost/health (via Traefik)
- **Traefik Dashboard**: http://traefik.localhost:8080 (manager node only)
### Expected Service Replicas
| Service | Replicas | Purpose |
|---------|----------|---------|
| traefik | 1 | Reverse proxy + SSL |
| web | 3 | Application servers |
| db | 1 | PostgreSQL database |
| redis | 1 | Cache + sessions |
| queue-worker | 2 | Background jobs |
## Rolling Updates
### Update Application
1. Build new image with updated code:
```bash
docker build -f Dockerfile.production -t 94.16.110.151:5000/framework:latest .
```
2. Transfer to production (if no registry):
```bash
docker save 94.16.110.151:5000/framework:latest | \
ssh -i ~/.ssh/production deploy@94.16.110.151 'docker load'
```
3. Update the service:
```bash
# On production server
docker service update --image 94.16.110.151:5000/framework:latest framework_web
```
The update will:
- Roll out to 1 container at a time (`parallelism: 1`)
- Wait 10 seconds between updates (`delay: 10s`)
- Start new container before stopping old one (`order: start-first`)
- Automatically rollback on failure (`failure_action: rollback`)
### Monitor Update Progress
```bash
# Watch update status
watch docker service ps framework_web
# View update logs
docker service logs framework_web -f --tail 50
```
### Manual Rollback
If needed, rollback to previous version:
```bash
docker service rollback framework_web
```
## Troubleshooting
### Service Won't Start
Check service logs:
```bash
docker service logs framework_web --tail 100
```
Check task failures:
```bash
docker service ps framework_web --no-trunc
```
### Container Crashing
Inspect individual container:
```bash
# Get container ID
docker ps -a | grep framework_web
# View logs
docker logs <container_id>
# Exec into running container
docker exec -it <container_id> bash
```
### SSL/TLS Issues
Traefik handles SSL termination. Check Traefik logs:
```bash
docker service logs framework_traefik -f
```
Verify SSL certificates are mounted in docker-compose.prod.yml:
```yaml
volumes:
- ./ssl:/ssl:ro
```
### Database Connection Issues
Check PostgreSQL health:
```bash
docker service logs framework_db --tail 50
# Exec into db container
docker exec -it $(docker ps -q -f name=framework_db) psql -U postgres -d framework_prod
```
### Redis Connection Issues
Check Redis availability:
```bash
docker service logs framework_redis --tail 50
# Test Redis connection
docker exec -it $(docker ps -q -f name=framework_redis) redis-cli ping
```
### Performance Issues
Check resource usage:
```bash
# Service resource limits
docker service inspect framework_web --format='{{json .Spec.TaskTemplate.Resources}}' | jq
# Container stats
docker stats
```
## Scaling
### Scale Web Service
```bash
# Scale up to 5 replicas
docker service scale framework_web=5
# Scale down to 2 replicas
docker service scale framework_web=2
```
### Scale Queue Workers
```bash
# Scale workers based on queue backlog
docker service scale framework_queue-worker=4
```
## Cleanup
### Remove Stack
```bash
# Remove entire stack
docker stack rm framework
# Verify removal
docker stack ls
```
### Remove Secrets
```bash
# List secrets
docker secret ls
# Remove specific secret
docker secret rm db_password
# Remove all framework secrets
docker secret ls | grep -E "db_password|app_key|vault_encryption_key" | awk '{print $2}' | xargs docker secret rm
```
### Leave Swarm
```bash
# Force leave Swarm (removes all services and secrets)
docker swarm leave --force
```
## Network Architecture
### Overlay Networks
- **traefik-public**: External network for Traefik ↔ Web communication
- **backend**: Internal network for Web ↔ Database/Redis communication
### Port Mappings
| Port | Service | Purpose |
|------|---------|---------|
| 80 | Traefik | HTTP (redirects to 443) |
| 443 | Traefik | HTTPS (production traffic) |
| 8080 | Traefik | Dashboard (manager node only) |
## Volume Management
### Named Volumes
| Volume | Purpose | Mounted In |
|--------|---------|------------|
| traefik-logs | Traefik access logs | traefik |
| storage-logs | Application logs | web, queue-worker |
| storage-uploads | User uploads | web |
| storage-queue | Queue data | queue-worker |
| db-data | PostgreSQL data | db |
| redis-data | Redis persistence | redis |
### Backup Volumes
```bash
# Backup database
docker exec $(docker ps -q -f name=framework_db) pg_dump -U postgres framework_prod > backup.sql
# Backup Redis (if persistence enabled)
docker exec $(docker ps -q -f name=framework_redis) redis-cli --rdb /data/dump.rdb
```
## Security Best Practices
1. **Secrets Management**: Never commit secrets to version control, use Docker Secrets
2. **Network Isolation**: Backend network is internal-only, no external access
3. **SSL/TLS**: Traefik enforces HTTPS, redirects HTTP → HTTPS
4. **Health Checks**: All services have health checks with automatic restart
5. **Resource Limits**: Production services have memory/CPU limits
6. **Least Privilege**: Containers run as www-data (not root) where possible
## Phase 2 - Monitoring (Coming Soon)
- Prometheus for metrics collection
- Grafana dashboards
- Automated PostgreSQL backups
- Email/Slack alerting
## Phase 3 - CI/CD (Coming Soon)
- Gitea Actions workflow
- Loki + Promtail for log aggregation
- Performance tuning
## Phase 4 - High Availability (Future)
- Multi-node Swarm cluster
- Varnish CDN cache layer
- PostgreSQL Primary/Replica with pgpool
- MinIO object storage
## References
- [Docker Swarm Documentation](https://docs.docker.com/engine/swarm/)
- [Traefik v2 Documentation](https://doc.traefik.io/traefik/)
- [Docker Secrets Management](https://docs.docker.com/engine/swarm/secrets/)