michaelschiemer/docs/deployment/production-logging.md

# Production Logging Configuration

Comprehensive production logging setup with performance optimization, resilience, and observability features.

## Overview

The framework provides production-ready logging configurations optimized for different deployment scenarios:

- **Standard Production**: Balanced configuration for typical production workloads
- **High Performance**: Optimized for high-throughput applications with sampling
- **Production with Aggregation**: Reduces log volume by 70-90% while preserving critical logs
- **Debug**: Temporary configuration for production troubleshooting
- **Staging**: Development-friendly configuration for staging environments

## Configuration Options

### 1. Standard Production (Recommended)

**Use Case**: Default production setup for most applications

**Features**:
- Resilient logging with automatic fallback
- Buffered writes for performance (100 entries, 5s flush)
- 14-day rotating log files
- Structured logs with request/trace context
- INFO level and above
- Performance metrics included

**Setup**:
```php
use App\Framework\Logging\ProductionLogConfig;

$logConfig = ProductionLogConfig::production(
    logPath: '/var/log/app',
    requestIdGenerator: $container->get(RequestIdGenerator::class)
);
```

**Log Files**:
- `/var/log/app/app.log` - Primary application logs (14 days retention)
- `/var/log/app/fallback.log` - Fallback when primary fails (7 days retention)

**Performance**:
- Write Latency: <1ms (buffered)
- Throughput: 10,000+ logs/second
- Disk I/O: Minimized via buffering

### 2. High Performance with Sampling

**Use Case**: High-traffic applications (>100 req/s) where log volume is critical

**Features**:
- Intelligent sampling reduces volume by 80-90%
- Always logs ERROR and CRITICAL
- Larger buffer (500 entries, 10s flush)
- Minimal processors for reduced overhead

**Setup**:
```php
$logConfig = ProductionLogConfig::highPerformance(
    logPath: '/var/log/app',
    requestIdGenerator: $container->get(RequestIdGenerator::class)
);
```

**Sampling Strategy**:
```
DEBUG:   5% sampled (1 in 20)
INFO:    10% sampled (1 in 10)
WARNING: 50% sampled (1 in 2)
ERROR:   100% (always logged)
CRITICAL: 100% (always logged)
```

**Performance**:
- Write Latency: <0.5ms (buffered + sampling)
- Throughput: 50,000+ logs/second
- Disk I/O: 80% reduction vs. standard

### 3. Production with Aggregation (High Volume)

**Use Case**: Applications with repetitive log patterns (e.g., API gateways, proxies)

**Features**:
- Aggregates identical messages over time window
- Reduces log volume by 70-90%
- Preserves all ERROR and CRITICAL logs
- Aggregation summary logged periodically

**Setup**:
```php
$logConfig = ProductionLogConfig::productionWithAggregation(
    logPath: '/var/log/app',
    requestIdGenerator: $container->get(RequestIdGenerator::class)
);
```

**Aggregation Example**:
```
Before Aggregation (1000 entries):
[INFO] User login successful (x1000)

After Aggregation (1 entry):
[INFO] User login successful (count: 1000, first: 2025-01-15 10:00:00, last: 2025-01-15 10:05:00)
```

**Performance**:
- Write Latency: <1ms
- Throughput: 20,000+ logs/second
- Disk I/O: 70-90% reduction
- Aggregation Window: 60 seconds

### 4. Debug Configuration

**Use Case**: Temporary production debugging (short-term troubleshooting)

**Features**:
- DEBUG level enabled
- Smaller buffer for faster feedback (50 entries, 2s flush)
- Extensive performance metrics
- 3-day retention (auto-cleanup)

**Setup**:
```php
$logConfig = ProductionLogConfig::debug(logPath: '/var/log/app');
```

**⚠️ Warning**: High overhead - use sparingly and disable after debugging

**Performance Impact**:
- 5-10x higher log volume
- 2-3ms write latency
- Increased disk I/O

### 5. Staging Environment

**Use Case**: Pre-production staging environment

**Features**:
- DEBUG level for development visibility
- Production-like resilience features
- 7-day retention
- Full processor stack for testing

**Setup**:
```php
$logConfig = ProductionLogConfig::staging(logPath: '/var/log/app');
```

## Integration with Application

### Environment-Based Configuration

```php
use App\Framework\Config\Environment;
use App\Framework\Config\EnvKey;
use App\Framework\Logging\ProductionLogConfig;

$env = $container->get(Environment::class);
$logPath = $env->get(EnvKey::LOG_PATH, '/var/log/app');

$logConfig = match ($env->get(EnvKey::APP_ENV)) {
    'production' => ProductionLogConfig::productionWithAggregation(
        logPath: $logPath,
        requestIdGenerator: $container->get(RequestIdGenerator::class)
    ),
    'staging' => ProductionLogConfig::staging($logPath),
    'debug' => ProductionLogConfig::debug($logPath),
    default => ProductionLogConfig::production(
        logPath: $logPath,
        requestIdGenerator: $container->get(RequestIdGenerator::class)
    )
};

// Register in DI container
$container->singleton(LogConfig::class, $logConfig);
```

### Environment Variables

```env
# .env.production
APP_ENV=production
LOG_PATH=/var/log/app
LOG_LEVEL=INFO
LOG_ENABLE_SAMPLING=true
LOG_ENABLE_AGGREGATION=true
LOG_BUFFER_SIZE=100
LOG_FLUSH_INTERVAL=5

# .env.staging
APP_ENV=staging
LOG_PATH=/var/log/app
LOG_LEVEL=DEBUG
LOG_ENABLE_SAMPLING=false
LOG_ENABLE_AGGREGATION=false
```

## Log Rotation and Retention

All production configurations use rotating file handlers:

**Rotation Strategy**:
- Daily rotation at midnight
- Compressed archives (.gz) for old logs
- Automatic cleanup of old files

**Retention Policies**:
```
Production:     14 days (app.log), 7 days (fallback.log)
High Perf:      7 days (app.log), 3 days (fallback.log)
Debug:          3 days (debug.log), 1 day (fallback.log)
Staging:        7 days (staging.log), 3 days (fallback.log)
```

**Disk Space Requirements**:
- Standard Production: ~2-5 GB (14 days)
- With Sampling: ~500 MB (7 days)
- With Aggregation: ~300 MB (14 days)

## Log Format

### Structured JSON Logs

All production configurations output structured JSON:

```json
{
  "timestamp": "2025-01-15T10:00:00+00:00",
  "level": "INFO",
  "message": "User login successful",
  "context": {
    "user_id": "12345",
    "ip_address": "203.0.113.42"
  },
  "request_id": "req_8f3a9b2c1d",
  "trace_id": "trace_7e4f2a1b",
  "span_id": "span_9c6d3e8f",
  "performance": {
    "memory_mb": 45.2,
    "execution_time_ms": 23.5
  }
}
```

### Log Processors

**RequestIdProcessor**:
- Adds unique request ID to all logs
- Enables request tracing across services
- Integration with RequestIdGenerator

**TraceContextProcessor**:
- Adds distributed tracing context (trace_id, span_id)
- OpenTelemetry compatible
- Cross-service correlation

**PerformanceProcessor**:
- Memory usage at log time
- Execution time since request start
- CPU usage (optional)

**MetricsCollectingProcessor** (with Aggregation):
- Collects log volume metrics
- Error rate tracking
- Performance metrics aggregation

## Monitoring and Alerting

### Health Check Integration

```php
use App\Framework\Health\Checks\LoggingHealthCheck;

// Automatically registered via HealthCheckManagerInitializer
// Checks:
// - Log files writable
// - Disk space available
// - No fallback handler activation
// - Log handler performance within SLA
```

### Metrics Exposed

**Via `/metrics` endpoint** (Prometheus format):

```prometheus
# Log volume
log_entries_total{level="info"} 15234
log_entries_total{level="error"} 23

# Log processing performance
log_write_duration_seconds{percentile="p50"} 0.001
log_write_duration_seconds{percentile="p95"} 0.003
log_write_duration_seconds{percentile="p99"} 0.005

# Disk usage
log_disk_usage_bytes{path="/var/log/app"} 2147483648
log_disk_available_bytes{path="/var/log/app"} 10737418240

# Handler health
log_fallback_activations_total 0
log_buffer_full_events_total 2
```

### Alerting Rules (Example)

```yaml
# Prometheus Alert Rules
groups:
  - name: logging
    rules:
      - alert: HighErrorRate
        expr: rate(log_entries_total{level="error"}[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error log rate detected"

      - alert: LogDiskSpaceLow
        expr: log_disk_available_bytes / log_disk_usage_bytes < 0.1
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Log disk space critically low"

      - alert: FallbackHandlerActive
        expr: rate(log_fallback_activations_total[5m]) > 0
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Log fallback handler activated"
```

## Performance Tuning

### Buffer Size Optimization

**Small Buffers (50-100)**: Lower latency, more disk I/O
**Large Buffers (500-1000)**: Higher throughput, higher memory

**Tuning Guide**:
```php
// Low latency requirement (<100ms flush)
bufferSize: 50,
flushIntervalSeconds: 0.1

// Balanced (recommended)
bufferSize: 100,
flushIntervalSeconds: 5.0

// High throughput (>50k logs/s)
bufferSize: 500,
flushIntervalSeconds: 10.0
```

### Sampling Configuration

```php
use App\Framework\Logging\Sampling\SamplingConfig;

// Conservative sampling (production default)
SamplingConfig::production(); // INFO: 10%, DEBUG: 5%

// Aggressive sampling (high load)
SamplingConfig::highLoad(); // INFO: 5%, DEBUG: 2%

// Custom sampling
new SamplingConfig(
    debugRate: 0.01,    // 1%
    infoRate: 0.05,     // 5%
    warningRate: 0.25,  // 25%
    errorRate: 1.0,     // 100%
    criticalRate: 1.0   // 100%
);
```

### Aggregation Configuration

```php
use App\Framework\Logging\Aggregation\AggregationConfig;

// Standard aggregation (1 minute window)
AggregationConfig::production();

// Extended aggregation (5 minute window)
new AggregationConfig(
    enabled: true,
    windowSeconds: 300,
    minLevel: LogLevel::DEBUG,
    excludedPatterns: ['Critical error', 'Fatal exception']
);
```

## Troubleshooting

### Issue: High Disk Usage

**Diagnosis**:
```bash
# Check log sizes
du -sh /var/log/app/*

# Check retention policy
ls -lh /var/log/app/app.log*
```

**Solutions**:
1. Enable sampling: `ProductionLogConfig::highPerformance()`
2. Enable aggregation: `ProductionLogConfig::productionWithAggregation()`
3. Reduce retention: Modify `maxFiles` parameter
4. Increase log level: `minLevel: LogLevel::WARNING`

### Issue: Fallback Handler Activated

**Diagnosis**:
```bash
# Check fallback logs
tail -f /var/log/app/fallback.log

# Check metrics
curl http://localhost/metrics | grep log_fallback
```

**Common Causes**:
- Disk full or permissions error
- Log file corruption
- Handler exception or crash

**Solutions**:
1. Check disk space: `df -h /var/log/app`
2. Check permissions: `ls -la /var/log/app`
3. Review error logs: `tail -100 /var/log/app/fallback.log`

### Issue: High Log Write Latency

**Diagnosis**:
```bash
# Check metrics
curl http://localhost/metrics | grep log_write_duration

# Check disk I/O
iostat -x 5
```

**Solutions**:
1. Increase buffer size: `bufferSize: 200`
2. Increase flush interval: `flushIntervalSeconds: 10.0`
3. Enable sampling: `ProductionLogConfig::highPerformance()`
4. Use faster disk (SSD recommended)

### Issue: Logs Missing or Incomplete

**Diagnosis**:
```bash
# Check buffer status
curl http://localhost/health/detailed | jq '.checks.logging'

# Check flush events
curl http://localhost/metrics | grep log_buffer_full_events
```

**Common Causes**:
- Application crash before buffer flush
- Buffer overflow (logs dropped)
- Aggressive sampling configuration

**Solutions**:
1. Enable `flushOnError: true` in BufferedLogHandler
2. Reduce buffer size for more frequent flushes
3. Review sampling configuration
4. Check application error logs

## Best Practices

### 1. Use Environment-Specific Configurations

```php
// ✅ Good: Environment-aware
$logConfig = match ($env) {
    'production' => ProductionLogConfig::productionWithAggregation(),
    'staging' => ProductionLogConfig::staging(),
    default => ProductionLogConfig::debug()
};

// ❌ Bad: Hardcoded debug in production
$logConfig = ProductionLogConfig::debug();
```

### 2. Always Include Request Context

```php
// ✅ Good: Request ID for tracing
$logger->info('User login', [
    'user_id' => $userId,
    'request_id' => $request->getRequestId()
]);

// ❌ Bad: No context for debugging
$logger->info('User login');
```

### 3. Use Appropriate Log Levels

```php
// ✅ Good: Proper severity levels
$logger->debug('Cache miss for key', ['key' => $key]);
$logger->info('User logged in', ['user_id' => $userId]);
$logger->warning('Rate limit approaching', ['current' => 90, 'limit' => 100]);
$logger->error('Payment failed', ['order_id' => $orderId, 'error' => $e->getMessage()]);
$logger->critical('Database connection lost', ['attempts' => 3]);

// ❌ Bad: Everything as INFO
$logger->info('Payment failed');
```

### 4. Avoid Logging Sensitive Data

```php
// ✅ Good: Masked or excluded
$logger->info('Payment processed', [
    'order_id' => $orderId,
    'amount' => $amount,
    'card_last4' => $card->getLast4(),
]);

// ❌ Bad: Sensitive data exposed
$logger->info('Payment processed', [
    'credit_card' => $card->getNumber(),
    'cvv' => $card->getCvv()
]);
```

### 5. Monitor Log Health

```php
// Set up health checks
$healthCheckManager->registerHealthCheck(
    new LoggingHealthCheck($logger, $logPath)
);

// Monitor metrics
$metricsCollector->track([
    'log_volume' => $logger->getMetrics()->getTotalLogs(),
    'error_rate' => $logger->getMetrics()->getErrorRate(),
    'disk_usage' => $diskMonitor->getUsage($logPath)
]);
```

## Production Checklist

### Pre-Deployment

- [ ] Log directory created: `/var/log/app`
- [ ] Permissions set: `chown www-data:www-data /var/log/app`
- [ ] Disk space allocated: Minimum 5GB free
- [ ] Rotation configured: logrotate or built-in rotation
- [ ] Environment configured: `.env.production` with correct settings

### Configuration

- [ ] Production config selected (standard/sampling/aggregation)
- [ ] Request ID generator integrated
- [ ] Processors configured appropriately
- [ ] Buffer size tuned for workload
- [ ] Sampling rates validated (if enabled)
- [ ] Aggregation tested (if enabled)

### Monitoring

- [ ] Health check endpoint verified: `/health/detailed`
- [ ] Metrics endpoint verified: `/metrics`
- [ ] Prometheus/monitoring integration tested
- [ ] Alert rules configured
- [ ] Log aggregation tool configured (ELK, Datadog, etc.)

### Testing

- [ ] Log writing tested in production environment
- [ ] Fallback handler tested (simulate primary failure)
- [ ] Log rotation tested (manual trigger)
- [ ] Performance tested under load
- [ ] Disk space monitoring tested

## Support and Troubleshooting

For issues with production logging:

1. Check health endpoint: `curl http://localhost/health/detailed | jq '.checks.logging'`
2. Check metrics: `curl http://localhost/metrics | grep log_`
3. Review fallback logs: `tail -100 /var/log/app/fallback.log`
4. Verify disk space: `df -h /var/log/app`
5. Check permissions: `ls -la /var/log/app`

For further assistance, see:
- Framework Documentation: `/docs/claude/guidelines.md`
- Error Handling Guide: `/docs/claude/error-handling.md`
- Performance Monitoring: `/docs/claude/performance-monitoring.md`