michaelschiemer/docs/ml-model-management-deployment.md

# ML Model Management System - Production Deployment Guide

Complete guide for deploying the ML Model Management System to production with all three integrated ML systems.

## Table of Contents

1. [System Overview](#system-overview)
2. [Prerequisites](#prerequisites)
3. [Configuration](#configuration)
4. [Deployment Steps](#deployment-steps)
5. [Monitoring Setup](#monitoring-setup)
6. [Troubleshooting](#troubleshooting)
7. [Maintenance](#maintenance)

## System Overview

The ML Model Management System provides centralized management for three ML systems:

1. **N+1 Detection** (`n1-detector` - v1.0.0)
   - Detects N+1 query patterns in database operations
   - Alert threshold: 85% accuracy
   - Auto-tuning: confidence_threshold (0.5-0.9)

2. **WAF Behavioral Analysis** (`waf-behavioral` - v1.0.0)
   - Detects malicious behavioral patterns in web requests
   - Alert threshold: 90% accuracy (critical)
   - Auto-tuning: anomaly_threshold (0.5-0.9)

3. **Queue Job Anomaly Detection** (`queue-anomaly` - v1.0.0)
   - Detects anomalous job execution patterns
   - Alert threshold: 80% accuracy
   - Auto-tuning: anomaly_threshold (0.4-0.8)

**Core Components**:
- **ModelRegistry**: Cache-based model version and configuration storage
- **ModelPerformanceMonitor**: Real-time performance metrics tracking
- **AutoTuningEngine**: Automatic threshold optimization via grid search
- **AlertingService**: Log-based alerting for performance issues
- **MLMonitoringScheduler**: Periodic monitoring jobs

## Prerequisites

### System Requirements

- PHP 8.3+
- Redis (for cache-based model registry)
- Sufficient memory for model tracking (minimum 256MB PHP memory_limit)

### Framework Components

Ensure these framework components are initialized:

```php
// Required initializers:
- ModelManagementInitializer
- NPlusOneDetectionEngineInitializer
- MLEnhancedWafLayerInitializer
- JobAnomalyDetectionInitializer
- MLMonitoringSchedulerInitializer
```

### Environment Variables

```env
# Model Management
MODEL_REGISTRY_CACHE_TTL=604800        # 7 days
AUTO_TUNING_ENABLED=true

# N+1 Detection
NPLUSONE_ML_ENABLED=true
NPLUSONE_ML_AUTO_REGISTER=true
NPLUSONE_ML_TIMEOUT_MS=5000
NPLUSONE_ML_CONFIDENCE_THRESHOLD=60.0

# WAF Behavioral
WAF_ML_ENABLED=true
WAF_ML_AUTO_REGISTER=true

# Queue Anomaly
QUEUE_ML_ENABLED=true
QUEUE_ML_AUTO_REGISTER=true
JOB_ANOMALY_MIN_CONFIDENCE=0.6
JOB_ANOMALY_THRESHOLD=50
JOB_ANOMALY_ZSCORE_THRESHOLD=3.0
JOB_ANOMALY_IQR_MULTIPLIER=1.5
```

## Configuration

### 1. Model Registry Configuration

The model registry uses cache for storage with a default TTL of 7 days:

```php
// In ModelManagementInitializer
$cacheKey = CacheKey::fromString('ml_model_registry');
$ttl = Duration::fromDays($environment->getInt('MODEL_REGISTRY_CACHE_TTL', 7));
```

**Production Recommendations**:
- Use Redis for cache backend (better persistence than file cache)
- Set TTL to at least 7 days to maintain model history
- Configure Redis persistence (AOF or RDB) for data durability

### 2. Performance Monitor Configuration

Performance monitoring tracks predictions with sliding windows:

```php
// Window configurations
- Short-term: Last 100 predictions
- Medium-term: Last 1000 predictions
- Long-term: All predictions since deployment
```

**Memory Considerations**:
- Each prediction: ~200 bytes
- 1000 predictions: ~200KB
- Recommend periodic cleanup for long-running systems

### 3. Auto-Tuning Configuration

Auto-tuning uses grid search optimization:

```php
// N+1 Detector
thresholdRange: [0.5, 0.9]
step: 0.05
timeWindow: 1 hour
metricToOptimize: 'f1_score'

// WAF Behavioral
thresholdRange: [0.5, 0.9]
step: 0.05
timeWindow: 1 hour
metricToOptimize: 'f1_score'

// Queue Anomaly
thresholdRange: [0.4, 0.8]
step: 0.05
timeWindow: 1 hour
metricToOptimize: 'f1_score'
```

**Tuning Recommendations**:
- Auto-apply only if improvement > 5%
- Run hourly to balance responsiveness and stability
- Monitor threshold changes via logs

### 4. Monitoring Schedule Configuration

Four periodic jobs with different frequencies:

```php
// Performance Monitoring - Every 5 minutes
- Checks current accuracy against thresholds
- Sends alerts if accuracy drops below threshold
- Logs: 'ml-performance-monitoring'

// Degradation Detection - Every 15 minutes
- Compares current vs. baseline performance
- Detects 5%+ degradation
- Logs: 'ml-degradation-detection'

// Auto-Tuning - Every hour
- Optimizes thresholds via grid search
- Auto-applies if improvement > 5%
- Logs: 'ml-auto-tuning'

// Registry Cleanup - Daily
- Cleans up old performance data
- Maintains production model list
- Logs: 'ml-registry-cleanup'
```

## Deployment Steps

### Step 1: Pre-Deployment Verification

```bash
# 1. Verify all ML systems are functional
docker exec php php console.php ml:verify-systems

# 2. Run integration tests
docker exec php ./vendor/bin/pest tests/Integration/MachineLearning/

# 3. Check environment configuration
docker exec php php console.php config:check-ml

# 4. Verify cache backend is accessible
docker exec php php console.php cache:health
```

### Step 2: Deploy Core Components

```bash
# 1. Deploy application code (including ML components)
git pull origin main

# 2. Rebuild dependencies if needed
docker exec php composer install --no-dev --optimize-autoloader

# 3. Clear and warm up caches
docker exec php php console.php cache:clear
docker exec php php console.php cache:warmup
```

### Step 3: Initialize Model Registry

```bash
# Register all three models on first deployment
docker exec php php console.php ml:register-all-models

# Expected output:
# - N+1 Detector (n1-detector v1.0.0) registered
# - WAF Behavioral (waf-behavioral v1.0.0) registered
# - Queue Anomaly (queue-anomaly v1.0.0) registered
```

### Step 4: Start Monitoring Scheduler

The monitoring scheduler is automatically initialized during framework startup.

Verify it's running:

```bash
# Check scheduled tasks
docker exec php php console.php scheduler:status

# Expected tasks:
# - ml-performance-monitoring (next run: +5 min)
# - ml-degradation-detection (next run: +15 min)
# - ml-auto-tuning (next run: +1 hour)
# - ml-registry-cleanup (next run: +1 day)
```

### Step 5: Deploy to Production

```bash
# 1. Deploy all models to production environment
docker exec php php console.php ml:deploy-to-production --all

# 2. Verify production deployment
docker exec php php console.php ml:list-production-models

# Expected output:
# - n1-detector v1.0.0 (deployed: YYYY-MM-DD HH:MM:SS)
# - waf-behavioral v1.0.0 (deployed: YYYY-MM-DD HH:MM:SS)
# - queue-anomaly v1.0.0 (deployed: YYYY-MM-DD HH:MM:SS)
```

### Step 6: Verify Monitoring

```bash
# Wait 5 minutes for first monitoring run, then check logs
docker exec php tail -f storage/logs/ml-monitoring.log

# Expected log entries:
# [INFO] ML monitoring scheduler initialized (jobs_scheduled: 4, models_monitored: 3)
# [INFO] Performance monitoring executed successfully
# [INFO] N+1 detector status: monitored (accuracy: 0.XX)
# [INFO] WAF behavioral status: monitored (accuracy: 0.XX)
# [INFO] Queue anomaly status: monitored (accuracy: 0.XX)
```

## Monitoring Setup

### 1. Log Monitoring

All ML operations are logged to:
- Main log: `storage/logs/application.log`
- ML-specific: `storage/logs/ml-monitoring.log` (if configured)

**Key log patterns to monitor**:

```bash
# Performance warnings
grep "Performance Warning" storage/logs/*.log

# Degradation alerts
grep "Performance Degraded" storage/logs/*.log

# Auto-tuning applications
grep "auto-tuned" storage/logs/*.log

# Monitoring failures
grep "monitoring failed" storage/logs/*.log
```

### 2. Metrics Dashboard

Create a dashboard to track:

**N+1 Detector Metrics**:
```
- Accuracy (target: >85%)
- Total predictions per day
- False positive rate
- Average confidence score
```

**WAF Behavioral Metrics**:
```
- Accuracy (target: >90%)
- Total requests analyzed
- Blocked requests ratio
- Detection latency
```

**Queue Anomaly Metrics**:
```
- Accuracy (target: >80%)
- Total jobs analyzed
- Anomaly detection rate
- Average processing time
```

**System Metrics**:
```
- Model registry cache hit rate
- Auto-tuning improvement rate
- Alert frequency
- Scheduler job success rate
```

### 3. Alerting Rules

Configure alerts for:

```yaml
# Critical Alerts (immediate action)
- WAF accuracy < 90%
- WAF performance degraded > 5%
- Scheduler job failures > 3 consecutive

# Warning Alerts (investigate within 1 hour)
- N+1 accuracy < 85%
- Queue anomaly accuracy < 80%
- Auto-tuning improvements < 1% (indicates stagnation)
- Model registry cache misses > 10%

# Info Alerts (review during business hours)
- Auto-tuning applied successfully
- Model configuration updated
- Registry cleanup completed
```

### 4. Performance Monitoring Commands

```bash
# Get current performance metrics for all models
docker exec php php console.php ml:performance-summary

# Check for degradation
docker exec php php console.php ml:check-degradation

# View auto-tuning history
docker exec php php console.php ml:auto-tuning-history

# Get model configuration
docker exec php php console.php ml:show-config n1-detector
docker exec php php console.php ml:show-config waf-behavioral
docker exec php php console.php ml:show-config queue-anomaly
```

## Troubleshooting

### Issue: Model Not Registered

**Symptoms**:
```
RuntimeException: Model not registered. Call registerCurrentModel() first.
```

**Solution**:
```bash
# Check if model exists
docker exec php php console.php ml:list-models

# Register missing model
docker exec php php console.php ml:register-model n1-detector
docker exec php php console.php ml:register-model waf-behavioral
docker exec php php console.php ml:register-model queue-anomaly

# Or register all at once
docker exec php php console.php ml:register-all-models
```

### Issue: Low Accuracy Alerts

**Symptoms**:
```
[WARNING] N+1 Detector Performance Warning: Accuracy dropped to 78.50%
```

**Investigation Steps**:

1. **Check recent prediction patterns**:
```bash
docker exec php php console.php ml:performance-details n1-detector --recent 100
```

2. **Analyze false positives/negatives**:
```bash
docker exec php php console.php ml:analyze-errors n1-detector
```

3. **Review ground truth data**:
```bash
docker exec php php console.php ml:validate-ground-truth n1-detector
```

4. **Possible causes**:
- Data drift (application behavior changed)
- Incorrect ground truth labels
- Model needs retraining
- Configuration drift (thresholds changed)

**Resolution**:
- If data drift: Consider retraining model
- If ground truth issues: Update labeling process
- If config drift: Revert to known-good configuration
- If temporary: Wait for auto-tuning to adapt

### Issue: Auto-Tuning Not Improving

**Symptoms**:
```
[INFO] Auto-tuning completed (improvement_percent: 0.5%)
```

**Investigation**:

1. **Check optimization history**:
```bash
docker exec php php console.php ml:auto-tuning-history n1-detector --limit 10
```

2. **Verify enough data for optimization**:
```bash
docker exec php php console.php ml:data-sufficiency n1-detector
```

3. **Possible causes**:
- Insufficient prediction data (<100 predictions)
- Already at optimal threshold
- Metric plateaued (reached model capacity)
- Time window too short

**Resolution**:
- Wait for more prediction data
- Consider different optimization metric (precision, recall)
- Adjust threshold range or step size
- Increase time window for optimization

### Issue: Performance Degradation Detected

**Symptoms**:
```
[CRITICAL] WAF Behavioral Performance Degraded (degradation_percent: 8.5%)
```

**Immediate Actions**:

1. **Verify it's not a false alarm**:
```bash
docker exec php php console.php ml:verify-degradation waf-behavioral
```

2. **Check for system changes**:
```bash
# Review recent deployments
git log --oneline --since="1 week ago"

# Check configuration changes
docker exec php php console.php config:diff --since="1 week ago"
```

3. **Rollback if necessary**:
```bash
# Revert to previous model configuration
docker exec php php console.php ml:rollback-config waf-behavioral

# Or revert to previous model version
docker exec php php console.php ml:rollback-version waf-behavioral
```

4. **Monitor recovery**:
```bash
# Watch real-time performance
docker exec php watch -n 60 "php console.php ml:performance-summary"
```

### Issue: Monitoring Jobs Not Running

**Symptoms**:
```
No monitoring logs in the last 5 minutes
```

**Investigation**:

1. **Check scheduler status**:
```bash
docker exec php php console.php scheduler:status
```

2. **Verify scheduler is running**:
```bash
# Check if scheduler daemon is active
docker exec php ps aux | grep scheduler
```

3. **Check for errors**:
```bash
docker exec php grep "scheduler" storage/logs/application.log | tail -20
```

**Resolution**:
```bash
# Restart scheduler daemon
docker exec php php console.php scheduler:restart

# Or restart entire application
docker-compose restart php

# Verify monitoring resumes
docker exec php tail -f storage/logs/ml-monitoring.log
```

### Issue: Cache Miss Rate High

**Symptoms**:
```
Model registry cache miss rate: 45% (expected: <10%)
```

**Possible Causes**:
- Cache TTL too short
- Cache eviction due to memory pressure
- Cache backend not persistent

**Resolution**:

1. **Increase cache TTL**:
```env
# .env
MODEL_REGISTRY_CACHE_TTL=1209600  # 14 days
```

2. **Configure Redis persistence**:
```redis
# redis.conf
appendonly yes
appendfsync everysec
```

3. **Monitor Redis memory**:
```bash
docker exec redis redis-cli INFO memory
```

## Maintenance

### Daily Tasks

1. **Review performance metrics**:
```bash
docker exec php php console.php ml:daily-summary
```

2. **Check alert log**:
```bash
grep -E "(WARNING|CRITICAL)" storage/logs/ml-monitoring.log | tail -20
```

3. **Verify scheduler health**:
```bash
docker exec php php console.php scheduler:health
```

### Weekly Tasks

1. **Review auto-tuning effectiveness**:
```bash
docker exec php php console.php ml:auto-tuning-report --last-week
```

2. **Analyze degradation incidents**:
```bash
docker exec php php console.php ml:degradation-report --last-week
```

3. **Check model configuration drift**:
```bash
docker exec php php console.php ml:config-drift-report
```

### Monthly Tasks

1. **Performance trend analysis**:
```bash
docker exec php php console.php ml:performance-trends --last-month
```

2. **Evaluate model improvement opportunities**:
```bash
docker exec php php console.php ml:improvement-recommendations
```

3. **Review and update thresholds**:
```bash
# Review current thresholds
docker exec php php console.php ml:show-thresholds

# Update if needed
docker exec php php console.php ml:update-threshold n1-detector --accuracy 0.88
```

4. **Clean up old performance data**:
```bash
docker exec php php console.php ml:cleanup-old-data --older-than 90days
```

### Quarterly Tasks

1. **Model retraining evaluation**:
- Review accumulated prediction data
- Evaluate if retraining is beneficial
- Plan retraining if significant drift detected

2. **System capacity planning**:
- Review prediction volume trends
- Assess cache and memory requirements
- Plan for scaling if needed

3. **Alert threshold tuning**:
- Review alert frequency and relevance
- Adjust thresholds based on operational experience
- Update escalation procedures

### Model Version Upgrades

When deploying a new model version:

```bash
# 1. Register new version
docker exec php php console.php ml:register-model n1-detector --version 1.1.0

# 2. Deploy to staging first
docker exec php php console.php ml:deploy n1-detector --version 1.1.0 --env staging

# 3. Monitor staging performance
docker exec php php console.php ml:compare-versions n1-detector 1.0.0 1.1.0

# 4. If successful, deploy to production
docker exec php php console.php ml:deploy n1-detector --version 1.1.0 --env production

# 5. Monitor production rollout
docker exec php watch -n 60 "php console.php ml:deployment-status n1-detector"

# 6. Rollback if issues detected
docker exec php php console.php ml:rollback n1-detector --to-version 1.0.0
```

## Performance Baselines

Expected performance metrics after deployment:

### N+1 Detector
```
Accuracy: 88-92%
Precision: 85-90%
Recall: 85-90%
F1-Score: 86-91%
Total Predictions: 1000+ per day
False Positive Rate: <10%
Detection Latency: <50ms
```

### WAF Behavioral
```
Accuracy: 92-96%
Precision: 90-95%
Recall: 90-95%
F1-Score: 91-95%
Total Predictions: 10,000+ per day
False Positive Rate: <5%
Detection Latency: <100ms
```

### Queue Anomaly
```
Accuracy: 82-88%
Precision: 80-85%
Recall: 80-85%
F1-Score: 81-86%
Total Predictions: 500+ per day
False Positive Rate: <15%
Detection Latency: <30ms
```

### System Performance
```
Model Registry Cache Hit Rate: >90%
Scheduler Job Success Rate: >99%
Auto-Tuning Application Rate: 20-30% (1-2 times per week)
Alert Rate: <5 alerts per day (excluding info alerts)
Memory Usage: <200MB for model tracking
```

## Security Considerations

1. **Model Configuration Access**:
- Restrict access to model configuration updates
- Log all configuration changes with user attribution
- Implement approval workflow for production changes

2. **Performance Data Privacy**:
- Anonymize sensitive data in tracking
- Implement data retention policies
- Ensure GDPR compliance for tracked predictions

3. **Alert Notification Security**:
- Use secure channels for critical alerts
- Implement alert authentication
- Rate-limit alert notifications

## Support and Escalation

### L1 Support (Monitoring Team)
- Monitor dashboard and alerts
- Execute basic troubleshooting commands
- Escalate to L2 for unresolved issues

### L2 Support (ML Team)
- Investigate performance degradation
- Adjust thresholds and configurations
- Plan model retraining
- Escalate to L3 for system issues

### L3 Support (Core Team)
- System architecture issues
- Framework integration problems
- Performance optimization
- Long-term capacity planning

## Appendix

### Example Integration Code

See `/examples/n1-model-management-integration.php` for comprehensive integration examples.

### CLI Commands Reference

Complete list of ML management commands:

```bash
# Model Registry
ml:list-models              # List all registered models
ml:register-model           # Register a specific model
ml:register-all-models      # Register all three models
ml:show-config              # Show model configuration
ml:update-config            # Update model configuration
ml:rollback-config          # Rollback configuration
ml:list-production-models   # List production-deployed models

# Performance Monitoring
ml:performance-summary      # Current performance for all models
ml:performance-details      # Detailed performance for specific model
ml:check-degradation        # Check for performance degradation
ml:performance-trends       # Historical performance trends
ml:daily-summary           # Daily performance summary
ml:compare-versions        # Compare performance between versions

# Auto-Tuning
ml:auto-tuning-history     # View auto-tuning history
ml:auto-tuning-report      # Generate auto-tuning report
ml:show-thresholds         # Show current thresholds
ml:update-threshold        # Manually update threshold

# Deployment
ml:deploy                  # Deploy model to environment
ml:deploy-to-production    # Deploy to production
ml:rollback                # Rollback to previous version
ml:deployment-status       # Check deployment status

# Monitoring
ml:verify-systems          # Verify all ML systems functional
ml:scheduler-status        # Check scheduler status
ml:scheduler-health        # Scheduler health check
ml:analyze-errors          # Analyze prediction errors
ml:validate-ground-truth   # Validate ground truth data

# Maintenance
ml:cleanup-old-data        # Clean up old performance data
ml:data-sufficiency        # Check if enough data for optimization
ml:improvement-recommendations  # Get improvement recommendations
ml:config-drift-report     # Check for configuration drift
```

### Additional Resources

- [ML Model Management Architecture](/docs/ml-model-management-architecture.md)
- [Integration Examples](/examples/n1-model-management-integration.php)
- [Integration Tests](/tests/Integration/MachineLearning/MLModelManagementIntegrationTest.php)
- [Framework Documentation](/docs/claude/)