Files
michaelschiemer/docs/ml-model-management-deployment.md
Michael Schiemer fc3d7e6357 feat(Production): Complete production deployment infrastructure
- Add comprehensive health check system with multiple endpoints
- Add Prometheus metrics endpoint
- Add production logging configurations (5 strategies)
- Add complete deployment documentation suite:
  * QUICKSTART.md - 30-minute deployment guide
  * DEPLOYMENT_CHECKLIST.md - Printable verification checklist
  * DEPLOYMENT_WORKFLOW.md - Complete deployment lifecycle
  * PRODUCTION_DEPLOYMENT.md - Comprehensive technical reference
  * production-logging.md - Logging configuration guide
  * ANSIBLE_DEPLOYMENT.md - Infrastructure as Code automation
  * README.md - Navigation hub
  * DEPLOYMENT_SUMMARY.md - Executive summary
- Add deployment scripts and automation
- Add DEPLOYMENT_PLAN.md - Concrete plan for immediate deployment
- Update README with production-ready features

All production infrastructure is now complete and ready for deployment.
2025-10-25 19:18:37 +02:00

20 KiB

ML Model Management System - Production Deployment Guide

Complete guide for deploying the ML Model Management System to production with all three integrated ML systems.

Table of Contents

  1. System Overview
  2. Prerequisites
  3. Configuration
  4. Deployment Steps
  5. Monitoring Setup
  6. Troubleshooting
  7. Maintenance

System Overview

The ML Model Management System provides centralized management for three ML systems:

  1. N+1 Detection (n1-detector - v1.0.0)

    • Detects N+1 query patterns in database operations
    • Alert threshold: 85% accuracy
    • Auto-tuning: confidence_threshold (0.5-0.9)
  2. WAF Behavioral Analysis (waf-behavioral - v1.0.0)

    • Detects malicious behavioral patterns in web requests
    • Alert threshold: 90% accuracy (critical)
    • Auto-tuning: anomaly_threshold (0.5-0.9)
  3. Queue Job Anomaly Detection (queue-anomaly - v1.0.0)

    • Detects anomalous job execution patterns
    • Alert threshold: 80% accuracy
    • Auto-tuning: anomaly_threshold (0.4-0.8)

Core Components:

  • ModelRegistry: Cache-based model version and configuration storage
  • ModelPerformanceMonitor: Real-time performance metrics tracking
  • AutoTuningEngine: Automatic threshold optimization via grid search
  • AlertingService: Log-based alerting for performance issues
  • MLMonitoringScheduler: Periodic monitoring jobs

Prerequisites

System Requirements

  • PHP 8.3+
  • Redis (for cache-based model registry)
  • Sufficient memory for model tracking (minimum 256MB PHP memory_limit)

Framework Components

Ensure these framework components are initialized:

// Required initializers:
- ModelManagementInitializer
- NPlusOneDetectionEngineInitializer
- MLEnhancedWafLayerInitializer
- JobAnomalyDetectionInitializer
- MLMonitoringSchedulerInitializer

Environment Variables

# Model Management
MODEL_REGISTRY_CACHE_TTL=604800        # 7 days
AUTO_TUNING_ENABLED=true

# N+1 Detection
NPLUSONE_ML_ENABLED=true
NPLUSONE_ML_AUTO_REGISTER=true
NPLUSONE_ML_TIMEOUT_MS=5000
NPLUSONE_ML_CONFIDENCE_THRESHOLD=60.0

# WAF Behavioral
WAF_ML_ENABLED=true
WAF_ML_AUTO_REGISTER=true

# Queue Anomaly
QUEUE_ML_ENABLED=true
QUEUE_ML_AUTO_REGISTER=true
JOB_ANOMALY_MIN_CONFIDENCE=0.6
JOB_ANOMALY_THRESHOLD=50
JOB_ANOMALY_ZSCORE_THRESHOLD=3.0
JOB_ANOMALY_IQR_MULTIPLIER=1.5

Configuration

1. Model Registry Configuration

The model registry uses cache for storage with a default TTL of 7 days:

// In ModelManagementInitializer
$cacheKey = CacheKey::fromString('ml_model_registry');
$ttl = Duration::fromDays($environment->getInt('MODEL_REGISTRY_CACHE_TTL', 7));

Production Recommendations:

  • Use Redis for cache backend (better persistence than file cache)
  • Set TTL to at least 7 days to maintain model history
  • Configure Redis persistence (AOF or RDB) for data durability

2. Performance Monitor Configuration

Performance monitoring tracks predictions with sliding windows:

// Window configurations
- Short-term: Last 100 predictions
- Medium-term: Last 1000 predictions
- Long-term: All predictions since deployment

Memory Considerations:

  • Each prediction: ~200 bytes
  • 1000 predictions: ~200KB
  • Recommend periodic cleanup for long-running systems

3. Auto-Tuning Configuration

Auto-tuning uses grid search optimization:

// N+1 Detector
thresholdRange: [0.5, 0.9]
step: 0.05
timeWindow: 1 hour
metricToOptimize: 'f1_score'

// WAF Behavioral
thresholdRange: [0.5, 0.9]
step: 0.05
timeWindow: 1 hour
metricToOptimize: 'f1_score'

// Queue Anomaly
thresholdRange: [0.4, 0.8]
step: 0.05
timeWindow: 1 hour
metricToOptimize: 'f1_score'

Tuning Recommendations:

  • Auto-apply only if improvement > 5%
  • Run hourly to balance responsiveness and stability
  • Monitor threshold changes via logs

4. Monitoring Schedule Configuration

Four periodic jobs with different frequencies:

// Performance Monitoring - Every 5 minutes
- Checks current accuracy against thresholds
- Sends alerts if accuracy drops below threshold
- Logs: 'ml-performance-monitoring'

// Degradation Detection - Every 15 minutes
- Compares current vs. baseline performance
- Detects 5%+ degradation
- Logs: 'ml-degradation-detection'

// Auto-Tuning - Every hour
- Optimizes thresholds via grid search
- Auto-applies if improvement > 5%
- Logs: 'ml-auto-tuning'

// Registry Cleanup - Daily
- Cleans up old performance data
- Maintains production model list
- Logs: 'ml-registry-cleanup'

Deployment Steps

Step 1: Pre-Deployment Verification

# 1. Verify all ML systems are functional
docker exec php php console.php ml:verify-systems

# 2. Run integration tests
docker exec php ./vendor/bin/pest tests/Integration/MachineLearning/

# 3. Check environment configuration
docker exec php php console.php config:check-ml

# 4. Verify cache backend is accessible
docker exec php php console.php cache:health

Step 2: Deploy Core Components

# 1. Deploy application code (including ML components)
git pull origin main

# 2. Rebuild dependencies if needed
docker exec php composer install --no-dev --optimize-autoloader

# 3. Clear and warm up caches
docker exec php php console.php cache:clear
docker exec php php console.php cache:warmup

Step 3: Initialize Model Registry

# Register all three models on first deployment
docker exec php php console.php ml:register-all-models

# Expected output:
# - N+1 Detector (n1-detector v1.0.0) registered
# - WAF Behavioral (waf-behavioral v1.0.0) registered
# - Queue Anomaly (queue-anomaly v1.0.0) registered

Step 4: Start Monitoring Scheduler

The monitoring scheduler is automatically initialized during framework startup.

Verify it's running:

# Check scheduled tasks
docker exec php php console.php scheduler:status

# Expected tasks:
# - ml-performance-monitoring (next run: +5 min)
# - ml-degradation-detection (next run: +15 min)
# - ml-auto-tuning (next run: +1 hour)
# - ml-registry-cleanup (next run: +1 day)

Step 5: Deploy to Production

# 1. Deploy all models to production environment
docker exec php php console.php ml:deploy-to-production --all

# 2. Verify production deployment
docker exec php php console.php ml:list-production-models

# Expected output:
# - n1-detector v1.0.0 (deployed: YYYY-MM-DD HH:MM:SS)
# - waf-behavioral v1.0.0 (deployed: YYYY-MM-DD HH:MM:SS)
# - queue-anomaly v1.0.0 (deployed: YYYY-MM-DD HH:MM:SS)

Step 6: Verify Monitoring

# Wait 5 minutes for first monitoring run, then check logs
docker exec php tail -f storage/logs/ml-monitoring.log

# Expected log entries:
# [INFO] ML monitoring scheduler initialized (jobs_scheduled: 4, models_monitored: 3)
# [INFO] Performance monitoring executed successfully
# [INFO] N+1 detector status: monitored (accuracy: 0.XX)
# [INFO] WAF behavioral status: monitored (accuracy: 0.XX)
# [INFO] Queue anomaly status: monitored (accuracy: 0.XX)

Monitoring Setup

1. Log Monitoring

All ML operations are logged to:

  • Main log: storage/logs/application.log
  • ML-specific: storage/logs/ml-monitoring.log (if configured)

Key log patterns to monitor:

# Performance warnings
grep "Performance Warning" storage/logs/*.log

# Degradation alerts
grep "Performance Degraded" storage/logs/*.log

# Auto-tuning applications
grep "auto-tuned" storage/logs/*.log

# Monitoring failures
grep "monitoring failed" storage/logs/*.log

2. Metrics Dashboard

Create a dashboard to track:

N+1 Detector Metrics:

- Accuracy (target: >85%)
- Total predictions per day
- False positive rate
- Average confidence score

WAF Behavioral Metrics:

- Accuracy (target: >90%)
- Total requests analyzed
- Blocked requests ratio
- Detection latency

Queue Anomaly Metrics:

- Accuracy (target: >80%)
- Total jobs analyzed
- Anomaly detection rate
- Average processing time

System Metrics:

- Model registry cache hit rate
- Auto-tuning improvement rate
- Alert frequency
- Scheduler job success rate

3. Alerting Rules

Configure alerts for:

# Critical Alerts (immediate action)
- WAF accuracy < 90%
- WAF performance degraded > 5%
- Scheduler job failures > 3 consecutive

# Warning Alerts (investigate within 1 hour)
- N+1 accuracy < 85%
- Queue anomaly accuracy < 80%
- Auto-tuning improvements < 1% (indicates stagnation)
- Model registry cache misses > 10%

# Info Alerts (review during business hours)
- Auto-tuning applied successfully
- Model configuration updated
- Registry cleanup completed

4. Performance Monitoring Commands

# Get current performance metrics for all models
docker exec php php console.php ml:performance-summary

# Check for degradation
docker exec php php console.php ml:check-degradation

# View auto-tuning history
docker exec php php console.php ml:auto-tuning-history

# Get model configuration
docker exec php php console.php ml:show-config n1-detector
docker exec php php console.php ml:show-config waf-behavioral
docker exec php php console.php ml:show-config queue-anomaly

Troubleshooting

Issue: Model Not Registered

Symptoms:

RuntimeException: Model not registered. Call registerCurrentModel() first.

Solution:

# Check if model exists
docker exec php php console.php ml:list-models

# Register missing model
docker exec php php console.php ml:register-model n1-detector
docker exec php php console.php ml:register-model waf-behavioral
docker exec php php console.php ml:register-model queue-anomaly

# Or register all at once
docker exec php php console.php ml:register-all-models

Issue: Low Accuracy Alerts

Symptoms:

[WARNING] N+1 Detector Performance Warning: Accuracy dropped to 78.50%

Investigation Steps:

  1. Check recent prediction patterns:
docker exec php php console.php ml:performance-details n1-detector --recent 100
  1. Analyze false positives/negatives:
docker exec php php console.php ml:analyze-errors n1-detector
  1. Review ground truth data:
docker exec php php console.php ml:validate-ground-truth n1-detector
  1. Possible causes:
  • Data drift (application behavior changed)
  • Incorrect ground truth labels
  • Model needs retraining
  • Configuration drift (thresholds changed)

Resolution:

  • If data drift: Consider retraining model
  • If ground truth issues: Update labeling process
  • If config drift: Revert to known-good configuration
  • If temporary: Wait for auto-tuning to adapt

Issue: Auto-Tuning Not Improving

Symptoms:

[INFO] Auto-tuning completed (improvement_percent: 0.5%)

Investigation:

  1. Check optimization history:
docker exec php php console.php ml:auto-tuning-history n1-detector --limit 10
  1. Verify enough data for optimization:
docker exec php php console.php ml:data-sufficiency n1-detector
  1. Possible causes:
  • Insufficient prediction data (<100 predictions)
  • Already at optimal threshold
  • Metric plateaued (reached model capacity)
  • Time window too short

Resolution:

  • Wait for more prediction data
  • Consider different optimization metric (precision, recall)
  • Adjust threshold range or step size
  • Increase time window for optimization

Issue: Performance Degradation Detected

Symptoms:

[CRITICAL] WAF Behavioral Performance Degraded (degradation_percent: 8.5%)

Immediate Actions:

  1. Verify it's not a false alarm:
docker exec php php console.php ml:verify-degradation waf-behavioral
  1. Check for system changes:
# Review recent deployments
git log --oneline --since="1 week ago"

# Check configuration changes
docker exec php php console.php config:diff --since="1 week ago"
  1. Rollback if necessary:
# Revert to previous model configuration
docker exec php php console.php ml:rollback-config waf-behavioral

# Or revert to previous model version
docker exec php php console.php ml:rollback-version waf-behavioral
  1. Monitor recovery:
# Watch real-time performance
docker exec php watch -n 60 "php console.php ml:performance-summary"

Issue: Monitoring Jobs Not Running

Symptoms:

No monitoring logs in the last 5 minutes

Investigation:

  1. Check scheduler status:
docker exec php php console.php scheduler:status
  1. Verify scheduler is running:
# Check if scheduler daemon is active
docker exec php ps aux | grep scheduler
  1. Check for errors:
docker exec php grep "scheduler" storage/logs/application.log | tail -20

Resolution:

# Restart scheduler daemon
docker exec php php console.php scheduler:restart

# Or restart entire application
docker-compose restart php

# Verify monitoring resumes
docker exec php tail -f storage/logs/ml-monitoring.log

Issue: Cache Miss Rate High

Symptoms:

Model registry cache miss rate: 45% (expected: <10%)

Possible Causes:

  • Cache TTL too short
  • Cache eviction due to memory pressure
  • Cache backend not persistent

Resolution:

  1. Increase cache TTL:
# .env
MODEL_REGISTRY_CACHE_TTL=1209600  # 14 days
  1. Configure Redis persistence:
# redis.conf
appendonly yes
appendfsync everysec
  1. Monitor Redis memory:
docker exec redis redis-cli INFO memory

Maintenance

Daily Tasks

  1. Review performance metrics:
docker exec php php console.php ml:daily-summary
  1. Check alert log:
grep -E "(WARNING|CRITICAL)" storage/logs/ml-monitoring.log | tail -20
  1. Verify scheduler health:
docker exec php php console.php scheduler:health

Weekly Tasks

  1. Review auto-tuning effectiveness:
docker exec php php console.php ml:auto-tuning-report --last-week
  1. Analyze degradation incidents:
docker exec php php console.php ml:degradation-report --last-week
  1. Check model configuration drift:
docker exec php php console.php ml:config-drift-report

Monthly Tasks

  1. Performance trend analysis:
docker exec php php console.php ml:performance-trends --last-month
  1. Evaluate model improvement opportunities:
docker exec php php console.php ml:improvement-recommendations
  1. Review and update thresholds:
# Review current thresholds
docker exec php php console.php ml:show-thresholds

# Update if needed
docker exec php php console.php ml:update-threshold n1-detector --accuracy 0.88
  1. Clean up old performance data:
docker exec php php console.php ml:cleanup-old-data --older-than 90days

Quarterly Tasks

  1. Model retraining evaluation:
  • Review accumulated prediction data
  • Evaluate if retraining is beneficial
  • Plan retraining if significant drift detected
  1. System capacity planning:
  • Review prediction volume trends
  • Assess cache and memory requirements
  • Plan for scaling if needed
  1. Alert threshold tuning:
  • Review alert frequency and relevance
  • Adjust thresholds based on operational experience
  • Update escalation procedures

Model Version Upgrades

When deploying a new model version:

# 1. Register new version
docker exec php php console.php ml:register-model n1-detector --version 1.1.0

# 2. Deploy to staging first
docker exec php php console.php ml:deploy n1-detector --version 1.1.0 --env staging

# 3. Monitor staging performance
docker exec php php console.php ml:compare-versions n1-detector 1.0.0 1.1.0

# 4. If successful, deploy to production
docker exec php php console.php ml:deploy n1-detector --version 1.1.0 --env production

# 5. Monitor production rollout
docker exec php watch -n 60 "php console.php ml:deployment-status n1-detector"

# 6. Rollback if issues detected
docker exec php php console.php ml:rollback n1-detector --to-version 1.0.0

Performance Baselines

Expected performance metrics after deployment:

N+1 Detector

Accuracy: 88-92%
Precision: 85-90%
Recall: 85-90%
F1-Score: 86-91%
Total Predictions: 1000+ per day
False Positive Rate: <10%
Detection Latency: <50ms

WAF Behavioral

Accuracy: 92-96%
Precision: 90-95%
Recall: 90-95%
F1-Score: 91-95%
Total Predictions: 10,000+ per day
False Positive Rate: <5%
Detection Latency: <100ms

Queue Anomaly

Accuracy: 82-88%
Precision: 80-85%
Recall: 80-85%
F1-Score: 81-86%
Total Predictions: 500+ per day
False Positive Rate: <15%
Detection Latency: <30ms

System Performance

Model Registry Cache Hit Rate: >90%
Scheduler Job Success Rate: >99%
Auto-Tuning Application Rate: 20-30% (1-2 times per week)
Alert Rate: <5 alerts per day (excluding info alerts)
Memory Usage: <200MB for model tracking

Security Considerations

  1. Model Configuration Access:
  • Restrict access to model configuration updates
  • Log all configuration changes with user attribution
  • Implement approval workflow for production changes
  1. Performance Data Privacy:
  • Anonymize sensitive data in tracking
  • Implement data retention policies
  • Ensure GDPR compliance for tracked predictions
  1. Alert Notification Security:
  • Use secure channels for critical alerts
  • Implement alert authentication
  • Rate-limit alert notifications

Support and Escalation

L1 Support (Monitoring Team)

  • Monitor dashboard and alerts
  • Execute basic troubleshooting commands
  • Escalate to L2 for unresolved issues

L2 Support (ML Team)

  • Investigate performance degradation
  • Adjust thresholds and configurations
  • Plan model retraining
  • Escalate to L3 for system issues

L3 Support (Core Team)

  • System architecture issues
  • Framework integration problems
  • Performance optimization
  • Long-term capacity planning

Appendix

Example Integration Code

See /examples/n1-model-management-integration.php for comprehensive integration examples.

CLI Commands Reference

Complete list of ML management commands:

# Model Registry
ml:list-models              # List all registered models
ml:register-model           # Register a specific model
ml:register-all-models      # Register all three models
ml:show-config              # Show model configuration
ml:update-config            # Update model configuration
ml:rollback-config          # Rollback configuration
ml:list-production-models   # List production-deployed models

# Performance Monitoring
ml:performance-summary      # Current performance for all models
ml:performance-details      # Detailed performance for specific model
ml:check-degradation        # Check for performance degradation
ml:performance-trends       # Historical performance trends
ml:daily-summary           # Daily performance summary
ml:compare-versions        # Compare performance between versions

# Auto-Tuning
ml:auto-tuning-history     # View auto-tuning history
ml:auto-tuning-report      # Generate auto-tuning report
ml:show-thresholds         # Show current thresholds
ml:update-threshold        # Manually update threshold

# Deployment
ml:deploy                  # Deploy model to environment
ml:deploy-to-production    # Deploy to production
ml:rollback                # Rollback to previous version
ml:deployment-status       # Check deployment status

# Monitoring
ml:verify-systems          # Verify all ML systems functional
ml:scheduler-status        # Check scheduler status
ml:scheduler-health        # Scheduler health check
ml:analyze-errors          # Analyze prediction errors
ml:validate-ground-truth   # Validate ground truth data

# Maintenance
ml:cleanup-old-data        # Clean up old performance data
ml:data-sufficiency        # Check if enough data for optimization
ml:improvement-recommendations  # Get improvement recommendations
ml:config-drift-report     # Check for configuration drift

Additional Resources