# ML Model Management System - Production Deployment Guide Complete guide for deploying the ML Model Management System to production with all three integrated ML systems. ## Table of Contents 1. [System Overview](#system-overview) 2. [Prerequisites](#prerequisites) 3. [Configuration](#configuration) 4. [Deployment Steps](#deployment-steps) 5. [Monitoring Setup](#monitoring-setup) 6. [Troubleshooting](#troubleshooting) 7. [Maintenance](#maintenance) ## System Overview The ML Model Management System provides centralized management for three ML systems: 1. **N+1 Detection** (`n1-detector` - v1.0.0) - Detects N+1 query patterns in database operations - Alert threshold: 85% accuracy - Auto-tuning: confidence_threshold (0.5-0.9) 2. **WAF Behavioral Analysis** (`waf-behavioral` - v1.0.0) - Detects malicious behavioral patterns in web requests - Alert threshold: 90% accuracy (critical) - Auto-tuning: anomaly_threshold (0.5-0.9) 3. **Queue Job Anomaly Detection** (`queue-anomaly` - v1.0.0) - Detects anomalous job execution patterns - Alert threshold: 80% accuracy - Auto-tuning: anomaly_threshold (0.4-0.8) **Core Components**: - **ModelRegistry**: Cache-based model version and configuration storage - **ModelPerformanceMonitor**: Real-time performance metrics tracking - **AutoTuningEngine**: Automatic threshold optimization via grid search - **AlertingService**: Log-based alerting for performance issues - **MLMonitoringScheduler**: Periodic monitoring jobs ## Prerequisites ### System Requirements - PHP 8.3+ - Redis (for cache-based model registry) - Sufficient memory for model tracking (minimum 256MB PHP memory_limit) ### Framework Components Ensure these framework components are initialized: ```php // Required initializers: - ModelManagementInitializer - NPlusOneDetectionEngineInitializer - MLEnhancedWafLayerInitializer - JobAnomalyDetectionInitializer - MLMonitoringSchedulerInitializer ``` ### Environment Variables ```env # Model Management MODEL_REGISTRY_CACHE_TTL=604800 # 7 days AUTO_TUNING_ENABLED=true # N+1 Detection NPLUSONE_ML_ENABLED=true NPLUSONE_ML_AUTO_REGISTER=true NPLUSONE_ML_TIMEOUT_MS=5000 NPLUSONE_ML_CONFIDENCE_THRESHOLD=60.0 # WAF Behavioral WAF_ML_ENABLED=true WAF_ML_AUTO_REGISTER=true # Queue Anomaly QUEUE_ML_ENABLED=true QUEUE_ML_AUTO_REGISTER=true JOB_ANOMALY_MIN_CONFIDENCE=0.6 JOB_ANOMALY_THRESHOLD=50 JOB_ANOMALY_ZSCORE_THRESHOLD=3.0 JOB_ANOMALY_IQR_MULTIPLIER=1.5 ``` ## Configuration ### 1. Model Registry Configuration The model registry uses cache for storage with a default TTL of 7 days: ```php // In ModelManagementInitializer $cacheKey = CacheKey::fromString('ml_model_registry'); $ttl = Duration::fromDays($environment->getInt('MODEL_REGISTRY_CACHE_TTL', 7)); ``` **Production Recommendations**: - Use Redis for cache backend (better persistence than file cache) - Set TTL to at least 7 days to maintain model history - Configure Redis persistence (AOF or RDB) for data durability ### 2. Performance Monitor Configuration Performance monitoring tracks predictions with sliding windows: ```php // Window configurations - Short-term: Last 100 predictions - Medium-term: Last 1000 predictions - Long-term: All predictions since deployment ``` **Memory Considerations**: - Each prediction: ~200 bytes - 1000 predictions: ~200KB - Recommend periodic cleanup for long-running systems ### 3. Auto-Tuning Configuration Auto-tuning uses grid search optimization: ```php // N+1 Detector thresholdRange: [0.5, 0.9] step: 0.05 timeWindow: 1 hour metricToOptimize: 'f1_score' // WAF Behavioral thresholdRange: [0.5, 0.9] step: 0.05 timeWindow: 1 hour metricToOptimize: 'f1_score' // Queue Anomaly thresholdRange: [0.4, 0.8] step: 0.05 timeWindow: 1 hour metricToOptimize: 'f1_score' ``` **Tuning Recommendations**: - Auto-apply only if improvement > 5% - Run hourly to balance responsiveness and stability - Monitor threshold changes via logs ### 4. Monitoring Schedule Configuration Four periodic jobs with different frequencies: ```php // Performance Monitoring - Every 5 minutes - Checks current accuracy against thresholds - Sends alerts if accuracy drops below threshold - Logs: 'ml-performance-monitoring' // Degradation Detection - Every 15 minutes - Compares current vs. baseline performance - Detects 5%+ degradation - Logs: 'ml-degradation-detection' // Auto-Tuning - Every hour - Optimizes thresholds via grid search - Auto-applies if improvement > 5% - Logs: 'ml-auto-tuning' // Registry Cleanup - Daily - Cleans up old performance data - Maintains production model list - Logs: 'ml-registry-cleanup' ``` ## Deployment Steps ### Step 1: Pre-Deployment Verification ```bash # 1. Verify all ML systems are functional docker exec php php console.php ml:verify-systems # 2. Run integration tests docker exec php ./vendor/bin/pest tests/Integration/MachineLearning/ # 3. Check environment configuration docker exec php php console.php config:check-ml # 4. Verify cache backend is accessible docker exec php php console.php cache:health ``` ### Step 2: Deploy Core Components ```bash # 1. Deploy application code (including ML components) git pull origin main # 2. Rebuild dependencies if needed docker exec php composer install --no-dev --optimize-autoloader # 3. Clear and warm up caches docker exec php php console.php cache:clear docker exec php php console.php cache:warmup ``` ### Step 3: Initialize Model Registry ```bash # Register all three models on first deployment docker exec php php console.php ml:register-all-models # Expected output: # - N+1 Detector (n1-detector v1.0.0) registered # - WAF Behavioral (waf-behavioral v1.0.0) registered # - Queue Anomaly (queue-anomaly v1.0.0) registered ``` ### Step 4: Start Monitoring Scheduler The monitoring scheduler is automatically initialized during framework startup. Verify it's running: ```bash # Check scheduled tasks docker exec php php console.php scheduler:status # Expected tasks: # - ml-performance-monitoring (next run: +5 min) # - ml-degradation-detection (next run: +15 min) # - ml-auto-tuning (next run: +1 hour) # - ml-registry-cleanup (next run: +1 day) ``` ### Step 5: Deploy to Production ```bash # 1. Deploy all models to production environment docker exec php php console.php ml:deploy-to-production --all # 2. Verify production deployment docker exec php php console.php ml:list-production-models # Expected output: # - n1-detector v1.0.0 (deployed: YYYY-MM-DD HH:MM:SS) # - waf-behavioral v1.0.0 (deployed: YYYY-MM-DD HH:MM:SS) # - queue-anomaly v1.0.0 (deployed: YYYY-MM-DD HH:MM:SS) ``` ### Step 6: Verify Monitoring ```bash # Wait 5 minutes for first monitoring run, then check logs docker exec php tail -f storage/logs/ml-monitoring.log # Expected log entries: # [INFO] ML monitoring scheduler initialized (jobs_scheduled: 4, models_monitored: 3) # [INFO] Performance monitoring executed successfully # [INFO] N+1 detector status: monitored (accuracy: 0.XX) # [INFO] WAF behavioral status: monitored (accuracy: 0.XX) # [INFO] Queue anomaly status: monitored (accuracy: 0.XX) ``` ## Monitoring Setup ### 1. Log Monitoring All ML operations are logged to: - Main log: `storage/logs/application.log` - ML-specific: `storage/logs/ml-monitoring.log` (if configured) **Key log patterns to monitor**: ```bash # Performance warnings grep "Performance Warning" storage/logs/*.log # Degradation alerts grep "Performance Degraded" storage/logs/*.log # Auto-tuning applications grep "auto-tuned" storage/logs/*.log # Monitoring failures grep "monitoring failed" storage/logs/*.log ``` ### 2. Metrics Dashboard Create a dashboard to track: **N+1 Detector Metrics**: ``` - Accuracy (target: >85%) - Total predictions per day - False positive rate - Average confidence score ``` **WAF Behavioral Metrics**: ``` - Accuracy (target: >90%) - Total requests analyzed - Blocked requests ratio - Detection latency ``` **Queue Anomaly Metrics**: ``` - Accuracy (target: >80%) - Total jobs analyzed - Anomaly detection rate - Average processing time ``` **System Metrics**: ``` - Model registry cache hit rate - Auto-tuning improvement rate - Alert frequency - Scheduler job success rate ``` ### 3. Alerting Rules Configure alerts for: ```yaml # Critical Alerts (immediate action) - WAF accuracy < 90% - WAF performance degraded > 5% - Scheduler job failures > 3 consecutive # Warning Alerts (investigate within 1 hour) - N+1 accuracy < 85% - Queue anomaly accuracy < 80% - Auto-tuning improvements < 1% (indicates stagnation) - Model registry cache misses > 10% # Info Alerts (review during business hours) - Auto-tuning applied successfully - Model configuration updated - Registry cleanup completed ``` ### 4. Performance Monitoring Commands ```bash # Get current performance metrics for all models docker exec php php console.php ml:performance-summary # Check for degradation docker exec php php console.php ml:check-degradation # View auto-tuning history docker exec php php console.php ml:auto-tuning-history # Get model configuration docker exec php php console.php ml:show-config n1-detector docker exec php php console.php ml:show-config waf-behavioral docker exec php php console.php ml:show-config queue-anomaly ``` ## Troubleshooting ### Issue: Model Not Registered **Symptoms**: ``` RuntimeException: Model not registered. Call registerCurrentModel() first. ``` **Solution**: ```bash # Check if model exists docker exec php php console.php ml:list-models # Register missing model docker exec php php console.php ml:register-model n1-detector docker exec php php console.php ml:register-model waf-behavioral docker exec php php console.php ml:register-model queue-anomaly # Or register all at once docker exec php php console.php ml:register-all-models ``` ### Issue: Low Accuracy Alerts **Symptoms**: ``` [WARNING] N+1 Detector Performance Warning: Accuracy dropped to 78.50% ``` **Investigation Steps**: 1. **Check recent prediction patterns**: ```bash docker exec php php console.php ml:performance-details n1-detector --recent 100 ``` 2. **Analyze false positives/negatives**: ```bash docker exec php php console.php ml:analyze-errors n1-detector ``` 3. **Review ground truth data**: ```bash docker exec php php console.php ml:validate-ground-truth n1-detector ``` 4. **Possible causes**: - Data drift (application behavior changed) - Incorrect ground truth labels - Model needs retraining - Configuration drift (thresholds changed) **Resolution**: - If data drift: Consider retraining model - If ground truth issues: Update labeling process - If config drift: Revert to known-good configuration - If temporary: Wait for auto-tuning to adapt ### Issue: Auto-Tuning Not Improving **Symptoms**: ``` [INFO] Auto-tuning completed (improvement_percent: 0.5%) ``` **Investigation**: 1. **Check optimization history**: ```bash docker exec php php console.php ml:auto-tuning-history n1-detector --limit 10 ``` 2. **Verify enough data for optimization**: ```bash docker exec php php console.php ml:data-sufficiency n1-detector ``` 3. **Possible causes**: - Insufficient prediction data (<100 predictions) - Already at optimal threshold - Metric plateaued (reached model capacity) - Time window too short **Resolution**: - Wait for more prediction data - Consider different optimization metric (precision, recall) - Adjust threshold range or step size - Increase time window for optimization ### Issue: Performance Degradation Detected **Symptoms**: ``` [CRITICAL] WAF Behavioral Performance Degraded (degradation_percent: 8.5%) ``` **Immediate Actions**: 1. **Verify it's not a false alarm**: ```bash docker exec php php console.php ml:verify-degradation waf-behavioral ``` 2. **Check for system changes**: ```bash # Review recent deployments git log --oneline --since="1 week ago" # Check configuration changes docker exec php php console.php config:diff --since="1 week ago" ``` 3. **Rollback if necessary**: ```bash # Revert to previous model configuration docker exec php php console.php ml:rollback-config waf-behavioral # Or revert to previous model version docker exec php php console.php ml:rollback-version waf-behavioral ``` 4. **Monitor recovery**: ```bash # Watch real-time performance docker exec php watch -n 60 "php console.php ml:performance-summary" ``` ### Issue: Monitoring Jobs Not Running **Symptoms**: ``` No monitoring logs in the last 5 minutes ``` **Investigation**: 1. **Check scheduler status**: ```bash docker exec php php console.php scheduler:status ``` 2. **Verify scheduler is running**: ```bash # Check if scheduler daemon is active docker exec php ps aux | grep scheduler ``` 3. **Check for errors**: ```bash docker exec php grep "scheduler" storage/logs/application.log | tail -20 ``` **Resolution**: ```bash # Restart scheduler daemon docker exec php php console.php scheduler:restart # Or restart entire application docker-compose restart php # Verify monitoring resumes docker exec php tail -f storage/logs/ml-monitoring.log ``` ### Issue: Cache Miss Rate High **Symptoms**: ``` Model registry cache miss rate: 45% (expected: <10%) ``` **Possible Causes**: - Cache TTL too short - Cache eviction due to memory pressure - Cache backend not persistent **Resolution**: 1. **Increase cache TTL**: ```env # .env MODEL_REGISTRY_CACHE_TTL=1209600 # 14 days ``` 2. **Configure Redis persistence**: ```redis # redis.conf appendonly yes appendfsync everysec ``` 3. **Monitor Redis memory**: ```bash docker exec redis redis-cli INFO memory ``` ## Maintenance ### Daily Tasks 1. **Review performance metrics**: ```bash docker exec php php console.php ml:daily-summary ``` 2. **Check alert log**: ```bash grep -E "(WARNING|CRITICAL)" storage/logs/ml-monitoring.log | tail -20 ``` 3. **Verify scheduler health**: ```bash docker exec php php console.php scheduler:health ``` ### Weekly Tasks 1. **Review auto-tuning effectiveness**: ```bash docker exec php php console.php ml:auto-tuning-report --last-week ``` 2. **Analyze degradation incidents**: ```bash docker exec php php console.php ml:degradation-report --last-week ``` 3. **Check model configuration drift**: ```bash docker exec php php console.php ml:config-drift-report ``` ### Monthly Tasks 1. **Performance trend analysis**: ```bash docker exec php php console.php ml:performance-trends --last-month ``` 2. **Evaluate model improvement opportunities**: ```bash docker exec php php console.php ml:improvement-recommendations ``` 3. **Review and update thresholds**: ```bash # Review current thresholds docker exec php php console.php ml:show-thresholds # Update if needed docker exec php php console.php ml:update-threshold n1-detector --accuracy 0.88 ``` 4. **Clean up old performance data**: ```bash docker exec php php console.php ml:cleanup-old-data --older-than 90days ``` ### Quarterly Tasks 1. **Model retraining evaluation**: - Review accumulated prediction data - Evaluate if retraining is beneficial - Plan retraining if significant drift detected 2. **System capacity planning**: - Review prediction volume trends - Assess cache and memory requirements - Plan for scaling if needed 3. **Alert threshold tuning**: - Review alert frequency and relevance - Adjust thresholds based on operational experience - Update escalation procedures ### Model Version Upgrades When deploying a new model version: ```bash # 1. Register new version docker exec php php console.php ml:register-model n1-detector --version 1.1.0 # 2. Deploy to staging first docker exec php php console.php ml:deploy n1-detector --version 1.1.0 --env staging # 3. Monitor staging performance docker exec php php console.php ml:compare-versions n1-detector 1.0.0 1.1.0 # 4. If successful, deploy to production docker exec php php console.php ml:deploy n1-detector --version 1.1.0 --env production # 5. Monitor production rollout docker exec php watch -n 60 "php console.php ml:deployment-status n1-detector" # 6. Rollback if issues detected docker exec php php console.php ml:rollback n1-detector --to-version 1.0.0 ``` ## Performance Baselines Expected performance metrics after deployment: ### N+1 Detector ``` Accuracy: 88-92% Precision: 85-90% Recall: 85-90% F1-Score: 86-91% Total Predictions: 1000+ per day False Positive Rate: <10% Detection Latency: <50ms ``` ### WAF Behavioral ``` Accuracy: 92-96% Precision: 90-95% Recall: 90-95% F1-Score: 91-95% Total Predictions: 10,000+ per day False Positive Rate: <5% Detection Latency: <100ms ``` ### Queue Anomaly ``` Accuracy: 82-88% Precision: 80-85% Recall: 80-85% F1-Score: 81-86% Total Predictions: 500+ per day False Positive Rate: <15% Detection Latency: <30ms ``` ### System Performance ``` Model Registry Cache Hit Rate: >90% Scheduler Job Success Rate: >99% Auto-Tuning Application Rate: 20-30% (1-2 times per week) Alert Rate: <5 alerts per day (excluding info alerts) Memory Usage: <200MB for model tracking ``` ## Security Considerations 1. **Model Configuration Access**: - Restrict access to model configuration updates - Log all configuration changes with user attribution - Implement approval workflow for production changes 2. **Performance Data Privacy**: - Anonymize sensitive data in tracking - Implement data retention policies - Ensure GDPR compliance for tracked predictions 3. **Alert Notification Security**: - Use secure channels for critical alerts - Implement alert authentication - Rate-limit alert notifications ## Support and Escalation ### L1 Support (Monitoring Team) - Monitor dashboard and alerts - Execute basic troubleshooting commands - Escalate to L2 for unresolved issues ### L2 Support (ML Team) - Investigate performance degradation - Adjust thresholds and configurations - Plan model retraining - Escalate to L3 for system issues ### L3 Support (Core Team) - System architecture issues - Framework integration problems - Performance optimization - Long-term capacity planning ## Appendix ### Example Integration Code See `/examples/n1-model-management-integration.php` for comprehensive integration examples. ### CLI Commands Reference Complete list of ML management commands: ```bash # Model Registry ml:list-models # List all registered models ml:register-model # Register a specific model ml:register-all-models # Register all three models ml:show-config # Show model configuration ml:update-config # Update model configuration ml:rollback-config # Rollback configuration ml:list-production-models # List production-deployed models # Performance Monitoring ml:performance-summary # Current performance for all models ml:performance-details # Detailed performance for specific model ml:check-degradation # Check for performance degradation ml:performance-trends # Historical performance trends ml:daily-summary # Daily performance summary ml:compare-versions # Compare performance between versions # Auto-Tuning ml:auto-tuning-history # View auto-tuning history ml:auto-tuning-report # Generate auto-tuning report ml:show-thresholds # Show current thresholds ml:update-threshold # Manually update threshold # Deployment ml:deploy # Deploy model to environment ml:deploy-to-production # Deploy to production ml:rollback # Rollback to previous version ml:deployment-status # Check deployment status # Monitoring ml:verify-systems # Verify all ML systems functional ml:scheduler-status # Check scheduler status ml:scheduler-health # Scheduler health check ml:analyze-errors # Analyze prediction errors ml:validate-ground-truth # Validate ground truth data # Maintenance ml:cleanup-old-data # Clean up old performance data ml:data-sufficiency # Check if enough data for optimization ml:improvement-recommendations # Get improvement recommendations ml:config-drift-report # Check for configuration drift ``` ### Additional Resources - [ML Model Management Architecture](/docs/ml-model-management-architecture.md) - [Integration Examples](/examples/n1-model-management-integration.php) - [Integration Tests](/tests/Integration/MachineLearning/MLModelManagementIntegrationTest.php) - [Framework Documentation](/docs/claude/)