- Add comprehensive health check system with multiple endpoints - Add Prometheus metrics endpoint - Add production logging configurations (5 strategies) - Add complete deployment documentation suite: * QUICKSTART.md - 30-minute deployment guide * DEPLOYMENT_CHECKLIST.md - Printable verification checklist * DEPLOYMENT_WORKFLOW.md - Complete deployment lifecycle * PRODUCTION_DEPLOYMENT.md - Comprehensive technical reference * production-logging.md - Logging configuration guide * ANSIBLE_DEPLOYMENT.md - Infrastructure as Code automation * README.md - Navigation hub * DEPLOYMENT_SUMMARY.md - Executive summary - Add deployment scripts and automation - Add DEPLOYMENT_PLAN.md - Concrete plan for immediate deployment - Update README with production-ready features All production infrastructure is now complete and ready for deployment.
20 KiB
ML Model Management System - Production Deployment Guide
Complete guide for deploying the ML Model Management System to production with all three integrated ML systems.
Table of Contents
- System Overview
- Prerequisites
- Configuration
- Deployment Steps
- Monitoring Setup
- Troubleshooting
- Maintenance
System Overview
The ML Model Management System provides centralized management for three ML systems:
-
N+1 Detection (
n1-detector- v1.0.0)- Detects N+1 query patterns in database operations
- Alert threshold: 85% accuracy
- Auto-tuning: confidence_threshold (0.5-0.9)
-
WAF Behavioral Analysis (
waf-behavioral- v1.0.0)- Detects malicious behavioral patterns in web requests
- Alert threshold: 90% accuracy (critical)
- Auto-tuning: anomaly_threshold (0.5-0.9)
-
Queue Job Anomaly Detection (
queue-anomaly- v1.0.0)- Detects anomalous job execution patterns
- Alert threshold: 80% accuracy
- Auto-tuning: anomaly_threshold (0.4-0.8)
Core Components:
- ModelRegistry: Cache-based model version and configuration storage
- ModelPerformanceMonitor: Real-time performance metrics tracking
- AutoTuningEngine: Automatic threshold optimization via grid search
- AlertingService: Log-based alerting for performance issues
- MLMonitoringScheduler: Periodic monitoring jobs
Prerequisites
System Requirements
- PHP 8.3+
- Redis (for cache-based model registry)
- Sufficient memory for model tracking (minimum 256MB PHP memory_limit)
Framework Components
Ensure these framework components are initialized:
// Required initializers:
- ModelManagementInitializer
- NPlusOneDetectionEngineInitializer
- MLEnhancedWafLayerInitializer
- JobAnomalyDetectionInitializer
- MLMonitoringSchedulerInitializer
Environment Variables
# Model Management
MODEL_REGISTRY_CACHE_TTL=604800 # 7 days
AUTO_TUNING_ENABLED=true
# N+1 Detection
NPLUSONE_ML_ENABLED=true
NPLUSONE_ML_AUTO_REGISTER=true
NPLUSONE_ML_TIMEOUT_MS=5000
NPLUSONE_ML_CONFIDENCE_THRESHOLD=60.0
# WAF Behavioral
WAF_ML_ENABLED=true
WAF_ML_AUTO_REGISTER=true
# Queue Anomaly
QUEUE_ML_ENABLED=true
QUEUE_ML_AUTO_REGISTER=true
JOB_ANOMALY_MIN_CONFIDENCE=0.6
JOB_ANOMALY_THRESHOLD=50
JOB_ANOMALY_ZSCORE_THRESHOLD=3.0
JOB_ANOMALY_IQR_MULTIPLIER=1.5
Configuration
1. Model Registry Configuration
The model registry uses cache for storage with a default TTL of 7 days:
// In ModelManagementInitializer
$cacheKey = CacheKey::fromString('ml_model_registry');
$ttl = Duration::fromDays($environment->getInt('MODEL_REGISTRY_CACHE_TTL', 7));
Production Recommendations:
- Use Redis for cache backend (better persistence than file cache)
- Set TTL to at least 7 days to maintain model history
- Configure Redis persistence (AOF or RDB) for data durability
2. Performance Monitor Configuration
Performance monitoring tracks predictions with sliding windows:
// Window configurations
- Short-term: Last 100 predictions
- Medium-term: Last 1000 predictions
- Long-term: All predictions since deployment
Memory Considerations:
- Each prediction: ~200 bytes
- 1000 predictions: ~200KB
- Recommend periodic cleanup for long-running systems
3. Auto-Tuning Configuration
Auto-tuning uses grid search optimization:
// N+1 Detector
thresholdRange: [0.5, 0.9]
step: 0.05
timeWindow: 1 hour
metricToOptimize: 'f1_score'
// WAF Behavioral
thresholdRange: [0.5, 0.9]
step: 0.05
timeWindow: 1 hour
metricToOptimize: 'f1_score'
// Queue Anomaly
thresholdRange: [0.4, 0.8]
step: 0.05
timeWindow: 1 hour
metricToOptimize: 'f1_score'
Tuning Recommendations:
- Auto-apply only if improvement > 5%
- Run hourly to balance responsiveness and stability
- Monitor threshold changes via logs
4. Monitoring Schedule Configuration
Four periodic jobs with different frequencies:
// Performance Monitoring - Every 5 minutes
- Checks current accuracy against thresholds
- Sends alerts if accuracy drops below threshold
- Logs: 'ml-performance-monitoring'
// Degradation Detection - Every 15 minutes
- Compares current vs. baseline performance
- Detects 5%+ degradation
- Logs: 'ml-degradation-detection'
// Auto-Tuning - Every hour
- Optimizes thresholds via grid search
- Auto-applies if improvement > 5%
- Logs: 'ml-auto-tuning'
// Registry Cleanup - Daily
- Cleans up old performance data
- Maintains production model list
- Logs: 'ml-registry-cleanup'
Deployment Steps
Step 1: Pre-Deployment Verification
# 1. Verify all ML systems are functional
docker exec php php console.php ml:verify-systems
# 2. Run integration tests
docker exec php ./vendor/bin/pest tests/Integration/MachineLearning/
# 3. Check environment configuration
docker exec php php console.php config:check-ml
# 4. Verify cache backend is accessible
docker exec php php console.php cache:health
Step 2: Deploy Core Components
# 1. Deploy application code (including ML components)
git pull origin main
# 2. Rebuild dependencies if needed
docker exec php composer install --no-dev --optimize-autoloader
# 3. Clear and warm up caches
docker exec php php console.php cache:clear
docker exec php php console.php cache:warmup
Step 3: Initialize Model Registry
# Register all three models on first deployment
docker exec php php console.php ml:register-all-models
# Expected output:
# - N+1 Detector (n1-detector v1.0.0) registered
# - WAF Behavioral (waf-behavioral v1.0.0) registered
# - Queue Anomaly (queue-anomaly v1.0.0) registered
Step 4: Start Monitoring Scheduler
The monitoring scheduler is automatically initialized during framework startup.
Verify it's running:
# Check scheduled tasks
docker exec php php console.php scheduler:status
# Expected tasks:
# - ml-performance-monitoring (next run: +5 min)
# - ml-degradation-detection (next run: +15 min)
# - ml-auto-tuning (next run: +1 hour)
# - ml-registry-cleanup (next run: +1 day)
Step 5: Deploy to Production
# 1. Deploy all models to production environment
docker exec php php console.php ml:deploy-to-production --all
# 2. Verify production deployment
docker exec php php console.php ml:list-production-models
# Expected output:
# - n1-detector v1.0.0 (deployed: YYYY-MM-DD HH:MM:SS)
# - waf-behavioral v1.0.0 (deployed: YYYY-MM-DD HH:MM:SS)
# - queue-anomaly v1.0.0 (deployed: YYYY-MM-DD HH:MM:SS)
Step 6: Verify Monitoring
# Wait 5 minutes for first monitoring run, then check logs
docker exec php tail -f storage/logs/ml-monitoring.log
# Expected log entries:
# [INFO] ML monitoring scheduler initialized (jobs_scheduled: 4, models_monitored: 3)
# [INFO] Performance monitoring executed successfully
# [INFO] N+1 detector status: monitored (accuracy: 0.XX)
# [INFO] WAF behavioral status: monitored (accuracy: 0.XX)
# [INFO] Queue anomaly status: monitored (accuracy: 0.XX)
Monitoring Setup
1. Log Monitoring
All ML operations are logged to:
- Main log:
storage/logs/application.log - ML-specific:
storage/logs/ml-monitoring.log(if configured)
Key log patterns to monitor:
# Performance warnings
grep "Performance Warning" storage/logs/*.log
# Degradation alerts
grep "Performance Degraded" storage/logs/*.log
# Auto-tuning applications
grep "auto-tuned" storage/logs/*.log
# Monitoring failures
grep "monitoring failed" storage/logs/*.log
2. Metrics Dashboard
Create a dashboard to track:
N+1 Detector Metrics:
- Accuracy (target: >85%)
- Total predictions per day
- False positive rate
- Average confidence score
WAF Behavioral Metrics:
- Accuracy (target: >90%)
- Total requests analyzed
- Blocked requests ratio
- Detection latency
Queue Anomaly Metrics:
- Accuracy (target: >80%)
- Total jobs analyzed
- Anomaly detection rate
- Average processing time
System Metrics:
- Model registry cache hit rate
- Auto-tuning improvement rate
- Alert frequency
- Scheduler job success rate
3. Alerting Rules
Configure alerts for:
# Critical Alerts (immediate action)
- WAF accuracy < 90%
- WAF performance degraded > 5%
- Scheduler job failures > 3 consecutive
# Warning Alerts (investigate within 1 hour)
- N+1 accuracy < 85%
- Queue anomaly accuracy < 80%
- Auto-tuning improvements < 1% (indicates stagnation)
- Model registry cache misses > 10%
# Info Alerts (review during business hours)
- Auto-tuning applied successfully
- Model configuration updated
- Registry cleanup completed
4. Performance Monitoring Commands
# Get current performance metrics for all models
docker exec php php console.php ml:performance-summary
# Check for degradation
docker exec php php console.php ml:check-degradation
# View auto-tuning history
docker exec php php console.php ml:auto-tuning-history
# Get model configuration
docker exec php php console.php ml:show-config n1-detector
docker exec php php console.php ml:show-config waf-behavioral
docker exec php php console.php ml:show-config queue-anomaly
Troubleshooting
Issue: Model Not Registered
Symptoms:
RuntimeException: Model not registered. Call registerCurrentModel() first.
Solution:
# Check if model exists
docker exec php php console.php ml:list-models
# Register missing model
docker exec php php console.php ml:register-model n1-detector
docker exec php php console.php ml:register-model waf-behavioral
docker exec php php console.php ml:register-model queue-anomaly
# Or register all at once
docker exec php php console.php ml:register-all-models
Issue: Low Accuracy Alerts
Symptoms:
[WARNING] N+1 Detector Performance Warning: Accuracy dropped to 78.50%
Investigation Steps:
- Check recent prediction patterns:
docker exec php php console.php ml:performance-details n1-detector --recent 100
- Analyze false positives/negatives:
docker exec php php console.php ml:analyze-errors n1-detector
- Review ground truth data:
docker exec php php console.php ml:validate-ground-truth n1-detector
- Possible causes:
- Data drift (application behavior changed)
- Incorrect ground truth labels
- Model needs retraining
- Configuration drift (thresholds changed)
Resolution:
- If data drift: Consider retraining model
- If ground truth issues: Update labeling process
- If config drift: Revert to known-good configuration
- If temporary: Wait for auto-tuning to adapt
Issue: Auto-Tuning Not Improving
Symptoms:
[INFO] Auto-tuning completed (improvement_percent: 0.5%)
Investigation:
- Check optimization history:
docker exec php php console.php ml:auto-tuning-history n1-detector --limit 10
- Verify enough data for optimization:
docker exec php php console.php ml:data-sufficiency n1-detector
- Possible causes:
- Insufficient prediction data (<100 predictions)
- Already at optimal threshold
- Metric plateaued (reached model capacity)
- Time window too short
Resolution:
- Wait for more prediction data
- Consider different optimization metric (precision, recall)
- Adjust threshold range or step size
- Increase time window for optimization
Issue: Performance Degradation Detected
Symptoms:
[CRITICAL] WAF Behavioral Performance Degraded (degradation_percent: 8.5%)
Immediate Actions:
- Verify it's not a false alarm:
docker exec php php console.php ml:verify-degradation waf-behavioral
- Check for system changes:
# Review recent deployments
git log --oneline --since="1 week ago"
# Check configuration changes
docker exec php php console.php config:diff --since="1 week ago"
- Rollback if necessary:
# Revert to previous model configuration
docker exec php php console.php ml:rollback-config waf-behavioral
# Or revert to previous model version
docker exec php php console.php ml:rollback-version waf-behavioral
- Monitor recovery:
# Watch real-time performance
docker exec php watch -n 60 "php console.php ml:performance-summary"
Issue: Monitoring Jobs Not Running
Symptoms:
No monitoring logs in the last 5 minutes
Investigation:
- Check scheduler status:
docker exec php php console.php scheduler:status
- Verify scheduler is running:
# Check if scheduler daemon is active
docker exec php ps aux | grep scheduler
- Check for errors:
docker exec php grep "scheduler" storage/logs/application.log | tail -20
Resolution:
# Restart scheduler daemon
docker exec php php console.php scheduler:restart
# Or restart entire application
docker-compose restart php
# Verify monitoring resumes
docker exec php tail -f storage/logs/ml-monitoring.log
Issue: Cache Miss Rate High
Symptoms:
Model registry cache miss rate: 45% (expected: <10%)
Possible Causes:
- Cache TTL too short
- Cache eviction due to memory pressure
- Cache backend not persistent
Resolution:
- Increase cache TTL:
# .env
MODEL_REGISTRY_CACHE_TTL=1209600 # 14 days
- Configure Redis persistence:
# redis.conf
appendonly yes
appendfsync everysec
- Monitor Redis memory:
docker exec redis redis-cli INFO memory
Maintenance
Daily Tasks
- Review performance metrics:
docker exec php php console.php ml:daily-summary
- Check alert log:
grep -E "(WARNING|CRITICAL)" storage/logs/ml-monitoring.log | tail -20
- Verify scheduler health:
docker exec php php console.php scheduler:health
Weekly Tasks
- Review auto-tuning effectiveness:
docker exec php php console.php ml:auto-tuning-report --last-week
- Analyze degradation incidents:
docker exec php php console.php ml:degradation-report --last-week
- Check model configuration drift:
docker exec php php console.php ml:config-drift-report
Monthly Tasks
- Performance trend analysis:
docker exec php php console.php ml:performance-trends --last-month
- Evaluate model improvement opportunities:
docker exec php php console.php ml:improvement-recommendations
- Review and update thresholds:
# Review current thresholds
docker exec php php console.php ml:show-thresholds
# Update if needed
docker exec php php console.php ml:update-threshold n1-detector --accuracy 0.88
- Clean up old performance data:
docker exec php php console.php ml:cleanup-old-data --older-than 90days
Quarterly Tasks
- Model retraining evaluation:
- Review accumulated prediction data
- Evaluate if retraining is beneficial
- Plan retraining if significant drift detected
- System capacity planning:
- Review prediction volume trends
- Assess cache and memory requirements
- Plan for scaling if needed
- Alert threshold tuning:
- Review alert frequency and relevance
- Adjust thresholds based on operational experience
- Update escalation procedures
Model Version Upgrades
When deploying a new model version:
# 1. Register new version
docker exec php php console.php ml:register-model n1-detector --version 1.1.0
# 2. Deploy to staging first
docker exec php php console.php ml:deploy n1-detector --version 1.1.0 --env staging
# 3. Monitor staging performance
docker exec php php console.php ml:compare-versions n1-detector 1.0.0 1.1.0
# 4. If successful, deploy to production
docker exec php php console.php ml:deploy n1-detector --version 1.1.0 --env production
# 5. Monitor production rollout
docker exec php watch -n 60 "php console.php ml:deployment-status n1-detector"
# 6. Rollback if issues detected
docker exec php php console.php ml:rollback n1-detector --to-version 1.0.0
Performance Baselines
Expected performance metrics after deployment:
N+1 Detector
Accuracy: 88-92%
Precision: 85-90%
Recall: 85-90%
F1-Score: 86-91%
Total Predictions: 1000+ per day
False Positive Rate: <10%
Detection Latency: <50ms
WAF Behavioral
Accuracy: 92-96%
Precision: 90-95%
Recall: 90-95%
F1-Score: 91-95%
Total Predictions: 10,000+ per day
False Positive Rate: <5%
Detection Latency: <100ms
Queue Anomaly
Accuracy: 82-88%
Precision: 80-85%
Recall: 80-85%
F1-Score: 81-86%
Total Predictions: 500+ per day
False Positive Rate: <15%
Detection Latency: <30ms
System Performance
Model Registry Cache Hit Rate: >90%
Scheduler Job Success Rate: >99%
Auto-Tuning Application Rate: 20-30% (1-2 times per week)
Alert Rate: <5 alerts per day (excluding info alerts)
Memory Usage: <200MB for model tracking
Security Considerations
- Model Configuration Access:
- Restrict access to model configuration updates
- Log all configuration changes with user attribution
- Implement approval workflow for production changes
- Performance Data Privacy:
- Anonymize sensitive data in tracking
- Implement data retention policies
- Ensure GDPR compliance for tracked predictions
- Alert Notification Security:
- Use secure channels for critical alerts
- Implement alert authentication
- Rate-limit alert notifications
Support and Escalation
L1 Support (Monitoring Team)
- Monitor dashboard and alerts
- Execute basic troubleshooting commands
- Escalate to L2 for unresolved issues
L2 Support (ML Team)
- Investigate performance degradation
- Adjust thresholds and configurations
- Plan model retraining
- Escalate to L3 for system issues
L3 Support (Core Team)
- System architecture issues
- Framework integration problems
- Performance optimization
- Long-term capacity planning
Appendix
Example Integration Code
See /examples/n1-model-management-integration.php for comprehensive integration examples.
CLI Commands Reference
Complete list of ML management commands:
# Model Registry
ml:list-models # List all registered models
ml:register-model # Register a specific model
ml:register-all-models # Register all three models
ml:show-config # Show model configuration
ml:update-config # Update model configuration
ml:rollback-config # Rollback configuration
ml:list-production-models # List production-deployed models
# Performance Monitoring
ml:performance-summary # Current performance for all models
ml:performance-details # Detailed performance for specific model
ml:check-degradation # Check for performance degradation
ml:performance-trends # Historical performance trends
ml:daily-summary # Daily performance summary
ml:compare-versions # Compare performance between versions
# Auto-Tuning
ml:auto-tuning-history # View auto-tuning history
ml:auto-tuning-report # Generate auto-tuning report
ml:show-thresholds # Show current thresholds
ml:update-threshold # Manually update threshold
# Deployment
ml:deploy # Deploy model to environment
ml:deploy-to-production # Deploy to production
ml:rollback # Rollback to previous version
ml:deployment-status # Check deployment status
# Monitoring
ml:verify-systems # Verify all ML systems functional
ml:scheduler-status # Check scheduler status
ml:scheduler-health # Scheduler health check
ml:analyze-errors # Analyze prediction errors
ml:validate-ground-truth # Validate ground truth data
# Maintenance
ml:cleanup-old-data # Clean up old performance data
ml:data-sufficiency # Check if enough data for optimization
ml:improvement-recommendations # Get improvement recommendations
ml:config-drift-report # Check for configuration drift