Files

Michael Schiemer fc3d7e6357 feat(Production): Complete production deployment infrastructure

- Add comprehensive health check system with multiple endpoints
- Add Prometheus metrics endpoint
- Add production logging configurations (5 strategies)
- Add complete deployment documentation suite:
  * QUICKSTART.md - 30-minute deployment guide
  * DEPLOYMENT_CHECKLIST.md - Printable verification checklist
  * DEPLOYMENT_WORKFLOW.md - Complete deployment lifecycle
  * PRODUCTION_DEPLOYMENT.md - Comprehensive technical reference
  * production-logging.md - Logging configuration guide
  * ANSIBLE_DEPLOYMENT.md - Infrastructure as Code automation
  * README.md - Navigation hub
  * DEPLOYMENT_SUMMARY.md - Executive summary
- Add deployment scripts and automation
- Add DEPLOYMENT_PLAN.md - Concrete plan for immediate deployment
- Update README with production-ready features

All production infrastructure is now complete and ready for deployment.

2025-10-25 19:18:37 +02:00

20 KiB

Raw Blame History

ML Model Management System - Production Deployment Guide

Complete guide for deploying the ML Model Management System to production with all three integrated ML systems.

System Overview
Prerequisites
Configuration
Deployment Steps
Monitoring Setup
Troubleshooting
Maintenance

System Overview

The ML Model Management System provides centralized management for three ML systems:

N+1 Detection (n1-detector - v1.0.0)
- Detects N+1 query patterns in database operations
- Alert threshold: 85% accuracy
- Auto-tuning: confidence_threshold (0.5-0.9)
WAF Behavioral Analysis (waf-behavioral - v1.0.0)
- Detects malicious behavioral patterns in web requests
- Alert threshold: 90% accuracy (critical)
- Auto-tuning: anomaly_threshold (0.5-0.9)
Queue Job Anomaly Detection (queue-anomaly - v1.0.0)
- Detects anomalous job execution patterns
- Alert threshold: 80% accuracy
- Auto-tuning: anomaly_threshold (0.4-0.8)

Core Components:

ModelRegistry: Cache-based model version and configuration storage
ModelPerformanceMonitor: Real-time performance metrics tracking
AutoTuningEngine: Automatic threshold optimization via grid search
AlertingService: Log-based alerting for performance issues
MLMonitoringScheduler: Periodic monitoring jobs

Prerequisites

System Requirements

PHP 8.3+
Redis (for cache-based model registry)
Sufficient memory for model tracking (minimum 256MB PHP memory_limit)

Framework Components

Ensure these framework components are initialized:

// Required initializers:
- ModelManagementInitializer
- NPlusOneDetectionEngineInitializer
- MLEnhancedWafLayerInitializer
- JobAnomalyDetectionInitializer
- MLMonitoringSchedulerInitializer

Environment Variables

# Model Management
MODEL_REGISTRY_CACHE_TTL=604800        # 7 days
AUTO_TUNING_ENABLED=true

# N+1 Detection
NPLUSONE_ML_ENABLED=true
NPLUSONE_ML_AUTO_REGISTER=true
NPLUSONE_ML_TIMEOUT_MS=5000
NPLUSONE_ML_CONFIDENCE_THRESHOLD=60.0

# WAF Behavioral
WAF_ML_ENABLED=true
WAF_ML_AUTO_REGISTER=true

# Queue Anomaly
QUEUE_ML_ENABLED=true
QUEUE_ML_AUTO_REGISTER=true
JOB_ANOMALY_MIN_CONFIDENCE=0.6
JOB_ANOMALY_THRESHOLD=50
JOB_ANOMALY_ZSCORE_THRESHOLD=3.0
JOB_ANOMALY_IQR_MULTIPLIER=1.5

Configuration

1. Model Registry Configuration

The model registry uses cache for storage with a default TTL of 7 days:

// In ModelManagementInitializer
$cacheKey = CacheKey::fromString('ml_model_registry');
$ttl = Duration::fromDays($environment->getInt('MODEL_REGISTRY_CACHE_TTL', 7));

Production Recommendations:

Use Redis for cache backend (better persistence than file cache)
Set TTL to at least 7 days to maintain model history
Configure Redis persistence (AOF or RDB) for data durability

2. Performance Monitor Configuration

Performance monitoring tracks predictions with sliding windows:

// Window configurations
- Short-term: Last 100 predictions
- Medium-term: Last 1000 predictions
- Long-term: All predictions since deployment

Memory Considerations:

Each prediction: ~200 bytes
1000 predictions: ~200KB
Recommend periodic cleanup for long-running systems

3. Auto-Tuning Configuration

Auto-tuning uses grid search optimization:

// N+1 Detector
thresholdRange: [0.5, 0.9]
step: 0.05
timeWindow: 1 hour
metricToOptimize: 'f1_score'

// WAF Behavioral
thresholdRange: [0.5, 0.9]
step: 0.05
timeWindow: 1 hour
metricToOptimize: 'f1_score'

// Queue Anomaly
thresholdRange: [0.4, 0.8]
step: 0.05
timeWindow: 1 hour
metricToOptimize: 'f1_score'

Tuning Recommendations:

Auto-apply only if improvement > 5%
Run hourly to balance responsiveness and stability
Monitor threshold changes via logs

4. Monitoring Schedule Configuration

Four periodic jobs with different frequencies:

// Performance Monitoring - Every 5 minutes
- Checks current accuracy against thresholds
- Sends alerts if accuracy drops below threshold
- Logs: 'ml-performance-monitoring'

// Degradation Detection - Every 15 minutes
- Compares current vs. baseline performance
- Detects 5%+ degradation
- Logs: 'ml-degradation-detection'

// Auto-Tuning - Every hour
- Optimizes thresholds via grid search
- Auto-applies if improvement > 5%
- Logs: 'ml-auto-tuning'

// Registry Cleanup - Daily
- Cleans up old performance data
- Maintains production model list
- Logs: 'ml-registry-cleanup'

Deployment Steps

Step 1: Pre-Deployment Verification

# 1. Verify all ML systems are functional
docker exec php php console.php ml:verify-systems

# 2. Run integration tests
docker exec php ./vendor/bin/pest tests/Integration/MachineLearning/

# 3. Check environment configuration
docker exec php php console.php config:check-ml

# 4. Verify cache backend is accessible
docker exec php php console.php cache:health

Step 2: Deploy Core Components

# 1. Deploy application code (including ML components)
git pull origin main

# 2. Rebuild dependencies if needed
docker exec php composer install --no-dev --optimize-autoloader

# 3. Clear and warm up caches
docker exec php php console.php cache:clear
docker exec php php console.php cache:warmup

Step 3: Initialize Model Registry

# Register all three models on first deployment
docker exec php php console.php ml:register-all-models

# Expected output:
# - N+1 Detector (n1-detector v1.0.0) registered
# - WAF Behavioral (waf-behavioral v1.0.0) registered
# - Queue Anomaly (queue-anomaly v1.0.0) registered

Step 4: Start Monitoring Scheduler

The monitoring scheduler is automatically initialized during framework startup.

Verify it's running:

# Check scheduled tasks
docker exec php php console.php scheduler:status

# Expected tasks:
# - ml-performance-monitoring (next run: +5 min)
# - ml-degradation-detection (next run: +15 min)
# - ml-auto-tuning (next run: +1 hour)
# - ml-registry-cleanup (next run: +1 day)

Step 5: Deploy to Production

# 1. Deploy all models to production environment
docker exec php php console.php ml:deploy-to-production --all

# 2. Verify production deployment
docker exec php php console.php ml:list-production-models

# Expected output:
# - n1-detector v1.0.0 (deployed: YYYY-MM-DD HH:MM:SS)
# - waf-behavioral v1.0.0 (deployed: YYYY-MM-DD HH:MM:SS)
# - queue-anomaly v1.0.0 (deployed: YYYY-MM-DD HH:MM:SS)

Step 6: Verify Monitoring

# Wait 5 minutes for first monitoring run, then check logs
docker exec php tail -f storage/logs/ml-monitoring.log

# Expected log entries:
# [INFO] ML monitoring scheduler initialized (jobs_scheduled: 4, models_monitored: 3)
# [INFO] Performance monitoring executed successfully
# [INFO] N+1 detector status: monitored (accuracy: 0.XX)
# [INFO] WAF behavioral status: monitored (accuracy: 0.XX)
# [INFO] Queue anomaly status: monitored (accuracy: 0.XX)

Monitoring Setup

1. Log Monitoring

All ML operations are logged to:

Main log: storage/logs/application.log
ML-specific: storage/logs/ml-monitoring.log (if configured)

Key log patterns to monitor:

# Performance warnings
grep "Performance Warning" storage/logs/*.log

# Degradation alerts
grep "Performance Degraded" storage/logs/*.log

# Auto-tuning applications
grep "auto-tuned" storage/logs/*.log

# Monitoring failures
grep "monitoring failed" storage/logs/*.log

2. Metrics Dashboard

Create a dashboard to track:

N+1 Detector Metrics:

- Accuracy (target: >85%)
- Total predictions per day
- False positive rate
- Average confidence score

WAF Behavioral Metrics:

- Accuracy (target: >90%)
- Total requests analyzed
- Blocked requests ratio
- Detection latency

Queue Anomaly Metrics:

- Accuracy (target: >80%)
- Total jobs analyzed
- Anomaly detection rate
- Average processing time

System Metrics:

- Model registry cache hit rate
- Auto-tuning improvement rate
- Alert frequency
- Scheduler job success rate

3. Alerting Rules

Configure alerts for:

# Critical Alerts (immediate action)
- WAF accuracy < 90%
- WAF performance degraded > 5%
- Scheduler job failures > 3 consecutive

# Warning Alerts (investigate within 1 hour)
- N+1 accuracy < 85%
- Queue anomaly accuracy < 80%
- Auto-tuning improvements < 1% (indicates stagnation)
- Model registry cache misses > 10%

# Info Alerts (review during business hours)
- Auto-tuning applied successfully
- Model configuration updated
- Registry cleanup completed

4. Performance Monitoring Commands

# Get current performance metrics for all models
docker exec php php console.php ml:performance-summary

# Check for degradation
docker exec php php console.php ml:check-degradation

# View auto-tuning history
docker exec php php console.php ml:auto-tuning-history

# Get model configuration
docker exec php php console.php ml:show-config n1-detector
docker exec php php console.php ml:show-config waf-behavioral
docker exec php php console.php ml:show-config queue-anomaly

Troubleshooting

Issue: Model Not Registered

Symptoms:

RuntimeException: Model not registered. Call registerCurrentModel() first.

Solution:

# Check if model exists
docker exec php php console.php ml:list-models

# Register missing model
docker exec php php console.php ml:register-model n1-detector
docker exec php php console.php ml:register-model waf-behavioral
docker exec php php console.php ml:register-model queue-anomaly

# Or register all at once
docker exec php php console.php ml:register-all-models

Issue: Low Accuracy Alerts

Symptoms:

[WARNING] N+1 Detector Performance Warning: Accuracy dropped to 78.50%

Investigation Steps:

Check recent prediction patterns:

docker exec php php console.php ml:performance-details n1-detector --recent 100

Analyze false positives/negatives:

docker exec php php console.php ml:analyze-errors n1-detector

Review ground truth data:

docker exec php php console.php ml:validate-ground-truth n1-detector

Possible causes:

Data drift (application behavior changed)
Incorrect ground truth labels
Model needs retraining
Configuration drift (thresholds changed)

Resolution:

If data drift: Consider retraining model
If ground truth issues: Update labeling process
If config drift: Revert to known-good configuration
If temporary: Wait for auto-tuning to adapt

Issue: Auto-Tuning Not Improving

Symptoms:

[INFO] Auto-tuning completed (improvement_percent: 0.5%)

Investigation:

Check optimization history:

docker exec php php console.php ml:auto-tuning-history n1-detector --limit 10

Verify enough data for optimization:

docker exec php php console.php ml:data-sufficiency n1-detector

Possible causes:

Insufficient prediction data (<100 predictions)
Already at optimal threshold
Metric plateaued (reached model capacity)
Time window too short

Resolution:

Wait for more prediction data
Consider different optimization metric (precision, recall)
Adjust threshold range or step size
Increase time window for optimization

Issue: Performance Degradation Detected

Symptoms:

[CRITICAL] WAF Behavioral Performance Degraded (degradation_percent: 8.5%)

Immediate Actions:

Verify it's not a false alarm:

docker exec php php console.php ml:verify-degradation waf-behavioral

Check for system changes:

# Review recent deployments
git log --oneline --since="1 week ago"

# Check configuration changes
docker exec php php console.php config:diff --since="1 week ago"

Rollback if necessary:

# Revert to previous model configuration
docker exec php php console.php ml:rollback-config waf-behavioral

# Or revert to previous model version
docker exec php php console.php ml:rollback-version waf-behavioral

Monitor recovery:

# Watch real-time performance
docker exec php watch -n 60 "php console.php ml:performance-summary"

Issue: Monitoring Jobs Not Running

Symptoms:

No monitoring logs in the last 5 minutes

Investigation:

Check scheduler status:

docker exec php php console.php scheduler:status

Verify scheduler is running:

# Check if scheduler daemon is active
docker exec php ps aux | grep scheduler

Check for errors:

docker exec php grep "scheduler" storage/logs/application.log | tail -20

Resolution:

# Restart scheduler daemon
docker exec php php console.php scheduler:restart

# Or restart entire application
docker-compose restart php

# Verify monitoring resumes
docker exec php tail -f storage/logs/ml-monitoring.log

Issue: Cache Miss Rate High

Symptoms:

Model registry cache miss rate: 45% (expected: <10%)

Possible Causes:

Cache TTL too short
Cache eviction due to memory pressure
Cache backend not persistent

Resolution:

Increase cache TTL:

# .env
MODEL_REGISTRY_CACHE_TTL=1209600  # 14 days

Configure Redis persistence:

# redis.conf
appendonly yes
appendfsync everysec

Monitor Redis memory:

docker exec redis redis-cli INFO memory

Maintenance

Daily Tasks

Review performance metrics:

docker exec php php console.php ml:daily-summary

Check alert log:

grep -E "(WARNING|CRITICAL)" storage/logs/ml-monitoring.log | tail -20

Verify scheduler health:

docker exec php php console.php scheduler:health

Weekly Tasks

Review auto-tuning effectiveness:

docker exec php php console.php ml:auto-tuning-report --last-week

Analyze degradation incidents:

docker exec php php console.php ml:degradation-report --last-week

Check model configuration drift:

docker exec php php console.php ml:config-drift-report

Monthly Tasks

Performance trend analysis:

docker exec php php console.php ml:performance-trends --last-month

Evaluate model improvement opportunities:

docker exec php php console.php ml:improvement-recommendations

Review and update thresholds:

# Review current thresholds
docker exec php php console.php ml:show-thresholds

# Update if needed
docker exec php php console.php ml:update-threshold n1-detector --accuracy 0.88

Clean up old performance data:

docker exec php php console.php ml:cleanup-old-data --older-than 90days

Quarterly Tasks

Model retraining evaluation:

Review accumulated prediction data
Evaluate if retraining is beneficial
Plan retraining if significant drift detected

System capacity planning:

Review prediction volume trends
Assess cache and memory requirements
Plan for scaling if needed

Alert threshold tuning:

Review alert frequency and relevance
Adjust thresholds based on operational experience
Update escalation procedures

Model Version Upgrades

When deploying a new model version:

# 1. Register new version
docker exec php php console.php ml:register-model n1-detector --version 1.1.0

# 2. Deploy to staging first
docker exec php php console.php ml:deploy n1-detector --version 1.1.0 --env staging

# 3. Monitor staging performance
docker exec php php console.php ml:compare-versions n1-detector 1.0.0 1.1.0

# 4. If successful, deploy to production
docker exec php php console.php ml:deploy n1-detector --version 1.1.0 --env production

# 5. Monitor production rollout
docker exec php watch -n 60 "php console.php ml:deployment-status n1-detector"

# 6. Rollback if issues detected
docker exec php php console.php ml:rollback n1-detector --to-version 1.0.0

Performance Baselines

Expected performance metrics after deployment:

N+1 Detector

Accuracy: 88-92%
Precision: 85-90%
Recall: 85-90%
F1-Score: 86-91%
Total Predictions: 1000+ per day
False Positive Rate: <10%
Detection Latency: <50ms

WAF Behavioral

Accuracy: 92-96%
Precision: 90-95%
Recall: 90-95%
F1-Score: 91-95%
Total Predictions: 10,000+ per day
False Positive Rate: <5%
Detection Latency: <100ms

Queue Anomaly

Accuracy: 82-88%
Precision: 80-85%
Recall: 80-85%
F1-Score: 81-86%
Total Predictions: 500+ per day
False Positive Rate: <15%
Detection Latency: <30ms

System Performance

Model Registry Cache Hit Rate: >90%
Scheduler Job Success Rate: >99%
Auto-Tuning Application Rate: 20-30% (1-2 times per week)
Alert Rate: <5 alerts per day (excluding info alerts)
Memory Usage: <200MB for model tracking

Security Considerations

Model Configuration Access:

Restrict access to model configuration updates
Log all configuration changes with user attribution
Implement approval workflow for production changes

Performance Data Privacy:

Anonymize sensitive data in tracking
Implement data retention policies
Ensure GDPR compliance for tracked predictions

Alert Notification Security:

Use secure channels for critical alerts
Implement alert authentication
Rate-limit alert notifications

Support and Escalation

L1 Support (Monitoring Team)

Monitor dashboard and alerts
Execute basic troubleshooting commands
Escalate to L2 for unresolved issues

L2 Support (ML Team)

Investigate performance degradation
Adjust thresholds and configurations
Plan model retraining
Escalate to L3 for system issues

L3 Support (Core Team)

System architecture issues
Framework integration problems
Performance optimization
Long-term capacity planning

Appendix

Example Integration Code

See /examples/n1-model-management-integration.php for comprehensive integration examples.

CLI Commands Reference

Complete list of ML management commands:

# Model Registry
ml:list-models              # List all registered models
ml:register-model           # Register a specific model
ml:register-all-models      # Register all three models
ml:show-config              # Show model configuration
ml:update-config            # Update model configuration
ml:rollback-config          # Rollback configuration
ml:list-production-models   # List production-deployed models

# Performance Monitoring
ml:performance-summary      # Current performance for all models
ml:performance-details      # Detailed performance for specific model
ml:check-degradation        # Check for performance degradation
ml:performance-trends       # Historical performance trends
ml:daily-summary           # Daily performance summary
ml:compare-versions        # Compare performance between versions

# Auto-Tuning
ml:auto-tuning-history     # View auto-tuning history
ml:auto-tuning-report      # Generate auto-tuning report
ml:show-thresholds         # Show current thresholds
ml:update-threshold        # Manually update threshold

# Deployment
ml:deploy                  # Deploy model to environment
ml:deploy-to-production    # Deploy to production
ml:rollback                # Rollback to previous version
ml:deployment-status       # Check deployment status

# Monitoring
ml:verify-systems          # Verify all ML systems functional
ml:scheduler-status        # Check scheduler status
ml:scheduler-health        # Scheduler health check
ml:analyze-errors          # Analyze prediction errors
ml:validate-ground-truth   # Validate ground truth data

# Maintenance
ml:cleanup-old-data        # Clean up old performance data
ml:data-sufficiency        # Check if enough data for optimization
ml:improvement-recommendations  # Get improvement recommendations
ml:config-drift-report     # Check for configuration drift

20 KiB Raw Blame History

ML Model Management System - Production Deployment Guide

Table of Contents

System Overview

Prerequisites

System Requirements

Framework Components

Environment Variables

Configuration

1. Model Registry Configuration

2. Performance Monitor Configuration

3. Auto-Tuning Configuration

4. Monitoring Schedule Configuration

Deployment Steps

Step 1: Pre-Deployment Verification

Step 2: Deploy Core Components

Step 3: Initialize Model Registry

Step 4: Start Monitoring Scheduler

Step 5: Deploy to Production

Step 6: Verify Monitoring

Monitoring Setup

1. Log Monitoring

2. Metrics Dashboard

3. Alerting Rules

4. Performance Monitoring Commands

Troubleshooting

Issue: Model Not Registered

Issue: Low Accuracy Alerts

Issue: Auto-Tuning Not Improving

Issue: Performance Degradation Detected

Issue: Monitoring Jobs Not Running

Issue: Cache Miss Rate High

Maintenance

Daily Tasks

Weekly Tasks

Monthly Tasks

Quarterly Tasks

Model Version Upgrades

Performance Baselines

N+1 Detector

WAF Behavioral

Queue Anomaly

System Performance

Security Considerations

Support and Escalation

L1 Support (Monitoring Team)

L2 Support (ML Team)

L3 Support (Core Team)

Appendix

Example Integration Code

CLI Commands Reference

Additional Resources

20 KiB

Raw Blame History