ML System Patterns That Actually Scale
After deploying ML at scale across multiple companies, these are the patterns that consistently work - and the ones that don't.
The Production ML Stack
┌────────────────────────────────────────┐
│ Model Serving │
│ (Low latency, high throughput) │
├────────────────────────────────────────┤
│ Feature Store │
│ (Consistent train/serve features) │
├────────────────────────────────────────┤
│ Training Pipeline │
│ (Reproducible, versioned) │
├────────────────────────────────────────┤
│ Data Platform │
│ (Single source of truth) │
└────────────────────────────────────────┘
Pattern 1: Feature Store
The #1 source of training-serving skew is feature computation:
WRONG: Duplicate feature logic
training/features.py
def compute_features_training(user):
return user.purchases[-30:].mean()
serving/features.py
def compute_features_serving(user):
return user.recent_purchases.average() # Subtly different!Solution: Single feature definition, multiple materializations:
@feature(
entities=["user_id"],
online=True, # Materialize to Redis
offline=True, # Materialize to warehouse
)
def avg_purchase_30d(user_purchases: DataFrame) -> float:
"""Single definition, used everywhere."""
return user_purchases.last("30d")["amount"].mean()
Pattern 2: Model Versioning
Every model in production needs:
@dataclass
class ModelArtifact:
# Identity
model_id: str
version: str
# Reproducibility
training_data_hash: str
code_commit: str
hyperparameters: dict
# Provenance
trained_at: datetime
trained_by: str
training_metrics: dict
# Deployment
serving_signature: dict
resource_requirements: dict
Pattern 3: Shadow Deployment
Never deploy directly to production:
Request → Load Balancer
│
├──→ Production Model (serves response)
│
└──→ Shadow Model (logs only)
│
└──→ Compare metrics offlineOnly promote shadow → production when metrics prove out.
Pattern 4: Graceful Degradation
ML systems fail. Plan for it:
class ResilientPredictor:
def predict(self, features):
try:
# Try ML model
return self.ml_model.predict(features, timeout=50ms)
except TimeoutError:
# Fall back to simpler model
return self.fallback_model.predict(features)
except Exception:
# Fall back to business rules
return self.rule_based_fallback(features)
Pattern 5: Monitoring That Matters
Input Drift
def monitor_input_drift(current_batch, reference_distribution):
drift_score = kolmogorov_smirnov_test(
current_batch,
reference_distribution
)
if drift_score > THRESHOLD:
alert("Input distribution shift detected")
Output Drift
def monitor_predictions(predictions, window="1h"):
# Prediction distribution
pred_mean = predictions.mean()
pred_std = predictions.std()
# Compare to baseline
if abs(pred_mean - BASELINE_MEAN) > 2 BASELINE_STD:
alert("Prediction distribution shift")
Business Metrics
def monitor_business_impact(model_cohort, control_cohort):
# The metric that actually matters
conversion_lift = (
model_cohort.conversion_rate -
control_cohort.conversion_rate
)
if conversion_lift < MINIMUM_LIFT:
alert("Model not providing expected lift")
Anti-Patterns to Avoid
1. Notebook → Production
This is not deployment
model.save("model.pkl")
Put it on the server somehow???
2. No Rollback Plan
Always have:
3. Optimizing for Offline Metrics Only
Offline accuracy: 94%
Online conversion: Dropped 15%Why? Latency increased from 20ms to 200ms.
Always measure end-to-end business impact.
The Production Checklist
Before any model goes live:
Production ML is 10% models, 90% engineering. Plan accordingly.
Next: Feature Engineering at Scale - Lessons from 1B+ Daily Predictions*
