Back to Blog
Building in Production

ML System Patterns That Actually Scale

Battle-tested architectures for deploying machine learning in production. From feature stores to model serving.

Marcus Chen

Marcus Chen

2023-12-15

18 min read

ML System Patterns That Actually Scale

After deploying ML at scale across multiple companies, these are the patterns that consistently work - and the ones that don't.

The Production ML Stack

┌────────────────────────────────────────┐
│           Model Serving                │
│    (Low latency, high throughput)      │
├────────────────────────────────────────┤
│          Feature Store                 │
│    (Consistent train/serve features)   │
├────────────────────────────────────────┤
│        Training Pipeline               │
│    (Reproducible, versioned)           │
├────────────────────────────────────────┤
│         Data Platform                  │
│    (Single source of truth)            │
└────────────────────────────────────────┘

Pattern 1: Feature Store

The #1 source of training-serving skew is feature computation:

WRONG: Duplicate feature logic

training/features.py

def compute_features_training(user): return user.purchases[-30:].mean()

serving/features.py

def compute_features_serving(user): return user.recent_purchases.average() # Subtly different!

Solution: Single feature definition, multiple materializations:

@feature(
    entities=["user_id"],
    online=True,  # Materialize to Redis
    offline=True,  # Materialize to warehouse
)
def avg_purchase_30d(user_purchases: DataFrame) -> float:
    """Single definition, used everywhere."""
    return user_purchases.last("30d")["amount"].mean()

Pattern 2: Model Versioning

Every model in production needs:

@dataclass
class ModelArtifact:
    # Identity
    model_id: str
    version: str
    
    # Reproducibility
    training_data_hash: str
    code_commit: str
    hyperparameters: dict
    
    # Provenance
    trained_at: datetime
    trained_by: str
    training_metrics: dict
    
    # Deployment
    serving_signature: dict
    resource_requirements: dict

Pattern 3: Shadow Deployment

Never deploy directly to production:

Request → Load Balancer
              │
              ├──→ Production Model (serves response)
              │
              └──→ Shadow Model (logs only)
                        │
                        └──→ Compare metrics offline

Only promote shadow → production when metrics prove out.

Pattern 4: Graceful Degradation

ML systems fail. Plan for it:

class ResilientPredictor:
    def predict(self, features):
        try:
            # Try ML model
            return self.ml_model.predict(features, timeout=50ms)
        except TimeoutError:
            # Fall back to simpler model
            return self.fallback_model.predict(features)
        except Exception:
            # Fall back to business rules
            return self.rule_based_fallback(features)

Pattern 5: Monitoring That Matters

Input Drift

def monitor_input_drift(current_batch, reference_distribution):
    drift_score = kolmogorov_smirnov_test(
        current_batch, 
        reference_distribution
    )
    if drift_score > THRESHOLD:
        alert("Input distribution shift detected")

Output Drift

def monitor_predictions(predictions, window="1h"):
    # Prediction distribution
    pred_mean = predictions.mean()
    pred_std = predictions.std()
    
    # Compare to baseline
    if abs(pred_mean - BASELINE_MEAN) > 2  BASELINE_STD:
        alert("Prediction distribution shift")

Business Metrics

def monitor_business_impact(model_cohort, control_cohort):
    # The metric that actually matters
    conversion_lift = (
        model_cohort.conversion_rate - 
        control_cohort.conversion_rate
    )
    
    if conversion_lift < MINIMUM_LIFT:
        alert("Model not providing expected lift")

Anti-Patterns to Avoid

1. Notebook → Production

This is not deployment

model.save("model.pkl")

Put it on the server somehow???

2. No Rollback Plan

Always have:

  • Previous model version ready
  • One-click rollback procedure
  • Automated rollback on metric degradation
  • 3. Optimizing for Offline Metrics Only

    Offline accuracy: 94%
    Online conversion: Dropped 15%

    Why? Latency increased from 20ms to 200ms.

    Always measure end-to-end business impact.

    The Production Checklist

    Before any model goes live:

  • [ ] Training-serving feature parity verified
  • [ ] Model versioned with full provenance
  • [ ] Shadow deployment completed
  • [ ] Fallback mechanisms tested
  • [ ] Monitoring dashboards ready
  • [ ] Rollback procedure documented
  • [ ] On-call runbook updated
  • Production ML is 10% models, 90% engineering. Plan accordingly.


    Next: Feature Engineering at Scale - Lessons from 1B+ Daily Predictions*

    [Subscribe]

    Posterior Updates

    Weekly dispatches on AI, neuroscience, and the mathematics of mind. No spam, just signal.