ML Ops Flashcards

Question 1

Q

Let’s say we’ve built a churn prediction model for a subscription service using user behavior data from the past 12 months, and now we want to deploy it to production and also apply it to a new user segment that wasn’t in our original training data.

How would you approach preparing the model for deployment and ensuring it generalizes well to the new dataset?

Answer

A

1. Deployment Preparation:

Package model with preprocessing steps (scaling, encoding) for consistency.
Test inference pipeline on representative data.
Set monitoring for latency, errors, and system performance.

2. Ensuring Generalization:

Validate on new segment: Collect a sample of the new users and evaluate model performance.
Domain adaptation: Retrain or fine-tune the model using new segment data if available.
Feature consistency: Ensure features in new data match the format and distribution of training data.
Monitor drift: Track changes in input features and model predictions over time.

3. Feedback Loop:

Collect new user outcomes (churn labels) to update the model periodically.
Adjust thresholds if model predictions show bias or miscalibration for the new segment.

Question 2

Q

Let’s say we’ve trained a machine learning model on SageMaker and now want to deploy it as an API endpoint that can handle up to 100 requests per second with low latency.

How would you design the end-to-end deployment architecture using AWS services, and what considerations would you have for scalability and monitoring?

Answer

A

1. Deployment Architecture:

Model Hosting: Use SageMaker Endpoint (real-time inference).
API Gateway: Front SageMaker endpoint with Amazon API Gateway to handle requests and routing.
Load Balancing: Optional AWS Application Load Balancer if multiple endpoints are used.
Auto-scaling: Enable SageMaker endpoint autoscaling to handle traffic spikes.

2. Scalability Considerations:

Choose appropriate instance type (CPU vs GPU) based on model size and inference latency.
Configure endpoint auto-scaling policies based on request rate or CPU/GPU utilization.
Use multi-model endpoints if serving multiple models efficiently.

3. Monitoring & Logging:

Enable CloudWatch metrics for latency, request count, error rates.
Use CloudWatch Logs to capture detailed request/response logs.
Monitor model drift and prediction quality over time.

4. Additional Considerations:

Implement caching for repeated requests if possible.
Secure the endpoint with IAM roles, VPC, and API keys.
Set timeout and retry policies to maintain low latency under load.

Summary:
Use SageMaker real-time endpoints with API Gateway, autoscaling, monitoring via CloudWatch, and proper instance selection to ensure low-latency, scalable, and secure API deployment.

Question 3

Q

Let’s say we’ve deployed a machine learning model to flag potentially fraudulent transactions, and over the past month, its precision has dropped from 92% to 78%.

How would you go about investigating the reasons for this decline in precision, and what steps could we take to improve the model’s performance?

Answer

A

1. Investigate the decline:

Check data drift: Compare feature distributions of recent transactions vs training data.
Review labeling errors: Ensure fraud labels in recent data are accurate.
Examine model inputs: Look for missing, malformed, or new feature values.
Analyze false positives: Identify patterns causing incorrect fraud flags.
Monitor external changes: New merchants, payment methods, or fraud schemes.

2. Steps to improve performance:

Retrain or fine-tune the model on recent data including new fraud patterns.
Update features to capture new behaviors or risk indicators.
Adjust decision thresholds to optimize precision vs recall for current conditions.
Use ensemble methods or anomaly detection to catch novel fraud patterns.
Implement a feedback loop: Incorporate human-in-the-loop corrections from recent alerts.

Summary:
Investigate drift, errors, and new patterns; update features or retrain; adjust thresholds; and leverage human feedback to restore high precision.

Question 4

Q

Let’s say you’re a data analyst at a company that uses a machine learning model to set prices for a ride-hailing service dyanmically. Yesterday, you noticed that the model’s average ride price dropped from $15 to $7 for about an hour before returning to normal.

What hypotheses would you consider for this brief drop in average ride price, and how would you go about investigating the cause?

Answer

A

Possible hypotheses:

Data issues: Missing or incorrect input data (e.g., demand, traffic, or surge factors).
Model behavior: Model mispredicted due to unusual patterns or feature values.
External events: Temporary surge in supply or drop in demand (e.g., weather, events).
System errors: Deployment bug or misconfiguration affecting pricing calculations.

Investigation steps:

Check input data logs during the affected hour for anomalies.
Examine model predictions and feature values to identify unusual inputs.
Review external factors (traffic, events, promotions) impacting pricing.
Audit deployment and pipeline logs for errors or rollback events.
Compare with historical patterns for similar anomalies.

Summary:
Combine data validation, model output inspection, system logs, and external context to identify the root cause of temporary price drops.

Question 5

Q

Let’s say we manage a cloud infrastructure platform, and over the last 6 months, we’ve noticed that 20% of our compute nodes consistently sit idle while others run near full capacity.

How would you analyze and address this stranded capacity to improve overall resource utilization?

Answer

A

1. Analyze the problem:

Monitor utilization metrics: CPU, memory, disk, network per node.
Identify patterns: Are idle nodes tied to specific workloads, regions, or times?
Check scheduling policies: Are workloads unevenly distributed due to resource allocation rules?
Review task characteristics: Some jobs may require specific node types, causing fragmentation.

2. Address stranded capacity:

Load balancing & scheduling: Implement smarter workload placement or autoscaling.
Consolidation: Migrate small workloads to fewer nodes to free idle ones.
Resource pooling: Use cluster-wide resource pools to allow dynamic allocation.
Right-sizing instances: Match node types to workload requirements.
Implement policies: Enforce fair-share scheduling or preemptible workloads to fill gaps.

3. Continuous monitoring:

Track utilization trends to detect recurring idle resources.
Adjust scaling and allocation rules dynamically.

Question 6

Q

Let’s say we’re responsible for deploying a new version of Uber’s routing system, and the release includes both backend and frontend changes that may impact user workflows.

How would you design the deployment process to minimize risk, ensure clear communication with both clients and internal teams, and handle rollbacks if unexpected issues arise?

Answer

A

1. Deployment Strategy:

Staged rollout / Canary releases: Deploy to a small subset of users first to monitor impact.
Blue-Green deployment: Run new version alongside old version, switch traffic gradually.
Feature flags: Enable or disable new features dynamically without full redeploy.

2. Risk Mitigation:

Automated testing: Unit, integration, and end-to-end tests before deployment.
Monitoring & alerting: Track errors, latency, user behavior, and key metrics in real time.
Rollback plan: Predefine steps to revert to previous version quickly if issues arise.

3. Communication:

Internal teams: Share release notes, expected changes, and escalation paths.
Clients/users: Notify of potential downtime or new workflows if needed; provide support channels.

4. Post-deployment:

Gradually increase user traffic to the new version.
Monitor logs, metrics, and user feedback closely.
Iterate and fix issues promptly before full-scale rollout.

Summary:
Use staged deployment, feature flags, and automated testing to reduce risk, maintain clear communication, monitor closely, and have a rollback plan ready to ensure safe and smooth release.

Question 7

Q

Let’s say we’re launching a new version of a payments app, but due to a tight deadline, we don’t have time to run the entire suite of 300 automated test cases before release. How would you handle this situation?

Answer

A

1. Prioritize tests:

Run critical test cases covering core payment flows, security, and high-risk features first.
Use risk-based testing to focus on areas most likely to fail.

2. Staged rollout:

Deploy to a small subset of users (canary release) to monitor for issues before full release.

3. Monitoring & Alerts:

Set up real-time monitoring for errors, transaction failures, and system performance.
Prepare alerting and rollback procedures in case issues arise.

4. Communication & Contingency:

Inform internal teams and support about potential risks.
Plan rapid patch deployment for critical issues post-release.

Question 8

Q

ML Ops Flashcards

(8 cards)