System Design - Problem Solving Flashcards

Question 1

Q

How would you design a customer support chatbot for an online banking platform (like Capital One) to ensure it provides secure, helpful, and relevant responses?

Answer

A

Architecture: Chatbot frontend + secure backend APIs (no direct DB access).

Security: Enforce authentication (OAuth2, MFA), encrypt data, mask sensitive info, use least privilege.

NLU: Detect intents (e.g., check balance, lost card) and retrieve FAQs safely.

Responses: Use templates for account data, LLM or retrieval for general FAQs, avoid hallucinations.

Masking: Prevent access to future or unrelated customer data.

Monitoring: Log safely, detect misuse, retrain on anonymized data.

Compliance: Follow GDPR, PCI DSS, and banking privacy laws.

Question 2

Q

As a consultant advising a fintech on using generative AI for fraud detection with a labeled dataset of 50,000 transactions, how would you decide between RAG, prompt engineering, or fine-tuning? What factors guide your choice, and how would your recommendation change if the dataset doubled?

Answer

A

Choice depends on dataset size, task specificity, and performance needs:

Prompt engineering: Quick, low-resource; suitable for small datasets or exploratory tasks; limited accuracy for complex fraud patterns.
RAG: Combines retrieval with LLM reasoning; handles rare or evolving fraud cases; useful for complex patterns and explainability.
Fine-tuning / PEFT: Uses labeled data to adapt the model; best for medium/large datasets (50k+); provides high accuracy and explanations; requires more compute.

Factors guiding choice:

Dataset size & quality: Small → Prompt or RAG; Medium/Large → Fine-tuning/PEFT
Task complexity & explainability: Simple → Prompt; Complex → RAG or Fine-tuning
Resources: Limited → Prompt or RAG; Enough compute → Fine-tuning/PEFT
Fraud pattern changes: Rapid → RAG; Stable → Fine-tuning/PEFT

If dataset doubles (~100k):

Fine-tuning/PEFT becomes more attractive due to more labeled data supporting supervised learning.
RAG still useful for rare or unusual fraud cases.
Prompt engineering alone likely insufficient for high accuracy.

Question 3

Q

Let’s say BCG is helping a large e-commerce client develop a generative AI tool that can automatically generate marketing copy and brand visuals by processing both product photos and text reviews (multi-modal), rather than relying on text data alone (unimodal). How would you think about the implications of using a multi-modal model versus a unimodal one in this business setting, and what steps would you take to identify and reduce potential biases in its outputs?

Answer

A

Implications of multi-modal vs unimodal:

Multi-modal:
- Uses both product images and text reviews → richer, more context-aware outputs.
- Generates visuals aligned with copy and customer sentiment.
- More complex and compute-intensive; biases from images and text can combine.
Unimodal (text only):
- Easier to train and deploy.
- Limited context; may miss visual cues or aesthetic details.

Steps to identify and reduce biases:

Audit training data: Check text and images for gaps or imbalances in product types, styles, or demographic representation.
Analyze outputs: Look for stereotypical, unfair, or harmful content in copy and visuals.
Bias mitigation:
- Add diverse examples to training data
- Filter or post-process outputs
- Evaluate fairness and inclusivity metrics
Human review: Marketing and ethics teams validate outputs before publishing.
Continuous monitoring: Track outputs over time and retrain to correct emerging biases.

Summary: Multi-modal models produce richer, aligned outputs but need careful bias auditing; unimodal models are simpler but less context-aware.

Question 4

Q

You work as a machine learning engineer at Amazon, focusing on product discovery. Your team is tasked with building a system that enables customers to enter a text description, such as “red hiking backpack with water bottle holder”, and retrieve the most relevant product images from Amazon’s vast catalog.

How would you design this system end-to-end?

Answer

A

1. Data Preparation:

Collect product images, titles, descriptions, and metadata.
Clean and normalize text (tokenization, lowercasing, remove stopwords).
Preprocess images (resize, normalize, optional embeddings).

2. Feature Representation:

Text: Encode using a pre-trained language model (e.g., BERT, Sentence-BERT) to get dense embeddings.
Images: Encode using a pre-trained vision model (e.g., ResNet, CLIP visual encoder) to get image embeddings.
Optional: Use a multi-modal model (like CLIP) to embed text and images in the same vector space.

3. Indexing & Retrieval:

Store image embeddings in a vector database (e.g., FAISS, Milvus, Pinecone).
At query time: Encode user text into the same embedding space.
Retrieve top-K nearest image embeddings using cosine similarity or dot product.

4. Ranking & Post-processing:

Re-rank retrieved results using additional signals: popularity, relevance, or business rules.
Optional filtering by category, price, or availability.

5. System Design Considerations:

Use caching for popular queries.
Ensure low-latency retrieval for real-time user queries.
Monitor system performance and periodically update embeddings as catalog grows.

6. Evaluation:

Metrics: Precision@K (Fraction of top-K retrieved items that are relevant.) , Recall@K (Fraction of all relevant items that appear in the top-K results.) , MRR. (Average of reciprocal ranks of the first relevant item across queries) -> Evaluate how well a retrieval system ranks relevant results at the top.
Perform A/B tests to measure user satisfaction and click-through rate.

Summary:
Transform both text and images into embeddings (ideally in a shared space), store image embeddings in a vector database, retrieve nearest neighbors for user queries, then rank and filter results for relevance.

System Design - Problem Solving Flashcards

(4 cards)