Interview Flashcards

(20 cards)

1
Q

abstraction characteristics in Python specifically in OOP?

A

exposing only the necessary parts of an object and hiding the complex internal implementation.

from abc import ABC, abstractmethod

class Animal(ABC):
    @abstractmethod
    def make_sound(self):
        pass
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is an interface?

A

contract that specifies what methods a class must implement, without providing how.
Python uses ABCs (abstract base classes).
TypeScript uses interface keywords.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is generative AI?

A

“Generative AI models learn patterns from data and create new content — text, images, code, audio — that resembles the training distribution. Most modern generative models use transformers and autoregressive decoding.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
  1. “Difference between Generative AI and Agentic AI?”
A

Generative AI → produces content.

Agentic AI → takes actions, uses tools, calls APIs, interacts with environments.
Dust agents use LLM + tools + memory + workflows, not just text generation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

“What is LLM-as-a-judge?”

A

“A strong LLM is used to evaluate or score the output of another model. It can perform pairwise comparisons, give qualitative ratings, or act as a proxy for human evaluation.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which evaluation metrics would you use for assessing an AI assistant?”

A

Accuracy / correctness

Relevance

Safety / toxicity

Latency

Cost per request

Robustness to edge cases

Hallucination rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How would you design an evaluation pipeline for Dust agents?”

A

“I would design the pipeline in four layers:

  1. Dataset & test-set creation
    – Golden sets: curated high-quality Q/A pairs
    – Edge-case sets: adversarial, ambiguous, malformed inputs
    – Real-world logs sampling from Dust usage
    – Versioned via DVC or Git-LFS
  2. Orchestration & execution
    – A scheduler (Airflow, Prefect, or simple CI job) runs weekly/auto tests
    – Each run sends inputs through the current agent version
    – Outputs stored with metadata: model version, prompt template, timestamps
  3. Scoring & metrics
    – Automatic metrics: accuracy, latency, token cost
    – LLM-as-judge scoring for qualitative metrics
    – Safety classifiers / toxicity checks
    – Regression detection: comparing current metrics to previous runs
  4. Reporting & monitoring
    – Dashboards (Grafana, Superset, or internal tools)
    – Alerts when score drifts or regression detected
    – Integrate with CI so releases require passing evaluation thresholds

This matches Yousign’s JD: reproducibility, traceability, versioning, metrics, quality SLOs, dashboards, and CI/CD integration.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What tools do you know for LLM evaluation?”

A

LangSmith – traces, evaluation datasets, LLM-as-judge

OpenAI Evals – automated evals with templates

Langfuse – observability and monitoring for LLM apps

Custom pipelines using Python + dashboards
Mention: “I already built my own in the RAG project.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Do you have experience with CI/CD?”

A

I’ve worked with GitHub Actions, Docker, AWS CI pipelines, and ZenML for ML CI. I wrote unit/integration tests for Next.js and backend APIs, so I’m used to test-driven delivery.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why should we hire you?”

A

First, I already built evaluation workflows in my multi-agent RAG project, including LLM-as-judge scoring, test-set design, and metric tracking — exactly the core of this internship.

Second, I bring both ML skills and software engineering experience from full-stack development and real-time pipelines. I’m comfortable with Python, TS, CI/CD, data versioning, and production constraints.

Finally, I’m genuinely motivated by reliability and quality — I love building things that are measurable, reproducible, and robust. I believe I can contribute quickly and help scale Yousign’s AI productivity agents in a safe and controlled way.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What’s the difference between fine-tuning and prompt engineering?

A

Prompt engineering = steer the model without changing weights.
Fine-tuning = update weights using domain-specific data.

When to use:

Prompt engineering → when data is small, tasks are simple.

Fine-tuning → when model must learn new domain knowledge or style.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you evaluate an LLM for correctness, safety, and cost?

A

We need multi-dimensional evaluation:

Correctness → accuracy, groundedness, hallucination rate (I’ve done this in my RAG multi-agent benchmark).

Safety → jailbreaking tests, refusal consistency.

Cost → per-token computation, latency, throughput.

In my RAG project, I measured accuracy + hallucination rate + latency, which maps directly to Yousign’s needs (safety + reliability + cost).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Design a system to evaluate internal agents used by Yousign

A

To evaluate AI agents for productivity workflows, I’d implement a three-stage evaluation loop:

Dataset creation:

Collect real internal tasks (summaries, document understanding, template generation).

Build a labeled benchmark: expected outputs, constraints, rejection cases.

Agent evaluation pipeline:

Automatically run each agent on a large batch of tasks.

Record metrics: accuracy, groundedness, hallucination rate, latency, cost/token.

Comparison & continuous monitoring:

Compare agents against each other using statistical scoring.

Create dashboards for drift detection and performance degradation.

My experience building a multi-agent evaluation workflow with LangGraph matches this exactly — I already implemented automated task/agent comparison and used reflection patterns to assess quality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you measure hallucinations?

A

I measure hallucinations using:

Context Recall → how much of the relevant context the model used

Context Precision → how much of the answer is supported by context

Faithfulness score (e.g., RAGAS)

Manual annotation for high-risk tasks

In my RAG multi-agent project, hallucination dropped 15% using a reflection-based workflow.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain the tradeoff between accuracy, latency, cost, and safety.

A

Accuracy ↑ → Latency ↑ & Cost ↑
Better models are slower and more expensive.

Safety ↑ → Accuracy ↓ sometimes
Safe models refuse more often.

Low latency → smaller or optimized models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do you measure whether retrieval improves accuracy in a RAG pipeline?

A

Compare 3 conditions:

No RAG (LLM alone)

RAG with dense retrieval

RAG with hybrid (BM25 + embedding)

Use metrics:

Faithfulness

Accuracy

Recall@k

Hallucination rate

17
Q

What is agentic AI workflow

A

An agentic AI workflow is a process where an LLM-based app executes multiple step to complete a task (plan → tools → action → revise)

18
Q

What is LLMs?

A

“LLMs, or Large Language Models, are a class of machine learning models trained on massive text datasets to understand and generate human-like language. They’re typically built using transformer architectures, which allow them to capture long-range context and relationships in text efficiently. Because of this, LLMs can perform a wide range of tasks—like text generation, summarization, translation, question answering, code completion—without being explicitly programmed for each one.

In practice, an LLM works by predicting the next token in a sequence, but thanks to large-scale training and fine-tuning, it can generalize patterns, reason over context, and adapt its responses to different domains. Today, LLMs play a major role in applications such as chatbots, productivity tools, search, and even security log analysis.”

19
Q

Data drift

A

Data drift happens when the input data changes compared to the data the model was trained on.

The world changes, so data distributions move.
Examples:

Users start using new types of wording in support tickets.

A cybersecurity dataset suddenly has new attack patterns.

A price prediction model sees inflation that wasn’t in the training data.

Consequence:

The model sees new kinds of inputs and can’t generalize well → accuracy drops.