Interview Flashcards

Question 1

Q

abstraction characteristics in Python specifically in OOP?

Answer

A

exposing only the necessary parts of an object and hiding the complex internal implementation.

from abc import ABC, abstractmethod

class Animal(ABC):
    @abstractmethod
    def make_sound(self):
        pass

Question 2

Q

What is an interface?

Answer

A

contract that specifies what methods a class must implement, without providing how.
Python uses ABCs (abstract base classes).
TypeScript uses interface keywords.

Question 3

Q

What is generative AI?

Answer

A

“Generative AI models learn patterns from data and create new content — text, images, code, audio — that resembles the training distribution. Most modern generative models use transformers and autoregressive decoding.”

Question 4

Q

“Difference between Generative AI and Agentic AI?”

Answer

A

Generative AI → produces content.

Agentic AI → takes actions, uses tools, calls APIs, interacts with environments.
Dust agents use LLM + tools + memory + workflows, not just text generation.

Question 5

Q

“What is LLM-as-a-judge?”

Answer

A

“A strong LLM is used to evaluate or score the output of another model. It can perform pairwise comparisons, give qualitative ratings, or act as a proxy for human evaluation.”

Question 6

Q

Which evaluation metrics would you use for assessing an AI assistant?”

Answer

A

Accuracy / correctness

Relevance

Safety / toxicity

Latency

Cost per request

Robustness to edge cases

Hallucination rate

Question 7

Q

How would you design an evaluation pipeline for Dust agents?”

Answer

A

“I would design the pipeline in four layers:

Dataset & test-set creation
– Golden sets: curated high-quality Q/A pairs
– Edge-case sets: adversarial, ambiguous, malformed inputs
– Real-world logs sampling from Dust usage
– Versioned via DVC or Git-LFS
Orchestration & execution
– A scheduler (Airflow, Prefect, or simple CI job) runs weekly/auto tests
– Each run sends inputs through the current agent version
– Outputs stored with metadata: model version, prompt template, timestamps
Scoring & metrics
– Automatic metrics: accuracy, latency, token cost
– LLM-as-judge scoring for qualitative metrics
– Safety classifiers / toxicity checks
– Regression detection: comparing current metrics to previous runs
Reporting & monitoring
– Dashboards (Grafana, Superset, or internal tools)
– Alerts when score drifts or regression detected
– Integrate with CI so releases require passing evaluation thresholds

This matches Yousign’s JD: reproducibility, traceability, versioning, metrics, quality SLOs, dashboards, and CI/CD integration.”

Question 8

Q

What tools do you know for LLM evaluation?”

Answer

A

LangSmith – traces, evaluation datasets, LLM-as-judge

OpenAI Evals – automated evals with templates

Langfuse – observability and monitoring for LLM apps

Custom pipelines using Python + dashboards
Mention: “I already built my own in the RAG project.”

Question 9

Q

Do you have experience with CI/CD?”

Answer

A

I’ve worked with GitHub Actions, Docker, AWS CI pipelines, and ZenML for ML CI. I wrote unit/integration tests for Next.js and backend APIs, so I’m used to test-driven delivery.”

Question 10

Q

Why should we hire you?”

Answer

A

First, I already built evaluation workflows in my multi-agent RAG project, including LLM-as-judge scoring, test-set design, and metric tracking — exactly the core of this internship.

Second, I bring both ML skills and software engineering experience from full-stack development and real-time pipelines. I’m comfortable with Python, TS, CI/CD, data versioning, and production constraints.

Finally, I’m genuinely motivated by reliability and quality — I love building things that are measurable, reproducible, and robust. I believe I can contribute quickly and help scale Yousign’s AI productivity agents in a safe and controlled way.”

Question 11

Q

What’s the difference between fine-tuning and prompt engineering?

Answer

A

Prompt engineering = steer the model without changing weights.
Fine-tuning = update weights using domain-specific data.

When to use:

Prompt engineering → when data is small, tasks are simple.

Fine-tuning → when model must learn new domain knowledge or style.

Question 12

Q

How do you evaluate an LLM for correctness, safety, and cost?

Answer

A

We need multi-dimensional evaluation:

Correctness → accuracy, groundedness, hallucination rate (I’ve done this in my RAG multi-agent benchmark).

Safety → jailbreaking tests, refusal consistency.

Cost → per-token computation, latency, throughput.

In my RAG project, I measured accuracy + hallucination rate + latency, which maps directly to Yousign’s needs (safety + reliability + cost).

Question 13

Q

Design a system to evaluate internal agents used by Yousign

Answer

A

To evaluate AI agents for productivity workflows, I’d implement a three-stage evaluation loop:

Dataset creation:

Collect real internal tasks (summaries, document understanding, template generation).

Build a labeled benchmark: expected outputs, constraints, rejection cases.

Agent evaluation pipeline:

Automatically run each agent on a large batch of tasks.

Record metrics: accuracy, groundedness, hallucination rate, latency, cost/token.

Comparison & continuous monitoring:

Compare agents against each other using statistical scoring.

Create dashboards for drift detection and performance degradation.

My experience building a multi-agent evaluation workflow with LangGraph matches this exactly — I already implemented automated task/agent comparison and used reflection patterns to assess quality.

Question 14

Q

How do you measure hallucinations?

Answer

A

I measure hallucinations using:

Context Recall → how much of the relevant context the model used

Context Precision → how much of the answer is supported by context

Faithfulness score (e.g., RAGAS)

Manual annotation for high-risk tasks

In my RAG multi-agent project, hallucination dropped 15% using a reflection-based workflow.

Question 15

Q

Explain the tradeoff between accuracy, latency, cost, and safety.

Answer

A

Accuracy ↑ → Latency ↑ & Cost ↑
Better models are slower and more expensive.

Safety ↑ → Accuracy ↓ sometimes
Safe models refuse more often.

Low latency → smaller or optimized models

Question 16

Q

How do you measure whether retrieval improves accuracy in a RAG pipeline?

Answer

Study These Flashcards

A

Compare 3 conditions:

No RAG (LLM alone)

RAG with dense retrieval

RAG with hybrid (BM25 + embedding)

Use metrics:

Faithfulness

Accuracy

Recall@k

Hallucination rate

Question 17

Q

What is agentic AI workflow

Answer

Study These Flashcards

A

An agentic AI workflow is a process where an LLM-based app executes multiple step to complete a task (plan → tools → action → revise)

Question 18

Q

What is LLMs?

Answer

Study These Flashcards

A

“LLMs, or Large Language Models, are a class of machine learning models trained on massive text datasets to understand and generate human-like language. They’re typically built using transformer architectures, which allow them to capture long-range context and relationships in text efficiently. Because of this, LLMs can perform a wide range of tasks—like text generation, summarization, translation, question answering, code completion—without being explicitly programmed for each one.

In practice, an LLM works by predicting the next token in a sequence, but thanks to large-scale training and fine-tuning, it can generalize patterns, reason over context, and adapt its responses to different domains. Today, LLMs play a major role in applications such as chatbots, productivity tools, search, and even security log analysis.”

Question 19

Q

Data drift

Answer

Study These Flashcards

A

Data drift happens when the input data changes compared to the data the model was trained on.

The world changes, so data distributions move.
Examples:

Users start using new types of wording in support tickets.

A cybersecurity dataset suddenly has new attack patterns.

A price prediction model sees inflation that wasn’t in the training data.

Consequence:

The model sees new kinds of inputs and can’t generalize well → accuracy drops.

Question 20

Q

Answer

Study These Flashcards

A

Interview Flashcards

(20 cards)