abstraction characteristics in Python specifically in OOP?
exposing only the necessary parts of an object and hiding the complex internal implementation.
from abc import ABC, abstractmethod
class Animal(ABC):
@abstractmethod
def make_sound(self):
pass
What is an interface?
contract that specifies what methods a class must implement, without providing how.
Python uses ABCs (abstract base classes).
TypeScript uses interface keywords.
What is generative AI?
“Generative AI models learn patterns from data and create new content — text, images, code, audio — that resembles the training distribution. Most modern generative models use transformers and autoregressive decoding.”
Generative AI → produces content.
Agentic AI → takes actions, uses tools, calls APIs, interacts with environments.
Dust agents use LLM + tools + memory + workflows, not just text generation.
“What is LLM-as-a-judge?”
“A strong LLM is used to evaluate or score the output of another model. It can perform pairwise comparisons, give qualitative ratings, or act as a proxy for human evaluation.”
Which evaluation metrics would you use for assessing an AI assistant?”
Accuracy / correctness
Relevance
Safety / toxicity
Latency
Cost per request
Robustness to edge cases
Hallucination rate
How would you design an evaluation pipeline for Dust agents?”
“I would design the pipeline in four layers:
This matches Yousign’s JD: reproducibility, traceability, versioning, metrics, quality SLOs, dashboards, and CI/CD integration.”
What tools do you know for LLM evaluation?”
LangSmith – traces, evaluation datasets, LLM-as-judge
OpenAI Evals – automated evals with templates
Langfuse – observability and monitoring for LLM apps
Custom pipelines using Python + dashboards
Mention: “I already built my own in the RAG project.”
Do you have experience with CI/CD?”
I’ve worked with GitHub Actions, Docker, AWS CI pipelines, and ZenML for ML CI. I wrote unit/integration tests for Next.js and backend APIs, so I’m used to test-driven delivery.”
Why should we hire you?”
First, I already built evaluation workflows in my multi-agent RAG project, including LLM-as-judge scoring, test-set design, and metric tracking — exactly the core of this internship.
Second, I bring both ML skills and software engineering experience from full-stack development and real-time pipelines. I’m comfortable with Python, TS, CI/CD, data versioning, and production constraints.
Finally, I’m genuinely motivated by reliability and quality — I love building things that are measurable, reproducible, and robust. I believe I can contribute quickly and help scale Yousign’s AI productivity agents in a safe and controlled way.”
What’s the difference between fine-tuning and prompt engineering?
Prompt engineering = steer the model without changing weights.
Fine-tuning = update weights using domain-specific data.
When to use:
Prompt engineering → when data is small, tasks are simple.
Fine-tuning → when model must learn new domain knowledge or style.
How do you evaluate an LLM for correctness, safety, and cost?
We need multi-dimensional evaluation:
Correctness → accuracy, groundedness, hallucination rate (I’ve done this in my RAG multi-agent benchmark).
Safety → jailbreaking tests, refusal consistency.
Cost → per-token computation, latency, throughput.
In my RAG project, I measured accuracy + hallucination rate + latency, which maps directly to Yousign’s needs (safety + reliability + cost).
Design a system to evaluate internal agents used by Yousign
To evaluate AI agents for productivity workflows, I’d implement a three-stage evaluation loop:
Dataset creation:
Collect real internal tasks (summaries, document understanding, template generation).
Build a labeled benchmark: expected outputs, constraints, rejection cases.
Agent evaluation pipeline:
Automatically run each agent on a large batch of tasks.
Record metrics: accuracy, groundedness, hallucination rate, latency, cost/token.
Comparison & continuous monitoring:
Compare agents against each other using statistical scoring.
Create dashboards for drift detection and performance degradation.
My experience building a multi-agent evaluation workflow with LangGraph matches this exactly — I already implemented automated task/agent comparison and used reflection patterns to assess quality.
How do you measure hallucinations?
I measure hallucinations using:
Context Recall → how much of the relevant context the model used
Context Precision → how much of the answer is supported by context
Faithfulness score (e.g., RAGAS)
Manual annotation for high-risk tasks
In my RAG multi-agent project, hallucination dropped 15% using a reflection-based workflow.
Explain the tradeoff between accuracy, latency, cost, and safety.
Accuracy ↑ → Latency ↑ & Cost ↑
Better models are slower and more expensive.
Safety ↑ → Accuracy ↓ sometimes
Safe models refuse more often.
Low latency → smaller or optimized models
How do you measure whether retrieval improves accuracy in a RAG pipeline?
Compare 3 conditions:
No RAG (LLM alone)
RAG with dense retrieval
RAG with hybrid (BM25 + embedding)
Use metrics:
Faithfulness
Accuracy
Recall@k
Hallucination rate
What is agentic AI workflow
An agentic AI workflow is a process where an LLM-based app executes multiple step to complete a task (plan → tools → action → revise)
What is LLMs?
“LLMs, or Large Language Models, are a class of machine learning models trained on massive text datasets to understand and generate human-like language. They’re typically built using transformer architectures, which allow them to capture long-range context and relationships in text efficiently. Because of this, LLMs can perform a wide range of tasks—like text generation, summarization, translation, question answering, code completion—without being explicitly programmed for each one.
In practice, an LLM works by predicting the next token in a sequence, but thanks to large-scale training and fine-tuning, it can generalize patterns, reason over context, and adapt its responses to different domains. Today, LLMs play a major role in applications such as chatbots, productivity tools, search, and even security log analysis.”
Data drift
Data drift happens when the input data changes compared to the data the model was trained on.
The world changes, so data distributions move.
Examples:
Users start using new types of wording in support tickets.
A cybersecurity dataset suddenly has new attack patterns.
A price prediction model sees inflation that wasn’t in the training data.
Consequence:
The model sees new kinds of inputs and can’t generalize well → accuracy drops.