06-evaluation-methodology Flashcards

(10 cards)

1
Q

What is AI-as-a-Judge (LLM-as-a-Judge)?

A

An evaluation pattern where a Large Language Model grades the outputs of another LLM. It scales evaluation far beyond what human reviewers can handle but introduces evaluator bias (the judge may favor outputs that resemble its own style).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the two main AI-as-a-Judge approaches?

A

(1) Pairwise comparison — the judge sees a prompt and two answers and picks the better one based on criteria like truthfulness or neutrality; (2) Reference-free evaluation — the judge assesses quality without a “gold standard” answer, using its own reasoning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Evaluator Bias in AI-as-a-Judge?

A

The tendency for a judge model to favor outputs that sound like itself or match its own style. Mitigated by using diverse judge models and calibrating against human-labeled datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why are traditional NLP metrics (BLEU, ROUGE) insufficient for AI Engineering?

A

They measure word overlap with reference text, which is inadequate for reasoning tasks, open-ended generation, and domain-specific quality (e.g., factual accuracy, neutrality, faithfulness). Huyen advocates model-based evaluation and domain-specific metrics instead.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Perplexity as an evaluation metric?

A

A measure of how “surprised” a model is by a sequence of text. Lower perplexity generally indicates better fluency, but NOT necessarily factual accuracy. It measures language quality, not truthfulness.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the Faithfulness metric?

A

A measure of whether a generated answer is derived ONLY from the provided context (retrieved documents). It detects hallucination by checking if the model introduced information not present in the source material. Typically measured via AI-as-a-Judge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Semantic Similarity as a metric?

A

Measures how close a generated answer is to a reference document or expected answer using embedding cosine similarity. Used to verify RAG accuracy — ensuring responses align with the meaning of source policy documents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does Functional Correctness measure?

A

Whether code runs successfully or whether a tool call executes properly. Essential for evaluating agentic workflows where the agent generates and runs code or makes API calls.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a Neutrality metric?

A

A measure of whether AI output is non-partisan and objective in tone. Assessed via AI-as-a-Judge with rubric-based scoring. Required for federal compliance with EO 14319’s ideological neutrality mandate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does Evaluation-Driven Development progress in sophistication?

A

Stage 1: Evaluation as a quality gate (metrics defined before building). Stage 2: Evaluation drives architecture selection (e.g., hybrid search chosen because it scores higher on citation accuracy). Stage 3: Evaluation drives fundamental constraints (e.g., fine-tuning prohibited entirely due to leakage risk).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly