Eval Logs Flashcards

Question 1

Q

Things to measure (tailored to specific use case)

Answer

A

accuracy
relevancy
coherence
toxicity
sentiment

Question 2

Q

List 3 approaches to evaluation

Answer

A

code-based
LLM as a judge
human in the loop

Question 3

Q

List 3 types of evals

Answer

A

context specific
user experience specific
security and safety evaluation

Question 4

Q

what is context specific evaluation?

Answer

A

Gauges your app’s ability to retrieve relevant information and and produce appropriate outputs

If the LLM’s answer contains information that isn’t in the source documents, the LLM may be hallucinating

Question 5

Q

Name two types of context specific evaluations

Answer

A

Needle in the haystack test
Faithfulness evaluations

Question 6

Q

Needle in the haystack test

Answer

A

checks how well an LLM is able to retrieve a discrete piece of information from within all the data in its context window

Question 7

Q

how is needle in a haystack performed

Answer

A

Place a random fact or statement (the ‘needle’) in the middle of a long context window (the ‘haystack’)
Ask the model to retrieve this statement
Iterate over various document depths (where the needle is placed) and context lengths to measure performance

Question 8

Q

Faithfulness

Answer

A

Faithfulness evaluations use a secondary LLM to test whether an LLM application’s response can be logically inferred from the context the application used to create it—typically with a RAG pipeline.
A response is considered faithful if all its claims can be supported by the retrieved context, while a low faithfulness score can indicate the prevalence of hallucinations in your RAG-based LLM app’s responses.

Question 9

Q

How is a faithfulness evaluation conducted?

Answer

A

Break down the response into discrete claims.
Ask an LLM whether each claim can be inferred from the provided context.
Determine the fraction of claims that were correctly inferred.
Produce a score between 0 and 1 from this fraction.

Question 10

Q

HIL

Answer

A

Have humans manually evaluate and label model outputs

Question 11

Q

Name 2 LLM-as-a-judge approaches

Answer

A

topic relevancy
negative sentiment

Question 12

Q

when should code-based evals be used?

Answer

A

for deterministic failures

Question 13

Q

when should LLM as judge be used?

Answer

A

For subjective cases

Question 14

Q

how do you evaluate the LLM judge’s expertise?

Answer

A

true positive rate and true negative rate

Question 15

Q

what’s a potential problem with using evals in the CI/CD pipeline?

Answer

A

Also need to use production data for continuous validation

Question 16

Q

what makes LLM evaluation difficult?

Answer

Study These Flashcards

A

LLM pipelines don’t produce deterministic outputs
their responses are subjective and context dependent
multiple ways to be wrong: accurate but wrong tone

Question 17

Q

what’s a vibes based approach to LLM evals?

Answer

Study These Flashcards

A

change the prompt and manually test a few inputs

Question 18

Q

Three gaps

Answer

Study These Flashcards

A

gulf of comprehension
gulf of specification
gulf of generalization

Question 19

Q

gaps of comprehension

Answer

Study These Flashcards

A

the gap between a developer and the model’s behavior at scale. can’t manually inspect every user query and AI response

Question 20

Q

gaps of specification

Answer

Study These Flashcards

A

the gap between what we want the LLM to do and what our prompts actually instruct it to do

Question 21

Q

gaps of generalization

Answer

Study These Flashcards

A

the gap between a well-written prompt and the model’s ability to apply those instructions reliably across inputs.
even with perfect instructions, a model can still fail on new or unusal data

Question 22

Q

eval flywheel

Answer

Study These Flashcards

A

analyze to find failures
measure with evals
improve through experimentation
automate with evals used as regression tests

Question 23

Q

what’s a trace?

Answer

Study These Flashcards

A

a complete record of an interaction: initial user query, intermediate LLM reasoning steps, any tool calls made, and the final user-facing response
everything you need to reconstruct what actually happened
reading traces is a good starting point in the analysis phase

Question 24

Q

tools for viewing traces

Answer

Study These Flashcards

A

langsmith, arize, braintrust

Eval Logs Flashcards

(24 cards)