Eval Logs Flashcards

(24 cards)

1
Q

Things to measure (tailored to specific use case)

A
  • accuracy
  • relevancy
  • coherence
  • toxicity
  • sentiment
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

List 3 approaches to evaluation

A
  • code-based
  • LLM as a judge
  • human in the loop
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

List 3 types of evals

A
  • context specific
  • user experience specific
  • security and safety evaluation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is context specific evaluation?

A

Gauges your app’s ability to retrieve relevant information and and produce appropriate outputs

If the LLM’s answer contains information that isn’t in the source documents, the LLM may be hallucinating

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Name two types of context specific evaluations

A
  • Needle in the haystack test
  • Faithfulness evaluations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Needle in the haystack test

A

checks how well an LLM is able to retrieve a discrete piece of information from within all the data in its context window

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

how is needle in a haystack performed

A
  • Place a random fact or statement (the ‘needle’) in the middle of a long context window (the ‘haystack’)
  • Ask the model to retrieve this statement
  • Iterate over various document depths (where the needle is placed) and context lengths to measure performance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Faithfulness

A
  • Faithfulness evaluations use a secondary LLM to test whether an LLM application’s response can be logically inferred from the context the application used to create it—typically with a RAG pipeline.
  • A response is considered faithful if all its claims can be supported by the retrieved context, while a low faithfulness score can indicate the prevalence of hallucinations in your RAG-based LLM app’s responses.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How is a faithfulness evaluation conducted?

A
  • Break down the response into discrete claims.
  • Ask an LLM whether each claim can be inferred from the provided context.
  • Determine the fraction of claims that were correctly inferred.
  • Produce a score between 0 and 1 from this fraction.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

HIL

A

Have humans manually evaluate and label model outputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Name 2 LLM-as-a-judge approaches

A
  • topic relevancy
  • negative sentiment
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

when should code-based evals be used?

A

for deterministic failures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

when should LLM as judge be used?

A

For subjective cases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

how do you evaluate the LLM judge’s expertise?

A

true positive rate and true negative rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what’s a potential problem with using evals in the CI/CD pipeline?

A

Also need to use production data for continuous validation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what makes LLM evaluation difficult?

A
  • LLM pipelines don’t produce deterministic outputs
  • their responses are subjective and context dependent
  • multiple ways to be wrong: accurate but wrong tone
17
Q

what’s a vibes based approach to LLM evals?

A

change the prompt and manually test a few inputs

18
Q

Three gaps

A
  • gulf of comprehension
  • gulf of specification
  • gulf of generalization
19
Q

gaps of comprehension

A
  • the gap between a developer and the model’s behavior at scale. can’t manually inspect every user query and AI response
20
Q

gaps of specification

A

the gap between what we want the LLM to do and what our prompts actually instruct it to do

21
Q

gaps of generalization

A
  • the gap between a well-written prompt and the model’s ability to apply those instructions reliably across inputs.
  • even with perfect instructions, a model can still fail on new or unusal data
22
Q

eval flywheel

A
  • analyze to find failures
  • measure with evals
  • improve through experimentation
  • automate with evals used as regression tests
23
Q

what’s a trace?

A
  • a complete record of an interaction: initial user query, intermediate LLM reasoning steps, any tool calls made, and the final user-facing response
  • everything you need to reconstruct what actually happened
  • reading traces is a good starting point in the analysis phase
24
Q

tools for viewing traces

A
  • langsmith, arize, braintrust