Things to measure (tailored to specific use case)
List 3 approaches to evaluation
List 3 types of evals
what is context specific evaluation?
Gauges your app’s ability to retrieve relevant information and and produce appropriate outputs
If the LLM’s answer contains information that isn’t in the source documents, the LLM may be hallucinating
Name two types of context specific evaluations
Needle in the haystack test
checks how well an LLM is able to retrieve a discrete piece of information from within all the data in its context window
how is needle in a haystack performed
Faithfulness
How is a faithfulness evaluation conducted?
HIL
Have humans manually evaluate and label model outputs
Name 2 LLM-as-a-judge approaches
when should code-based evals be used?
for deterministic failures
when should LLM as judge be used?
For subjective cases
how do you evaluate the LLM judge’s expertise?
true positive rate and true negative rate
what’s a potential problem with using evals in the CI/CD pipeline?
Also need to use production data for continuous validation
what makes LLM evaluation difficult?
what’s a vibes based approach to LLM evals?
change the prompt and manually test a few inputs
Three gaps
gaps of comprehension
gaps of specification
the gap between what we want the LLM to do and what our prompts actually instruct it to do
gaps of generalization
eval flywheel
what’s a trace?
tools for viewing traces