Model Evaluation
What are you evaluating?
Quality Control
Model Evaluation
Examples of tasks in model evaluation?
text summarization, Q&A, text classification
Model Evaluation
What do you supply to the model for Model Evalution?
Prompt Data set
Model Evaluation
Where can you create them?
Create custom yourself, or use curated prompt datasets
Model Evaluation
What’s in a prompt data set?
Questions and what you think a good answer should be
Model Evaluation
How do you score the model during Model Evaluation?
You don’t. The scores are calculated automatically.
Model Evaluation
How does the model calculate a score?
It doesn’t. A separate judge model scores the actual response vs. your supplied exemplar response.
Model Evaluation
What statistical methods are used to score a response?
BERT, F1, others
Model Evaluation
What’s an example of something Benchmark tests can help detect?
Bias and discrimination
Model Evaluation
Metrics to eval an FM?
ROUGE, BLEU, BERT
Model Evaluation
What is ROUGE?
Eval automatic summarization and machine translation
Model Evaluation
What is ROUGE-N?
Measure the number of n-grams between reference and generated text
Model Evaluation
What is ROUGE-L?
Longest sequence of text between reference and generated text
Model Evaluation
What is BLEU?
Bilingual Evaluation Understudy
Model Evaluation
What is BLEU used for?
Quality of generated text, especially for translations
Model Evaluation
How does BLEU work?
Looks at combination of n-grams
Model Evaluation
What is BERT?
Semantic similarity: looks at how close the /meanings/ are to each other
Model Evaluation
What’s the special techie term for what BERT does?
Computes the cosine similarity
Model Evaluation
Which evaluation metric is the best for nuanced meaning between reference and generated text?
BERT: it uses semantics, not n-grams
Model Evaluation
What is Perplexity metric?
How well the model predicts the next token
Model Evaluation
For Perplexity, is a higher or lower score better?
Lower is better