Model Evaluation (1) Flashcards

Question 1

Q

Model Evaluation

What are you evaluating?

Answer

A

Quality Control

Question 2

Q

Model Evaluation

Examples of tasks in model evaluation?

Answer

A

text summarization, Q&A, text classification

Question 3

Q

Model Evaluation

What do you supply to the model for Model Evalution?

Answer

A

Prompt Data set

Question 4

Q

Model Evaluation

Where can you create them?

Answer

A

Create custom yourself, or use curated prompt datasets

Question 5

Q

Model Evaluation

What’s in a prompt data set?

Answer

A

Questions and what you think a good answer should be

Question 6

Q

Model Evaluation

How do you score the model during Model Evaluation?

Answer

A

You don’t. The scores are calculated automatically.

Question 7

Q

Model Evaluation

How does the model calculate a score?

Answer

A

It doesn’t. A separate judge model scores the actual response vs. your supplied exemplar response.

Question 8

Q

Model Evaluation

What statistical methods are used to score a response?

Answer

A

BERT, F1, others

Question 9

Q

Model Evaluation

What’s an example of something Benchmark tests can help detect?

Answer

A

Bias and discrimination

Question 10

Q

Model Evaluation

Metrics to eval an FM?

Answer

A

ROUGE, BLEU, BERT

Question 11

Q

Model Evaluation

What is ROUGE?

Answer

A

Eval automatic summarization and machine translation

Question 12

Q

Model Evaluation

What is ROUGE-N?

Answer

A

Measure the number of n-grams between reference and generated text

Question 13

Q

Model Evaluation

What is ROUGE-L?

Answer

A

Longest sequence of text between reference and generated text

Question 14

Q

Model Evaluation

What is BLEU?

Answer

A

Bilingual Evaluation Understudy

Question 15

Q

Model Evaluation

What is BLEU used for?

Answer

A

Quality of generated text, especially for translations

Question 16

Q

Model Evaluation

How does BLEU work?

Answer

Study These Flashcards

A

Looks at combination of n-grams

Question 17

Q

Model Evaluation

What is BERT?

Answer

Study These Flashcards

A

Semantic similarity: looks at how close the /meanings/ are to each other

Question 18

Q

Model Evaluation

What’s the special techie term for what BERT does?

Answer

Study These Flashcards

A

Computes the cosine similarity

Question 19

Q

Model Evaluation

Which evaluation metric is the best for nuanced meaning between reference and generated text?

Answer

Study These Flashcards

A

BERT: it uses semantics, not n-grams

Question 20

Q

Model Evaluation

What is Perplexity metric?

Answer

Study These Flashcards

A

How well the model predicts the next token

Question 21

Q

Model Evaluation

For Perplexity, is a higher or lower score better?

Answer

Study These Flashcards

A

Lower is better

Model Evaluation (1) Flashcards

(21 cards)