Model Evaluation (1) Flashcards

(21 cards)

1
Q

Model Evaluation

What are you evaluating?

A

Quality Control

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Model Evaluation

Examples of tasks in model evaluation?

A

text summarization, Q&A, text classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Model Evaluation

What do you supply to the model for Model Evalution?

A

Prompt Data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Model Evaluation

Where can you create them?

A

Create custom yourself, or use curated prompt datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Model Evaluation

What’s in a prompt data set?

A

Questions and what you think a good answer should be

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Model Evaluation

How do you score the model during Model Evaluation?

A

You don’t. The scores are calculated automatically.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Model Evaluation

How does the model calculate a score?

A

It doesn’t. A separate judge model scores the actual response vs. your supplied exemplar response.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Model Evaluation

What statistical methods are used to score a response?

A

BERT, F1, others

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Model Evaluation

What’s an example of something Benchmark tests can help detect?

A

Bias and discrimination

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Model Evaluation

Metrics to eval an FM?

A

ROUGE, BLEU, BERT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Model Evaluation

What is ROUGE?

A

Eval automatic summarization and machine translation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Model Evaluation

What is ROUGE-N?

A

Measure the number of n-grams between reference and generated text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Model Evaluation

What is ROUGE-L?

A

Longest sequence of text between reference and generated text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Model Evaluation

What is BLEU?

A

Bilingual Evaluation Understudy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Model Evaluation

What is BLEU used for?

A

Quality of generated text, especially for translations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Model Evaluation

How does BLEU work?

A

Looks at combination of n-grams

17
Q

Model Evaluation

What is BERT?

A

Semantic similarity: looks at how close the /meanings/ are to each other

18
Q

Model Evaluation

What’s the special techie term for what BERT does?

A

Computes the cosine similarity

19
Q

Model Evaluation

Which evaluation metric is the best for nuanced meaning between reference and generated text?

A

BERT: it uses semantics, not n-grams

20
Q

Model Evaluation

What is Perplexity metric?

A

How well the model predicts the next token

21
Q

Model Evaluation

For Perplexity, is a higher or lower score better?

A

Lower is better