Testing Flashcards

Question 1

Q

3 Software testing types:

Answer

A

Unit tests - test the functionality of a single price of code (like a function)
Integration tests - test how 2 or more units work together
End to end tests - tests the entire system

Question 2

Q

Best practices for testing

Answer

A

Automate your testing
Make sure the tests are reliable, fast, and go through code review as the rest of the code. (Buggy test are the worst thing)
Enforce that tests must pass before merging
When you find production bugs convert them to tests
Follow the testing pyramid

Question 3

Q

What is the testing pyramid?

Answer

A

Right more unit tests>integration>e2e
70/20/10

The unit tests are faster, more reliable, and easier at isolating the failures.

Question 4

Q

Solitary testing

Answer

A

Doesn’t rely on really data from other units, so you make up the data and test it. It’s good the test exactly what you want

Question 5

Q

Sociable testing

Answer

A

Makes an assumption that other modules are working and testing with their outputs

Question 6

Q

Test coverage

Answer

A

Shows how many lines of code are being tested
Good for finding areas that are not tested.
But can be misleading because it doesn’t check the quality of the tests which is what we really care about.

Question 7

Q

Test driven development

Answer

A

You first right you tests, than you write small parts of code that will just make you last the one test, than check on the bigger tests , and iterate.
(Not sure how accurate this is, but the idea is simple)

Question 8

Q

Testing in production, why and how?

Answer

A

Why - anyways most bugs will not be caught before, it’s inevitable. So you might as well build a system that will monitor the errors fast and clearly so you can fix them once it’s out.

How:

Canary deployments - roll it out to a small percentage of users (1%…) So not all will get the bug
A/B testing - for more statistical tests if you know what you care about
Real user monitoring - like the actual behavior
Exploratory testing - not set it up in advance

Question 9

Q

CI/CD

Answer

A

Testing done by saas, testing the code once it is pushed. As a cloud job.

Best free and easy one is GitHub actions

Question 10

Q

Testing only machine learning model and not the system is not enough, why?

Answer

A

The model itself is just a small piece of the system, which includes:
Training system-model-prediction system - serving system - prod data - labeling system - storage and preprocessing system - and back to the start

So each one of these steps should be tested and monitored

Question 11

Q

Infrastructure tests - unit test for training code

Answer

A

Goal: avoid bugs in the training pipeline

How:

Unit test like any other code
Add a single batch/epoch tests that check performance after a small run on tiny dataset
Run frequently during development

Question 12

Q

Integration test - test the step between the data system and the training system

Answer

A

Goal: make sure trading is reproducible

How:

Take a piece of the dataset and Run a trading run :)
Then check to make sure that the performance remains consistent
Consider pulling a sliding window of window (the data of the last week…)
Run it periodically

Question 13

Q

Functionality tests - unit test for the prediction code

Answer

A

Goal: avoid stuff in the code that makes up the prediction infra

How:

Unit test the code like any other
Load pre trained model and test prediction on few examples
Run frequently during development

Question 14

Q

Evaluation tests, goal and how?

Answer

A

Goal:
make sure a new model is ready to go into production

How:

evaluate your model on all of the metrics, datasets, and slices that you care about
compare the new model to the old one and your baselines
understand the performance envelope (how is the model working on different groups? What types of data will cost the model to not preform well) of the model
run every time you have a new candidate model

Question 15

Q

Behavioral test sing metrics (part of evaluation tests):

Answer

A

Goal:
Make sure the model has the invariances we expect. meaning - does it perform the way we expect on the perturbations of the data (the deviations of the datasets)

Types:
Invariance tests: assert that change in input shouldn’t affect the output (if we change a city name in Sentiment analysis it should have the same results)
Directional tests: assert that change in input should change the output (like in Sentiment analysis changing from negative word to positive)
Min functionality tests - certain inputs and outputs should always produce a given results

Question 16

Q

Robustness metrics (part of evaluation testing) goal and 4 tests:

Answer

Study These Flashcards

A

Goal:
Understand the performance envelope, ie where would you expect the model to fail?

Feature importance (if the value of a strong feature is not what you will expect it will not work well)
Sensitivity to staleness - train the model on old data, then test it on a moving window with the time moving forward and plot the results to see how fast it decline. How long will it take the model to become stale
sensitivity to drift - if possible, measure the sensitivity to different types of drifts, so you can check if they are happening in production
-correlation between model performance and business metrics - how did it impact the business metrics that we care about, so we will know if the accuracy dios how scary should it be
-

Question 17

Q

Privacy and fairness metrics (part of evaluation tests)

Answer

Study These Flashcards

A

Make sure that you test on the different groups and see the results for each of them

Question 18

Q

Simulation tests (part of evaluation tests)

Answer

Study These Flashcards

A

Goal:
Understand how the performance of the model could affect the rest of the system. For example - cars will affect other cars, recommendation will affect the users etc

hard. Need a model of the real world, dataset of scenarios etc.

Relevant especially in robotics

Question 19

Q

Eval tests - deeper on what to evaluate:

Answer

Study These Flashcards

A

evaluate on multiple slices.

Tools:
What-if by google - shows the results on different slices

add dataset for the evaluation phase:
Edge cases, multiple modalities (nyc vs LA) etc…

Question 20

Q

Shadow tests - prediction system to service systems goals and how?

Answer

Study These Flashcards

A

Goal:
Detect production bugs before they hit users
Detect inconsistency between the offline and online models
Detect issues that appear on production data

How:
Run the model in production system along side the old model but don’t return the predictions to users

Save the data and run the offline model

Inspect the prediction distribution for old vs new and offline vs online

Question 21

Q

A/B testing - goal and how

Answer

Study These Flashcards

A

Goal:
Test how the rest of the system will react - how will the users and business metrics will react

How:
Start by “canarying” model on a tiny fraction of data.

Consider using a more statistical principle split

Compare the two cohorts

Question 22

Q

Labeling tests - goal and how

Answer

Study These Flashcards

A

Goal:
Catch poor quality labels before the corrupt your model

How
Train and certify the labelers
Aggregate labels of multiple labelers
Assign labelers a trust score based on how often they are wrong
Manually spot check the labels from your labeling service
Run a previous model and check the biggest disagreements

Question 23

Q

Expectations test - unit tests for data

Goal and how

Answer

Study These Flashcards

A

Goal:
Catch data quality issues before they reach your pipeline

How:
Define rules about properties of each of you data tables at each stage in you data cleaning and preprocessing
Run them when you run batch data pipeline jobs

Open source for this:
Great expectations

Question 24

Q

Build up the testing gradually, start there:

Answer

Study These Flashcards

A

Infrastructure tests (the code etc)
Evaluation tests (like the performance on slices)
Expectation tests

Testing Flashcards

(24 cards)