3 Software testing types:
Best practices for testing (5)
What is the testing pyramid?
Right more unit tests>integration>e2e
70/20/10
The unit tests are faster, more reliable, and easier at isolating the failures.
Solitary testing
Doesn’t rely on data from other units, so you make up the data and test it.
Test coverage
Shows how many lines of code are being tested
Good for finding areas that are not tested.
But can be misleading because it doesn’t check the quality of the tests which is what we really care about.
Test driven development: 3 laws
1) You may not write any production code until you have written a failing unit test
2) You may not write more of a unit test than is sufficient to fail. Nor compiling is failing
3) You may not write more production code than is sufficient to pass the currently failing unit test.
Testing in production, why and how?
Why - anyways most bugs will not be caught before, it’s inevitable. So you might as well build a system that will monitor the errors fast and clearly so you can fix them once it’s out.
How:
CI/CD
Testing done by saas, testing the code once it is pushed. As a cloud job.
Best free and easy one is GitHub actions
Testing only machine learning model and not the system is not enough, why?
The model itself is just a small piece of the system, which includes:
Training system-model-prediction system - serving system - prod data - labeling system - storage and preprocessing system - and back to the start
So each one of these steps should be tested and monitored
Infrastructure tests - unit test for training code. Goal and How ?
Goal: avoid bugs in the training pipeline
How:
Integration test - test the step between the data system and the training system
Goal: make sure training is reproducible
How:
Take a piece of the dataset and Run a trainng run :)
Then check to make sure that the performance remains consistent
Consider pulling a sliding window of window (the data of the last week…)
Run it periodically
Functionality tests - unit test for the prediction code.
What is the goal and how do you achieve it?
Goal: avoid stuff in the code that mess up the prediction
How:
Evaluation tests, goal and how?
Goal:
make sure a new model is ready to go into production
How:
Robustness metrics (part of evaluation testing) goal and 4 tests:
Goal:
Understand the performance envelope, ie where would you expect the model to fail?
Shadow tests - prediction system to service systems goals(3) and how(3)?
Goal:
1) Detect production bugs before they hit users
2) Detect inconsistency between the offline and online models
3) Detect issues that appear on production data
How:
1)Run the model in production system along side the old model but don’t return the predictions to users
2) Save the data and run the offline model
3) Inspect the prediction distribution for old vs new and offline vs online
A/B testing - goal and how
Goal:
Test how the rest of the system will react - how will the users and business metrics will react
How:
Start by “canarying” model on a tiny fraction of data.
Consider using a more statistical principle split
Compare the two cohorts
Labeling tests - goal and how(5)
Goal:
Catch poor quality labels before the corrupt your model
How
1) Train and certify the labelers
2) Aggregate labels of multiple labelers
3) Assign labelers a trust score based on how often they are wrong
4) Manually spot check the labels from your labeling service
5) Run a previous model and check the biggest disagreements
Expectations test - unit tests for data
Goal and how
Goal:
Catch data quality issues before they reach your pipeline
How:
Define rules about properties of each of you data tables at each stage in you data cleaning and preprocessing
Run them when you run batch data pipeline jobs
Build up the testing gradually, start there: