3 Software testing types:
Best practices for testing
What is the testing pyramid?
Right more unit tests>integration>e2e
70/20/10
The unit tests are faster, more reliable, and easier at isolating the failures.
Solitary testing
Doesn’t rely on really data from other units, so you make up the data and test it. It’s good the test exactly what you want
Sociable testing
Makes an assumption that other modules are working and testing with their outputs
Test coverage
Shows how many lines of code are being tested
Good for finding areas that are not tested.
But can be misleading because it doesn’t check the quality of the tests which is what we really care about.
Test driven development
You first right you tests, than you write small parts of code that will just make you last the one test, than check on the bigger tests , and iterate.
(Not sure how accurate this is, but the idea is simple)
Testing in production, why and how?
Why - anyways most bugs will not be caught before, it’s inevitable. So you might as well build a system that will monitor the errors fast and clearly so you can fix them once it’s out.
How:
CI/CD
Testing done by saas, testing the code once it is pushed. As a cloud job.
Best free and easy one is GitHub actions
Testing only machine learning model and not the system is not enough, why?
The model itself is just a small piece of the system, which includes:
Training system-model-prediction system - serving system - prod data - labeling system - storage and preprocessing system - and back to the start
So each one of these steps should be tested and monitored
Infrastructure tests - unit test for training code
Goal: avoid bugs in the training pipeline
How:
Integration test - test the step between the data system and the training system
Goal: make sure trading is reproducible
How:
Take a piece of the dataset and Run a trading run :)
Then check to make sure that the performance remains consistent
Consider pulling a sliding window of window (the data of the last week…)
Run it periodically
Functionality tests - unit test for the prediction code
Goal: avoid stuff in the code that makes up the prediction infra
How:
Evaluation tests, goal and how?
Goal:
make sure a new model is ready to go into production
How:
Behavioral test sing metrics (part of evaluation tests):
Goal:
Make sure the model has the invariances we expect. meaning - does it perform the way we expect on the perturbations of the data (the deviations of the datasets)
Types:
Invariance tests: assert that change in input shouldn’t affect the output (if we change a city name in Sentiment analysis it should have the same results)
Directional tests: assert that change in input should change the output (like in Sentiment analysis changing from negative word to positive)
Min functionality tests - certain inputs and outputs should always produce a given results
Robustness metrics (part of evaluation testing) goal and 4 tests:
Goal:
Understand the performance envelope, ie where would you expect the model to fail?
Privacy and fairness metrics (part of evaluation tests)
Make sure that you test on the different groups and see the results for each of them
Simulation tests (part of evaluation tests)
Goal:
Understand how the performance of the model could affect the rest of the system. For example - cars will affect other cars, recommendation will affect the users etc
Relevant especially in robotics
Eval tests - deeper on what to evaluate:
Tools:
What-if by google - shows the results on different slices
Shadow tests - prediction system to service systems goals and how?
Goal:
Detect production bugs before they hit users
Detect inconsistency between the offline and online models
Detect issues that appear on production data
How:
Run the model in production system along side the old model but don’t return the predictions to users
Save the data and run the offline model
Inspect the prediction distribution for old vs new and offline vs online
A/B testing - goal and how
Goal:
Test how the rest of the system will react - how will the users and business metrics will react
How:
Start by “canarying” model on a tiny fraction of data.
Consider using a more statistical principle split
Compare the two cohorts
Labeling tests - goal and how
Goal:
Catch poor quality labels before the corrupt your model
How
Train and certify the labelers
Aggregate labels of multiple labelers
Assign labelers a trust score based on how often they are wrong
Manually spot check the labels from your labeling service
Run a previous model and check the biggest disagreements
Expectations test - unit tests for data
Goal and how
Goal:
Catch data quality issues before they reach your pipeline
How:
Define rules about properties of each of you data tables at each stage in you data cleaning and preprocessing
Run them when you run batch data pipeline jobs
Open source for this:
Great expectations
Build up the testing gradually, start there: