What steps and primary questions comprise the data wrangling process?
What are some examples of the definition of a population (in data terms)?
All users on Facebook.
All US users on Facebook.
All US users on Facebook in the last month
All the watermelons in the back of the truck.
All the watermelons greater than 5lbs in the back of the truck.
Etc…
How do we obtain data from a population?
Sampling
What are two simple probability-based methods for sampling?
2. Stratified Random Sampling
What is simple random sampling of a population?
Every observation from the population has the same chance of being sampled
What is stratified random sampling of a population?
Population is partitioned into groups and then a simple random sampling approach is applied within each group.
Example: In the watermelons in the back of the truck example, we could partition into 3 groups: (1) less than 5lbs, (2) greater than 5lbs but less than 10lbs, and (3) greater than 10lbs. We could then randomly sample within each group. This is stratified random sampling.
What are some best practices for data wrangling?
What is Cross Validation (CV)?
A method for estimating prediction error.
Grid search is always better than random search when trying to optimize hyperparameters? (True/False)
False. One 2012 paper by Bergstra and Bengio found that random search is often just as good, if not better, than grid search.
What are two methods for handling class imbalances?
What is one type of plot we can use to gauge the confidence a model has in its prediction?
Calibration plot
It isn’t necessary to include a datasheet when creating a new dataset? (True/False)
False. It can be very helpful to future researchers (including yourself!) to understand how the dataset was constructed.
What things should be included in a datasheet for a dataset?
What are the three steps in the Data Cleaning process for ML?
What are three mechanisms that can cause missing data?
What are some ways we can fix missing data?
2. Imputation (mean/median, using learned model to predict, etc.)
What are some examples of the data transformation step in the data cleaning process?
What are some examples of the data preprocessing step in the data cleaning process?
Zero-center data, normalization, etc
What are three important components of fairness in ML?
What is an example of how a proxy to a protected attribute might result in an unfair ML model?
One example might be using features like zip code in areas with high racial segregation. If the model learns that zip code is an important discriminatory feature, there’s a good chance that it has learned a subtle proxy for racial discrimination.
Layers in a NN must always be fully connected? (True/False)
False. Other connectivity structures are possible, and in many cases (like images) desirable.
Why does it make sense to consider small patches of inputs when building a NN for image data? What are these small patches called?
They are called receptive fields, modeled after similar structure in the human visual cortex. They make sense to use because while structure exists in image data, it’s often localized, such as edges and lines, and collections of those lines and edges forming higher level motifs.
Why does using linear layers not make sense for some applications?
Consider the case of image data. If we connect each pixel to every weight in a hidden linear layer, there could be hundreds of millions of parameters to learn for just one layer. Furthermore, patterns in images tend to be SPATIALLY LOCAL. A pixel in the upper right corner in all likelihood will have very little to do with a pixel in the lower left.
As the number of parameters to learn in a model increase, more data is needed to ensure a robust model that generalizes to new data? (True/False)
True.