Describe the characteristics of predictive modeling problems
How do you produce a meaningful problem definition?
General Strategy: get to the root cause of the business issue and make it specific enough to be solvable
Specific Strategies:
* (Hypotheses) Use prior knowledge of the business problem to ask questions and develop testable hypotheses
* (KPIs) Select appropriate key performance indicators to provide a quantitative basis for measuring success
Define granularity
Granularity refers to how precisely a variable is measured. For example, the level of detail for the information contained by the variable
What is the goal of exploratory data analysis (EDA)?
The goal is to use descriptive statistics and graphical displays to gain insights into the distribution of variables on their own and in relation to one another, esp. in relation to the target variable
How do you perform EDA?
List the common issues for numeric variables
What is the issue with right skewness for numeric variables and what are possible solutions?
The problem with right skewness is the fact that extreme values distort visualizations and exert a disproportionate effect on the model fit.
The solution would be to apply transformations to remedy right skewness and symmetrize distribution to improve the fit of GLMs if the variables serve as predictors
* Log transformation (works only for strictly positive variables)
* Square root transformations (works for non-negative variables)
How can you handle the presence of outliers?
How can you handle highly correlated predictors?
List some reasons why a numeric variable should be converted to a factor
List some reasons why a numeric variable should not be converted to a factor
What is the common issue for categorical predictors and how should we handle them?
The common issue for categorical predictors is sparse levels
* Motivation: sparse factor levels (often for a high dimensional categorical predictor) reduce robustness of models and cause overfitting
* What to do: combine sparse levels with more populous levels where the target variable behaves similarly to form representative groups
* Trade-off: strikes a balance between ensuring each level has a number of observations, and preserving the differences in the behavior of the target variable among different factor levels for prediction
What is the difference between interaction and correlation?
Interaction concerns a 3-way relationship with 1 target variable and 2 predictors. Correlation concerns the relationship between 2 numeric predictors
Why should we split the data into training data and test data?
Why should we use stratified sampling?
To produce representative training and test sets with respect to the target variable (not with respect to the predictors)
Explain how cross validation works
List some common examples of GLMs, and their common distributions and link functions
Gamma and inverse Gaussian require the target variable to be strictly positive. Zero values are not allowed
Explain what binning is and the pros and cons
Binning refers to converting a numeric variable into a categorical variable with levels defined as non-overlapping intervals over the range of the original variable
* Pros: no definite order among the coefficients of the dummy variables corresponding to different bins which means that the target mean can vary highly irregularly over the bins
* Cons: (1) usually no clear choice of the number of bins and the associated boundaries, and (2) results in a loss of information
What are some general statements you can make when interpreting GLMs?
What are some specific statements you can make when interpreting GLMs with log link?
What are weights and offsets?
Weights and offsets are modeling techniques:
* Offsets: usually used when the target observations are aggregated over an exposure unit. The larger the exposure, the larger the mean. It’s commonly used with a log-link GLM, which shows that the target mean is directly proportional to the exposure
* Weights: usually used when the target observations are averaged over an exposure unit. The larger the exposure, the more precise the observations (which means a lower variance). Observations with a larger weight will play a more important role in the estimation of the model coefficients
Explain accuracy vs precision in the context of predictive analytics
Accuracy and precision measure different aspects of prediction performance. Bias quantifies the accuracy (when predictions capture the true signal) and variance quantifies the precision (when predictions are concentrated in a small region rather than spread out)