Linear Regression Flashcards by Abhishek Verma

What is the fundamental purpose of the linear regression algorithm?

It predicts a scalar value from a set of feature values by computing their weighted combinations with a bias term.

How well did you know this?

Not at all

Perfectly

Is linear regression a supervised or unsupervised machine learning algorithm?

Supervised.

How well did you know this?

Not at all

Perfectly

What type of input does a linear regression model require?

A dataset where each data point is represented by a set of numerical features (continuous or categorical).

How well did you know this?

Not at all

Perfectly

What is the output of a linear regression model?

A single scalar value that is the prediction or estimation.

How well did you know this?

Not at all

Perfectly

Name two common use cases for linear regression.

Real estate property valuation and marketing sales prediction.

How well did you know this?

Not at all

Perfectly

For small datasets, how can the weights in a linear regression model be learned directly?

Through the direct application of pseudo inverses using scientific linear algebra libraries.

How well did you know this?

Not at all

Perfectly

For large datasets, what is the common technique used to learn the weights in a linear regression model?

Stochastic gradient descent (SGD) or related techniques, where weights are learned incrementally.

How well did you know this?

Not at all

Perfectly

What is the general linear formula used to predict a value $y\sim$ in linear regression?

The formula is $y\sim = \beta_0 + \beta_1x_1 + \beta_2x_2 + … + \beta_nx_n$, where $\beta$ values are the model parameters.

How well did you know this?

Not at all

Perfectly

In the iterative training process of linear regression, what is the typical first step for the weights, $\beta$?

Initialize the weights to random values.

How well did you know this?

Not at all

Perfectly

What is the most common loss function used in linear regression, calculated for each row?

Squared loss, which is calculated as $(y - y\sim)^2$.

How well did you know this?

Not at all

Perfectly

In linear regression training, the technique used to prevent overfitting by adding a penalty term to the loss function is known as _____.

regularization

How well did you know this?

Not at all

Perfectly

What is the formula for the gradient of the squared loss with respect to a weight $\beta_k$?

The gradient is $2(y - y\sim)(-x_k)$.

How well did you know this?

Not at all

Perfectly

How are weights updated in each step of stochastic gradient descent for linear regression?

The update rule is $\beta_k := \beta_k - \eta(y - y\sim)x_k$, where $\eta$ is the learning rate.

How well did you know this?

Not at all

Perfectly

What is the hyperparameter $\eta$ in the SGD weight update rule?

A single global learning rate, specified by the engineer.

How well did you know this?

Not at all

Perfectly

When does the iterative training process for linear regression typically stop?

It stops upon convergence (when average loss is below a tolerance), or after a certain number of epochs.

How well did you know this?

Not at all

Perfectly

For small datasets, what is the closed-form matrix equation to solve for the parameter vector $\beta$?

The equation is $\beta = (X^TX)^{-1}X^Ty$.

How well did you know this?

Not at all

Perfectly

The term $(X^TX)^{-1}X^Ty$ is known as the best estimator for $\beta$ that minimizes what type of loss?

It minimizes the quadratic loss $(y - X\beta)^2$.

How well did you know this?

Not at all

Perfectly

What is Ridge regression?

A type of linear regression that uses ridge loss (squared loss plus a penalty for large parameter values) to prevent overfitting.

How well did you know this?

Not at all

Perfectly

What is the effect of Ridge regression on model parameters?

It tends to make all model parameters smaller, but not exactly zero.

How well did you know this?

Not at all

Perfectly

What is the closed-form solution for Ridge regression?

The solution is $\beta = (X^TX + \lambda I)^{-1}X^Ty$.

How well did you know this?

Not at all

Perfectly

In the Ridge regression formula, what problem does adding the $\lambda I$ term help solve?

It helps with multicollinearity, underdetermined systems, and stabilizes calculations when the correlation matrix is poorly conditioned.

How well did you know this?

Not at all

Perfectly

What is Lasso regression?

A type of linear regression that uses lasso loss, which adds a penalty based on the absolute values of the coefficients.

How well did you know this?

Not at all

Perfectly

What is the primary effect of Lasso regression on model parameters, which makes it useful for feature selection?

It often reduces less important parameters to exactly zero, resulting in a simpler, sparser model.

How well did you know this?

Not at all

Perfectly

What is the optimization objective for Lasso regression?

Minimize $||y - X\beta||^2 + \lambda||\beta||_1$, where $||\beta||_1$ is the L1 norm of the coefficients.

How well did you know this?

Not at all

Perfectly

Name one of the six core data requirements for a linear regression model to be a useful predictor.

Linearity, Normality of Residuals, Non-collinearity, Homoscedasticity, Similar Scales, or Independence.

Requirement: Linearity

Definition: There should be a linear relationship between the input features and the target variable (labels).

Requirement: Normality of Residuals

Definition: The residuals (the differences between observed and predicted values) should be normally distributed.

Requirement: Non-collinearity

Definition: The features should not be highly correlated with each other.

Requirement: Homoscedasticity

Definition: The residuals should have a constant variance at every level of the independent variables.

Requirement: Similar Scales

Definition: The feature variables should be on similar scales to avoid numerical instability.

Requirement: Independence

Definition: The observed data rows should be independent of each other.

What is multicollinearity in the context of linear regression?

It occurs when two or more features are linearly correlated, leading to unreliable and unstable coefficient estimates.

In matrix algebra terms, what problem can multicollinearity cause when trying to find the closed-form solution?

It can lead to non-invertible (singular) matrices, making it difficult or impossible to compute the matrix inverse accurately.

The phenomenon where the variability of residuals is not constant across the range of predicted values is called _____.

heteroscedasticity

Why is it problematic if feature variables have vastly different scales in linear regression?

It can lead to unreliable and unstable results, manifesting as high condition numbers in linear algebra.

What is the key metric for evaluating the goodness of fit of a linear regression model, representing the proportion of explained variance?

The $R^2$ metric, also known as the coefficient of determination.

How is the $R^2$ metric calculated in terms of Mean Squared Error (MSE)?

The formula is $R^2 = 1 - \frac{\text{MSE}}{\text{Total variance}}$.

What do AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) account for that $R^2$ does not?

They add a penalty for the number of parameters in the model, thus adjusting for model complexity.

When comparing two models with similar $R^2$ values, does a higher or lower AIC/BIC score indicate a better model?

A lower AIC/BIC score indicates a better model.

What is a major limitation of linear regression when applied to complex scenarios like stock price prediction?

It assumes a simple linear relationship and struggles to capture complex, non-linear, and dynamic interactions between factors.

How can the predictive power of linear regression be improved when feature interactions are important?

Through feature engineering, such as creating new features that explicitly capture interactions or non-linear transformations.

Linear regression forms the basis of which common classification model?

Logistic regression.

What type of plot is useful in exploratory data analysis (EDA) to immediately identify collinear features?

A correlation plot or heatmap.

In an OLS regression summary, what does a large 'Cond. No.' (Condition Number) suggest?

It indicates potential problems like strong multicollinearity or other numerical issues.

What diagnostic plot is used to visually check for homoscedasticity?

A plot of residuals versus fitted values.

On a residuals vs. fitted values plot, what pattern would suggest heteroscedasticity (non-constant variance)?

A funnel shape, where the spread of residuals changes as the fitted values increase or decrease.

What does the curved line (lowess curve) on a residuals vs. fitted plot suggest if it is not flat and near zero?

It suggests that there might be non-linear relationships or feature interactions not captured by the model.

What type of plot is used to check if the residuals of a regression model follow a normal distribution?

A QQ (Quantile-Quantile) plot of the residuals.

On a QQ plot of residuals, what indicates that the residuals are not normally distributed?

The data points deviate significantly from the straight diagonal line.

What is a helpful data transformation technique for the output variable (e.g., 'price') if it is not normally distributed?

A Box-Cox transformation, which includes options like log, square root, or inverse transforms.

Why does L1 regularization (Lasso) tend to produce sparse models?

Its loss function's contours are sharp-cornered (a hypercube), which makes it more likely for the optimization path to hit an axis, setting a weight to exactly zero.

In contrast to L1 regularization, why does L2 regularization (Ridge) shrink weights towards zero but rarely sets them to exactly zero?

Its loss function's contours are smooth hyperspheres, so the optimization path is less likely to intersect an axis perfectly.

Under what condition is the assumption that coefficient magnitudes are a proxy for feature importance incorrect?

This assumption is incorrect when the features are on different scales.

When feature scales are different, what metric should be used instead of raw coefficient magnitude to assess feature importance?

The magnitude of the t-statistic (coefficient divided by its standard error).

Which Python library is preferred in traditional statistics and social sciences for its detailed statistical summaries of regression models?

The `statsmodels` library.

Which Python library is a popular choice for integrating linear regression into broader machine learning pipelines?

The `scikit-learn` library.

In linear regression, if a plot of residuals vs. a specific feature shows a clear pattern or curve, what does this suggest?

It suggests a non-linear relationship between that feature and the target variable that the model is not capturing.

The R-squared value in an OLS regression summary indicates the model's ____.

goodness of fit

In an OLS regression summary, what do the p-values (P>|t|) associated with each coefficient assess?

They assess the statistical significance of each coefficient, indicating the probability that the coefficient is zero.

In L1 regularization, the penalty term is based on the sum of the _____ values of the weights.

absolute

In L2 regularization, the penalty term is based on the sum of the _____ values of the weights.

squared

The loss function that is a combination of squared loss and absolute error is called _____ loss.

Huber

What concept at the senior level refers to the variability of model parameter estimates when the model is retrained on different subsets of the data?

Stability.

What does a Variance Inflation Factor (VIF) measure?

It measures the severity of multicollinearity in an ordinary least squares regression analysis.

Linear Regression Flashcards

(64 cards)