What is the fundamental purpose of the linear regression algorithm?
It predicts a scalar value from a set of feature values by computing their weighted combinations with a bias term.
Is linear regression a supervised or unsupervised machine learning algorithm?
Supervised.
What type of input does a linear regression model require?
A dataset where each data point is represented by a set of numerical features (continuous or categorical).
What is the output of a linear regression model?
A single scalar value that is the prediction or estimation.
Name two common use cases for linear regression.
Real estate property valuation and marketing sales prediction.
For small datasets, how can the weights in a linear regression model be learned directly?
Through the direct application of pseudo inverses using scientific linear algebra libraries.
For large datasets, what is the common technique used to learn the weights in a linear regression model?
Stochastic gradient descent (SGD) or related techniques, where weights are learned incrementally.
What is the general linear formula used to predict a value $y\sim$ in linear regression?
The formula is $y\sim = \beta_0 + \beta_1x_1 + \beta_2x_2 + … + \beta_nx_n$, where $\beta$ values are the model parameters.
In the iterative training process of linear regression, what is the typical first step for the weights, $\beta$?
Initialize the weights to random values.
What is the most common loss function used in linear regression, calculated for each row?
Squared loss, which is calculated as $(y - y\sim)^2$.
In linear regression training, the technique used to prevent overfitting by adding a penalty term to the loss function is known as _____.
regularization
What is the formula for the gradient of the squared loss with respect to a weight $\beta_k$?
The gradient is $2(y - y\sim)(-x_k)$.
How are weights updated in each step of stochastic gradient descent for linear regression?
The update rule is $\beta_k := \beta_k - \eta(y - y\sim)x_k$, where $\eta$ is the learning rate.
What is the hyperparameter $\eta$ in the SGD weight update rule?
A single global learning rate, specified by the engineer.
When does the iterative training process for linear regression typically stop?
It stops upon convergence (when average loss is below a tolerance), or after a certain number of epochs.
For small datasets, what is the closed-form matrix equation to solve for the parameter vector $\beta$?
The equation is $\beta = (X^TX)^{-1}X^Ty$.
The term $(X^TX)^{-1}X^Ty$ is known as the best estimator for $\beta$ that minimizes what type of loss?
It minimizes the quadratic loss $(y - X\beta)^2$.
What is Ridge regression?
A type of linear regression that uses ridge loss (squared loss plus a penalty for large parameter values) to prevent overfitting.
What is the effect of Ridge regression on model parameters?
It tends to make all model parameters smaller, but not exactly zero.
What is the closed-form solution for Ridge regression?
The solution is $\beta = (X^TX + \lambda I)^{-1}X^Ty$.
In the Ridge regression formula, what problem does adding the $\lambda I$ term help solve?
It helps with multicollinearity, underdetermined systems, and stabilizes calculations when the correlation matrix is poorly conditioned.
What is Lasso regression?
A type of linear regression that uses lasso loss, which adds a penalty based on the absolute values of the coefficients.
What is the primary effect of Lasso regression on model parameters, which makes it useful for feature selection?
It often reduces less important parameters to exactly zero, resulting in a simpler, sparser model.
What is the optimization objective for Lasso regression?
Minimize $||y - X\beta||^2 + \lambda||\beta||_1$, where $||\beta||_1$ is the L1 norm of the coefficients.