Regression analysis
The analysis of the statistical relationship among variables. In the simples form there are only 2 variables:
Simple linear regression
Y = a + bX + e a = intercept component to the model that represents the models value for Y for X=0. b = specifically denotes the slope of the linear equation that specifies the model. e = a term that represents the errors associated with the model.
R^2
The coefficient indicating goodness of fit (with max = 1). When R^2 increases, the fit of the model increases as well, but there is also more modelling noise. Proportion of variation in Y ‘explained’ by all X variables in the model.
Ordinary least squares (OLS)
Method for finding the model with the best fit. It minimizes the errors associated with predicting the values for Y. It issues a least squares criterion because without square we would allow positive and negative deviations from the models to cancel each other out.
What is OLS often used for.?
Hedonic price models
Collinearity
Some independent variable that depend on another independent variable in the model.
Multiple regression model
Y = b0x0 + b1x1 + b2x2 + … + bNxN + e
Neural network
Non-linear multiple regression model
Adjuster R^2
Compensates for the number of explanatory variables, penalty for extra variables. R^2 never decreases when a new X variables is added to the model, which may cause overfitting. To avoid overfitting you can use 2 sets, training set and validation set.
Model selection
2 ways: