Why do we need econometrics? Class size example
Economics suggests important relationships with policy implications/should that rarely ever indicates quantitative magnitude of causal effects which ideally would be determined by experiment (randomized+controlled) however almost always we only have observational (non-experimental) data.
Ex. Decrease in class size increases student achievement, the provincial government should create policy to decrease class sizes, but by how much=quantitative effect?
Random Sampling must satisfy
Random Sampling must satisfy: no confounds, each person has equal chance of selection (ex.Tasting saltiness of well mixed soup)
* n>25 Law of Large Numbers
* Random Sample
* Identically distribution
* Sample Independently distributed
Explain difference b/w Independent & Identically distributed? Example of coin flip?
Prove expectation of Y equals population regression equation
E[Y]=E[Y|X]=E[0+1Xi+ui]=E(b0)+E(b1x)+E(e)=b0+b1x+e
3 Measures of Fit + Formula + Drawing
OLS Estimation + Proof
R Regression Code
regression1 <-lm(dependent~independent, data=caschool)
summary(regression1)
coeftest(regression,vcov=vcovHC(regression, type=”HC1”))
Look at R code result, Interpret each element
1 Least Squares Assumptions for Causal Reference
Randomized Controlled Experiment: for a binary treatment, expected difference in means b/w the treatment & control groups which are divided by random assignment (by computer) ensuring X is uncorrelated with all other determinants of Y, there are no confounding variables (OVB/bias), all individual characteristics that make up u are distributed independently of X so Conditional Distribution E(u|X=x)=0 all other qualities and residuals will cancel out across both groups implying 1 is an unbiased estimator of the causal effect
See graph in doc
2 Least Squares Assumptions for Causal Reference
Identically & Independently Distributed to allow Central Limit Theorem (CLT) to create the sampling distribution of 0 & 1 by simple random sampling; all entities selected from same population (identically distributed) and at random so probability of selecting one school has no correlation with selecting other (independently distributed)
3 Least Squares Assumptions for Causal Reference
Large outliers in X and/or Y are rare E(X4or Y4)< it could strongly influence results or create meaningless values of 1, usually X & Y are bounded having finite fourth moments
Scatterplot and removing extreme values of X or Y or else
Trimming: take 1% of data off of both ends
Winsorizing: replacing with less extreme values from within the data distribution, rather than removing them entirely to mitigate their effects without completely discarding data points.
See doc for graph.
Interpret b0 and b1
b0 is the average value of Y when X=0
b1 is the unit of change associated with a 1 unit change of X holding all other factors/variables constant.
Heteroskedasticity means that:
A) homogeneity cannot be assumed automatically for the model.
B) the variance of the error term is not constant.
C) the observed units have different preferences.
D) agents are not all rational.
B
The power of the test is:
A) dependent on whether you calculate a t or a t2 statistic.
B) one minus the probability of committing a type I error.
C) a subjective view taken by the econometrician dependent on the situation.
D) one minus the probability of committing a type II error.
D
With i.i.d. sampling each of the following is true EXCEPT:
A) E( ) = .
B) var( ) = /n.
C) E( ) < E(Y).
D) is a random variable
C
Central limit theorem states:
A) states conditions under which a variable involving the sum of Y1,…, Yn i.i.d. variables
becomes the standard normal distribution.
B) postulates that the sample mean is a consistent estimator of the population mean .
C) only holds in the presence of the law of large numbers.
D) states conditions under which a variable involving the sum of Y1,…, Yn i.i.d. variables
becomes the Student t distribution
A
You have estimated a linear regression to understand the relationship between salary and
years of experience. You want to test the hypothesis:
* Null Hypothesis H0 : The effect of experience on salary is zero (β1=0).
* Alternative Hypothesis HA : Experience significantly affects salary (β1≠0).
Which of the following R commands will provide the t-statistic and p-value for this
hypothesis test?
A) summary(model)
B) coefficients(model)
C) confint(model)
D) t.test(company_data$salary, company_data$experience)
A
Which command will predict sales if the advertising budget is 1000 units?
A) predict(model, newdata = data.frame(advertising = 1000))
B) predict(model, newdata = list(advertising = 1000))
C) model$predict(1000)
D) predict(model, advertising = 1000
A
Which command extracts the intercept and slope coefficients from the model?
A) coef(model)
B) summary(model)
C) model$coefficients
D) coefficients(model)
C
Which R command will show the detailed results (coefficients, residuals, R-squared, etc.) of
the regression?
A) summary(model)
B) print(model)
C) model$coefficients
D) coefficients(model)
A
Which of the following is the correct way to run a simple linear regression in R, where sales
is the dependent variable and advertising is the independent variable using the lm()
function?
A) lm(sales ~ advertising, data = dataset)
B) lm(advertising ~ sales, dataset)
C) lm(data = dataset, sales ~ advertising)
D) lm(dataset$sales, dataset$advertising)
A
To infer the political tendencies of the students at your college/university, you sample 150
of them. Only one of the following is a simple random sample. You:
A) make sure that the proportion of minorities are the same in your sample as in the
entire student body.
B) call every fiftieth person in the student directory at 9 a.m. If the person does not answer
the phone, you pick the next name listed, and so on.
C) go to the main dining hall on campus and interview students randomly there.
D) have your statistical package generate 150 random numbers in the range from 1 to the
total number of students in your academic institution, and then choose the corresponding
names in the student telephone directory
D
4 elements of Ideal Randomzied Controlled Experiment
4th Least Square Assumptiosn for Causal Inference in Multiple Regressions & How it can be violated & Solutions
No perfect collinearity, a regressor is an exact linear function of the other regressor, regressors are highly correlated
1. Inserting the same variable twice gives r code of NA, STRATA (dropped)
2. Dummy Variable Trap: one variable can be perfectly predicted from the others, making it impossible to accurately interpret the individual effects of each dummy variable on the model due to redundancy with the intercept term, mutually exclusive & exhaustive, include all dummy variables & a constant gives perfect multicollinearity, income v. provinces
* Solution: modify list of regressors, omit intercept or omit a categorical group