Quantitative Methods Flashcards

Basics of multiple regression and underlying assumptions, evaluating regression model fit and interpreting model results, model misspecification, extensions of multiple regression, time-series analysis, machine learning, and big data projects (143 cards)

1
Q

Adjusted R²

A

The coefficient of determination adjusted for degrees of freedom.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Akaike’s Information Criterion (AIC)

A

A figure holistically representing the prediction power of a model; a model with a lower AIC will provide a more accurate estimation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Schwarz’s Bayesian Information Criterion (BIC)

A

A figure holistically representing the parsimony of a model; a model with a lower BIC has a better fit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Joint test of hypotheses

What are its degrees of freedom? What are its conclusions?

A

A form of hypothesis test that determines whether a set of independent variables in a restricted model has significant power to explain changes in the dependent variable relative to the unrestricted model.

The test is F-distributed with q and n - k - 1 degrees of freedom.

H₀: b₁ = b₂ = … = bₖ = 0
Hₐ: At least one b ≠ 0

Rejecting H₀ means that the restricted model has significant explanatory power relative to the unrestricted model.
Failing to reject H₀ means that the restricted model does not have significant explanatory power relative to the unrestricted model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

General linear F-test

What are its degrees of freedom? What are its conclusions?

A

A form of hypothesis test that determines whether an entire regression has significant power to explain changes in the dependent variable.

The test is F-distributed with k and n - k - 1 degrees of freedom.

H₀: b₁ = b₂ = … = bₖ = 0
Hₐ: At least one b ≠ 0

Rejecting H₀ means that the model is well fit.
Failing to reject H₀ means that the model is not well fit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Model specification

A

The set of independent variables in a regression model as well as its functional form; in order for a model to be correctly specified, it must:

1.) Be grounded in economic reasoning
2.) Be parsimonious
3.) Perform well out-of-sample
4.) Have the appropriate functional form
5.) Satisfy regression assumptions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Omitted variable bias

What can it lead to?

A

A functional form misspecification in which one or more independent variables with significant explanatory power as to the dependent variable are missing from the regression; may lead to heteroskedasticity and/or serial correlation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Inappropriate form of variables

What can it lead to?

A

A functional form misspecification in which a nonlinear relationship between the independent and dependent variables is ignored; may lead to heteroskedasticity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Inappropriate scaling of variables

What can it lead to?

A

A functional form misspecification in which variables must be transformed before estimating the regression; may lead to heteroskedasticity and/or multicollinearity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Inappropriate pooling of variables

What can it lead to?

A

A functional form misspecification in which the regression pools observations from different contexts (e.g. fiscal regime, recession) leading to data clustering; may lead to heteroskedasticity and/or serial correlation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Breusch-Pagan test

What is it? How is it calculated? What are its conclusions?

A

A form of hypothesis test that determines whether or not conditional heteroskedasticity exists in a regression model.

BP = nR²

H₀: no conditional heteroskedasticity exists
Hₐ: conditional heteroskedasticity exists

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Durbin-Watson test

What is it? What are its degrees of freedom? What are its conclusions?

A

A form of hypothesis test that determines whether a model exhibits first-order serial correlation.

The test is DW-distributed with n and k degrees of freedom.

H₀: DW = 2
Hₐ: DW < 2

Rejecting H₀ means that the model exhibits positive serial correlation.
Failing to reject H₀ means that the model does not exhibit positive serial correlation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Breusch-Godfrey test

What is it? What are its degrees of freedom? What does it conclude?

A

A form of hypothesis test that determines whether a model exhibits serial correlation up to an order p.

The test is F-distributed with n - p - k - 1 and p degrees of freedom.

H₀: no pth-order serial correlation exists
Hₐ: pth-order serial correlation exists

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Variance inflation factor

What is it? How is calculated? What are its conclusions?

A

A figure representing the magnitude of multicollinearity.

VIF = 1/(1-R²)

VIF > 5: Further investigation into the independent variable is warranted
VIF > 10: A serious multicollinearity issue is present with regard to the independent variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Maximum likelihood estimation

a.k.a. MLE

A

A method that estimates values for the intercept and slope coefficients in a logistic regression; the logit equivalent of ordinary least squares (OLS).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Likelihood ratio test

What is it and how is it calculated?

A

A joint test of hypotheses for a logit regression which uses the chi-squared distribution.

The closer the LR to 0, the better the model fits the data.

LR = -2(LLR - LLU), where LLR = log-likelihood of restricted model and LLU = log-likelihood of unrestricted model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Time series

A

A regression in which time is the independent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Studentized residual

A

A t-statistic which is used to determine whether an observation is an outlier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Trend

A

A long-term pattern of the dependent variable’s movement in a particular direction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Leverage (regression)

What is it? How is it used to determine influence?

A

A measure of the influence of a high-leverage point on a regression.

If the leverage of an observation > 3[(k+1)/n], then the observation is potentially influential

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Linear trend

A

A trend in which the dependent variable moves at a constant rate with respect to time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Studentized deleted residual

What is it? How is it used to determine influence?

A

A figure that quantifies the effect of removing an observation from a regression on the residuals of that regression.

If |studentized deleted residual| > 3, the observation is an outlier
If |studentized deleted residual| > critical t-value with n-k-2 degrees of freedom at a selected significance level, the observation is potentially influential

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Log-linear trend

A

A trend in which the dependent variable moves at an exponential rate with respect to time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Autoregressive model

a.k.a. AR model

A

A type of time-series regression in which the dependent variable is modeled to be explained by previous values of itself.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Covariance stationary | What is it? What are its implications on a time series?
The property of a time series in which: 1.) The expected value is constant & finite in all periods 2.) The variance of the time series is constant & finite in all periods 3.) The covariance of the time series with itself for a fixed number of periods in the past/future must be constant & finite in all periods A time series must be covariance stationary in order for an AR model to yield inferences which are statistically sound.
26
kth-order autocorrelation
Correlation between observations in a time series separated by k periods.
27
Residual autocorrelation
Sample autocorrelation of an error term used to deduce whether the errors of an AR model are correlated.
28
Mean reversion | What is it and how is it calculated?
The property of a time series in which the value of the dependent variable returns to its average in the long term. Mean-reverting level of an AR(1) model = b0 / (1-b1)
29
Chain rule of forecasting
An estimation method in which the next period's value as predicted by the forecasting equation is substituted into the right-hand side of the equation in order to deduce the value two periods ahead.
30
Slope dummy
A dummy variable that functions to steepen or flatten the regression line.
31
In-sample forecast errors
The residuals part of the sample that was used to fit the model.
32
Out-of-sample forecast errors
The residuals of forecasts outside of the sample that was used to fit the model.
33
Root mean squared error | a.k.a. RMSE
The criterion used to quantify and compare the out-of-sample forecast performances of models against each other.
34
Regime
A set of technological, legal, political and regulatory characteristics that defines an economic environment over a particular timeframe.
35
Random walk
A time-series in which the value in the current period is equal to the value in the past period plus some random error.
36
First-differencing | What is it and what is it used for?
A regression transformation in which the value at time t -1 is subtracted from the value at time t; used to make time-series covariance stationary.
37
Unit root
A time series that is non-covariance-stationary is said to have a unit root.
38
Dickey-Fuller test | What is it used for? How is it performed? What are its conclusions?
A hypothesis test used to determine whether a time series contains a unit root. The test uses a modified t-table. H₀: g₁ = 0, thus time series has a unit root. Hₐ: g₁ > 0, thus time series does not have a unit root. Rejecting H₀ means that the model is covariance stationary. Failing to reject H₀ means that the model is non-covariance-stationary.
39
Seasonality | How do we test for seasonality in a time series?
A characteristic of a time series in which a pattern emerges within a particular smaller timeframe. To test for seasonality in a time series: 1. Check the autocorrelations of each lag's residuals. 2. If one of the autocorrelations seems high, test if it is significantly different from zero by calculating its t-statistic. 3. If the t-statistic rejects the null hypothesis that autocorrelation is zero, this lag contains seasonality.
40
Seasonal lag
The value of a time series one year before the current period, included as an extra term to correct seasonality.
41
Autoregressive conditional heteroskedasticity (ARCH) | How do we test for ARCH? What are its implications?
A characteristic in an autoregressive model in which the residuals are correlated with time. To test for ARCH, regress the original AR model on its squared residuals, resulting in êₜ² = a₀+a₁êₜ₋₁²+uₜ. Then, perform the following hypothesis test: H₀: a₁ = 0 Hₐ: a₁ > 0 Rejecting H₀ means that the AR model is ARCH and must be re-estimated using GLS. Failing to reject H₀ means that the AR model is not ARCH.
42
Cointegration
The characteristic of a linear regression with two time series in which both series are correlated and thus do not diverge without bound in the long run
43
Engle-Granger Dickey-Fuller test | What is it used for? How is it performed? What are its conclusions?
A hypothesis test used to determine whether two time series in a linear regression are cointegrated. It is performed by verifying whether eₜ in y = b₀ + b₁x₁ + eₜ contains a unit root. H₀: eₜ contains a unit root and thus is non-covariance-stationary Hₐ: eₜ does not contain a unit root and thus is covariance stationary Rejecting H₀ means that y and x₁ are cointegrated. Failing to reject H₀ means that y and x₁ are not cointegrated.
44
Least absolute shrinkage & selection operator | What type of algorithm is this and what is it best suited for?
An ML algorithm using penalized regression with a hyperparameter set to minimize the sum of the absolute values of the regression coefficients; this is a supervised algorithm best suited for regression problems.
45
Hyperparameter
A value set by a researcher before machine learning begins.
46
Regularization
The reduction of statistical variability in high-dimensional data estimation problems.
47
Support vector machine | What is it and what is it used for?
An ML algorithm characterized by a linear classifier that aims to maximize the distance between two groups of data; this is a supervised algorithm best suited for classification problems.
48
Linear classifier
A vector used to separate groups of data based on the features of each data point
49
Soft margin classification
A modification of the support vector machine as to both maximize the distance between the two groups of data and minimize misclassification of data points.
50
K-nearest neighbour | What is it and what is it used for?
An ML algorithm which classifies the new data point based on the similarity of its characteristics to k existing data points, where k is a hyperparameter; this is a supervised algorithm used mostly for classification problems, but can also be used for regression.
51
Classification & regression tree
An ML algorithm that can be applied to predict either a categorical target variable, creating a classification tree, or a continuous target variable, creating a regression tree.
52
Ensemble learning
A method that combines multiple models to reach a more accurate classification or regression.
53
Ensemble method
A method that combines multiple ML algorithms to reach a more accurate classification or regression.
54
Majority-vote classifier
A method that assigns to a new data point the label with the most votes from the ensemble.
55
Bootstrap aggregating | A.k.a. bagging
A technique whereby the original training dataset is used to generate n new training datasets (bags) of data; each new dataset is generated by random sampling with replacement from the original dataset.
56
Random forest classifier
A collection of a large number of trees trained through a bagging method
57
Precision
The ratio of correctly predicted positive classes to all predicted positive classes. Precision = TP / (TP + FP)
58
Recall | A.k.a. sensitivity
The ratio of correctly predicted positive classes to all actual positive classes. Recall = TP / (TP + FN)
59
Penalized regression
A system in which regression coefficients are chosen as to minimize the combination of sum of squares error and penalty, the latter of which is imposed on each instance of an independent variable’s addition to the model.
60
Pruning
Regularization of CART in which sections that do not hold sufficient classification/regression power are removed from the tree.
61
Principal components analysis
A type of dimension reduction that converts many correlated variables into fewer, uncorrelated composite variables.
62
Supervised machine learning
Algorithms which are trained to classify or regress data using a labeled dataset; includes penalized regression, LASSO, SVM, KNN, CART, ensemble learning and random forest.
63
Unsupervised machine learning
Algorithms which are trained to classify or regress data by finding patterns within the data itself; includes dimension reduction and clustering.
64
Dimension reduction
An ML algorithm which aims to represent a dataset with many correlated features to one represented by fewer features that maintain their explanatory power.
65
Composite variable | What are they and how are they formed?
A variable that is formed of multiple variables that are statistically strongly correlated with each other; they are represented by eigenvectors that each have an eigenvalue corresponding to their power in explaining the initial dataset.
66
Projection error
The vertical (perpendicular) distance of a data point from a principal component; PCA aims to minimize the sum of this error across all data points.
67
Spread
The horizontal (parallel) distance between data points in a PCA; PCA aims to maximize the sum of spreads across all data points.
68
Scree plot
A plot that shows the total variance of the data explained by each principal component.
69
Clustering
An ML algorithm that organizes data points into subsets called clusters, in which all points within a cluster are deemed similar; can be k-means or hierarchical.
70
K-means clustering | What is an advantage and disadvantage?
Clustering that partitions observations into k clusters; works well for large datasets, but k is a hyperparameter that must be estimated beforehand.
71
Centroid
The centre of a cluster formed using k-means clustering
72
Hierarchical clustering | What are the advantages and disadvantages of each type?
Clustering that creates intermediate clusters that increase (agglomerative) or decrease (divisive) in size until a final clustering is reached; agglomerative clusters compute quicker and are better suited for large datasets, while divisive clusters are better suited for large clusters
73
Dendrogram
A type of tree diagram used to outline the results of hierarchical clustering at each iteration.
74
Structured ML model building steps
1. Conceptualization 2. Data collection 3. Data preparation (cleansing) and preprocessing (wrangling) 4. Data exploration 5. Model training
75
Unstructured ML model building steps
1. Text problem formulation 2. Text curation 3. Text preparation (cleansing) and preprocessing (wrangling) 4. Text exploration 5. Model training
76
Web scraping | a.k.a. web spidering, web crawling
The employment of a program to scour external data sources (usually websites) in order to collect raw textual information.
77
Readme files
Files containing instructions on how to use a given piece of raw data.
78
Application Programming Interface (API)
A set of well-defined methods that allow softwares to communicate; used to deliver data
79
Data Preparation (Cleansing)
The correction of invalid, inaccurate, inconsistent, incomplete, non-uniform or duplicate data before its use as an input to the preprocessing stage.
80
Steps in numerical data preprocessing (wrangling)
1. Transformation (extracting, aggregating, filtering, selecting, converting) 2. Outlier removal (trimming, winsorization) 3. Scaling (normalization, standardization)
81
Metadata
Data that provides information about other data.
82
Trimming
The removal of k% of highest and lowest values from a dataset.
83
Winsorization
The replacement of outliers with maximum and minimum values which are not outliers.
84
Scaling
Adjusting the range of data by shifting and changing the scale of the data; normalization and standardization are two types of scaling
85
Normalization
A type of scaling in which numbers are rescaled to fit in the range [0, 1]; this scaling method is sensitive to outliers. Xnorm = (X - Xmin)/(Xmax-Xmin)
86
Standardization
A type of scaling in which numerical data is centered around a mean of 0; data must follow a normal distribution in order for it to be standardized. Xstandard = (X - mean)/standard deviation
87
Summation operator
The functional part of a neural network's node that multiplies each input value by its respective weight and sums the weighted values to form the total net input, which is then passed to the activation function.
88
Activation function
The functional part of a neural network's node that receives the total net input from the summation operator and transforms it into the final output of the node; operates akin to a dimmer switch that in/decreases the strength of the output.
89
Forward propagation
The method of adjusting weights in a neural network to minimize its error by moving forward through the network.
90
Backward propagation
The method of adjusting weights in a neural network to minimize its error by moving backward through the network.
91
Neural network weight updating | What is learning rate?
New weight = Old weight - (Learning rate x partial derivative of the total error with respect to time), where learning rate is a hyperparameter.
92
Deep neural network (DNN)
A neural network with at least two hidden layers.
93
Reinforcement learning
Machine learning in which the algorithm learns from its past outputs or from interacting with itself.
94
Regular expression (Regex)
A series of particular characters in order used to find patterns in a body of text.
95
Text cleansing
The removal of html tags, punctuation, numbers, and whitespaces from a body of text in order to prepare it for preprocessing.
96
Steps in text data preprocessing (wrangling)
1. Tokenization 2. Normalization (lowercasing, removing stop words, stemming, lemmatization) 3. Bag-of-words 4. Document term matrix
97
Tokenization (data analysis)
The process of splitting a body of text into separate words, or tokens.
98
Bag-of-words (BOW)
A collection of a distinct set of tokens from all texts in a sample dataset.
99
Document term matrix (DTM)
A two-dimensional representation in which each column represents a token from the BOW, each row represents the name of a text, and each intersection represents the number of appearances of that token in that text.
100
N-grams
The representation of a sequence of words aggregated into a single token.
101
Exploratory data analysis (EDA)
The first step in data exploration; it requires a high degree of collaboration with other departments in order to understand relationships between data points in order to create graphs, charts and other visualizations, as well as analyze descriptive statistics.
102
Feature selection
The second stap in data exploration; a process in which only the features most relevant for model training are included in the dataset in order to prevent overfitting and model overcomplication.
103
Feature engineering
The third and final step in data exploration; a process in which new features are created by changing or transforming existing features.
104
One-hot encoding
The process in which categorical variables are transformed into binary form for machine reading.
105
Term frequency (TF)
TF = frequency of token in a dataset / total number of tokens in a dataset
106
Feature selection methods for text data
1. Frequency analysis 2. Chi-squared test 3. Mutual information (MI)
107
Document frequency (DF)
DF = number of documents (i.e. sentences, texts) in a dataset containing a token / total number of documents in the dataset
108
Chi-squared test (feature selection) | What does a high/low statistic mean?
A statistical method used to determine the independence of two events: the occurrence of a token and the occurrence of a class; a token with a high chi-squared statistic indicates higher discriminatory potential.
109
Mutual information (MI)
A measure of the tendency of a particular token in appearing in a specific text versus appearing across all texts uniformly; MI of 0 indicates uniformity, MI of 1 indicates high bias and thus discriminatory potential.
110
Feature engineering methods for text data
1. Numbers 2. N-grams 3. Name entity recognition 4. Parts-of-speech
111
Parts of speech (POS)
An algorithm that tags a token to an element of a sentence (e.g. noun, verb) based on the words surrounding it.
112
Name entity recognition (NER)
An algorithm that tags a token to an object class (e.g. organization, year, name) based on the words surrounding it.
113
Steps in model training
1. Method selection 2. Performance evaluation 3. Tuning
114
Factors in model selection
1. Type of machine learning (supervised vs unsupervised) 2. Type of data 3. Size of data
115
Ground truth
The known outcome of a target variable available, characteristic of supervised ML.
116
Class imbalance
An event in which the number of data points belonging to one class greatly outnumbers those belonging to another class; can be alleviated by oversampling the underrepresented class and undersampling the overrepresented class.
117
Intercept dummy
A dummy variable that functions to raise or lower the regression line parallel to the original regression.
118
Base error
Model error due to randomness in the data.
119
Bias error
Describes the degree to which the model fits the training data; high bias error can be caused by erroneous assumptions and will lead to underfitting and high in-sample error.
120
Complexity
Refers to the number of dimensions/features/parameters in the data and whether they are linear or non-linear.
121
Cross-validation
The process of estimating out-of-sample error directly by determining the error in validation samples.
122
Deep learning
An ML algorithm that uses neural networks to find patterns in highly complex data.
123
Dendrogram
A tree diagram used to visualize hierarchical clustering.
124
F1 score
The harmonic mean of precision and recall; F1 score is a better performance measure than accuracy when there is class imbalance. F1 score = 2PR / (P+R)
125
Accuracy
The ratio of correctly predicted classes to total predictions; an overall performance metric for classification problems. Accuracy = (TP + TN) / (TP + FP + TN + FN)
126
Features
The independent variables in a labeled dataset.
127
Fitting curve
A curve that shows in- and out-of-sample error rates on the y-axis plotted against model complexity on the x-axis.
128
Generalization
The retention of a model's explanatory power when performing out-of-sample.
129
Target (machine learning)
The dependent variable in a labeled dataset.
130
Confusion matrix
A grid with actual classes on the x-axis and predicted classes on the y-axis used to evaluate Type I and Type II error rates, as well as correct predictions. Top left cell -> True positives Top right cell -> False positives (Type I Error) Bottom left cell -> False negatives (Type II Error) Bottom right cell -> True negatives
131
Performance evaluation
The measurement of model performance for goodness of fit. The three performance evaluation methods are as follows: 1. Error analysis 2. Receiver operating characteristic 3. RMSE
132
Receiver operating characteristic (ROC)
A curve illustrating the trade-off between the false positive error (x-axis) and true positive rate (y-axis) for various cutoff points.
133
Tuning | What are the two tuning methods?
A process performed on an ML model that aims to achieve the optimal parameters and hyperparameters that neither underfits nor overfits the model, and thus involves optimizing the bias-variance error tradeoff; two tuning methods are grid search and ceiling analysis.
134
Grid search
A tuning method in which different combinations of hyperparameters are applied to an ML model until the best model is found.
135
Ceiling analysis
An assessment of the pipeline of ML model development to locate at which step the model requires tuning.
136
Corpus
A collection of text data in any form.
137
Sentence length
The number of characters, including spaces, in a sentence.
138
Frequency analysis
The process of determining how important certain tokens are in a sentence and the corpus as a whole. Frequency analysis measures include: 1. Term frequency 2. Document frequency 3. Collection frequency 4. Inverse document frequency
139
Collection frequency (CF)
Number of instances of a token in a corpus / Number of tokens in the corpus
140
Inverse document frequency (IDF)
A relative measure of how unique a token is across the corpus. IDF = log(1/DF)
141
TF-IDF
An overall measure of the value of a token across the entire dataset. TF-IDF = TF x IDF A token with a higher TF-IDF appears more frequently throughout a small number of documents. A token with a lower TF-IDF appears across many documents.
142
Holdout samples
Data samples that are not used to train the model.
143
K-fold cross validation | What is it used for?
A method in which data is shuffled and divided randomly into k equal subsamples, in which k-1 subsamples are used as the training sample and the final, the kth, is used as a validation sample; used to mitigate the effect of holdout samples on shrinking the sample excessively