Machine Learning Flashcards

(32 cards)

1
Q

What is machine learning?

A

using data to teach algorithms to predict outcomes they have never seen before. Steps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the three steps in machine learning?

A
  • Give computer data and outcomes
  • Figures out patterns by itself
  • Uses these patterns to make predictions on new data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Give 5 differences between statistics and machine learning?

A
  1. Goal
  2. Questions
  3. Evaluation
  4. Approach
  5. Style
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How does the goal of statistics differ from machine learning goal?

A

understanding relationship VS making accurate predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does the central question in statistics differ from that in machine learning?

A

How does X relate to Y? VS Given X, what is Y?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does the evaluation of statistics differ from that of machine learning?

A

Coefficients, p-values VS Test-set accuracy, error rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does the style of statistics differ from that of machine learning?

A

Transparent but rigid VS Flexible but often opaque

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does the approach of statistics differ from that of machine learning?

A

Model assumptions VS Algorithms that learn patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are 4 important concepts in machine learning?

A
  1. Feature = independent variable
     Any variable used to make predictions
  2. Target = dependent variable
     Outcome to predict
     Used for classification, where outcome is a category
  3. Training: estimate a model
  4. Loss function: objective to minimize. Concepts are the same, just used more broadly
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the two types of machine learning?

A
  1. Supervised: we have labels (outcomes). Model learns to predict them
  2. Unsupervised: no labels. The model finds hidden structure in the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Give 4 ways in which you move from inference to prediction?

A
  • Does X cause Y?  Can we forecast Y?
  • Interpreting coefficients -> Minimizing prediction error
  • Worry about unobserved variables, reverse causality  Use whatever patterns work
  • Use fixed effects to control for unobservables  Use flexible algorithms that capture non-linearities and interactions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a regression tree?

A

flowchart predicting a continuous outcome by splitting data into groups by asking a series of yes/no questions and splits data step by step. Each endpoint than gives a prediction, which is the average outcome for the observations that end up there.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are three characteristics of regression treess?

A
  • Easy to understand and visualize
  • Fundamentally algorithmic: computer searches for best splits rather than estimating coefficients
  • Showcase common strengths and problems of ML algorithms: flexibility, overfitting and cross-validation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the two elements of a regression tree?

A
  • Node: asks yes/no question about a variable that split data into two groups
  • Leaf: endpoint where tree makes a prediction (mean of observations that land there)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the steps in which a regression tree works?

A
  • Consider every variable and every threshold at each node
  • Pick split that minimizes variance within resulting groups
  • Same as minimizing within-group variation as in Panel regression
  • Continue until stopping rule is met
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Give three key advantags of a regression tree?

A
  • Handles non-linear relationships automatically
  • No need to specify interactions between variables
  • Intuitive and easy to explain
15
Q

How do you measure prediction quality of algorithms?

A

RMSE = SQRT(1/n * Sum of (Actual value – predicted value)^2

Lower RMSE means better predictions.

16
Q

What is meant with model complexity?

A

basically refers to the fact that more leaves in a regression tree leads to catching more intricate patterns.

17
Q

What s the difference between a training set and a test set?

A

Training set: set of data that is used to estimate (train) the model
Test set: hiding during estimation, used only to evaluate performance

18
Q

What is meant with overfitting?

A

model learns training data too well including its noise and peculiarities.

18
Q

What is the sweet spot in ML algorithm development?

A

number of leaves of the regression tree where the RMSE of test data is lowest

19
Q

How does classification using logistic regression work?

A

Classification is done using Logistic regression. Changes in classification opposed to testing regression trees’ RMSE:
1. Prediction: each leaf predicts a class instead of a number
2. Splitting criterion: we want each split to make groups as pure as possible instead of minimizing prediction error. Purity is measured by entropy

20
Q

What is entropy?

A

Measure of how mixed a group is

21
Q

What are the different gradations in entropy and what do they mean?

A
  • Low entropy: Mostly of one class
  • Medium entropy: some mixing
  • High entropy: even mix
  • Zero entropy: perfectly pure as in all of one class
     Tree picks the split that reduces the entropy the most
22
What are the 4 steps in a classification algorithm?
- Each split is a yes/no question - At each step, tree tries every word and picks one that reduce entropy the most - Keeps splitting until improvement falls below a threshold (complexity parameter) - Pattern recognition quickly outperforms humans.
23
What is a confusion matrix?
compares what model predicted vs what actually happened
24
What are the four possible outcomes of the confusion matrix?
1. True negative: Predicted A, Actual A 2. False negative: Predicted A, Actual B 3. False positive: Predicted B, Actual A 4. True positive: Predicted B, Actual B
25
What are three key metrics of the confusion matrix?
1. Accuracy 2. True positive rate 3. False positive rate
26
What is accuracy in a confusion matrix and how do you calculate it?
(TP+ TN) / (TN + FN + FP+ TP)  Share of all correct predictions
27
What is the true positive rate and how do you measure it?
TP / (TP + FN)  How many true positives did we catch
28
What is the false positive rate?
FP / (FP + TN)  How many observations did we falsely flag?
29
Give 4 wayis in which machine learning is used as a measurement tool?
1. Sentiment analysis: Optimism vs Pessimism 2. Innovation measurement: does patent describe breakthrough 3. Readability: how complex is this disclosure? 4. AI writing detection: purely human or AI generated?