Hyperparameters & Activation Functions Flashcards

(28 cards)

1
Q

What are hyperparameters?

A

Hyperparameters are configuration variables external to the model that are set before the training process begins and control how the model learns.
- define how the model is trained. how long we train, the learning rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is dropout?

A
  • Dropout is a regularization technique used to prevent overfitting in neural networks.
  • Randomly turn off neurons
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does dropout work?

A

During training, dropout randomly “drops out” (sets to zero) a certain percentage of neurons in a layer at each training step. This means the dropped-out neurons do not contribute to the forward pass or backpropagation for that specific training example.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What’s the typical range of dropout?

A

0.2 to 0.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the effect of dropout?

A
  • Improved generalization and reduced overfitting
  • By forcing the network to learn more robust features that are less reliant on the presence of any single neuron, dropout encourages the model to learn redundant representations.
  • This effectively trains an ensemble of smaller, different networks at each iteration, leading to improved generalization and reduced overfitting.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Overfitting

A
  • When a model learns the training data too well, including noise and specific patterns that do not generalize to unseen data.
  • Causes the model to perform poorly on unseen data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Learning Rate

A

- step size during the optimization process, particularly in gradient descent algorithms
- Dictates how much the model’s parameters are updated in response to the estimated error at each iteration
- Influences both the speed of convergence and final accuracy
- how fast the model parameters are allowed to change

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What happens if the learning rate is too high?

A
  • the optimization process will overshoot the optimal solution
  • leads to unstable training, oscillations in loss, or even divergence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What happens if the learning rate is too low?

A
  • Very slow convergence potentially getting stuck in local minima and taking a long time to reach an optimal solution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a typical learning rate?

A
  • Problem dependent and often requires tuning
  • 0.01 is recommended starting point for general neural network training
  • 0.001 for optimizers like Adam
  • 0.0001 for optimizers like AdamW
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Max Depth

A
  • number of layers
  • control the complexity of the network
  • more layers allow the model to fit more complex patterns but also increase the risk of overfitting
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Regularization Strength

A
  • Add a penalty term to the model’s loss function preventing it from fitting the training data too closely and improving its ability to generalize to unseen data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

L1 Regularization

A
  • Lasso
  • has the effect of reducing the number of features used in the model by pushing to zero the weights of features that would otherwise have small weights.
  • results in sparse models and reduces the amount of noise in the mode
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

L2 regularization

A
  • Ridge
  • results in smaller overall weight values and stabilizes the weights when there is a high correlation between the input features
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Number of epochs

A

how many times the entire training dataset is passed through the learning algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

scale_pos_weight

A
  • XGBoost param
  • used to address class imbalance in binary classification problems
  • can adjust to reduce high FN rate
17
Q

batch_size

A
  • in stochastic gradient descent, the batch size determines how many samples are used to compute the gradient at each step
18
Q

n_estimators

A
  • used in random forest
  • number of decision trees in the ensemble
19
Q

max tree depth

A

limits the depth of individual decision trees preventing them from becoming overly complex

20
Q

Early stopping

A
  • regularization technique
  • Training is stopped when the validation error starts to increase, indicating that the model is starting to overfit the training data.
  • This prevents the model from continuing to learn the training data too closel
21
Q

What are some regularization techniques?

A

L1, L2, Dropout, Early stopping

22
Q

What is the goal of regularization?

A

To prevent over-fitting

23
Q

What hyerparameters should be adjusted for binary classification problems with imbalanced datasets?

A
  • scale_pos_weight
  • eval_metric
24
Q

eval_metric

A
  • XGBoost param
  • specifies the evaluation metric used to monitor the model’s performance on a validation dataset
25
Underfitting
Performs poorly on training dataset
26
What to do about underfitting?
- Add new domain-specific features and more feature Cartesian products - change the types of feature processing used (e.g., increasing n-grams size). - Decrease the amount of regularization used.
27
Target Encoding
type of encoding is achieved by replacing categorical variables with just one new numerical variable and replacing each category of the categorical variable with its corresponding probability of the target.
28
Standard Scaling
feature engineering method that transforms continuous features in a dataset by centering their mean around zero and scaling them to have a unit variance. This method helps regression models to perform better by ensuring that no single feature dominates the others due to differences in scale.