- Dropout is a regularization technique used to prevent overfitting in neural networks. - Randomly turn off neurons

- When a model learns the training data too well, including noise and specific patterns that do not generalize to unseen data. - Causes the model to perform poorly on unseen data

- number of layers - control the complexity of the network - more layers allow the model to fit more complex patterns but also increase the risk of overfitting

- Lasso - has the effect of reducing the number of features used in the model by pushing to zero the weights of features that would otherwise have small weights. - results in sparse models and reduces the amount of noise in the mode

- Ridge - results in smaller overall weight values and stabilizes the weights when there is a high correlation between the input features

Hyperparameters & Activation Functions Flashcards by Katie D

What are hyperparameters?

Hyperparameters are configuration variables external to the model that are set before the training process begins and control how the model learns.
- define how the model is trained. how long we train, the learning rate

How well did you know this?

Not at all

Perfectly

What is dropout?

Dropout is a regularization technique used to prevent overfitting in neural networks.
Randomly turn off neurons

How well did you know this?

Not at all

Perfectly

How does dropout work?

During training, dropout randomly “drops out” (sets to zero) a certain percentage of neurons in a layer at each training step. This means the dropped-out neurons do not contribute to the forward pass or backpropagation for that specific training example.

How well did you know this?

Not at all

Perfectly

What’s the typical range of dropout?

0.2 to 0.5

How well did you know this?

Not at all

Perfectly

What is the effect of dropout?

Improved generalization and reduced overfitting
By forcing the network to learn more robust features that are less reliant on the presence of any single neuron, dropout encourages the model to learn redundant representations.
This effectively trains an ensemble of smaller, different networks at each iteration, leading to improved generalization and reduced overfitting.

How well did you know this?

Not at all

Perfectly

Overfitting

When a model learns the training data too well, including noise and specific patterns that do not generalize to unseen data.
Causes the model to perform poorly on unseen data

How well did you know this?

Not at all

Perfectly

Learning Rate

- step size during the optimization process, particularly in gradient descent algorithms
- Dictates how much the model’s parameters are updated in response to the estimated error at each iteration
- Influences both the speed of convergence and final accuracy
- how fast the model parameters are allowed to change

How well did you know this?

Not at all

Perfectly

What happens if the learning rate is too high?

the optimization process will overshoot the optimal solution
leads to unstable training, oscillations in loss, or even divergence

How well did you know this?

Not at all

Perfectly

What happens if the learning rate is too low?

Very slow convergence potentially getting stuck in local minima and taking a long time to reach an optimal solution

How well did you know this?

Not at all

Perfectly

What is a typical learning rate?

Problem dependent and often requires tuning
0.01 is recommended starting point for general neural network training
0.001 for optimizers like Adam
0.0001 for optimizers like AdamW

How well did you know this?

Not at all

Perfectly

Max Depth

number of layers
control the complexity of the network
more layers allow the model to fit more complex patterns but also increase the risk of overfitting

How well did you know this?

Not at all

Perfectly

Regularization Strength

Add a penalty term to the model’s loss function preventing it from fitting the training data too closely and improving its ability to generalize to unseen data

How well did you know this?

Not at all

Perfectly

L1 Regularization

Lasso
has the effect of reducing the number of features used in the model by pushing to zero the weights of features that would otherwise have small weights.
results in sparse models and reduces the amount of noise in the mode

How well did you know this?

Not at all

Perfectly

L2 regularization

Ridge
results in smaller overall weight values and stabilizes the weights when there is a high correlation between the input features

How well did you know this?

Not at all

Perfectly

Number of epochs

how many times the entire training dataset is passed through the learning algorithm

How well did you know this?

Not at all

Perfectly

scale_pos_weight

Study These Flashcards

XGBoost param
used to address class imbalance in binary classification problems
can adjust to reduce high FN rate

batch_size

Study These Flashcards

in stochastic gradient descent, the batch size determines how many samples are used to compute the gradient at each step

n_estimators

Study These Flashcards

used in random forest
number of decision trees in the ensemble

max tree depth

Study These Flashcards

limits the depth of individual decision trees preventing them from becoming overly complex

Early stopping

Study These Flashcards

regularization technique
Training is stopped when the validation error starts to increase, indicating that the model is starting to overfit the training data.
This prevents the model from continuing to learn the training data too closel

What are some regularization techniques?

Study These Flashcards

L1, L2, Dropout, Early stopping

What is the goal of regularization?

Study These Flashcards

To prevent over-fitting

What hyerparameters should be adjusted for binary classification problems with imbalanced datasets?

Study These Flashcards

scale_pos_weight
eval_metric

eval_metric

Study These Flashcards

XGBoost param
specifies the evaluation metric used to monitor the model’s performance on a validation dataset

Underfitting

Performs poorly on training dataset

What to do about underfitting?

- Add new domain-specific features and more feature Cartesian products - change the types of feature processing used (e.g., increasing n-grams size). - Decrease the amount of regularization used.

Target Encoding

type of encoding is achieved by replacing categorical variables with just one new numerical variable and replacing each category of the categorical variable with its corresponding probability of the target.

Standard Scaling

feature engineering method that transforms continuous features in a dataset by centering their mean around zero and scaling them to have a unit variance. This method helps regression models to perform better by ensuring that no single feature dominates the others due to differences in scale.

Hyperparameters & Activation Functions Flashcards

(28 cards)