Optimization Algorithms Flashcards

(13 cards)

1
Q

Mention ways to make the decrease of J faster

A

Normalize the input data
Use gradient descent with momentum
RMSprop
Initialize the weights randomly and with different ways of initialization to make weights not be too large
Mini batch gradient descent

Adam optimization algorithm is both gradient descent with momentum and RMSprop in two moments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In very high dimensional spaces it is most likely that the gradient descent process gives us a local minimum than a saddle point of the cost function. True/False?

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the steps required to create the mini batches

A

First shuffling the dataset and then partitioning making sure the last mini batch has size starting at the last one until m which is the total of examples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the usual size of batches

A

Numbers equal to powers of 2, like 64 or similar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is momentum doing in gradient descent

A

It is the exponentially weighted average of the gradient on previous steps so there is less oscillation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the steps required to use momentum in gradient descent

A

First, initialize v to zeros, this is for each dW and db

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the usual recommended values of the hyperparameters alpha, beta 1, beta 2 and epsilon

A

Alpha is something that needs to be tuned
Beta 1 is the momentum and usually is 0.9
Beta 2 is the RMSprop and 0.999
Epsilon usually a low number 10 to the power of -8

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the steps required to update the parameters with momentum

A

After initializing the velocities to zero,
1. Compute the new velocity of that parameter using the beta
2. Update the parameters with that velocity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you implement Adam optimization

A

First initialize v s to zeros
Then compute the velocity, then the corrected velocity
Then the s , then the s corrected
Finally update the parameters using v and s corrected with epsilon

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

You need to make the model run faster and converge faster, what are different options to use

A

Mini batch gradient descent
Momentum in gradient descent
Adam (momentum + RMSprop)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is learning decay

A

It is making learning decrease so basically we move smaller steps forward as we do more iterations and we get closer to convergence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a problem that can occur if we add learning decay

A

We might make the learning rate go to zero because with every iteration it decreases so it can quickly become zero and stop the learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a fix for learning decay not becoming zero quickly

A

Adding fixed interval scheduling. This is done in the same formula by dividing the epochNum by the time interval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly