What is stride?
Stride is how many pixels along we move the filter each time.
Stride = 1, means we move 1 pixel in any direction.
What is the formula for the output size of an image?
What is the size of the following:
1. 7x7 image, filter = 3x3, stride = 1
2. 7x7 image, filter = 3x3, stride = 2
3. 7x7 image, filter = 3x3, stride = 3
((N-F)/Stride) + 1
stride 1 => ((7 - 3) / 1) + 1 = 5
stride 2 => ((7 - 3) / 2) + 1 = 3
stride 3 => ((7 - 3) / 3) + 1 = 2.33 - not recommended because it’s not an integer value.
What is the motivation of padding?
To obtain an output size that is the same as the input image size.
What formula do we use to calculate how the size we would pad with?
How much should we pad for filter size:
- 3
- 5
- 7
(F-1)/2
(3-1)/2 = 1 (padding)
(5-1)/2 = 2 (padding)
(7-1)/2 = 3 (padding)
How do we deal with images that have a depth of 3 - RGB images (larger than 1)
We must use a filter with a depth that matches in input image depth. E.g. for an RGB image of depth 3, the filter must have depth 3.
- We calculate the dot product to merge the 3 output values into 1. Perform the convolution then add the 3 values to get 1.
What happens if the image depth and filter depth aren’t the same value?
We can’t calculate their dot product.
We may want to use multiple filters, how does this effect the output size?
The filter size decides the depth of the output, the output is known as the number of activation maps
Convolutional Neural Networks (CNN) size
In one convolution layer, we have 128 filters of 3x3x3 applied to input volume 128x128x3 with stride 2 and pad 1. What is the size of the output volume? Give details of how you calculate the size of the output volume.
(((N - F + (2*P))/stride) + 1
(((128-3+(2*1))/2)+1 = 64.5
round down/floor to 64
so output is 64x64x128
- number of filters is always the same as the number of output activation maps
What are the formulas for calculating the output images height, width and depth:
a volume of size W1 x H1 x D1
Four hyperparameters are required: Number of filters K, Filter size F, stride S, amount of zero padding P
When W2 and H2 are integers:
* Next layer: a volume of size W2 x H2 x D2
W2 = (W1 - F +2P) / S +1
H2 = (H1 - F +2P) / S +1
D2 = K
When and are not integers:
Next layer: a volume of size W2 x H2 x D2
▪ W2 = floor((W1- F +2P) / S) +1
▪ H2 = floor((H1 - F +2P) / S) +1
▪ D2 = K
Explain the role of the pooling layer in CNN:
List two methods of down-sampling/pooling layers:
Max pooling: given a subregion, take the max value. The next region is where the next stride is. max(2, 9, 4, 5) = 9
Average pooling: given a subregion, calculate the average of the pixels. Next region is where the next stride is. avg(2, 9, 4, 5) = 20/4 = 5
What is a drawback of max and average pooling:
Calculate the output size of the pooling layer:
Give the output size for this question:
image = 128x128x3, Filter/pooling size = 2x2, stride = 2, no padding
image: W x H x D, filter F, stride S
- depth is the same
- W = floor((W - F)/S) + 1
- H = floor((H - F)/S) + 1
D = 3
W = ((128-2)/2)+1 = 64
H = ((128-2)/2)+1 = 64
output = 64x64x3
Explain the role of the fully connected layer in CNN:
Explain the process of the fully connected layer:
What is the output size of a fully connected layer with input image 32x32x3, 10 categories.
How many parameters would there be:
step 1) stretch channels into a 1D vector
step 2) connect each input node to each output node
step 3) y = Wx (ignoring bias)
there would be 30720 parameters without bias. (30730 with bias)
What is the role of the activation function in CNN:
What is the role of the BatchNorm layer in CNN?
When is the BatchNorm layer applied:
After the convolution layer but before the activation function
What happens if we don’t perform batchNorm:
The state distributions are very strange for each layer, they are unknown and vary for each iteration.
Why do we forward batches through model instead of single images:
This creates a more stable network.
Why does batchNorm have two learnable parameters?
Why is the training process more stable using batchNorm?
Because the moving average of the mean and standard deviation is used to update them.
What are the differences between a standard network and a network that uses batchNorm?