Deep Learning Flashcards

Question

How does attention work, at a high level, for an encoder-decoder where both are RNNs?

Answer 1

The encoder passes *all* of its hidden states, from every time step, to the decoder, rather than just the last one. This is great: the decoder has access to a hidden state for each individual part of the input, so it can sort of understand all parts of the input equally well. Then, at each time step in the decoder (i.e. at each word it's trying to produce), it focuses on the most important parts of the input. It learns parameters during training that figure out which parts of an input are important to focus on based on what part of the output it's trying to produce. It does this basically by learning to, at a given time step, assign each of the encoder's hidden state a weight based on how important it is, and then it calculates a "context vector" which is just the weighted sum of all the encoder's hidden states. I'm not gonna get more in the weeds than that.

Answer 2

The decoder figures out where in the image is relevant to the particular part of the description it's currently writing. *Awesome.*

Answer 3

Just training. This makes some intuitive sense: if you had regularization as a part of your objective function rather than through dropout, you would want to penalize the model for complex weights during training, but when examining the validation set you really just wanna see how good your predictions are regardless of how complex the model parameters are. So I suppose the analog is also true for dropout.

Answer 4

Normalizing the input, as usual! Subtract mean, divide by std dev. (There are some tricks to quickly approximate this process that probably aren't important rn.) This way, the network is recieving a standardized distribution of pixel values regardless of the input image, which helps training. Otherwise dim images vs bright images would be hard to treat similarly, for example.

Answer 5

1. Fewer parameters: the same set of parameters are applied again and again, making the network more simple and probably decreasing overfitting. 2. Because it's using the same weights in different places, learning from one place can be applied elsewhere. A bird will look the same in the top right vs bottom left; with an MLP the network would have to re-learn that in every location on the network, but the CNN can learn it once and apply elsewhere. In that way it's like an RNN: a word means the same thing at the beginning or end of a sentence. 3. Because we're using a square convolution, it uses spacial information more intuitively and much better than an MLP, which would recieve the input flattened into one long vector presented one row of pixels at a time.

Answer 6

The convolutional layer is going to have a convolution, or 'filter', which is a 3x3 array of learned weights. To perform a convolution, you apply it to a part of the grid by multiplying the pixel values by the corresponding weight, then summing the results, and then passing the sum through an activation function. You do that for all parts of the image (depending on stride and padding and such, but ignore that for now): you scan across the image continually applying the convolution to form the output of the layer, which is still square.

Answer 7

I'm pretty sure the most common is padding. It feels common.

Answer 8

Say a greyscale image is 28x28 pixels. It is represented by a 28x28 grid of scalar values between 0 and 255, denoting brightness at a given pixel. RGB images need to keep track of not just one color (and not just one "brightness level"), but three: red, green and blue. So it is represented by *three* 28x28 grids of scalar values between 0 and 255, with one pertaining to the "red brightness", one to blue, and one to green. So the greyscale image is represented as a (28,28) matrix. The RGB is a (28,28,3) matrix: **it has three**"**channels", and is referred to as having a "depth" of 3**: its width and height are 28, and its depth is 3.

Answer 9

The amount of pixels it moves at a time. If it's one, it scans one pixel at a time. If it's 2, it skips every other pixel. And so on.

Answer 10

3x3: On every side of the image, there will be one row/column that we can't apply the kernel to, so each side decreases in size by 2. The output is (N-2)x(N-2) 5x5: now each size loses 3 rows, so it's (N-4)x(N-4)

Answer 11

If the input is NxN, it'll become (N/2)x(Nx2), because for every row and column, a pixel is only being formed in the output for every other pixel in the input.

Answer 12

An RGB image has 3 channels, so its shape is something like (28,28,3). In a normal 28x28 image, we'd have a filter like a 5x5 array of weights, and we'd apply it at a point by multiplying the weights by the corresponding pixels, summing all the resulting numbers, and passing through an activation. The RGB case is similar, except now the filter is 5x5x3. The height and width can be whatever, but **the depth of the kernel will equal the depth of the image, so we can learn about each of the input channels.** This way we basically have three 5x5 kernels being applied to the image: one to the red values, one to blue, and one to green. Then all 5x5x3=75 results are added together, across all 3 channels, and then passed through an activation function. So conceptually, an edge detector could learn how to detect edges separately in each of the 3 colors, having one detector for each color. For example. Below is a *great* image.

Answer 13

Say we use a 5x5x3 filter, and we pad the image such that with a stride of 1, the outcome will be 28x28x\_. What will the depth be? Well we scan the width and height of the image, and at each point we apply our "three separate 5x5 filters" to the three channels, sum the 75 outputs across all 3 channels into one scalar, and pass through activation. So we’re getting one scalar at each point. That means the output is depth 1: 28x28x1. So how do we get 28x28x4? **We learn 4 *different, separate* 5x5x3 filters**. Each will result in its own 28x28x1 output, yielding a 28x28x4 output.

Answer 14

Each filter can learn something different about the input! Maybe one detects edges, one records how bright it is, one checks if the dominant color is red, etc. Or maybe they just all detect *different* types of edges. One filter can only really learn one thing, but using multiple allows us to learn more complex and varied information during each layer.

Answer 15

Something like (5x5x25)! Whether it's an input layer or not, all we need is for the depth in the kernel to match the depth of the input, so we can apply a 2d filter to each of the channels, and learn about all the channels.

Answer 16

It needs 45 different filters of a shape like 5x5x32. Each filter's depth needs to match the depth of the input so it can look at all the channels in the input, and each filter will produce one NxNx1 output, so we need as many filters as we want output channels.

Answer 17

Decrease the height and width dimensions of the tensor, so weights don't explode as we slowly increase channels.

Answer 18

A 2x2 max pooling layer will decrease the height and width of an input by half, so the output will be 14x14x3. It does this simply enough by, within each channel, looking at each 2x2 grid within the channel, and outputting the max of those 4 values. An average pooling layer does the exact same thing, but instead of outputting the max of the 4 numbers, it outputs their mean.

Answer 19

Often, we want to apply many filters to a given layer, meaning the outputs of our convolutional layers can be very large and require many parameters, which could lead to overfitting. For example, say we’re applying 1000 filters; that will add up fast. If we decrease the height and width by 2 every so often, we can offset this growing of the parameters: as we increase depth, we can decrease height and width.

Answer 20

Generally no. You would think they could, because you just take the filter and continually slide it over the input regardless of size, but the issue the size of the activation matrix would continue to be different throughout the network, and eventually you'd typically flatten the matrix into a 1d vector that you pass to a normal dense layer; but dense layers can only take inputs matching their exact length. So you need the same sized images. (I suppose you could get away with it if all you use is conv and pooling layers, and at the end your loss function can be applied to a variable-size output.) (Future Drew here: I suppose this could be one thing global pooling layers are for?)

Answer 21

To start the network, there would be several blocks of one or more convolutional layers followed by a max pooling layer. Each blocks' convolutional layers will typically use padding so the height and width of the image don't change, and thus they only change when the max pooling layers decrease them by half. We will continually learn more and more 'features', or 'channels', about the input as we go, so when combined with the max pooling layers decreasing width and height, we will go from a representation whose height and width is much larger than its depth, to the other way around: a very deep representation with small width and depth. After we achieve this through several conv-pooling blocks, we'll flatten the resulting matrix out into a 1d vector, and pass it through a few simple dense layers before outputting our prediction. The below image doesn't show the dense layers at the end.

Answer 22

We're learning more and more features/pieces of information about the input. And of course as we get deeper in the network, those features/insights become more complex, as they are supposed to with simple MLPs as well.

Answer 23

The number of input channels, and the number of output channels The height and width of one of the kernels (their depth will equal the number of input channels) The stride and padding information

Answer 24

Just the height/width and the stride

Answer 25

From pytorch docs: “At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels and producing half the output channels, and both subsequently concatenated.” So yeah. Basically it's a way of decreasing the number of connections in a conv layer mapping in\_channels to out\_channels, by having each out channel only consider a subset of the input channels. The higher the groups, the fewer number of input channels being considered by a given output channel.

Answer 26

You can apply a bunch of different rotations, zooms, crops, reflections, random noise, color distortion, etc to your input images, resulting in lots more slightly different images. This just makes your CNN more robust to picking up the features of the image when those features are in different sizes, orientations, etc. If you're recognizing cats, this helps you identify big cats, small cats, cats on their sides, upside-down cats, etc.

Answer 27

1. **The size of your dataset:** the smaller it is, the less capable you are of retraining a gigantic network. 2. **The quality of the match between their task and yours**: if your image classification task is super similar to theirs, you need less retraining or structural alteration than otherwise. It's easier to move from dogs to wolves than from dogs to cancer detection.

Answer 28

There are already giant, well-trained networks that can classify wide arrays of images, like ImageNet. These can often be slightly reworked to transfer all that big-data learning to your small-data task, or the parameter initizations can serve as a great starting point before fine-tuning if you have enough data to do so. (Ah how times change. NLP too now!)

Answer 29

**Little and similar**: you can use most of the layers and can't do much retraining, so maybe just replace a couple of the fully connected layers at the end, leaving all the layers before it fixed and not backpropogating through them **Lots and similar:** because you have lots of data, you should still fine-tune the architecture, but the parameters learned from the similar task are a good initialization. You'll of course swap out a layer or two at the end (as with any of these cases) just so your # of output classes is correct **Little and not similar:** Here, overfitting to our small dataset is still an issue, so we will hold the parameters from the original network as constant. But now because the datasets are different, task-specific features that the original network learns in later layers will not be useful. We can, however, still use the more abstract features from earlier layers, like textures and edge detection. So we remove most of the original layers, leaving only the beginning layers that extract more general image features. Then we add a few new layers and only backpropogate through the new ones. **Lots and not similar**: you might fine-tune the parameters from the original, or you might just totally retrain it and just use the original network's hyperparameters, like number and size of layers, as a starting place.

Answer 30

Autoencoders are a compression algorithm, or a learned means of dimension-reducing an input, then scaling it back up to its original size with as little lost information as possible. An autoencoder is basically a neural network made up of two sub-neural-networks: the encoder and the decoder. The encoder takes the input and maps it to a low-dimensional representation, and the decoder takes that low-dimensional representation and maps it back up to the original size of the image. That low-dimensional representation in the middle there is the compressed form, and the goal of the network is to get that as good as possible. The loss function is simply to compare the input to the reconstructed input: if the input was an image, you just find the pixel-level MSE between the two, so the network aims to make the output as similar to the input as it can.

Answer 31

In autoencoders, the decoder needs to take a low-dimensional vector and **upsample it into an image**. Similarly, in a GAN generating images, then generator needs to take a small vector of noise and upsample it into an image. These networks start to look like reverse CNNs: CNNs slowly go from large height/width and few channels to small height/width and many channels. So these upsampling networks need to do the opposite, slowly increasing the height and width. This is what reverse convolutional layers do: they **basically apply a filter which is larger than the area it's being applied to, and doing so with a high stride like 2, so the output height and width are larger than the input's.**

Answer 32

A GAP layer, if included, would be included after all the blocks of conv/pooling layers, before the fully connected layers. It is basically an extreme version of a pooling layer: it maps every channel to one scalar, which is the mean of all the values in the channel. Often, the inputs to the dense layers are so large (so many channels of substantive height and width), that there are just too many parameters in the final dense layers, which can cause issues with overfitting. This is a way to combat that overfitting: it drastically decreases the size of the input to the dense layers, thus decreasing the number of parameters they need to have. Because of the nature of conv layers vs dense layers, oftentimes most of a network's parameters can be in those final few dense layers, so this can be a very effective way to decrease the # of params and combat overfitting.

Answer 33

In momentum, rather than taking a step in the direction of the current gradient, you take a step in the direction of **an exponentially decaying weighted sum of all past gradients**. The hope is that it helps you "power through" local minima to reach global minima. So for example, if you got to the bottom of this local minimum, the current gradient would be zero, but the previous ones are still pointing right and would carry you through. Another benefit is that momentum helps decrease jagged training. If the objective function is pointing in a consistent direction in one dimension (long side of ovals below), but prone to jumping around on another one (short side of ovals below), momentum smoothes this out.

Answer 34

Because it takes an average of all previous gradients, with a focus on recent gradients, it can smooth out learning if it happens to be jagged. If the gradients keep jumping back and forth in one of the dimensions, those will average out to about zero, and the descent will stop taking big steps in those dimensions and focus on the dimensions with a consistent direction. This is illustrated in the following picture. These are contour lines, and each line is at a constant level in the z direction with respect to itself. So this means the slopes are way more steep in the y direction than x, because height is changing over a much smaller horizontal distance. This may be easier to see if you envision it as maximizing over a hill rather than minimizing over a valley. So in this context, you can see (in blue) how the baseline slopes will be much larger in the y direction than x, causing most of each step to be a jagged movement in the y direction rather than a productive movement in the more subtle x direction. Momentum (in red) smoothes this out. (The red drawing over-exaggerates the size of the steps in that direction, but it’s just meant to illustrate how the jagged y-direction movement is decreased.)

Answer 35

We want to speed up training, and to increase learning rate to do so (maybe even beyond what we can theoretically guarantee will converge). But we can't just wantonly increase leraning rate, or learning becomes jagged and bad. So we essentially **adaptively use a different learning rate based on the "terrain".** In jagged areas the **learning rate ends up being low, in effect** (because when we sum past gradients, they're not similar and cancel each other out, decreasing the effective learning rate). In smooth ares, the summed gradients instead synergize, **effectively increasing the learning rate.** So it's basically a form of adaptive learning rate tuning!

Answer 36

Maybe your objective function is way steeper in one dimension than in others. In this case (as well as in similar cases where some dimensions are out of whack), it'd be great to scale your gradient on an entry-by-entry basis, where take the big gradients and chill them out a bit, or take the tiny gradients and amp them up a bit. This is what adagrad does: it stores historical gradient info, looks at which entries are recently usually really big or small, and uses that to **You'll notice that this is very similar to what momentum does!** They're two different ways of accomplishing a similar conceptual goal, it seems.

Answer 37

The goal, similar to momentum, is to combat jagged and inefficient learning by smoothing out our steps in the direction of the gradient. The general idea of RMSprop is it keeps track of which dimensions keep having large steps and which keep having small ones, and uses that information to smooth out training by decreasing the relative size of the large ones and increasing the relative size of the small ones. It does this by dividing the size of the step in each direction by a weighted average of recent derivatives in that direction.

Answer 38

Again, the goal, similar to momentum, is to combat jagged and inefficient learning by smoothing out our steps in the direction of the gradient. In momentum, you keep a exponentially decaying weighted average of the gradients, and each iteration you take a step in the direction of that weighted average. In RMSprop, you instead keep an exponentially weighted average of the squares of each of the partial derivatives, and then to construct the “gradient” which is the direction you want to move, for each dimension you take the current partial derivative, and divide it by the square root of the exponentially decaying sum of the derivatives. That’s pretty complicated, but here’s the intuition: because we’re squaring the derivatives in the sum, they’re always positive; so **the bigger the past derivatives, the bigger the value by which we’re *dividing* the size of our current step.** Steep dimensions where learning is jagged will have large gradients, so we’ll be dividing by a large value and decreasing the size of the step; conversely, not-steep dimensions with slow learning will now have relatively larger steps, so we make proportionally more progress in that direction. This can be used to increase the learning rate, and overall learn faster. The key parts of this picture are the **top**, showing the image where each fault line has a consistent height and thus the y axis is much steeper, and the **bottom**, showing how the update is the derivative, divided by sqrt(weighted sum of squares of derivatives).

Answer 39

They're basically the same thing, RMSProp is just an optimized version. They can be discussed and conceptualized very similary Source: https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam-f898b102325c

Answer 40

It takes momentum and RMSprop and puts them together. (These are both effective means of smoothing out training and making it more consistently move in the right direction in a non-jagged fashion, so this is a good idea!)

Answer 41

Adam combines momentum and RMSprop. So, ignoring an optimization or two that isn't that important for conceptual understanding, basically what you do is: Keep track of an exponentially weighted sum of the past gradients (for momentum), as well as an exponentially weighted sum of the *squares* of the past gradients (for RMSprop) Then your update at a given time step will be the current exponentially weighted sum of the gradients (as with momentum) *divided by* the square root of the weighted sum of the squares of the gradients (as with RMSprop).

Answer 42

Learning rate, the size of the step. Obviously important and need to be tuned. Then there is Beta1, which determine the rate at which the exponentially weighted sum of the gradients drops off (i.e. how quickly old values disappear towards zero), and Beta2, which is the same thing for the weighted sum of the squares of gradients used in RMSprop. Beta1 and Beta2 are more commonly not messed with and just left to the default values.

Answer 43

A GAN, or generative adversarial network, is a model essentially composed of two separate neural networks: a generator and a discriminator. The generator recieves as input a vector of random noise and transforms it into a generated fake member of a dataset; for example, maybe it turns it into a picture of a face. The discriminator recieves either a real member of a dataset, or a fake one made by the generator, and it predicts the probability that the input is real. To train this joint network, the generator tries to maximize the discriminator's predictions on its fake inputs, and the discriminator tries to minimize those predictions and maximize predictions on the real inputs. Hence adversarial.

Answer 44

Creating new data, or new fake members of a dataset, such as images or videos Transferring aspects of a dataset onto another: making a video of a horse look like a zebra, making a photo in the style of a certain artist, turning a rough sketch of an object into a much more detailed sketch, deep fakes, etc

Answer 45

Applying batch normalization to a layer simply means normalizing the layer’s outputs to have mean 0 and std 1, by subtracting the batch mean and dividing by the batch variance. So we’re normalizing with respect to the current batch, not the whole dataset. So similar to how we normalize the inputs to a model, we can also normalize the inputs to layer n+1 by applying batch normalization to layer n. It’s helpful to normalize some layers within the NN, not just normalize the inputs, because in the same way that the consistency of inputs to a network helps it learn, consistent inputs to any given layer help it learn more easily, quickly, and consistently. Any layer can be thought of as “the input layer in the remaining sub-network”, and having consistent inputs to that sub-network will be helpful!

Answer 46

Basically how it's explained: subtract the batch's mean and divide by its variance. The only deviation from this is that we actually add a small epsilon to the variance in practice. This is partially to avoid a variance of zero, and partially because we’re really trying to estimate overall population variance, which is typically a little higher than a batch’s variance.

Answer 47

The primary purpose and benefit of batch norm is faster training (which could potentially allow for more complex models, or smaller learning rates, etc). A possible secondary benefit is thus that maybe we can get better accuracies. There are lots of other potential small benefits (very light regularization, potentially allowing for a wider range of activations, weight initializations are less important, can help with vanishing gradients) that probably aren't as key.

Answer 48

For both, I think the general answer is that there are different ways of doing it. The general idea of normalization is you want all of the outputs from a particular layer to follow the same (unit normal) distribution, so that the same learning rate can be effectively applied to all outputs. Out of this comes the fact that the most theoretically correct way to do normalization is to find the mean and stdev for every neuron, and normalize them all the same, so they're all output with the same distribution. (See DeepLearningAI's normalization youtube video.) That said, this is not always what people do in practice. Empirically, dumber/less theoretically sensible things can work fine. For example, at Aurora, we normalized our images using just a single mean and stdev calculated across the whole tensor and the whole dataset, because that was simple and sufficient. Some people even go simpler and just divide an mnist image by 255, for example. There are also intermediate solutions, where you pick a particular dimension and calculate a mean and stdev for each of those dimensions, then use the same summary stats for every neuron that's part of the same entry in that particular dimension. So for example, take the BatchNorm2d layer in pytorch, which normalizes a batch of CxHxW image representations (so input is NxCxHxW). Based on this layer's description, I'm pretty sure for a given batch, it computes summary statistics for each channel, then uses the same stats for each entry in a given channel.

Answer 49

An RNN has a **common "structure" or "hidden layer"** that is applied sequentially to each time step in the input structure; for example, words in a sentence. At each application, there are basically 2 things that determine the activations 's' of the hidden layer: the input, and the weight matrix Wx that connects the input to the hidden layer; and **the activations of the previous application of the hidden layer**, and the weight matrix Ws that connects the previous activations to the hidden layer. The two matrix-vector products are calculated, summed, and passed through the activation function. Then, a third matrix is learned which connects the activations of the hidden layer to the output layer. (Then maybe there's an activation like sigmoid.) Depending on the architecture, this might be evaluated at every step, or at just the last step, etc. Here are 3 different ways of showing the same basic network; the first shows the key formulas.

Answer 50

In an RNN, both the input vector x and the previous activation vector s are multiplied by their own respective weight matrices to get new vectors of the same length, which are summed to get the final value. This is where the formula activation([Wx]\*x + [Ws]s\_t-1) comes from. But this is not actually that weird, because it can just be thought of as x and s\_t-1 being lined up into one vector and multiplied by one weight matrix!

Answer 51

Vanishing gradients lead to "bad long-term memory". Basically, when we're trying to update Wx or Ws based on the output at a very late time step, we find the derivative of the contribution of that matrix W from all previous time steps. But the derivative for early time steps is a *lot* of derivatives chained together, leading to vanishing gradients. So when we're updating based on a late outcome, the contribution of an earlier part of the input vanishes. This can be bad: words at the beginning of a sentence can have a big impact on the meaning of the end of the sentence, for example. It's intuitive that vanishing gradients are especially bad for RNNs: they're bad when a bunch of layers are applied in a row, because all of the potentially small gradients are getting multiplied together. Well, an RNN has a layer that can easily be applied 50 or 100+ times if the input x has 100+ time steps. The solution is LSTMs

Answer 52

Long term memory! RNNs have a mechanism for using short term memory, but due to the vanishing gradients problem, they can't really effectively retain info from very long ago. LSTMs add a path to pass along and retain long term memory in addition to the short term memory pathway. I'm not gonna get into memorizing gates and architecture and stuff; as Dr. Kolter said, a lot of that is hand waving. There are 4 "gates" for bringing in and interpreting old and new information, and reforming it into new long and short term memories.

Answer 53

You put the word embedding layer(s) between the input and the "main RNN cell", so from the perspective of the cell, the embedding *is* the input. If you're learning a custom embedding scheme while training the RNN, you can have backprop-through-time go through the embedding layer. (In this picture sigmoid just refers to an FC layer with a sigmoid activation)

Answer 54

It's pretty simple: there are basically "2 RNNs", one that iterates over the input from front to back, and another that goes from back to front. So 2 hidden states are learned for each input: one that uses that input and all the earlier context, and another that uses that input and all the later context. Then both of those hidden states are concatenated and passed through a single dense layer to get the prediction, which in this case is the probability that a word is a name.

Deep Learning Flashcards

(79 cards)