W8-12 Flashcards

Question

We have shown that the ____________ is relevant to "intellegence" of systems or living beings. What is the other factor that is important?

Answer 1

Number of neurons structure (even if computer has same number as a human, lacks the common sense, can't reason)

Answer 2

Most common type of neural network for practical applications - first layer input - last layer output Compute a series of transformations

Answer 3

"deep" neural networks

Answer 4

a non-linear function of the activities in the layer below

Answer 5

Overfit convolutional networks (sparsely connected feed-forward) are shallow models - overfit around 20 million parameters while deep ones can benefit from having over 60 million and for the same number of parameters, deep models achieve better performance than shallow ones

Answer 6

Shallow networks may overfit more | Deep net often achieve better performance with same number of parameters

Answer 7

task dependent (no universal answer) For some problems, the depth of models are naturally given by the problems (i.e., in recurrent neural nets a network ofen has a depth "by time" (number of steps))

Answer 8

1 hidden layer is enough to represnt an approximation of any function to any arbitrary degree of accuracy

Answer 9

3 layers or less

Answer 10

Very inefficient. Need to do multiple forward passes on a representative set of training data just to change one weight.

Answer 11

It is still hard to figure out on which weight(s) the change improves performance

Answer 12

Minimize an error function by computing the derivative w.r.t. model parameters and set that to be zero ' (FF neral nets, Derivatives are computed with an algorithm called: BACKWARD PROPOGATION AKA BACKPROP) Once you get derivatives, there are many optimization algorithms you can chose to use to find good parameters

Answer 13

1. Forward propogation. Compute hidden units and output 2. Compute loss/errors 3. Back-propogation. Compute derivatives 4. Use some existing optimization algorithm to find good parameters (those minimizing errors)

Answer 14

Do not guarantee find globally optimal solution (often, the error function is not convex and may have many local optimums)

Answer 15

- Suppose we have implemted backprop to compute dE/dθ - If the backprop is implemented correctly we should have dE/dθ ~= (E(θ+e)-E(θ-e))/2e In neural network, θ is a vector that includes the weights w of all layers (including biases), The outline of the checking algorithm is: - we randomly pick a θ, and then compute E(θ+e) and E(θ-e) with forward propagation - we also compute E(θ) with forward propogation and based on that compute dE/dθ with backprop - check if the equation above is satisfied or not

Answer 16

- Autoregressive models: Predict the next term in a sequence from a fixed number of previous terms (i.e. Markov Models) Feed-Forward neural nets: generalize autoregressive models by using one or more layers of non-linear hidden units (both cases: no memory about longer history)

Answer 17

- The syntax needs to fit (e.g. number and tense agreement) - The semantics needs to fit. The intonation needs to fit. - The accent, rate, volume, and vocal tract characteristics must all fit All these aspects combined could be 100 bits of information that the first half of an utterance needs to convey to the second half. 2^100 is big!!!

Answer 18

- Distributed hidden state that allows them to store a lot of information about the past efficiently - non-linear dynamics that allows them to update their hidden state in complicated ways

Answer 19

transition and emission functions

Answer 20

as a layered, feed-forward net with shared weights and then train the feed-forward net with weight constraints

Answer 21

- the forward pass builds up a stack of the activities of all the units at each time step - the backward pass peels activities off the stack to compute the error derivatives at each time step - after the backward pass we add together the derivatives at all the different times for each weight

Answer 22

The automaton is restricted to be exactly one state at each time. The hidden units are restricted to have exactly one vector of activity at each time.

Answer 23

It is difficult to train RNN: time consuming bout also because of a phenomenon called vanishing/exploding gradient

Answer 24

- If the weights are small, gradients shrink exponentially - If the weights are big the gradients grow exponentially - in addition, activation function can also cause vanishing gradients (if the network only has a few hidden layers, this won't make too much trouble)

Answer 25

share parameters RNNs have difficulty dealing with long-range dependencies

Answer 26

The design of specific types of RNN architecture -> Long Short-Term Memory (LSTM) - uses gates to control information flow in RNN (Another less widely used is gated recurrent unit)

Answer 27

The input gate decides how much information is allowed to get in memory cell from the input The Forget gate (should actually be called the memory gate) decides how much information is allowed from the previous memory

Answer 28

Ensures the gate values are between 0 and 1

Answer 29

- many hidden layers - use filters of replicated units to share parameters - use subsampling operation

Answer 30

The first key component of CNN. A filter shifts on input to detect local features. At the convolutional layer of a CNN, we often have multiple numbers of such filters (each has its own parameters)/

Answer 31

Convolution has fewer parameters! al (i.e. in the example from the notes/ question the first output is only connected to 9 inputs not fully connected to all 36) also has shared weights

Answer 32

Yes, traditional mm does not share any parameters

Answer 33

Can subsample pixels to make image smaller, so (1) fewer parameters are used to characterize the image, (2) computation is less expensive, and (3) the view of each filter at an upper layer can see a wide range.

Answer 34

invariance

Answer 35

Average pooling - to take average of an area

Answer 36

fully connected layers

Answer 37

- reducing number of connections - shared weights on the edges - pooling further reduces the complexity

Answer 38

~14 million labeled images, 20k classes | used for CNN learning

Answer 39

for a pooling area with N-by_N input units, you just need to backprop gradient to the unit that has the max value (the winning unit) why? if you make a small change in other non-max units, it will not change the pooling results, indicating gradient is zero w.r.t. that non-winning unit

Answer 40

take derivative w.r.t. the corresponding units (i.e. the error is multiplied by 1/N^2 and assigned to the whole pooling area (all units get this same gradient value))

Answer 41

- connectivity - weight constraints - neuron activation functions Less intrusive then hand-designing the features \ Alternatively can use prior knowledge to create a whole lot more training data

Answer 42

dropout works very well for different neural network architectures Can also be beneficial for input layer but with a higher probability of keeping input unit

Answer 43

better performance

Answer 44

(dropout as a form of model averaging) Dropout can be seen as sampling from different "thinned" architectures (each hidden layer now use different nomber of hidden units) ---> at test time could sample many different architectures and take the average of their output distributions - best to use all hidden units, but to halve their outgoing weights - we sample from 2^H models - the sharing of the weights means that every model is very strongly regularized - much better than Lasso or squared penalities that pull the weights towards zero

Answer 45

usually reduce the number of errors by a lot

Answer 46

Falso - can do better | at the cost of taking quite a lot longer to train

Answer 47

a neural network trained to copy its input to its output Traditionally used for dimentionality reduction as well as feature learning hoping that training it will result in taking on useful properties

Answer 48

- design a feed forward net trying to make the output be the same as the input with a central code layer - hidden and output layers linear, will learn hidden units that are a linear function of the data and minimize the squared reconstruction error - M hidden units will span the space as the first M components found by PCA, although they are not exactly the same vectors

Answer 49

encoder | decoder

Answer 50

Undercomplete: the code layer h has smaller dimension than input vector x (more widely used) overcomplete: the code layer h has larger dimension than input vector x

Answer 51

the autoencoder can learn to perform the copying task without extracting useful information about the distribution of the data - useful autoencoders should not copy perfectly but ristricted by design to copy only approximately (in doing so learns useful properties of the data to reconstruct the data)

Answer 52

- sparsity of representation - smallness of the derivative of the representation - robustness of noise of missing inputs

Answer 53

- one that recieves a corrupted data point as input and is trained to predict the original, uncorrupted data point as its output

Answer 54

- choose a training sample from the training data - obtain corrupted version from curruption process - use training sample pair to estimate reconstruction

W8-12 Flashcards

New Material Included on the Exam (82 cards)