Neural nets and deep learning Flashcards

(12 cards)

1
Q

What is MNIST an abbreviation of?

A

Modified National Institute of Standards and Technology database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does “None” indicate in “Output shape: (None, 784)” after model definition in Keras?

A

Shapes are written as (batch_size, …features…). When the model is defined, Keras doesn’t lock in a batch size.

It is Keras’s way of saying that batch size goes here later at runtime.

None isn’t a missing value; it’s a placeholder for a variable batch size, carried unchanged through layers that operate per example.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Glorot (Xavier) initialization in neural networks?

A

A weight initialization method proposed by Xavier Glorot and Yoshua Bengio. It aims to keep the variance of activations and gradients roughly constant across layers during forward and backward passes.

W ∼ U(±a), a = √(6 / (fan_in + fan_out))

fan-in = number of neurons in the previous layer
n𝑙 - 1
fan-out = number of neurons in the current layer
n𝑙 + 1
“𝑙” = layer index

This initialization is widely used for networks with activation functions like tanh or sigmoid.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does L stand for in L1/2 regularization?

L1 and L2 are names for standard norms from analysis (L1 and L2 spaces), not “Loss” or “Lambda”.

A

The “L” refers to the Lp norms from Lebesgue spaces in functional analysis.

(named after Henri Lebesgue)

L1 regularization uses the L1 norm (sum of absolute values), and L2 regularization uses the L2 norm (square root of sum of squared values).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What happens in the regularization technique called dropout?

A

randomly selected neurons are ignored during the training phase

These might be activated again at other steps of the training procedure.

Since each neuron can either be present or absent, the number of possible networks grows exponentially. Of course, such networks are not independent, because they share a large number of weights, but still, the multitude of potential networks are different. This allows the resulting neural network to be treated as an average ensemble of all these different networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does “elu” stand for in
model = keras.models.Sequential([ keras.layers.Flatten(input_shape=[28, 28]), keras.layers.Dropout(rate=0.2), keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"), keras.layers.Dropout(rate=0.2), keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"), keras.layers.Dropout(rate=0.2), keras.layers.Dense(10, activation="softmax") ])")
?

A

Exponential Linear Unit

ELU(𝑥)=𝑥 if 𝑥≥0
ELU(𝑥)=𝛼(exp(𝑥)−1) if 𝑥<0

α=1 by default

Like ReLU it keeps positive values linear, but for negatives it smoothly saturates to −α, which helps keep activations closer to zero and can speed training. It’s smooth at 0 (unlike ReLU), which some optimizers like.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is z in formulas such as zhi=whi*x and 𝑜=𝜎(𝑧ℎ𝑖)?

A

z is the neuron’s pre-activation: the linear combination that hits the activation function.

(also called the net input or logit)

For a tiny network with only one neuron per layer, zhi​=whi​⋅x is the input to the hidden neuron and 𝑜=𝜎(𝑧ℎ𝑖) is the output of the hidden neuron.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does the prime symbol denote in σ′(zhi)?

A

the derivative of sigma evaluated at zhi

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is rmsprop in
network.compile(optimizer='rmsprop', metrics=['acc'], loss='binary_crossentropy')
?

A

root mean squared propagation

a variant extension of the gradient descent algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is h and c in the context of LSTMs in deep learning?

A

h (e.g., ℎ𝑡) is the hidden state (also the block’s output) — the short-term, exposed memory the rest of the network sees at time 𝑡

c (e.g., 𝑐𝑡) is the cell state — the long-term “memory” that mostly flows along the top highway and gets updated by the forget/input gates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does z denote in the context of the gated recurrent unit (GRU) networks?

A

the update gate

a vector deciding how much of ht - 1 to keep vs. new info

zt = σ(Wz·xt + Uz·ht-1 + bz)

if 𝑧𝑡≈1: keep old memory ℎt - 1 (little update)

if 𝑧𝑡≈0: overwrite with ℎ~t (big update)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What do Q, K and V stand for in the general definition of attention
S = (Q KT) / √(dk)
A = softmax(S)
O = A V
in the context of encoder-decoder transformers?

A

Query
Key
Value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly