Neural nets and deep learning Flashcards

Question 1

Q

What is MNIST an abbreviation of?

Answer

A

Modified National Institute of Standards and Technology database

Question 2

Q

What does “None” indicate in “Output shape: (None, 784)” after model definition in Keras?

Answer

A

Shapes are written as (batch_size, …features…). When the model is defined, Keras doesn’t lock in a batch size.

It is Keras’s way of saying that batch size goes here later at runtime.

None isn’t a missing value; it’s a placeholder for a variable batch size, carried unchanged through layers that operate per example.

Question 3

Q

What is Glorot (Xavier) initialization in neural networks?

Answer

A

A weight initialization method proposed by Xavier Glorot and Yoshua Bengio. It aims to keep the variance of activations and gradients roughly constant across layers during forward and backward passes.

W ∼ U(±a), a = √(6 / (fan_in + fan_out))

fan-in = number of neurons in the previous layer
n_{𝑙 - 1}
fan-out = number of neurons in the current layer
n_{𝑙 + 1}
“𝑙” = layer index

This initialization is widely used for networks with activation functions like tanh or sigmoid.

Question 4

Q

What does L stand for in L1/2 regularization?

L1 and L2 are names for standard norms from analysis (L¹ and L² spaces), not “Loss” or “Lambda”.

Answer

A

The “L” refers to the L^p norms from Lebesgue spaces in functional analysis.

(named after Henri Lebesgue)

L1 regularization uses the L¹ norm (sum of absolute values), and L2 regularization uses the L² norm (square root of sum of squared values).

Question 5

Q

What happens in the regularization technique called dropout?

Answer

A

randomly selected neurons are ignored during the training phase

These might be activated again at other steps of the training procedure.

Since each neuron can either be present or absent, the number of possible networks grows exponentially. Of course, such networks are not independent, because they share a large number of weights, but still, the multitude of potential networks are different. This allows the resulting neural network to be treated as an average ensemble of all these different networks.

Question 6

Q

What does “elu” stand for in
model = keras.models.Sequential([ keras.layers.Flatten(input_shape=[28, 28]), keras.layers.Dropout(rate=0.2), keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"), keras.layers.Dropout(rate=0.2), keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"), keras.layers.Dropout(rate=0.2), keras.layers.Dense(10, activation="softmax") ])")
?

Answer

A

Exponential Linear Unit

ELU(𝑥)=𝑥 if 𝑥≥0
ELU(𝑥)=𝛼(exp(𝑥)−1) if 𝑥<0

α=1 by default

Like ReLU it keeps positive values linear, but for negatives it smoothly saturates to −α, which helps keep activations closer to zero and can speed training. It’s smooth at 0 (unlike ReLU), which some optimizers like.

Question 7

Q

What is z in formulas such as z^hi=w^hi*x and 𝑜^ℎ=𝜎(𝑧^ℎ𝑖)?

Answer

A

z is the neuron’s pre-activation: the linear combination that hits the activation function.

(also called the net input or logit)

For a tiny network with only one neuron per layer, z^hi=w^hi⋅x is the input to the hidden neuron and 𝑜^ℎ=𝜎(𝑧^ℎ𝑖) is the output of the hidden neuron.

Question 8

Q

What does the prime symbol denote in σ′(z^hi)?

Answer

A

the derivative of sigma evaluated at z^hi

Question 9

Q

What is rmsprop in
network.compile(optimizer='rmsprop', metrics=['acc'], loss='binary_crossentropy')
?

Answer

A

root mean squared propagation

a variant extension of the gradient descent algorithm

Question 10

Q

What is h and c in the context of LSTMs in deep learning?

Answer

A

h (e.g., ℎ_𝑡) is the hidden state (also the block’s output) — the short-term, exposed memory the rest of the network sees at time 𝑡

c (e.g., 𝑐_𝑡) is the cell state — the long-term “memory” that mostly flows along the top highway and gets updated by the forget/input gates

Question 11

Q

What does z denote in the context of the gated recurrent unit (GRU) networks?

Answer

A

the update gate

a vector deciding how much of h_{t - 1} to keep vs. new info

z_t = σ(W_z·x_t + U_z·h_t-1 + b_z)

if 𝑧_𝑡≈1: keep old memory ℎ_{t - 1} (little update)

if 𝑧_𝑡≈0: overwrite with ℎ^~_t (big update)

Question 12

Q

What do Q, K and V stand for in the general definition of attention
S = (Q K^T) / √(d_k)
A = softmax(S)
O = A V
in the context of encoder-decoder transformers?

Answer

A

Query
Key
Value

Neural nets and deep learning Flashcards

(12 cards)