week 6 Flashcards by Timothee Maurin

What is a deep neural network (DNN)?

A composition of layers fθ = fL ∘ fL−1 ∘ … ∘ f1 that maps input x through multiple transformations.

How well did you know this?

Not at all

Perfectly

What is the forward pass through layer l?

zl+1 = σl(Wl zl + bl).

How well did you know this?

Not at all

Perfectly

What is a shallow neural network?

A neural network with one hidden layer: fθ(x)=ηᵀ σ(Wx+b).

How well did you know this?

Not at all

Perfectly

What are W and b in a neural network layer?

W are weights and b are biases for that layer.

How well did you know this?

Not at all

Perfectly

What are activations?

The outputs after applying the activation function: zl = σ(Wl−1 zl−1 + bl−1).

How well did you know this?

Not at all

Perfectly

What are pre-activations?

The linear part before the activation function: al = Wl−1 zl−1 + bl−1.

How well did you know this?

Not at all

Perfectly

What is a neuron?

A single hidden unit computing σ(wᵀx + b).

How well did you know this?

Not at all

Perfectly

What is network depth?

Number of hidden layers (sometimes +1).

How well did you know this?

Not at all

Perfectly

What is network width?

Number of neurons in a layer.

How well did you know this?

Not at all

Perfectly

What is network capacity?

Total number of neurons across the network.

How well did you know this?

Not at all

Perfectly

What does a sigmoid activation do?

σ(u) = 1 / (1 + e^(−u)).

How well did you know this?

Not at all

Perfectly

What is the ReLU activation?

σ(u) = max(0,u).

How well did you know this?

Not at all

Perfectly

What is tanh activation?

σ(u) = tanh(u).

How well did you know this?

Not at all

Perfectly

What is the step activation?

σ(u) = 1_{u > 0}.

How well did you know this?

Not at all

Perfectly

What is the Swish activation?

σ(u) = u / (1 + e^(−βu)).

How well did you know this?

Not at all

Perfectly

What property do neural networks share with RBF kernels?

They are universal approximators: can approximate any continuous function on compact sets.

How well did you know this?

Not at all

Perfectly

Why use deep instead of shallow networks?

Deep networks approximate functions with fewer parameters than shallow networks.

How well did you know this?

Not at all

Perfectly

What is hierarchical feature learning?

Hidden layers learn progressively more abstract features (edges → shapes → object parts → concepts).

How well did you know this?

Not at all

Perfectly

Why does the last layer behave like a shallow model?

Study These Flashcards

It is linear in its parameters: fθ(x)=ηᵀz where z is the final learned feature vector.

How can we inspect what features a DNN learns?

Study These Flashcards

By analysing which input patterns strongly activate specific neurons.

What is data-driven feature learning?

Study These Flashcards

Learning features automatically from data rather than manually designing them.

Why are GPUs used for DNN training?

Study These Flashcards

GPUs perform thousands of operations in parallel, enabling fast training of large models.

What is a spurious feature?

Study These Flashcards

A feature correlated with the label in training data but not truly relevant to the task.

Example of spurious features?

Study These Flashcards

Skin lesion classification using presence of a ruler as a signal for malignancy.

What is an adversarial example?

Input with imperceptible noise that causes the DNN to misclassify.

Why are adversarial examples dangerous?

They can fool autonomous vehicles, security systems, spam filters, etc.

Why are adversarial features surprising?

They reveal the model uses patterns humans would never rely on.

How does explicit regularisation work in DNNs?

Add penalties (e.g., L2) or use early stopping to avoid overfitting.

What is early stopping?

Stop training when validation loss increases; acts as implicit regularisation.

What is data augmentation?

Generate modified training samples (rotations, crops, synonym replacement, back-translation).

Why is data augmentation useful?

Makes the model robust and reduces overfitting.

What is dropout?

Randomly deactivate (e.g., 50%) neurons during training.

Why does dropout help?

It prevents reliance on specific neurons and effectively trains many subnetworks.

Are neurons dropped at test time?

No, all neurons are active at test time.

How do we train DNNs?

Minimise empirical risk using mini-batch SGD or Adam with backpropagation.

Why is backpropagation needed?

To compute gradients efficiently through the chain rule in deep composition functions.

What is the chain rule for a 1-layer network?

∂l/∂w = (∂l/∂z)(∂z/∂a)(∂a/∂w) with a=wx+b, z=σ(a).

Why is backprop efficient?

It reuses intermediate gradients instead of recomputing long chain-rule derivatives.

What problem occurs without good initialisation?

Vanishing or exploding gradients.

What is He/Kaiming initialisation?

Set W ∼ N(0, 2 / d_l) or N(0, 4/(d_l + d_{l+1})) to stabilise activations and gradients.

Why does He initialisation work?

It keeps variances of activations and gradients consistent across layers.

What optimiser is typically used?

Adam or mini-batch SGD.

What are typical hyperparameters in DNNs?

Number of layers, width, activation, optimiser settings, learning rate, batch size.

What are residual layers?

Layers of the form z_{l+1} = z_l + hθ(z_l).

Why do residual layers help?

They smooth the loss surface and enable training of very deep networks (e.g., 1000+ layers).

What trick helps residual networks avoid exploding gradients?

Batch Normalisation (instead of just He init).

What is the purpose of a skip connection?

Allow gradients to flow easily across layers and improve training stability.

week 6 Flashcards

(47 cards)