Multilayer perceptron after input and before the output layer, how are hidden layers calculated?
∑(wi*xi)+b then passed through another layer using an activation function
What activation function is used in logistic regression?
Linear sigsima(x) = ax + b
Detail the Step Activation function
sigma(x) = 1 if x ≥ 0 sigma(x) = 0 if x < 0
Detail the Sigmoid and Tanh Activation functions
• Sigmoid sigma(x) = 1/ exp^(-ax) + 1 (centered at 0.5) • Tanh sigma(x) = tanh x (centered at 0) • Differentiable, but gradients are killed when |x| is large • Also, expensive to compute
Detail the ReLU Activation function
sigma(x) = max(0, x)
Pros: • Gradients don’t die in positive region • Computationally efficient • Experimentally: Convergence is faster Cons: • Kills gradients in negative region • Not zero centered
gradient is 0 if x is less than 0
Detail the Softplus Activation function
sigma(x) = ln(1 + exp^x)
Pros: • Differentiable • Gradients don‘t die in both positive and negative region Cons: • Kills gradients in negative region when |x| is large • Not zero centered • Computationally expensive
Detail Leaky ReLU and Parametric ReLU Activation functions
sigma(x) = max(0.01x, x)(Leaky ReLU) More generally, sigma(x) = max(ax, x)(Parametric ReLU) Pros: • Gradients don‘t die in both positive and negative region • Computationally efficient • Experimentally: Convergence is faster Cons: • Need to decide a (hyper-parameter) • Not zero centered
Detail the Exponential Linear Units
sigma(x) = • x if x > 0 • a(exp x − 1) if x ≤ 0 Pros: • Gradients don’t die in both positive and negative region • Computationally efficient • Experimentally: Convergence is faster • Closer to zero mean outputs Cons: • Expensive computation (exp)
Detail the Maxout Neuron
• sigma(x) = max(w1x, w2x)
Pros: • Generalizes Parametric ReLU • Provides more flexibility by allowing different w1 and w2 • Gradients don’t die Cons: • Doubles the number of parameters
Can we use a perceptron to model an XOR function?
No
Can we use a perceptron to model an OR function?
Yes
Single-layer) perceptron can
only do what to data points?
Single-layer) perceptron can
only separate linear separable
data points
MLP with one hidden layer
is known as?
universal approximator
Name three loss functions and how to calculate them.
Square loss = (MSE)
Cross Entropy loss = − ∑(j) actualValue(j) log(predictedValue(j))
Hinge loss = ∑(j) max(0, predictedValue(j) − actualValue(j) + 1), where t is the index
of the ‘hot’ bit in t.
Give some examples of advanced optimizers for neural networks.
• Stochastic gradient descent (SGD) has troubles:
• Ravines: Surface is steeper on one direction, which are common around local optima
• Saddle points (i.e. points where one dimension slopes up and another slopes down)
are usually surrounded by a plateau of the same error, which makes it notoriously
hard for SGD to escape, as the gradient is close to zero in all dimensions.
• More advanced optimizers have been proposed:
• RMSProp, AdaGrad, AdaDelta, Adam, Nadam
• These methods train usually faster than SGD, but their found solution is
often not as good as by SGD
• Performance of SGD is very much reliant on a robust initialization and annealing schedule
• Possible solution: First train with Adam, fine-tune with SGD
• Shuffling and Curriculum Learning
• Shuffling: avoid providing the training examples in a meaningful order to our model
as this may bias the optimization algorithm
• Curriculum learning: for some cases where we aim to solve progressively harder
problems, supplying the training examples in a meaningful order may lead to
improved performance and better convergence
• Batch normalization
• For each mini-batch, normalize the weights to an ideal range (e.g. zero mean and a
unit variance) to ensure the gradients do not vanish or explode
• Early stopping
• monitor error on a validation set during training and stop (with some patience) if the
validation error does not improve enough
These techniques can be used alongside each other
What is Dropout?
By dropping a unit out, we mean
temporarily removing it from the network,
along with all its incoming and outgoing connections.
Dropout is a regularization method that
approximates training a large number of neural networks with different
architectures in parallel.
Dropout has the effect of making the
training process noisy, forcing nodes
within a layer to probabilistically take on
more or less responsibility for the inputs.
How can the values of dropout be set and what architectures can it be used with?
• Dropout can be used with most types of neural architectures, such as
dense fully connected layers, CNNs and RNNs
• Dropout rate (PyTorch): the probability of dropping out a node, where 0.0 means no dropout, and 1.0 means drop all nodes. A good value for dropout
in a hidden layer is between 0.2 and 0.5.
• Caveat: in some papers/blogs ‘dropout rate’ also means the percentage of nodes to
be learned
• Use larger network: a network with 100 nodes and a proposed dropout rate of 0.5 will require 200 nodes (100 / 0.5) when using dropout.
• Weight constraint: Large weight values can be a sign of an unstable network. To counter this effect a weight constraint can be imposed to force
the norm (magnitude) of all weights in a layer to be below a specified value
(e.g. 3-4)
How should you initialize weights?
Glorot Initialization - Reasonable initialization (based on linear activations, works for tanh, needs adaptation for ReLU because it is not zero centered)
Why shouldn’t you use large random numbers to initialize weights?
Why shouldn’t you use small random numbers to initialize weights?
• E.g. real numbers uniformly randomly drawn from [-0.01, 0.01]
• Works OK, but only for small networks (a few layers, each with a few
activations)
• In deeper networks, activations become very close to zero in deeper
layers (i.e. layers far from the input, close to the output layer)