Linear regression
Y = x β + e, where Y are observed values, X are the features, b the regression coefficient and e the vector of errors or residuals,
We can solve this for β: β = (XT X)-1X TY, where (XT X) must be invertible.
If it is not invertible we can do ridge regression instead or use the pseudo-inverse.
What does backpropagation do?
Backprop computes ∇θL efficiently by applying the chain rule on a computation graph and reusing intermediate results.
Forward pass: compute activations and cache needed intermediates.
Backward pass: propagate upstream gradients δv = ∂L/∂v from output back to parameters.
Key rule: if y = f(x), then δx = δy · (∂y/∂x)
Model training code in PyTorch
num_epochs = 1000
for epoch in range(num_epochs):
predictions = model(X)
MSE = loss(predictions,y)
MSE.backward()
optimizer.step()
optimizer.zero_grad()Model evaluation code in PyTorch
model.eval()
with torch.no_grad():
predictions = model(X_test)
test_MSE=loss(predictions,y_test)Random forest
How to calculate Entropy (Information Gain) for example for random forests?
H(x) = - ∑1c (pi log2(pi), where c is the number of classes
→ the goal is low entropy in child notes
How to calculate Gini Impurity?
Gini(X) = 1 - sum(pi2)
Gini(0,1)=0
Gini(0.5,0.5)=0.5
Gini(0.3,0.7)=0.48
More unequal distribution = more certainty = lower impurity
What are Boosted trees?
An ensemble is built by training each new instance to emphasize previously mismodeled samples e.g. AdaBoost
What is CrossEntropy?
LMLM = - 1/|B| ∑B∑M log(p(si|SMc))
where B is the batch and M is the set of masked tokens
What is a support vector machine?
What is a singular value decomposition?
What is UMAP?
“Uniform Manifold and Approximation”
1. Construct high-dimensional graph: calculate distances between data points using a metric, find k-nearest neighbors and create a graph with probability of being connected ed(x,y)/ σi
2. Optimization process: find low-dimensional representation of graph, minimize cross-entropy
3. Attractive and repulsive forces: to make distances match those in high-dimension
4. Stochastic gradient descent (SGD)
What is PCA?
“Principal component analysis”
- converts a set of observations into a set of linearly uncorrelated variables
1. Standardization: zi=xi-μi / σi
2. Covariance matrix: C = 1/(n-1)ZTZ
3. Eigenvalue decomposition: solve Cv = λv
4. Sort eigenvectors: λ1, …, λk, where λk is the largest eigenvalue, v1, …, vk are the corresponding eigenvectors
5. Project data onto principal components: T = ZVk, where T is the transformed data with the rows containing the observations and the columns the principal components
- the total variance is then given by ∑ λi
Normal distribution
f(x) = (1/σ √2π) e -1/2 ((x-μ)/σ)2
where σ is the standard deviation
the variance is σ2.
For the standard normal distribution: μ=0, σ=1
Recurrent layers
Network architecture designed to process sequential data. Has potential for infinite impulse response.
Hidden state
ht = fh(Whxxt + Whhht−1 + bh)
Output
yt = fy(Wyhht + by)
P-value
probability of obtaining data as extreme as the observed data assuming the null hypothesis is true
Pearson correlation
PX,Y= Cov(X,Y)/σXσY, where σ is the standard deviation
Covariance
con(X,Y) = 1/n ∑(xi-E(X))(yi-E(Y)), where E(X) is the expected value
How to prevent overfitting? List 7 methods
How can you optimize model architecture and execution?
Transformer architecture
Input sequence -> Tokenizer -> Input embedding layer + positional encoding
Encoder stack:
- multi-head self-attention
- position-wise fully connected feed forward
Decoder stack:
- self-attention for decoder input: K, V
- self-attention for encoder output: Q from decoder
- feed-forward
Output layer:
- linear transformation
- softmax: to create output probabilities
Logit function
logit(p) = σ-1 = ln(p/(1-p)) for p in (0,1)
- maps (0,1) to (-infinity,infinity)
- inverse of the sigmoid function
- logit(0.5) = 0
“logit = logarithm + unit”
Attention layers
Attention(Q,K,V) = softmax(QKT/sqr(dk))V
Wq,Wk,Wv are the parameter matrices
rows are tokens
columns are embeddings of the tokens
PyTorch Model
torch.manual_seed(42) model=nn.Sequential( nn.Linear(3,8), nn.ReLU(), nn.Linear(8,4), nn.Sigmoid(), nn.Linear(4,1)) model(X)