Physical study cards Flashcards

(43 cards)

1
Q

Linear regression

A

Y = x β + e, where Y are observed values, X are the features, b the regression coefficient and e the vector of errors or residuals,
We can solve this for β: β = (XT X)-1X TY, where (XT X) must be invertible.
If it is not invertible we can do ridge regression instead or use the pseudo-inverse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does backpropagation do?

A

Backprop computes ∇θL efficiently by applying the chain rule on a computation graph and reusing intermediate results.
Forward pass: compute activations and cache needed intermediates.
Backward pass: propagate upstream gradients δv = ∂L/∂v from output back to parameters.
Key rule: if y = f(x), then δx = δy · (∂y/∂x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Model training code in PyTorch

A
num_epochs = 1000
for epoch in range(num_epochs):
       predictions = model(X)
       MSE = loss(predictions,y)
       MSE.backward()
       optimizer.step()
       optimizer.zero_grad()
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Model evaluation code in PyTorch

A
model.eval()
with torch.no_grad():
       predictions = model(X_test)
       test_MSE=loss(predictions,y_test)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Random forest

A
  • ML model consisting of several decision trees
  • Randomness between trees can be introduced by:
    1. Bootstrap Aggregation (Bagging)
  • randomly sample trees with replacement
    2. Random Feature
  • selection: at each node random subset of features is considered
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How to calculate Entropy (Information Gain) for example for random forests?

A

H(x) = - ∑1c (pi log2(pi), where c is the number of classes
→ the goal is low entropy in child notes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to calculate Gini Impurity?

A

Gini(X) = 1 - sum(pi2)
Gini(0,1)=0
Gini(0.5,0.5)=0.5
Gini(0.3,0.7)=0.48
More unequal distribution = more certainty = lower impurity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are Boosted trees?

A

An ensemble is built by training each new instance to emphasize previously mismodeled samples e.g. AdaBoost

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is CrossEntropy?

A

LMLM = - 1/|B|BM log(p(si|SMc))
where B is the batch and M is the set of masked tokens

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a support vector machine?

A
  • model finds a hyperplane that best separates classes
  • objective function minimize 1/2 ||w||2 subject to the constraint yi(wxi+b)>=1, where yi are the labels and yi in {-1,1}
  • for non-linear cases a kernel trick can be applied
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a singular value decomposition?

A
  • a matrix decomposition used for dimensionality reduction in PCA and for data compression
  • A = U∑VT, where U and V are orthogonal matrices
  • orthogonal means QQT = QTQ = 1, a linear transformation that preserves norms of vector
    It is calculated as follows:
  • U has eigenvectors of AAT as columns
  • V has eigenvectors of ATA as columns
  • ∑ has singular values = square roots of eigenvalues as diagonal elements in descending order
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is UMAP?

A

“Uniform Manifold and Approximation”
1. Construct high-dimensional graph: calculate distances between data points using a metric, find k-nearest neighbors and create a graph with probability of being connected ed(x,y)/ σi
2. Optimization process: find low-dimensional representation of graph, minimize cross-entropy
3. Attractive and repulsive forces: to make distances match those in high-dimension
4. Stochastic gradient descent (SGD)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is PCA?

A

“Principal component analysis”
- converts a set of observations into a set of linearly uncorrelated variables
1. Standardization: zi=xii / σi
2. Covariance matrix: C = 1/(n-1)ZTZ
3. Eigenvalue decomposition: solve Cv = λv
4. Sort eigenvectors: λ1, …, λk, where λk is the largest eigenvalue, v1, …, vk are the corresponding eigenvectors
5. Project data onto principal components: T = ZVk, where T is the transformed data with the rows containing the observations and the columns the principal components
- the total variance is then given by ∑ λi

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Normal distribution

A

f(x) = (1/σ √2π) e -1/2 ((x-μ)/σ)2
where σ is the standard deviation
the variance is σ2.
For the standard normal distribution: μ=0, σ=1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Recurrent layers

A

Network architecture designed to process sequential data. Has potential for infinite impulse response.
Hidden state
ht = fh(Whxxt + Whhht−1 + bh)
Output
yt = fy(Wyhht + by)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

P-value

A

probability of obtaining data as extreme as the observed data assuming the null hypothesis is true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Pearson correlation

A

PX,Y= Cov(X,Y)/σXσY, where σ is the standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Covariance

A

con(X,Y) = 1/n ∑(xi-E(X))(yi-E(Y)), where E(X) is the expected value

19
Q

How to prevent overfitting? List 7 methods

A
  • Cross-validation
  • Regularization (L1/L2)
  • Pruning (e.g. in decision trees)
  • Early stopping
  • Dropout
  • Simpler model with fewer parameters
  • Ensemble methods
20
Q

How can you optimize model architecture and execution?

A
  • Mixed precision training: single-precision (32-bit) for weights, biases, losses and half (16-bit) for activations and gradients. This increases speed and decreases memory requirements.
  • Gradient checkpointing: only store activations at checkpoint layers (activations are computed forward and then needed for the backward gradient computation), recalculate other during backward pass
  • Model simplification:
    a. pruning: remove unimportant parameters
    b. quantization: reduce precision
    c. knowledge distillation: smaller model trained to mimic larger model
21
Q

Transformer architecture

A

Input sequence -> Tokenizer -> Input embedding layer + positional encoding
Encoder stack:
- multi-head self-attention
- position-wise fully connected feed forward
Decoder stack:
- self-attention for decoder input: K, V
- self-attention for encoder output: Q from decoder
- feed-forward
Output layer:
- linear transformation
- softmax: to create output probabilities

22
Q

Logit function

A

logit(p) = σ-1 = ln(p/(1-p)) for p in (0,1)
- maps (0,1) to (-infinity,infinity)
- inverse of the sigmoid function
- logit(0.5) = 0
“logit = logarithm + unit”

23
Q

Attention layers

A

Attention(Q,K,V) = softmax(QKT/sqr(dk))V

  • query Q = WqX+bq1T
  • keys K = WkX+bk1T
  • values V = WvX+bv1T

Wq,Wk,Wv are the parameter matrices

rows are tokens
columns are embeddings of the tokens

24
Q

PyTorch Model

A
torch.manual_seed(42)
model=nn.Sequential(
nn.Linear(3,8),
nn.ReLU(),
nn.Linear(8,4),
nn.Sigmoid(),
nn.Linear(4,1))
model(X)
25
Confusion matrix
TP-FP FN-TN Precision = TP/(TP+FP) Recall = TP/(TP+FN) Accuracy = (TP + TN)/(TP+FP+TN+FN)
26
L1-Regularization
"Lasso regularization" - loss = 1/N ∑(yip-yia)2 + λ ∑(||wj||) - this leads to sparse models - effectively performs feature selection (do this when some features might be irrelevant)
27
Batch Normalization
y = γ [(x - μB) / sqrt( sigma B2+ε)] + β where μB is the mini-batch mean (over xi) and sigma B the standard deviation γ, β are learnable parameters, ε is added for numerical stability - this adjusts layer outputs to follow the same distribution regardless of noise in inputs - it addresses the issue of internal covariate shift, where layers constantly have to adjust to changing input distributions
28
Cosine similarity
cosine similarity(u,v) = cos(θu, v) = uv/(||u|| ||v||) = u1v1+u2v2+u2v3/||u|| ||v||
29
ROC curve
- graph of false positive rate vs. true positive rate - random would be the diagonal - TPR = TP/ (TP+FN) -> probability that true positive tests positive - FPR = FP / (FP+TN) -> probability that a true negative will test positive
30
Tanh / Hyperbolic Tangent
tanh(x) = (ex - e-x) / (ex + e-x) tanh'(x) = 1 - tanh2(x) - maps real numbers to (-1,1) - zero centered: tanh(0)=0
31
Sigmoid function
σ(x) = 1/(1+e-x) σ'(x) = σ(x)(1 - σ(x)) - maps the real numbers to (0,1) - s shaped with σ(0)=0.5
32
Absolute positional embedding
- PE(pos, 2i) = sin( pos / (100002i / dmodel) ) - PE(pos, 2i+1) = cos( pos / (100002i / dmodel) ) - dmodel = dimensionality of embedding - i adjusts the frequency (higher i = lower frequency)
33
Adam optimizer
- first moment vector: m + moving average of gradients - second moment vector: v = moving average of squared gradients **1. Initialization:** m0=0, v0=0, t=0 **2. Gradient:** gt = ∇ f(θt-1), where f is the loss function **3.** mt1 mt-1+ (1-β1)gt **4.** vt2vt-1+(1-β2) gt2 **5. Correct the bias toward 0 from initialization:** mt(hat)= mt/(1-β1t), vt(hat)=vt/(1-β2t) **6. Update parameters:** θt+1t - lr (mt(hat) / sqrt(vt(hat))+e)
34
L2-Regularization
- "Ridge regression" - loss = 1/N ∑(yip-yia)2 + λ ∑(wj2) -> encourages smaller more evenly distributed weights
35
Logarithm rules
Definition: aloga(b)=b Rules: logb(xy) = logb(x)+logb(y) logb(x/y) = logb(x)-logb(y) logb(xn)=nlogb(x) log(1) = 0 for b towards 0, log(b) goes towards - infinity
36
Mathew's correlation coefficient, MCC
(TP x TN - FP x FN)/sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN) - similar to F1-score, but even more robust and incorporates all parts of the confusion matrix
37
F1-score
2 * (precision*recall)/(precision+recall) - if there is an imbalance in the class distribution then the precision and recall are the most important metrics
38
What is the kernel trick for SVM?
A trick to apply an SVM to a non-linear space. The data is mapped to a higher dimensional space. - Linear kernel: K(xi,xj)=xixj - Radial Basis Function: K(xi,xj)=exp(-γ||xi-xj||2)
39
T-test
Assumptions: normality, independence t = (mean(x) - μ0) / (s √n), where mean(x) is the sample mean, μ0 is the population mean, s is the standard deviation and n is the sample size To calculate the test statistic t is looked up in a table with n-1 degrees of freedom. If t is smaller than alpha the null hypothesis is rejected.
40
Rotary Positional Embedding (RoPE)
**Rotary positional embedding:** Instead of adding a positional embedding to a token embedding, pairs of embedding dimensions are rotated by an angle that depends on token position 1️⃣ Pair up dimensions Take a query or key vector and group dimensions into pairs: (x1​,x2​), (x3​,x4​), … Each pair will be treated as a 2D vector. 2️⃣ Assign a rotation angle per position: For token position p, each pair gets a rotation angle: θp,i​=p⋅ωi​ = fixed frequency (different per dimension pair) Lower dimensions rotate slowly, higher ones rotate faster. 3️⃣ Rotate the vector Each 2D pair is rotated using a standard rotation matrix: (x2i′​x2i+1′​​)=(cosθp,i​sinθp,i​​−sinθp,i​cosθp,i​​)(x2i​x2i+1​​) This is applied to queries and keys, but not values.
41
GeLU function
GELU(x) = x Φ(x), where Φ(x) is the cumulative distribution function of the standard normal distribution - activation function that is an alternative to ReLU - it gives a smooth transition instead of a strong break at 0
42
What’s the backprop rule at a node with multiple children?
If v influences multiple downstream nodes u1, …, uk, then gradients add: ∂L/∂v = Σi (∂L/∂ui) (∂ui/∂v) Interpretation: sum contributions from every outgoing path.
43
What is the formula for binary cross entropy?
BCE=−[y logq + (1−y) log(1−q)], With target y∈{0,1} and predicted q=p^(y=1)