Physical study cards Flashcards

Question

Confusion matrix

Answer 1

TP-FP FN-TN Precision = TP/(TP+FP) Recall = TP/(TP+FN) Accuracy = (TP + TN)/(TP+FP+TN+FN)

Answer 2

"Lasso regularization" - loss = ¹/_N ∑(y_i^p-y_i^a)² + λ ∑(||w_j||) - this leads to sparse models - effectively performs feature selection (do this when some features might be irrelevant)

Answer 3

y = γ [(x - μ_B) / sqrt( sigma _B²+ε)] + β where μ_B is the mini-batch mean (over x_i) and sigma _B the standard deviation γ, β are learnable parameters, ε is added for numerical stability - this adjusts layer outputs to follow the same distribution regardless of noise in inputs - it addresses the issue of internal covariate shift, where layers constantly have to adjust to changing input distributions

Answer 4

cosine similarity(u,v) = cos(θ_{u, v}) = uv/(||u|| ||v||) = u₁v₁+u₂v₂+u₂v₃/||u|| ||v||

Answer 5

- graph of false positive rate vs. true positive rate - random would be the diagonal - TPR = TP/ (TP+FN) -> probability that true positive tests positive - FPR = FP / (FP+TN) -> probability that a true negative will test positive

Answer 6

tanh(x) = (e^x - e^-x) / (e^x + e^-x) tanh'(x) = 1 - tanh²(x) - maps real numbers to (-1,1) - zero centered: tanh(0)=0

Answer 7

σ(x) = 1/(1+e^-x) σ'(x) = σ(x)(1 - σ(x)) - maps the real numbers to (0,1) - s shaped with σ(0)=0.5

Answer 8

- PE(pos, 2i) = sin( pos / (10000^{2i / d_model}) ) - PE(pos, 2i+1) = cos( pos / (10000^{2i / d_model}) ) - d_model = dimensionality of embedding - i adjusts the frequency (higher i = lower frequency)

Answer 9

- first moment vector: m + moving average of gradients - second moment vector: v = moving average of squared gradients **1. Initialization:** m₀=0, v₀=0, t=0 **2. Gradient:** g_t = ∇ f(θ_t-1), where f is the loss function **3.** m_t=β₁ m_t-1+ (1-β₁)g_t **4.** v_t=β₂v_t-1+(1-β₂) g_t² **5. Correct the bias toward 0 from initialization:** m_t(hat)= m_t/(1-β₁^t), v_t(hat)=v_t/(1-β₂^t) **6. Update parameters:** θ_t+1=θ_t - lr (m_t(hat) / sqrt(v_t(hat))+e)

Answer 10

- "Ridge regression" - loss = ¹/_N ∑(y_i^p-y_i^a)² + λ ∑(w_j²) -> encourages smaller more evenly distributed weights

Answer 11

Definition: a^log_a(b)=b Rules: log_b(xy) = log_b(x)+log_b(y) log_b(x/y) = log_b(x)-log_b(y) log_b(xⁿ)=nlog_b(x) log(1) = 0 for b towards 0, log(b) goes towards - infinity

Answer 12

(TP x TN - FP x FN)/sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN) - similar to F1-score, but even more robust and incorporates all parts of the confusion matrix

Answer 13

2 * (precision*recall)/(precision+recall) - if there is an imbalance in the class distribution then the precision and recall are the most important metrics

Answer 14

A trick to apply an SVM to a non-linear space. The data is mapped to a higher dimensional space. - Linear kernel: K(x_i,x_j)=x_ix_j - Radial Basis Function: K(x_i,x_j)=exp(-γ||x_i-x_j||²)

Answer 15

Assumptions: normality, independence t = (mean(x) - μ₀) / (s √n), where mean(x) is the sample mean, μ₀ is the population mean, s is the standard deviation and n is the sample size To calculate the test statistic t is looked up in a table with n-1 degrees of freedom. If t is smaller than alpha the null hypothesis is rejected.

Answer 16

**Rotary positional embedding:** Instead of adding a positional embedding to a token embedding, pairs of embedding dimensions are rotated by an angle that depends on token position 1️⃣ Pair up dimensions Take a query or key vector and group dimensions into pairs: (x1,x2), (x3,x4), … Each pair will be treated as a 2D vector. 2️⃣ Assign a rotation angle per position: For token position p, each pair gets a rotation angle: θp,i=p⋅ωi = fixed frequency (different per dimension pair) Lower dimensions rotate slowly, higher ones rotate faster. 3️⃣ Rotate the vector Each 2D pair is rotated using a standard rotation matrix: (x2i′x2i+1′)=(cosθp,isinθp,i−sinθp,icosθp,i)(x2ix2i+1) This is applied to queries and keys, but not values.

Answer 17

GELU(x) = x Φ(x), where Φ(x) is the cumulative distribution function of the standard normal distribution - activation function that is an alternative to ReLU - it gives a smooth transition instead of a strong break at 0

Answer 18

If v influences multiple downstream nodes u₁, …, u_k, then gradients add: ∂L/∂v = Σ_i (∂L/∂u_i) (∂u_i/∂v) Interpretation: sum contributions from every outgoing path.

Answer 19

BCE=−[y logq + (1−y) log(1−q)], With target y∈{0,1} and predicted q=p^(y=1)

Physical study cards Flashcards

(43 cards)