What is the geometric interpretation of a matrix multiplying a vector?
A linear transformation that can scale, rotate, shear, reflect, or project the vector in space.
What determines whether a matrix is invertible?
It must be square.
Its determinant must be non-zero.
Its columns/rows must be linearly independent.
What does it mean for vectors to be linearly independent?
No vector can be written as a linear combination of the others.
What is the rank of a matrix, and why is it important in ML?
Rank = number of linearly independent columns.
In ML it affects:
Feature redundancy
Identifiability of solutions
Behavior of least-squares
Condition number (stability)
What is the null space of a matrix?
Set of all vectors x such that Ax = 0.
In ML, it identifies feature directions that have no effect on predictions.
What is the determinant intuitively?
A scaling factor for volume under the matrix transformation.
Why is linear algebra foundational to gradient-based optimization?
Gradients and Hessians are vectors/matrices; optimization depends on:
Vector calculus
Matrix multiplications
Eigenvalues of Hessian β curvature
Why is linear regression often solved using QR decomposition instead of the normal equation?
Because QR is numerically more stable;
π^π π amplifies conditioning issues.
What is the connection between PCA and the SVD?
PCA = SVD of centered data matrix:
π=πΞ£π^π
Principal components = columns of V.
Why do eigenvalues matter for training stability in deep learning?
Hessian eigenvalues indicate curvature:
Very large β exploding gradients
Very small β vanishing gradients
Spread β ill-conditioned optimization
In vectorized ML code, why is broadcasting important?
Allows computation across batches or dimensions without explicit loops β faster GPU/TPU execution.
How does matrix multiplication underpin transformer attention?
Attention = Q Kα΅ to compute pairwise similarities.
Then softmax and multiply by V.
Why is the softmax attention matrix low-rank in many real tasks?
Tokens often lie in a lower-dimensional semantic subspace.
This enables techniques like:
FlashAttention
Low-rank adapters (LoRA)
Speculative decoding
Agent routing to specialized LLMs
How does LoRA use linear algebra to reduce fine-tuning cost?
LoRA decomposes weight updates into a low-rank decomposition:
Ξπ=π΅π΄
where B and A are small matrices.
Reduces trainable parameters by orders of magnitude.
In agentic workflows, why is vector-space embedding similarity critical?
Agents must:
Retrieve memory
Plan next steps
Rank tools/skills
Route tasks between models
Similarity computations = dot products (cosine similarity).
How does linear algebra support vector databases used in agent memory?
Vector search = nearest neighbors in high-dimensional space using:
L2 distance
Inner product
Approximate search using low-rank projections
What is the Moore-Penrose pseudoinverse?
Generalized inverse for non-square or rank-deficient matrices.
Used for:
Linear regression
Solving Ax = b when no exact solution exists
Explain how power iteration finds dominant eigenvalues.
Repeatedly apply A to a vector; it aligns with the largest eigenvector.
Useful in:
PageRank
Large-scale graph analysis
Real-time agent routing
Why do RNNs suffer from vanishing gradients in terms of linear algebra?
Repeated multiplication by weight matrix W:
If all eigenvalues of matrix W < 1 in magnitude β spectral radius < 1 β shrink exponentially.
Network can’t learn long-term dependencies because loss of earlier signals as gradients vanish.
If > 1 β explode.
What is the Kronecker product and where is it used?
Block matrix product.
Used in:
Gaussian processes
Structured neural networks
Fourier features