What is the goal of dimensionality reduction?
To reduce the number of features while retaining as much information (variance, structure, or signal) as possible. Benefits include:
Visualization
Noise reduction
Faster computation
Avoiding curse of dimensionality
Difference between feature selection and feature extraction?
Selection: choose a subset of original features
Extraction: create new features (linear/nonlinear combinations) from original features
What is the “curse of dimensionality”?
As dimensions increase, data becomes sparse, distance metrics lose meaning, and models overfit → DR mitigates this.
Why is dimensionality reduction important for LLM embeddings?
Reduce 1536/4096-dim embeddings → faster retrieval and similarity search
Noise reduction → improves clustering & vector search ranking
Visualizing embedding space to debug retrieval or tool routing
What is PCA?
PCA is a linear DR method that projects data onto orthogonal directions (principal components) that maximize variance.
How is PCA computed?
Center data
Compute covariance matrix
Eigen decomposition of covariance matrix
Select top-k eigenvectors → project data
OR via SVD:
𝑋=𝑈Σ𝑉^𝑇
What do eigenvalues in PCA represent?
Variance explained by each principal component. Larger eigenvalue → more variance captured.
How do you choose the number of components in PCA?
Cumulative explained variance (e.g., 90–95%)
Scree plot (elbow method)
Downstream task performance
What preprocessing is required before PCA?
Centering (subtract mean)
Standardization (divide by std) if features have different scales
Optional whitening for decorrelated components
What are limitations of PCA?
Linear assumption → cannot capture nonlinear manifolds
Sensitive to outliers
Components may be hard to interpret
Variance ≠ predictive power
When would you use Kernel PCA?
Data lies on a nonlinear manifold
Classical PCA fails to capture structure
Kernel trick maps to higher-dim space before PCA
What are t-SNE and UMAP?
t-SNE: nonlinear DR for visualization, preserves local neighborhood structure
UMAP: faster nonlinear DR, preserves local + some global structure, scalable
What is a risk of using t-SNE for downstream ML tasks?
t-SNE is stochastic → inconsistent embeddings
Distances are not globally meaningful
Mainly for visualization, not feature extraction for classifiers
How are autoencoders used for DR?
Neural networks learn bottleneck representation (compressed latent space)
Nonlinear DR → capture complex manifolds
Can reconstruct original features → minimize information loss
What is whitening?
Scaling PCA components so they have unit variance → decorrelated features, used in some preprocessing pipelines.
Why is SVD useful for DR?
Decomposes data into orthogonal modes → captures variance
Can compute PCA efficiently via SVD
Handles rectangular matrices (more samples than features or vice versa)
When prefer feature selection over extraction?
When interpretability is important
When computation is cheap and features are meaningful individually
How do you evaluate PCA or DR performance?
Explained variance ratio
Downstream task accuracy
Reconstruction error (for autoencoders)
Visual inspection for clustering or separation
When prefer feature extraction?
High-dimensional data (text, images, embeddings)
Reduce noise, redundancy
Downstream ML benefits from compressed representation
How do you decide between linear PCA and nonlinear methods?
Start with PCA for interpretability and speed
Use t-SNE, UMAP, or autoencoders if linear PCA fails to capture important structure
How is DR evaluated in retrieval/RAG pipelines?
Embedding compression → check retrieval recall@k
Clustering quality in latent space → silhouette score or kNN accuracy
Speed vs quality trade-off in vector search
Your linear regression on PCA components fails. Why?
PCA maximizes variance, not predictive power
Some important low-variance features may be lost
Consider supervised DR (PLS, LDA, or autoencoder with supervised loss)
You reduce embeddings from 4096 → 128 dims via PCA. Recall@10 in retrieval drops slightly. What do you do?
Check explained variance → increase components
Try whitening/scaling
Consider nonlinear DR (autoencoder, UMAP)
Evaluate reconstruction or downstream metrics
t-SNE shows clusters but nearest neighbors in original space don’t match. Is this a problem?
No, t-SNE preserves local structure and is stochastic → not suitable for quantitative neighbor tasks.