What does the cost function measure
Mismatch between model and data
What are the 2 types of supervised learning
Regression
Classification
What is the core assumption behind k-NN (smoothness assumption)
If observations are close in the input space, they are also close in the output space
What are the steps in k-NN
Compute distances from new point to all training points.
Pick the k closest.
Classification → majority vote
Regression → average their outputs
What is overfitting
Tuning the model parameters too closely to the noise in the measurements, which prevents it from generalizing well
Low training error, high test error
What is underfitting
Using a model that is too simple (e.g. a straight line for non-linear data) to make good predictions
High training error, high test error
How do we test how well a model generalises
The model is trained on the training set, and its performance is evaluated by how well it predicts outputs for the test set.
Low test error = good generalisation
What is Ockham’s razor
A principle suggesting the simplest solution is usually the best
What is the goal of unsupervised learning
Goal is not prediction, but gaining insight into the phenomenon itself by finding internal relationships and patterns
What is clustering
Finding groups of similar observations
What is dimensionality reduction
Compressing data while preserving structure
K-Means clustering process
Choose the number of clusters (K).
Select K initial “centroids”.
Assign points to the nearest centroid.
Update centroid positions.
Repeat until convergence.
What does supervised learning mean
The dataset contains inputs and correct outputs
How to find test error
∑(ytest−ypred)2
What are the sources of bias in machine learning
Biased datasets
Wrong labels
Missing data
Unbalanced data
Poor feature choice
What matters more, data quality or quantity
Quality
Supervised vs Unsupervised Learning
Supervised - Have labels, predict target
Unsupervised - No labels, understand data structure