Supervised learning: concepts and applications Flashcards

(34 cards)

1
Q

What is machine learning

A

a set of methods that can automatically detect patterns in data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does machine learning do with these patterns

A

they are used to predict future data or to perform other kinds of decision making under uncertainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a key premise of machine learning

A

the learning problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is the learning problem

A

learning from data is used in situations where we don’t have any analytic solutions but we do have data that can construct an empirical solution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what does the learning problem use

A

a machine learning method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What kind of inputs does a model have

A

feature, attribute, predictor, independent variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what kind out outputs does a model have

A

response, dependant variable, label growth

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When is regression used in supervised learning

A

where Y is continuous (quantitative)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Where is classification used in supervised learning

A

covers situations where Y is categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do you minimise the least square error

A

the gradient decent method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what are the main two types of supervised learning

A

regression and classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Give an example of regression

A

predicting house prices based on size, location and number of bedrooms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Give an example of classification

A

spam/non-spam

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is classification used for

A

assigning instance to discrete categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How is the “best fit” defined

A

the line that minimises the sum of squared errors between actual and predicted values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What algorithms are supervised and can solve both regression and classification problems

A

Decision trees
Random Forest
K-nearest neighbours

17
Q

What are strengths of linear regression

A
  • Simple and easy to implement
  • works well for linearly separable data
  • Computationally efficient
18
Q

What are the limitations of linear regression

A
  • Assumes linearity between features and the target variable
  • Sensitive to outliers
  • Struggles with multi collinearity
  • poor performance with complex, non-linear data
19
Q

What is the process of the decision tree

A
  • Determine attribute to select as the root (according to some goodness measure… information gain) two split on
  • Partition inout examples into subsets according to values order the root attribute
  • construct DT recursively for each subset
  • Connect the roots for the subtrees to the root of the whole tree via labelled links
20
Q

What is entropy

A

a measure of uncertainty or impurity of a dataset

21
Q

What does information gain measure?

A

A: The reduction in entropy after a dataset is split on an attribute.

22
Q

Why Is information gain important

A

It determines the best attribute to split on.

23
Q

What are the strengths of the decision tree

A
  • easy to visualise and interpret
  • handles non-linear relationships well
  • No need for data normalisation or scaling
  • can handle categorical and continuous data
24
Q

What are the limitations of the decision tree

A
  • Prone to overfitting (without running or max depth constraints)
  • Sensitive to small change in data (instability)
25
How to do random forest
1. Train each tree on a random subset of data with replacement (Bootstrapping). 2. Use a random subset of features at each split (Feature Subsampling). 3. Grow individual decision trees independently (Tree Building). 4. Combine results using majority vote (classification) or averaging (regression) – (Aggregation) 5. Output the aggregated result from all trees (Final Prediction).
26
what is bootstrapping
Each decision tree is trained on a random sample of the dataset: - Sampling is with replacement - So some data points appear multiple times - Some data points are left out
27
Example in random forest
each trees uses a different ML method to make predictions, the majority vote by each model (tree) is then combined to find the majority and the output is the aggregated result from all trees
28
random forest definition
A Random Forest trains many different decision trees, lets each one make a prediction, and then combines their answers to get a more reliable final result.
29
What are the strengths of random forest
- Reduces overfitting compared to individual decision trees as it only looks a t subset of the data - Handles high dimensional data effectively
30
What are the limitations of random forest
- Less interpretable than individual trees - Computationally intensive for large datasets - May require tuning for hyper parameters (e.g. number of trees)
31
What is the nearest neighbour classification
1. user inputs k: an integer representing the number of nearest neighbours (instances) to search for 2. with each unlabelled instance: calculate the distance between it and all the instances in the data set 3. 4. 5. find the k nearest neighbours count the assigned class labels in k nearest neighbour for each class the class with the highest count (majority vote) is the output
32
What are the strengths of k-nearest neighbours
- Simple to implement and understand - Makes no assumptions about data distribution - effective for non-linear and multi-class problems - adaptive to changes in the dataset
33
What are the limitations of K-nearest Neighbours
- Computationally expensive during prediction (lazy learning) - Performance depends heavily on the choice of k and distance metric - sensitive to irrelevant or noisy features
34
What are the key issues of nearest neighbour classification
- no model is built! all the data is retained - training could take up to O(np) per observation - will not do well when number of features is large