Discretisation / Modelling Continous Data Flashcards

Question 1

Q

What is Discretisation, and where might it be used?

Answer

A

Discretisation = The translation of continuous attributes into nominal attributes.

Might be used in some learners such as Decision Trees, as they generally work better with nominal attributes.

Question 2

Q

Summarise some approaches to supervised discretisation

Answer

A

Naïve Supervised Discretisation
Information-Based Supervised Discretisation
General idea is to sort the possible values, and create nominal value for a region where most of the instances have the same label. i.e. group em’ up

Question 3

Q

What is Equal Width?

Answer

A

Equal width is an unsupervised method

Divides the range of possible values seen in the training set into equally-sized sub-divisions, regardless of the number of instances (sometimes 0) in each division

1) max instance - min instance = difference
2) difference / num of buckets = width of each bucket
3) min + width, …, until finished

Question 4

Q

What is Equal Frequency?

Answer

A

Equal Freq. is an unsupervised method

Divides the range of possible values seen in the training set, such that roughly the same number of instances appear in each bucket

1) for a specific attribute, sort the instances in ascending order
2) split according to how many buckets we want
3) if we need to transform new data that is added later, define the dividing point at the median

Question 5

Q

What is k-means in the context of discretisation?

Answer

A

K-means is a “clustering” approach, but it can work well in the context of discretisation.

If we want k buckets, we randomly select k points to act as seeds
We then have an iterative approach where we;
- assign each instance to the bucket of the closest seed
- update the “centroid” of the bucket with the mean of the values it currently has

Question 6

Q

How to calculate the (sample) mean?

Answer

A

mean of a specific attribute = 1/N (sumof(Ci))

Question 7

Q

How to calculate the standard deviation?

Answer

A

1) Sumof( Squaring the difference between attribute value and sample mean (Ci - Meanc) )
2) Dividing by 1 less than the number of values
3) Taking the positive square root

Question 8

Q

How could we use the MEAN and STANDARD DEVIATION when building a classifier?

Answer

A

Could construct a Gaussian probability density function, which would allow us to estimate the probability of observing any given value, based on counting the number of standard deviations it is from the mean (its z-score)

Question 9

Q

WTF IS A Z-SCORE

Question 10

Q

What is a hyperparameter? What does it mean for the model to parametrise the data? How do these relate to the model being non-parametric / parametric?

Question 11

Q

What are the general two steps in discretisation?

Answer

A

1) Decide how many values (= intervals/buckets) to map the features on to
2) Map each continuous value onto a discrete value

Question 12

Q

Pros and Cons of K-means clustering

Answer

A

Pros
- Efficient O(tkn)
n = # of instances
k = # of clusters
t = # of iterations
normally k, t << n

Cons

Tends to converge to local minimum; Sensitive to seed instances
Need to specify k in advance
Not able to handle non-convex clusters

Question 13

Q

Information-Based supervised discretisation

Answer

A

Cluster values into two intervals which minimise the entropy

1) sort the values
2) calculate the mean information at the different breakpoints in class membership

Question 14

Q

Naïve Supervised Discretisation

Answer

A

“cluster” values into class-contiguous intervals

1) sort the values and identify breakpoints in class membership
2) reposition any breakpoints where there is no change in numeric values
3) set the breakpoints midway between the neighbouring values

*SIMPLE TO IMPLEMENT

LEADS TO OVERFITTING
to avoid overfitting, delay inserting a breakpoint until each cluster contains at least n instances
or, merge neighbouring clusters until they reach a certain size/at least n instances of the majority class

Question 15

Q

What is Gaussian Distribution (aka normal distribution)

Answer

A

Given the mean and standard deviation of a distribution, it is possible to estimate the probability density for x via Gaussian Distribution.

Question 16

Q

Why is smoothing important in NB?

Answer

Study These Flashcards

A

Prevents 0 probabilities from just decimating our entire thing.

Discretisation / Modelling Continous Data Flashcards

(16 cards)