Unsupervised Learning Flashcards

Question 1

Q

Give two reasons why unsupervised learning is often more challenging than supervised learning

Answer

A

Objectives become more fuzzy and subjective because there is no simple goal like prediction
Since a target variable is absent, methods for assessing model quality on the target variable is not applicable, which makes it difficult to evaluate results obtained

Question 2

Q

Describe how principal components analysis works

Answer

A

PCA transforms a high-dimensional dataset into a smaller, much more manageable set of representative variables that capture most of the information in the original dataset (especially useful for highly correlated data)

Question 3

Q

Describe how centering and scaling the variables will affect the results of principal components analysis

Answer

A

Centering: mean-centering does not affect the results of PCA since the variance remains unchanged when the values are added or subtracted by the same constant
Scaling: scaling affects PCA. PCA using variables on their original scale are determined on the sample covariance matrix. PCA using variables on standardized scale are determined on the sample correlation matrix.

Question 4

Q

Describe the drawbacks (or limitations) of principal components analysis

Answer

A

It may not lead to interpretable results (because PCs as composite variables can be hard to interpret)
PCA may not be a good tool to use for non-linear relationships (because PCs rely on linear transformations)
Although PCA does dimension reduction, PCA is not doing feature selection, so no operational efficiency is achieved (because PCs are constructed from all original features)
The target variable is ignored since PC loadings and scores are generated independently of the target variable (because PCA is unsupervised)

Question 5

Q

Explain how K-means clustering works

Answer

A

K-means clustering assigns each observation in a dataset into one of relatively homogeneous K clusters, where K is specified upfront.
First, we randomly assign K points to be initial cluster centers. Then we perform an iteration process:
1. Assign each observation to the closest cluster based on Eucliedean distance
2. Recalculate the center of each of the K clusters
3. Repeat until the cluster assignments no longer change

Question 6

Q

Explain what the term “K-means” refers to

Answer

A

The algorithim involves iteratively calculating the K means/centers of the clusters, hence the name

Question 7

Q

Explain why it is desirable to run a K-means clustering algorithm multiple times

Answer

A

This is because the k-means clustering algorithim is guaranteed to arrive at a local but not global optimum. Initial cluster assignments affect the local optimum, so running the k-means clustering algorithim multiple times (20 to 50) increases the change of identifying a global optimum and getting a representative cluster grouping

Question 8

Q

Explain how the elbow method can be used to select the value of K

Answer

A

A plot of the proportion of variance explained (equal to the between-cluster variation divided by the total variation in the data) as new clusters get added can be used. the elbow of this plot represents when the proportion of variance explained has plateued.

Question 9

Q

Explain how hierarchical clustering works

Answer

A

Consists of a series of fusions of observations in the data. This is a bottoms-up clustering method that starts with the individual observations treated as its own cluster, then successively fuses the closest pair of clusters one pair at a time. The process iterates until all clusters are fused into a single cluster containing all observations

Question 10

Q

Explain the difference between average linkage and centroid linkage

Answer

A

For average linkage, we first compute all pairwise distances, then take an average
For centroid linkage, we fisrt take the average of the feature values to get the two centroids, then comput the distance between the two centroids.

Question 11

Q

Explain the two differences between K-means clustering and hierarchical clustering

Answer

A

K-means
* Randomization is needed to determine initial cluster centers
* Number of clusters is pre-specified
* Clusters are not nested

Hierarchical clustering
* Randomization is not needed
* Number of clusters is not pre-specified
* Clusters are nested

Question 12

Q

Explain how scaling the variables will affect the results of hierarchical clustering

Answer

A

Without scaling: unscaled variables result in some variables dominating the distance calculations to exert a disproportionate impact on cluster assignments
With scaling: equal importance is attached to each feature when performing distance calculations

Question 13

Q

Explain two ways in which clustering can be used to generate features for predicting a target variable

Answer

A

Generates features in two ways
1. Cluster groups: group assignments as a result of clustering is a factor variable/feature that can be used to predict a target variable
2. Cluster centers: replace original variables by the cluster centers to serve as numeric features. two advantages to this feature generation
* interpretation, cluster centers provide numeric summary of the characteristics of observations in different clusters
* prediction, cluster centers retains numeric characteristics of the observations and uses the summarized characteristics to help make better predictions

Question 14

Q

What are the properties of principal components?

Answer

A

Linear combinations of the original features
PCs are generated to capture as much information in the data as possible (w.r.t. variance)
PCs are mutually uncorrelated (different PCs capture different aspects of data)
Amount of variance explained decreases with PC order. For ex., PC1 explains the most variance and subsequent PCs explain less and less

Question 15

Q

What are two applications of PCA?

Answer

A

EDA including data visualization: Dataset becomes much easier to explore and visualize. Plot the score of the 1st PC vs. the scores of the 2nd PC to gain a 2D view of the data in a scatterplot
Feature generation: Replace the original variables by PCs to redue overfitting and improve prediction performance –> delete the original variables to avoid any duplication of information –> co-existence of the two groups will result in perfect collinearity anda. rank-deficient model if fitting a GLM

Question 16

Q

What is the tradeoff of increasing the number of PCs (M) to use?

Answer

Study These Flashcards

A

As M increases:
* cumulative PVE increases
* dimension increases
* (if y exists) model complexity increases

Question 17

Q

How can we choose the number of principal components (M) to use?

Answer

Study These Flashcards

A

Scree plot: choose the number such that the cumulative PVE is high enough
CV: treat M as a hyperparamater to be tuned if y exists

Question 18

Q

Why might we use complete and average linkage over single and centroid?

Answer

Study These Flashcards

A

Complete and average linkage tend to result in more balanced and visually appealing clusters
Single linkage tends to produce extended, trailing clusters with single observations fused one-at-a-time
Centroid linkage may lead to inversion (some later fusions occur at a lower height than an earlier fusion)

Question 19

Q

List two ways you can perform feature generation using PCA

Answer

Study These Flashcards

A

Take the first PC scores as a new feature as they are
Take the PC loadings that are most similar and define a new feature (can drop loadings of other variables that are not similar) and calculate new PC scores as a new feature

Unsupervised Learning Flashcards

(19 cards)