- central point of a cluster - also known as the mathematical mean

Unsupervised Machine Learning Flashcards by Unknown Unknown

Unlabeled Data

any data that’s not organized in an easily identifiable manner is known as unstructured/unlabeled

How well did you know this?

Not at all

Perfectly

Goals of Unsupervised Learning

Goal is to learn about data’s underlying structure and find out how different features relate to each other.

How well did you know this?

Not at all

Perfectly

Name 2 Methodologies of unsupervised learning

Recommendation Systems
K-mean models

How well did you know this?

Not at all

Perfectly

Briefly describe a Recommendation system

Recommendation systems are a
- subclass of machine learning algorithms that
- can be both supervised or unsupervised
- offer relevant suggestions to users.

How well did you know this?

Not at all

Perfectly

What is the goal of a recommendation system?

to quantify how similar one thing is to another, and use this information to suggest a closely related option.

How well did you know this?

Not at all

Perfectly

What is content-based filtering?

Content-based filtering is a type of recommendation system where comparisons are made based on the attributes of the content itself.

For example, attributes of a song you played are compared to attributes of other songs to determine similarity.

How well did you know this?

Not at all

Perfectly

What are some benefits of content-based filtering?

The benefits include being easy to understand, recommending more of what a user likes,
not needing other users’ information to work, and - -
being able to map users and items in the same space to recommend things that are closest to a user’s typical preferences.

How well did you know this?

Not at all

Perfectly

What are some drawbacks of content-based filtering?

Always recommends more of the same
Require manual input of attributes
Cannot reccommend across content type
Limited use cases

How well did you know this?

Not at all

Perfectly

What is collaborative filtering?

Collaborative filtering is a type of recommendation system that uses the likes and dislikes of users to make recommendations.

It does not need to know anything about the content itself. All that matters is if the user liked it.

How well did you know this?

Not at all

Perfectly

What are some benefits of collaborative filtering?

The benefits include the ability to
- recommend across content types,
- finding hidden correlations in the data, and
- not requiring tedious manual mapping.

How well did you know this?

Not at all

Perfectly

What are some drawbacks of collaborative filtering?

Drawbacks include
- needing lots of data to even start getting useful results,
- requiring every user to give the system lots of data, and
- dealing with sparse data that has a lot of missing values.

How well did you know this?

Not at all

Perfectly

What type of model is K-means and what does it do?

unsupervised learning model
partitioning algorithm,
organize unlabeled data into clusters

How well did you know this?

Not at all

Perfectly

What is a Centroid?

central point of a cluster
also known as the mathematical mean

How well did you know this?

Not at all

Perfectly

List the 4 steps to build K-means model

Initiate k centroids
Assign all points to nearest centroid
Recalculate the centroid of each cluster.
Repeat Step 2 and 3 until the algorithm converges

How well did you know this?

Not at all

Perfectly

What is the difference between Clustering and Partitioning Algorithms

clustering algorithms: outlying points can exist outside of the clusters.

partitioning algorithms: all points must be assigned to a cluster.

in other words, K-means does not allow unassigned outliers

How well did you know this?

Not at all

Perfectly

What is k in Initiate k centroids step?

Study These Flashcards

K = the number of centroids in your model, which is how many clusters you’ll have.

Who makes decision on k = #?

Study These Flashcards

you

How to choose k value?

Study These Flashcards

sometimes known, for instance, if there’s 3 species of beetle to cluster, then k=3. sometimes unknown.

Name 2 other clustering methodologies

Study These Flashcards

DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Groups data points together based on their density.
Agglomerative clustering: Creates a hierarchy of clusters by merging data points or clusters iteratively

What cluster shape does K-mean work best with?

Study These Flashcards

round clusters

DBSCAN (density-based spatial clustering of applications with noise)

Study These Flashcards

searches your data space for continuous regions of high density.
Or find clusters based on density, the shape of the cluster isn’t as important as it is for K-means.

DBSCAN Hyperparameters

Study These Flashcards

epsilon, min_samples

DBSCAN: eps, Epsilon (ε)

Study These Flashcards

The radius of your search area from any given point

DBSCAN: min_samples

Study These Flashcards

the number of samples in an ε-neighborhood for a point to be considered a core point (including itself)

Agglomerative clustering

works by first assigning every point to its own cluster, then progressively combining clusters based on intercluster distance.

Agglomerative clustering requirement

you specify a desired number of clusters or a distance threshold, which is the linkage distance

Agglomerative clustering: Linkage

different ways to measure the distances that determine whether or not to merge the clusters.

Common Linkages

Single: The minimum pairwise distance between clusters. Complete: The maximum pairwise distance between clusters. Average: The distance between each cluster’s centroid and other clusters’ centroids. Ward: This is not a distance measurement. Instead, it merges the two clusters whose merging will result in the lowest inertia.

When does Agglomerative clustering stop?

1. You reach a specified number of clusters. 2. You reach an intercluster distance threshold (clusters that are separated by more than this distance are too far from each other and will not be merged).

Agglomerative clustering: Hyperaparameters

n_clusters: the number of clusters you want in your final model. linkage: the linkage method to use to determine which clusters to merge. affinity: the metric used to calculate the distance between clusters. Default = euclidean distance. distance_threshold: the distance above which clusters will not be merged

Agglomerative clustering PROs

scales reasonably well, can detect clusters of various shapes.

What is considered good clustering model?

1. **clearly identifiable** clusters (within each cluster or intracluster, the **points are close to each other**) 1. Each cluster is **well separated** from **other clusters**. (between the clusters themselves or intercluster, you want **lots of empty space**)

K-means: metrics to evaluate good clusters

1. Inertia 2. Silhouette Score

K-means: Inertia

Inertia is a metric used in K-Means clustering to **measure the quality** of the clustering. It represents the average squared distance between **each data point** and its **assigned cluster centroid** **Lower Inertia, Better Clustering** The goal of K-Means is to minimize inertia. A lower inertia indicates that the **data** points are more **tightly clustered **around** their respective centroids**, suggesting a better clustering solution.

K-means: Silhouette Score

- **more precise** evaluation metric than inertia because it also takes into account the separation between clusters. Silhouette score is defined as the mean of the silhouette coefficients of all the observations in the model. Provides insight as to what the **optimal value for K** should be, and uses both **intracluster and intercluster** measurements in its calculation

Inertia Score

- lower = better (less distance between each observation and its nearest centroid. - 0 = useless (all points are overlapping each other in the center).

Inertia Score PROs

- helps us to decide on the **optimal k value.** - We do this by using the **elbow method.**

Elbow Method

Plot of **inertia vs K-values** (1,2,3…etc). - A good way of choosing an **optimal k value** is to find the elbow of the curve. - This is the value of k at which the decrease in inertia starts to level off.

Explain Silhouette Scores (-1,0,1)

1 = optimal (an observation sit nicely within its own cluster and well separated from other clusters). 0 = an observation is on the boundary between clusters. -1 = in the wrong cluster.

Unsupervised Machine Learning Flashcards

(39 cards)