DBSCAN Flashcards

(43 cards)

1
Q

What does the acronym DBSCAN stand for?

A

Density-based spatial clustering of applications with noise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Who proposed the DBSCAN algorithm and in what year?

A

Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu in 1996.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the fundamental principle of the DBSCAN algorithm?

A

It groups together points that are closely packed and marks points that lie alone in low-density regions as outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In DBSCAN, what does the parameter $\epsilon$ (epsilon) define?

A

It defines the radius of the neighborhood around each point to be considered for density.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does the minPts parameter in DBSCAN specify?

A

It specifies the minimum number of points required within a point’s $\epsilon$-radius to form a dense region.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How is a ‘core point’ defined in the DBSCAN algorithm?

A

A point is a core point if at least minPts points (including itself) are within its $\epsilon$-radius.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In DBSCAN, a point ‘q’ is ‘directly reachable’ from point ‘p’ if ‘q’ is within distance $\epsilon$ from ‘p’, and ‘p’ is a _____ point.

A

core

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does it mean for a point ‘q’ to be ‘reachable’ from a point ‘p’ in DBSCAN?

A

There is a path of points from ‘p’ to ‘q’ where each subsequent point is directly reachable from the previous one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

In DBSCAN, what are points that are not reachable from any other point called?

A

Outliers or noise points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is a ‘border point’ characterized in DBSCAN?

A

It is a non-core point that is reachable from a core point, effectively forming the ‘edge’ of a cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

According to DBSCAN’s logic, a core point forms a cluster with all other points that are _____ from it.

A

reachable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Is the ‘reachability’ relationship in DBSCAN symmetric? Why or why not?

A

No, because by definition only core points can reach non-core points, but the reverse is not true.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When are two points ‘p’ and ‘q’ considered ‘density-connected’ in DBSCAN?

A

They are density-connected if there is a core point ‘o’ from which both ‘p’ and ‘q’ are reachable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a major advantage of DBSCAN over an algorithm like k-means regarding the number of clusters?

A

DBSCAN does not require the user to specify the number of clusters a priori.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What kind of cluster shapes can DBSCAN identify that many other algorithms cannot?

A

DBSCAN can find arbitrarily-shaped clusters, including a cluster completely surrounded by another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does the MinPts parameter help reduce the ‘single-link effect’ in DBSCAN?

A

It prevents different clusters from being connected by just a thin line of points, requiring a certain density to bridge them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a primary limitation of DBSCAN when dealing with data containing clusters of varying densities?

A

A single combination of minPts and $\epsilon$ cannot be chosen appropriately for all clusters simultaneously.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

In what specific situation is the DBSCAN algorithm not entirely deterministic?

A

Border points that are reachable from more than one cluster can be assigned to either, depending on the data processing order.

19
Q

The effectiveness of DBSCAN in high-dimensional data can be hindered by the so-called ‘_____ of dimensionality’.

20
Q

What graphical method is commonly used to help choose an appropriate value for the $\epsilon$ parameter?

A

A k-distance graph, plotting the distance to the k-th nearest neighbor (where k = minPts - 1).

21
Q

On a k-distance graph for DBSCAN, a good value for $\epsilon$ is typically found where the plot shows an ‘_____’.

22
Q

What is a common rule of thumb for setting the minPts parameter based on the number of dimensions, D?

A

Set minPts to be greater than or equal to D + 1, or often minPts = 2 * D.

23
Q

Why is setting minPts = 1 not sensible for DBSCAN?

A

Because every point would be defined as a core point, making every point a tiny cluster.

24
Q

If minPts is set to 2 or less, DBSCAN’s result becomes equivalent to what other clustering method?

A

Hierarchical clustering with the single link metric.

25
With an accelerating index structure, what is the average runtime complexity of DBSCAN?
$O(n \log n)$
26
What is the worst-case runtime complexity of DBSCAN without an accelerating index structure?
$O(n^2)$
27
What is the memory complexity of a standard, non-matrix-based implementation of DBSCAN?
$O(n)$
28
What is HDBSCAN*?
A hierarchical version of DBSCAN that produces a hierarchical clustering and no longer has the concept of border points.
29
Since DBSCAN is an unsupervised algorithm, what evaluation method can be used to measure how well-formed its clusters are?
Silhouette analysis, which measures cluster cohesion versus separation.
30
In silhouette analysis for evaluating clustering, what does a higher score (closer to 1) indicate?
It indicates that the clusters are well-defined, dense, and well-separated from each other.
31
In the silhouette score formula, $s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$, what does $a(i)$ represent?
The average distance from point $i$ to all other points within the same cluster (cohesion).
32
In the silhouette score formula, $s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$, what does $b(i)$ represent?
The average distance from point $i$ to all points in the nearest neighboring cluster (separation).
33
The _____ algorithm can be seen as a generalization of DBSCAN that replaces the $\epsilon$ parameter with a maximum search radius.
OPTICS
34
What does GDBSCAN (Generalized DBSCAN) generalize from the original algorithm?
It generalizes the concepts of 'neighborhood' and 'dense' into arbitrary predicates, moving beyond the fixed $\epsilon$ and `minPts` parameters.
35
What is the typical input for the DBSCAN algorithm?
A dataset where each data point is represented by a set of numerical features or attributes.
36
What is the output of the DBSCAN algorithm?
Cluster assignments in the form of a label for each data point, including a label for noise.
37
DBSCAN is designed to be used with databases that can accelerate _____ queries, for example by using an R* tree.
region
38
Name one of the use cases for DBSCAN mentioned in the source material.
Customer Segmentation, Anomaly Detection, Image Compression, or Document Clustering.
39
The variation DBSCAN* achieves a fully deterministic result by treating _____ as noise.
border points
40
Does DBSCAN guarantee that every cluster will contain at least `minPts` points? Why or why not?
No, because a cluster only needs one core point, and border points might be shared between clusters.
41
What is the first step in the abstract, high-level DBSCAN algorithm?
Find all points in the $\epsilon$-neighborhood of every point and identify the core points.
42
After identifying core points, what is the second step in the abstract DBSCAN algorithm?
Find the connected components of core points on the neighbor graph, ignoring all non-core points.
43
What is the final step in the abstract DBSCAN algorithm concerning non-core points?
Assign each non-core point to a nearby cluster if it's an $\epsilon$-neighbor, otherwise assign it to noise.