Week 7 - Vector Databases Flashcards

(15 cards)

1
Q

Which of the following is a primary use case for vector databases in cloud computing?

A. Storing transactional data.
B. Storing and querying high-dimensional vectors for similarity search.
C. Managing relational data.
D. Performing batch processing of structured data.

A

B. Storing and querying high-dimensional vectors for similarity search.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which of the following is a key characteristic of a vector database?

A. Support for SQL queries.
B. Handling ACID transactions.
C. Optimized for similarity searches in high-dimensional spaces.
D. Designed for data warehousing.

A

C. Optimized for similarity searches in high-dimensional spaces.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which of the following Faiss Index types is most suitable for brute-force search in a
small dataset?

A. IndexIVFFlat.
B. IndexFlatL2.
C. IndexIVFPQ.
D. IndexHNSW.

A

B. IndexFlatL2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In Faiss, which index is designed to handle very large datasets by partitioning the data
into smaller subsets, each indexed independently?

A. IndexIVFFlat.
B. IndexFlatIP.
C. IndexHNSW.
D. IndexLSH.

A

A. IndexIVFFlat

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In Faiss, an _____ is constructed by dividing the set of vectors into k Voronoi partitions.

A. Locality Sensitive Hashing Index
B. Inverted File Index
C. Hierarchical Navigable Small Worlds Index
D. Flat Index

A

B. Inverted File Index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In the Inverted File Product Quantization Index (IVFPQ), the ______ algorithm is run on
vectors from all the partitions:

A. K-means clustering
B. Principal Component Analysis
C. Product Quantization
D. Singular Value Decomposition

A

C. Product Quantization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In the Inverted File Product Quantization Index (IVFPQ), sub-vectors are quantized into
a finite number of _____, each represented by ______.

A. clusters; a centroid
B. bits; a hash code
C. partitions; a mean value
D. blocks; an eigenvector

A

A. clusters; a centroid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a vector database?

A

A vector db is a type of db designed to efficiently store, manage and query high dimensional vector data, commonly used in ml and nlp, optimised for operations like similarity search, where the goal is to find the nearest vector given a query vector. They enable scalable and efficient handling of complex data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the 3 foundations of vector dbs

A

Vector Embeddings
Similarity Search
Indexing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the 4 FAISS indexes?

A
  1. Flat Index
  2. Hierarchical Navigable Small Worlds Index
  3. Inverted File Product Quantization Index
  4. Locality Sensitive Hashing Index
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain the Flat Index, its usage, Advantages and Disadvantages

A

Simplest Index in Faiss, all vectors are stored in memory and are compared in a brute force manner. During Query index computes distance from the query vector to all vectors in teh dataset and returns the closest matches.

Usage: Best for smal datasets where computational resources and time are not major concerns, as the method involves a complete search through all vectors. Highly accurate but potentially quite slow. Best Accuracy of all indexing methods.

Adv: High accuracy, simple to implement

Disadv: Computationally expensive for large datasets, requires significant memory to store all vectors in RAM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain Hierarchical Navigable Small Worlds Index (HNSW)

A

Graph based index that constructs a graph of vectors where each vector is connected to its nearest neighbors. Search process starts from an entry point in the graph and navigates to find the closest neighbors of query vector

Usage: Designed for approximate nearest neighbor search suitable for very large datasets where exact match may not be feasible. Used where low latency and Realtime performance are critical

Adv: much faster than bf especially for large datasets
provides good tradeoff between accuracy and performance
scalable to large datasets as it only navigates through relevant parts of graph.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the types of vector databases?

A

In memory vector db
e.g. RedisAI, Torchserve

store vectors in mem, enable swift r/w, support real time analytics

Disk based vector dbs
e.g. Annoy, Milvus, scaNN

store vectors on disk, suitable for large datasets, uses indexing and compression techniques

Distributed vector db
e.g. FAISS, Dask-ML

spead vector data across multiple nodes or servers, horizontal scalability and fault tolerance, suitable for managing massive data sets and high-thorughput tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is RAG and what is the generalized RAG approach?

A

RAG is a method that combines retrieval based models with generative models to enhance the generation of contextually relevant and accurate responses

uses retrieval mechanism to fetch relevant documents or pieces of information which are then used to condition the generative model.

  1. Users sends query
  2. App forwards queries
  3. RAG retrieves and generates responses
  4. LLM returns response.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain Inverted File Product Quantization Index (IVFPQ)

A

A two stage index that first partitions the dataset into multiple clusters (inverted file system) and then applies product quantization within each cluster to reduce memory usage and search time. Vectors are assigned to a cluster and only a subset of clusters are searched based on the query vector, reducing number of distance computations.

Usage: Good fror large scale approximate nearest neighbor search where memory efficiency and search speed are critical.
Common in large image retrieval systems where exact resaults arent necessary

Adv: scalable to very large datasets
Product Quantization reduces mem usage while returning much of the search accuracy
Efficient and fast especially for large datasets.

Disadv: search is more approximate might not always return the exact nearest neighbor
More complex to implement and tune compared to flat index.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly