Which of the following is a primary use case for vector databases in cloud computing?
A. Storing transactional data.
B. Storing and querying high-dimensional vectors for similarity search.
C. Managing relational data.
D. Performing batch processing of structured data.
B. Storing and querying high-dimensional vectors for similarity search.
Which of the following is a key characteristic of a vector database?
A. Support for SQL queries.
B. Handling ACID transactions.
C. Optimized for similarity searches in high-dimensional spaces.
D. Designed for data warehousing.
C. Optimized for similarity searches in high-dimensional spaces.
Which of the following Faiss Index types is most suitable for brute-force search in a
small dataset?
A. IndexIVFFlat.
B. IndexFlatL2.
C. IndexIVFPQ.
D. IndexHNSW.
B. IndexFlatL2.
In Faiss, which index is designed to handle very large datasets by partitioning the data
into smaller subsets, each indexed independently?
A. IndexIVFFlat.
B. IndexFlatIP.
C. IndexHNSW.
D. IndexLSH.
A. IndexIVFFlat
In Faiss, an _____ is constructed by dividing the set of vectors into k Voronoi partitions.
A. Locality Sensitive Hashing Index
B. Inverted File Index
C. Hierarchical Navigable Small Worlds Index
D. Flat Index
B. Inverted File Index
In the Inverted File Product Quantization Index (IVFPQ), the ______ algorithm is run on
vectors from all the partitions:
A. K-means clustering
B. Principal Component Analysis
C. Product Quantization
D. Singular Value Decomposition
C. Product Quantization
In the Inverted File Product Quantization Index (IVFPQ), sub-vectors are quantized into
a finite number of _____, each represented by ______.
A. clusters; a centroid
B. bits; a hash code
C. partitions; a mean value
D. blocks; an eigenvector
A. clusters; a centroid
What is a vector database?
A vector db is a type of db designed to efficiently store, manage and query high dimensional vector data, commonly used in ml and nlp, optimised for operations like similarity search, where the goal is to find the nearest vector given a query vector. They enable scalable and efficient handling of complex data.
What are the 3 foundations of vector dbs
Vector Embeddings
Similarity Search
Indexing
What are the 4 FAISS indexes?
Explain the Flat Index, its usage, Advantages and Disadvantages
Simplest Index in Faiss, all vectors are stored in memory and are compared in a brute force manner. During Query index computes distance from the query vector to all vectors in teh dataset and returns the closest matches.
Usage: Best for smal datasets where computational resources and time are not major concerns, as the method involves a complete search through all vectors. Highly accurate but potentially quite slow. Best Accuracy of all indexing methods.
Adv: High accuracy, simple to implement
Disadv: Computationally expensive for large datasets, requires significant memory to store all vectors in RAM
Explain Hierarchical Navigable Small Worlds Index (HNSW)
Graph based index that constructs a graph of vectors where each vector is connected to its nearest neighbors. Search process starts from an entry point in the graph and navigates to find the closest neighbors of query vector
Usage: Designed for approximate nearest neighbor search suitable for very large datasets where exact match may not be feasible. Used where low latency and Realtime performance are critical
Adv: much faster than bf especially for large datasets
provides good tradeoff between accuracy and performance
scalable to large datasets as it only navigates through relevant parts of graph.
What are the types of vector databases?
In memory vector db
e.g. RedisAI, Torchserve
store vectors in mem, enable swift r/w, support real time analytics
Disk based vector dbs
e.g. Annoy, Milvus, scaNN
store vectors on disk, suitable for large datasets, uses indexing and compression techniques
Distributed vector db
e.g. FAISS, Dask-ML
spead vector data across multiple nodes or servers, horizontal scalability and fault tolerance, suitable for managing massive data sets and high-thorughput tasks.
What is RAG and what is the generalized RAG approach?
RAG is a method that combines retrieval based models with generative models to enhance the generation of contextually relevant and accurate responses
uses retrieval mechanism to fetch relevant documents or pieces of information which are then used to condition the generative model.
Explain Inverted File Product Quantization Index (IVFPQ)
A two stage index that first partitions the dataset into multiple clusters (inverted file system) and then applies product quantization within each cluster to reduce memory usage and search time. Vectors are assigned to a cluster and only a subset of clusters are searched based on the query vector, reducing number of distance computations.
Usage: Good fror large scale approximate nearest neighbor search where memory efficiency and search speed are critical.
Common in large image retrieval systems where exact resaults arent necessary
Adv: scalable to very large datasets
Product Quantization reduces mem usage while returning much of the search accuracy
Efficient and fast especially for large datasets.
Disadv: search is more approximate might not always return the exact nearest neighbor
More complex to implement and tune compared to flat index.