Should you compare relational or non-relational dbs in an interview?
Most interviewers don’t need an explicit comparison of SQL and NoSQL databases in your session and it’s a pothole you should completely avoid. Instead, talk about what you know about the database you’re using and how it will help you solve the problem at hand
How are indexes typically implemented?
B-Tree or Hashtable
What are the properties of relational DBs that make them an appropriate use case?
The great thing about relational databases is (a) their support for arbitrarily many indexes, which allows you to optimize for different queries and (b) their support for multi-column and specialized indexes (e.g. geospatial indexes, full-text indexes).
ACID, transactional isolation, well defined schema.
What are transactions in a relational DB?
Transactions are a way of grouping multiple operations together into a single atomic operation
With regards to ACID compiance, how do RDMS and Nonrelational systems compare?
Relational (SQL): strict ACID compliance is the default.
Non-Relational (NoSQL): Default is usually “Eventual Consistency” (speed over strictness), but ACID is often available as an optional feature if you are willing to take a performance hit. They are BASE compliant by default.
What kinds of data models do non-relational DBs support?
Key value stores
- fast access
- simple model
Document stores
- flexible
- schemaless
Column family stores
- scalable
- high performance for writes
Graph DBs
- relationships
- efficient retrieval
Why are compelling reasons for using a nonrelational database?
Flexible Data Models: Your data model is evolving or you need to store different types of data structures without a fixed schema.
Scalability: Your application needs to scale horizontally (across many servers) to accommodate large amounts of data or high user loads.
Handling Big Data and Real-Time Web Apps: You have applications dealing with large volumes of data, especially unstructured data, or applications requiring real-time data processing and analytics.
Do NoSQL dbs support indexing?
Yes
What should you know about blob storage?
Durability: Blob storage services are designed to be incredibly durable.
Scalability: Hosted blob storage solutions like AWS S3 can be considered infinitely scalable.
Cost: Blob storage services are designed to be cost effective.
Security: Blob storage services have built-in security features like encryption at rest and in transit.
Upload and Download Directly from the Client: Blob storage services allow you to upload and download blobs directly from the client.
Chunking: Modern blob storage services like S3 support chunking out of the box via the multipart upload API.
What is a search optimized database and when should you use it?
Search optimized databases, on the other hand, are specifically designed to handle full-text search. They use techniques like indexing, tokenization, and stemming to make search queries fast and efficient.
How do search optimized databases work?
They work by building what are called inverted indexes. Inverted indexes are a data structure that maps from words to the documents that contain them.
{
"word1": [doc1, doc2, doc3],
"word2": [doc2, doc3, doc4],
"word3": [doc1, doc3, doc4]
}What should you know about inverted indexes?
Inverted Indexes: Inverted indexes to make search queries fast and efficient.
Tokenization: Tokenization is the process of breaking a piece of text into individual words.
Stemming: Stemming is the process of reducing words to their root form. This allows you to match different forms of the same word. For example, “running” and “runs” would both be reduced to “run”.
Fuzzy Search: Fuzzy search is the ability to find results that are similar to a given search term. This is achieved through techniques like edit distance calculation, which measures how many letters need to be changed, added, or removed to transform one word into another.
Scaling: Just like traditional databases, search optimized databases scale by adding more nodes to a cluster and sharding data across those nodes.
What are popular search optimized DBs?
Elasticsearch is the winner here
Postgres has GIN indexes which support full-text search and Redis has a (in my opinion, quite immature and bad) full-text search capability
In a systems design interview, when should you include an API gateway?
In nearly all product design style system design interviews, it is a good idea to include an API gateway in your design as the first point of contact for your clients.
What are the most common API gateways?
AWS API Gateway, Kong, and Apigee.
It’s also not uncommon to have an nginx or Apache webserver as your API gateway (in the early days of Amazon, a gigantic fleet of Apache webservers served this purpose).
When should you use an L4 loadbalancer vs L7 loadbalancer?
If you have persistent connections like websockets, you’ll likely want to use an L4 (transport) load balancer. Otherwise, an L7.
What are the most popular load balancers?
The most common load balancers are AWS Elastic Load Balancer (a hosted offering from AWS), NGINX (an open-source webserver frequently used as a load balancer), and HAProxy (a popular open-source load balancer)
What are common use cases for queues?
Buffer for Bursty Traffic: In a ride-sharing application like Uber, queues can be used to manage sudden surges in ride requests. During peak hours or special events, ride requests can spike massively. A queue buffers these incoming requests, allowing the system to process them at a manageable rate without overloading the server or degrading the user experience.
Distribute Work Across a System: In a cloud-based photo processing service, queues can be used to distribute expensive image processing tasks. When a user uploads photos for editing or filtering, these tasks are placed in a queue. Different worker nodes then pull tasks from the queue, ensuring even distribution of workload and efficient use of computing resources.
What should you know about queues for your interview?
Message Ordering: Most queues are FIFO (first in, first out), meaning that messages are processed in the order they were received. However, some queues (like Kafka) allow for more complex ordering guarantees, such as ordering based on a specified priority or time.
Retry Mechanisms: Many queues have built-in retry mechanisms.
Dead Letter Queues: Dead letter queues are used to store messages that cannot be processed. They’re useful for debugging and auditing, as it allows you to inspect messages that failed to be processed and understand why they failed.
**Scaling with Partitions: **Queues can be partitioned across multiple servers so that they can scale to handle more messages.
Backpressure: The biggest problem with queues is they make it easy to overwhelm your system. If my system supports 200 requests per second but I’m receiving 300 requests per second, I’ll never finish them! A queue is just obscuring the problem that I don’t have enough capacity. The answer is backpressure. Backpressure is a way of slowing down the production of messages when the queue is overwhelmed. This helps prevent the queue from becoming a bottleneck in your system. For example, if a queue is full, you might want to reject new messages or slow down the rate at which new messages are accepted, potentially returning an error to the user or producer.
What is event sourcing?
Event sourcing is a technique where changes in application state are stored as a sequence of events. These events can be replayed to reconstruct the application’s state at any point in time.
Difference between stream and message queue?
Streams retain events for a configurable period and allow multiple consumers to replay data independently, while message queues delete messages after consumption and are designed for one-time task processing.
When should you use streams?
When you need to process large amounts of data in real-time.
- e.g. - real-time analytics of user engagements (likes, comments, shares) on posts.
- high volumes of engagement events generated by users across the globe. A stream processing system
When you need to support complex processing scenarios like event sourcing.
- Consider a banking system where every transaction (deposits, withdrawals, transfers) needs to be recorded and could affect multiple accounts.
When you need to support multiple consumers reading from the same stream.
- In a real-time chat application, when a user sends a message, it’s published to a stream associated with the chat room. This stream acts as a centralized channel where all chat participants are subscribers. This is a great example of a publish-subscribe pattern, which is a common use case for streams.
Common streaming technologies
Kafka, Flink, and Kinesis
Things you should know about streams for your interview
**Scaling with Partitioning: **In order to scale streams, they can be partitioned across multiple servers. Each partition can be processed by a different consumer, allowing for horizontal scaling. Just like databases, you will need to specify a partition key to ensure that related events are stored in the same partition.
Multiple Consumer Groups: Streams can support multiple consumer groups, allowing different consumers to read from the same stream independently.
Replication: In order to support fault tolerance, just like databases, streams can replicate data across multiple servers. This ensures that if a server fails, the data can still be read from another server.
Windowing: Streams can support windowing, which is a way of grouping events together based on time or count.