Key System Design Technologies Flashcards by Bo Guthrie

Should you compare relational or non-relational dbs in an interview?

Most interviewers don’t need an explicit comparison of SQL and NoSQL databases in your session and it’s a pothole you should completely avoid. Instead, talk about what you know about the database you’re using and how it will help you solve the problem at hand

How well did you know this?

Not at all

Perfectly

How are indexes typically implemented?

B-Tree or Hashtable

How well did you know this?

Not at all

Perfectly

What are the properties of relational DBs that make them an appropriate use case?

The great thing about relational databases is (a) their support for arbitrarily many indexes, which allows you to optimize for different queries and (b) their support for multi-column and specialized indexes (e.g. geospatial indexes, full-text indexes).
ACID, transactional isolation, well defined schema.

How well did you know this?

Not at all

Perfectly

What are transactions in a relational DB?

Transactions are a way of grouping multiple operations together into a single atomic operation

How well did you know this?

Not at all

Perfectly

With regards to ACID compiance, how do RDMS and Nonrelational systems compare?

Relational (SQL): strict ACID compliance is the default.

Non-Relational (NoSQL): Default is usually “Eventual Consistency” (speed over strictness), but ACID is often available as an optional feature if you are willing to take a performance hit. They are BASE compliant by default.

How well did you know this?

Not at all

Perfectly

What kinds of data models do non-relational DBs support?

Key value stores
- fast access
- simple model

Document stores
- flexible
- schemaless

Column family stores
- scalable
- high performance for writes

Graph DBs
- relationships
- efficient retrieval

How well did you know this?

Not at all

Perfectly

Why are compelling reasons for using a nonrelational database?

Flexible Data Models: Your data model is evolving or you need to store different types of data structures without a fixed schema.

Scalability: Your application needs to scale horizontally (across many servers) to accommodate large amounts of data or high user loads.

Handling Big Data and Real-Time Web Apps: You have applications dealing with large volumes of data, especially unstructured data, or applications requiring real-time data processing and analytics.

How well did you know this?

Not at all

Perfectly

Do NoSQL dbs support indexing?

Yes

How well did you know this?

Not at all

Perfectly

What should you know about blob storage?

Durability: Blob storage services are designed to be incredibly durable.

Scalability: Hosted blob storage solutions like AWS S3 can be considered infinitely scalable.
Cost: Blob storage services are designed to be cost effective.

Security: Blob storage services have built-in security features like encryption at rest and in transit.

Upload and Download Directly from the Client: Blob storage services allow you to upload and download blobs directly from the client.

Chunking: Modern blob storage services like S3 support chunking out of the box via the multipart upload API.

How well did you know this?

Not at all

Perfectly

What is a search optimized database and when should you use it?

Search optimized databases, on the other hand, are specifically designed to handle full-text search. They use techniques like indexing, tokenization, and stemming to make search queries fast and efficient.

How well did you know this?

Not at all

Perfectly

How do search optimized databases work?

They work by building what are called inverted indexes. Inverted indexes are a data structure that maps from words to the documents that contain them.

{
  "word1": [doc1, doc2, doc3],
  "word2": [doc2, doc3, doc4],
  "word3": [doc1, doc3, doc4]
}

How well did you know this?

Not at all

Perfectly

What should you know about inverted indexes?

Inverted Indexes: Inverted indexes to make search queries fast and efficient.

Tokenization: Tokenization is the process of breaking a piece of text into individual words.

Stemming: Stemming is the process of reducing words to their root form. This allows you to match different forms of the same word. For example, “running” and “runs” would both be reduced to “run”.

Fuzzy Search: Fuzzy search is the ability to find results that are similar to a given search term. This is achieved through techniques like edit distance calculation, which measures how many letters need to be changed, added, or removed to transform one word into another.

Scaling: Just like traditional databases, search optimized databases scale by adding more nodes to a cluster and sharding data across those nodes.

How well did you know this?

Not at all

Perfectly

What are popular search optimized DBs?

Elasticsearch is the winner here

Postgres has GIN indexes which support full-text search and Redis has a (in my opinion, quite immature and bad) full-text search capability

How well did you know this?

Not at all

Perfectly

In a systems design interview, when should you include an API gateway?

In nearly all product design style system design interviews, it is a good idea to include an API gateway in your design as the first point of contact for your clients.

How well did you know this?

Not at all

Perfectly

What are the most common API gateways?

AWS API Gateway, Kong, and Apigee.

It’s also not uncommon to have an nginx or Apache webserver as your API gateway (in the early days of Amazon, a gigantic fleet of Apache webservers served this purpose).

How well did you know this?

Not at all

Perfectly

When should you use an L4 loadbalancer vs L7 loadbalancer?

Study These Flashcards

If you have persistent connections like websockets, you’ll likely want to use an L4 (transport) load balancer. Otherwise, an L7.

What are the most popular load balancers?

Study These Flashcards

The most common load balancers are AWS Elastic Load Balancer (a hosted offering from AWS), NGINX (an open-source webserver frequently used as a load balancer), and HAProxy (a popular open-source load balancer)

What are common use cases for queues?

Study These Flashcards

Buffer for Bursty Traffic: In a ride-sharing application like Uber, queues can be used to manage sudden surges in ride requests. During peak hours or special events, ride requests can spike massively. A queue buffers these incoming requests, allowing the system to process them at a manageable rate without overloading the server or degrading the user experience.

Distribute Work Across a System: In a cloud-based photo processing service, queues can be used to distribute expensive image processing tasks. When a user uploads photos for editing or filtering, these tasks are placed in a queue. Different worker nodes then pull tasks from the queue, ensuring even distribution of workload and efficient use of computing resources.

What should you know about queues for your interview?

Study These Flashcards

Message Ordering: Most queues are FIFO (first in, first out), meaning that messages are processed in the order they were received. However, some queues (like Kafka) allow for more complex ordering guarantees, such as ordering based on a specified priority or time.

Retry Mechanisms: Many queues have built-in retry mechanisms.

Dead Letter Queues: Dead letter queues are used to store messages that cannot be processed. They’re useful for debugging and auditing, as it allows you to inspect messages that failed to be processed and understand why they failed.

**Scaling with Partitions: **Queues can be partitioned across multiple servers so that they can scale to handle more messages.

Backpressure: The biggest problem with queues is they make it easy to overwhelm your system. If my system supports 200 requests per second but I’m receiving 300 requests per second, I’ll never finish them! A queue is just obscuring the problem that I don’t have enough capacity. The answer is backpressure. Backpressure is a way of slowing down the production of messages when the queue is overwhelmed. This helps prevent the queue from becoming a bottleneck in your system. For example, if a queue is full, you might want to reject new messages or slow down the rate at which new messages are accepted, potentially returning an error to the user or producer.

What is event sourcing?

Study These Flashcards

Event sourcing is a technique where changes in application state are stored as a sequence of events. These events can be replayed to reconstruct the application’s state at any point in time.

Difference between stream and message queue?

Study These Flashcards

Streams retain events for a configurable period and allow multiple consumers to replay data independently, while message queues delete messages after consumption and are designed for one-time task processing.

When should you use streams?

Study These Flashcards

When you need to process large amounts of data in real-time.
- e.g. - real-time analytics of user engagements (likes, comments, shares) on posts.
- high volumes of engagement events generated by users across the globe. A stream processing system

When you need to support complex processing scenarios like event sourcing.
- Consider a banking system where every transaction (deposits, withdrawals, transfers) needs to be recorded and could affect multiple accounts.

When you need to support multiple consumers reading from the same stream.
- In a real-time chat application, when a user sends a message, it’s published to a stream associated with the chat room. This stream acts as a centralized channel where all chat participants are subscribers. This is a great example of a publish-subscribe pattern, which is a common use case for streams.

Common streaming technologies

Study These Flashcards

Kafka, Flink, and Kinesis

Things you should know about streams for your interview

Study These Flashcards

**Scaling with Partitioning: **In order to scale streams, they can be partitioned across multiple servers. Each partition can be processed by a different consumer, allowing for horizontal scaling. Just like databases, you will need to specify a partition key to ensure that related events are stored in the same partition.

Multiple Consumer Groups: Streams can support multiple consumer groups, allowing different consumers to read from the same stream independently.

Replication: In order to support fault tolerance, just like databases, streams can replicate data across multiple servers. This ensures that if a server fails, the data can still be read from another server.

Windowing: Streams can support windowing, which is a way of grouping events together based on time or count.

When should you use a distributed lock?

Distributed locks are perfect for situations where you need to lock something across different systems or processes for a reasonable period of time.

How do distributed locks work?

The basic idea is that you can use a key-value store to store a lock and then use the atomicity of the key-value store to ensure that only one process can acquire the lock at a time

What is apache zookeeper?

ZooKeeper is a "traffic cop" for distributed systems. It is primarily used for: **Leader Election**: Deciding which server is the "boss." **Configuration Management**: Ensuring all servers have the same settings. **Service Discovery**: Helping servers find each other.

When should you use a distributed lock?

**E-Commerce Checkout System** **Ride-Sharing Matchmaking** **Distributed Cron Jobs** **Online Auction Bidding System**

What should you know about distributed locks?

**Locking Mechanisms**: There are different ways to implement distributed locks. One common implementation uses Redis and is called Redlock. Redlock uses multiple Redis instances to ensure that a lock is acquired and released in a safe and consistent manner. **Lock Expiry**: Distributed locks can be set to expire after a certain amount of time. **Locking Granularity**: Distributed locks can be used to lock a single resource or a group of resources. **Deadlocks**: Deadlocks can occur when two or more processes are waiting for each other to release a lock.

How do tool handle deadlocks?

**Prevention** (Set Strict Rules) **Avoidance (Smart Planning)** **Detection & Recovery (Fixing It)**

What kind of data can you store in a cache?

**Simple Key-Value Pairs (Strings)**: Used for session tokens, HTML fragments, or basic configuration settings. **Lists**: Perfect for maintaining "latest items" feeds, message queues, or activity streams where the order of insertion matters. **Sets**: Useful for storing unique items like unique visitor IDs, tags for a blog post, or IP addresses for rate limiting. **Sorted Sets**: Ideal for leaderboards, popular event lists, or priority queues where items need to be ranked by a score (like "likes" or "timestamps"). **Hashes**: Used to store objects with multiple fields, such as a User Profile (storing name, email, and age under a single key). **Bitmaps**: Highly efficient for tracking binary states over time, such as "Is the user online today?" or "Has this user completed the tutorial?" **HyperLogLogs**: Used for estimating the cardinality of very large sets (e.g., counting unique search queries) using minimal memory. **Geospatial Indexes**: Storing coordinates (latitude/longitude) to perform "find nearby" queries for apps like ride-sharing or restaurant discovery. **Streams**: Used for high-speed activity tracking and log processing, similar to a lightweight Kafka.

What is cardinality?

In databases, cardinality refers to the uniqueness of data in a column (low/high) or the relationship count between tables (1:1, 1:N, M:N).

What should you know about CDN?

**CDNs are not just for static assets**. While CDNs are often used to cache static assets like images, videos, and javascript files, they can also be used to cache dynamic content. This is especially useful for content that is accessed frequently, but changes infrequently. For example, a blog post that is updated once a day can be cached by a CDN. **CDNs can be used to cache API responses**. If you have an API that is accessed frequently, you can use a CDN to cache the responses. This can help reduce the load on your servers and improve the performance of your API. **Eviction policies**. Like other caches, CDNs have eviction policies that determine when cached content is removed. For example, you can set a time-to-live (TTL) for cached content, or you can use a cache invalidation mechanism to remove content from the cache when it changes.

Key System Design Technologies Flashcards

(33 cards)