Databases Flashcards

(22 cards)

1
Q

Atomicity

A

From ACID

Either all aspects of the transaction are executed, or none are, and the state is rolled back.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Isolation

A

From ACID

A transaction is isolated if it can not be affected by actions of other concurrent transactions.

Levels: Read Uncommitted (can get phantom reads), Read Committed, Repeatable Read (no phantom reads, but can cause them), Serializable (can not read or cause phantom reads)

Phantom read = when a transaction reads entries that are then deleted or new rows inserted that would have been returned.

Trade off:
The trade-off with isolation is performance: higher isolation levels (especially Serializable) can lead to reduced concurrency, as transactions may need to wait for others to complete before proceeding.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Redshift

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Dynamo DB

A

Primary key is Partition key + sort key. Sort key doesn’t have to be specified.

“Global tables” will replicate your table to multiple regions for faster latency if needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Cassandra

A

No sql db.

Known for horizontally scaling (linear performance improvement). While scaling, performance degraded, but still functional.

Write optimized. < 1 ms latency. Uses append only model, which doesn’t require locking/blocking.

Reads = a few ms - 10s of ms, depending on consistency level and workload.

Uses “wide rows”

Works by use of partition key and clustering keys. PK = which node data on, Clustering Key = order of rows on partition. If both PK and CK are specified, this is called Composite key.

Doesn’t support joins, but supports join-like operations?

Architecture: peer to peer. No leader node.

Pros:
- Can handle millions of TPS when scaled. Known for linear horizontal scaling.
- Write optimized. Read latency not bad with right querying patterns.

Cons:
- Bad at range queries, even with secondary indices. Secondary indices stored per partition, but a query without partition key included requires searching all nodes. Secondary index speeds per node search up, but still not idea.
- Doesn’t support joins or aggregations. Requires tools like Spark to do this.
- Need primary key specified for good performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Redis

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

SQL

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Mongo DB

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Clustered index

A

In sql, the primary key is used as the clustered index. Primary key can be on more than 1 column, which is called a composite clustered index.

When you have a clustered index on a column (or set of columns) it means that the data is physical order of this index. There is also a tree structure which basically allows you to search for the rows you want more easily (more than 1 child per node). The actual data is stored at the leaf nodes.

There can only be 1 clustered index per table because the data can only be physically stored in 1 way.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Non clustered index

A

Indices on columns other than the primary key are always non clustered indices. Non clustered indices can be made on more than one column, which is called a composite non clustered index.

This works by storing separate files with only the data in this column sorted + the primary key data. If you query on an indexed column, it can search the index for the index entries that contain the value you’re looking for. Those entries also have the primary key values you need. Then it uses the clustered index to find the full data and grab whatever your query needs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Range query

A

Ex:
Select * from table where col > 5 and col < 5000;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

B trees vs B+ trees

A

B+ trees don’t store data in intermediate nodes, just the leaves.
B trees store data in intermediate and leaf nodes.

B+ trees also link the leaf nodes (the leaf nodes are also a linked list, in order). This is really good for range queries because it can just find the first node in the range and follow the pointers until it finds a node outside the range.

B+ trees better for:
- disk access situations (minimize it)
- speed in general
- range queries

B trees better for:
- really just simplicity. If you’re in a situation where you don’t need b+ tree benefits you could implement a b tree just because its easier (you don’t have to link leaf nodes separately).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Dynamo Global Secondary Index (GSI)

A

https://www.youtube.com/watch?v=ihMOlb8EZKE

Literally just copies a (specifiable) limited amount of the dynamo table to a 2nd table and keeps them in sync. All writes to 1 of these tables are repeated on all others, so if you have 5 GSIs and do 1 write its actually 6 writes.

GSIs do not improve the performance of scans.

Gotchyas:
- Doubles the cost of writing
- Has its own read and write throttle limits, so need to be configured
- Has eventual consistency, so if you want to update the secondary index you need to be careful because updates to the original table or other indices could have not yet propagated
- throttle limits need to be >= those of the main table, otherwise it will make writes to the main table have to wait on the GSI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Dynamo Local Secondary Index (LSI)

A

Basically just an additional sort key. Changes the way the data is stored. Allows you to query on the LSI + partition key. Query MUST include the partition key still though.

LSIs do not speed up scans.

Needs to be made at table creation time.

Limited number allowed (support ticket can exceed the soft limit of 5 though)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Choosing Dynamo vs Sql

A

Dynamo:
- < 10 ms latency (extremely low latency)
- 10,000+ requests per second (high throughput) (because it scales horizontally on its own)
- Bursty requests (auto horizontal scaling, will do throttling before new capacity ready still which is good and bad)
- cheaper than sql when low traffic or bursty traffic because uses pay-per-request model.
- More expensive when a lot of traffic because its more expensive per storage and more requests = more money.

Sql:
- 10 - 100 ms latency
- 1,000-10,000 requests per second (moderate throughput)
- harder to horizontally scale, although some new DBs like Amazon RDS improve scaling
- Cheaper than dynamo with larger traffic and more consistent traffic because you’re paying per server and not scaling that often (if you’re scaling that much maybe shouldn’t use sql).
- More expensive when low traffic because you’re still paying for the same servers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Choosing Redshift vs Sql

17
Q

Partitioning

A

Can be partitioning within a single node or across multiple nodes.

If across multiple nodes, this term is basically interchangeable with “sharding,” except sharding is more often used in relation to horizontal scaling situations.

Usually used to mean partitioning within a single node?

Within a single node: This is when a table is broken up into multiple partitions. When you query its just applied to the partitions that are relevant. Each partition is managed independently.

18
Q

Choosing Cassandra vs Dynamo

A

Both can handle millions of TPS if scaled, configured, and used properly

19
Q

Sql benefits over Dynamo / Cassandra

A

More flexible querying
- can do “order by” (dynamo queries only support ordering of sort key columns)
- can do aggregation (sum, avg, count, group by, etc) (dynamo requires aggregations done by your app, which requires transferring large amounts of data to your app first)
- can do joins

This all means SQL allows you to store data in a normalized way and query it how you want. Dynamo/Cassandra require you to store it in the way it will be queried (but they’re faster at using the data in that way).

20
Q

Normalization

A

Reducing the redundant storage of data.

“Normalization is the process of organizing data in a database to minimize redundancy and dependency, ensuring data integrity and efficiency”

Pros
- less storage space used
- less writes
- reduces chances the same data is stored with different values in different places, because only stored once

Cons:
- requires joins so slower
- schema can be confusing when highly normalized

Good for:
- Transaction processing (OLTP) because it uses a lot of small operations (insert, update, delete) like just saving, updating data, and removing data.

Bad for:
- Analytics processing (OLAP) (SELECT statements) because uses large amounts of data, and requires joins, so slow.

21
Q

Redis Hybrid Persistence

A

Loads the most recent RDB (redis db backup) and then only applies the AOF (append only file) transactions since that time

22
Q

Current max ram on linux or windows

A

128 TB

64 bit os can theoretically do 16 exobytes, but the os limits it.