Databases & Warehousing Flashcards

(44 cards)

1
Q

What is the primary purpose of an OLTP database?

A

To support high-concurrency, low-latency transactional workloads such as inserts, updates, and small, selective queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the primary purpose of an OLAP database or data warehouse?

A

To support analytical workloads that scan and aggregate large volumes of data for reporting and decision-making.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do OLTP and OLAP systems typically differ in schema design?

A

OLTP systems are usually highly normalized to avoid redundancy; OLAP systems are often denormalized into star or snowflake schemas for fast reads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is normalization in relational databases?

A

The process of structuring tables to reduce redundancy and dependency anomalies by dividing data into related tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why is normalization beneficial for OLTP systems?

A

It reduces duplication, simplifies updates, and helps maintain data integrity across many small transactions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why is denormalization common in data warehouses?

A

It reduces the number of joins needed for analytical queries, trading some redundancy for simpler and faster reads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a relational database management system (RDBMS)?

A

A database system that stores data in tables with rows and columns and uses SQL and relational principles for querying and integrity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are common examples of OLTP RDBMS engines?

A

MySQL, PostgreSQL, SQL Server, and Oracle Database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a distributed MPP data warehouse?

A

A massively parallel processing system where data is distributed across nodes and queries are executed in parallel for scalability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are common examples of data warehouse engines?

A

Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse, and distributed columnar systems like ClickHouse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a primary key in a relational table?

A

A column or set of columns that uniquely identifies each row and is used to enforce uniqueness and relationships.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a foreign key?

A

A column in one table that references a primary key in another table, representing a relationship between the tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is an index in a database?

A

A data structure that accelerates lookups and joins on specific columns at the cost of extra storage and write overhead.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why should indexes be used carefully in write-heavy systems?

A

Indexes speed up reads but slow down inserts, updates, and deletes because they must be maintained with each write.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a clustered index conceptually?

A

An index that defines the physical order of rows in the table, often on the primary key in many systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a nonclustered index conceptually?

A

A separate structure that references table rows without changing their physical order, used to speed up queries on other columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a transaction in database systems?

A

A unit of work that is executed with ACID properties: atomicity, consistency, isolation, and durability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does isolation mean in the context of transactions?

A

The degree to which concurrent transactions are prevented from interfering with each other’s intermediate changes.

19
Q

What is eventual consistency?

A

A model where all replicas will converge to the same value over time, but reads may temporarily see stale data after writes.

20
Q

Why is eventual consistency common in distributed NoSQL systems?

A

It allows higher availability and partition tolerance by relaxing strict transactional guarantees.

21
Q

What are the main categories of NoSQL databases?

A

Key-value stores, document stores, column-family stores, and graph databases.

22
Q

When are key-value or document stores typically preferred over relational databases?

A

When flexible schemas, high write throughput, or low-latency simple access patterns are more important than complex joins and transactions.

23
Q

Why are relational warehouses usually columnar?

A

Columnar storage matches analytical access patterns, enabling efficient scans, compression, and aggregation on selected columns.

24
Q

What is a star schema in data warehousing?

A

A schema with a central fact table joined to multiple dimension tables, forming a star-like structure.

25
What is a snowflake schema?
A star schema variant where dimensions are further normalized into additional related tables, forming a more complex structure.
26
What is a fact table’s grain?
The level of detail represented by each row in the fact table, such as one row per order line or per daily metric per customer.
27
Why is grain definition critical when designing fact tables?
It determines how data can be aggregated and prevents ambiguity when joining or combining facts.
28
What is a slowly changing dimension (SCD)?
A dimension where attribute values change over time and must be tracked, typically handled with SCD Type 1 or Type 2 patterns.
29
What is an SCD Type 2 pattern?
A pattern where changes are recorded by adding new rows with effective dates or version indicators to preserve history.
30
What is a data mart?
A subject- or department-specific subset of the data warehouse designed for a particular analytical domain.
31
Why do organizations create data marts?
To provide targeted, simplified schemas for specific teams and use cases while still sourcing from a central warehouse.
32
What is a materialized view?
A precomputed, stored result of a query that can be refreshed periodically or incrementally to speed up repeated analytical queries.
33
How do materialized views differ from regular views?
Regular views are just stored query definitions; materialized views persist the query results physically.
34
What is a column statistics or histogram used for in query planning?
They provide information about data distribution, helping the optimizer choose efficient join orders and access paths.
35
Why can poor statistics lead to bad query plans?
The optimizer may underestimate or overestimate row counts, leading to suboptimal join strategies and high execution time.
36
What is partitioning in relational or warehouse tables?
Dividing a table into segments based on partition keys so queries can scan only relevant segments and improve performance.
37
Why should partition keys for warehouse tables often be time-based?
Most analytical queries filter by time; time-based partitioning supports pruning and lifecycle management.
38
What is sharding in distributed databases?
Splitting data across multiple nodes or clusters based on a shard key to scale out storage and throughput.
39
Why is choosing a good shard key important?
A poor shard key can cause hotspots or uneven load, undermining the benefits of horizontal scaling.
40
What is a query federation or virtual warehouse conceptually?
Executing queries that span multiple underlying data sources or compute clusters while presenting a unified logical schema.
41
Why is separating compute clusters for different workloads beneficial in a warehouse environment?
It isolates performance and allows independent scaling of ETL, BI, and ad hoc workloads without resource contention.
42
What is the role of a query engine in a modern warehouse stack?
To parse, optimize, and execute SQL queries against one or more storage layers, often abstracting away physical details.
43
Why is concurrency an important consideration in warehouse design?
Many users and jobs may run queries simultaneously; the system must handle concurrency without excessive queuing or degradation.
44
What is a good one-sentence mental model for databases vs warehouses?
Databases are optimized for transactional correctness and small, selective queries, while warehouses are optimized for large, read-heavy analytical workloads over well-modeled historical data.