Advanced SQL Patterns Data Engineers Use Daily Flashcards

(30 cards)

1
Q

What is a window function pattern used for ranking rows?

A

A ranking window function like ROW_NUMBER or RANK lets you order rows inside groups without collapsing them. It is like sorting students within each class and numbering them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do data engineers use ROW_NUMBER() to remove duplicates?

A

You group similar rows and give them numbers using ROW_NUMBER. Then you keep only row number 1. This removes duplicates by keeping the best or most recent row.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a deduplication query?

A

A query used to remove duplicate rows from a dataset while keeping the most relevant record. It usually uses ROW_NUMBER with PARTITION BY.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does PARTITION BY help achieve in analytics queries?

A

It divides data into groups so calculations run separately inside each group

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a Slowly Changing Dimension (SCD)?

A

A method for tracking changes in data over time. Instead of overwriting old values you keep historical records so you know what the data looked like in the past.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is SCD Type 1?

A

Type 1 simply overwrites old data. It keeps only the most recent value and does not track history.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is SCD Type 2?

A

Type 2 keeps historical versions of rows. When a value changes a new row is created with timestamps so you can see past states of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why do data warehouses use SCD Type 2?

A

Because analysts often need to know what the data looked like at a specific time in history

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is an incremental data pipeline?

A

Instead of reprocessing all data every time you only process new or changed records since the last run.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why are incremental pipelines important?

A

They make large systems scalable because processing millions of old records repeatedly would waste time and resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a time-series query?

A

A query that analyzes data over time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a rolling average query?

A

A rolling average calculates an average over a moving window of rows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What problem does a rolling window solve?

A

It smooths out fluctuations so trends become easier to see.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a star schema?

A

A data warehouse design where a central fact table connects to multiple dimension tables like a star shape.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a fact table?

A

A table that stores measurable events such as sales

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a dimension table?

A

A table that stores descriptive information such as customers

17
Q

Why do data warehouses separate fact and dimension tables?

A

Because it organizes data for faster analytical queries and clearer relationships.

18
Q

What is a surrogate key?

A

A generated ID used as the primary key in warehouse tables instead of natural identifiers.

19
Q

Why are surrogate keys useful?

A

Because natural identifiers can change

20
Q

What is a staging table?

A

A temporary table used to hold raw data before it is cleaned or transformed.

21
Q

Why do pipelines use staging tables?

A

They allow data engineers to validate and transform raw data safely before loading it into production tables.

22
Q

What is a data pipeline?

A

A system that automatically moves and transforms data from source systems to storage or analytics systems.

23
Q

What is batch processing?

A

Processing large groups of data at scheduled intervals rather than in real time.

24
Q

What is streaming processing?

A

Processing data continuously as it arrives rather than waiting for batches.

25
What is a data warehouse?
A database optimized for analyzing large volumes of structured data.
26
What is a data lake?
A storage system that holds large amounts of raw data in many formats before it is structured.
27
What is schema-on-read?
Data is stored raw and structured only when queries are run.
28
What is schema-on-write?
Data must be structured before being stored in the database.
29
What is data lineage?
The ability to trace where data came from and how it was transformed along the pipeline.
30
What is a data quality check?
A validation step that ensures data is accurate