Advanced SQL Patterns Data Engineers Use Daily Flashcards by Play Like A Pro

What is a window function pattern used for ranking rows?

A ranking window function like ROW_NUMBER or RANK lets you order rows inside groups without collapsing them. It is like sorting students within each class and numbering them.

How well did you know this?

Not at all

Perfectly

How do data engineers use ROW_NUMBER() to remove duplicates?

You group similar rows and give them numbers using ROW_NUMBER. Then you keep only row number 1. This removes duplicates by keeping the best or most recent row.

How well did you know this?

Not at all

Perfectly

What is a deduplication query?

A query used to remove duplicate rows from a dataset while keeping the most relevant record. It usually uses ROW_NUMBER with PARTITION BY.

How well did you know this?

Not at all

Perfectly

What does PARTITION BY help achieve in analytics queries?

It divides data into groups so calculations run separately inside each group

How well did you know this?

Not at all

Perfectly

What is a Slowly Changing Dimension (SCD)?

A method for tracking changes in data over time. Instead of overwriting old values you keep historical records so you know what the data looked like in the past.

How well did you know this?

Not at all

Perfectly

What is SCD Type 1?

Type 1 simply overwrites old data. It keeps only the most recent value and does not track history.

How well did you know this?

Not at all

Perfectly

What is SCD Type 2?

Type 2 keeps historical versions of rows. When a value changes a new row is created with timestamps so you can see past states of the data.

How well did you know this?

Not at all

Perfectly

Why do data warehouses use SCD Type 2?

Because analysts often need to know what the data looked like at a specific time in history

How well did you know this?

Not at all

Perfectly

What is an incremental data pipeline?

Instead of reprocessing all data every time you only process new or changed records since the last run.

How well did you know this?

Not at all

Perfectly

Why are incremental pipelines important?

They make large systems scalable because processing millions of old records repeatedly would waste time and resources.

How well did you know this?

Not at all

Perfectly

What is a time-series query?

A query that analyzes data over time

How well did you know this?

Not at all

Perfectly

What is a rolling average query?

A rolling average calculates an average over a moving window of rows

How well did you know this?

Not at all

Perfectly

What problem does a rolling window solve?

It smooths out fluctuations so trends become easier to see.

How well did you know this?

Not at all

Perfectly

What is a star schema?

A data warehouse design where a central fact table connects to multiple dimension tables like a star shape.

How well did you know this?

Not at all

Perfectly

What is a fact table?

A table that stores measurable events such as sales

How well did you know this?

Not at all

Perfectly

What is a dimension table?

Study These Flashcards

A table that stores descriptive information such as customers

Why do data warehouses separate fact and dimension tables?

Study These Flashcards

Because it organizes data for faster analytical queries and clearer relationships.

What is a surrogate key?

Study These Flashcards

A generated ID used as the primary key in warehouse tables instead of natural identifiers.

Why are surrogate keys useful?

Study These Flashcards

Because natural identifiers can change

What is a staging table?

Study These Flashcards

A temporary table used to hold raw data before it is cleaned or transformed.

Why do pipelines use staging tables?

Study These Flashcards

They allow data engineers to validate and transform raw data safely before loading it into production tables.

What is a data pipeline?

Study These Flashcards

A system that automatically moves and transforms data from source systems to storage or analytics systems.

What is batch processing?

Study These Flashcards

Processing large groups of data at scheduled intervals rather than in real time.

What is streaming processing?

Study These Flashcards

Processing data continuously as it arrives rather than waiting for batches.

What is a data warehouse?

A database optimized for analyzing large volumes of structured data.

What is a data lake?

A storage system that holds large amounts of raw data in many formats before it is structured.

What is schema-on-read?

Data is stored raw and structured only when queries are run.

What is schema-on-write?

Data must be structured before being stored in the database.

What is data lineage?

The ability to trace where data came from and how it was transformed along the pipeline.

What is a data quality check?

A validation step that ensures data is accurate

Advanced SQL Patterns Data Engineers Use Daily Flashcards

(30 cards)