Data Transformation Flashcards

Interview questions - basic transformation (30 cards)

1
Q

What is data transformation?

A

Data transformation is the process of converting raw data into a clean, structured, and usable format.

Example (Python – change column format):

import pandas as pd

df = pd.DataFrame({‘date’: [‘2025-01-01’, ‘2025-01-02’]})
df[‘date’] = pd.to_datetime(df[‘date’])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is data transformation important?

A

It ensures data consistency, quality, and compatibility for analysis and reporting.

Practical example:
Converting text-based dates into date format for reporting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are common types of data transformation?

A
  • Filtering
  • Aggregation
  • Normalization
  • Standardization
  • Format conversion

These types help in preparing data for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Difference between data transformation and data cleansing?

A
  • Cleansing: Fixing errors (duplicates, missing data)
  • Transformation: Changing structure or format

Example (SQL – cleansing + transformation):

SELECT DISTINCT UPPER(name) AS name
FROM customers
WHERE name IS NOT NULL;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When is data transformation performed?

A

After data collection and before data analysis (ETL process).

This step is crucial for ensuring data is ready for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is ETL?

A

ETL stands for Extract, Transform, Load.

Azure Example:
Extract: Azure Data Factory pulls data from SQL Server
Transform: Mapping Data Flow cleans data
Load: Store data in Azure Data Lake or Synapse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is data normalization?

A

Scaling data to a standard range or structure.

Python Example (Min-Max scaling):

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[[‘salary’]] = scaler.fit_transform(df[[‘salary’]])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is data aggregation?

A

Summarizing data (sum, count, average).

SQL Example:
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is data filtering?

A

Selecting only required data.

SQL Example:
SELECT * FROM orders
WHERE order_date >= ‘2025-01-01’;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is data mapping?

A

Matching fields from source to destination.

Example:
cust_id → customer_id, dob → birth_date.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is data format conversion?

A

Changing data from one format to another.

Python Example:
df[‘phone’] = df[‘phone’].astype(str).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is schema transformation?

A

Changing table structure (columns, data types).

SQL Example:
ALTER TABLE employees
ADD email VARCHAR(100);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Difference between structured and unstructured data transformation?

A
  • Structured: Tables (SQL)
  • Unstructured: Text, images (Python, AI tools)

This distinction affects how data is processed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is data standardization?

A

Making data consistent.

SQL Example:
UPDATE customers
SET country = ‘USA’
WHERE country IN (‘US’, ‘United States’);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is data enrichment?

A

Adding extra information to existing data.

Example:
Adding city names using ZIP codes via external API.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are tools used for data transformation?

A
  • SQL
  • Python
  • Azure Data Factory
  • Azure Synapse
  • Databricks

These tools facilitate various transformation tasks.

17
Q

Role of SQL in data transformation?

A

Used for filtering, joining, aggregating data.

Example:
SELECT c.name, o.total
FROM customers c
JOIN orders o ON c.id = o.customer_id;

18
Q

Difference between batch and real-time transformation?

A
  • Batch: Large data at scheduled times
  • Real-time: Continuous data (streaming)

Azure Example:
Azure Stream Analytics for real-time data.

19
Q

What is cloud-based data transformation?

A

Transformation done using cloud services.

Azure Example:
Azure Data Factory Mapping Data Flow.

20
Q

What is a data pipeline?

A

Automated flow of data from source to destination.

Azure Example:
Source → ADF → Data Lake → Synapse.

21
Q

How does transformation affect data quality?

A

Improves accuracy, consistency, and usability.

This is essential for reliable analysis.

22
Q

What are challenges in data transformation?

A
  • Large data size
  • Performance issues
  • Incorrect mappings

These challenges can complicate the transformation process.

23
Q

How to handle missing values?

A
  • Remove
  • Replace with default/average

Python Example:
df.fillna(0, inplace=True).

24
Q

What is data validation?

A

Checking transformed data meets rules.

SQL Example:
SELECT * FROM employees
WHERE salary < 0;

25
How do you optimize **transformation**?
* Indexing * Parallel processing * Cloud scaling ## Footnote These methods enhance performance.
26
What are the steps to transform **raw data**?
* Collect * Clean * Transform * Validate * Store ## Footnote Following these steps ensures effective data processing.
27
How to ensure **accuracy** during transformation?
* Validation checks * Sample testing * Logging ## Footnote These practices help maintain data integrity.
28
What happens if transformation is **incorrect**?
Wrong analysis and poor business decisions. ## Footnote This highlights the importance of accurate transformation.
29
What is a **real-world example** of data transformation?
Converting daily sales CSV files into summarized monthly reports using SQL and Python. ## Footnote This illustrates practical application in business.
30
What is an **Azure real-world scenario** (short)?
Azure Data Factory pulls data → cleans nulls → aggregates sales → loads into Azure Synapse for Power BI reporting. ## Footnote This showcases a typical ETL process in Azure.