Chapter 3 - Designing and Implementing the Data Exploration Layer Flashcards

Question 1

Q

How are SQL dedicated SQL pool (Synapse) / clusters (Databricks) billed and allocated?

Answer

A

They are pre-allocated hardware that is always available, and always on unless turned off. Meaning that it results in permanent storage and billing regardless of use. Results in highly predictable and consistent performance as well as billing (application serving layer).

Question 2

Q

How does Synapse Serverless ‘Pool’ work?

Answer

A

Synapse retains a server to execute queries, Synapse automatically allocates resources for a query. Cost is based on consumption. Does not store data longterm. Should not be used for transactional data.

Question 3

Q

What does running Spark practically mean?

Answer

A

Users select from HDInsight (flexible Apache ecosystem), Databricks (popular, fully managed Spark), or Synapse Spark (integrated Spark pools).

First, a cluster is created (distributed computing), then a notebook is run in Python, SQL, or another language to utilize the cluster. Finally, Spark automatically splits data and tasks across cluster nodes, then consolidates the results. Spark handles data partitioning, task scheduling, and result aggregation automatically, abstracting much of the underlying complexity.

Question 4

Q

How does the data catalogue work in Microsoft Purview?

Answer

A

It allows users to sort / filter various objects by type, i.e. table, data share, data pipeline, report, folder, etc. or similar tags, i.e. Dev, Test, Prod.

It helps classify and discover various data objects based on their metadata type.

Question 5

Q

What role does pushing data lineage to Microsoft Purview play in a data exploration layer?

Answer

A

It records the data’s origin, movement, and transformation processes, enabling better traceability, governance, and trust in the data discovered during exploration.

Question 6

Q

What is the purpose of a Spark cluster within the context of data exploration?

Answer

A

A Spark cluster allows for distributed data processing and analytics, enabling large-scale data exploration, transformation, and machine learning tasks without heavy manual configuration.

Question 7

Q

How can pushing data lineage to Microsoft Purview impact compliance and auditing processes in regulated industries?

Answer

A

Detailed data lineage records provide transparency into data provenance, transformations, and usage, aiding compliance with regulations by offering auditors clear evidence of data handling processes, thus reducing risk and streamlining audit preparations.

Chapter 3 - Designing and Implementing the Data Exploration Layer Flashcards

(7 cards)