What is data partitioning?
The process of dividing data and storing it in physically different locations.
Involves vertical splitting of data across different files within the same machine.
Same machine ⇒ parallelism relies on local CPU cores and memory. (Slower)
What is data distribution?
Horizontal splitting of data across different machines.
Different machines ⇒ parallelism scales out to a cluster, letting each node handle its own chunk of the data independently. (Faster)
Describe the benefits of data partitioning?
What are the best practices for horizontal scaling in Azure?
Examples of enhancing security on Azure Data Analytics platform via partitioning?
Dynamic Data Masking: Masks sensitive data in query results by replacing actual data with obfuscated values (e.g., asterisks) based on defined rules. Azure SQL Database mask sensitive data in query results.
What are five benefits of partitioning data?
What type of storage does an Azure Storage Account support?
An Azure storage account contains all of your Azure Storage data objects: blobs, files, queues, and tables.
The storage account provides a unique namespace for your Azure Storage data that’s accessible from anywhere in the world over HTTP or HTTPS. Data in your storage account is durable and highly available, secure, and scalable.
How does Azure Storage store and manage blobs?
Azure Storage uses <account name + container name + blob name>
It continues to store blobs in the same partition until is reaches the internal limit, then is repartitions and rebalances data amongst partitions automatically. Adding a 3-digit hash to filenames to improve rebalancing.
What does ADLS G2 Hierarchical Namespace option do?
Provides folder indexing and security (ability to create Access Control Lists at the folder / file level).
What is a horizontal partition?
Divided data table with subsets of rows stored in different data stores (same schema as the parent table) is stored in different database instances (i.e. index block 1000-1999, 2000-2999).
What is a vertical partition?
Retain the primary key, split the data by most utilized columns to make reading the row faster.
What is a functional partition?
Separating data based on business sense, such as sensitivity (CustomerID, CustomerName, etc.).
What is Azure Event Hub?
Scalable service that processes events in real time, streaming data is partitioned based on user discretion.
What is Azure Stream Analytics?
Real-time analytics service designed to help analyze and visualize streaming data in real time. Takes partitioned data for processing to monitor, trigger alerts, or provide real-time reporting.
Receives data from Azure Event Hub.
What is Azure Databricks?
“Spark on Azure” streamlines big-data analytics and machine learning.
Collaborative notebooks for data science and data engineering.
Auto-scaling Spark clusters.
Integration with Azure Data Lake (Gen2), Cosmos DB (NoSQL), Synapse, and other Azure services.
List the various services that Azure Synapse Analytics contain.
SQL Data Warehouse (Serverless or Dedicated Pool), Synapse Spark, Azure Data Files (SMB, NFS protocols), Cosmos DB (NoSQL), Azure AD (IAM), Azure Purview (Governance)
Define a dedicated SQL Pool.
Dedicated SQL pool is a Massively Parallel Processing (MPP) system that splits queries into 60 parallel queries. Each smaller query runs on a “DISTRIBUTION,” which acts as the basic unit of processing and storage for the dedicated SQL pool. This architecture allows for high throughput and scalable performance by distributing data and workloads across multiple distributions. It also ensures efficient resource utilization and faster query results by processing data in parallel across many nodes.
What are the three different distributions (shards)?
______ is the rate at which you ingest data into an Azure Storage systems and the rate at which you move the data out of the Azure Storage system is called the ______.
ingress, egress
What are the three types of tables supported by a dedicated SQL pool?
Clustered Columnstore Tables:
Stores data in a columnar format, placing each column’s data contiguously to optimize compression and speed up analytics queries by scanning only necessary columns.
Ideal for data warehousing, big data analytics, and read-heavy operations where you perform aggregations, filtering, or scans on large datasets but don’t require frequent individual row updates. It excels in scenarios requiring high compression and fast query performance over many rows and columns.
Clustered Index Tables:
Organizes data rows according to a clustered index, physically ordering rows by key to facilitate fast row-based lookups at the cost of potentially slower inserts/updates.
Best for transactional systems or operational databases where quick retrieval of rows by key is critical. Use when you have frequent single-row reads, updates, or deletes based on indexed columns, and maintaining sorted order can greatly improve performance.
HEAP Tables:
Stores data in unstructured pages without a specific order, enabling fast bulk inserts but resulting in slower lookups and updates due to lack of organization.
Suitable for staging tables, temporary data storage, or scenarios where rapid bulk inserts are needed without immediate query performance requirements. Use when data organization isn’t a priority or when you plan to reorganize or index the data later after bulk loading.
What is range partitioning?
Range partitioning involves organizing data into partitions based on specific value intervals (e.g., all records from January 2023 in one partition, February 2023 in another), which is a different approach than grouping by alphabetical order.
What are three key considerations for implementing a partitioning strategy for analytical workloads?
What Azure services can you utilize to implement a partition strategy for streaming workloads?
What are the default limits for the number of storage accounts and storage account capacity in Azure?