Get Started with Databricks for Data Engineering Flashcards by Dakota Styck

Open-source protocol for reading and writing files to cloud storage.

Delta Lake

The default format for tables created in Databricks

How well did you know this?

Not at all

Perfectly

Entire transaction completes

Atomicity

How well did you know this?

Not at all

Perfectly

Data follows rules or it will be rolled back

Consistency

How well did you know this?

Not at all

Perfectly

One transaction completed before the start of another

Isolation

How well did you know this?

Not at all

Perfectly

Data is saved in a persistent state once completed

Durability

How well did you know this?

Not at all

Perfectly

Automatically adjusts the schema of your Delta table as your data changes

Schema Evolution

How well did you know this?

Not at all

Perfectly

Ensures that any data written to the Delta table matches the table’s defined schema

Schema Enforcement

How well did you know this?

Not at all

Perfectly

Creates a table by selecting data from an existing table or data source

CREATE TABLE AS (CTAS)

How well did you know this?

Not at all

Perfectly

Provides a point-and-click interface to upload files and create tables

UPLOAD UI

How well did you know this?

Not at all

Perfectly

Incrementally (streaming) processes new data files as they arrive in cloud storage

Auto Loader

How well did you know this?

Not at all

Perfectly

What level of the Medallion Architecture is this?

Dumping ground for raw data from external source systems
Often with long retention (years)
Data as it originally existed

Bronze

How well did you know this?

Not at all

Perfectly

What level of the Medallion Architecture is this?

Filter, cleanse, join and enrich the data
Define structure and enforce or evolve schema
Single source of truth

Silver

How well did you know this?

Not at all

Perfectly

What level of the Medallion Architecture is this?

Clean data, ready for consumption
Can be business-level aggregates of data
Delivered downstream to users and applications

Gold

How well did you know this?

Not at all

Perfectly

Which statement describes the Databricks workspace?

A) It is a classroom setup for running Databricks lessons and exercises.

B) It is a mechanism for cleaning up lesson-specific assets created during a learning session.

C) It is a set of predefined tables and path variables within Databricks.

D) It is a solution for organizing assets within Databricks

How well did you know this?

Not at all

Perfectly

What assets can be accessed from and organized within the Databricks workspace?

A) Virtual machine configurations for clusters

B) Machine learning models and algorithms

C) Notebooks and files

D) Cloud storage accounts

C) Notebooks and files

How well did you know this?

Not at all

Perfectly

Which statement describes Databricks Repos?

A) A feature for scheduling and orchestrating data pipelines within Databricks

B) A tool for managing virtual environments and dependencies in Databricks

C) A capability centered around continuous integration of assets in Databricks and external Git repositories.

D) An integrated development environment (IDE) specifically designed for Databricks notebooks

Study These Flashcards

C) A capability centered around continuous integration of assets in Databricks and external Git repositories.

**What is the basic cloud-based compute structure of Databricks?88

A) Data Nodes

B) Data Warehouses

C) Databricks Clusters

D) Databricks Instances

Study These Flashcards

C) Databricks Clusters

As a Data Engineer, which of the following would you use to orchestrate data tasks?

A) Databricks AI Library

B) Databricks Academy

C) Spark MLlib

D) LakeFlow Jobs

Study These Flashcards

D) LakeFlow Jobs

How do clusters and warehouses differ in their roles?

A) Clusters are designed for data visualization, while SQL warehouses execute SQL queries

B) Clusters provide compute resource for running notebooks and warehouses work specifically with SQL queries

C) Clusters offer storage optimization, while SQL warehouses provide data replication

D) Clusters handle machine learning tasks, while SQL warehouses focus on data processing

Study These Flashcards

B) Clusters provide compute resource for running notebooks and warehouses work specifically with SQL queries

What are the high-level configuration options available when setting up a cluster?

A) Data Transformation Pipelines, Machine Learning Models, and Data Visualization

B) Data Replication, Disk Encryption, and Data Partitioning.

C) Autoscaling Options, Access Mode, and Cluster Name

D) Notebook Sharing, Version Control, and User Permissions.

Study These Flashcards

C) Autoscaling Options, Access Mode, and Cluster Name

What are the primary high-level configuration options available when setting up a warehouse?

A) Compute Cluster Size, Auto-stop Timer, and Scaling Parameters

B) Query Execution Speed, Access Mode, and Visualization Mode

C) Data Compression, Cluster Name, and Query Optimization.

D) Data Replication, Notebook Sharing, and Data Partitioning

Study These Flashcards

A) Compute Cluster Size, Auto-stop Timer, and Scaling Parameters

What are the benefits of using the available serverless compute features?

A) Cost efficiency, scalability, and simplified management

B) Enhanced query performance for all workloads.

C) Fixed and predetermined billing structure.

D) Manual adjustment of resource allocation.

Study These Flashcards

A) Cost efficiency, scalability, and simplified management

What is the primary interface used by data engineers when working with Databricks?

A) Visual Studio Code

B) Command Line Interface

C) Databricks Notebooks

D) Data Dashboards

Study These Flashcards

C) Databricks Notebooks

What are the common use cases for data engineers when working with Notebooks?

A) Writing Research Papers

B) Creating Mobile Apps

C) Data Exploration, Reporting, and Dashboarding

D) Playing Online Games

Study These Flashcards

C) Data Exploration, Reporting, and Dashboarding

**How does Databricks store data?** A) Data is stored in physical servers B) Data is stored on local computers C) Data is stored in cloud object storage locations and accessed via Databricks D) Data is stored in cloud-based web servers

C) Data is stored in cloud object storage locations and accessed via Databricks

**What are the benefits of data storage in the data lakehouse architecture across roles and Databricks services.** A) Increased code complexity for data engineers. B) Faster data visualization for analysts C) Simplify the ETL processing and ensures integrity D) Enhanced security for data scientists.

C) Simplify the ETL processing and ensures integrity

**What is the optimized storage layer that serves as the foundation for data storage in a data lakehouse architecture?** A) Apache Spark B) MongoDB C) Delta Lake D) Apache Parquet

C) Delta Lake

**What is the default table type for all tables in Databricks?** A) Delta tables B) Temporary tables C) CSV tables D) External tables

A) Delta tables

**What does Delta Lake include to improve performance?** A) Built-in and easy optimizations B) Data compression C) External data sources D) Real-time streaming

A) Built-in and easy optimizations

**What is the purpose of Unity Catalog in Databricks?** A) Machine learning platform B) Distributed storage system C) Centralized governance solution D) Real-time data processing

C) Centralized governance solution

**What is the structure of the three-tier namespace?** A) Source, Transform, Load B) Database, Collection, file C) Catalog, Schema, Table D) Data, Analysis, Visualization

C) Catalog, Schema, Table

**What is the purpose of LakeFlow Jobs?** A) To monitor real-time data streams B) To visualize data pipelines graphically. C) To create interactive notebooks for data analysis. D) To automate and orchestrate data workflows.

D) To automate and orchestrate data workflows.

**What is the primary purpose of LakeFlow jobs?** A) Collaborative data analysis and exploration B) Scheduling and automating tasks C) Managing data pipelines and ETL processes D) Enabling complex data transformations

B) Scheduling and automating tasks

**Which of the following types of assets can be automated using LakeFlow Jobs?** A) Partner integrations B) BI Connectors C) MLFlow D) Notebooks, ETL pipelines, and ML model training

D) Notebooks, ETL pipelines, and ML model training

**What solution is designed for building and running robust data pipelines?** A) Delta Live Systems B) Delta Live Streams C) DLT D) Delta Live Networks

C) DLT

**What is the purpose of Databricks SQL for analysts and engineers working within the Databricks ecosystem?** A) Serving as a data warehousing solution B) Providing graphic design tools C) Managing social media campaigns D) Offering fitness tracking features

A) Serving as a data warehousing solution

**What are common use cases for data engineers when working with Databricks SQL?** A) Generating random data samples B) Designing mobile applications C) Determining data quality D) Writing machine learning algorithms

C) Determining data quality