Get Started with Databricks for Data Engineering Flashcards

In this course, you will learn basic skills that will allow you to use the Databricks Data Intelligence Platform to perform a simple data engineering workflow and support data warehousing endeavors. You will be given a tour of the workspace and be shown how to work with objects in Databricks such as catalogs, schemas, volumes, tables, compute clusters, and notebooks. You will then follow a basic data engineering workflow to perform tasks such as creating and working with tables, ingesting data i (37 cards)

1
Q

Open-source protocol for reading and writing files to cloud storage.

A

Delta Lake

The default format for tables created in Databricks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Entire transaction completes

A

Atomicity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data follows rules or it will be rolled back

A

Consistency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

One transaction completed before the start of another

A

Isolation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data is saved in a persistent state once completed

A

Durability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Automatically adjusts the schema of your Delta table as your data changes

A

Schema Evolution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Ensures that any data written to the Delta table matches the table’s defined schema

A

Schema Enforcement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Creates a table by selecting data from an existing table or data source

A

CREATE TABLE AS (CTAS)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Provides a point-and-click interface to upload files and create tables

A

UPLOAD UI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Incrementally (streaming) processes new data files as they arrive in cloud storage

A

Auto Loader

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What level of the Medallion Architecture is this?

  • Dumping ground for raw data from external source systems
  • Often with long retention (years)
  • Data as it originally existed
A

Bronze

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What level of the Medallion Architecture is this?

  • Filter, cleanse, join and enrich the data
  • Define structure and enforce or evolve schema
  • Single source of truth
A

Silver

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What level of the Medallion Architecture is this?

  • Clean data, ready for consumption
  • Can be business-level aggregates of data
  • Delivered downstream to users and applications
A

Gold

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which statement describes the Databricks workspace?

A) It is a classroom setup for running Databricks lessons and exercises.

B) It is a mechanism for cleaning up lesson-specific assets created during a learning session.

C) It is a set of predefined tables and path variables within Databricks.

D) It is a solution for organizing assets within Databricks

A

D) It is a solution for organizing assets within Databricks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What assets can be accessed from and organized within the Databricks workspace?

A) Virtual machine configurations for clusters

B) Machine learning models and algorithms

C) Notebooks and files

D) Cloud storage accounts

A

C) Notebooks and files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which statement describes Databricks Repos?

A) A feature for scheduling and orchestrating data pipelines within Databricks

B) A tool for managing virtual environments and dependencies in Databricks

C) A capability centered around continuous integration of assets in Databricks and external Git repositories.

D) An integrated development environment (IDE) specifically designed for Databricks notebooks

A

C) A capability centered around continuous integration of assets in Databricks and external Git repositories.

17
Q

**What is the basic cloud-based compute structure of Databricks?88

A) Data Nodes

B) Data Warehouses

C) Databricks Clusters

D) Databricks Instances

A

C) Databricks Clusters

18
Q

As a Data Engineer, which of the following would you use to orchestrate data tasks?

A) Databricks AI Library

B) Databricks Academy

C) Spark MLlib

D) LakeFlow Jobs

A

D) LakeFlow Jobs

19
Q

How do clusters and warehouses differ in their roles?

A) Clusters are designed for data visualization, while SQL warehouses execute SQL queries

B) Clusters provide compute resource for running notebooks and warehouses work specifically with SQL queries

C) Clusters offer storage optimization, while SQL warehouses provide data replication

D) Clusters handle machine learning tasks, while SQL warehouses focus on data processing

A

B) Clusters provide compute resource for running notebooks and warehouses work specifically with SQL queries

20
Q

What are the high-level configuration options available when setting up a cluster?

A) Data Transformation Pipelines, Machine Learning Models, and Data Visualization

B) Data Replication, Disk Encryption, and Data Partitioning.

C) Autoscaling Options, Access Mode, and Cluster Name

D) Notebook Sharing, Version Control, and User Permissions.

A

C) Autoscaling Options, Access Mode, and Cluster Name

21
Q

What are the primary high-level configuration options available when setting up a warehouse?

A) Compute Cluster Size, Auto-stop Timer, and Scaling Parameters

B) Query Execution Speed, Access Mode, and Visualization Mode

C) Data Compression, Cluster Name, and Query Optimization.

D) Data Replication, Notebook Sharing, and Data Partitioning

A

A) Compute Cluster Size, Auto-stop Timer, and Scaling Parameters

22
Q

What are the benefits of using the available serverless compute features?

A) Cost efficiency, scalability, and simplified management

B) Enhanced query performance for all workloads.

C) Fixed and predetermined billing structure.

D) Manual adjustment of resource allocation.

A

A) Cost efficiency, scalability, and simplified management

23
Q

What is the primary interface used by data engineers when working with Databricks?

A) Visual Studio Code

B) Command Line Interface

C) Databricks Notebooks

D) Data Dashboards

A

C) Databricks Notebooks

24
Q

What are the common use cases for data engineers when working with Notebooks?

A) Writing Research Papers

B) Creating Mobile Apps

C) Data Exploration, Reporting, and Dashboarding

D) Playing Online Games

A

C) Data Exploration, Reporting, and Dashboarding

25
**How does Databricks store data?** A) Data is stored in physical servers B) Data is stored on local computers C) Data is stored in cloud object storage locations and accessed via Databricks D) Data is stored in cloud-based web servers
C) Data is stored in cloud object storage locations and accessed via Databricks
26
**What are the benefits of data storage in the data lakehouse architecture across roles and Databricks services.** A) Increased code complexity for data engineers. B) Faster data visualization for analysts C) Simplify the ETL processing and ensures integrity D) Enhanced security for data scientists.
C) Simplify the ETL processing and ensures integrity
27
**What is the optimized storage layer that serves as the foundation for data storage in a data lakehouse architecture?** A) Apache Spark B) MongoDB C) Delta Lake D) Apache Parquet
C) Delta Lake
28
**What is the default table type for all tables in Databricks?** A) Delta tables B) Temporary tables C) CSV tables D) External tables
A) Delta tables
29
**What does Delta Lake include to improve performance?** A) Built-in and easy optimizations B) Data compression C) External data sources D) Real-time streaming
A) Built-in and easy optimizations
30
**What is the purpose of Unity Catalog in Databricks?** A) Machine learning platform B) Distributed storage system C) Centralized governance solution D) Real-time data processing
C) Centralized governance solution
31
**What is the structure of the three-tier namespace?** A) Source, Transform, Load B) Database, Collection, file C) Catalog, Schema, Table D) Data, Analysis, Visualization
C) Catalog, Schema, Table
32
**What is the purpose of LakeFlow Jobs?** A) To monitor real-time data streams B) To visualize data pipelines graphically. C) To create interactive notebooks for data analysis. D) To automate and orchestrate data workflows.
D) To automate and orchestrate data workflows.
33
**What is the primary purpose of LakeFlow jobs?** A) Collaborative data analysis and exploration B) Scheduling and automating tasks C) Managing data pipelines and ETL processes D) Enabling complex data transformations
B) Scheduling and automating tasks
34
**Which of the following types of assets can be automated using LakeFlow Jobs?** A) Partner integrations B) BI Connectors C) MLFlow D) Notebooks, ETL pipelines, and ML model training
D) Notebooks, ETL pipelines, and ML model training
35
**What solution is designed for building and running robust data pipelines?** A) Delta Live Systems B) Delta Live Streams C) DLT D) Delta Live Networks
C) DLT
36
**What is the purpose of Databricks SQL for analysts and engineers working within the Databricks ecosystem?** A) Serving as a data warehousing solution B) Providing graphic design tools C) Managing social media campaigns D) Offering fitness tracking features
A) Serving as a data warehousing solution
37
**What are common use cases for data engineers when working with Databricks SQL?** A) Generating random data samples B) Designing mobile applications C) Determining data quality D) Writing machine learning algorithms
C) Determining data quality