Databricks Interview prep - Unity Catalog & Governance Flashcards

(25 cards)

1
Q

What is Unity Catalog?

A

Unity Catalog is a centralized governance layer for data and AI assets:
* Manages access control across workspaces
* Provides data lineage
* Enables fine-grained security (row/column level)
πŸ‘‰ It replaces the legacy Hive Metastore with enterprise-grade governance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the object hierarchy in Unity Catalog?

A

Catalog β†’ Schema β†’ Table/View
Example:
catalog = finance
schema = transactions
table = payments
πŸ‘‰ This structure supports multi-tenant environments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the key differences between Unity Catalog and Hive Metastore?

A

Unity Catalog:
Centralized across workspaces
Fine-grained access control
Built-in lineage

Hive Metastore:
Workspace-level
Limited governance
No native lineage
πŸ‘‰ Unity Catalog is designed for enterprise governance at scale.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What types of access control does Unity Catalog support?

A

Table-level
Column-level
Row-level filtering
πŸ‘‰ Enables:
Data masking
Compliance (PII protection)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does row-level security work in Unity Catalog?

A
  • Applies filters based on user identity
  • Users see only permitted rows
    πŸ‘‰ Example:
    Sales manager sees only their region
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How is column-level security implemented?

A
  • Mask or restrict access to specific columns
    πŸ‘‰ Example:
  • Hide salary or SSN for non-authorized users
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is data lineage and why is it important?

A

Tracks:
* Data origin
* Transformations
* Downstream dependencies

πŸ‘‰ Helps with:
* Debugging
* Impact analysis
* Compliance audits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is an external location in Unity Catalog?

A
  • A secured reference to cloud storage (e.g., ADLS, S3)
    πŸ‘‰ Controls access to:
  • Files
  • Tables stored outside managed storage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are storage credentials in Unity Catalog?

A
  • Secure authentication mechanism to access cloud storage
    πŸ‘‰ Separates:
  • Access control from compute
  • Credentials from code
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the difference between managed and external tables?

A
  • Managed: Databricks controls data + metadata
  • External: Data stored outside, only metadata managed πŸ‘‰ Managed = easier πŸ‘‰ External = more control
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does Unity Catalog enable data sharing?

A
  • Secure sharing across workspaces or organizations
  • No need to duplicate data πŸ‘‰ Often uses Delta Sharing.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the principle of least privilege and why is it important?

A
  • Users get only the access they need πŸ‘‰ Reduces:
  • Security risks
  • Accidental data exposure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is RBAC implemented in Unity Catalog?

A
  • Assign permissions to roles/groups
  • Grant roles to users πŸ‘‰ Easier to manage than user-level permissions.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is data masking and when should it be used?

A
  • Obscures sensitive data
    πŸ‘‰ Example:
  • Show only last 4 digits of credit card Used for: * Compliance (GDPR, HIPAA)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does Unity Catalog handle multiple workspaces?

A
  • Central governance across all workspaces
    πŸ‘‰ Ensures:
  • Consistent policies
  • No duplication of rules
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is auditing important in data governance?

A
  • Tracks who accessed what data and when πŸ‘‰ Required for:
  • Compliance
  • Security monitoring
17
Q

Why is defining data ownership important?

A
  • Ensures accountability
  • Improves data quality
  • Clarifies responsibility
    πŸ‘‰ Each dataset should have a clear owner.
18
Q

What is the tradeoff between governance and flexibility?

A
  • Strong governance β†’ secure but restrictive
  • High flexibility β†’ agile but risky
    πŸ‘‰ Balance is key in enterprise environments.
19
Q

How would you secure PII data in Databricks?

A
  • Use column-level masking
  • Apply row-level filters
  • Restrict access via RBAC
    πŸ‘‰ Combine multiple controls for strong security.
20
Q

Why is it important to understand data access patterns?

A
  • Helps define permissions
  • Improves performance
  • Avoids overexposure
    πŸ‘‰ Governance should align with usage.
21
Q

How do you enforce governance in data pipelines?

A
  • Apply permissions at table level
  • Validate data quality
  • Track lineage
    πŸ‘‰ Governance is not just access controlβ€”it’s end-to-end.
22
Q

What are common mistakes in data governance?

A
  • Over-permissioning users
  • No data ownership
  • Ignoring lineage
  • Poor documentation
    πŸ‘‰ Leads to β€œdata chaos”.
23
Q

How does Unity Catalog integrate with Delta Lake?

A
  • Delta provides data reliability
  • Unity Catalog provides governance
    πŸ‘‰ Together:
  • Reliable + secure data platform
24
Q

How would you handle a request for sensitive data access?

A
  • Validate business need
  • Apply least privilege
  • Grant temporary or scoped access
    πŸ‘‰ Always audit and monitor usage.
25
What does a good governance strategy look like in Databricks?
* Centralized control (Unity Catalog) * Fine-grained permissions * Data lineage tracking * Clear ownership πŸ‘‰ Goal: * Secure + scalable + compliant data platform