Orchestration & Workflow Design Flashcards by O Cam

What is orchestration in data engineering?

Coordinating the execution order, dependencies, scheduling, and error handling of multiple data processing tasks and pipelines.

How well did you know this?

Not at all

Perfectly

How does orchestration differ from simple job scheduling?

Scheduling runs jobs at specific times, while orchestration manages dependencies, conditional logic, retries, and end-to-end workflows.

How well did you know this?

Not at all

Perfectly

What is a DAG (Directed Acyclic Graph) in workflow tools?

A graph where nodes are tasks and edges represent dependencies, with no cycles, defining a valid execution order.

How well did you know this?

Not at all

Perfectly

Why are DAGs used to represent data pipelines?

They clearly encode dependencies and prevent circular flows, allowing the orchestrator to determine safe parallel execution and retries.

How well did you know this?

Not at all

Perfectly

What is a task in the context of orchestration?

A unit of work, such as a script invocation, SQL query, or API call, that the orchestrator can schedule and monitor.

How well did you know this?

Not at all

Perfectly

What is a dependency between tasks?

A rule that one task must complete successfully before another can start, enforcing correct ordering of steps.

How well did you know this?

Not at all

Perfectly

What is the difference between schedule-based and event-based triggering?

Schedule-based triggering runs tasks at defined times, while event-based triggering runs tasks when specific events occur, such as file arrivals or upstream completions.

How well did you know this?

Not at all

Perfectly

Why are retries important in orchestrated pipelines?

Transient failures such as network issues or temporary service unavailability can be resolved by rerunning tasks without manual intervention.

How well did you know this?

Not at all

Perfectly

What is exponential backoff for retries?

A retry strategy that increases the delay between retries after each failure, reducing pressure on failing systems.

How well did you know this?

Not at all

Perfectly

What is a backfill in the context of orchestration?

Running a pipeline over historical date ranges to populate or repair data for past periods.

How well did you know this?

Not at all

Perfectly

Why must backfills be handled carefully?

They can generate heavy load, interfere with regular runs, and require idempotent logic to avoid duplicating data.

How well did you know this?

Not at all

Perfectly

What is a workflow’s SLA (Service Level Agreement)?

An agreed target for freshness or completion time, such as having daily data ready by 7 AM each morning.

How well did you know this?

Not at all

Perfectly

Why are SLAs important for orchestration design?

They influence scheduling, resource allocation, alerting, and prioritization of jobs to meet business deadlines.

How well did you know this?

Not at all

Perfectly

What is the critical path in a DAG?

The longest sequence of dependent tasks that determines the minimum possible completion time of the workflow.

How well did you know this?

Not at all

Perfectly

Why is identifying the critical path useful?

Optimizing tasks on the critical path gives the biggest impact on end-to-end latency and SLA adherence.

How well did you know this?

Not at all

Perfectly

What is a sensor task or dependency check?

Study These Flashcards

A task that waits for a condition to be met, such as file presence or upstream system completion, before allowing downstream tasks to run.

What is a failure handler or on-failure hook?

Study These Flashcards

Logic that runs when a task or workflow fails, such as sending alerts, triggering compensating actions, or creating tickets.

Why is idempotency critical for orchestrated tasks?

Study These Flashcards

Because orchestration systems may rerun tasks after failures or backfills; idempotent tasks ensure repeated runs do not corrupt data.

What metadata should be logged for each workflow run?

Study These Flashcards

Start and end times, status, parameters, input ranges, row counts, errors, and upstream/downstream relationships.

Why is parameterization important in workflows?

Study These Flashcards

It allows the same DAG and code to run for different dates, environments, or customers without duplicating logic.

What is environment separation in orchestration (dev/test/prod)?

Study These Flashcards

Running pipelines in distinct environments or accounts to test changes safely before they affect production data.

Why should deployments of orchestration code be version-controlled?

Study These Flashcards

Version control enables rollback, code review, and clear tracking of which version introduced a change or incident.

What is a blue/green or canary deployment in pipeline changes?

Study These Flashcards

Deploying new versions alongside existing ones, routing a subset of traffic or data to them before full cutover, to reduce risk.

What is dependency inversion at the orchestration level?

Study These Flashcards

Designing pipelines so they depend on stable interfaces or contracts between stages rather than hardcoded details of internal implementations.

Why is it useful to break large workflows into smaller sub-DAGs or modular pipelines?

Smaller workflows are easier to understand, test, reuse, and operate, and can be composed into larger end-to-end processes.

What is a common anti-pattern in orchestration design?

Placing too much logic directly in the orchestrator (e.g., large scripts or SQL embedded inline) instead of calling tested, versioned code modules.

Why should the orchestrator be treated as a coordinator, not a compute engine?

Orchestrators should trigger external compute (warehouses, Spark jobs) and manage states rather than doing heavy data processing themselves.

What is a cron expression used for?

Defining complex schedules for periodic task execution, such as 'every weekday at 2 AM'.

Why must timezone considerations be explicit in scheduling?

Daylight savings and region differences can cause jobs to run at unexpected times if timezones are not clearly defined.

What is monitoring in orchestration?

Tracking the status, duration, and failures of tasks and workflows over time to detect anomalies and performance issues.

What is alerting in orchestration?

Notifying humans or systems when workflows violate SLAs, fail, or exhibit abnormal behavior, so issues can be investigated promptly.

Why is it useful to have success notifications in some cases?

They confirm that critical workflows completed as expected and can be used to trigger downstream processes or communication.

What is the role of tagging or labeling workflows and tasks?

Tags help group and filter workflows by team, domain, environment, or priority for reporting and management.

Why should orchestrated workflows be tested with realistic data and scenarios?

To ensure that dependencies, error handling, and performance behave correctly under real-world conditions, not just toy examples.

What is the difference between orchestration and choreography in distributed systems?

Orchestration uses a central controller to manage interactions, while choreography relies on decentralized components reacting to events and contracts.

When might an event-driven, choreographed architecture be preferable?

When systems need to be loosely coupled, independently deployable, and responsive to events without a central orchestrator.

Why do many organizations use both orchestration and choreography?

Orchestration suits complex, tightly controlled workflows; choreography suits scalable, decentralized integrations; both patterns often coexist.

What is a good one-sentence mental model for orchestration in data engineering?

Use a central, versioned DAG to coordinate small, idempotent tasks that move and transform data in a predictable, observable, and SLA-respecting way.

Orchestration & Workflow Design Flashcards

(38 cards)