What is orchestration in data engineering?
Coordinating the execution order, dependencies, scheduling, and error handling of multiple data processing tasks and pipelines.
How does orchestration differ from simple job scheduling?
Scheduling runs jobs at specific times, while orchestration manages dependencies, conditional logic, retries, and end-to-end workflows.
What is a DAG (Directed Acyclic Graph) in workflow tools?
A graph where nodes are tasks and edges represent dependencies, with no cycles, defining a valid execution order.
Why are DAGs used to represent data pipelines?
They clearly encode dependencies and prevent circular flows, allowing the orchestrator to determine safe parallel execution and retries.
What is a task in the context of orchestration?
A unit of work, such as a script invocation, SQL query, or API call, that the orchestrator can schedule and monitor.
What is a dependency between tasks?
A rule that one task must complete successfully before another can start, enforcing correct ordering of steps.
What is the difference between schedule-based and event-based triggering?
Schedule-based triggering runs tasks at defined times, while event-based triggering runs tasks when specific events occur, such as file arrivals or upstream completions.
Why are retries important in orchestrated pipelines?
Transient failures such as network issues or temporary service unavailability can be resolved by rerunning tasks without manual intervention.
What is exponential backoff for retries?
A retry strategy that increases the delay between retries after each failure, reducing pressure on failing systems.
What is a backfill in the context of orchestration?
Running a pipeline over historical date ranges to populate or repair data for past periods.
Why must backfills be handled carefully?
They can generate heavy load, interfere with regular runs, and require idempotent logic to avoid duplicating data.
What is a workflow’s SLA (Service Level Agreement)?
An agreed target for freshness or completion time, such as having daily data ready by 7 AM each morning.
Why are SLAs important for orchestration design?
They influence scheduling, resource allocation, alerting, and prioritization of jobs to meet business deadlines.
What is the critical path in a DAG?
The longest sequence of dependent tasks that determines the minimum possible completion time of the workflow.
Why is identifying the critical path useful?
Optimizing tasks on the critical path gives the biggest impact on end-to-end latency and SLA adherence.
What is a sensor task or dependency check?
A task that waits for a condition to be met, such as file presence or upstream system completion, before allowing downstream tasks to run.
What is a failure handler or on-failure hook?
Logic that runs when a task or workflow fails, such as sending alerts, triggering compensating actions, or creating tickets.
Why is idempotency critical for orchestrated tasks?
Because orchestration systems may rerun tasks after failures or backfills; idempotent tasks ensure repeated runs do not corrupt data.
What metadata should be logged for each workflow run?
Start and end times, status, parameters, input ranges, row counts, errors, and upstream/downstream relationships.
Why is parameterization important in workflows?
It allows the same DAG and code to run for different dates, environments, or customers without duplicating logic.
What is environment separation in orchestration (dev/test/prod)?
Running pipelines in distinct environments or accounts to test changes safely before they affect production data.
Why should deployments of orchestration code be version-controlled?
Version control enables rollback, code review, and clear tracking of which version introduced a change or incident.
What is a blue/green or canary deployment in pipeline changes?
Deploying new versions alongside existing ones, routing a subset of traffic or data to them before full cutover, to reduce risk.
What is dependency inversion at the orchestration level?
Designing pipelines so they depend on stable interfaces or contracts between stages rather than hardcoded details of internal implementations.