Data engineering Flashcards

Assist learner to become knowledgeable in data engineering (95 cards)

1
Q

What is an offset in Apache Kafka?

A

A unique identifier in a partition, primarily used to identify messages based on their ID.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are DStreams built as in Spark Streaming?

A

a continuous stream of RDDs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the main advantage of the Kappa architecture compared to the Lambda architecture?

A

Kappa removes the separate batch and speed layers and keeps a single streaming pipeline built on an immutable event log (for example, Kafka).

This simplifies the overall system design, codebase, and operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which one is a simpler data engineering architecture, kappa or lambda?

A

kappa

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Based on what does Docker create reproducible environments?

A

Dockerfiles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What process is responsible for building, running, and distributing Docker containers?

A

Docker daemon

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What happens when the

docker build

command is run from a bash terminal?

A

The instructions in the Dockerfile in the current directory are followed and an image is created.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What do we specify in Dockerfiles?

A
  • env vars
  • file locations
  • language
  • network ports
  • OS the container will run on
  • what to do when executed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

In Docker, what is the key abstraction that enables communication between containers and between containers and the outside world?

A

Docker networks

(such as bridge, host, overlay, macvlan)

virtual networks created and managed by Docker’s network drivers that connect containers to each other and to external networks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is iptable?

A

a command-line firewall utility that enables or blocks traffic
based on policy chains

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the two types of
nodes that make up the Docker Swarm?

A

managers and workers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does

# syntax=docker/dockerfile:1

as the first line of a Dockerfile tell the Docker builder?

A

which Dockerfile syntax to use

the latest release of the version 1 syntax

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What do anti-affinity rules do?

A

Anti-affinity rules ensure that selected virtual machines are not placed on the same host.

They spread VMs across different hosts so that a single host failure does not take them all down at once.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Kubernetes?

A

a container management and orchestration tool

κυβερνήτης = kormányos

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the core function of Kubernetes?

A

to run and coordinate containerized applications across a cluster of machines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What happens to the pods of a deployment during a rolling update in Kubernetes?

A

During a rolling update, Kubernetes gradually replaces the existing pods with new pods, creating new ones and terminating old ones in small batches so the application remains available.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the main responsibility of the kubelet in Kubernetes?

A

The kubelet runs on each node and monitors the pods scheduled to that node. It continuously checks their actual state and works to keep them matching their desired state (for example, by starting or restarting containers when needed).

When a node stops sending kubelet heartbeats, the control plane marks the node as NotReady, and controllers recreate the affected Pods on other healthy nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

In Kubernetes, by acting as the brain and main gateway, what are the primary responsibilities of the control plane (legacy term: master node)?

A

The control plane exposes the Kubernetes API, stores and manages the cluster state, schedules pods onto worker nodes, and runs controllers that monitor node and workload health and automatically react to changes (for example, replacing failed pods or handling node failures).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

In the context of Kubernetes, what is etcd and what is it used for?

A

etcd is a persistent, strongly consistent, distributed key–value store used by Kubernetes as the primary data store for all cluster configuration and state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

With which command-line client do you typically access and manage a Kubernetes cluster?

A

kubectl

the Kubernetes command-line client

to access and manage a cluster from a local machine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

In Kubernetes, what are worker nodes responsible for?

A

Worker nodes host and run pods (the application workloads) and provide the local services (kubelet, container runtime, kube-proxy) that communicate with the control plane and handle node-level networking.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Which architecture pattern does Docker use?
a) Master-slave
b) Event-bus
c) Client-server
d) Model-view-controller

A

The correct answer is (c) Client-server.

Docker is built on a client-server architecture.

server: the Docker daemon, the persistent background process (often named dockerd)

client: the primary way most users interact with Docker, the Docker CLI or docker command
(When we type a command like docker run or docker build, the client translates that command into a REST API request and sends it to the Docker daemon.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

The control plane in a Kubernetes architecture is also known as…
a) …pod
b) …etcd
c) …API Server
d) …master node

A

The correct answer is (d) …master node.

The term “master node” is the older terminology used in Kubernetes.

The control plane refers to the set of components that act as the “brain” of the Kubernetes cluster. It is responsible for making global decisions about the cluster (like scheduling pods) and maintaining the cluster’s desired state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does the term overhead refer to in data engineering?

A

wastage of resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
"No one shall be subjected to arbitrary interference with his privacy, family, home or correspondence, nor to attacks upon his honour and reputation. Everyone has the right to the protection of the law against such interference or attacks." | Where does the quote originate from?
Article 12 of the Universal Declaration of Human Rights (1948)
26
Who does the GDPR (2018) apply to?
all organizations in the EU and those that offer goods or services in the EU, or collect and analyze data related to EU residents, regardless of their geographic locations
27
According to the GDPR, what are the main data protection principles that processors should adhere to?
– accuracy – data minimization – **integrity** and confidentiality – lawfulness, fairness, and transparency – purpose limitation – **storage** limitation
28
What does the GDPR prescribe with respect to the principle of data minimization?
That personal data shall be adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed.
29
Is the processing of personal data allowed per the GDPR?
The processing of personal data is not allowed unless it is legally required, the processor has a legitimate interest, or the data subject has consented.
30
What does the GDPR state on the topic of up-to-date & accurate personal information?
All appropriate measures are required to ensure inaccurate personal information is erased or corrected immediately, considering the nature of the processing.
31
What are the purposes for which personal data may be stored for longer periods?
public interest, science, historical research, or statistical research, following fair, lawful, and transparent practices
32
The data subjects have the right to access the data that have been collected about them by the controllers. What is the time window data controllers have at their disposal to respond to requests from data subjects?
30 days
33
What are the main rights of data subjects regarding their personal information?
– Right of access – Right of **transparent** **information**, communication, and modalities – Right to be notified – Right to data portability – Right to erasure or to be forgotten – Right to know the source of personal data – Right to object – Right to rectification – Right to reject automated individual decision-making – Right to restriction of processing
34
In which cases do the rights to restrict data processing not apply?
When data are processed to prevent, investigate, detect, or prosecute criminal offenses, or to prevent public security threats.
35
What technologies the GDPR endorses for data protection?
the regularion is technology-agnostic
36
What software platforms can help manage GDPR compliance?
**OneTrust** – comprehensive privacy management **BigID** – AI-powered data discovery and privacy compliance **Ketch** – consent and data rights automation **PrivacyPerfect** – EU-based SaaS for DPOs and privacy officers **SAI360** – integrated GRC with GDPR modules **Securiti** – unified data controls and privacy automation **Transcend** – automated DSARs and consent management **TrustArc** – privacy management and assessments
37
Broadly speaking, what are the three main types of security breaches in data engineering?
– incorrect data modification – unauthorized data observation – unavailability of data
38
During the infamous 2012 data breach of LinkedIn, how many users were affected by the exposure of their PI?
165 million
39
What types of user data were involved in the data breach of Yahoo in 2013, which affected 3 billion accounts?
– birth date – email address – encrypted & unencrypted security Qs – hashed password – name – telephone number
40
During the first infamous Facebook data policy bug in the early 2010s, for how long had it exposed the personal data of 6 million users to unauthorized viewers?
for over a year
41
In September 2018, how many FB user accounts were discovered to have been hacked?
between 50 and 90 million
42
What is the part of requirements engineering called that is critical for developing secure software systems?
security requirements engineering
43
List example approaches to security reqs engineering.
– **CLASP** (Comprehensive, Lightweight Application Security Process) – **Core** security requirements artifacts – **SQUARE** (Security Quality Requirements Engineering)
44
What are the steps of the SQUARE process framework for security reqs engineering?
Step 1: Agree on definitions Step 2: Identify assets and security goals Step 3: Develop **artifacts** Step 4: Perform risk assessment Step 5: Select elicitation technique Step 6: Elicit security requirements Step 7: Categorize requirements Step 8: Prioritize requirements Step 9: Inspect requirements
45
What are the most widely used security patterns for recurring problems in software development?
– **Authorization** Pattern – **Multilevel** Security Pattern – RBAC pattern – Single Access Point Pattern
46
What are the main activities in data governance?
Discovering the appropriate data, accessing them, understanding their meaning, and determining how they can be used in a compliant way.
47
What elements should be included in a framework that is designed to implement and assess data governance readiness and maturity?
– a set of **processes** and **activities** that involve identifying, defining, monitoring, and enforcing data quality policies – data policies and data governance rules – **roles** and **responsibilities** of people involved in the data governance life cycle – **technology** for implementing data governance
48
What are five key objectives of a data governance program?
1. Ensuring data is easy to find and access for authorized users. 2. Improving shared understanding of data through **clear definitions and metadata**. 3. Maintaining and enhancing data quality and trustworthiness. 4. Enabling **controlled self-service** use of data by business users. 5. Protecting data through strong privacy and security controls.
49
What are the main roles typically used in data governance projects and processes?
– CDO – Data governance council – Data owners – Data stewards
50
What are the main types of data governance policies?
– Data **access** policy – Data **governance** **structure** policy – Data **integrity** policy – Data **usage** policy
51
What are the 5 levels of data quality capability of the ISO 8000-61 process model?
Level 1: reliable data processing Level 2: **controlling** the data processing Level 3: data quality **planning** Level 4: system-level **data quality assurance** Level 5: **root cause analyses** of issues + meta-shit data approach reflections
52
To overcome the communication issues when there are stakeholders with diverse cultural backgrounds involved, which three elicitation techniques can be used?
**Accelerated Requirements** Method **Joint Application** Design Structured interviews
53
Please list five key goals of data governance.
Making data more **findable** and accessible Making data more **understandable** Making data more **trustworthy** and improving the quality of data Empowering data users with **self-service** Ensuring data **privacy** and security
54
Under the GDPR, what type of mechanism is required to obtain valid electronic consent from a data subject?
Opt-in — the data subject must take a clear, affirmative action (e.g., ticking an unchecked box) to indicate consent. | Pre-checked boxes or implied consent are not valid.
55
During data collection on children, in the context of the GDPR, what is the controller's responsibility when it comes to obtaining consent?
consent by the guardian for the child
56
What does Capability Level 1 in the ISO 8000-61 emphasize?
reliable data processing
57
Role-Based Access Control (RBAC) is classified as what type of access control model?
non-discretionary access control | non-DAC ## Footnote The Access Control Survey classifies traditional access control models into two primary categories: 1. DAC (Discretionary Access Control) 2. Non-DAC, which further divides into MAC, RBAC, and ABAC
58
What does a lean manufacturing operation aim for?
increase productivity while reducing waste
59
What is Terraform?
an open-source, CLI-based software that manages cloud services using IaC
60
What kind of configuration files the Terraform IaC tool expects?
plain text files with an extension of .tf
61
What is ARM in the context of Microsoft?
Azure Resource Manager | the deployment and management service for Azure ## Footnote It provides a management layer that helps us create, update, and delete resources in an Azure account.
62
How many imperative commands do we have to use to deploy an ARM template?
one
63
In the context of GitHub Actions, workflows are configurable automated processes that run multiple jobs. What file extensions do we have to use to define them in a GitHub repository?
YAML e. g. "acidvuca\vlc\.github\workflows\test.yml"
64
When do developers typically open a pull request?
After pushing their changes to a branch (or fork), when the work is ready for review and potential merge into a target branch (for example, main).
65
What kind of plugins does the Java-based open-source software development tool Jenkins include?
CI
66
What developmen lifecycle processes does Jenkins support?
- build - deploy - document - package - stage - static analysis - test
67
What are the key architectural components of Jenkins?
Jenkins **controller** (manages configuration, schedules builds, serves the UI) and Jenkins **agents** (execute the actual build jobs, can be static nodes or dynamic cloud-provisioned instances).
68
What are typical phases of a data pipeline?
♥ data extraction ♥ data ingestion ♥ data preprocessing ♥ data validation, training machine learning algorithms, reporting
69
Give examples of platforms for orchestrating data pipelines.
Data Factory, Apache Airflow, Apache Nifi, Luigi, AWS Glue
70
What are the two principles that the expectation of replayability in data pipelines is based on?
immutability and idempotency of data
71
What are the main expectations we need to adhere to when orchestrating data pipelines?
☻ replayability ☻ auditability ☻ scalability ☻ reliability ☻ security
72
What is a directed acyclic graph (DAG) in the context of data pipeline orchestration?
A DAG models a pipeline as a graph of tasks with two constraints. Directed means each edge (dependency) flows one way: a task only runs once all of its upstream dependencies have completed. Acyclic means the dependency chain cannot loop back on itself — if A feeds B and B feeds C, then C cannot feed back into A. Together, these properties guarantee a deterministic, topologically sortable execution order with no deadlocks or infinite loops.
73
What is Azure Data Factory, part of Microsoft's cloud offering?
a data pipeline orchestrator and an ETL tool
74
What are the main components of a pipeline in Airflow?
♥ **Operators**, predefined tasks that can be linked together ♥ **Sensors**, operators that wait for external events to occur ♥ **TaskFlow**, a Python function that is packaged as an Airflow pipeline task
75
What can we use as source in Apache NiFi?
A wide variety of data formats can be used, including logs, geolocation information, and social media feeds. | Additionally, SFTP, HDFS, and Kafka are supported.
76
What are the four key DevOps/DataOps metrics, which can easily be captured and used in practice to evaluate the project progress in terms of DevOps/DataOps principles?
♦ deployment frequency ♦ **lead time for changes** ♦ change failure rate ♦ mean time to restore service
77
What is dead code?
Dead code is any section of source code that can never be executed, regardless of input, because no control flow path reaches it — for example, code after an unconditional return, or a branch guarded by a condition that is always false. | Compilers typically detect and eliminate it during optimization. ## Footnote The related term dead store refers to code that does execute but whose result is never subsequently used.
78
How is the defect density of a software application defined?
the number of confirmed bugs during the development period divided by the software size
79
What are different metrics of line counts in IT?
* (raw) lines of code (LOC) * percentage of comments (perCOM) * source lines of code (SLOC) * logical lines of code (LLOC)
80
On what do the Halstead metrics depend?
the number of operators and operands
81
How do we calculate the Halstead Volume (HV)?
HV = LTH · log2(VOC) | LTH = OP + OD (program length) VOC = UOP + UOD (program vocabulary) ## Footnote OP = total number of operators OD = total number of operands UOP = number of distinct operators UOD = number of distinct operands
81
What are the operators and the operands in "int x = x + 1" per the Halstead metrics?
operators: =, + operands: int, x, 1
82
What does cyclomatic complexity (CC) measure?
the understandability, maintainability, and testability of code
83
How is cyclomatic complexity (CC) calculated?
CC = E − N + 2 E is the number of edges and N the number of nodes
84
What is the relation between the number of tests for full coverage and cyclomatic complexity?
A fully covered program requires the same number of tests as its cyclomatic complexity.
85
What does the Maintainability Index (MI) measure?
how easy is it to support and change the code
86
What is the original formula for the Maintainability Index (MI) in software engineering?
MI = 171 − 5.2 · ln(aveHV) − 0.23 · aveCC − 16.2 · ln(aveLOC) + 50 · sin(√(2.4 · perCM))
87
What are burndown charts used for in software development?
to track incomplete work or backlogs over a given period, as well as to measure team progress
88
Give examples of application monitoring dashboards.
- **Jaeger** (for microservices) - **OpenCensus** (data viewing on the host + exporting the data to central aggregators) - Prometheus (for containerized applications) - Grafana (to visualize event-driven metrics such as response time, request volume, workloads, network traffic flow etc.)
89
Name three operations that processors in ApacheNiFi Flow perform on Flowfile.
CREATE, CLONE, RECEIVE
90
How do we calculate the average lead time for production changes (such as change orders or revisions)?
For each change, measure the time between when the change is initiated (or approved) and when it is fully implemented in production. Then calculate the average of these lead times across all changes. | lead_timei = tfinish,i − tstart,i ## Footnote average lead time: L_bar = (1/N) ∑i=1N ( tfinish,i − tstart,i )
91
In DORA/Four Keys, what counts as a “failed change” (a change failure)?
A failed change / change failure is a production deployment that results in degraded service or an incident and requires immediate remediation, such as a hotfix, rollback, or fix-forward.
92
In the context of DevOps/DORA metrics, what determines the Change Failure Rate of a software project?
The Change Failure Rate is determined by how many production deployments fail—that is, the number of unsuccessful deployments in production (those that cause incidents or require remediation) relative to the total number of deployments.
93
In the context of continuous integration, which open-source automation server is widely used to orchestrate builds, tests, and deployments?
**Jenkins** is the open-source automation server commonly used for continuous integration and continuous delivery, allowing teams to automate building, testing, and deploying their software.
94
A score of less than or equal to 65 for a code MI refers to which maintainability category?
difficult to maintain