Big Data Technologies Flashcards

to study big data technologies (225 cards)

1
Q

the 4Vs of data

A

volume, velocity, variety, veracity

(how much data is stored, how fast can it be accessed, what kind of data is stored, what is the quality and accuracy of the data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

volume of big data

A

the size of the data that is stored and available for access and processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

max. handling capacity of a single server in 2025

A

up to around one Petabyte

(anything above this needs to be stored on a distributed system)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

abbreviation and storage space of Kilobyte

A

KB, 1000 B, 8000 bits

1 kibibyte (KiB) is 1024 B (2¹⁰)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

units of data volume from and above gigabyte

A

gigabyte, terabyte, petabyte, exabyte, zettabyte, yottabyte (in ascending order with steps *1024)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

transfer speed (definition, unit)

A

a measure of how much data is transferred for each time unit, e. g. for each second; b/s (Bps), Kb/s (Kbps), Mb/s (Mbps), Gb/s (Gbps)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What can we do when the slowest component of the server (usually the disk) reaches its transfer speed limit?

A

We can accelerate the workload by implementing a distributed system, so we can have data written to many servers across a server pool.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

response time (definition, unit)

A

the time it takes for a database to respond to an access or storage request, ms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the use of messages in big data?

A

They are used for transmitting data in low-velocity IoT applications before it is ingested into DB systems.

(Messaging systems come with a queue to offer support for scenarios with unstable internet connection. Newly accumulated data is added to the queue, where it is stored until the device comes back online.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

FIFO

A

first-in, first-out methodology

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

eventual consistency

A

A concept of improving DB read and write performance. When it is applied, the data is initially only written to one node, then replicated across others.

<-> strong consistency

-> Data can’t be assumed to be up to date in a distributed DB.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does variety describe in big data?

A

the different types of data present: structured, semi-structured, unstructured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What kind of data can cause veracity issues in the field of big data?

A

inconsistent, untrusted, raw/uncleansed, biased, incomplete etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What dimensions extensions to the 4Vs contain?

A

variability, exhaustivity, fine-grained, relationality, resolution & indexicality, extensionality & scalability, value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

data mining in big data

A

the process of finding, extracting and processing data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

psycopg

A

the most popular PostgreSQL adapter for the Python programming language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

tweepy

A

a Python library that enables easy Twitter API access

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Twitter OAuthing in Python

A
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

After the fist line, the auth variable refers to an instance of the class tweepy.OAuthHandler.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What can we use the api object resulting from api = tweepy.API(auth) in Python for?

A

to authenticate search requests against the API

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How do I paginate through tweets by hashtag and start date using tweepy v2?

A

from datetime import datetime, timezone
import os, tweepy, mysql.connector

———- 1. Authenticate —————————————-
client = tweepy.Client(
bearer_token=os.environ[“X_BEARER_TOKEN”],
wait_on_rate_limit=True,
)

———- 2. Build query ——————————————
query = “#InterestingHashtag lang:en -is:retweet”

Earliest date wanted, inclusive (UTC)
start_time = datetime(2024, 1, 1, tzinfo=timezone.utc)

———- 3. Choose endpoint by access tier ———————–
# search_all_tweets → Full Archive (Academic — effectively dead)
# search_recent_tweets → Last 7 days (Basic $100/mo+, max 100/page)
search_method = client.search_recent_tweets

———- 4. Iterate with Paginator ——————————-
db_params = dict(host=”localhost”, database=”mydb”,
user=”root”, password=”secret”)
sql = “INSERT IGNORE INTO tweets(id, text) VALUES (%s, %s)”
BATCH = 200

cxn = mysql.connector.connect(**db_params)
cur = cxn.cursor()
try:
paginator = tweepy.Paginator(
search_method,
query=query,
start_time=start_time.isoformat(), # YYYY-MM-DDTHH:MM:SS+00:00
tweet_fields=[“id”, “text”],
max_results=100, # 10-100 for recent
)
count = 0
for tweet in paginator.flatten(limit=None):
cur.execute(sql, (tweet.id, tweet.text))
count += 1
if count % BATCH == 0:
cxn.commit()
cxn.commit() # flush remainder
finally:
cur.close()
cxn.close()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

pub/sub

A

a system of publishers (IoT edge devices) and subscribers (brokers that make data available to clients)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does a message broker do in Industry 4.0?

A

It handles delivery of messages after periods of downtime on the subscriber’s end.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are the components of the pub/sub pattern?

A

IoT machinery -> message broker -> DB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

SWIFT

A

Society of Wordwide Interbank Financial Telecommunications

(payment processing system acting as a messaging service)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Data storage need of CT scan results
Cc. 22 MB per uncompressed image file. If 1000 slices are captured, we will have 22 GB per CT scan. If we do high-res scans, it is up to 350 GB per CT scan.
26
What additional traits air-borne and satellite imagery has in terms of big data when compared with CT imagery?
time slices, geospatial references
27
What does INSPIRE abbreviate in the context of geospatial data?
Infrastructure for Spatial Information in Europe
28
difference between flat and hierarchical data
flat is tablelike (spreadsheet), hierarchical is treelike (there are relationships represented between records)
29
velocity of streamed and batch-processed data
streamed is high, batch-processed is low
30
What is a spatial join used for?
to determine the relationship between geodatasets ## Footnote (e. g. the latitude and longitude of a phone's location is joined w/ a polygon that represents a tower's coverage area)
31
examples for structured data
numeric sensor data, birthdates in a customer DB, addresses, names, e-mail addresses, spatial coordinates, phone numbers
32
What are characters encoded into in UTF-8 and ASCII?
sequences of 8-bit bytes
33
binary encoding values for an 8-bit integer (index, value)
(1, 128), (2, 64), …, (7, 2), (8, 1)
34
How to add type hint 'float' for a temperature variable in Python?
temperature: float = 62.8
35
What is required in Java regarding the data type when a new variable is defined?
It needs to be set explicitly.
36
data types in PostgreSQL
text, integer, decimal, bool, date, money, double precision, char (a single character) + geodata, IP address, UUID, CIDR, MAC address etc.
37
What is the max storage size of a field of data type text in PostgreSQL?
1GB
38
examples of semi-structured data
XML, emails, files in markdown languages, HTML data
39
examples of unstructured data
image, video, audio, binary
40
Why were NoSQL DBs developed?
Because SQL-based DBs cannot handle unstructured data.
41
What is GridFS?
a specification used by MongoDB to store files larger than 16MB ## Footnote GridFS is implemented entirely inside MongoDB: each file is sliced into 255 kB chunks that are written as separate BSON documents in a collection called fs.chunks; per-file metadata lives in fs.files. MongoDB’s drivers then give us helper methods that feel like open(), read(), seek(), write(), but under the hood we are still doing normal database reads and writes.
42
Which method of storing unstructured data in a DB is less error-prone, storing it as a BLOB or using file system path?
BLOB
43
How do we create the MongoClient instance in Python?
from pymongo import MongoClient client = MongoClient('<>') ## Footnote the MongoDB URL is a connection string, sg like "mongodb://[username:password@]host1[:port1][,host2[:port2],...[,hostN[:portN]]][/[database][?options]]"
44
How do we create a MongoDB database object in Python to reference the DB?
from pymongo import MongoClient client = MongoClient('<>') database = client.emails
45
Exemplify a .JSON-like object representation of an email message to be stored in MongoDB.
``` email_message = { 'to_line': 'example@iu.org', 'cc_line': '', 'from': 'example2@mit.org', 'subject': 'Important', 'body': 'anything you wish here to be added.', 'attachments': [ { 'filename': 'catphoto', 'extension': 'jpg', 'data': '…' }, { 'filename': 'dogphoto', 'extension', jpg', 'data': '…' } ] } ```
46
What could be Python code to find and delete all emails sent to 'example@iu.org' stored in MongoDB If we stored emails with saved = database.emaildata.insert_one(email_message)?
We can locate emails sent to the specified address with `example_emails = database.emaildata.find({'to_line': 'example@iu.org'})`. Then `deleted = database.emaildata.delete_many({'to_line': 'example@iu.org'})`
47
What is the definition of Big Data?
the creation, storage, and computation of huge amounts of data that frequently needs to be stored in distributed systems capable of handling a very large number of parallel read and write requests
48
What does veracity measure?
It measures how inconsistent, untrusted, raw/uncleansed, biased and incomplete the data might be. ## Footnote Because veracity is a quality continuum, we can equally talk about the high end of the scale: consistent (values agree across sources and over time), trusted (provenance is known; integrity checks pass), processed/cleansed (errors removed, formats normalised), unbiased (sampling and recording processes minimise systematic error), complete (few or no missing or truncated values) The underlying dimension being measured is still the same: How much can I rely on this dataset to reflect reality?
49
What does MAC stand for in MAC address?
medium/media access control
50
What is need to be done with data if we are to store it in a database?
It needs to be normalized to make sure it conforms to a standard.
51
How can we use the "with" statement in Python to open a text file with the open() function, and extract its content using the read() method, and print it out?
with open('example.txt') as f: content = f.read() print(content)
52
What is pandas the abbreviation of?
panel data
53
How do we load the data in Islands.csv into a pandas df?
import pandas as pd data = pd.read_csv("Islands.csv", sep=";")
54
What does pandas' read_csv function do with the first row?
It defines it as the header by default.
55
How do we add the column names manually to a headerless Islands_noheader.csv when extracting its data to a pandas df?
columns = ["Island", "Year", "Residents", "Capital", "Continent"] data = pd.read_csv("Islands_noheader.csv", names=columns, sep=";")
56
How to read Islands_meta.csv to a pd df skipping the first four metadata rows, also making sure that the data is read as utf-8?
`data = pd.read_csv("Islands_meta.csv", sep=";", skiprows=range(0,4), encoding="utf-8")`
57
How to save a pd df as a csv?
data.to_csv("example.csv", sep=";")
58
What separates records from each other in JSON?
comma
59
How do we read a JSON file with the open() and read() functions, and load the string json_data to a Python dictionary?
import json json_data = open('example.json').read() data = json.loads(json_data)
60
Imagining that a JSON data file is in a nested structure, we decide to only access elements for the top-level key "main", for which all elements are in a flat format. How do we do that?
data["main"]
61
How do we normalize the elements of the data["main"] level to a pd?
pd.io.json.json_normalize(data["main"])
62
What does XML abbreviate?
eXtensible Markup Language
63
How do we specify the XML version used at the top of an XML file?
64
Using Python's xml.etree.ElementTree, how do we print the name of every country in data.xml, given that each direct child of the root holds the country name in its second sub-element?
``` import xml.etree.ElementTree as ET tree = ET.parse('data.xml') root = tree.getroot() for child in root: print(child[1].text) ```
65
What does HDF abbreviate?
Hierarchical Data Format
66
What does an HDF5 file contain?
metadata, groups, datasets
67
How can we create an iu.h5 file with a new dataset in the shape of 4, 6 in Python?
import h5py file = h5py.File('iu.h5', 'w') dataset = file.create_dataset("iu", (4, 6))
68
How to query the shape, name and parent of a dataset with h5py?
dataset.shape, dataset.name, dataset.parent
69
How to generate random data in the shape of (4, 6) to be added to iu.h5 as iu_numbers using the h5py module?
``` import h5py import numpy as np with h5py.File('iu.h5', 'r+') as f: if "iu_numbers" not in f: dset = f.create_dataset("iu_numbers", (4, 6)) else: dset = f["iu_numbers"] data = np.random.rand(4, 6).round(2) dset[...] = data ```
70
How can I add custom metadata (attributes) to an h5py dataset and read it back?
In h5py we store metadata as attributes on files, groups, or datasets using the .attrs mapping: | *.attrs[...] is a dict-like interface for for metadata of HDF5 objects ## Footnote ``` import h5py import numpy as np with h5py.File("example.h5", "w") as f: data = np.arange(10) dset = f.create_dataset("mydata", data=data) # Add metadata (attribute) dset.attrs["user"] = "T" dset.attrs["description"] = "Toy example dataset" with h5py.File("example.h5", "r") as f: dset = f["mydata"] # Read a single attribute print(dset.attrs["user"]) # T # Or iterate over all attributes for key, value in dset.attrs.items(): print(key, value) ```
71
What does RLE stands for in the context of Parquet?
run length encoding
72
What are two configurations in Parquet that enable the optimization of files?
1. **row group size**, which allows the data to be chunked 2. **data page size**, which enables a single row lookup
73
How to create a pd df w/ 100 random numbers, convert it to a parquet table w/ 4 columns and write it to a parquet file?
``` import pyarrow.parquet as pq import pyarrow as pa import pandas as pd import numpy as np df = pd.DataFrame(np.random.rand(100).reshape(25, 4), columns=["one", "two", "three", "four"]) tableToWrite = pa.Table.from_pandas(df) pq.write_table(tableToWrite, "myPQFile.parquet") ``` | pa.Table.from_pandas(df) copies the df labels into the Arrow Schema
74
How to get a parquet file's content to a pd df?
``` import pyarrow.parquet as pq tableToRead = pq.read_table("example.parquet") tableToRead.to_pandas() ``` | pandas.read_parquet(path, columns=None, storage_options=None, **kwargs)
75
How to query metadata for a parquet file?
file = pq.ParquetFile("example.parquet") file.metadata file.schema
76
Why was Arrow created?
Arrow was built for in-memory analytics to enable the processing and moving of information in a big data environment with low overhead.
77
What is Cassandra?
a DBMS
78
Give examples of systems Arrow can connect.
Spark, Drill, Pandas, Impala, Parquet, HBase, Cassandra, Kudu
79
Who is the publisher of Arrow?
the Apache Software Foundation
80
Where is the data when we use Arrow?
in memory
81
What kind of storage model does Arrow use?
Apache columnar storage model
82
How can we load a 3-column pd df data to Arrow?
``` import pyarrow as pa import pandas as pd import numpy as np df = pd.DataFrame({ "one": [20, np.nan, 2.5], "two": ["january", "february", "march"], "three": [True, False, True]}, index = list('abc')) table = pa.Table.from_pandas(df) ``` ## Footnote When we convert the df to an Apache Arrow table (pa.Table.from_pandas(df)), pandas includes the index as metadata (or as a column if we set preserve_index=True). Either way, the row labels remain part of the dataset’s identity. In short, index=list('abc') is a neat shorthand for giving the three rows the labels 'a', 'b', and 'c'.
83
How do we get a pd df from an Arrow table read into memory from data/sales_2025-06.parquet?
`file_path = Path("data/sales_2025-06.parquet") sales_tbl: pa.Table = pq.read_table(file_path) sales_df: pd.DataFrame = sales_tbl.to_pandas( types_mapper=pd.ArrowDtype, preserve_index=False )`
84
What are the advantages of Arrow over dfs?
1. It contains extensible metadata information of the flat and nested types with the option to create user-defined types. 2. Better performance with DB or flatfile ingestion and export. 3. It is more efficient in processing very large datasets.
85
What kind of characters are contained in ASCII?
alphabetic, numeric, collating and non-printing
86
What language can you use Arrow with?
Arrow is language-agnostic
87
What does ASCII do?
It translates 0s and 1s to numbers between 0 and 255. ## Footnote Plain-vanilla ASCII maps 7-bit codes (0–127) to printable and control characters; the extra 128 possible values in an 8-bit byte belong to various extended or vendor-specific encodings, not to ASCII itself.
88
What are the 3 data models that dominated the 20th century?
relational, hierarchical, network
89
What structure the hierarchical data model provides?
treelike ## Footnote “Tree-like” is the textbook keyword because it pinpoints the idea of a single root with parent-child branches—the defining property of the hierarchical data model.
90
What issue does the network model solve?
It is a generalization of the hierarchical model, and makes it easy to represent multiple parent nodes, which addresses the M:N relationship issues of the hierarchical model.
91
How do we access a node in the network data model?
We follow a path called the access path. | The model allows more than one parent (owner) per record. ## Footnote In the network model a node is accessed by navigating an access path of owner–member links that starts at a chosen root record.
92
Where does the relational data model place data?
In tables, simple structures comprising rows with each row having a set of attributes.
93
What are the main features of the relational data model?
PKs, FKs, table schemas
94
What issues w/ the relational data model give rise to NoSQL DBs in the early 2000s?
Tight and inflexible schema, issues with mapping between DB table column values and OOP objects, limited scalability options, and expensive operations to join tables.
95
the definition of index in the context of the relational data model
In a relational database, an index is a mapping (stored in a lookup structure) from the value of an indexed column (or column combination) to the physical location(s) of the corresponding row(s), allowing the DBMS to find those rows quickly without scanning the whole table. ## Footnote Under the hood an index—whether a B-tree, hash table, or another structure—stores two essential pieces of information: Key: the value(s) taken from the indexed column(s) Pointer: a reference to the physical row (or to a list of rows when duplicates exist)
96
With what promise did NoSQL emerge?
to address the weaknesses of the relational model, especially its inflexibility and low scalability
97
What does it mean that, in NoSQL, records are self-contained?
That it is possible to split the DB among a number of servers, if necessary, to balance the load and scale vertically and/or horizontally.
98
difference between vertical and horizontal scaling
vertical = adding resources to a single machine or server to cope with increased demand horizontal = adding more machines to the infrastructure to cope with high demand
99
What are the 4 main categories of NoSQL DBs?
key-value datastore, document datastore, columnar datastore, graph datastore
100
What is a hashmap?
a structure that stores several entries in which a value is mapped to a key ## Footnote Strictly speaking, it is a hash-based associative array: it hashes each key to an index, stores key–value pairs in buckets, and resolves collisions, giving O(1) expected-time lookups.
101
How does a key-value datastore look like?
{key-1} -> {value-1} {key-2} -> {value-2} … {key-n} -> {value-n}
102
What can we do in a key-value datastore?
Read, update and delete a value addressable by a key
103
What is the difference between relational and columnar DBs?
Relational DBs are optimized for row operations, while columnar DBs are optimized for column operations.
104
What is the difference between a relational and columnar DB when we want to query only the age attribute of records?
In a relational DB, the engine will run over all rows while scanning different fields of different dtypes, and will filter the results per the query. In a columnar DB, the query will only be run on the table or table family containing the attribute of one dtype.
105
What is Apache Cassandra?
- a columnar data store - managed and licensed under the Apache Foundation umbrella - originated at Facebook - open-source project - written in Java - largely distributed NoSQL datastore
106
What is a DB called in Cassandra?
keyspace | space of partition-keys + the rules for their replication and placement ## Footnote That emphasis on keys (how rows are located) and space (the entire ring of nodes they inhabit) is why the Cassandra authors—and later the CQL spec—stuck with keyspace rather than re-using the overloaded term database.
107
What are the two types of information present in graph datastores?
nodes and edges
108
What is an example of a graph data store?
neo4j
109
How to import the neo4j module in Python?
from neo4j import GraphDatabase
110
What are graph DBs suitable for?
highly interconnected data
111
Give 2 examples for each NoSQL DB type.
key-value: Redis, DynamoDB; document: MongoDB, CouchDB; columnar: Cassandra, CosmosDB; graph: neo4j, InfoGrid
112
In what type of environment NoSQL DBs can support developers specifically?
where requirements change frequently
113
3 most popular relational DBs per db-engines.com in June 2025
Oracle, MySQL, Microsoft SQL Server
114
3 most popular multi-model DBs per db-engines.com in June 2025
Oracle (relational, document, graph, RDF, spatial, vector DBMS), MySQL (relational, document, spatial), Microsoft SQL Server (relational, document, graph, spatial)
115
3 most popular document stores per db-engines.com in November 2025
* MongoDB * Databricks * Amazon DynamoDB
116
3 most popular key-value stores per db-engines.com in June 2025
Redis, Memcached, etcd
117
3 most popular wide column stores per db-engines.com in June 2025
Apache Cassandra, Apache HBase, ScyllaDB
118
3 most popular dedicated Search Engines (NoSQL database management systems dedicated to the search for data content) per db-engines.com in November 2025
Splunk, Apache Solr, Algolia
119
3 most popular DBMSs with a database model of **graph** per db-engines.com in February 2025
Neo4j, Microsoft Azure Cosmos DB, Aerospike
120
3 most popular time series DBMS per db-engines.com in June 2025
InfluxDB, Prometheus, Graphite
121
3 most popular spatial DBMS per db-engines.com in June 2025
PostGIS, SpatiaLite, GeoMesa
122
3 most popular vector DBMS per db-engines.com in June 2025
Pinecone – Proprietary vector DBMS: plug-and-play cloud service that scales embeddings search Milvus – Apache 2.0 vector DBMS: open-source muscle that powers billion-vector AI workloads Qdrant – Apache 2.0 vector DBMS: quick Rust engine for semantic search with filter-friendly payloads
123
3 most popular multivalue DBMSs (systems which store data in tables, however they can assign more than one value to a record's attribute) per db-engines.com in June 2025
Adabas | UniData,UniVerse ## Footnote D3
124
What does HDFS do?
It manages the distribution and storage of data on its various nodes.
125
What is MapReduce suited for in the context of the Hadoop ecosystem?
retrieving and filtering data stored across data nodes, as well as performing other processing tasks on the data
126
Name elements of the Hadoop ecosystem.
HDFS; MapReduce; YARN; HBASE; Sqoop; Storm; Pig, Hive, Spark for distributed programming; Oozie for scheduling; Apache Ambari for system deployment; and Zookeeper
127
What is HDFS connected to in the Hadoop ecosystem?
DB, Linux/Windows file system, HBASE, MapReduce – YARN
128
What is HDFS made up of?
several slave nodes and one master node
129
What does the map function do?
It accepts a key/value pair as input, and outputs a list of key/value pairs.
130
What does the reduce function do?
Takes key/value pairs created by e. g. a map function, and aggregates the values for each key, for example by taking the average, the sum or the max of the values per key.
131
What is YARN an abbreviation for?
Yet Another Resource Negotiator
132
What does YARN do in Hadoop?
It allocates resources to different applications, manages the data flow between applications, and monitors the overall health of the system.
133
What is Tez?
The default compute engine of Hadoop since version 2.
134
What is Apache Pig in the context of the Hadoop ecosystem?
A now-retired high-level platform that used its own scripting language, Pig Latin, to express data transformations compiled into MapReduce (or later Tez) jobs on Hadoop/YARN. | It simplified batch ETL for developers who found raw MapReduce verbose. ## Footnote Largely superseded by Spark and Hive.
135
What is the language used to write Pig scripts?
Pig Latin
136
What is Hive designed to do in the Hadoop stack?
to enable the querying of data in HDFS with SQL statements
137
What is HBASE?
a column-oriented NoSQL DB
138
What is Apache Kafka?
a message broker that helps to manage large volumes of streaming data in a Pub/Sub system
139
What does Kafka use to ensure that messages are not lost in transaction?
an immutable commit log
140
What is Ambari?
a management platform w/ a graphical interface for Hadoop
141
What does Ambari allow to monitor?
the health of the Hadoop system, such as individual nodes, the storage capacity of HDFS, the compute resource usage of YARN and the ZooKeeper status
142
What does HDFS do with the collected data?
It splits the data into many bricks or blocks, replicates them n times, and distributes them across several cluster nodes for parallel processing.
143
What is the job of the active NameNode in HDFS?
It established and maintains a distribution map of all files stored in HDFS.
144
How does the active NameNode keep an up-to-date information of the HDFS file system?
Each DataNode keeps the active NameNode informed of its state.
145
When is it said that a system has a single point of failure?
If the failure of a single component of a system results in the failure of the entire system.
146
Where is the metadata of the active NameNode regulary saved to achieve high fault tolerance?
on a local file system as well as on a remote system
147
What is a standby NameNode in HDFS?
a duplicate of the active NameNode that receives the same metadata and logs, but do not act on the data nodes
148
What are the two core classes in Hadoop's Java API (org.apache.hadoop.fs) for working with HDFS?
FileSystem — abstract class for filesystem operations (open, create, list, delete, etc.) FileStatus — metadata container for files and directories (size, permissions, replication, timestamps)
149
What is the name of the class in the Java API Hadoop offers which informs the FileSystem and FileStatus classes on the configuration of the HDFS cluster?
Configuration
150
What objects take part in the HDFS reading mechanism?
* DataNode * DistributedFileSystem Object * FSDataInputStream Object * HDFS client * NameNode
151
What is a stub in programming?
an object or a short piece of substitute code that holds predefined data or functionality and uses it to answer calls during tests instead of using the more complex system it represents
152
What is the first step of writing a file in HDFS?
The client creates a local **DistributedFileSystem** **object** that acts as a stub or proxy to the HDFS.
153
What does "ack" stand for in the context of the pipeline mechanism in writing to HDFS?
acknowledge
154
What happens next in the HDFS write pipeline when the writing on the last replica is finished?
An acknowledgment message goes up to the node of the previous replica until reaching the node of the first replica.
155
What happens next in the HDFS write pipeline after the acknowledgment message reached the node of the first replica?
This node then returns an acknowledge to the specialized writer.
156
What to import in Python to work with HDFS as the file store?
hdfs, pyarrow.fs, hdfs3, or fsspec for day-to-day HDFS work in Python
157
Name three scenarios when the need for data replication can arise.
Replication Factor is changed, Data Blocks get corrupted, DataNode goes down
158
What is Spark used for?
for processing big data
159
Why is Spark better than YARN?
1. It is a standalone program, not a cluster management tool. 2. It is much faster when starting up, because it does not need to initialize the entire cluster. 3. It has a more user-friendly interface.
160
What does RDD stand for in the context of Spark?
Resilient Distributed Datasets
161
What are the two ways the Spark API provides to operate on RDDs?
1. transformations (operations that create a new RDD from an existing one) 2. actions (operations that return a result to the caller)
162
What is the master node responsible for in a Spark cluster?
managing the Spark cluster and distributing the workload across the slave nodes
163
What are elements of the Spark architecture?
- driver (SparkContext) - cluster manager - workers (Executors)
164
What are the key components of Spark?
Spark core, Spark SQL, Spark Streaming, MLLib and GraphX
165
What are the two ways we can process data with Spark Streaming?
in mini-batches or as it arrives
166
What is GraphX in the context of Spark?
a library for manipulating graphs and performing graph analytics in Spark
167
What are the advantages of Spark over other big data processing systems?
1. Spark can be run on a single machine or on a cluster. 2. It can process data in memory or on disk. 3. It has built-in libraries for data processing. 4. It is designed to be compatible with Hadoop. 5. It can read and write data in HDFS, and it can run on the same clusters as Hadoop. 6. It can process data faster than Hadoop MapReduce, and it can use more memory than Hadoop.
168
What can Spark be used in Python for?
data analysis, machine learning, streaming applications
169
What is the Spark Shell?
an interactive tool that allows us to run Spark jobs
170
How to read a simple text file into an RDD with PySpark?
RDD = sc.textFile("hdfs:/share/data.txt")
171
What is each line of a read file called in MapReduce and PySpark?
record | talking about classic MapReduce jobs or low-level RDD code (key/value) ## Footnote When we are talking about Spark SQL, DataFrames or anything that ends up looking like a table: we say row (it emphasises columns and schema).
172
What is an action in the context of PySpark?
an operation on an RDD/DataFrame that triggers the computation of its lineage and either (a) returns a result to the driver program or (b) persists the result externally (e.g., writing to storage)
173
How to apply an aggregation function on an RDD in PySpark?
`result = RDD.reduce()`
174
What is the difference between the map() function's implementation in plain Python and PySpark?
In plain Python, it has two parameters, while in PySpark, it is a method of the RDD class. PLAIN PYTHON ``` list = [1, 2, 3, 4] double = map(lambda n: n*2, list) ``` PYSPARK ``` list = sc.parallelize([1, 2, 3, 4]) double = list.map(lambda n: n*2) ```
175
How to group, return and dispplay the values of an RDD with elements as pairs that have the same key, and apply a function to them in PySpark?
`print(RDD.reduceByKey(lambda a, b: a + "-" + b).collect())` ## Footnote collect() returns data; print() displays it
176
What is DASK?
a distributed computing system that helps analyze and process large data sets
177
What are the two libraries DASK is bulit on?
Dask.distributed and Dask.array
178
What does the dask.distributed library provide?
the basic infrastructure for distributed computing | (communication between nodes, scheduling of tasks, fault tolerance)
179
How can we parallelize the computation of a classifier ML model using Dask in Python?
``` dask_model = dask.distributed.Client() dask_model.parallel(n_workers=4) dask_model.compute(X_train, y_train) dask_model.compute(X_test, y_test) ```
180
What are streaming frameworks applied for?
the processing of data streams
181
What does processing in batches imply?
That a certain time window or file size is given in which the data are collected before they are processed.
182
What risk does processing in batches bear?
that the data may already be outdated in the time of processing
183
What is Apache Spark?
a multi-language engine for executing data engineering, data science and machine learning tasks on single-node machines or clusters
184
How does Spark Streaming achieve near-real time analysis?
Instead of large batch jobs, micro-batches are processed mimicking the continuous stream of data.
185
What is an extra abstraction Spark Streaming adds to the functionalities of Spark Core?
the Discretized Stream (DStream), which is a programming model to operate on the data present in the stream | Internally, a DStream is implemented as an ordered sequence of RDDs, 1/i ## Footnote DStream is literally a class, but the term 'programming model' is emphasised to focus on thinking and coding in streaming terms while Spark silently handles micro-batch creation, scheduling, checkpointing, and fault-recovery. That combination of abstraction + operators + guarantees is what we usually mean by a programming model in distributed-systems literature.
186
What are resilient distributed datasets?
RDDs are Apache Spark's foundational data abstraction: immutable, fault-tolerant collections of objects, partitioned across cluster nodes for parallel in-memory processing. | Resilient◄ lost partitions recomputed from lineage graph (0 replication) ## Footnote Elements can be of any type. Transformations (map, filter, join) are lazy; they only execute when an action (count, collect) triggers computation.
187
What is PySpark?
an interface in Python that enables us to write Spark applications with Python APIs | a Python library that exposes the Spark programming model through Python ## Footnote Calling it an interface (instead of a library/module) is the more precise description, because the heavy lifting happens inside the Apache Spark engine that runs in the Java Virtual Machine (JVM).
188
How do we create an RDD in PySpark from data that exists in a text file?
``` from pyspark import SparkContext sc = SparkContext('local', 'MyFirstExample') trainData = "./data/example.txt" trains = sc.textFile(trainData) ``` | or trainData = "hdfs:///user/alice/train/*.txt" ## Footnote 'local' is the master URL that tells Spark to start in local mode instead of talking to a real cluster manager (YARN, Kubernetes, …)
189
How do we construct a filter function for the trains RDD in PySpark, if we want a subset of all trains that are not ICE?
`trains.filter(lambda x: 'ICE' not in x)`
190
How do we print the data from the trains RDD?
- trains.collect() – should be used with care - trains.take(3) | collect() ends the distributed world and pull data into 1 JVM process. ## Footnote Need a peek during development? Use trains.take(10) or pprint(10) on the DStream equivalent. Need the whole dataset locally? Only collect() if you are sure it fits in driver memory. In production streaming jobs always favour forEachRDD or Structured Streaming writers so you can push data to an external sink instead of clogging the driver logs.
191
What is the relationship between the object types DStream and RDD?
A DStream equals a sequence of RDDs.
192
How do we set up a filtered myTrains DStream in PySpark?
``` sc = SparkContext('local', 'MyFirstExample') ssc = StreamingContext(sc, 1) myDStream = ssc.textFileStream('./data') myTrains = myDStream.filter(lambda x: 'ICE' not in x) ``` ## Footnote The value of 1 here means “build and process a new RDD every one second.” In the example the directory ./data will be polled once a second and every file that landed since the previous poll becomes part of that batch.
193
What are the two main categories of sources in Spark Streaming?
basic sources (file systems, socket connections) and advanced sources (Kafka, Kinesis) | sockets: live app logs, network telemetry, IoT gateway, chatlike feed ## Footnote We choose socketTextStream for quick, low-volume prototypes or when an existing system already speaks raw TCP.
194
How does the socket source behave in Spark Streaming?
Like a TCP client, and it is implemented as a receiver-based source. It connets to a TCP server on a network location that can be identified by its host-ip:port combination.
195
Provide PySpark code for creating Kafka direct stream.
# pyspark 3.5-plus ``` from pyspark.sql import SparkSession from pyspark.sql.functions import col, from_json from pyspark.sql.types import StructType, StringType, DoubleType spark = (SparkSession.builder .appName("OrdersFromKafka") .config("spark.sql.shuffle.partitions", "4") # tune .getOrCreate()) 1️⃣ Create the direct Kafka stream as a streaming DataFrame kafka_raw = (spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "localhost:9092") .option("subscribe", "orders-v1") # or subscribePattern .option("startingOffsets", "earliest") # or "latest" .load()) # <-- direct stream! :contentReference[oaicite:1]{index=1} 2️⃣ Deserialize the key/value & apply any business logic schema = (StructType() .add("customer_id", StringType()) .add("amount", DoubleType()) .add("ts", StringType())) orders = (kafka_raw .selectExpr("CAST(key AS STRING) AS order_id", "CAST(value AS STRING) AS json") .select("order_id", from_json(col("json"), schema).alias("data.*"))) 3️⃣ Write the results somewhere (console here, but could be Delta, Kafka, etc.) query = (orders.writeStream .format("console") # change to "kafka", "delta", "iceberg", … .option("truncate", "false") .option("checkpointLocation", "/tmp/ckpt-orders-v1") .outputMode("append") .start()) query.awaitTermination() ```
196
What was Apache Bahir?
a repository providing extensions such as streaming connectors as well as SQL data sources
197
Name 3 examples of built-in output operations Spark Streaming provides for DStream objects.
1. DStream.pprint(num=10) – prints the first 10 elements of the DStream at every streaming interval 2. DStream.saveAsXYZ(prefix, suffix) – outputs to a file-based sink with the pattern prefix-.suffix 3. DStream.foreachRDD(func) – performs the function provided as an argument on every RDD of the stream
198
What can we replace the map-reduce function of Hadoop with?
for example Apache Spark batch-processing, or micro-batching with Spark Streaming
199
Name 10 elements of a possible big data architecture.
data sources, Kafka, Spark Streaming, Apache Spark, Presto, Hive, YARN, HDFS, Job Scheduler, ERP, CRM, DWH, Tableau
200
What layers are present in the exemplary big data architecture?
data sources, messaging layer, Hadoop framework, export functions (dashboards etc.)
201
What areas does an organization need to decide on when building a data lake using Spark?
1. storage system – HDFS or cloud object store (e. g. S3) 2. file format – structured (Parquet, ORC), semistructured (JSON), unstructured (audio, video)
202
At which company was Kafka developed in?
LinkedIn
203
What is Kafka at its core?
a message queue system
204
What are the main high-level use cases of Kafka?
1. moving all occurring event data to a CDWH 2. a central hub for working w/ data in general making use of the persisting of data 3. an interface for software services to communicate with each other
205
Compare the key features of Spark Streaming and Kafka: data store, processing type, latency, supported languages.
1. SS is based on Spark clusters, HDFS or similar. <-> Kafka offers a Java library Kafka streams. 2. SS is micro-batching, Kafka is event-at-a-time processing. 3. SS is higher latency, Kafka is low. 4. SS supports Python, Java, Scala and R, Kafka Streams is limited to Java and Scala, but a REST Proxy connection lets a large number of clients use it.
206
What are the main high-level components of a Kafka architecture?
producers, Kafka cluster w/ brokers and an optional ZooKeeper, consumers
207
What are messages sorted by in Kafka?
topics
208
What do we see inside a Kafka cluster?
brokers (controllers, leads, followers), topics, partitions, replications, an optional zookeeper
209
What is ZooKeeper?
a replicated coordination service that offers configuration, naming, synchronization, and group-membership primitives — delivered through a simple, strongly-consistent data tree API
210
What is an advantage of Kafka's structured immutable commit log?
Old messages remain readable even after updates to the way data is structured in the system.
211
What is offset in the context of Apache Kafka?
a number assigned to a new message appended to the end of a topic | 64-bit signed integer that the broker assigns to every record ## Footnote Kafka’s built-in pointer that preserves order, enables fault-tolerant progress tracking, and underpins every consumption guarantee the platform offers
212
What does the storage of the current offset number of each consumer allow in Kafka?
for the consumers to be able to continue working from the last message on, so no data is lost, and info is not skipped even if the connection is interrupted
213
What do producers and consumers do in the Kafka architecture?
Producers write, while consumers read.
214
What are the main components/frameworks/libraries of the Kafka ecosystem?
Apache Kafka Connect, Apache Kafka Streams, Confluent Schema Registry, Confluent REST Proxy
215
What does the Confluent Schema Registry do in the Kafka ecosystem?
It decouples producers and consumers at the data level. A consumer can use the registry to retrieve the schema before the data is processed to validate it. The schema is in .JSON.
216
What is being done over the HTTP protocol in Confluent REST Proxy?
Producers and consumers are sending REST commands to the proxy, which converts them into Kafka commands, and sends them to Kafka.
217
How to create a topic in a shell with Kafka?
`.\bin\kafka-topics.sh --create --topic bigdata --partitions 1 --replication-factor 1 --boostrap-server localhost:9092`
218
With which flag can we look at a topic in Kafka?
`--describe` | e. g. ## Footnote ``` .\bin\kafka-topics.sh \ --bootstrap-server broker1:9092,broker2:9093,broker3:9094 \ --describe \ --topic orders-v1 ```
219
How do we write a message to the topic bigdata in Kafka?
echo "Helló-belló" | ./kafka-console-producer.sh --topic bigdata --bootstrap-server localhost:9092
220
How do we consume messages in Kafka for example?
``` timeout 10 .\kafka-console-consumer.sh \ --topic bigdata \ --from-beginning \ --bootstrap-server localhost:9092 ``` | --from-beginning = start reading at offset 0 for every partition ## Footnote The little timeout 10… wrapper is there to keep a demo or self-test from hanging forever once it has been proven that the consumer actually works.
221
What are the two main modes to consume messages in Kafka?
1. with timeout - output a certain batch or all 2. without timeout - printing out messages once they are produced
222
What are advantages of Spark Streaming over Kafka?
1. SS is optimized for a wide range of applications 2. can be integrated w/ different frameworks and connected to Spark 3. SS's DStreams can be used in parallel to batch processing w/ Apache Spark in a Hadoop env
223
How can data in a key-value datastore be aggregated in a straightforward way?
Data aggregations in these datastores are straightforward as they are operations that are applied to all the values that share a common key. | Key | Value type | What’s stored | Example aggregates | | -------------- | ------------------------------ | -------------------- | --------------------------------------- | | `login:42` | Integer counter | Total log-ins | `INCR login:42` → count | | `login:42:ts` | List | All timestamps | `LLEN` (count), `LRANGE -1 -1` (latest) | | `login:42:ip` | Set | Distinct IPs | `SCARD` (distinct count), `SISMEMBER` | | `login:42:geo` | Sorted set (score = timestamp) | Geo-hashes over time | `ZCOUNT` between two dates |
224
How is big data relevant to the financial industry?
The global financial markets produce a very large amount of mostly numeric data every day. This includes stock market prices at various stock exchanges around the globe, as well as transaction systems such as SWIFT. Peaks and valleys in system load exist due to markets being closed for hours each day. Cryptocurrencies are a more recent appearance that rely on blockchain technologies, where each transaction is stored as part of the blockchain as immutable information.
225
Step-by-step explain what the following command does when entered into the cloud console of neo4j: MATCH p=(n:Ticket)-[]-() RETURN p
The command will create a visual representation of the data stored in the neo4j database. In this case, it will show the 'Ticket' node with all relationships (indicated by the empty square brackets) to all other nodes (indicated by the empty round parenthesis).