Big Data Technologies Flashcards

Question

Data storage need of CT scan results

Answer 1

Cc. 22 MB per uncompressed image file. If 1000 slices are captured, we will have 22 GB per CT scan. If we do high-res scans, it is up to 350 GB per CT scan.

Answer 2

time slices, geospatial references

Answer 3

Infrastructure for Spatial Information in Europe

Answer 4

flat is tablelike (spreadsheet), hierarchical is treelike (there are relationships represented between records)

Answer 5

streamed is high, batch-processed is low

Answer 6

to determine the relationship between geodatasets ## Footnote (e. g. the latitude and longitude of a phone's location is joined w/ a polygon that represents a tower's coverage area)

Answer 7

numeric sensor data, birthdates in a customer DB, addresses, names, e-mail addresses, spatial coordinates, phone numbers

Answer 8

sequences of 8-bit bytes

Answer 9

(1, 128), (2, 64), …, (7, 2), (8, 1)

Answer 10

temperature: float = 62.8

Answer 11

It needs to be set explicitly.

Answer 12

text, integer, decimal, bool, date, money, double precision, char (a single character) + geodata, IP address, UUID, CIDR, MAC address etc.

Answer 13

XML, emails, files in markdown languages, HTML data

Answer 14

image, video, audio, binary

Answer 15

Because SQL-based DBs cannot handle unstructured data.

Answer 16

a specification used by MongoDB to store files larger than 16MB ## Footnote GridFS is implemented entirely inside MongoDB: each file is sliced into 255 kB chunks that are written as separate BSON documents in a collection called fs.chunks; per-file metadata lives in fs.files. MongoDB’s drivers then give us helper methods that feel like open(), read(), seek(), write(), but under the hood we are still doing normal database reads and writes.

Answer 17

from pymongo import MongoClient client = MongoClient('<>') ## Footnote the MongoDB URL is a connection string, sg like "mongodb://[username:password@]host1[:port1][,host2[:port2],...[,hostN[:portN]]][/[database][?options]]"

Answer 18

from pymongo import MongoClient client = MongoClient('<>') database = client.emails

Answer 19

``` email_message = { 'to_line': 'example@iu.org', 'cc_line': '', 'from': 'example2@mit.org', 'subject': 'Important', 'body': 'anything you wish here to be added.', 'attachments': [ { 'filename': 'catphoto', 'extension': 'jpg', 'data': '…' }, { 'filename': 'dogphoto', 'extension', jpg', 'data': '…' } ] } ```

Answer 20

We can locate emails sent to the specified address with `example_emails = database.emaildata.find({'to_line': 'example@iu.org'})`. Then `deleted = database.emaildata.delete_many({'to_line': 'example@iu.org'})`

Answer 21

the creation, storage, and computation of huge amounts of data that frequently needs to be stored in distributed systems capable of handling a very large number of parallel read and write requests

Answer 22

It measures how inconsistent, untrusted, raw/uncleansed, biased and incomplete the data might be. ## Footnote Because veracity is a quality continuum, we can equally talk about the high end of the scale: consistent (values agree across sources and over time), trusted (provenance is known; integrity checks pass), processed/cleansed (errors removed, formats normalised), unbiased (sampling and recording processes minimise systematic error), complete (few or no missing or truncated values) The underlying dimension being measured is still the same: How much can I rely on this dataset to reflect reality?

Answer 23

medium/media access control

Answer 24

It needs to be normalized to make sure it conforms to a standard.

Answer 25

with open('example.txt') as f: content = f.read() print(content)

Answer 26

panel data

Answer 27

import pandas as pd data = pd.read_csv("Islands.csv", sep=";")

Answer 28

It defines it as the header by default.

Answer 29

columns = ["Island", "Year", "Residents", "Capital", "Continent"] data = pd.read_csv("Islands_noheader.csv", names=columns, sep=";")

Answer 30

`data = pd.read_csv("Islands_meta.csv", sep=";", skiprows=range(0,4), encoding="utf-8")`

Answer 31

data.to_csv("example.csv", sep=";")

Answer 32

import json json_data = open('example.json').read() data = json.loads(json_data)

Answer 33

data["main"]

Answer 34

pd.io.json.json_normalize(data["main"])

Answer 35

eXtensible Markup Language

Answer 36

``` import xml.etree.ElementTree as ET tree = ET.parse('data.xml') root = tree.getroot() for child in root: print(child[1].text) ```

Answer 37

Hierarchical Data Format

Answer 38

metadata, groups, datasets

Answer 39

import h5py file = h5py.File('iu.h5', 'w') dataset = file.create_dataset("iu", (4, 6))

Answer 40

dataset.shape, dataset.name, dataset.parent

Answer 41

``` import h5py import numpy as np with h5py.File('iu.h5', 'r+') as f: if "iu_numbers" not in f: dset = f.create_dataset("iu_numbers", (4, 6)) else: dset = f["iu_numbers"] data = np.random.rand(4, 6).round(2) dset[...] = data ```

Answer 42

In h5py we store metadata as attributes on files, groups, or datasets using the .attrs mapping: | *.attrs[...] is a dict-like interface for for metadata of HDF5 objects ## Footnote ``` import h5py import numpy as np with h5py.File("example.h5", "w") as f: data = np.arange(10) dset = f.create_dataset("mydata", data=data) # Add metadata (attribute) dset.attrs["user"] = "T" dset.attrs["description"] = "Toy example dataset" with h5py.File("example.h5", "r") as f: dset = f["mydata"] # Read a single attribute print(dset.attrs["user"]) # T # Or iterate over all attributes for key, value in dset.attrs.items(): print(key, value) ```

Answer 43

run length encoding

Answer 44

1. **row group size**, which allows the data to be chunked 2. **data page size**, which enables a single row lookup

Answer 45

``` import pyarrow.parquet as pq import pyarrow as pa import pandas as pd import numpy as np df = pd.DataFrame(np.random.rand(100).reshape(25, 4), columns=["one", "two", "three", "four"]) tableToWrite = pa.Table.from_pandas(df) pq.write_table(tableToWrite, "myPQFile.parquet") ``` | pa.Table.from_pandas(df) copies the df labels into the Arrow Schema

Answer 46

``` import pyarrow.parquet as pq tableToRead = pq.read_table("example.parquet") tableToRead.to_pandas() ``` | pandas.read_parquet(path, columns=None, storage_options=None, **kwargs)

Answer 47

file = pq.ParquetFile("example.parquet") file.metadata file.schema

Answer 48

Arrow was built for in-memory analytics to enable the processing and moving of information in a big data environment with low overhead.

Answer 49

Spark, Drill, Pandas, Impala, Parquet, HBase, Cassandra, Kudu

Answer 50

the Apache Software Foundation

Answer 51

Apache columnar storage model

Answer 52

``` import pyarrow as pa import pandas as pd import numpy as np df = pd.DataFrame({ "one": [20, np.nan, 2.5], "two": ["january", "february", "march"], "three": [True, False, True]}, index = list('abc')) table = pa.Table.from_pandas(df) ``` ## Footnote When we convert the df to an Apache Arrow table (pa.Table.from_pandas(df)), pandas includes the index as metadata (or as a column if we set preserve_index=True). Either way, the row labels remain part of the dataset’s identity. In short, index=list('abc') is a neat shorthand for giving the three rows the labels 'a', 'b', and 'c'.

Answer 53

`file_path = Path("data/sales_2025-06.parquet") sales_tbl: pa.Table = pq.read_table(file_path) sales_df: pd.DataFrame = sales_tbl.to_pandas( types_mapper=pd.ArrowDtype, preserve_index=False )`

Answer 54

1. It contains extensible metadata information of the flat and nested types with the option to create user-defined types. 2. Better performance with DB or flatfile ingestion and export. 3. It is more efficient in processing very large datasets.

Answer 55

alphabetic, numeric, collating and non-printing

Answer 56

Arrow is language-agnostic

Answer 57

It translates 0s and 1s to numbers between 0 and 255. ## Footnote Plain-vanilla ASCII maps 7-bit codes (0–127) to printable and control characters; the extra 128 possible values in an 8-bit byte belong to various extended or vendor-specific encodings, not to ASCII itself.

Answer 58

relational, hierarchical, network

Answer 59

treelike ## Footnote “Tree-like” is the textbook keyword because it pinpoints the idea of a single root with parent-child branches—the defining property of the hierarchical data model.

Answer 60

It is a generalization of the hierarchical model, and makes it easy to represent multiple parent nodes, which addresses the M:N relationship issues of the hierarchical model.

Answer 61

We follow a path called the access path. | The model allows more than one parent (owner) per record. ## Footnote In the network model a node is accessed by navigating an access path of owner–member links that starts at a chosen root record.

Answer 62

In tables, simple structures comprising rows with each row having a set of attributes.

Answer 63

PKs, FKs, table schemas

Answer 64

Tight and inflexible schema, issues with mapping between DB table column values and OOP objects, limited scalability options, and expensive operations to join tables.

Answer 65

In a relational database, an index is a mapping (stored in a lookup structure) from the value of an indexed column (or column combination) to the physical location(s) of the corresponding row(s), allowing the DBMS to find those rows quickly without scanning the whole table. ## Footnote Under the hood an index—whether a B-tree, hash table, or another structure—stores two essential pieces of information: Key: the value(s) taken from the indexed column(s) Pointer: a reference to the physical row (or to a list of rows when duplicates exist)

Answer 66

to address the weaknesses of the relational model, especially its inflexibility and low scalability

Answer 67

That it is possible to split the DB among a number of servers, if necessary, to balance the load and scale vertically and/or horizontally.

Answer 68

vertical = adding resources to a single machine or server to cope with increased demand horizontal = adding more machines to the infrastructure to cope with high demand

Answer 69

key-value datastore, document datastore, columnar datastore, graph datastore

Answer 70

a structure that stores several entries in which a value is mapped to a key ## Footnote Strictly speaking, it is a hash-based associative array: it hashes each key to an index, stores key–value pairs in buckets, and resolves collisions, giving O(1) expected-time lookups.

Answer 71

{key-1} -> {value-1} {key-2} -> {value-2} … {key-n} -> {value-n}

Answer 72

Read, update and delete a value addressable by a key

Answer 73

Relational DBs are optimized for row operations, while columnar DBs are optimized for column operations.

Answer 74

In a relational DB, the engine will run over all rows while scanning different fields of different dtypes, and will filter the results per the query. In a columnar DB, the query will only be run on the table or table family containing the attribute of one dtype.

Answer 75

- a columnar data store - managed and licensed under the Apache Foundation umbrella - originated at Facebook - open-source project - written in Java - largely distributed NoSQL datastore

Answer 76

keyspace | space of partition-keys + the rules for their replication and placement ## Footnote That emphasis on keys (how rows are located) and space (the entire ring of nodes they inhabit) is why the Cassandra authors—and later the CQL spec—stuck with keyspace rather than re-using the overloaded term database.

Answer 77

nodes and edges

Answer 78

from neo4j import GraphDatabase

Answer 79

highly interconnected data

Answer 80

key-value: Redis, DynamoDB; document: MongoDB, CouchDB; columnar: Cassandra, CosmosDB; graph: neo4j, InfoGrid

Answer 81

where requirements change frequently

Answer 82

Oracle, MySQL, Microsoft SQL Server

Answer 83

Oracle (relational, document, graph, RDF, spatial, vector DBMS), MySQL (relational, document, spatial), Microsoft SQL Server (relational, document, graph, spatial)

Answer 84

* MongoDB * Databricks * Amazon DynamoDB

Answer 85

Redis, Memcached, etcd

Answer 86

Apache Cassandra, Apache HBase, ScyllaDB

Answer 87

Splunk, Apache Solr, Algolia

Answer 88

Neo4j, Microsoft Azure Cosmos DB, Aerospike

Answer 89

InfluxDB, Prometheus, Graphite

Answer 90

PostGIS, SpatiaLite, GeoMesa

Answer 91

Pinecone – Proprietary vector DBMS: plug-and-play cloud service that scales embeddings search Milvus – Apache 2.0 vector DBMS: open-source muscle that powers billion-vector AI workloads Qdrant – Apache 2.0 vector DBMS: quick Rust engine for semantic search with filter-friendly payloads

Answer 92

Adabas | UniData,UniVerse ## Footnote D3

Answer 93

It manages the distribution and storage of data on its various nodes.

Answer 94

retrieving and filtering data stored across data nodes, as well as performing other processing tasks on the data

Answer 95

HDFS; MapReduce; YARN; HBASE; Sqoop; Storm; Pig, Hive, Spark for distributed programming; Oozie for scheduling; Apache Ambari for system deployment; and Zookeeper

Answer 96

DB, Linux/Windows file system, HBASE, MapReduce – YARN

Answer 97

several slave nodes and one master node

Answer 98

It accepts a key/value pair as input, and outputs a list of key/value pairs.

Answer 99

Takes key/value pairs created by e. g. a map function, and aggregates the values for each key, for example by taking the average, the sum or the max of the values per key.

Answer 100

Yet Another Resource Negotiator

Answer 101

It allocates resources to different applications, manages the data flow between applications, and monitors the overall health of the system.

Answer 102

The default compute engine of Hadoop since version 2.

Answer 103

A now-retired high-level platform that used its own scripting language, Pig Latin, to express data transformations compiled into MapReduce (or later Tez) jobs on Hadoop/YARN. | It simplified batch ETL for developers who found raw MapReduce verbose. ## Footnote Largely superseded by Spark and Hive.

Answer 104

to enable the querying of data in HDFS with SQL statements

Answer 105

a column-oriented NoSQL DB

Answer 106

a message broker that helps to manage large volumes of streaming data in a Pub/Sub system

Answer 107

an immutable commit log

Answer 108

a management platform w/ a graphical interface for Hadoop

Answer 109

the health of the Hadoop system, such as individual nodes, the storage capacity of HDFS, the compute resource usage of YARN and the ZooKeeper status

Answer 110

It splits the data into many bricks or blocks, replicates them n times, and distributes them across several cluster nodes for parallel processing.

Answer 111

It established and maintains a distribution map of all files stored in HDFS.

Answer 112

Each DataNode keeps the active NameNode informed of its state.

Answer 113

If the failure of a single component of a system results in the failure of the entire system.

Answer 114

on a local file system as well as on a remote system

Answer 115

a duplicate of the active NameNode that receives the same metadata and logs, but do not act on the data nodes

Answer 116

FileSystem — abstract class for filesystem operations (open, create, list, delete, etc.) FileStatus — metadata container for files and directories (size, permissions, replication, timestamps)

Answer 117

Configuration

Answer 118

* DataNode * DistributedFileSystem Object * FSDataInputStream Object * HDFS client * NameNode

Answer 119

an object or a short piece of substitute code that holds predefined data or functionality and uses it to answer calls during tests instead of using the more complex system it represents

Answer 120

The client creates a local **DistributedFileSystem** **object** that acts as a stub or proxy to the HDFS.

Answer 121

acknowledge

Answer 122

An acknowledgment message goes up to the node of the previous replica until reaching the node of the first replica.

Answer 123

This node then returns an acknowledge to the specialized writer.

Answer 124

hdfs, pyarrow.fs, hdfs3, or fsspec for day-to-day HDFS work in Python

Answer 125

Replication Factor is changed, Data Blocks get corrupted, DataNode goes down

Answer 126

for processing big data

Answer 127

1. It is a standalone program, not a cluster management tool. 2. It is much faster when starting up, because it does not need to initialize the entire cluster. 3. It has a more user-friendly interface.

Answer 128

Resilient Distributed Datasets

Answer 129

1. transformations (operations that create a new RDD from an existing one) 2. actions (operations that return a result to the caller)

Answer 130

managing the Spark cluster and distributing the workload across the slave nodes

Answer 131

- driver (SparkContext) - cluster manager - workers (Executors)

Answer 132

Spark core, Spark SQL, Spark Streaming, MLLib and GraphX

Answer 133

in mini-batches or as it arrives

Answer 134

a library for manipulating graphs and performing graph analytics in Spark

Answer 135

1. Spark can be run on a single machine or on a cluster. 2. It can process data in memory or on disk. 3. It has built-in libraries for data processing. 4. It is designed to be compatible with Hadoop. 5. It can read and write data in HDFS, and it can run on the same clusters as Hadoop. 6. It can process data faster than Hadoop MapReduce, and it can use more memory than Hadoop.

Answer 136

data analysis, machine learning, streaming applications

Answer 137

an interactive tool that allows us to run Spark jobs

Answer 138

RDD = sc.textFile("hdfs:/share/data.txt")

Answer 139

record | talking about classic MapReduce jobs or low-level RDD code (key/value) ## Footnote When we are talking about Spark SQL, DataFrames or anything that ends up looking like a table: we say row (it emphasises columns and schema).

Answer 140

an operation on an RDD/DataFrame that triggers the computation of its lineage and either (a) returns a result to the driver program or (b) persists the result externally (e.g., writing to storage)

Answer 141

`result = RDD.reduce()`

Answer 142

In plain Python, it has two parameters, while in PySpark, it is a method of the RDD class. PLAIN PYTHON ``` list = [1, 2, 3, 4] double = map(lambda n: n*2, list) ``` PYSPARK ``` list = sc.parallelize([1, 2, 3, 4]) double = list.map(lambda n: n*2) ```

Answer 143

`print(RDD.reduceByKey(lambda a, b: a + "-" + b).collect())` ## Footnote collect() returns data; print() displays it

Answer 144

a distributed computing system that helps analyze and process large data sets

Answer 145

Dask.distributed and Dask.array

Answer 146

the basic infrastructure for distributed computing | (communication between nodes, scheduling of tasks, fault tolerance)

Answer 147

``` dask_model = dask.distributed.Client() dask_model.parallel(n_workers=4) dask_model.compute(X_train, y_train) dask_model.compute(X_test, y_test) ```

Answer 148

the processing of data streams

Answer 149

That a certain time window or file size is given in which the data are collected before they are processed.

Answer 150

that the data may already be outdated in the time of processing

Answer 151

a multi-language engine for executing data engineering, data science and machine learning tasks on single-node machines or clusters

Answer 152

Instead of large batch jobs, micro-batches are processed mimicking the continuous stream of data.

Answer 153

the Discretized Stream (DStream), which is a programming model to operate on the data present in the stream | Internally, a DStream is implemented as an ordered sequence of RDDs, 1/i ## Footnote DStream is literally a class, but the term 'programming model' is emphasised to focus on thinking and coding in streaming terms while Spark silently handles micro-batch creation, scheduling, checkpointing, and fault-recovery. That combination of abstraction + operators + guarantees is what we usually mean by a programming model in distributed-systems literature.

Answer 154

RDDs are Apache Spark's foundational data abstraction: immutable, fault-tolerant collections of objects, partitioned across cluster nodes for parallel in-memory processing. | Resilient◄ lost partitions recomputed from lineage graph (0 replication) ## Footnote Elements can be of any type. Transformations (map, filter, join) are lazy; they only execute when an action (count, collect) triggers computation.

Answer 155

an interface in Python that enables us to write Spark applications with Python APIs | a Python library that exposes the Spark programming model through Python ## Footnote Calling it an interface (instead of a library/module) is the more precise description, because the heavy lifting happens inside the Apache Spark engine that runs in the Java Virtual Machine (JVM).

Answer 156

``` from pyspark import SparkContext sc = SparkContext('local', 'MyFirstExample') trainData = "./data/example.txt" trains = sc.textFile(trainData) ``` | or trainData = "hdfs:///user/alice/train/*.txt" ## Footnote 'local' is the master URL that tells Spark to start in local mode instead of talking to a real cluster manager (YARN, Kubernetes, …)

Answer 157

`trains.filter(lambda x: 'ICE' not in x)`

Answer 158

- trains.collect() – should be used with care - trains.take(3) | collect() ends the distributed world and pull data into 1 JVM process. ## Footnote Need a peek during development? Use trains.take(10) or pprint(10) on the DStream equivalent. Need the whole dataset locally? Only collect() if you are sure it fits in driver memory. In production streaming jobs always favour forEachRDD or Structured Streaming writers so you can push data to an external sink instead of clogging the driver logs.

Answer 159

A DStream equals a sequence of RDDs.

Answer 160

``` sc = SparkContext('local', 'MyFirstExample') ssc = StreamingContext(sc, 1) myDStream = ssc.textFileStream('./data') myTrains = myDStream.filter(lambda x: 'ICE' not in x) ``` ## Footnote The value of 1 here means “build and process a new RDD every one second.” In the example the directory ./data will be polled once a second and every file that landed since the previous poll becomes part of that batch.

Answer 161

basic sources (file systems, socket connections) and advanced sources (Kafka, Kinesis) | sockets: live app logs, network telemetry, IoT gateway, chatlike feed ## Footnote We choose socketTextStream for quick, low-volume prototypes or when an existing system already speaks raw TCP.

Answer 162

Like a TCP client, and it is implemented as a receiver-based source. It connets to a TCP server on a network location that can be identified by its host-ip:port combination.

Answer 163

# pyspark 3.5-plus ``` from pyspark.sql import SparkSession from pyspark.sql.functions import col, from_json from pyspark.sql.types import StructType, StringType, DoubleType spark = (SparkSession.builder .appName("OrdersFromKafka") .config("spark.sql.shuffle.partitions", "4") # tune .getOrCreate()) 1️⃣ Create the direct Kafka stream as a streaming DataFrame kafka_raw = (spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "localhost:9092") .option("subscribe", "orders-v1") # or subscribePattern .option("startingOffsets", "earliest") # or "latest" .load()) # <-- direct stream! :contentReference[oaicite:1]{index=1} 2️⃣ Deserialize the key/value & apply any business logic schema = (StructType() .add("customer_id", StringType()) .add("amount", DoubleType()) .add("ts", StringType())) orders = (kafka_raw .selectExpr("CAST(key AS STRING) AS order_id", "CAST(value AS STRING) AS json") .select("order_id", from_json(col("json"), schema).alias("data.*"))) 3️⃣ Write the results somewhere (console here, but could be Delta, Kafka, etc.) query = (orders.writeStream .format("console") # change to "kafka", "delta", "iceberg", … .option("truncate", "false") .option("checkpointLocation", "/tmp/ckpt-orders-v1") .outputMode("append") .start()) query.awaitTermination() ```

Answer 164

a repository providing extensions such as streaming connectors as well as SQL data sources

Answer 165

1. DStream.pprint(num=10) – prints the first 10 elements of the DStream at every streaming interval 2. DStream.saveAsXYZ(prefix, suffix) – outputs to a file-based sink with the pattern prefix-.suffix 3. DStream.foreachRDD(func) – performs the function provided as an argument on every RDD of the stream

Answer 166

for example Apache Spark batch-processing, or micro-batching with Spark Streaming

Answer 167

data sources, Kafka, Spark Streaming, Apache Spark, Presto, Hive, YARN, HDFS, Job Scheduler, ERP, CRM, DWH, Tableau

Answer 168

data sources, messaging layer, Hadoop framework, export functions (dashboards etc.)

Answer 169

1. storage system – HDFS or cloud object store (e. g. S3) 2. file format – structured (Parquet, ORC), semistructured (JSON), unstructured (audio, video)

Answer 170

a message queue system

Answer 171

1. moving all occurring event data to a CDWH 2. a central hub for working w/ data in general making use of the persisting of data 3. an interface for software services to communicate with each other

Answer 172

1. SS is based on Spark clusters, HDFS or similar. <-> Kafka offers a Java library Kafka streams. 2. SS is micro-batching, Kafka is event-at-a-time processing. 3. SS is higher latency, Kafka is low. 4. SS supports Python, Java, Scala and R, Kafka Streams is limited to Java and Scala, but a REST Proxy connection lets a large number of clients use it.

Answer 173

producers, Kafka cluster w/ brokers and an optional ZooKeeper, consumers

Answer 174

brokers (controllers, leads, followers), topics, partitions, replications, an optional zookeeper

Answer 175

a replicated coordination service that offers configuration, naming, synchronization, and group-membership primitives — delivered through a simple, strongly-consistent data tree API

Answer 176

Old messages remain readable even after updates to the way data is structured in the system.

Answer 177

a number assigned to a new message appended to the end of a topic | 64-bit signed integer that the broker assigns to every record ## Footnote Kafka’s built-in pointer that preserves order, enables fault-tolerant progress tracking, and underpins every consumption guarantee the platform offers

Answer 178

for the consumers to be able to continue working from the last message on, so no data is lost, and info is not skipped even if the connection is interrupted

Answer 179

Producers write, while consumers read.

Answer 180

Apache Kafka Connect, Apache Kafka Streams, Confluent Schema Registry, Confluent REST Proxy

Answer 181

It decouples producers and consumers at the data level. A consumer can use the registry to retrieve the schema before the data is processed to validate it. The schema is in .JSON.

Answer 182

Producers and consumers are sending REST commands to the proxy, which converts them into Kafka commands, and sends them to Kafka.

Answer 183

`.\bin\kafka-topics.sh --create --topic bigdata --partitions 1 --replication-factor 1 --boostrap-server localhost:9092`

Answer 184

`--describe` | e. g. ## Footnote ``` .\bin\kafka-topics.sh \ --bootstrap-server broker1:9092,broker2:9093,broker3:9094 \ --describe \ --topic orders-v1 ```

Answer 185

echo "Helló-belló" | ./kafka-console-producer.sh --topic bigdata --bootstrap-server localhost:9092

Answer 186

``` timeout 10 .\kafka-console-consumer.sh \ --topic bigdata \ --from-beginning \ --bootstrap-server localhost:9092 ``` | --from-beginning = start reading at offset 0 for every partition ## Footnote The little timeout 10… wrapper is there to keep a demo or self-test from hanging forever once it has been proven that the consumer actually works.

Answer 187

1. with timeout - output a certain batch or all 2. without timeout - printing out messages once they are produced

Answer 188

1. SS is optimized for a wide range of applications 2. can be integrated w/ different frameworks and connected to Spark 3. SS's DStreams can be used in parallel to batch processing w/ Apache Spark in a Hadoop env

Answer 189

Data aggregations in these datastores are straightforward as they are operations that are applied to all the values that share a common key. | Key | Value type | What’s stored | Example aggregates | | -------------- | ------------------------------ | -------------------- | --------------------------------------- | | `login:42` | Integer counter | Total log-ins | `INCR login:42` → count | | `login:42:ts` | List | All timestamps | `LLEN` (count), `LRANGE -1 -1` (latest) | | `login:42:ip` | Set | Distinct IPs | `SCARD` (distinct count), `SISMEMBER` | | `login:42:geo` | Sorted set (score = timestamp) | Geo-hashes over time | `ZCOUNT` between two dates |

Answer 190

The global financial markets produce a very large amount of mostly numeric data every day. This includes stock market prices at various stock exchanges around the globe, as well as transaction systems such as SWIFT. Peaks and valleys in system load exist due to markets being closed for hours each day. Cryptocurrencies are a more recent appearance that rely on blockchain technologies, where each transaction is stored as part of the blockchain as immutable information.

Answer 191

The command will create a visual representation of the data stored in the neo4j database. In this case, it will show the 'Ticket' node with all relationships (indicated by the empty square brackets) to all other nodes (indicated by the empty round parenthesis).

Big Data Technologies Flashcards

to study big data technologies (225 cards)