- Has a predifined schemea every record has the same fields and properties. - Tabular format: - rows are instances - columns are attributes

- Has some structure, but allows flexibility and variation acrosss different instances. - Common formats: - JSON - MXL - CSV files

- Layered format, with objects that have multiple attributes - Objects are enclosed in curly braces - Collections or arrays are enclosed in square brackets - Attributes are expressed as name-value pairs, separated by commas

- Uses tags (in angle-brackets) and indents to show elements and attributes

Module 1 - Core data concepts Flashcards by Zsolt Molnár

Data classified (3)

Structured
Semi-Structured
Unstructured

How well did you know this?

Not at all

Perfectly

Structured Data

Has a predifined schemea every record has the same fields and properties.
Tabular format:
rows are instances
columns are attributes

How well did you know this?

Not at all

Perfectly

Semi-structured data

Has some structure, but allows flexibility and variation acrosss different instances.
Common formats:
JSON
MXL
CSV files

How well did you know this?

Not at all

Perfectly

Unstructured data

Lacks a predifined model or consistent organizational structure.
- Ei.:
- Text documents, images, videos…

How well did you know this?

Not at all

Perfectly

Catagories of data stores (2)

File stores
Databases

How well did you know this?

Not at all

Perfectly

File Storage explained

Can be local or central (cloud)
File formats:
Type of data being stored
The application or services that will access it
Whether it needs to be readable to humans or be optimized for storage.

How well did you know this?

Not at all

Perfectly

Delimited Text Files (File types)

Plain text, good for structured data
CSV
commas separate the fields
rows are terminated be a carriage return or a new line
Others:
TSV, Space-delimited, Fixed-width data

How well did you know this?

Not at all

Perfectly

JSON (File types)

Layered format, with objects that have multiple attributes
Objects are enclosed in curly braces
Collections or arrays are enclosed in square brackets
Attributes are expressed as name-value pairs, separated by commas

How well did you know this?

Not at all

Perfectly

XML (File types)

Uses tags (in angle-brackets) and indents to show elements and attributes

How well did you know this?

Not at all

Perfectly

Binary Large Object (BLOB) (File types)

For unstructured data
Data stored in raw binary values, that needs to be interpreted
Images, videos, audio …

How well did you know this?

Not at all

Perfectly

(Storage) Optimized File Formats

Designed for efficient storage and processing.
Can be compressed, indexed and efficiently stored and retrieved.
Types: Avro, ORC, Parquet

How well did you know this?

Not at all

Perfectly

Avro (Optimized File Formates)

Avro records are made up of two parts
Header is in Json, defines the structure of data in the record.
Data itself in in binary
Good for compression and reduced storage or network bandwidth usage.

How well did you know this?

Not at all

Perfectly

ORC (Optimized File Formates)

Columnar layout, devided into stripes
One stripe contains data for a specific column or group of columns.
One stripe has:
Index (Access to records in stripe)
Actual data
Footer (Statistical data for column)

How well did you know this?

Not at all

Perfectly

Parquet (Optimized File Formates)

Columnar layout
Supports compact storage, high-performance read and write operations, and works well with complex data types.
Organized into row groups

How well did you know this?

Not at all

Perfectly

Databases explained

Centralized and specialized system to store data, and enable querying.

How well did you know this?

Not at all

Perfectly

Relational Databases

Study These Flashcards

Designed to store and query structured data.
Organized in tables, instances are given a primary key to be able to refer to them

Non-Relational Databases (NoSQL Databases)

Study These Flashcards

They do not enforce a realational schema on the data they store.

Key-value (NoSQL Databases categories)

Study These Flashcards

-One record is made up of
- a unique key
- an associated value that can be of any format

Document (NoSQL Databases categories)

Study These Flashcards

-One record in made up of
- a unique key
- a JSON document

Column Family (NoSQL Databases categories)

Study These Flashcards

Tabular structure
One Column can have more Columns inside it, called column families. They are related columns.

Graph (NoSQL Databases categories)

Study These Flashcards

Data is represented as nodes, that are entities with links (edges) used to define the relationships between them.

Transactional data processing

Study These Flashcards

Systems that capture transactions, events that need to be monitored
Commonly this work in called Online Transactional Processing (OLTP)

OLTP systems (Transactional data processing)

Study These Flashcards

Depend on optimal read and write performance
Support for standard CRUD operations
Require transactions to follow ACID properties

ACID properties (Transactional data processing)

Study These Flashcards

Atomicity
Consistency
Isolation
Durability

Atomicity (ACID properties)

Ensure that transactions are indivisible unit of work. Be completed in full or fail entirely.

Consistency (ACID properties)

Transactions can only move the database from one valid state to another.

Isolation (ACID properties)

Transactions running at the same time do not interfere with one another.

Durability (ACID properties)

Guarantee that if a transaction is committed, it is permanent and will remain committed.

Analytical data processing

Relies on read-only systems, that store large amounts of historical information or business metrics

The most common architecture for enterprise-scale analytics

1 - Operational data is extracted, tranformed, and the loaded (ETL) into a datalake to be analyzed 2 - Data is then loaded into a set of tables. It might be a lakehouse or data warehouse. 3 - Data is then further aggregated and loaded into an OLAP model 4 - Now the data stored can be queried to generate reports, dashboards and visualizations

Data warehouse explained

- For storing data in a relational schema - Read-heavy workloads (queries) Structured, relational storage optimized for fast analytical queries and reporting.

Data lake explained

- Large-scale analytical processing scenarios where file-based data needs to be gathered and analyzed. Stores large valumes of raw data (structured and unstructured) as files.

Data lakehouse explained

- New architecture, the combination of the lake and warehouse. - Scalable, flexible -> lake - Relational querying features -> warehouse

Module 1 - Core data concepts Flashcards

(33 cards)