Module 1 - Core data concepts Flashcards

(33 cards)

1
Q

Data classified (3)

A

Structured
Semi-Structured
Unstructured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Structured Data

A
  • Has a predifined schemea every record has the same fields and properties.
  • Tabular format:
  • rows are instances
  • columns are attributes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Semi-structured data

A
  • Has some structure, but allows flexibility and variation acrosss different instances.
  • Common formats:
  • JSON
  • MXL
  • CSV files
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Unstructured data

A

Lacks a predifined model or consistent organizational structure.
- Ei.:
- Text documents, images, videos…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Catagories of data stores (2)

A
  • File stores
  • Databases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

File Storage explained

A
  • Can be local or central (cloud)
  • File formats:
  • Type of data being stored
  • The application or services that will access it
  • Whether it needs to be readable to humans or be optimized for storage.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Delimited Text Files (File types)

A
  • Plain text, good for structured data
  • CSV
  • commas separate the fields
  • rows are terminated be a carriage return or a new line
  • Others:
  • TSV, Space-delimited, Fixed-width data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

JSON (File types)

A
  • Layered format, with objects that have multiple attributes
  • Objects are enclosed in curly braces
  • Collections or arrays are enclosed in square brackets
  • Attributes are expressed as name-value pairs, separated by commas
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

XML (File types)

A
  • Uses tags (in angle-brackets) and indents to show elements and attributes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Binary Large Object (BLOB) (File types)

A
  • For unstructured data
  • Data stored in raw binary values, that needs to be interpreted
  • Images, videos, audio …
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

(Storage) Optimized File Formats

A
  • Designed for efficient storage and processing.
  • Can be compressed, indexed and efficiently stored and retrieved.
  • Types: Avro, ORC, Parquet
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Avro (Optimized File Formates)

A
  • Avro records are made up of two parts
  • Header is in Json, defines the structure of data in the record.
  • Data itself in in binary
  • Good for compression and reduced storage or network bandwidth usage.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

ORC (Optimized File Formates)

A
  • Columnar layout, devided into stripes
  • One stripe contains data for a specific column or group of columns.
  • One stripe has:
  • Index (Access to records in stripe)
  • Actual data
  • Footer (Statistical data for column)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Parquet (Optimized File Formates)

A
  • Columnar layout
  • Supports compact storage, high-performance read and write operations, and works well with complex data types.
  • Organized into row groups
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Databases explained

A

Centralized and specialized system to store data, and enable querying.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Relational Databases

A
  • Designed to store and query structured data.
  • Organized in tables, instances are given a primary key to be able to refer to them
17
Q

Non-Relational Databases (NoSQL Databases)

A

They do not enforce a realational schema on the data they store.

18
Q

Key-value (NoSQL Databases categories)

A

-One record is made up of
- a unique key
- an associated value that can be of any format

19
Q

Document (NoSQL Databases categories)

A

-One record in made up of
- a unique key
- a JSON document

20
Q

Column Family (NoSQL Databases categories)

A
  • Tabular structure
  • One Column can have more Columns inside it, called column families. They are related columns.
21
Q

Graph (NoSQL Databases categories)

A

Data is represented as nodes, that are entities with links (edges) used to define the relationships between them.

22
Q

Transactional data processing

A
  • Systems that capture transactions, events that need to be monitored
  • Commonly this work in called Online Transactional Processing (OLTP)
23
Q

OLTP systems (Transactional data processing)

A
  • Depend on optimal read and write performance
  • Support for standard CRUD operations
  • Require transactions to follow ACID properties
24
Q

ACID properties (Transactional data processing)

A
  • Atomicity
  • Consistency
  • Isolation
  • Durability
25
Atomicity (ACID properties)
Ensure that transactions are indivisible unit of work. Be completed in full or fail entirely.
26
Consistency (ACID properties)
Transactions can only move the database from one valid state to another.
27
Isolation (ACID properties)
Transactions running at the same time do not interfere with one another.
28
Durability (ACID properties)
Guarantee that if a transaction is committed, it is permanent and will remain committed.
29
Analytical data processing
Relies on read-only systems, that store large amounts of historical information or business metrics
30
The most common architecture for enterprise-scale analytics
1 - Operational data is extracted, tranformed, and the loaded (ETL) into a datalake to be analyzed 2 - Data is then loaded into a set of tables. It might be a lakehouse or data warehouse. 3 - Data is then further aggregated and loaded into an OLAP model 4 - Now the data stored can be queried to generate reports, dashboards and visualizations
31
Data warehouse explained
- For storing data in a relational schema - Read-heavy workloads (queries) Structured, relational storage optimized for fast analytical queries and reporting.
32
Data lake explained
- Large-scale analytical processing scenarios where file-based data needs to be gathered and analyzed. Stores large valumes of raw data (structured and unstructured) as files.
33
Data lakehouse explained
- New architecture, the combination of the lake and warehouse. - Scalable, flexible -> lake - Relational querying features -> warehouse