Data lake
A database that holds raw data in its narrative format. The data lake approach copies all the data that might be relevant. Whereas the data warehouse approach cleans the data, the data lake approach doesn’t clean the data. Mostly, the two exist next to each other because different employees have different needs. This approach is also called schema on read.
ETL
Data warehousing process that consists of:
Extraction
Extracting data from a production database.
Transformation
Converting the extracted data from its source from into its target form, so that it can be placed into a data warehouse. Transformation activities: transfer, cleaning, integration and extraction.
Load
Putting the data into the warehouse.
ETL data conversion
ETL cleaning
ETL load
Integral load:
- All the records are taken and processes by comparing them with the records in the data warehouse.
- Extreme case: load deletes DW tables and replaces them with complete new versions (= full load).
- When no time stamps are present it is difficult to do an incremental load.
Delta/Incremental load:
- Only the changed records are taken and processed systematically in the DW.
ETL tools
ETL tool vendors
BI front-end applications
Query and reporting
Methods of reporting
OLAP
Multi-dimensional data
Data can be classified into measures and dimensions. Aggregating measures up to certain dimensions creates a multi-dimensional view on the data (= data cube).
Measures
Summable information concerning a business process.
Dimensions
Represent different perspectives on viewing measures and dimensions are organized hierarchically. Hierarchy represents different levels of aggregation.
Fact
When a cell in a data cube has a value.
OLAP operators
Data mining models
Modeling OLAP databases, define
Star schema
A fact table in the middle connected to a set of dimension tables. It is a simple model that limits the number of joins for a better query performance.
In a star schema the primary key is composed of the primary keys of the different tables.
There is data redundancy due to denormalization (a lot is put in one table with a lot of rows). This improves query performance and less joins are needed.
Snowflake schema (normalized)
A refinement of Star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to a snowflake. A Snowflake schema is used when the data volume is getting too large.
Sparsity
Arises when not every cell in the cube is filled with data.