What is data volume?
The amount or size of data that organisations are dealing with at a given time
What is the variety of data?
Descriptors of the data itself, e.g. is it structured, unstructured, sensitive, not sensitive
What are 2 key differences between data lakes and data warehouses?
Where are data lakes often more efficient than data warehouses?
For cost effectively storing data, for advanced analytics, ML or data discovery
What is a data lakehouse?
A hybrid of data lakes and data warehouses.
The idea is that is the best of both worlds - supports schema on read and schema on write, structured and unstructured data, and can do both advanced analytics and ML tasks
Where are data warehouses typically more efficient than data lakes?
For advanced querying of massive amounts of data, running analytics, business intelligence
What is a data mesh?
A data set up where individual teams own their own data products which can each serve various use cases around the organisation
How is governance and security implemented in a data mesh?
Each domain handles the governance and security for itself
What is data lineage?
A visual representation of the flow and transformation of your data
Why is data lineage useful?
Helps to track errors back to their source and also with compliance
What is schema evolution?
The ability to adapt the schema of a dataset over time without disrupting existing systems
What service can help manage schema evolution and manage different schema versions?
Glue Schema Registry
What are 3 ways to optimise the performance of your database?
1/ Create indices to avoid full table scans
2/ Partition your data to enable parallel processing
3/ Compress data to speed up its transfer, reduce storage and disk reads
What is data skew in a database?
The unequal distribution of data across partitions
What is temporal skew?
If you are partitioning by time and the volume of data you are collecting increases quickly (e.g. loads of new users join), this can cause temporal skew since the amount of new data will be a lot more than the amount of old data
What are 2 ways to fix data skew (5 listed)?
What is data completeness?
The amount of missing values, null counts or percentage of populated fields
What is data consistency?
E.g. rating data - ratings should be out of 5 across the datasets you are combining
What does CASE do in SQL?
Allows you to apply filters to what you are aggregating whilst you are doing it
What does WHERE do in SQL?
Allows you to apply one filter to one object at a time
What is the difference between INNER JOIN and FULL OUTER JOIN in SQL?
INNER JOIN is the default and will join all the elements that are common between the two columns.
FULL OUTER JOIN will join all the data from all the tables.
What does “~!” do in SQL?
This means “does not match” whatever expression