Data Engineering Fundamentals Flashcards

(22 cards)

1
Q

What is data volume?

A

The amount or size of data that organisations are dealing with at a given time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the variety of data?

A

Descriptors of the data itself, e.g. is it structured, unstructured, sensitive, not sensitive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are 2 key differences between data lakes and data warehouses?

A
  • Data lakes hold data in its native format be it structured, semi-structured or unstructured
  • Data lakes perform schema-on-read as opposed to schema-on-write
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Where are data lakes often more efficient than data warehouses?

A

For cost effectively storing data, for advanced analytics, ML or data discovery

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a data lakehouse?

A

A hybrid of data lakes and data warehouses.
The idea is that is the best of both worlds - supports schema on read and schema on write, structured and unstructured data, and can do both advanced analytics and ML tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Where are data warehouses typically more efficient than data lakes?

A

For advanced querying of massive amounts of data, running analytics, business intelligence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a data mesh?

A

A data set up where individual teams own their own data products which can each serve various use cases around the organisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How is governance and security implemented in a data mesh?

A

Each domain handles the governance and security for itself

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is data lineage?

A

A visual representation of the flow and transformation of your data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is data lineage useful?

A

Helps to track errors back to their source and also with compliance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is schema evolution?

A

The ability to adapt the schema of a dataset over time without disrupting existing systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What service can help manage schema evolution and manage different schema versions?

A

Glue Schema Registry

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are 3 ways to optimise the performance of your database?

A

1/ Create indices to avoid full table scans
2/ Partition your data to enable parallel processing
3/ Compress data to speed up its transfer, reduce storage and disk reads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is data skew in a database?

A

The unequal distribution of data across partitions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is temporal skew?

A

If you are partitioning by time and the volume of data you are collecting increases quickly (e.g. loads of new users join), this can cause temporal skew since the amount of new data will be a lot more than the amount of old data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are 2 ways to fix data skew (5 listed)?

A
  • Adaptive partitioning (re-partition based on data characteristics, potentially in real time)
  • Salting
  • Repartitioning (regularly redistributing the data based on its current characteristics)
  • Sampling (using a sample of the data to determine the distribution and adjust the processing strategy accordingly)
  • Custom partitioning based on rules you define
17
Q

What is data completeness?

A

The amount of missing values, null counts or percentage of populated fields

18
Q

What is data consistency?

A

E.g. rating data - ratings should be out of 5 across the datasets you are combining

19
Q

What does CASE do in SQL?

A

Allows you to apply filters to what you are aggregating whilst you are doing it

20
Q

What does WHERE do in SQL?

A

Allows you to apply one filter to one object at a time

21
Q

What is the difference between INNER JOIN and FULL OUTER JOIN in SQL?

A

INNER JOIN is the default and will join all the elements that are common between the two columns.
FULL OUTER JOIN will join all the data from all the tables.

22
Q

What does “~!” do in SQL?

A

This means “does not match” whatever expression