Chapter 3 Data Pipelines Flashcards

(23 cards)

1
Q

Explain acronim “DAG”

A

directed acyclic graphs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are data pipeline stages

A
  • Ingestion
  • Transformation
  • Storage
  • Analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How data pipeline “Ingestion” works

A

Ingestion is the process of bringing data into the GCP environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How data pipeline “Transformation” works

A

Transformation is the process of mapping data from the structure used in the source system to the structure used in the storage and analysis stages of the data pipeline.
Trasnformation includes
- trimming records
- filtering
- joins
- and every other operations on data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How data pipeline “Storage” works

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How data pipeline “Analysis” works

A

Analysis can take on several forms, from simple
- SQL querying
- report generation
- machine learning model training
- data science analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are Types of Data Pipelines

A
  • Data warehousing pipelines
  • Stream processing pipelines
  • Machine learning pipeline
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is and how works ETL

A

Extraction, transformation, and load.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is and how works ELT

A

Extract, load, and transformation (ELT)
In an ELT process, data is loaded into a database before transforming the data.
(my tak mamy, w datahubie gdzie surówka jest publikowana na topic)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is and how works Change Data Capture

A

In a change data capture approach, each change in a source system is captured and recorded in a data store. This is helpful in cases where it is important to know all changes over time and not just the state of the database at the time of data extraction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is and how works Stream Processing Pipelines

A

Streams are unending, continuous sources of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain Sliding and Tumbling Windows

A

Sliding windows takes data that overlap with each other (“last 3”)
Tubling windows takes data that doesn’t overlap on each other. (“15 minutes windows”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What GCP service would you recommend for ingesting IoT data?

A

Cloud Pub/Sub, which is a scalable, managed messaging queue
that is typically used for ingesting high-volume streaming data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

We have defined a set of rules for filtering out bad data before it gets into the data mart. At what stage of the data pipeline would you implement those rules?

A

Transformation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

A team of data warehouse developers is migrating a set of legacy Python scripts that have
been used to transform data as part of an ETL process. They would like to use a service
that allows them to use Python and requires minimal administration and operations support.
Which GCP service would you recommend?

A

Cloud Dataflow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

You are using Cloud Pub/Sub to buffer records from an application that generates a stream of data based on user interactions with a website. The messages are read by another service that transforms the data and sends it to a machine learning model that will use it for training. A developer has just released some new code, and you notice that messages are sent repeatedly at 10-minute intervals. What might be the cause of this problem?

A

The new code disabled acknowledgments from the consumer.
That caused Cloud Pub/Sub to consider the message outstanding for up to the duration of the acknowledgment wait time and then resend the message.

17
Q

It is considered a good practice to make your processing logic idempotent when consuming
messages from a Cloud Pub/Sub topic. Why is that?

A

Messages may be delivered multiple times and therefore
processed multiple times. If the logic were not idempotent, it could leave the application in
an incorrect state, such as that which could occur if you counted the same message multiple
times.

18
Q

A group of IoT sensors is sending streaming data to a Cloud Pub/Sub topic. A Cloud Dataflow
service pulls messages from the topic and reorders the messages sorted by event time.
A message is expected from each sensor every minute. If a message is not received from a
sensor, the stream processing application should use the average of the values in the last
four messages. What kind of window would you use to implement the missing data logic?

A

Sliding window

19
Q

You are tasked with designing a way to share data from on-premises pipelines that use
Kafka with GPC data pipelines that use Cloud Pub/Sub. How would you do that?

A

You should use CloudPubSubConnector and Kafka Connect.
The connector is developed and maintained by the Cloud Pub/Sub team for this purpose.

20
Q

Team of developers wants to create standardized patterns for processing IoT data. Several
teams will use these patterns. The developers would like to support collaboration and facilitate the use of patterns for building streaming data pipelines. What component should they use?

A

Cloud Dataflow templates

21
Q

You need to run several map reduce jobs on Hadoop along with one Pig job and four PySpark jobs. When you ran the jobs on premises, you used the department’s Hadoop cluster. Now you are running the jobs in GCP. What configuration for running these jobs would you recommend?

A

Create one cluster for each job and shut down the cluster when the job completes.

22
Q

You are working with a group of genetics researchers analyzing data generated by gene
sequencers. The data is stored in Cloud Storage. The analysis requires running a series of
six programs, each of which will output data that is used by the next process in the pipeline.
The final result set is loaded into BigQuery. What tool would you recommend for
orchestrating this workflow?

A

Cloud Composer

23
Q

The business owners of a data warehouse have determined that the current design of the
data warehouse is not meeting their needs. In addition to having data about the state of
systems at certain points in time, they need to know about all the times that data changed
between those points in time. What kind of data warehousing pipeline should be used to
meet this new requirement?