Chapter 3 Data Pipelines Flashcards

Question 1

Q

Explain acronim “DAG”

Answer

A

directed acyclic graphs

Question 2

Q

What are data pipeline stages

Answer

A

Ingestion
Transformation
Storage
Analysis

Question 3

Q

How data pipeline “Ingestion” works

Answer

A

Ingestion is the process of bringing data into the GCP environment.

Question 4

Q

How data pipeline “Transformation” works

Answer

A

Transformation is the process of mapping data from the structure used in the source system to the structure used in the storage and analysis stages of the data pipeline.
Trasnformation includes
- trimming records
- filtering
- joins
- and every other operations on data

Question 5

Q

How data pipeline “Storage” works

Question 6

Q

How data pipeline “Analysis” works

Answer

A

Analysis can take on several forms, from simple
- SQL querying
- report generation
- machine learning model training
- data science analysis.

Question 7

Q

What are Types of Data Pipelines

Answer

A

Data warehousing pipelines
Stream processing pipelines
Machine learning pipeline

Question 8

Q

What is and how works ETL

Answer

A

Extraction, transformation, and load.

Question 9

Q

What is and how works ELT

Answer

A

Extract, load, and transformation (ELT)
In an ELT process, data is loaded into a database before transforming the data.
(my tak mamy, w datahubie gdzie surówka jest publikowana na topic)

Question 10

Q

What is and how works Change Data Capture

Answer

A

In a change data capture approach, each change in a source system is captured and recorded in a data store. This is helpful in cases where it is important to know all changes over time and not just the state of the database at the time of data extraction.

Question 11

Q

What is and how works Stream Processing Pipelines

Answer

A

Streams are unending, continuous sources of data.

Question 12

Q

Explain Sliding and Tumbling Windows

Answer

A

Sliding windows takes data that overlap with each other (“last 3”)
Tubling windows takes data that doesn’t overlap on each other. (“15 minutes windows”)

Question 13

Q

What GCP service would you recommend for ingesting IoT data?

Answer

A

Cloud Pub/Sub, which is a scalable, managed messaging queue
that is typically used for ingesting high-volume streaming data

Question 14

Q

We have defined a set of rules for filtering out bad data before it gets into the data mart. At what stage of the data pipeline would you implement those rules?

Answer

A

Transformation

Question 15

Q

A team of data warehouse developers is migrating a set of legacy Python scripts that have
been used to transform data as part of an ETL process. They would like to use a service
that allows them to use Python and requires minimal administration and operations support.
Which GCP service would you recommend?

Answer

A

Cloud Dataflow

Question 16

Q

You are using Cloud Pub/Sub to buffer records from an application that generates a stream of data based on user interactions with a website. The messages are read by another service that transforms the data and sends it to a machine learning model that will use it for training. A developer has just released some new code, and you notice that messages are sent repeatedly at 10-minute intervals. What might be the cause of this problem?

Answer

Study These Flashcards

A

The new code disabled acknowledgments from the consumer.
That caused Cloud Pub/Sub to consider the message outstanding for up to the duration of the acknowledgment wait time and then resend the message.

Question 17

Q

It is considered a good practice to make your processing logic idempotent when consuming
messages from a Cloud Pub/Sub topic. Why is that?

Answer

Study These Flashcards

A

Messages may be delivered multiple times and therefore
processed multiple times. If the logic were not idempotent, it could leave the application in
an incorrect state, such as that which could occur if you counted the same message multiple
times.

Question 18

Q

A group of IoT sensors is sending streaming data to a Cloud Pub/Sub topic. A Cloud Dataflow
service pulls messages from the topic and reorders the messages sorted by event time.
A message is expected from each sensor every minute. If a message is not received from a
sensor, the stream processing application should use the average of the values in the last
four messages. What kind of window would you use to implement the missing data logic?

Answer

Study These Flashcards

A

Sliding window

Question 19

Q

You are tasked with designing a way to share data from on-premises pipelines that use
Kafka with GPC data pipelines that use Cloud Pub/Sub. How would you do that?

Answer

Study These Flashcards

A

You should use CloudPubSubConnector and Kafka Connect.
The connector is developed and maintained by the Cloud Pub/Sub team for this purpose.

Question 20

Q

Team of developers wants to create standardized patterns for processing IoT data. Several
teams will use these patterns. The developers would like to support collaboration and facilitate the use of patterns for building streaming data pipelines. What component should they use?

Answer

Study These Flashcards

A

Cloud Dataflow templates

Question 21

Q

You need to run several map reduce jobs on Hadoop along with one Pig job and four PySpark jobs. When you ran the jobs on premises, you used the department’s Hadoop cluster. Now you are running the jobs in GCP. What configuration for running these jobs would you recommend?

Answer

Study These Flashcards

A

Create one cluster for each job and shut down the cluster when the job completes.

Question 22

Q

You are working with a group of genetics researchers analyzing data generated by gene
sequencers. The data is stored in Cloud Storage. The analysis requires running a series of
six programs, each of which will output data that is used by the next process in the pipeline.
The final result set is loaded into BigQuery. What tool would you recommend for
orchestrating this workflow?

Answer

Study These Flashcards

A

Cloud Composer

Question 23

Q

The business owners of a data warehouse have determined that the current design of the
data warehouse is not meeting their needs. In addition to having data about the state of
systems at certain points in time, they need to know about all the times that data changed
between those points in time. What kind of data warehousing pipeline should be used to
meet this new requirement?

Answer

Study These Flashcards

A

Chapter 3 Data Pipelines Flashcards

(23 cards)