Explain acronim “DAG”
directed acyclic graphs
What are data pipeline stages
How data pipeline “Ingestion” works
Ingestion is the process of bringing data into the GCP environment.
How data pipeline “Transformation” works
Transformation is the process of mapping data from the structure used in the source system to the structure used in the storage and analysis stages of the data pipeline.
Trasnformation includes
- trimming records
- filtering
- joins
- and every other operations on data
How data pipeline “Storage” works
How data pipeline “Analysis” works
Analysis can take on several forms, from simple
- SQL querying
- report generation
- machine learning model training
- data science analysis.
What are Types of Data Pipelines
What is and how works ETL
Extraction, transformation, and load.
What is and how works ELT
Extract, load, and transformation (ELT)
In an ELT process, data is loaded into a database before transforming the data.
(my tak mamy, w datahubie gdzie surówka jest publikowana na topic)
What is and how works Change Data Capture
In a change data capture approach, each change in a source system is captured and recorded in a data store. This is helpful in cases where it is important to know all changes over time and not just the state of the database at the time of data extraction.
What is and how works Stream Processing Pipelines
Streams are unending, continuous sources of data.
Explain Sliding and Tumbling Windows
Sliding windows takes data that overlap with each other (“last 3”)
Tubling windows takes data that doesn’t overlap on each other. (“15 minutes windows”)
What GCP service would you recommend for ingesting IoT data?
Cloud Pub/Sub, which is a scalable, managed messaging queue
that is typically used for ingesting high-volume streaming data
We have defined a set of rules for filtering out bad data before it gets into the data mart. At what stage of the data pipeline would you implement those rules?
Transformation
A team of data warehouse developers is migrating a set of legacy Python scripts that have
been used to transform data as part of an ETL process. They would like to use a service
that allows them to use Python and requires minimal administration and operations support.
Which GCP service would you recommend?
Cloud Dataflow
You are using Cloud Pub/Sub to buffer records from an application that generates a stream of data based on user interactions with a website. The messages are read by another service that transforms the data and sends it to a machine learning model that will use it for training. A developer has just released some new code, and you notice that messages are sent repeatedly at 10-minute intervals. What might be the cause of this problem?
The new code disabled acknowledgments from the consumer.
That caused Cloud Pub/Sub to consider the message outstanding for up to the duration of the acknowledgment wait time and then resend the message.
It is considered a good practice to make your processing logic idempotent when consuming
messages from a Cloud Pub/Sub topic. Why is that?
Messages may be delivered multiple times and therefore
processed multiple times. If the logic were not idempotent, it could leave the application in
an incorrect state, such as that which could occur if you counted the same message multiple
times.
A group of IoT sensors is sending streaming data to a Cloud Pub/Sub topic. A Cloud Dataflow
service pulls messages from the topic and reorders the messages sorted by event time.
A message is expected from each sensor every minute. If a message is not received from a
sensor, the stream processing application should use the average of the values in the last
four messages. What kind of window would you use to implement the missing data logic?
Sliding window
You are tasked with designing a way to share data from on-premises pipelines that use
Kafka with GPC data pipelines that use Cloud Pub/Sub. How would you do that?
You should use CloudPubSubConnector and Kafka Connect.
The connector is developed and maintained by the Cloud Pub/Sub team for this purpose.
Team of developers wants to create standardized patterns for processing IoT data. Several
teams will use these patterns. The developers would like to support collaboration and facilitate the use of patterns for building streaming data pipelines. What component should they use?
Cloud Dataflow templates
You need to run several map reduce jobs on Hadoop along with one Pig job and four PySpark jobs. When you ran the jobs on premises, you used the department’s Hadoop cluster. Now you are running the jobs in GCP. What configuration for running these jobs would you recommend?
Create one cluster for each job and shut down the cluster when the job completes.
You are working with a group of genetics researchers analyzing data generated by gene
sequencers. The data is stored in Cloud Storage. The analysis requires running a series of
six programs, each of which will output data that is used by the next process in the pipeline.
The final result set is loaded into BigQuery. What tool would you recommend for
orchestrating this workflow?
Cloud Composer
The business owners of a data warehouse have determined that the current design of the
data warehouse is not meeting their needs. In addition to having data about the state of
systems at certain points in time, they need to know about all the times that data changed
between those points in time. What kind of data warehousing pipeline should be used to
meet this new requirement?