Practice Questions Flashcards

(34 cards)

1
Q

A company wants to automatically convert streaming JSON data into Apache Parquet before storing them in an S3 bucket

A
  • Use Amazon Data Firehose (formerly Amazon Kinesis Data Firehose)
  • Simpler alternative to Kinesis Data Streams
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

A company uses Amazon EMR for its ETL processes. The company is looking for an alternative with a lower operational overhead

A

Run the ETL jobs using AWS Glue

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which service should you use to deliver streaming data from Amazon MSK to a Redshift cluster with low latency?

A

Redshift Streaming Ingestion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

A data engineer is building a pipeline for streaming data. The data will be fetched from various sources.

A

Create an application that uses Kinesis Producer Library (KPL) to load streaming data from various sources into a Kinesis Data stream.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

A company wants to set up a data lake on Amazon S3. The data will be sourced from S3 buckets located in different AWS accounts. Which service can simplify the implementation of the data lake?

A

AWS Lake Formation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

An image classifier is getting high accuracy on the validation dataset. However, the accuracy significantly dropped when tested against real data. How can you improve the model’s performance?

A

Take existing images from the training data. Apply data augmentation techniques (ex: flipping, rotating, adjusting brightness) to the images and add them to the training data. Retrain the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What methods can a machine learning engineer use to reduce the size of a large dataset while retaining only relevant features?

A
  1. Principal Component Analysis (PCA)
  2. t-Distributed Stochastic Neighbor Embedding (t-SNE)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Principal Component Analysis

A

Principal Component Analysis (PCA) is a dimensionality reduction technique used to simplify complex datasets by transforming them into a new set of uncorrelated variables called principal components.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

t-SNE

A
  • t-distributed Stochastic Neighbor Embedding
  • used for dimensionality reduction, particularly for visualizing high-dimensional data in lower dimensions (like 2D or 3D)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does t-SNE work?

A
  • Maps high-dimensional data points to a lower-dimensional space (e.g., 2D or 3D) while trying to preserve the relative distances between them.
  • It focuses on keeping nearby points close together in the lower-dimensional representation, making it useful for revealing clusters and structures within the data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A dataset contains a mixture of categorical and numerical features. What feature engineering method should be done to prepare the data for training?

A

One-hot encoding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

X and Y variables have a Pearson correlation coefficient of -0.98. What does it indicate?

A

Very strong negative correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

A machine learning engineer handles a small dataset with missing values. What should they do to ensure no data points are lost?

A

Use imputation techniques to fill in missing values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

An ML engineer wants to evaluate the performance of a binary classification model visually. What visualization technique should be used?

A

Confusion matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

An ML engineer wants to discover topics available within a large text dataset. Which algorithm should the engineer train the model on?

A

Latent Dirichlet Allocation (LDA) algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

A SageMaker Object2vec model is overfitting on a validation dataset. How do you solve this problem?

A

Use Regularization, in this case, adjusting the value of the Dropout parameter.

17
Q

A neural network model is being trained using a large dataset in batches. As the training progresses, the loss function begins to oscillate. Which could be the cause?

A

The learning rate is too high

18
Q

What SageMaker built-in algorithm is suitable for predicting click-through rate (CTR) patterns?

A

Factorization machines

19
Q

An ML engineer wants to auto-scale the instances behind a SageMaker endpoint according to the volume of incoming requests. Which metric should this scaling be based on?

A

InvocationsPerInstance

20
Q

Which AWS service can you use to convert audio formats into text?

A

Amazon Transcribe

21
Q

An ML engineer is training a cluster of SageMaker instances. The traffic between the instances must be encrypted.

A

Enable inter-container traffic encryption

22
Q

A company wants to use Amazon SageMaker to deploy various ML models in a cost-effective way.

A
  • Use multi-model endpoint (MME)
  • MME allows multiple models to share the same compute instance(s), significantly reducing infrastructure costs, particularly for scenarios involving thousands of models.
23
Q

What AWS service can help you build an AI-powered chatbot that can interact with customers?

24
Q

Word2Vec

A
  • provides embeddings for words.
  • used in sentiment analysis, document classification, and natural language understanding
25
Object2Vec
generate the embeddings of more general-purpose objects such as sentences, customers, and products.
26
Factorization Machines (FM)
- versatile supervised learning algorithm - particularly effective in recommendation systems and other tasks dealing with **sparse, high-dimensional data**
27
What are hyperparameters?
Hyperparameters are configuration variables external to the model that are set before the training process begins and control how the model learns.
28
What is dropout?
-Dropout is a regularization technique used to prevent overfitting in neural networks.
29
How does dropout work?
During training, dropout randomly "drops out" (sets to zero) a certain percentage of neurons in a layer at each training step. This means the dropped-out neurons do not contribute to the forward pass or backpropagation for that specific training example.
30
What's the typical range of dropout?
0.2 to 0.5
31
What is the effect of dropout?
- forces the network to learn more robust features that are less reliant on the presence of any single neuron - dropout encourages the model to learn redundant representations - This effectively trains an ensemble of smaller, different networks at each iteration, leading to improved generalization and reduced overfitting.
32
Overfitting
When a model learns the training data too well, including noise and specific patterns that do not generalize to unseen data.
33
Naive Bayseian vs Full Bayseian
- Naive Bayesian relies heavily on how strongly independent the predictors are/ If there are highly correlated features in the dataset, use Full Bayseian instead
34
Best solution for image content moderation
Rekognition