- t-distributed Stochastic Neighbor Embedding - used for dimensionality reduction, particularly for visualizing high-dimensional data in lower dimensions (like 2D or 3D)

- Maps high-dimensional data points to a lower-dimensional space (e.g., 2D or 3D) while trying to preserve the relative distances between them. - It focuses on keeping nearby points close together in the lower-dimensional representation, making it useful for revealing clusters and structures within the data.

Practice Questions Flashcards by Katie D

A company wants to automatically convert streaming JSON data into Apache Parquet before storing them in an S3 bucket

Use Amazon Data Firehose (formerly Amazon Kinesis Data Firehose)
Simpler alternative to Kinesis Data Streams

How well did you know this?

Not at all

Perfectly

A company uses Amazon EMR for its ETL processes. The company is looking for an alternative with a lower operational overhead

Run the ETL jobs using AWS Glue

How well did you know this?

Not at all

Perfectly

Which service should you use to deliver streaming data from Amazon MSK to a Redshift cluster with low latency?

Redshift Streaming Ingestion

How well did you know this?

Not at all

Perfectly

A data engineer is building a pipeline for streaming data. The data will be fetched from various sources.

Create an application that uses Kinesis Producer Library (KPL) to load streaming data from various sources into a Kinesis Data stream.

How well did you know this?

Not at all

Perfectly

A company wants to set up a data lake on Amazon S3. The data will be sourced from S3 buckets located in different AWS accounts. Which service can simplify the implementation of the data lake?

AWS Lake Formation

How well did you know this?

Not at all

Perfectly

An image classifier is getting high accuracy on the validation dataset. However, the accuracy significantly dropped when tested against real data. How can you improve the model’s performance?

Take existing images from the training data. Apply data augmentation techniques (ex: flipping, rotating, adjusting brightness) to the images and add them to the training data. Retrain the model

How well did you know this?

Not at all

Perfectly

What methods can a machine learning engineer use to reduce the size of a large dataset while retaining only relevant features?

Principal Component Analysis (PCA)
t-Distributed Stochastic Neighbor Embedding (t-SNE)

How well did you know this?

Not at all

Perfectly

Principal Component Analysis

Principal Component Analysis (PCA) is a dimensionality reduction technique used to simplify complex datasets by transforming them into a new set of uncorrelated variables called principal components.

How well did you know this?

Not at all

Perfectly

t-SNE

t-distributed Stochastic Neighbor Embedding
used for dimensionality reduction, particularly for visualizing high-dimensional data in lower dimensions (like 2D or 3D)

How well did you know this?

Not at all

Perfectly

How does t-SNE work?

Maps high-dimensional data points to a lower-dimensional space (e.g., 2D or 3D) while trying to preserve the relative distances between them.
It focuses on keeping nearby points close together in the lower-dimensional representation, making it useful for revealing clusters and structures within the data.

How well did you know this?

Not at all

Perfectly

A dataset contains a mixture of categorical and numerical features. What feature engineering method should be done to prepare the data for training?

One-hot encoding

How well did you know this?

Not at all

Perfectly

X and Y variables have a Pearson correlation coefficient of -0.98. What does it indicate?

Very strong negative correlation

How well did you know this?

Not at all

Perfectly

A machine learning engineer handles a small dataset with missing values. What should they do to ensure no data points are lost?

Use imputation techniques to fill in missing values

How well did you know this?

Not at all

Perfectly

An ML engineer wants to evaluate the performance of a binary classification model visually. What visualization technique should be used?

Confusion matrix

How well did you know this?

Not at all

Perfectly

An ML engineer wants to discover topics available within a large text dataset. Which algorithm should the engineer train the model on?

Latent Dirichlet Allocation (LDA) algorithm

How well did you know this?

Not at all

Perfectly

A SageMaker Object2vec model is overfitting on a validation dataset. How do you solve this problem?

Study These Flashcards

Use Regularization, in this case, adjusting the value of the Dropout parameter.

A neural network model is being trained using a large dataset in batches. As the training progresses, the loss function begins to oscillate. Which could be the cause?

Study These Flashcards

The learning rate is too high

What SageMaker built-in algorithm is suitable for predicting click-through rate (CTR) patterns?

Study These Flashcards

Factorization machines

An ML engineer wants to auto-scale the instances behind a SageMaker endpoint according to the volume of incoming requests. Which metric should this scaling be based on?

Study These Flashcards

InvocationsPerInstance

Which AWS service can you use to convert audio formats into text?

Study These Flashcards

Amazon Transcribe

An ML engineer is training a cluster of SageMaker instances. The traffic between the instances must be encrypted.

Study These Flashcards

Enable inter-container traffic encryption

A company wants to use Amazon SageMaker to deploy various ML models in a cost-effective way.

Study These Flashcards

Use multi-model endpoint (MME)
MME allows multiple models to share the same compute instance(s), significantly reducing infrastructure costs, particularly for scenarios involving thousands of models.

What AWS service can help you build an AI-powered chatbot that can interact with customers?

Study These Flashcards

Amazon Lex

Word2Vec

Study These Flashcards

provides embeddings for words.
used in sentiment analysis, document classification, and natural language understanding

Object2Vec

generate the embeddings of more general-purpose objects such as sentences, customers, and products.

Factorization Machines (FM)

- versatile supervised learning algorithm - particularly effective in recommendation systems and other tasks dealing with **sparse, high-dimensional data**

What are hyperparameters?

Hyperparameters are configuration variables external to the model that are set before the training process begins and control how the model learns.

What is dropout?

-Dropout is a regularization technique used to prevent overfitting in neural networks.

How does dropout work?

During training, dropout randomly "drops out" (sets to zero) a certain percentage of neurons in a layer at each training step. This means the dropped-out neurons do not contribute to the forward pass or backpropagation for that specific training example.

What's the typical range of dropout?

0.2 to 0.5

What is the effect of dropout?

- forces the network to learn more robust features that are less reliant on the presence of any single neuron - dropout encourages the model to learn redundant representations - This effectively trains an ensemble of smaller, different networks at each iteration, leading to improved generalization and reduced overfitting.

Overfitting

When a model learns the training data too well, including noise and specific patterns that do not generalize to unseen data.

Naive Bayseian vs Full Bayseian

- Naive Bayesian relies heavily on how strongly independent the predictors are/ If there are highly correlated features in the dataset, use Full Bayseian instead

Best solution for image content moderation

Rekognition

Practice Questions Flashcards

(34 cards)