General Flashcards

Question 1

Q

OpenSearch Service

Answer

A

vector database.
store and retrieve vectors as high-dimensional points.
include capabilities for efficient and fast lookup of nearest neighbors in the N-dimensional space.
suitable to store information for RAG use

Question 2

Q

K-means clustering

Answer

A

is a popular unsupervised machine learning algorithm used for partitioning a dataset into a pre-defined number of clusters

Question 3

Q

Pre-training bias metrics

Answer

A

Class Imbalance (CI)
Label Imbalance (DPL)
Kullback-Leibler Divergence (KL)
Jensen-Shannon Divergence (JS)
Lp-norm (LP)
Total Variation Distance (TVD)
Kolmogorov-Smirnov (KS)
Conditional Demographic Disparity (CDD)

https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bia

Question 4

Q

Post-training bias metrics

Answer

A

Difference in Positive Proportions in Predicted Labels (DPPL)
Disparate Impact (DI)
Difference in Conditional Acceptance (DCAcc)
Difference in Conditional Rejection (DCR)
Specificity difference (SD)
Recall Difference (RD)
Difference in Acceptance Rates (DAR)
Difference in Rejection Rates (DRR)
Accuracy Difference (AD)
Treatment Equality (TE)
Conditional Demographic Disparity in Predicted Labels (CDDPL)
Counterfactual Fliptest (FT)
Generalized entropy (GE)

https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-post-tra

Question 5

Q

Partial dependence plots (PDP)

Answer

A

show the dependence of the predicted target response on a set of input features of interest.

Question 6

Q

Shapley values

Answer

A

determine the contribution that each feature made to model predictions.
method (solution concept) for fairly distributing the total gains or costs among a group of players who have collaborated.

Question 7

Q

The difference in proportions of labels (DPL)

Answer

A

compares the proportion of observed outcomes with positive labels for facet d with the proportion of observed outcomes with positive labels of facet a in a training dataset

Question 8

Q

Weight

Answer

A

Multiplies the input value, controlling its influence on the output.

Question 9

Q

Bias

Answer

A

Adds a constant term, allowing the model to fit the data better by shifting the activation function.

Question 10

Q

Text embeddings

Answer

A

represent meaningful vector representations of unstructured text such as documents, paragraphs, and sentences. You input a body of text and the output is a (1 x n) vector. You can use embedding vectors for a wide variety of applications.

Question 11

Q

Amazon Fraud Detector

Answer

A

is a fully managed service that you can use to detect fraudulent activities. Examples of fraudulent activities include fraudulent transactions or the creation of fake accounts.

Question 12

Q

Underfitting

Answer

A

leads to poor performance on both the training and test datasets.

Question 13

Q

A high bias

Answer

A

indicates underfitting, where the model is too simplistic

Question 14

Q

Low variance

Answer

A

suggests that the model doesn’t capture the complexity of the data

Question 15

Q

Overfitting

Answer

A

occurs when the model performs well on the training data but poorly on unseen data, such as the validation and test sets. This happens because the model learns the noise and intricate details of the training data, reducing its generalizability.

Question 16

Q

Grid Search

Answer

Study These Flashcards

A

enumerates every possible combination from a pre-defined “grid” of hyperparameter values.

Question 17

Q

Bayesian Optimization is valuable when

Answer

Study These Flashcards

A

each individual evaluation is expensive and you want to minimize the number of trials.

Question 18

Q

Random Search

Answer

Study These Flashcards

A

samples from the search space rather than covering it completely. It’s typically better if the search space is large

Question 19

Q

Hyperband

Answer

Study These Flashcards

A

is designed for larger search spaces and relies on early stopping. It’s not specifically intended to fully enumerate the space; its strength is in pruning unpromising configurations quickly

Question 20

Q

Dimensionality reduction

Answer

Study These Flashcards

A

is a technique that simplifies datasets by reducing the number of input variables or features. This simplification enhances computational efficiency and model performance, especially as datasets grow in size and complexity.

Question 21

Q

SMOTE

Answer

Study These Flashcards

A

Synthetic Minority Oversampling Technique, is a technique used to address class imbalance in datasets. It works by generating synthetic data points for the minority class, effectively balancing the dataset