General Flashcards

(24 cards)

1
Q

OpenSearch Service

A
  • vector database.
  • store and retrieve vectors as high-dimensional points.
  • include capabilities for efficient and fast lookup of nearest neighbors in the N-dimensional space.
  • suitable to store information for RAG use
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

K-means clustering

A

is a popular unsupervised machine learning algorithm used for partitioning a dataset into a pre-defined number of clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Pre-training bias metrics

A
  • Class Imbalance (CI)
  • Label Imbalance (DPL)
  • Kullback-Leibler Divergence (KL)
  • Jensen-Shannon Divergence (JS)
  • Lp-norm (LP)
  • Total Variation Distance (TVD)
  • Kolmogorov-Smirnov (KS)
  • Conditional Demographic Disparity (CDD)

https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bia

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Post-training bias metrics

A
  • Difference in Positive Proportions in Predicted Labels (DPPL)
  • Disparate Impact (DI)
  • Difference in Conditional Acceptance (DCAcc)
  • Difference in Conditional Rejection (DCR)
  • Specificity difference (SD)
  • Recall Difference (RD)
  • Difference in Acceptance Rates (DAR)
  • Difference in Rejection Rates (DRR)
  • Accuracy Difference (AD)
  • Treatment Equality (TE)
  • Conditional Demographic Disparity in Predicted Labels (CDDPL)
  • Counterfactual Fliptest (FT)
  • Generalized entropy (GE)

https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-post-tra

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Partial dependence plots (PDP)

A

show the dependence of the predicted target response on a set of input features of interest.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Shapley values

A
  • determine the contribution that each feature made to model predictions.
  • method (solution concept) for fairly distributing the total gains or costs among a group of players who have collaborated.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

The difference in proportions of labels (DPL)

A

compares the proportion of observed outcomes with positive labels for facet d with the proportion of observed outcomes with positive labels of facet a in a training dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Weight

A

Multiplies the input value, controlling its influence on the output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Bias

A

Adds a constant term, allowing the model to fit the data better by shifting the activation function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Text embeddings

A

represent meaningful vector representations of unstructured text such as documents, paragraphs, and sentences. You input a body of text and the output is a (1 x n) vector. You can use embedding vectors for a wide variety of applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Amazon Fraud Detector

A

is a fully managed service that you can use to detect fraudulent activities. Examples of fraudulent activities include fraudulent transactions or the creation of fake accounts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Underfitting

A

leads to poor performance on both the training and test datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

A high bias

A

indicates underfitting, where the model is too simplistic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Low variance

Low variance

A

suggests that the model doesn’t capture the complexity of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Overfitting

A

occurs when the model performs well on the training data but poorly on unseen data, such as the validation and test sets. This happens because the model learns the noise and intricate details of the training data, reducing its generalizability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Grid Search

A

enumerates every possible combination from a pre-defined “grid” of hyperparameter values.

16
Q

Bayesian Optimization is valuable when

A

each individual evaluation is expensive and you want to minimize the number of trials.

17
Q

Random Search

Random Search

A

samples from the search space rather than covering it completely. It’s typically better if the search space is large

18
Q

Hyperband

A

is designed for larger search spaces and relies on early stopping. It’s not specifically intended to fully enumerate the space; its strength is in pruning unpromising configurations quickly

19
Q

Dimensionality reduction

A

is a technique that simplifies datasets by reducing the number of input variables or features. This simplification enhances computational efficiency and model performance, especially as datasets grow in size and complexity.

20
Q

SMOTE

A

Synthetic Minority Oversampling Technique, is a technique used to address class imbalance in datasets. It works by generating synthetic data points for the minority class, effectively balancing the dataset

21
Q

The primary Weighted Loss Functions goal is

A

to address class imbalance or to prioritize certain data points during training, leading to a more robust and accurate model.

22
Q

The Random Cut Forest (RCF) algorithm is used for

A

anomaly detection, particularly in large datasets and streaming data.

23
Q

Amazon Forecast

A

A fully managed service that uses statistical and machine learning algorithms to deliver highly accurate time-series forecasts.