OpenSearch Service
K-means clustering
is a popular unsupervised machine learning algorithm used for partitioning a dataset into a pre-defined number of clusters
Pre-training bias metrics
https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bia
Post-training bias metrics
https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-post-tra
Partial dependence plots (PDP)
show the dependence of the predicted target response on a set of input features of interest.
Shapley values
The difference in proportions of labels (DPL)
compares the proportion of observed outcomes with positive labels for facet d with the proportion of observed outcomes with positive labels of facet a in a training dataset
Weight
Multiplies the input value, controlling its influence on the output.
Bias
Adds a constant term, allowing the model to fit the data better by shifting the activation function.
Text embeddings
represent meaningful vector representations of unstructured text such as documents, paragraphs, and sentences. You input a body of text and the output is a (1 x n) vector. You can use embedding vectors for a wide variety of applications.
Amazon Fraud Detector
is a fully managed service that you can use to detect fraudulent activities. Examples of fraudulent activities include fraudulent transactions or the creation of fake accounts.
Underfitting
leads to poor performance on both the training and test datasets.
A high bias
indicates underfitting, where the model is too simplistic
Low variance
Low variance
suggests that the model doesn’t capture the complexity of the data
Overfitting
occurs when the model performs well on the training data but poorly on unseen data, such as the validation and test sets. This happens because the model learns the noise and intricate details of the training data, reducing its generalizability.
Grid Search
enumerates every possible combination from a pre-defined “grid” of hyperparameter values.
Bayesian Optimization is valuable when
each individual evaluation is expensive and you want to minimize the number of trials.
Random Search
Random Search
samples from the search space rather than covering it completely. It’s typically better if the search space is large
Hyperband
is designed for larger search spaces and relies on early stopping. It’s not specifically intended to fully enumerate the space; its strength is in pruning unpromising configurations quickly
Dimensionality reduction
is a technique that simplifies datasets by reducing the number of input variables or features. This simplification enhances computational efficiency and model performance, especially as datasets grow in size and complexity.
SMOTE
Synthetic Minority Oversampling Technique, is a technique used to address class imbalance in datasets. It works by generating synthetic data points for the minority class, effectively balancing the dataset
The primary Weighted Loss Functions goal is
to address class imbalance or to prioritize certain data points during training, leading to a more robust and accurate model.
The Random Cut Forest (RCF) algorithm is used for
anomaly detection, particularly in large datasets and streaming data.
Amazon Forecast
A fully managed service that uses statistical and machine learning algorithms to deliver highly accurate time-series forecasts.