AI & ML Flashcards

Question

You have several complex projects spanning multiple files, with complex dependencies. You also need to collaborate and share notebooks. Which notebook solution offers the best option?

Answer 1

Vertex Workbench

Answer 2

1) Stop the instance 2) Modify the hardware configuration 3) Click the submit button.

Answer 3

Use Runtime Templates. Create a runtime template with the configuration that you need, then create a runtime based on that template, then connect to the runtime from your notebook and run your code.

Answer 4

1) Fast experimentation 2) Accelerated deployment 3) Simplified model management to achieve your ML goals.

Answer 5

Regression - Continuous labels Classification - Discrete (Labels / Categories)

Answer 6

Linear Regression - Predicts a continuous output from a continuous input by attempting to model the line of best fit. Polynomial Regression - Captures non-linear relationships between features and labels within a continuous dataset. Time Series Regression / AutoRegressive Integrated Moving Average (ARIMA) - Predicts future values in a time-dependent data set: often employed to forecast future values based on past observations.

Answer 7

Sales Forecasting, Inventory Forecasting, Stock Market Analysis

Answer 8

Logistic Regression - Although Logistic Regression uses regression techniques, the outcome is actually binary classification. Random Forest & Gradient Boosted Trees - Uses a concept called collective intelligence, which builds a bunch of independent decision trees, and aggregates their result into a single result, usually a score. K-Nearest Neighbors (KNN) - Groups data points based on nearest neighbour, essential K-Means without centroids. KNN is useful for pre-processing and populating labels on data points that have not yet been classified before other Machine Learning techniques are applied. Support Vector Machine (SVM) - Attempts to draw a line between classes in high-dimensionality datasets with no labels.

Answer 9

A human-set threshold that determines where on the logistic regression curve the classifier will predict a positive value in favour of a negative value.

Answer 10

Clustering - Involves grouping data points together so that objects in the same group (cluster) are more similar to each other than to those in other groups. Association - Attempts to learn the relationship between variables and objects within a dataset. Dimensionality Reduction / Matrix Factorisation - Dimensionality reduction is an unsupervised learning technique that reduces the number of features, or dimensions, in a dataset for better visualisation.

Answer 11

Recommendations

Answer 12

1) Place centroids randomly within the data space. 2) Measure the distance between each data point and each cluster center using the Euclidean Distance. 3) Assign each data point to its nearest centroid. 4) Recalculate the centroids by taking the mean of all new data values within their cluster. 5) Keep repeating until all data points fall in the same cluster as the previous epoch.

Answer 13

Market & Customer Segmentation, Computer Vision, Fraud Detection

Answer 14

Market basket analysis, Stock Analysis, anomaly detection, Medical diagnosis

Answer 15

Data Visualisation, Image Compression & Noise Filtering

Answer 16

Topic modelling is a unsupervised learning technique for discovering abstract "topics" hidden within a large collection of text documents.

Answer 17

TensorFlow is an end-to-end, open-source platform for machine learning that provides a comprehensive ecosystem for building and deploying models at scale. It is renowned for its production readiness.

Answer 18

Streaming (Pub/Sub), structured batch (BigQuery), unstructured batch (Cloud Storage)

Answer 19

Training checkpoints are snapshots of a model's state at specific points during the training process. Checkpoints enable model state to be persisted even if a training job is interrupted / fails. Checkpoints capture essential information like model weights, optimizer states, and the current training epoch or step.

Answer 20

A feature refers to a factor that contributes to the prediction. This is like an independent variable in statistics, or a column in a table.

Answer 21

Univariate Analysis - Each of the features is analysed independently. Bivariate Analysis - We compare 2 features to identify correlations.

Answer 22

By applying transformations such as log transformation and normalisation, you can transform skewed distribution to normal distribution by removing outliers.

Answer 23

By using oversampling. Synthetic Minority Oversampling Technique (SMOTE, undersampling or oversampling. SMOTE does not create duplicates. Instead, it uses linear interpolation to create new data points between existing minority samples. It relies on the k-Nearest Neighbors (k-NN) algorithm.

Answer 24

Log scaling is used when some of the data samples are in the power of law, or very large (left-leaning skewed data distribution), for example annual income.

Answer 25

Scaled value = (value − mean) / stddev

Answer 26

Capping all feature values above or below a certain fixed value to remove outliers.

Answer 27

Normalisation

Answer 28

1) Numeric features that have distinctly different ranges (for example, age and income) 2) Numeric features that cover a wide range such as a city

Answer 29

This technique involves reducing the number of examples in the majority class to match the size of the minority class to reduce bias. If you have 1,000 "Normal" transactions and 100 "Fraud" transactions: 1) You keep all 100 "Fraud" cases. 2) You randomly select only 100 "Normal" cases from the original 1,000. 3) You discard the remaining 900 "Normal" cases. 4) Result: A balanced dataset of 200 total rows.

Answer 30

Data Loss: You are throwing away potentially valuable information. The model might miss important patterns present in the discarded examples.

Answer 31

Balancing an imbalance dataset by interpolating features for a minority dataset.

Answer 32

The number one mistake beginners make with feature engineering is applying preprocessing techniques (normalising features, removing outliers etc.) before splitting the data. For example, if you oversample the entire dataset first, and then split it into Training and Testing sets, the synthetic samples created in the training set will be based on—and therefore extremely similar to, or even near-identical to—original data points that end up in the test set. This is called data leakage and impacts model eval and production quality.

Answer 33

Also known as Class Weighting or Cost-Sensitive Learning, this technique does not change the dataset size. Instead, it tells the model that errors made on the minority class are more expensive than errors made on the majority class. This can be useful to remove bias when training on a unbalanced dataset.

Answer 34

Dimensionality Reduction is the process by which an initial set of raw data is reduced to more manageable groups for training. In technical terms, you want to reduce the dimension of your feature space. By reducing the dimension of your feature space, you have fewer relationships between variables to consider, and you are less likely to overfit your model.

Answer 35

Converting categorical data into a numerical format that can be fed into machine learning algorithms to improve prediction accuracy. For example, if you have 3 employee IDs: 101, 113 and 129, you would split the employee ID column into 3 separate features during feature engineering and represent the employee who was active for that line of data with a 1 and the rest with a 0. This increases the sparsity of your dataset.

Answer 36

Feature hashing is a technique in Feature Engineering used to turn categorical features into a vector of a fixed size. It is essentially a space-efficient alternative to One-Hot Encoding, so useful for categories with lots of values.

Answer 37

Convert continuous numeric data into discrete intervals (categorical string features), which can then be one-hot encoded. For example, you could set age ranges for 18-25, 26-30, 30-40 etc.

Answer 38

1) Be related to the objective 2) Be known at prediction-time 3) Be numeric with meaningful magnitude 4) Have enough examples

Answer 39

Group categories in the long tail (bucketisation) where data variables aren’t continuous to avoid outliers overfitting the model.

Answer 40

A central store to aggregate, manage, serve and share features. These features are stored as a time-series, allowing traceability and searchability of features over time.

Answer 41

Features are shareable for training and serving. Features are reusable, reducing duplicate effort. Features are scalable - Fully managed, and serve at low-latency for training Mitigate Training-Serving Skew - Track & monitor for drift between training and serving.

Answer 42

A FeatureView is a logical grouping of features from a BigQuery table or view that you want to serve together. For example, if features are spread across multiple entity types, you can retrieve them in a single request that you can feed to a machine learning or batch prediction request. For the exam, this may be called a EntityType as this is what it was previously known as.

Answer 43

Your BigQuery table or view is the offline store. This eliminates data duplication and allows you to use the full power of BigQuery for feature engineering, analysis, and batch serving.

Answer 44

Offline serving is done directly from your BigQuery tables using standard BigQuery APIs and capabilities. This provides more flexibility and control over data access.

Answer 45

1) Optimised online serving (for ultra-low latency scenarios and embedding management). 2) Bigtable online serving (for large data volumes, similar to the legacy online store)

Answer 46

Level 1: Data Source - Traditionally BigQuery, especially for Offline Serving for Model Training Level 2: Feature Registry - For management & governance. Feature registry contains feature groups, which corresponds to a BigQuery source table or view containing feature data. Level 3: Online Feature Store - Stores a copy of the latest feature values for low-latency serving. Level 4: Feature View - Configures which features from your BigQuery source should be regularly synchronized and made available in a specific Feature Online Store.

Answer 47

Feature Store uses a time series data model to store a series of values for features. This model enables Feature Store to maintain feature values as they change over time.

Answer 48

For maximum speed, it's better to store materialized data instead of using views or subqueries for training data.

Answer 49

A baseline model is a solution to a problem without applying any machine learning techniques. You can get started within a baseline model with a simple “CREATE OR REPLACE MODEL”

Answer 50

BQML by default assumes that numbers are numeric features, and strings are categorical features. The model (e.g. Neural Network) will automatically treat any integer as a numerical value rather than a categorical value, meaning it will hold distinct meaningful magnitude rather than one-hot encoding.

Answer 51

Static training is easier to build and test, but is likely to become stale quickly in high data drift environments. Dynamic is harder to build and test but will adapt to changes as you’re constantly retraining the model based on live usage data.

Answer 52

Do the model features change like Science (slowly - Static) or fashion (quickly - dynamic)?

Answer 53

Vertex AI Pipelines

Answer 54

Container Logging - Specifically designed to log the standard output (stdout) and standard error (stderr) streams from the containers running your model on the prediction nodes.

Answer 55

A Google Cloud Storage bucket that acts as an input for both AutoML and custom training jobs.

Answer 56

1) Confusion Matrix - Recall vs Precision 2) Precision/Recall Curve 3) Feature Importance - Bar chart to illustrate the feature attribution to a prediction

Answer 57

Performance - Assessing accuracy and alignment to business objectives. Generalisation - Ensuring the model works on new, unseen data and isn’t overfitting. Model Selection - When there are multiple models to choose from, we use evaluation to select how much we rely on each model to solve a problem. Improvement - Performance can be tracked after deployment to identify when retraining the model is necessary.

Answer 58

Overfitting - a model performs exceptionally well on its specific training data, but struggles to generalise with new unseen data. Data Validation & Splitting, such as Stratified sampling and cross-validation, can be used to mitigate this. Data or Concept Drift - when the distribution of real-world data changes over time. Continuous Monitoring and Deployment can be used to mitigate drift. Metric Choice - Relying on one metric alone, or metrics that do not relate to the project goals, can make evaluation a bottle neck.

Answer 59

Identify performance bottlenecks and optimize hardware resource utilization across CPUs, GPUs, and TPUs.

Answer 60

Training data is used to fit the model's initial parameters Validation data is used to tune hyperparameters (such as during back-propegation in DNNs) Test Data is reserved for the final, unbiased evaluation of performance on unseen examples

Answer 61

True Positive - The model correctly predicted the positive outcome False Positive (Type 1 Error) - The model incorrectly predicted the positive outcome, when the outcome should have been negative False Negative (Type 2 Error) - The model incorrectly predicted the negative outcome, when the outcome should have been positive True Negative - The model correctly predicted the negative outcome

Answer 62

Mean Absolute Error (MAE) - What’s the average distance from the line of best fit. Mean Squared Error (MSE) - Punish larger errors of MAE. R2 Score - Determines how well the model predicts the actual data.An R2 of 1 means the model perfectly fits the data, while an R2 of 0 indicates the model explains none of the variability around the mean. For example, a R2 value of 0.85 for exam scores might indicate that “hours studied” has a strong relationship to final score.

Answer 63

R2 represents the proportion of the variance in the label that can be predicted from the features. An R2 of 1 means the model perfectly fits the data, while an R2 of 0 indicates the model explains none of the variability around the mean. A negative R2 means the model fits worse than using the average.

Answer 64

“What percentage of everything we caught in the ocean were fish?” A metric for classification models that answers the following question: Out of all the labels marked as positive by the model, how many were correct? True Positive / True Positive + False Positive

Answer 65

“What percentage of fish in the entire ocean did we catch?” A metric for classification models that answers the following question: Out of all the possible positive labels, how many did the model correctly identify? True Positive / True Positive + False Negative

Answer 66

Precision and Recall are often a tradeoff, and given your use case you may wish to optimise for one or the other. Consider the binary classification use case of Gmail separating mails “Spam” and “Not Spam”. If the goal is to catch as many emails as possible, then it may want to prioritise Recall. In contrast, if the goal is to only catch the messages that are DEFINITELY spam, then it may want to prioritse precision.

Answer 67

The confidence threshold determines how a ML model counts the positive cases. A higher threshold increases the precision, but decreases recall. A lower threshold decreases the precision, but increases recall. You can manually adjust the threshold to observe its impact on precision and recall and find the best tradeoff point between the two to meet your business needs.

Answer 68

If you raise the Classification Threshold, then it increases precision. For example, in detecting Spam where Spam is your positive classifier, you would increase the classification threshold so that more emails get identified as “Not Spam”. Opposingly, if you want to increase Recall, you would lower the Classification Threshold.

Answer 69

A measure of the accuracy of classification models. The f1 score is the harmonic average of the precision and recall. An f1 score's best value is 1. The worst value is 0.

Answer 70

A metric for classification models that measures the percentage of all predictions (positive & negative) that the model got correct. True Positive + True Negative / TP + FP + TN + FN

Answer 71

Area Under Curve - Receiver Operating Characteristic Plots the True Positive Rate (Recall) against the False Positive Rate. AUC-ROC measures: How well does the model separate the two classes?

Answer 72

Area Under Curve - Precision-Recall Plots Precision against Recall. AUC-PR measure: How well does the model find the positives without false alarms?

Answer 73

Explainability

Answer 74

Use AUC-ROC when your classes are balanced. E.g. You care equally about positive and negative classes. Use AUC-PR when you have a "needle in a haystack" problem (highly imbalanced data). For example, spam filtering will have a ratio of 1:1000 positive to negative ratio, making AUC-PR the preferred option.

Answer 75

The feature importance values could be used to help you improve your model and have more confidence in its predictions. You might decide to remove the least important features next time you train a model or to combine two of the more significant features into a feature cross to see if this improves model performance.

Answer 76

The ML.EVALUATE function calculates evaluation metrics against your model type, given a dataset you pass to it.

Answer 77

Feature attributions are an explainability method that indicate how much each input feature contributed to your model’s predictions and to the model’s overall predictive power.

Answer 78

Drift is a function of time (the world changes). For example changing consumer behaviors. Skew is a function of data composition (the representation is distorted). For example, you train a self-driving car in california, but it fails in London because it needs to work on the other side of the road and with different road signs.

Answer 79

TFRecords are primarily used to optimize the data input pipeline in machine learning models. They are designed to solve the "I/O bottleneck"—the problem where your powerful GPU or TPU sits idle, waiting for the hard drive to find and read thousands of individual files. You host these in GCS before loading them into a training pipeline. TFRecords are effectively "containers" that hold serialized data. While the outer shell is structured (following a Protocol Buffer format), the content inside can be raw binary blobs, such as JPEG or PNG data.

Answer 80

1) Shapley (SHAP): Classification and regression on tabular data 2) Integrated Gradients (IG): Classification and regression on tabular data. Classification on image data 3) XRAI (eXplanation with Ranked Area Integrals): Classification on image data

Answer 81

1) Requesting feature attributions as part of your prediction request. 2) Using the What-If Tool

Answer 82

Mathematically, they calculate the exact same loss value. The only difference is how you feed the ground truth labels into the function. Use Categorical Cross-Entropy if your labels are one-hot encoded vectors (e.g., [0, 1, 0]). Use Sparse Categorical Cross-Entropy if your labels are provided as simple integers (e.g., 1, 3, 14), which is generally the more memory-efficient option.

Answer 83

Creating a synthetic feature by combining two variables into a co-dependent feature. For example, hour of day and day of week might be two separate features that have a large impact on a demand forecast model. Therefore you may wish to feature cross them into monday-9am, thursday-5pm etc.

Answer 84

Multiplication. E.g: [A X B]: a feature cross formed by multiplying the values of two features. [A x B x C x D x E]: a feature cross formed by multiplying the values of five features. [A x A]: a feature cross formed by squaring a single feature.

Answer 85

Spatial Functions deal with space and geography, whereas temporal functions deal with time-series data. Think of it like a football game: Spatial analysis would look at a single moment in time. You would analyze the positions of all the players on the field to understand the team's formation, how far apart they are, and which players are near the ball. This is a snapshot in space. Temporal analysis would look at a single player over the entire game. You would analyze their movement, their speed over time, and the sequence of their actions (e.g., a pass followed by a run). This is a progression over time.

Answer 86

Kubeflow Pipelines (KFP) TensorFlow Extended (TFX)

Answer 87

Pre-Processing - Tensorflow Data Validation Feature Engineering -, Tensorflow Transform Model Evaluation - Tensorflow Model Analysis Serving - Tensorflow Serving

Answer 88

Vertex AI Pipelines - ML Workflow Orchestration & Automation (e.g. MLOps). This is used for the entire ML Lifecycle steps (data prep, train, deploy). You can view it as the Assembly Line in a factory. Ray on Vertex AI - Distributed Python & ML Compute. This is used for computationally intensive tasks WITHIN the ML lifecycle defined by Vertex AI Pipelines. You can view it as a single powerful multi-worker station on the assembly line.

Answer 89

Function Components - Simply write a python function and add the @component decorator Container-Based Components - Anything that can be packed into a Docker container can be orchestrated.

Answer 90

Passing data between components

Answer 91

To pass larger datasets between components, such as training data, that cannot be handled by parameters alone.

Answer 92

To set rules where a component only runs if certain conditions are met. For example, only deploying the model if certain thresholds are met.

Answer 93

You create your functional and container-based components with the @component decorator You string together your components in the order in which you want them using your @pipeline decorator Use the @compiler decorator in order to take our pipeline function and compile it into our pipeline specification as a json file. This json file can be used to execute the pipeline.

Answer 94

Cross-validation ensures that every single data point gets a chance to be in the "test set" at least once. In simple terms, instead of training your model once on one set of data and testing it once on another, cross-validation involves repeating the training and testing process multiple times on different subsets of your data to ensure your results aren't just a fluke. This is at the end of every training run, not every epoch.

Answer 95

For Pre-Built Containers, you use a setup.py file to specify all your libraries and dependencies before submitting a training pipeline, using Google’s pre-built containers for PyTorch, Tensorflow, Scikit and XGBoost in Artifact Registry. For Custom Containers, you use a Dockerfile to specify your dependencies. This Dockerile is pushed to Artifact Registry before you reference this container within your training pipeline.

Answer 96

Weights and Biases. When we say we have a 10 billion parameter model, we are literally counting the number of weights and biases.

Answer 97

1) Layers and neurons 2) Activation functions 3) Learning rate 4) Epochs

Answer 98

Vertex AI Hyperparameter Tuning Job is a wrapper around your training code. You give it a Docker container (your model code) and say, "Maximize accuracy by changing the learning rate between 0.01 and 0.1." It spins up the infrastructure, runs the trials, and shuts them down. Vertex AI Vizier is a standalone "optimization engine" (API). It tells you what parameters to try next, but you are responsible for actually running the trial and reporting the result back. This can also therefore be used for non-ML use cases.

Answer 99

1) Backpropagation - Modify weights and bias if the difference is significant 2) Cost or Loss Functions - Measure the distance between the predicted and actual value. 3) Gradient Descent - Decide how to the tune the weights, and when to stop when the data point reaches the base of the curve.

Answer 100

CNNs process Space, making them good for Image Classification RNNs process Time, constantly relying on their memory of what just happened to understand the present. This makes them good for NLP where large sequences of text are provided as input to the model. They have also powerful in Time-Series Forecasting and Speech Recognition.

Answer 101

Over epochs, the evaluation loss should ideally decrease as the training loss decreases. However, if the evaluation loss starts to increase while the training loss continues to decrease, it's a sign of overfitting. Regularisation (through L1 and L2) can be used to help reduce the model’s complexity, making it better at generalising

Answer 102

One or more layers from a previously trained model are lifted into a new model that will be used as a starting point for training a new model. For example, knowledge gained while learning to recognise cars could apply when trying to recognise trucks.

Answer 103

1) You can use an available pretrained model, which can be used as a starting point for training your own model. 2) Transfer learning can enable you to develop models even for problems where you may not have very much data.

Answer 104

1) Increase model complexity 2) Increase the number of features by performing feature engineering. 3) Remove noise from the data 4) Increase the number of epochs or increase the duration of training to get better results

Answer 105

Regularization technique Dropout: Probabilistically remove inputs during training Noise: Add statistical noise to inputs during training Early stopping: Monitor model performance on a validation set and stop training when performance degrades. Data augmentation. Cross‐validation.

Answer 106

High Bias and Low Variance.

Answer 107

Low bias and high variance

Answer 108

Reguarlisation is a part of a loss function intended to keep model weights close to zero, ensuring no single weight overpowers the rest in any layer of the neural network. This enables models to more generalise and helps reduce overfitting.

Answer 109

L1 Regualisation tends to produce sparse weights, meaning many of the weights become exactly zero. L2 Regularisation prefers to keep all weights small but not exactly zero, maintaining density of weights. Leading to a more balanced, but complex solution.

Answer 110

A gradient simply measures the change in all weights with regard to the change in error

Answer 111

Batch Normalisation is used in neural networks in order to force the input of every layer to have a mean of 0 and a variance of approx. 1. By normalizing the activations at every step, the weights don't need to be extremely small or large to handle the data, keeping gradients stable.

Answer 112

It is a Blueprint of Operations. At its core, a computational graph is a way to represent a series of mathematical operations. Think of it as a flowchart for your machine learning model to follow during training.

Answer 113

Nodes: These represent the operations themselves, like addition, multiplication, or more complex functions. Edges: These represent the flow of data (tensors) between the operations.

Answer 114

Data Parallelism - the dataset is divided into smaller chunks, and each worker node processes a different subset of the data. Model Parallelism - Model parallelism is employed when a model is too large to fit into the memory of a single worker node. In this approach, the model itself is partitioned at particular layers (say 1 GPU handles Layer 1-5, the next GPU handles Layer 6-10 etc.), and different parts of the model are placed on different workers. All workers process the same batch of data.

Answer 115

tf.distribute.Strategy

Answer 116

1) Replicate the Model: An identical copy of the model is loaded onto each worker node. 2) Split the Data: The training dataset is partitioned, and each worker receives a unique portion. 3) Parallel Processing: Each worker independently computes the forward and backward passes on its data subset to calculate the gradients. 4) Gradient Synchronisation: The gradients from all workers are aggregated to update the model's parameters.

Answer 117

Synchronous Training: All workers must finish processing their data batch and report their gradients before the model parameters are updated. This ensures consistency but can lead to bottlenecks if some workers are slower than others. Asynchronous Training: Each worker updates the model parameters independently without waiting for the others. This can lead to faster training times but may result in less stable convergence as some workers might be using stale model parameters.

Answer 118

Composability, Portability and Scalability

Answer 119

TensorFlow Lite is a specialised version of Google's open-source machine learning framework designed to run machine learning models on mobile and embedded devices.

Answer 120

In the realm of machine learning, quantisation is a technique used to reduce the computational and memory costs of running models. It achieves this by converting the numerical precision of a model's parameters (weights) and activations from high-precision floating-point numbers to lower-precision data types, such as 8-bit integers. This process makes models smaller, faster, and more energy-efficient, with a minimal impact on accuracy.

Answer 121

Zero-Shot prompting - Providing a single command to the LLM without any examples One-Shot Prompting - Providing a single example of the task to the LLM. Few-Shot Prompting - Providing a few examples of the task to the LLM.

Answer 122

A curated collection of sample prompts that show how generative AI models can work for a variety of use cases.

Answer 123

The degree of randomness in token selection. A temperature of 0 is deterministic, selecting high possibility words. Whereas a temperature of 1 introduces more creativity…but also holds greater risk of unexpectedness and hallucination.

Answer 124

A top-k of 1 means the selected token is the most probable among all tokens in the model’s vocabulary (also called greedy decoding), while a top-k of 40 means that the next token is selected from among the 40 most probable tokens (using temperature).

Answer 125

Tokens are selected from a set of tokens with the sum of the probabilities not exceeding P. For example, if tokens A, B, and C have a probability of .3, .2, and .1 and the top-p value is .5, then the model will select either A or B as the next token (using temperature) as their probability totals up to .5. This prevents the model from returning a response with extremely low probability, even when you want the range of responses to be high through Top-K.

Answer 126

1) Prompt Design 2) Fine Tuning 3) Reinforcement Learning 4) Distilling

Answer 127

Inexpensive vs more training runs Allows fast experimentation and customisation Doesn’t require ML background or complex technical skills

Answer 128

Fine-Tuning updates all the parameters of a pre-trained model for a new task, which is computationally expensive but can achieve maximum performance. In contrast, Parameter-Efficient Tuning (PEFT) freezes most of the model's original weights and only trains a very small fraction of new or existing parameters. This makes PEFT vastly more efficient in terms of computational cost and storage, while still delivering strong, competitive results.

Answer 129

1) Aims to reduce the challenges of fine-tuning 2) Only trains a subset of parameters of a much larger foundational model, making it less computationally expensive 3) Can require smaller datasets, making it more accessible

Answer 130

Adapter-Tuning - involves inserting small, fully-connected neural network layers, called adapter modules, between the existing layers of a pre-trained model. Only the parameters of these new adapter layers are trained, while the original model remains unchanged. Reinforcement - Unsupervised reinforcement learning with human feedback.

Answer 131

It can introduce latency as you are adding more layers to the neural network.

Answer 132

Transferring knowledge from a larger model to a smaller model to optimise performance, latency and cost.

Answer 133

Transfer Learning is about Adaptation. You want to take a smart model and teach it a new task. Distillation is about Compression. You want to make a heavy model smaller and faster while keeping the same task.

Answer 134

Pre-Trained Models - From Google, third party and open source Task Specific Models - Like Entity Extraction, Sentiment Analysis etc. Fine-Tunable Models - Mostly open source

Answer 135

A tool that lets you quickly test and customise generative AI models so you can leverage their capabilities in your applications.

Answer 136

To track and analyze the metadata produced by your machine learning (ML) systems, metadata such as parameters, artifacts (like datasets and models), metrics, and the lineage of components.

Answer 137

A centralised tracking system that stores lineage, versioning and related metadata for published machine learning models.

Answer 138

Online Predictions - Synchronous requests. When a request is sent, the service processes it immediately and returns the prediction in the same response. Batch Predictions - Asynchronous requests. You submit a "job" with a large dataset (from Cloud Storage or BigQuery). Vertex AI processes the data in bulk, and the results are written to a specified output location (e.g., Cloud Storage or BigQuery).

Answer 139

Online Predictions - Interactive applications, Mobile app backend, or when predictions are generated in a one-by-one workflow. Batch Predictions - Offline analysis, scoring an entire dataset at a point in time and / or scheduled workloads

Answer 140

Vertex AI Private Endpoints provide a way to access Vertex AI online prediction services using private IP addresses within your Virtual Private Cloud (VPC) network. Instead of sending prediction requests over the public internet, traffic remains within the Google Cloud network, enhancing security and potentially reducing latency. Vertex AI Private Endpoints can be good to use when your backend client that is calling your endpoint is hosted in GCP.

Answer 141

Applying Training Checkpoints

Answer 142

Skew detection: This approach looks for the degree of distortion between your model training and production data Drift detection: In this type of monitoring, you're looking for drift in your production data. Drift occurs when the statistical properties of the inputs and the target, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions could become less accurate as time passes.

Answer 143

As much as possible, use skew detection because knowing that your production data has deviated from your training data is a strong indicator that your model isn't performing as expected in production. If you don't have access to the training data, turn on drift detection so that you'll know when the inputs change over time. Use drift detection to monitor whether your production data is deviating over time. For drift detection, enable the features you want to monitor and the corresponding thresholds to trigger an alert.

Answer 144

For skew detection, set up the model monitoring job by providing a pointer to the training data that you used to train your model.

Answer 145

An upstream model changing A data source maintained by another team changing Data Drift - The relationship between features and labels

Answer 146

Changes in label distribution - E.g. A model that predicts how long humans live for has a label that has inherently increased over time. Changes in feature distribution - E.g. a model that predicts population movement patterns using postal code as a feature. Postal codes aren’t fixed and can therefore drift.

Answer 147

Extrapolation is the process of making predictions on new data that falls outside the range of the training data. The model has to make assumptions about how the patterns it learned from the the training data will continue in unchartered territory.

Answer 148

Interpolation is the process of making predictions on new, unseen data that falls WITHIN the boundaries of the training data. For example, if you have a model trained to predict house prices based on square footage, and the training data includes houses between 1,000 and 3,000 square feet, predicting the price of a 2,200 square foot house would be interpolation.

Answer 149

Data Drift (The features change) - The inputs into the model change. For example, a IOT device changes reporting from degree fahrenheit to degrees celsius. Model / Concept Drift (The relationship between features and labels change) - (also known as concept drift) occurs when the relationship between the input data and the output (the target variable) changes. The features themselves might not have changed, but what they mean in relation to the prediction has. For example, say you build a model to classify positive and negative sentiment of Reddit feed around certain topics. Over time, people's sentiments about these topics change. Or in email spam, malicious actors will adapt the language to bypass spam filters, causing model / concept drift for your Spam Detection model.

Answer 150

Any scenario in which the training data is generated differently from how the data is generated / collected in production.

Answer 151

Ablation analysis is the process of systematically removing parts of a machine learning model or algorithm to understand the contribution of each component.

Answer 152

Monitoring - Look at the descriptive summaries of your inputs and compare them to what the model has seen. For example, if the mean or the variance has changed substantially, then you can analyse this new segment of the input space, to see if the relationships learned still hold. Monitor Residuals - Residuals are the difference between the predictions and the labels. If errors are increasing, or have moved to a different area of the curve, this could be evidence of a change in relationship between features & labels. Custom Loss Function - To emphasise data recency. Regularly Retrain Models - Applying a Dynamic Training principle to retrain at intervals or when data drift is detected by thresholds breaking.

Answer 153

Where data from outside the training dataset is improperly used to create the model. The model essentially learns from information it wouldn't have access to in a real-world scenario, such as averages from its testing dataset or variables that are derived from the label.

Answer 154

Target Leakage - When training data includes features that are "contaminated" with information about the target variable, but this information won't be available when you actually need to make a prediction E.g. “Weekly Wages” being a feature of a “Annual Wages” label. Train-Test Contamination - This occurs when you don't properly separate your training and testing datasets before pre-processing / feature engineering. E.g. calculating averages for missing datasets across both training and test.

Answer 155

A Vertex AI model is deployed to its own virtual machine (VM) instance by default, cohosting enables models to share resources so that CPU, GPU and memory are fully utilised across traffic.

Answer 156

Resource sharing across multiple deployments. Cost-effective model serving. Improved utilization of memory and computational resources.

Answer 157

Concept / Model Drift

Answer 158

The name Beam comes from a combination of the words “Batch” and “Stream”.

Answer 159

The same code you use to preprocess features in training and evaluation can also be used in serving.

Answer 160

A single TPU device A TPU Pod (a group of TPU devices connected by high‐speed interconnects) A TPU slice (a subdivision of a TPU Pod) A TPU VM

Answer 161

A tensor is a multi-dimensional array or an N-dimensional list of numbers.

Answer 162

Scalar (a 0-dimensional array): A single value, like the number 7, has a shape of (), making it a 0-dimensional tensor. Vector (a 1D array): A list of numbers, such as [1, 2, 3], has one dimension and a length (e.g., 3), making it a 1-dimensional tensor. Matrix (a 2D array): A grid of numbers, like a spreadsheet with rows and columns, has two dimensions (rows and columns). For example, a 2x3 matrix is a 2-dimensional tensor with a shape of (2, 3). Higher-Dimensional Tensors: Any array with three or more dimensions is a tensor. For example, a 3D tensor could represent a grayscale video (with dimensions for frames, height, and width), while a 4D tensor could represent a batch of color images.

Answer 163

1) Apache Beam / Dataflow 2) Tensorflow

Answer 164

Data Validation

Answer 165

Can be used for generating schemas and statistics about the distribution of every feature in the dataset. This can be used to ensure that the data used for training a model is consistent with the data the model will see in production.

Answer 166

StatisticsGen - Generate statistics for the data SchemaGen - Use those statistics to generate a schema for each feature Visualise the schema and statistics and manually inspect them Update the schema if needed

Answer 167

The difference lies in their scope. In Pointwise evaluation the evaluator looks at a single prompt and a single response. They assign a score based on a specific rubric or scale (e.g., 1–5 Likert scale, Pass/Fail, or Binary Correct/Incorrect).. In pairwise evaluation, the evaluator is presented with one prompt and two different responses (like the LMSYS Chatbot Arena).

Answer 168

Rubrics are a set of instructions and criteria given to the evaluation (whether a human or an LLM-as-a-Judge) to define exactly how to score a response. Without a rubric, evaluation is subjective "vibes-based" checking. With a rubric, evaluation becomes a measurable, reproducible metric.

Answer 169

A score that measures the model output against the rating rubrics.

Answer 170

ROUGE - Focuses on recall BLEU - Focuses on precision METEOR - Focuses on precision and recall

Answer 171

Quantifies how well the language model predicts the next word in a sequence.

Answer 172

Preprocessing the data for ML involves both data engineering and feature engineering. Data engineering is the process of converting raw data into prepared data. Feature engineering then tunes the prepared data to create the features that are expected by the ML model.

Answer 173

Principal Component Analysis. It is a statistical technique used in feature engineering for dimensionality reduction. Essentially, it simplifies complex data by reducing the number of variables (features) while retaining as much of the original information (variance) as possible.

Answer 174

MinMax Scaling (also commonly referred to as Normalization) is a feature scaling technique that shifts and rescales the values of a numeric feature so they end up ranging between two fixed numbers, typically 0 and 1.

Answer 175

No, you generally do not need to deploy a separate pre-processing service in front of your model endpoint. One of the primary value propositions of TensorFlow Transform (TFT) is that it allows you to "bake" your pre-processing logic directly into the model graph that you export for serving. This ensures that the exact same transformations used during training are applied during serving, eliminating training-serving skew.

AI & ML Flashcards

(201 cards)