Implement AI Model Flashcards

Question

What is created when you execute **model = DecisionTreeClassifier()**?

Answer 1

A model object ## Footnote This does NOT train the model yet; it merely specifies the algorithm to be used.

Answer 2

Model learns patterns in X_train ## Footnote This is where the actual training occurs using the training data.

Answer 3

Predict labels ## Footnote It outputs an array of predicted species based on the test data.

Answer 4

Confidence for each class ## Footnote It provides probabilities for each predicted class.

Answer 5

Same columns, same format, same preprocessing ## Footnote This ensures compatibility with the trained model.

Answer 6

joblib.dump(model, 'my_first_ml_model.pkl') ## Footnote This allows for future use without needing to retrain.

Answer 7

FALSE ## Footnote These topics are mentioned as aspects that will be covered later.

Answer 8

* split data * fit a model * predict * save the model ## Footnote This pattern repeats forever in ML.

Answer 9

* tools for data handling * visuals * ML models + utilities ## Footnote This step is considered boilerplate and does not involve learning yet.

Answer 10

* X = what the model sees * y = what the model learns to predict ## Footnote This decision depends on the business/problem goal, not the dataset.

Answer 11

* Train set: model learns patterns * Test set: model is tested on unseen data ## Footnote Key parameters include test_size=0.2 and random_state=101 for reproducibility.

Answer 12

This does NOT train anything ## Footnote You’re just choosing which algorithm to use.

Answer 13

The model sees feature patterns and learns how they map to species ## Footnote This is the actual training step.

Answer 14

* Array of predicted species * Confidence for each class ## Footnote Example: [0.0, 1.0, 0.0] indicates 100% confidence for versicolor.

Answer 15

* Same columns * Same order * Same preprocessing as training data ## Footnote This ensures the model works correctly with new inputs.

Answer 16

* Don’t retrain every time * Use model in apps, APIs, dashboards ## Footnote Load later with loaded_model = joblib.load('my_first_ml_model.pkl').

Answer 17

* Learned from data * Examples: tree splits, weights, coefficients ## Footnote Parameters are the internal variables that the model learns during training.

Answer 18

* Set by YOU * Examples: n_estimators, max_depth ## Footnote Hyperparameters are external configurations that govern the training process.

Answer 19

Tunes hyperparameters ## Footnote It automates the process of finding the best hyperparameter combinations using cross-validation.

Answer 20

Trained one model with default settings ## Footnote There was uncertainty about whether the default settings were optimal.

Answer 21

* Train many models * Different hyperparameter combinations * Using cross-validation ## Footnote This approach allows for a systematic evaluation of model performance.

Answer 22

NEVER touched during tuning ## Footnote The test set is used solely for evaluating the final model's performance.

Answer 23

* Split data * Define pipeline * Define hyperparameter grid * GridSearchCV (fit MANY models) * Select best model * Evaluate on test set ## Footnote This workflow includes an additional step for hyperparameter tuning.

Answer 24

* Ensures no data leakage * Makes GridSearchCV possible * Treats preprocessing + model as ONE object ## Footnote Pipelines streamline the process and maintain data integrity.

Answer 25

Hyperparameter values to try ## Footnote It specifies the hyperparameters and their respective values for tuning.

Answer 26

Number of cross-validation folds ## Footnote It determines how many subsets the training data will be divided into for validation.

Answer 27

* Trains models * Validates models * Scores models * Averages results ## Footnote This process involves multiple iterations based on the defined hyperparameters and cross-validation.

Answer 28

A dictionary of results ## Footnote It contains information about hyperparameters tried, mean validation scores, and rankings.

Answer 29

grid.best_params_ ## Footnote This returns the hyperparameter combination that yielded the best performance.

Answer 30

The entire trained pipeline ## Footnote This includes the model with the best hyperparameters, ready for deployment or evaluation.

Answer 31

Overfitting ## Footnote This suggests that the model learned the training data very well but generalizes poorly to unseen data.

Answer 32

TRUE ## Footnote It simplifies the process of optimizing model performance through systematic evaluation.

Answer 33

Use GridSearchCV with a custom scoring metric to optimise a binary classification model based on business priorities, not just accuracy ## Footnote Focus on recall for malignant cancer.

Answer 34

* Regression: R², MAE, MSE * Classification: accuracy, recall, precision, F1 * Regression: One output * Classification: Multiple classes * Regression: No class priority * Classification: Class priority matters ## Footnote In classification, you must decide what matters more.

Answer 35

malignant ## Footnote Class 1 represents benign.

Answer 36

Missing a malignant tumour is worse than falsely flagging a benign one ## Footnote This decision drives the optimization for recall on class 0.

Answer 37

Load & split data ## Footnote Use the breast cancer dataset for binary classification.

Answer 38

Data leakage ## Footnote It keeps preprocessing and model together.

Answer 39

Accuracy can lie ## Footnote A model could predict 'benign' always and still achieve high accuracy while failing to identify malignant cases.

Answer 40

To define a custom scoring metric focused on recall for malignant cases ## Footnote Example: scoring = make_scorer(recall_score, pos_label=0).

Answer 41

Multiple models are trained and recall is computed for malignant cases ## Footnote Scores are averaged per hyperparameter.

Answer 42

Mean recall scores for different n_estimators ## Footnote Example output: n_estimators=50 → mean ≈ 0.86.

Answer 43

pipeline = grid.best_estimator_ ## Footnote This gives you the full pipeline with the best hyperparameters.

Answer 44

Confusion matrix + report ## Footnote Check recall for malignant and precision trade-off.

Answer 45

Business rules, not code ## Footnote Example: If client threshold ≥ 90% → acceptable.

Answer 46

Same workflow as regression, but scoring metric defines what 'best' means ## Footnote High recall often lowers precision (trade-off).

Answer 47

* Build medical ML systems * Detect fraud * Catch churn * Flag risky users ## Footnote Focus on optimizing decisions rather than just training models.

Answer 48

Split data ## Footnote This initiates the process of preparing data for model training and evaluation.

Answer 49

* Scaling * Feature selection * Model (e.g., RandomForestClassifier) ## Footnote The pipeline organizes the steps for data processing and model training.

Answer 50

* Parameters to tune * Example: n_estimators = [10, 20] ## Footnote Hyperparameter tuning is essential for optimizing model performance.

Answer 51

Binary: make_scorer(metric, pos_label=…) Multiclass: make_scorer(metric, labels=[class_of_interest], average=None) ## Footnote Different scoring methods are required to evaluate models based on the type of classification.

Answer 52

* Train multiple CV folds for each hyperparameter combination * Compute metric for the target class * Pick combination with highest score ## Footnote This process ensures thorough evaluation of hyperparameter settings.

Answer 53

* grid.best_params_: best hyperparameters * grid.best_estimator_: trained best pipeline ## Footnote These attributes provide access to the optimal settings and model after grid search.

Answer 54

* Precision * Recall * F1 ## Footnote The chosen metric should align with the client's specific needs for a particular class.

Answer 55

Principal Component Analysis ## Footnote PCA is a technique used to reduce the dimensionality of data while preserving as much variance as possible.

Answer 56

Squish many features into fewer, smarter features while keeping as much important information (variance) as possible ## Footnote This allows for a more manageable dataset while retaining essential information.

Answer 57

* Fewer dimensions * Faster models * Better visuals * Components are not directly interpretable ## Footnote While PCA simplifies data, it can make interpretation of components challenging.

Answer 58

* Lots of numeric features * Features are correlated * Visualise high-dimensional data * Speed up models * Reduce noise * Help clustering or classification ## Footnote PCA is beneficial in scenarios where data complexity needs to be managed.

Answer 59

* Must explain features to stakeholders * Feature meaning matters more than performance ## Footnote In cases where interpretability is crucial, PCA may not be suitable.

Answer 60

Load data ## Footnote The dataset used is the breast cancer dataset with 30 numeric features.

Answer 61

* 0 = malignant * 1 = benign ## Footnote The target variable indicates the diagnosis of the cancer.

Answer 62

PCA is distance-based; different scales can break PCA ## Footnote Cleaning and scaling ensure that PCA functions correctly.

Answer 63

341 rows × 30 features → NumPy array ## Footnote This prepares the data for PCA application.

Answer 64

* Component 0 → 43.7% * Component 1 → 18.5% * Component 2 → 10.3% ## Footnote The first three components explain approximately 72% of the information.

Answer 65

80–95% variance is usually enough ## Footnote This guideline helps in determining the optimal number of components.

Answer 66

30 features → 7 components ## Footnote This reduction retains 91.37% variance.

Answer 67

x_PCA shape: (341, 7) ## Footnote Each row represents the same observation in a new coordinate system.

Answer 68

Much clearer separation with just 2 numbers containing 62% of the data’s info ## Footnote PCA allows for better visual analysis of data.

Answer 69

PCA = rotate + compress ## Footnote Components represent directions of maximum variation, and explained variance indicates usefulness.

Answer 70

* Reduces dimensionality * Requires scaled numeric data * Outputs components, not features * Useful for visualisation, preprocessing, noise reduction, ML performance ## Footnote PCA is a powerful tool in data analysis and machine learning.

Answer 71

Features = text ## Footnote Text is not numbers; ML models only understand numbers.

Answer 72

Turning messy human language into clean numeric features ## Footnote Everything else (pipelines, CV, GridSearch) stays familiar.

Answer 73

A spam classifier for SMS messages ## Footnote Input: 'WIN a free ticket now!!!' Output: spam or ham.

Answer 74

* Load data * Split train/test * Clean text * Convert text → numbers * Train models * Compare performance ## Footnote The magic happens in steps 3 & 4.

Answer 75

* message → feature * label → target (spam/ham) ## Footnote After sampling: Smaller dataset for faster learning.

Answer 76

* X_train → messages * y_train → labels ## Footnote Same logic as usual. No surprises here.

Answer 77

Raw text is chaotic ## Footnote Issues include uppercase/lowercase differences, punctuation noise, and inconsistent formatting.

Answer 78

* Lowercases text * Removes punctuation ## Footnote Normalizes language so the model doesn’t panic.

Answer 79

Feature extraction ## Footnote This includes CountVectorizer and TF-IDF Transformer.

Answer 80

* Tokenises text into words * Counts word frequency * Removes English stop words ## Footnote Example: 'I love free pizza' → ['love', 'free', 'pizza'] → [1, 1, 1].

Answer 81

Penalises common words and boosts rare, meaningful words ## Footnote Makes spam words pop (free, win, cash).

Answer 82

'the' ≠ useful; 'win' = very useful ## Footnote It helps identify important words in messages.

Answer 83

* Text cleaning * Tokenisation (CountVectorizer) * TF-IDF weighting * Model ## Footnote This pipeline allows passing raw text to get predictions without manual preprocessing.

Answer 84

* SGDClassifier * LinearSVC ## Footnote These models are designed for high-dimensional sparse data and work well with TF-IDF.

Answer 85

Trees hate sparse text matrices ## Footnote This makes them less effective for text classification.

Answer 86

Builds multiple pipelines and runs GridSearchCV ## Footnote Here, only default hyperparameters are tested for model comparison.

Answer 87

* SGDClassifier performed best * LinearSVC very close behind ## Footnote Both are linear models that love TF-IDF features.

Answer 88

NLP success depends more on feature extraction than fancy models ## Footnote Key tools include CountVectorizer and TfidfTransformer.

Implement AI Model Flashcards

(112 cards)