Implement AI Model Flashcards

(112 cards)

1
Q

What does NLP stand for?

A

Natural Language Processing

NLP is the process of turning messy human language into something machines can analyze and learn from.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is NLP considered challenging?

A
  • Language is unstructured
  • Meaning includes context, tone, sarcasm, and word combinations

These factors make it difficult for machines to understand human language accurately.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the Bag of Words / TF-IDF technique used for in NLP?

A

Counts how often words appear

Common words like ‘and’ are less important, while rare words are more significant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does tokenization do?

A

Splits text into words or sentences and removes punctuation

It can break meaning, as in the case of names.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the purpose of stop word removal?

A

Removes boring words like ‘the’, ‘and’, ‘to’

This makes text smaller and faster to process, but care must be taken not to remove meaning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is stemming in NLP?

A

Chops words down to a base form

For example, ‘faster’ becomes ‘fast’.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the difference between stemming and lemmatization?

A
  • Stemming: Chops words to a base form
  • Lemmatization: Maps words to their real root

Lemmatization keeps meaning better, e.g., ‘coding’, ‘coded’, ‘codes’ → ‘code’.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What do recommender systems do?

A

Suggest things you’ll probably like

They are used by platforms like Netflix, Amazon, and Spotify.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Collaborative Filtering?

A

Based on your past behavior

Example: ‘You liked X, so try Y’. It improves over time but struggles with new users.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Content-Based Filtering?

A

Based on who you are and what the item is

It uses factors like age, gender, preferences, and item features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

True or false: NLP helps machines extract meaning from human language.

A

TRUE

It involves both simple methods that count words and advanced methods that try to understand context.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the final takeaway regarding NLP and recommender systems?

A

NLP helps machines extract meaning; cleaned text feeds into ML models for recommendations

Human-level language understanding remains a significant AI research goal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the definition of Machine learning?

A

A data-driven approach that uses algorithms to learn patterns and relationships from data without being explicitly programmed

Machine learning enables systems to improve their performance on tasks through experience.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

In machine learning, what does the developer provide to the algorithm?

A
  • Data
  • An objective

The algorithm uses this information to learn and create a model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is created after the algorithm is trained on the provided data?

A

A model

The trained model is used for predicting behaviors and outputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

True or false: The trained model in machine learning can be used for decision-making on unseen data.

A

TRUE

This capability allows for predictions based on new inputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

List some practical applications of machine learning.

A
  • E-mail spam detection
  • Customer Churn
  • Text Sentiment Analysis
  • Fraud Detection
  • Real-time Ads
  • Recommendation Engine

These applications demonstrate the versatility of machine learning across various industries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the first step in the supervised ML project workflow in scikit-learn?

A

Data

The workflow follows a loop: Data → split → fit model → predict → (evaluate) → save model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What libraries are imported to handle data in a supervised ML project?

A
  • numpy
  • pandas

These libraries are essential for data manipulation and analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In the context of supervised ML, what does X represent?

A

Features (inputs)

X = df.drop(‘species’, axis=1) represents measurements of the flower.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

In the context of supervised ML, what does y represent?

A

Target (what we predict)

y = df[‘species’] indicates the flower species being predicted.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the purpose of the train_test_split function?

A
  • Model learns from train set
  • Model is tested on unseen data from test set

It is crucial for evaluating model performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does test_size=0.2 indicate in the train_test_split function?

A

80% train, 20% test

This defines the proportion of data used for training versus testing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does random_state=101 ensure in the train_test_split function?

A

Reproducibility

It allows for the same split every time, yielding consistent results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is created when you execute **model = DecisionTreeClassifier()**?
A model object ## Footnote This does NOT train the model yet; it merely specifies the algorithm to be used.
26
What happens during the **fit** step of the model?
Model learns patterns in X_train ## Footnote This is where the actual training occurs using the training data.
27
What does the **predict** method do in a trained model?
Predict labels ## Footnote It outputs an array of predicted species based on the test data.
28
What is the output of **model.predict_proba(X_test)**?
Confidence for each class ## Footnote It provides probabilities for each predicted class.
29
What is the **golden rule** for live data when using a model?
Same columns, same format, same preprocessing ## Footnote This ensures compatibility with the trained model.
30
What command is used to **save the model** after training?
joblib.dump(model, 'my_first_ml_model.pkl') ## Footnote This allows for future use without needing to retrain.
31
True or false: The lesson covers evaluation metrics, hyperparameter tuning, and data cleaning.
FALSE ## Footnote These topics are mentioned as aspects that will be covered later.
32
What is the **standard scikit-learn supervised ML workflow**?
* split data * fit a model * predict * save the model ## Footnote This pattern repeats forever in ML.
33
What is the purpose of **importing packages** in the ML workflow?
* tools for data handling * visuals * ML models + utilities ## Footnote This step is considered boilerplate and does not involve learning yet.
34
What is the **rule** for deciding features vs target in a dataset?
* X = what the model sees * y = what the model learns to predict ## Footnote This decision depends on the business/problem goal, not the dataset.
35
What is the purpose of the **train_test_split** function in supervised learning?
* Train set: model learns patterns * Test set: model is tested on unseen data ## Footnote Key parameters include test_size=0.2 and random_state=101 for reproducibility.
36
What does creating the model with **model = DecisionTreeClassifier()** signify?
This does NOT train anything ## Footnote You’re just choosing which algorithm to use.
37
What happens during the **fit** step in the ML workflow?
The model sees feature patterns and learns how they map to species ## Footnote This is the actual training step.
38
What does **model.predict(X_test)** output?
* Array of predicted species * Confidence for each class ## Footnote Example: [0.0, 1.0, 0.0] indicates 100% confidence for versicolor.
39
What is the **golden rule** when predicting on live data?
* Same columns * Same order * Same preprocessing as training data ## Footnote This ensures the model works correctly with new inputs.
40
What is the purpose of **saving the model** using joblib?
* Don’t retrain every time * Use model in apps, APIs, dashboards ## Footnote Load later with loaded_model = joblib.load('my_first_ml_model.pkl').
41
What are **Parameters** in machine learning?
* Learned from data * Examples: tree splits, weights, coefficients ## Footnote Parameters are the internal variables that the model learns during training.
42
What are **Hyperparameters** in machine learning?
* Set by YOU * Examples: n_estimators, max_depth ## Footnote Hyperparameters are external configurations that govern the training process.
43
What does **GridSearchCV** do?
Tunes hyperparameters ## Footnote It automates the process of finding the best hyperparameter combinations using cross-validation.
44
Before using GridSearchCV, what was the common issue with training models?
Trained one model with default settings ## Footnote There was uncertainty about whether the default settings were optimal.
45
After implementing GridSearchCV, how do you train models?
* Train many models * Different hyperparameter combinations * Using cross-validation ## Footnote This approach allows for a systematic evaluation of model performance.
46
What is the **test set** in machine learning?
NEVER touched during tuning ## Footnote The test set is used solely for evaluating the final model's performance.
47
What is the **workflow** when using GridSearchCV?
* Split data * Define pipeline * Define hyperparameter grid * GridSearchCV (fit MANY models) * Select best model * Evaluate on test set ## Footnote This workflow includes an additional step for hyperparameter tuning.
48
What is the purpose of a **pipeline** in machine learning?
* Ensures no data leakage * Makes GridSearchCV possible * Treats preprocessing + model as ONE object ## Footnote Pipelines streamline the process and maintain data integrity.
49
What does the **param_grid** define in GridSearchCV?
Hyperparameter values to try ## Footnote It specifies the hyperparameters and their respective values for tuning.
50
What does the **cv** parameter represent in GridSearchCV?
Number of cross-validation folds ## Footnote It determines how many subsets the training data will be divided into for validation.
51
What happens when you call **grid.fit(X_train, y_train)**?
* Trains models * Validates models * Scores models * Averages results ## Footnote This process involves multiple iterations based on the defined hyperparameters and cross-validation.
52
What does **grid.cv_results_** provide?
A dictionary of results ## Footnote It contains information about hyperparameters tried, mean validation scores, and rankings.
53
How do you obtain the **best hyperparameters** from GridSearchCV?
grid.best_params_ ## Footnote This returns the hyperparameter combination that yielded the best performance.
54
What does **grid.best_estimator_** return?
The entire trained pipeline ## Footnote This includes the model with the best hyperparameters, ready for deployment or evaluation.
55
What does a big train–test performance gap indicate?
Overfitting ## Footnote This suggests that the model learned the training data very well but generalizes poorly to unseen data.
56
True or false: GridSearchCV automates hyperparameter tuning using cross-validation.
TRUE ## Footnote It simplifies the process of optimizing model performance through systematic evaluation.
57
What is the objective of using **GridSearchCV** in this context?
Use GridSearchCV with a custom scoring metric to optimise a binary classification model based on business priorities, not just accuracy ## Footnote Focus on recall for malignant cancer.
58
What are the key **differences** between regression and classification metrics?
* Regression: R², MAE, MSE * Classification: accuracy, recall, precision, F1 * Regression: One output * Classification: Multiple classes * Regression: No class priority * Classification: Class priority matters ## Footnote In classification, you must decide what matters more.
59
In the business context, what does **Class 0** represent?
malignant ## Footnote Class 1 represents benign.
60
What is the client decision regarding **malignant tumors**?
Missing a malignant tumour is worse than falsely flagging a benign one ## Footnote This decision drives the optimization for recall on class 0.
61
What is the first step in the **workflow** for using GridSearchCV?
Load & split data ## Footnote Use the breast cancer dataset for binary classification.
62
What does the **pipeline** in GridSearchCV help prevent?
Data leakage ## Footnote It keeps preprocessing and model together.
63
Why is **accuracy** not enough as a scoring metric in classification?
Accuracy can lie ## Footnote A model could predict 'benign' always and still achieve high accuracy while failing to identify malignant cases.
64
What is the purpose of using **make_scorer** in GridSearchCV?
To define a custom scoring metric focused on recall for malignant cases ## Footnote Example: scoring = make_scorer(recall_score, pos_label=0).
65
What happens when you run **GridSearchCV** with the defined parameters?
Multiple models are trained and recall is computed for malignant cases ## Footnote Scores are averaged per hyperparameter.
66
What does the output of GridSearchCV provide regarding **n_estimators**?
Mean recall scores for different n_estimators ## Footnote Example output: n_estimators=50 → mean ≈ 0.86.
67
How do you obtain the **best model** from GridSearchCV?
pipeline = grid.best_estimator_ ## Footnote This gives you the full pipeline with the best hyperparameters.
68
What should you evaluate after obtaining the best model?
Confusion matrix + report ## Footnote Check recall for malignant and precision trade-off.
69
What defines **model success** in this context?
Business rules, not code ## Footnote Example: If client threshold ≥ 90% → acceptable.
70
What is the final mental model for **GridSearchCV** in classification?
Same workflow as regression, but scoring metric defines what 'best' means ## Footnote High recall often lowers precision (trade-off).
71
What are some **real-life applications** of this approach?
* Build medical ML systems * Detect fraud * Catch churn * Flag risky users ## Footnote Focus on optimizing decisions rather than just training models.
72
What is the first step in the **workflow** for binary classification?
Split data ## Footnote This initiates the process of preparing data for model training and evaluation.
73
What is the structure of a **pipeline** in the workflow?
* Scaling * Feature selection * Model (e.g., RandomForestClassifier) ## Footnote The pipeline organizes the steps for data processing and model training.
74
What is included in the **hyperparameter grid**?
* Parameters to tune * Example: n_estimators = [10, 20] ## Footnote Hyperparameter tuning is essential for optimizing model performance.
75
How does scoring differ between **binary** and **multiclass** classification?
Binary: make_scorer(metric, pos_label=…) Multiclass: make_scorer(metric, labels=[class_of_interest], average=None) ## Footnote Different scoring methods are required to evaluate models based on the type of classification.
76
In the context of **GridSearchCV**, what does the workflow involve?
* Train multiple CV folds for each hyperparameter combination * Compute metric for the target class * Pick combination with highest score ## Footnote This process ensures thorough evaluation of hyperparameter settings.
77
What do **grid.best_params_** and **grid.best_estimator_** represent?
* grid.best_params_: best hyperparameters * grid.best_estimator_: trained best pipeline ## Footnote These attributes provide access to the optimal settings and model after grid search.
78
What is crucial for evaluating results in terms of business requirements?
* Precision * Recall * F1 ## Footnote The chosen metric should align with the client's specific needs for a particular class.
79
What does **PCA** stand for?
Principal Component Analysis ## Footnote PCA is a technique used to reduce the dimensionality of data while preserving as much variance as possible.
80
What is the **big idea** behind PCA?
Squish many features into fewer, smarter features while keeping as much important information (variance) as possible ## Footnote This allows for a more manageable dataset while retaining essential information.
81
What is the trade-off when using PCA?
* Fewer dimensions * Faster models * Better visuals * Components are not directly interpretable ## Footnote While PCA simplifies data, it can make interpretation of components challenging.
82
When should you **use PCA**?
* Lots of numeric features * Features are correlated * Visualise high-dimensional data * Speed up models * Reduce noise * Help clustering or classification ## Footnote PCA is beneficial in scenarios where data complexity needs to be managed.
83
When should you **avoid PCA**?
* Must explain features to stakeholders * Feature meaning matters more than performance ## Footnote In cases where interpretability is crucial, PCA may not be suitable.
84
What is the first step in the PCA process outlined in the notebook?
Load data ## Footnote The dataset used is the breast cancer dataset with 30 numeric features.
85
What is the target variable in the breast cancer dataset?
* 0 = malignant * 1 = benign ## Footnote The target variable indicates the diagnosis of the cancer.
86
Why is it important to **clean and scale** the data before applying PCA?
PCA is distance-based; different scales can break PCA ## Footnote Cleaning and scaling ensure that PCA functions correctly.
87
What is the result of cleaning and scaling the breast cancer dataset?
341 rows × 30 features → NumPy array ## Footnote This prepares the data for PCA application.
88
What is the key result when testing PCA with 30 components?
* Component 0 → 43.7% * Component 1 → 18.5% * Component 2 → 10.3% ## Footnote The first three components explain approximately 72% of the information.
89
What is the rule of thumb for deciding how many components to keep in PCA?
80–95% variance is usually enough ## Footnote This guideline helps in determining the optimal number of components.
90
What is the final dimensionality reduction result of the breast cancer dataset using PCA?
30 features → 7 components ## Footnote This reduction retains 91.37% variance.
91
What does the output of PCA look like after dimensionality reduction?
x_PCA shape: (341, 7) ## Footnote Each row represents the same observation in a new coordinate system.
92
What is the **visualisation magic** achieved after applying PCA?
Much clearer separation with just 2 numbers containing 62% of the data’s info ## Footnote PCA allows for better visual analysis of data.
93
What is the mental model to remember about PCA?
PCA = rotate + compress ## Footnote Components represent directions of maximum variation, and explained variance indicates usefulness.
94
What are the amazing benefits of PCA?
* Reduces dimensionality * Requires scaled numeric data * Outputs components, not features * Useful for visualisation, preprocessing, noise reduction, ML performance ## Footnote PCA is a powerful tool in data analysis and machine learning.
95
In **NLP**, what is the key shift compared to normal ML?
Features = text ## Footnote Text is not numbers; ML models only understand numbers.
96
What is the main goal of **NLP**?
Turning messy human language into clean numeric features ## Footnote Everything else (pipelines, CV, GridSearch) stays familiar.
97
What type of classifier is being built in this notebook?
A spam classifier for SMS messages ## Footnote Input: 'WIN a free ticket now!!!' Output: spam or ham.
98
What are the steps in the **NLP workflow**?
* Load data * Split train/test * Clean text * Convert text → numbers * Train models * Compare performance ## Footnote The magic happens in steps 3 & 4.
99
What are the columns in the **dataset**?
* message → feature * label → target (spam/ham) ## Footnote After sampling: Smaller dataset for faster learning.
100
In the **train/test split**, what do X_train and y_train represent?
* X_train → messages * y_train → labels ## Footnote Same logic as usual. No surprises here.
101
Why is **text cleaning** necessary in NLP?
Raw text is chaotic ## Footnote Issues include uppercase/lowercase differences, punctuation noise, and inconsistent formatting.
102
What does the custom transformer **text_cleaning** do?
* Lowercases text * Removes punctuation ## Footnote Normalizes language so the model doesn’t panic.
103
What is the core NLP trick for converting text to numbers?
Feature extraction ## Footnote This includes CountVectorizer and TF-IDF Transformer.
104
What does **CountVectorizer** do?
* Tokenises text into words * Counts word frequency * Removes English stop words ## Footnote Example: 'I love free pizza' → ['love', 'free', 'pizza'] → [1, 1, 1].
105
What is the purpose of the **TF-IDF Transformer**?
Penalises common words and boosts rare, meaningful words ## Footnote Makes spam words pop (free, win, cash).
106
Why does **TF-IDF** beat raw counts?
'the' ≠ useful; 'win' = very useful ## Footnote It helps identify important words in messages.
107
What are the steps in the **full NLP pipeline**?
* Text cleaning * Tokenisation (CountVectorizer) * TF-IDF weighting * Model ## Footnote This pipeline allows passing raw text to get predictions without manual preprocessing.
108
Which models were tested in the notebook?
* SGDClassifier * LinearSVC ## Footnote These models are designed for high-dimensional sparse data and work well with TF-IDF.
109
Why are tree models not used in NLP?
Trees hate sparse text matrices ## Footnote This makes them less effective for text classification.
110
What does **HyperparameterOptimizationSearch** do?
Builds multiple pipelines and runs GridSearchCV ## Footnote Here, only default hyperparameters are tested for model comparison.
111
What was observed in the **results interpretation**?
* SGDClassifier performed best * LinearSVC very close behind ## Footnote Both are linear models that love TF-IDF features.
112
What is the core takeaway regarding **NLP success**?
NLP success depends more on feature extraction than fancy models ## Footnote Key tools include CountVectorizer and TfidfTransformer.