How to handle imbalanced datasets
1. Change the performance metric
As we saw above, accuracy is not the best metric to use when evaluating imbalanced datasets as it can be very misleading. Metrics that can provide better insight include:
Confusion Matrix: a table showing correct predictions and types of incorrect predictions.
Precision: the number of true positives divided by all positive predictions. Precision is also called Positive Predictive Value. It is a measure of a classifier’s exactness. Low precision indicates a high number of false positives.
Recall: the number of true positives divided by the number of positive values in the test data. Recall is also called Sensitivity or the True Positive Rate. It is a measure of a classifier’s completeness. Low recall indicates a high number of false negatives.
F1: Score: the weighted average of precision and recall.
Let’s see what happens when we apply these F1 and recall scores to our logistic regression from above.
2. Change the algorithm
While in every machine learning problem, it’s a good rule of thumb to try a variety of algorithms, it can be especially beneficial with imbalanced datasets. Decision trees frequently perform well on imbalanced data. They work by learning a hierarchy of if/else questions and this can force both classes to be addressed.
3. Resampling Techniques — Oversample minority class
Our next method begins our resampling techniques.
Oversampling can be defined as adding more copies of the minority class. Oversampling can be a good choice when you don’t have a ton of data to work with.
We will use the resampling module from Scikit-Learn to randomly replicate samples from the minority class.
Important Note
Always split into test and train sets BEFORE trying oversampling techniques! Oversampling before splitting the data can allow the exact same observations to be present in both the test and train sets. This can allow our model to simply memorize specific data points and cause overfitting and poor generalization to the test data.
4. Resampling techniques — Undersample majority class
Undersampling can be defined as removing some observations of the majority class. Undersampling can be a good choice when you have a ton of data -think millions of rows. But a drawback is that we are removing information that may be valuable. This could lead to underfitting and poor generalization to the test set.
We will again use the resampling module from Scikit-Learn to randomly remove samples from the majority class.
5. Generate synthetic samples
A technique similar to upsampling is to create synthetic samples. Here we will use imblearn’s SMOTE or Synthetic Minority Oversampling Technique. SMOTE uses a nearest neighbors algorithm to generate new and synthetic data we can use for training our model.
Again, it’s important to generate the new samples only in the training set to ensure our model generalizes well to unseen data.
Other Methods
https://towardsdatascience.com/the-5-most-useful-techniques-to-handle-imbalanced-datasets-6cdba096d55a
AI v.s. ML
What is artificial intelligence?
Artificial intelligence is a broad field, which refers to the use of technologies to build machines and computers that have the ability to mimic cognitive functions associated with human intelligence, such as being able to see, understand, and respond to spoken or written language, analyze data, make recommendations, and more.
Although artificial intelligence is often thought of as a system in itself, it is a set of technologies implemented in a system to enable it to reason, learn, and act to solve a complex problem.
What is machine learning?
Machine learning is a subset of artificial intelligence that automatically enables a machine or system to learn and improve from experience. Instead of explicit programming, machine learning uses algorithms to analyze large amounts of data, learn from the insights, and then make informed decisions.
Machine learning algorithms improve performance over time as they are trained—exposed to more data. Machine learning models are the output, or what the program learns from running an algorithm on training data. The more data used, the better the model will get.
Differences between AI and ML
Now that you understand how they are connected, what is the main difference between AI and ML?
While artificial intelligence encompasses the idea of a machine that can mimic human intelligence, machine learning does not. Machine learning aims to teach a machine how to perform a specific task and provide accurate results by identifying patterns.
Let’s say you ask your Google Nest device, “How long is my commute today?” In this case, you ask a machine a question and receive an answer about the estimated time it will take you to drive to your office. Here, the overall goal is for the device to perform a task successfully—a task that you would generally have to do yourself in a real-world environment (for example, research your commute time).
In the context of this example, the goal of using ML in the overall system is not to enable it to perform a task. For instance, you might train algorithms to analyze live transit and traffic data to forecast the volume and density of traffic flow. However, the scope is limited to identifying patterns, how accurate the prediction was, and learning from the data to maximize performance for that specific task.
Explanatory Algorithms
One of the biggest problems in machine learning is understanding how various models get to their end predictions. We often know the “what” but struggle to explain the “why”.
Explanatory algorithms help us identify the variables that have a meaningful impact on the outcome we are interested in. These algorithms allow us to understand the relationships between the variables in the model, rather than just using the model to make predictions about the outcome.
Algorithms
Linear/Logistic Regression: a statistical method for modeling the linear relationship between a dependent variable and one or more independent variables — can be used to understand the relationships between variables based on the t-tests and coefficients. Decision Trees: a type of machine learning algorithm that creates a tree-like model of decisions and their possible consequences. They are useful for understanding the relationships between variables by looking at the rules that split the branches. Principal Component Analysis (PCA): a dimensionality reduction technique that projects the data onto a lower-dimensional space while retaining as much variance as possible. PCA can be used to simplify the data or to determine feature importance. Local Interpretable Model-Agnostic Explanations (LIME): an algorithm that explains the predictions of any machine learning model by approximating the model locally around the prediction by constructing a simpler model using techniques such as linear regression or decision trees. Shapley Additive explanations (SHAPLEY): an algorithm that explains the predictions of any machine learning model by computing the contribution of each feature to the prediction using a method based on the concept of “marginal contribution.”. It can be more accurate than SHAP in some cases. Shapley Approximation (SHAP): a method for explaining the predictions of any machine learning model by estimating the importance of each feature in the prediction. SHAP uses a method called the “coalitional game” method to approximate Shapley values and is generally faster than SHAPLEY.
Pattern Mining Algorithms
Pattern Mining Algorithms
Pattern mining algorithms are a type of data mining technique that are used to identify patterns and relationships within a dataset. These algorithms can be used for a variety of purposes, such as identifying customer buying patterns in a retail context, understanding common user behaviour sequences for a website/app, or finding relationships between different variables in a scientific study.
Pattern mining algorithms typically work by analyzing large datasets and looking for repeated patterns or associations between variables. Once these patterns have been identified, they can be used to make predictions about future trends or outcomes or to understand the underlying relationships within the data.
Algorithms
Apriori algorithm: an algorithm for finding frequent item sets in a transactional database — it’s efficient and widely used for association rule mining tasks. Recurrent Neural Network (RNN): a type of neural network that is designed to process sequential data as they are able to capture temporal dependencies in the data. Long Short-Term Memory (LSTM): a type of recurrent neural network that is designed to remember information for longer periods of time. LSTMs are able to capture longer-term dependencies in the data and are often used for tasks such as language translation and language generation. Sequential Pattern Discovery Using Equivalence Class (SPADE): a method for finding frequent patterns in sequential data by grouping together items that are equivalent in some sense. This method is able to handle large datasets and is relatively efficient, but may not work well with sparse data. PrefixSpan: an algorithm for finding frequent patterns in sequential data by constructing a prefix tree and pruning infrequent items. PrefixScan is able to handle large datasets and is relatively efficient, but may not work well with sparse data.
Ensemble Learning
Ensemble algorithms are machine learning techniques that combine the predictions of multiple models in order to make more accurate predictions than any of the individual models. There are several reasons why ensemble algorithms can outperform traditional machine learning algorithms:
Diversity: By combining the predictions of multiple models, ensemble algorithms can capture a wider range of patterns within the data. Robustness: Ensemble algorithms are generally less sensitive to noise and outliers in the data, which can lead to more stable and reliable predictions. Reducing overfitting: By averaging the predictions of multiple models, ensemble algorithms can reduce the tendency of individual models to overfit the training data, which can lead to improved generalization to new data. Improved accuracy: Ensemble algorithms have been shown to consistently outperform traditional machine learning algorithms in a variety of contexts.
Algorithms
Random Forest: a machine learning algorithm that creates an ensemble of decision trees and makes predictions based on the majority vote of the trees. XGBoost: a type of gradient boosting algorithm that uses decision trees as its base model and is known to be one of the strongest ML algorithms for predictions. LightGBM: another type of gradient boosting algorithm that is designed to be faster and more efficient than other boosting algorithms. CatBoost: A type of gradient boosting algorithm that is specifically designed to handle categorical variables well.
Clustering
Clustering algorithms are an unsupervised learning task and are used to group data into “clusters”. In contrast to supervised learning, where the target variable is known, there is no target variable in clustering.
This technique is useful for finding natural patterns and trends in data and is often used during the exploratory data analysis phase to gain further understanding of the data. Additionally, clustering can be used to divide a dataset into distinct segments based on various variables. A common application of this is in segmenting customers or users.
Algorithms
K-mode clustering: a clustering algorithm that is specifically designed for categorical data. It is able to handle high-dimensional categorical data well and is relatively simple to implement.
DBSCAN: A density-based clustering algorithm that is able to identify clusters of arbitrary shape. It is relatively robust to noise and can identify outliers in the data. Spectral clustering: A clustering algorithm that uses the eigenvectors of a similarity matrix to group data points into clusters. It is able to handle non-linearly separable data and is relatively efficient.
What is Gradient Descent
Gradient Descent (GD) is a popular optimization algorithm used in machine learning to minimize the cost function of a model. It works by iteratively adjusting the weights or parameters of the model in the direction of the negative gradient of the cost function, until the minimum of the cost function is reached.
What is the Curse of Dimensionality
As the dimensionality increases, the number of data points required for good performance of any machine learning algorithm increases exponentially. The reason is that, we would need more number of data points for any given combination of features, for any machine learning model to be valid.
What is Cross-Validation
Cross validation (CV) is one of the technique used to test the effectiveness of a machine learning models, it is also a re-sampling procedure used to evaluate a model if we have a limited data. To perform CV we need to keep aside a sample/portion of the data on which is not used to train the model, later use this sample for testing/validating.
K-Folds Cross Validation
K-Folds technique is a popular and easy to understand, it generally results in a less biased model compare to other methods. Because it ensures that every observation from the original dataset has the chance of appearing in training and test set. This is one among the best approach if we have a limited input data. This method follows the below steps.
1) Split the entire data randomly into K folds (value of K shouldn’t be too small or too high, ideally we choose 5 to 10 depending on the data size). The higher value of K leads to less biased model (but large variance might lead to over-fit), where as the lower value of K is similar to the train-test split approach we saw before.
2) Then fit the model using the K-1 (K minus 1) folds and validate the model using the remaining Kth fold. Note down the scores/errors.
3) Repeat this process until every K-fold serve as the test set. Then take the average of your recorded scores. That will be the performance metric for the model.
6 Steps towards a Successful Machine Learning Project
How to Prevent Overfitting
Early stopping
Early stopping pauses the training phase before the machine learning model learns the noise in the data. However, getting the timing right is important; else the model will still not give accurate results.
Pruning
You might identify several features or parameters that impact the final prediction when you build a model. Feature selection—or pruning—identifies the most important features within the training set and eliminates irrelevant ones. For example, to predict if an image is an animal or human, you can look at various input parameters like face shape, ear position, body structure, etc. You may prioritize face shape and ignore the shape of the eyes.
Regularization
Regularization is a collection of training/optimization techniques that seek to reduce overfitting. These methods try to eliminate those factors that do not impact the prediction outcomes by grading features based on importance. For example, mathematical calculations apply a penalty value to features with minimal impact. Consider a statistical model attempting to predict the housing prices of a city in 20 years. Regularization would give a lower penalty value to features like population growth and average annual income but a higher penalty value to the average annual temperature of the city.
Ensembling
Ensembling combines predictions from several separate machine learning algorithms. Some models are called weak learners because their results are often inaccurate. Ensemble methods combine all the weak learners to get more accurate results. They use multiple models to analyze sample data and pick the most accurate outcomes. The two main ensemble methods are bagging and boosting. Boosting trains different machine learning models one after another to get the final result, while bagging trains them in parallel.
Data augmentation
Data augmentation is a machine learning technique that changes the sample data slightly every time the model processes it. You can do this by changing the input data in small ways. When done in moderation, data augmentation makes the training sets appear unique to the model and prevents the model from learning their characteristics. For example, applying transformations such as translation, flipping, and rotation to input images.
Bias Vs. Variance
What is bias?
Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data.
What is variance?
Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data.
Can you explain the difference between deep learning and machine learning?
Deep learning is a subfield of machine learning that uses artificial neural networks with multiple layers to learn and represent complex patterns and relationships in data. In contrast, machine learning is a broader field that includes a variety of algorithms and techniques for training models to make predictions or decisions based on data.
The key difference between deep learning and traditional machine learning is the complexity and flexibility of the models.
Traditional machine learning algorithms, such as decision trees, support vector machines, and linear regression, typically rely on handcrafted features and are limited in their ability to handle large amounts of data or complex relationships between variables.
Deep learning, on the other hand, can automatically learn hierarchical representations of features from raw data, and can model highly nonlinear relationships between variables. Deep learning algorithms, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have been successfully applied to a wide range of tasks, such as image and speech recognition, natural language processing, and game playing.
In summary, deep learning is a subset of machine learning that uses deep neural networks with multiple layers to learn complex patterns in data. While traditional machine learning focuses on handcrafted features and simpler models, deep learning can automatically learn complex representations of data and is particularly effective in applications that require handling large amounts of data and complex relationships between variables.
Can you describe the steps involved in building a machine learning pipeline?
Building a machine learning pipeline involves several key steps, which can vary depending on the specific application and the data available. Here are some common steps involved in building a machine learning pipeline:
Overall, building a machine learning pipeline requires a combination of domain knowledge, technical skills, and experience with data analysis and modeling techniques. Effective machine learning pipelines require careful attention to each of these steps, as well as ongoing refinement and optimization over time.
How do you deal with categorical variables in a dataset when building a machine learning model
Categorical variables are variables that take on a finite number of discrete values or categories, such as colors, types of objects, or labels. Handling categorical variables is an important part of building a machine learning model, as these variables may provide important information for predicting the target variable. Here are some common techniques for dealing with categorical variables:
Overall, the choice of technique for handling categorical variables depends on the specific problem and the characteristics of the data. A combination of techniques may be appropriate for different variables or subsets of the data. It is important to carefully preprocess the data to ensure that the categorical variables are properly represented and that the model can effectively capture the relationships between the variables and the target variable.
How do you choose the appropriate algorithm for a machine learning problem?
Choosing the appropriate algorithm for a machine learning problem is a crucial step in building an effective model. The choice of algorithm depends on several factors, including the nature of the data, the type of problem (classification, regression, clustering, etc.), the size of the dataset, the performance requirements, and the available computational resources. Here are some general guidelines for choosing an appropriate algorithm:
Overall, the choice of algorithm for a machine learning problem is a complex decision that requires careful consideration of multiple factors. It is important to choose an algorithm that is appropriate for the problem and that can effectively capture the relationships in the data.
How do you optimize the hyperparameters of a machine learning model?
Hyperparameter optimization is an important step in building a machine learning model, as it involves tuning the parameters of the model to achieve the best performance on the validation data. Here are some common approaches to hyperparameter optimization:
In general, the choice of hyperparameter optimization technique depends on the complexity of the hyperparameter space, the available computational resources, and the performance requirements of the model. It is often a good idea to combine multiple techniques and compare their performance on the validation data.
How do you evaluate the performance of a machine learning model?
Evaluating the performance of a machine learning model is a crucial step in determining its effectiveness and suitability for the problem at hand. Here are some common techniques for evaluating model performance:
Overall, evaluating the performance of a machine learning model requires careful consideration of multiple factors, including the choice of evaluation technique, the choice of metric, and the performance requirements of the problem. It is often a good idea to use multiple techniques and metrics and compare their performance to get a better understanding of the model’s effectiveness.
What are some common challenges you might face when building a machine learning model, and how do you overcome them?
Building a machine learning model is a complex process that involves many challenges. Here are some common challenges that you might face when building a machine learning model, and how to overcome them:
Overall, building a machine learning model requires careful consideration of many factors, including data quality, model complexity, regularization, hyperparameter tuning, and computational resources. By understanding these challenges and how to overcome them, you can build effective machine learning models that are accurate, reliable, and generalizable to new data.
Can you explain the difference between batch and online learning?
Batch learning and online learning are two different approaches to building machine learning models. Here’s how they differ:
Batch Learning:
In batch learning, the machine learning algorithm is trained on a fixed dataset, also known as a training set. The training process is performed offline, which means the algorithm processes the entire dataset at once, and the model parameters are updated based on the gradients calculated over the entire dataset. Batch learning is commonly used for supervised learning tasks, such as classification and regression, where the dataset is relatively small and can fit into memory. Once the model is trained, it can be used to make predictions on new data.
Online Learning:
In online learning, the machine learning algorithm is trained on data that arrives in a continuous stream, with no fixed dataset. The model parameters are updated in real-time as new data becomes available. This means the algorithm learns incrementally, adjusting the model parameters each time a new data point is processed. Online learning is commonly used in scenarios where the data is too large to fit into memory or when the data is constantly changing. It is often used for applications such as online advertising, fraud detection, and recommendation systems.
The key differences between batch and online learning are:
Both batch and online learning have their advantages and disadvantages, and the choice of approach depends on the specific problem and the available resources. Batch learning is typically faster and more accurate but requires a fixed dataset and a lot of memory. Online learning is more flexible and can handle dynamic data, but it is slower and requires more computational resources.
Can you explain the difference between batch gradient descent, stochastic gradient descent, and mini-batch gradient descent, and when to use each?
Batch gradient descent, stochastic gradient descent, and mini-batch gradient descent are different optimization algorithms used in machine learning for updating the parameters of a model during training. Here’s how they differ:
When to use each:
- Batch gradient descent is suitable for small datasets where the entire dataset can fit into memory, and computational resources are not a problem. It is also useful when the** training process requires high precision**.
In summary, the choice of gradient descent algorithm depends on the specific problem, the size of the dataset, and the available computational resources.
Convolutional Neural Network
A Convolutional Neural Network, or CNN, is built for processing spatial data, especially images. Unlike a fully connected network, it starts with convolutional layers that apply small filters, such as 3×3 or 5×5 kernels, across the input to detect local features like edges, textures, or shapes. Each filter produces a feature map that highlights where certain patterns occur in the image. Pooling layers, such as max pooling, are then used to downsample these feature maps—this reduces computation, helps prevent overfitting, and makes the model more robust to small shifts or distortions. Toward the end of the network, the feature maps are flattened and passed into fully connected layers, which combine the extracted features to make the final classification or prediction. CNNs are widely used in image classification, object detection, and OCR.
Recurrent Neural Network
A Recurrent Neural Network, or RNN, is a type of neural network specifically designed for sequential or time-dependent data, such as text, speech, or time series. Unlike feedforward networks that process inputs independently, RNNs process data one step at a time, passing information from one step to the next through a hidden state. At each time step, the RNN takes the current input and the previous hidden state to produce a new hidden state, allowing it to “remember” context from earlier in the sequence.
However, basic RNNs struggle with long-term dependencies due to vanishing or exploding gradients during training. To address this, more advanced architectures like LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units) use gating mechanisms to control what information is kept, forgotten, or updated.
Long-Short Term Memory Neural Network
An LSTM, or Long Short-Term Memory network, is a variant of a Recurrent Neural Network that’s designed to capture long-term dependencies in sequential data. While basic RNNs pass a single hidden state between time steps, LSTMs introduce two key components: a cell state (which acts as the memory of the network) and gates (which control what information to keep, forget, or output).
Forget Gate: Decides what information from the previous cell state to remove.
Input Gate: Determines what new information to store in the cell state.
Output Gate: Controls what part of the cell state to output as the hidden state for the next time step.
This gating mechanism allows LSTMs to maintain relevant information for many time steps, mitigating the vanishing gradient problem common in standard RNNs. They are widely used for tasks where context over long sequences matters, such as machine translation, speech recognition, text generation, and long-term time series prediction.