A/B Testing
A/B is Statistical hypothesis testing for randomized experiment with two variables A and B. It is used to compare two models that use different predictor variables in order to check which variable fits best for a given sample of data.
Consider a scenario where you’ve created two models (using different predictor variables) that can be used to recommend products for an e-commerce platform.
A/B Testing can be used to compare these two models to check which one best recommends products to a customer.
Bagging vs Boosting

Classification vs Regression
Classification:
Regression:

Cluster Sampling
It is a process of randomly selecting intact groups within a defined population, sharing similar characteristics.
Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements.
Collinearity and Multicollinearity
Collinearity occurs when two predictor variables (e.g., x1 and x2) in a multiple regression have some correlation.
Multicollinearity occurs when more than two predictor variables (e.g., x1, x2, and x3) are inter-correlated.
Confusion Matrix
A confusion matrix or an error matrix is a table which is used for summarizing the performance of a classification algorithm.

Gini Impurity vs Entropy in a Decision Tree
Gini measurement is the probability of a random sample being classified correctly if you randomly pick a label according to the distribution in the branch.
Entropy is a measurement to calculate the lack of information. You calculate the Information Gain (difference in entropies) by making a split. This measure helps to reduce the uncertainty about the output label.
How Decision Tree node is split
Entropy Vs Information Gain
Entropy is an indicator of how messy your data is. It decreases as you reach closer to the leaf node.
The Information Gain is based on the decrease in entropy after a dataset is split on an attribute. It keeps on increasing as you reach closer to the leaf node.
Eigenvectors and Eigenvalues
Eigenvectors: Eigenvectors are those vectors whose direction remains unchanged even when a linear transformation is performed on them.
Eigenvalues: Eigenvalue is the scalar that is used for the transformation of an Eigenvector.
Ensemble learning
Ensemble learning is a technique that is used to create multiple Machine Learning models, which are then combined to produce more accurate results. A general Machine Learning model is built by using the entire training data set.
However, in Ensemble Learning the training data set is split into multiple subsets, wherein each subset is used to build a separate model. After the models are trained, they are then combined to predict an outcome in such a way that the variance in the output is reduced.
True Positive
False Positive
False Negative
True Negative
Better False Positives vs False Negatives?
It depends on the question as well as on the domain for which we are trying to solve the problem.
Inductive vs Deductive learning
Inductive learning is the process of using observations to draw conclusions
Deductive learning is the process of using conclusions to form observations

KNN vs K-Means
KNN:
K-Means:

Python libraries for Data Analysis
Deep Learning vs Machine Learning
Types of Machine Learning

Dealing with Missing Value
4 ways:
Suppose you are given a data set which has missing values spread along 1 standard deviation from the median. What percentage of data would remain unaffected and Why?
Since the data is spread across the median, let’s assume it’s a normal distribution.
As you know, in a normal distribution, ~68% of the data lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data unaffected. Therefore, ~32% of the data would remain unaffected by missing values.
You are given a cancer detection data set. Let’s suppose when you build a classification model you achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance? What can you do about it?
Might not have been trained properly.
You can do the following:
Suppose you found that your model is suffering from low bias and high variance. Which algorithm you think could tackle this situation and Why?
Type 1: How to tackle high variance?
Type 2: How to tackle high variance?
How do you map nicknames (Pete, Andy, Nick, Rob, etc) to real names?
You’re asked to build a random forest model with 10000 trees. During its training, you got training error as 0.00. But, on testing the validation error was 34.23. What is going on? Haven’t you trained your model perfectly?