DS Technical Interview Questions Flashcards

Question 1

Q

What are the differences between supervised and unsupervised learning?

Answer

A

Supervised Learning:
- Uses known and labeled data as input
- Supervised learning has a feedback mechanism
- The most commonly used supervised learning algorithms are decision trees, logistic regression, and support vector machine

Unsupervised Learning:
- Uses unlabeled data as input
- Unsupervised learning has no feedback mechanism
- The most commonly used unsupervised learning algorithms are k-means clustering, hierarchical clustering, and apriori algorithm

Question 2

Q

How is logistic regression done?

Answer

A

Logistic regression measures the relationship between the dependent variable (our label of what we want to predict) and one or more independent variables (our features) by estimating probability using its underlying logistic function (sigmoid).

p(x) = 1 / (1 + exp(- \theta^T * x)) [w/ Bernoulli likelihood]

Sigmoid:
S(x) = 1 / (1 + exp(-x))
S’(x) = S(x) * (1 - S(x))

Sigmoid generalisation: Softmax function

Question 3

Q

How is logistic regression done? [ICL notes]

Answer

A

ICL notes:
- Binary classification problems
- Linear model with non-Gaussian likelihood
- Implicit modeling assumptions
- Parameter estimation (MLE, MAP) no longer in closed form
- Bayesian logistic regression with Laplace approximation of the
posterior

Question 4

Q

Explain the steps in making a decision tree

Answer

A

Take the entire data set as input
Calculate entropy/loss of the target variable, as well as the predictor attributes/features
Calculate your information gain of all attributes/features (we gain information on sorting different objects from each other)
Choose the attribute/feature with the highest information gain as the root node
Repeat the same procedure on every branch until the decision node of each branch is finalised

Question 5

Q

Bias vs Variance

Answer

A

Straight line = high bias (potential underfitting)
Perfect fitting line = high variance
—- low/zero train error
—- high test error (overfitting)

Question 6

Q

What are the three commonly used methods for finding the sweet spot between a simple and complicated model?

Answer

A

Regularisation, e.g. L1, L2
Boosting, e.g. AdaBoost, GradientBoost, XGBoost, LightGBM
Bagging, e.g. Random Forest

Question 7

Q

How does Random Forest work?

Answer

A

Bootstrap data, i.e. sample from original dataset
Create decision trees on bootstrapped dataset (one each), “random” subsample columns to be used for decisions
Final voting called bagging (averaging all DT)
Use out-of-bag sample to estimate the RF accuracy, which also helps to choose the right columns for decisions

Question 8

Q

How does AdaBoost work?

Answer

A

Stump: A tree with just one node and two leaves (weak learner)
E.g. “Stump” Forest = AdaBoost
Combine a lot of weak learners
Some stumps get more say in classification/regression than others (boosting weights of mis-classified samples)
Each stump is made by taking the previous stumps-s mistakes into account

Question 9

Q

How does GradientBoost work?

Answer

A

Start with a leaf that is the average value of the variable we want to predict
Add a tree based on the residuals, the difference between the observed and predicted values
Scale the tree’s contribution to the final prediction with a learning rate
Repeat on the previous residuals until converged/done
This works for regression; for classification us log-odds and sigmoid

Question 10

Q

How does LightGBM work?

Answer

A

Faster training speed and higher efficiency:
—- LightGBM uses a histogram based algorithm, i.e. it buckets continuous feature values into discrete bins which speed up the training procedure
—- Exclusive feature bundling
Lower memory usage:
—- Replaces continuous values to discrete bins which result in a lower memory usage
Better accuracy than any other boosting algorithm:
—- I produces more complex trees by following a leaf-wise split approach than level-wise which is the main factor in achieving higher accuracy
—- Can lead to overfitting if max_depth or leaf_no is not restricted
—- Gradient-based one-side sampling; keep large gradient features
Compatibility with large datasets (reduced train time vs XGBoost)
Parallel learning supported

Question 11

Q

How do you build a random forest model?

Answer

A

Steps to build a random forest model:
(1) Randomly select ‘k’ features from a total of ‘m’ features where k «_space;m
(2) Among the ‘k’ features, calculate the node D using the best split point
(3) Split the node into daughter nodes using the best split
(4) Repeat steps two and three until leaf nodes are finalised
(5) Build forest by repeating steps one to four for ‘n’ times to create ‘n’ number of trees

Question 12

Q

How can you avoid overfitting your model?

Answer

A

Overfitting refers to a model that is only set for a very small amount of data and ignores the bigger picture. There are three main methods to avoid overfitting:
(1) Keep the model simple—take fewer variables into account, thereby removing some of the noise in the training data
(2) Use cross-validation techniques, such as k folds cross-validation
(3) Use regularisation techniques, such as LASSO, that penalise certain model parameters if they’re likely to cause overfitting

[4] Bagging, Boosting

Question 13

Q

Differentiate between univariate, bivariate, and multivariate analysis

Answer

A

Univariate data contains only one variable. The purpose of the univariate analysis is to describe the data and find patterns that exist within it.

Example height of students: The patterns can be studied by drawing conclusions using mean, median, mode, dispersion or range, minimum, maximum, etc.

Bivariate data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to determine the relationship between the two variables.

Example temperature and ice cream sales in the summer season: the relationship is visible from the table that temperature and sales are directly proportional to each other. The hotter the temperature, the better the sales.

Multivariate data involves three or more variables, it is categorised under multivariate. It is similar to a bivariate but contains more than one dependent variable.

Example data for house price prediction: patterns can be studied by drawing conclusions using mean, median, and mode, dispersion or range, minimum, maximum, etc. You can start describing the data and using it to guess what the price of the house will be.

Question 14

Q

What are the feature selection methods used to select the right variables?

Answer

A

There are two main methods for feature selection, i.e, filter, and wrapper methods.

Filter Methods:
- Linear discriminant analysis (LDA)
- PCA [X^TX feature vs feature, added myself]
- ANOVA [analysis of variance, SST = SSB + SSW]
- Chi-Square [test for mutually independent features, e.g. reject if significance level is below 5%]

Wrapper Methods:
- Forward Selection: We test one feature at a time and keep adding them until we get a good fit
- Backward Selection: We test all the features and start removing them to see what works better
- Recursive Feature Elimination: Recursively looks through all the different features and how they pair together

Note: Wrapper methods are very labor-intensive, and high-end computers are needed if a lot of data analysis is performed with the wrapper method.

Question 15

Q

Describe ANOVA

Answer

A

ANOVA = analysis of variance

SST = SSB + SSW,
where
SST = sum of squares total
SSB = … between
SSW = … within
X \in R^(mxn)

F-statistic = (SSB/(m-1)) / (SSW/(m*(n-1))»_space; 1 => highly different features

Question 16

Q

Describe Linear Discriminant Analysis

Answer

Study These Flashcards

A

d^2/(s_1^2 + s_2^2) = ideally large/small,
where, s_i is the sample std

(1) Maximise the distance between means
(2) Minimise the variation (scatter) within each category

LDA vs PCA:
- both try to reduce dimensions
- PCA looks at the features with the most variation, hence X^TX
- LDA tries to maximise the separation of categories and minimises variation

DS Technical Interview Questions Flashcards

(16 cards)