Machine Learning Interview Flashcards

Question 1

Q

What is heuristics and why are they used?

Answer

A

Shortcuts to solutions. Rules-based approaches. An approximate solution instead of an exact solution.

They are used because some problems either can’t be solved or require too much time or processing power to be reasonable for solving the problem at hand., e.g. don’t want a robot taking forever to assess the best move in chess

Nearest neighbour heuristic –> Ask the computer to figure out the closest city that has’t been visited yet by the salesperson and make that the next stop. Doesn’t consider future moves so it’s not the most effective.

Alpha-beta pruning (games) –> runs through many different possible next moves until a move is determined to be worse than a previously considered move

Question 2

Q

What is AI?

Answer

A

A broad category that involves teaching computers to solve problems using algorithms. They mimic human intelligence.

This can be done using a set of complex rules processing or training machine learning models.

Question 3

Q

What are the challenges with ML

Answer

A

Lack of understanding: Setting expectations that AI is not a straightforward process & educating stakeholders
High expectations: Communicating often
Lack of data
Wrong problem: Identifying real machine learning problems that solve a real-world business problem
Lack of governance: Monitoring/retraining models at regular intervals and setting up related governance processes

Question 4

Q

What is ML?

Answer

A

Machine learning is about training a machine (set of mathematical models) with a historical dataset such that the machine can predict the unseen data.

The key part of machine learning systems is that its performance can be improved based on the new data set (experience).

Question 5

Q

What is deep learning?

Answer

A

Deep learning problems form the subset of machine learning problems.

Deep learning represents the aspect of machine learning which mimics human brain for learning from data set and predicting the outcome on unseen data set.

In deep learning models. the features are learnt automatically based on the optimisation algorithm.

Question 6

Q

When should one use deep learning?

Answer

A

Complex problems –> Being able to solve complex problems that require discovering hidden patterns in the data and/or a deep understanding of relationships between a large number of interdependent variables.
Success definition –> Increases the difficulty for humans to understand and interpret the model. They model very complex situations with high non-linearity and interactions between inputs so might not be the best choice if explainability is a concern
Resource availability –> Complicated and expensive endeavor. Models are slow to train and require a lot of computational power. GPUs are pretty much always a requirement.
Availability of quality labelled data –> Millions of labelled data points for a classification task are required making these more difficult endeavors

Question 7

Q

How do you identify whether a problem is a machine learning one?

Answer

A

It is not easy to identify a finite set of rules based on which one can determine output related to numerical problems or classification problems (e.g. email classification)
Although the finite set of rules can be identified, however, the fact that rules change very fast makes it difficult to deploy the solution changes in the production (e.g. dynamic pricing for airlines)
Whether the solution requires a large volume of data for testing/quality assurance (QA) (e.g. chess or reconciliations)
Whether the solution improves with the improvement in a variety of data (e.g. reconciliations)

Question 8

Q

What are the different kinds of ML problems?

Answer

A

Three most common…

Supervised learning problems: These are problems where the output labels or actual values related to the response variable (a variable that needs to be predicted) are available. The machine is trained using both the data and the related output value. Later, the machine makes the prediction on an unseen dataset.

Supervised learning problems can be categorized into the following different types:
- Regression (Predict the numerical value given the data set)
- Classification (Predict the class or the label of the dataset)

Unsupervised learning problems: These are problems where output values or labels ain’t present. Clustering is one common type of unsupervised learning problems. The machine learns the clusters of data given the data set.
Reinforcement learning problems: Given the environment, the machine learns to perform the most optimal action based on feedback it gets by performing an action in a simulation or training environment. Some of the key aspects of reinforcement learning include environment, current state, action, future state, reward, etc.

Question 9

Q

What is feature engineering and what role do product managers play?

Answer

A

Feature engineering is one of the key stages of the machine learning model development lifecycle. It can be defined as the process of identifying the most important features that can be used to train a machine learning model, which generalizes well for an unseen dataset (larger population). You need to clearly understand the concept of features

Feature engineering comprises of the following tasks:

Identifying raw features which can be obtained from the dataset
Identifying derived features which can be obtained using the raw data set
Extracting features from the existing features
Selecting the most important features from features obtained in the above stages

As a product manager, you play a key role in helping data scientists identify raw features and derived features. Rest is the work of the data scienitist.

Question 10

Q

What is your approach towards model governance/monitoring?

Answer

A

Model performance can be classified into three categories, namely, the green zone, the yellow zone, and the red zone. One needs to identify thresholds for putting the model performance in the green, yellow, and red zones. Based on which zone model performance is found, the model is scheduled for retraining.

Green Zone: If model performance is above a particular threshold, say, 85-90%, the model can be said to be in the green zone. One may not need to do anything.
Yellow Zone: If the model performance is between say 60-70% to green zone threshold, the model falls in the yellow zone and requires scrutiny.
Red Zone: If the model performance is less than a particular threshold, say, 60%, the model gets scheduled to be retrained.

Question 11

Q

What is accuracy and how to best use it?

How do you use other metrics?

Answer

A

Accuracy = total correct predictions / total predictions

Measures the total misclassification done by the model.

Can be misleading if the dataset is unbalanced (majority comprised of one of the labels). Ok to use if the dataset is balanced.

We use precision when we want the prediction of 1 to be as correct as possible and we use recall when we want our model to spot as many real 1 as possible.

Question 12

Q

Why is testing ML projects challenging?

Answer

A

Edge cases
Live data can change
Outcome itself is dependent not only on code but training data
Model performance is something else to test for
Model behaviour covered to ensure main known constraints and behaviours/patterns are met

Question 13

Q

How do you handle data science uncertainty when planning?

Answer

A

Contract with stakeholder (examples from working with ops)
Educate the stakeholder about unpredictability
Encourage trying a few different approaches (if resources allow)

Question 14

Q

How do you help the team manage and prioritise BAU model improvements?

Answer

A

Need to be metrics and objectives driven
Logged and managed together
If possible, bringing the stakeholder group into the process, aka workshops

Question 15

Q

What is precision and how to best use it?

Give an example.

Answer

A

Precision = true positive / total predicted positive (true and false positive)
Preicion is a good measure when the costs of getting a prediction wrong is much higher than cost of missing out on the right prediction, e.g. email spam detection. If precision is low, there are too many false positives and important email goes to the spam folder.

Question 16

Q

What is recall and how to best use it?

Give an example.

Answer

Study These Flashcards

A

Recall = true positive / total actual positive (true positive and false negative)
The model should capture all examples of the class.
Best used when cost of missing a prediction is much higher than a wrong predction. e.g. when a bank is transaction is predicted as non-fradulent or sick patients. Or airport detectors missing any actual bombs/dangerous items. High coverage!

Question 17

Q

What is an F1 score and how to best use it?

Answer

Study These Flashcards

A

F1-Score = 2 x (precision*recall)/(precision + recall) –> seeking a balance between precision and recall

Best used when the data is unbalanced.

Question 18

Q

What is the difference between supervised and unsupervised methods?

Can you provide some examples?

Answer

Study These Flashcards

A

Supervised: Need labels, answers questions in pre-defined categories (e.g. email classification - classifying the email exactly)

Unsupervised: No need for labels, good for exploring, can visualise well (e.g. figuring out how many email classes there are/exploration)

Question 19

Q

Why is versioning important?

Answer

Study These Flashcards

A

Model, data and code

Helps a data scientist keep track of their changes and ultimately pick the right model
Governance and regulatory requirements

Question 20

Q

What product metrics would you use for a chatbot?

Answer

Study These Flashcards

A

User engagement: qualitative and quantitative

Performance: Model metrics and response ties

Stability: SLA, bugs

Question 21

Q

How would you design for feedback capture? Can you give an example?

(Does no feedback mean positive feedback?)

Answer

Study These Flashcards

A

Ensuring that the design of the UI makes it very clear to the user what action they are taking next so that they don’t click without a clear intent.

I would make sure to user test this multiple times to ensure the mentality is engrained.

This gives data science teams reassurance on the other end.

Example: Reconciliations accept or decline. Decline made a difference. Had to stop people from progressing until they clicked somethig.

Machine Learning Interview Flashcards

(21 cards)