We’re given two tables, a table of notification deliveries and a table of users with created and purchase conversion dates. If the user hasn’t purchased then the conversion_date column is NULL.
`notification_deliveries` table: column type notification varchar user_id int created_at datetime
`users` table: column type user_id int created_at datetime conversion_date datetime
Example Output:
notification conversion_rate
activate_premium 0.05
try_premium 0.03
free_trial 0.11
Problem 1 Possible Solutions:
select ct, count(*) from (select n.user_id, count(notification) as ct from notification_deliveries n join users u n.user_id = u.user_id where conversion_date is not null and n.created_at < u.conversion_date) temp group by ct
select pushes, count() from( select t1.user_id, count() as pushes from users t1 left join notification_deliveries t2 on t1.user_id = t2.user_id and t1.convertion_date >= t2.created_at where conversion_date is not null group by 1) tmp2 group by 1
Problem 2 Possible Solutions
select notification avg(converted) from (select *, (case when conversion_date is not null then 1 else 0 end) as converted from notification_deliveries n join users u n.user_id = u.user_id) temp group by notification
select notification, sum(case when conversion_date is not null then 1 else 0 end) / count(*) as conversion_rate from users t1 left join notification_deliveries t2 on t1.user_id = t2.user_id and t1.convertion_date >= t2.created_at group by 1
A dating websites schema is represented by a table of people that like other people. The table has three columns. One column is the user_id, another column is the liker_id which is the user_id of the user doing the liking, and the last column is the date time that the like occured.
Write a query to count the number of liker’s likers (the users that like the likers) if the liker has one.
likes table:
column type
user_id int
created_at datetime
liker_id int
input:
user liker A B B C B D D E
output:
user count
B 2
D 1
select user_id, count(liker_id) as count from likes where user_id in ( select liker_id from likes group by liker_id) group by user_id
select user_id, count(liker_id) from likes where user in (select distinct(likers) from likes) group by user order by user
Suppose we have a binary classification model that classifies whether or not an applicant should be qualified to get a loan. Because we are a financial company we have to provide each rejected applicant with a reason why.
Given we don’t have access to the feature weights, how would we give each rejected applicant a reason why they got rejected?
Given we do not have access to the feature weights, we are unable to tell each applicant which were the highest contributing factors to their application rejection. However, if we have enough results, we can start to build a sample distribution of application outcomes, and then map them to the particular characteristics of each rejection.
For example, if a rejected applicant had a recurring outstanding credit card balance of 10% of their monthly take-home income: if we know that the percentile of this data point falls within the middle of the distribution of rejected applicants, we can be fairly certain it is at least correlated with their rejection outcome. With this methodology, we can outline a few standard factors that may have led to the decision.
Given a list of timestamps in sequential order, return a list of lists grouped by week (7 days) using the first timestamp as the starting point.
Example:
ts = [
'2019-01-01',
'2019-01-02',
'2019-01-08',
'2019-02-01',
'2019-02-02',
'2019-02-05',
]output = [
['2019-01-01', '2019-01-02'],
['2019-01-08'],
['2019-02-01', '2019-02-02'],
['2019-02-05'],
]from datetime import datetime as dt
from itertools import groupby
inp = ['2019-01-01','2019-01-02','2019-01-08', '2019-02-01','2019-02-05'] first = dt.strptime(inp[0], "%Y-%m-%d") out = []
for k, g in groupby(inp, key=lambda d: (dt.strptime(d, “%Y-%m-%d”) - first).days // 7 ):
out.append(list(g))
print out
from collections import defaultdict
from datetime import datetime as dt
curr = '2019-01-01'
idx = 0
dic = defaultdict(list)
for i in ts:
if ( dt.strptime(i, '%Y-%m-%d') - dt.strptime(curr, '%Y-%m-%d')).days < 7 :
dic[idx].append(i)
else:
curr = i
idx += 1
dic[idx].append(i)
print(dic.values())Explain what regularization is and why it is useful
How do you solve for multicollinearity?
-multicollinearity occurs when independent variables in a model are correlated
Explain the difference between generative and discriminative algorithms
Suppose we have a dataset with training input x and labels y.
Generative model: explicitly models the actual distribution of each class.
Discriminative Model: learns the conditional probability distribution p(y|x) or a direct mapping from inputs x to the class labels y
Explain the bias-variance tradeoff
bias: error caused from oversimplification of your model(underfitting)
variance: error caused from having a too complex model(overfitting)
is more data always better?
no. related to Big Data hubris or the idea that big data is a substitute, rather than a supplement to, traditional data collection and analysis
what are feature vectors?
feature vecto: n-dimensional vector of numerical features that represent some object and can be represented as a point in n-dimensional space
how do you know if one algorithm is better than others?
better can mean a lot of different things:
this answer depends on the problem, goal and contstraints
explain the difference between supervised and unsuperviced machine learning
supervised machine learning algorithms: we provide labeled data (ex spam or not spam, cats or not cats) so the model can learn the mapping from inputs to labeled outputs.
unsupervised learning: we dont need to have labeled data and goal is to detect patterns or learn representations of the data
-example: detecting anomalies or finding similar groupings of customers
what is the difference between convex and non-convex functions?
convex: one minimum
- important: an optimization algorithm(like gradient descent) wont get stuck in a local minimum
non-convex: some up and down valleys (local minimas) that aren’t as down as the overall down (global minum)
-optimization algorithms can get stuck in local minimum and it can be hard to tell when this happens
2. what is the difference between local and global optimum
suppose you have the following two lists: a = [42,84,3528,1764] b=[42,42,42,42] What does the following piece of code do? How can you make it run faster? >>>total = 0 >>>for idx, val in enumerate(a): >>> total += a[idx]*b[idx] >>>return total
Essentially the dot product between two 1 dimensional vecotrs. can use np.dot(np.array(a),np.array(b)) instead
Define the Central Limit Theorem and its importance
CLT repeatedly take independent random samples of size n from a population(for both normal and nonnormal data)
-when n is large, distribution of sample means will approach a normal distribution
Imporatnce
Define Law of large numbers and its importance
LNN states that if an experiment is repeated independently a large number of times and you take the average of the results
-average should be close to expected value (or mathematically proven result)
example:
what is the normal distribution? What are some examples of data that follow it?
-also known as gaussian distribution
-allows us to perform parametric hypothesis testing
most of the observations cluster around the mean
-68% first standard deviation
-95.4% second standard deviation
-99.7% third standard deviation
examples: height, weight, shoe size, test scores, blood pressure, daily return of stocks
how do you check if a distribution is close to normal
2. what are some examples of data that follow it? Why is it important in machine learning?
long tailed distribution (or Pareto): when data is clustered around the head and gradually levels off to zero
examples: frequency of earthquakes(large number of small magnitude earthquakes, few large magnitude ones), search engines(few keyboards that are commonly searched for)
2. important in ML: applied by saying that 20% of data might be useful or that 80% of your time will be sepnt on one part of the data science project(usually data cleaning)
2. how does one solve it
how does one multiply matrices?
what are:
a. eigenvalues
b. eigenvectors
A scalar is called an eigenvalue of an n x n matrix A if there is a nontrivial solution x such that Ax = scalarx and x is the eigenvector corresponding to the eigenvalue scalar
Eigenvectors tell you when the linear transformation represented by acts like scalar multiplication
Eigenvalues are the amounts by which they are scaled