Notebook 8 Flashcards

(72 cards)

1
Q

What are Categorical Variables?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

import statement need for OneHotEncoder?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Ordinal encoding?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is one-hot and dummy encoding?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do we use apndas to apply one-hot or dummy encoding to a dataframe?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do we change this one-hot encoding to dummy encoding?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do we use scikit-learn for one-hot encoding for a pandas df? How is it different from pandas?

A

This is what is read is sparse_output = False

<Compressed Sparse Row sparse matrix of dtype ‘float64’
with 5 stored elements and shape (5, 3)>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do we use sckit-learn’s encoder to get dummy encoding?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When do you use dummy encoding vs. one-hot encoding?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Say if you have use scikit-learn to use onehot encoding how do we convert it back into a dataframe? What do we need to be mindful of?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is mixed or heterogenous data types?

A

datasets that have both nuerical and categorical features, that is mixed data types, also called multiple variables types or heterogenous data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

how would you code to perform one-hot and dummy encoding on a mixed data set laptop_price.csv?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

For a mixed data set how could you code to explicity specify the categorical features for one-hot coding?

How could you separate hot the numerical and categorical columns?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How is inconsistent preprocessing a common pitfall?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Common pitfalls: What is Data leakage?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can we avoid data leakage when pre-processing?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What can lead to preprocessing could we perform he that would lead to data leakage?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How can we correct for this preprocessing error that is causing data leakage?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What property do some scikit learn object have inherently? How does this end up leading to prepreocessing errors?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does the random_state prameter determine?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What happens to our estimators if we pass instances to our random_state?

A

MIGHT NEED ADJUSTING

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are CV Splitters?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

When do we want to pass an integer vs. random_state to an estimator?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

When RandomState are passed to CV splitters what occurs?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
General recommendation: Getting reproducible result across multiple execution?
26
General recommendation: Robustness of cross-validation results?
27
For the mixed dataset laptop_price, how would I create test and training data for target y for column Price (Euro) of test size =0.2? How would I pull out from the design matrix XAS the categorical and numerical columns?
28
For our laptop_price dataset, what function can I use to pre-process our data? Generally what will it do to our different data types and how do I got this?
29
How can we make the code more readable and slightly more sophisticated?
30
Can you use this information then apply a test and training data for design matrix X for the laptop_price dataset?, Then after preprocess the data before we can then test it noting this is a mixed dataset?
31
What is imputation? What does the SimpleImputer function do? Apply it to this dataset?
can change the 'strategy' to 'median' for for that to replace the results
32
HOw does the SimpleImputer work (Scikit learn)?
33
What are some more sophisticated scaling methods?
34
What are both the MixMaxScaler function and MaxAbsScaler functions?
Scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size. This can be achieved using MinMaxScaler or MaxAbsScaler, respectively. This can be useful if you know that your data is bounded in some range and is not Gaussian, and for preserving zero entries in sparse data.
35
What does the RobustScaler function do?
your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. For such data you want to use RobustScaler, which scales data using the median and interquartile range (IQR), making it less sensitive to outliers. Typical examples of such data are housing prices and income data where outliers are common.
36
Example of MinMaxScaler on a dataset?
37
Why are duplicated rows a problem and how do we deal with them?
38
What are the import statmenets for all our scoring and performance measures including all their prequisites that would probably be needed in Machine Learning modelling?
Add what which corresponds to which
39
for the label vector on 'Outcome' for the diabetes dataset: Load the dataset, train_test_split the data and scale the data?
Ho
40
How do we find the best hyperparameter for the SVM CLassifierfor this dataset?
41
Once my SVM Classifier has been created, and the best parameters are found, how do we comput our performance measures?
check accuracy by head and verify that the precision and recall for the 1class(positive class) are the values printed here
42
What is an important message to keep note about our performmance measures?
43
How do we greate a hyperparameter grid search for a logistic regression?
44
How do we test the performance of this regression?
45
How do we generate ROC Curve and AUC?
What does this mean though?
46
DO YOU ADD THE SECTION 3.4.3 SCORING STUFF??
47
How do we create a 2D grid?
48
When creating a 2-D why do we separate the arrays? How do we evaluate a function on this grid?
We now have three 2D arrays: the `X` and `Y` arrays with the grid coordinates, and the array `Z` containing the value of the function at each grid point.
49
50
How do you plot a contour lines for this function on our 2-D grid?
51
How do you plot a filled contour regions for this function on our 2-D grid?
LOOK UP PROPER WAY TO FIND WAY TO FIND THE CRITICAL POINTS
52
How do we add more contour plots to plt.contourf and plt.contour functions?
use the kwarg: levels = int
53
How do specify the levels of counters to plot What a bout the linestyles themselves?
54
What does the cmap Kwarg do in our contour plots?
other cmaps = 'magma', 'plasma', 'Reds', 'jet' adding alpha = 0 to 1 will change the opacity with the closer the to 1 the more brighter it will be
55
How do we code the decision boundaries here?
56
How do you plot the contourf and contour on the sample graph?
57
58
59
Walk me through on a high level what each line of this code is doing?
--- - A Python list with classifiers and hyperparameter settings is created. - Three datasets are created and put into a list. - The lists are looped over to generate the comparison grid. - Within the outer loop (dataset loop) you should recognise where the data is scaled and the train_test split is made. - There is then code to plot data. - Within the inner loop (classifier loop) you should recognise the fitting of the training data and the computation of score from the test data. - There is then code to plot the results. You would recognise the functions contourf and scatter. While more involved and sophisticated, this code more-or-less corresponds to the example in the Contour_plot notebook.
60
Locate where the code creates the linearly separable dataset? Where can you set the added noise to zero? What about for when it is 20 What happens to the score when you do this?
X += 2 * rng.uniform(size=X.shape) The score decreases by about 0.5 when increasing the noise
61
What do these lines of code generate on a graph?
make_moon --> creates two interleaving half circles * Shuffle --> wheather to shuffle the samples * Noise --> Standard evision of Gaussian noise added to the data make_circles -->makes a larger circle containing the smaller circle * Shuffle --> wheather to shuffle the samples * Noise --> Standard evision of Gaussian noise added to the data * Factor --> the Scale factor between inner and out circles in the range [0,1) More noise generally makes the score worse for make_circles score in low for linear.svm as data is not linearly separable even with soft margins
62
For binary classification problems, what will always result in a score of about 0.5?
Either randomly guessing the class of the binary classification problem or settle all the test data to one of the classes . (in our case we just can set y_test = 0 * y_test)
63
What is the difference between settling this to linear of RBF?
* Linear is the dot plot of the input samples which creates our usual hyperplane spit with straight lines * Radial Basis function (RBF) also know as the Gausian Kernel,easure the similarity between two data points in infinite dimension and then approaches classificaiton by majority vote The kernel function is: K(x1,2) = exp (-γ * IIx1 - 2II2
64
What values does gamme take for RBF SVM classification?
65
How would you update this to add a Poly SVM?
Setting degree =1 for the poly kernel should reproduce the linear case. Note that the circles data is approximately rotationally symmetric, so linear classifiers will have a hard time argeeing on an orientiation. As already discussed the scores are bad even for noise free circle data with linear classifiers.
66
import libraries for 3D plotting?
67
How do we plot this using 3D plotting?
You should see a graph of the function z=f(x,y) as a surface plot in three dimensions. Surface plots make local and global minima very obvious. You should also see faint lines on the surface showing the grid. Before discussing the Python code, it is worth changing the colouring of the surface. can add cmap='plasma' or 'viridis' or 'magma'
68
Why do we need these when making £D plots?
69
What does this function do?
elev = rotates the surface in the z-axis e.g. elev=90 looks at the plot for above whereas -30 is looking at it from 30 degrees below change the azim (azimuth) angle rotations the view of the surface around the z-axis
70
What does adding edgecolor='k' to the ax.plot_surface function do?
71
Plotting a scatter plot in 3D?
72
What is changing the point size to 50*(z_point) do? What about 500*np.abs(z) WHat abou tchanging the colours to colours = y_points + 10? what about to -y_points?
500* np.ab(z) just makes the scatter plot larger