Dataset Preparetion Flashcards

Using Turbo learn AI (53 cards)

1
Q

Missing Data Imputation Strategy: Mode

A

Replace empty values with the most frequent value in the same column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Train-Test Split

A

Dividing a dataset into two separate sets: one for training a model and another for evaluating its performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data Imputation

A

Replacing missing entries with estimated values based on statistical measures or model predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Algorithms that DON’T Need Normalization

A

Decision Trees and Random Forest.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Inspect Dataset Structure and Statistics: Steps

A

Load the Data. Review Shape of Data. Display Data Types. Look at Basic Stats to Grasp Information Available.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Procedure to get the data types and non-null counts of a Pandas Dataframe

A

Use the command $print(df.info())$

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Goal of Feature Selection

A

Keeping only the most informative features and discarding irrelevant or redundant ones to improve model performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Normalization Range

A

Scales data to a fixed interval, typically [0,1].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Formula for calculating normalized value using min-max scaling

A

$x_{norm} = \frac{x - x_{min}}{x_{max} - x_{min}}$

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Cross-Validation

A

A technique for assessing how well a model generalizes to an independent dataset by training and validating on different subsets of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Algorithms that Need Normalization

A

Algorithms using distance metrics (KNN, K-Means) and gradient descent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Missing Data Imputation Strategy: Mean

A

Replace empty values with the average of the other values in the same column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When to Remove Duplicate Rows

A

Unless duplicates represent meaningful replications, get rid of them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Python library commonly used for data manipulation and analysis

A

Pandas

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Label Encoding

A

Converting categorical labels into numerical values, assigning a unique integer to each category.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Encode Categorical Features and Scale Numeric Ones: Steps

A

Encode Categorical Columns. Apply Normalization or Standardization to Numeric Columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Duplicate Detection

A

The process of identifying and locating identical rows within a dataset that could potentially skew the analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Duplicate CustomerID Rows

A

Remove them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Procedure to display the number of rows and columns of a Pandas Dataframe

A

Use the command $print(df.shape)$

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

One-Hot Encoding

A

Converting categorical data into a binary matrix, creating a new column for each category.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Variance Thresholding Purpose

A

Removes low-variance variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Python library commonly used for machine learning algorithms

23
Q

Missing Data Imputation Strategy: Median

A

Replace empty values with the middle value of the other values in the same column.

24
Q

Procedure to view the first few rows of a Pandas Dataframe

A

Use the command $print(df.head())$

25
Standardization (Z-Score Scaling) Formula
$z = \frac{{x - \mu}}{{\sigma}}$
26
Standardization
Transforming data so it has a mean of 0 and a standard deviation of 1.
27
Scikit-learn
A Python library providing machine learning algorithms and tools for model training, evaluation, and more.
28
Formula for calculating standardized value using Z-score scaling
$z = \frac{x - \mu}{\sigma}$
29
Normalization (Min-Max Scaling) Formula
$x_{norm} = \frac{{x - x_{min}}}{{x_{max} - x_{min}}}$
30
Procedure to perform cross-validation
Use the command $scores = cross_val_score(rf, X, y, cv=5)$
31
Procedure to get summary statistics for numeric columns of a Pandas Dataframe
Use the command $print(df.describe())$
32
Procedure to drop duplicate rows in Pandas Dataframe
Use the command $df_clean = df.drop_duplicates()$
33
Feature Importance
A measure of how significant each variable is in predicting the target variable; often visualized with charts.
34
Procedure to count the number of missing values per column of a Pandas Dataframe
Use the command $print(df.isnull().sum())$
35
Procedure to perform label encoding of a categorical column 'Size' in Pandas Dataframe
Use the command $df['Size_encoded'] = le.fit_transform(df['Size'])$
36
Overfitting
A situation where a model learns the training data too well, resulting in poor performance on new, unseen data.
37
Procedure to Load a CSV file with Pandas
Use the command $df = pd.read_csv('your_dataset.csv')$
38
Procedure to perform one-hot encoding of a categorical column 'Gender' in Pandas Dataframe
Use the command $df_onehot = pd.get_dummies(df, columns=['Gender'])$
39
Feature Selection
The process of choosing a subset of relevant variables for use in model construction.
40
Procedure to fill the missing values of 'Age' column with median
Use the command $df['Age'].fillna(df['Age'].median(), inplace=True)$
41
Pandas
A Python library providing data structures and data analysis tools.
42
Missing 'Age' Value Best Approach
Impute it with the median or mean.
43
Target Variable
The column in a supervised learning dataset that the model aims to predict.
44
Why Split Data Into Training and Test Sets
Evaluate the model on unseen data to prevent overfitting.
45
Dataset Inspection
The initial process of examining data to understand its structure, types, and basic statistical properties.
46
Dataframe
A two-dimensional labeled data structure with columns of potentially different types.
47
Handle Missing Values and Duplicate Rows: Steps
Identify Missing Values. Impute Missing Values. Locate Duplicates. Remove Duplicates.
48
Standardization Values
Transforms data to have a mean of 0 and a standard deviation of 1.
49
Procedure to split the data into training and test sets
Use the command $X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)$
50
Procedure to view duplicate rows in Pandas Dataframe
Use the command $print(df[df.duplicated()])$
51
Normalization
Rescaling data values to fit within a specific range, often between 0 and 1.
52
Drawback of Many Irrelevant Variables
Increases computational cost and may hurt accuracy.
53
Visualize Variable Importance with Random Forests: Steps
Train a Random Forest. Extract Variable Importances. Plot Importances to Rank Variables.