Dataset Preparetion Flashcards by Mohammed Ammar

Missing Data Imputation Strategy: Mode

Replace empty values with the most frequent value in the same column.

How well did you know this?

Not at all

Perfectly

Train-Test Split

Dividing a dataset into two separate sets: one for training a model and another for evaluating its performance.

How well did you know this?

Not at all

Perfectly

Data Imputation

Replacing missing entries with estimated values based on statistical measures or model predictions.

How well did you know this?

Not at all

Perfectly

Algorithms that DON’T Need Normalization

Decision Trees and Random Forest.

How well did you know this?

Not at all

Perfectly

Inspect Dataset Structure and Statistics: Steps

Load the Data. Review Shape of Data. Display Data Types. Look at Basic Stats to Grasp Information Available.

How well did you know this?

Not at all

Perfectly

Procedure to get the data types and non-null counts of a Pandas Dataframe

Use the command $print(df.info())$

How well did you know this?

Not at all

Perfectly

Goal of Feature Selection

Keeping only the most informative features and discarding irrelevant or redundant ones to improve model performance.

How well did you know this?

Not at all

Perfectly

Normalization Range

Scales data to a fixed interval, typically [0,1].

How well did you know this?

Not at all

Perfectly

Formula for calculating normalized value using min-max scaling

$x_{norm} = \frac{x - x_{min}}{x_{max} - x_{min}}$

How well did you know this?

Not at all

Perfectly

Cross-Validation

A technique for assessing how well a model generalizes to an independent dataset by training and validating on different subsets of the data.

How well did you know this?

Not at all

Perfectly

Algorithms that Need Normalization

Algorithms using distance metrics (KNN, K-Means) and gradient descent.

How well did you know this?

Not at all

Perfectly

Missing Data Imputation Strategy: Mean

Replace empty values with the average of the other values in the same column.

How well did you know this?

Not at all

Perfectly

When to Remove Duplicate Rows

Unless duplicates represent meaningful replications, get rid of them.

How well did you know this?

Not at all

Perfectly

Python library commonly used for data manipulation and analysis

Pandas

How well did you know this?

Not at all

Perfectly

Label Encoding

Converting categorical labels into numerical values, assigning a unique integer to each category.

How well did you know this?

Not at all

Perfectly

Encode Categorical Features and Scale Numeric Ones: Steps

Encode Categorical Columns. Apply Normalization or Standardization to Numeric Columns.

How well did you know this?

Not at all

Perfectly

Duplicate Detection

The process of identifying and locating identical rows within a dataset that could potentially skew the analysis.

How well did you know this?

Not at all

Perfectly

Duplicate CustomerID Rows

Remove them.

How well did you know this?

Not at all

Perfectly

Procedure to display the number of rows and columns of a Pandas Dataframe

Use the command $print(df.shape)$

How well did you know this?

Not at all

Perfectly

One-Hot Encoding

Converting categorical data into a binary matrix, creating a new column for each category.

How well did you know this?

Not at all

Perfectly

Variance Thresholding Purpose

Removes low-variance variables.

How well did you know this?

Not at all

Perfectly

Python library commonly used for machine learning algorithms

Study These Flashcards

Scikit-learn

Missing Data Imputation Strategy: Median

Study These Flashcards

Replace empty values with the middle value of the other values in the same column.

Procedure to view the first few rows of a Pandas Dataframe

Study These Flashcards

Use the command $print(df.head())$

Standardization (Z-Score Scaling) Formula

$z = \frac{{x - \mu}}{{\sigma}}$

Standardization

Transforming data so it has a mean of 0 and a standard deviation of 1.

Scikit-learn

A Python library providing machine learning algorithms and tools for model training, evaluation, and more.

Formula for calculating standardized value using Z-score scaling

$z = \frac{x - \mu}{\sigma}$

Normalization (Min-Max Scaling) Formula

$x_{norm} = \frac{{x - x_{min}}}{{x_{max} - x_{min}}}$

Procedure to perform cross-validation

Use the command $scores = cross_val_score(rf, X, y, cv=5)$

Procedure to get summary statistics for numeric columns of a Pandas Dataframe

Use the command $print(df.describe())$

Procedure to drop duplicate rows in Pandas Dataframe

Use the command $df_clean = df.drop_duplicates()$

Feature Importance

A measure of how significant each variable is in predicting the target variable; often visualized with charts.

Procedure to count the number of missing values per column of a Pandas Dataframe

Use the command $print(df.isnull().sum())$

Procedure to perform label encoding of a categorical column 'Size' in Pandas Dataframe

Use the command $df['Size_encoded'] = le.fit_transform(df['Size'])$

Overfitting

A situation where a model learns the training data too well, resulting in poor performance on new, unseen data.

Procedure to Load a CSV file with Pandas

Use the command $df = pd.read_csv('your_dataset.csv')$

Procedure to perform one-hot encoding of a categorical column 'Gender' in Pandas Dataframe

Use the command $df_onehot = pd.get_dummies(df, columns=['Gender'])$

Feature Selection

The process of choosing a subset of relevant variables for use in model construction.

Procedure to fill the missing values of 'Age' column with median

Use the command $df['Age'].fillna(df['Age'].median(), inplace=True)$

Pandas

A Python library providing data structures and data analysis tools.

Missing 'Age' Value Best Approach

Impute it with the median or mean.

Target Variable

The column in a supervised learning dataset that the model aims to predict.

Why Split Data Into Training and Test Sets

Evaluate the model on unseen data to prevent overfitting.

Dataset Inspection

The initial process of examining data to understand its structure, types, and basic statistical properties.

Dataframe

A two-dimensional labeled data structure with columns of potentially different types.

Handle Missing Values and Duplicate Rows: Steps

Identify Missing Values. Impute Missing Values. Locate Duplicates. Remove Duplicates.

Standardization Values

Transforms data to have a mean of 0 and a standard deviation of 1.

Procedure to split the data into training and test sets

Use the command $X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)$

Procedure to view duplicate rows in Pandas Dataframe

Use the command $print(df[df.duplicated()])$

Normalization

Rescaling data values to fit within a specific range, often between 0 and 1.

Drawback of Many Irrelevant Variables

Increases computational cost and may hurt accuracy.

Visualize Variable Importance with Random Forests: Steps

Train a Random Forest. Extract Variable Importances. Plot Importances to Rank Variables.

Dataset Preparetion Flashcards

Using Turbo learn AI (53 cards)