Missing Data Imputation Strategy: Mode
Replace empty values with the most frequent value in the same column.
Train-Test Split
Dividing a dataset into two separate sets: one for training a model and another for evaluating its performance.
Data Imputation
Replacing missing entries with estimated values based on statistical measures or model predictions.
Algorithms that DON’T Need Normalization
Decision Trees and Random Forest.
Inspect Dataset Structure and Statistics: Steps
Load the Data. Review Shape of Data. Display Data Types. Look at Basic Stats to Grasp Information Available.
Procedure to get the data types and non-null counts of a Pandas Dataframe
Use the command $print(df.info())$
Goal of Feature Selection
Keeping only the most informative features and discarding irrelevant or redundant ones to improve model performance.
Normalization Range
Scales data to a fixed interval, typically [0,1].
Formula for calculating normalized value using min-max scaling
$x_{norm} = \frac{x - x_{min}}{x_{max} - x_{min}}$
Cross-Validation
A technique for assessing how well a model generalizes to an independent dataset by training and validating on different subsets of the data.
Algorithms that Need Normalization
Algorithms using distance metrics (KNN, K-Means) and gradient descent.
Missing Data Imputation Strategy: Mean
Replace empty values with the average of the other values in the same column.
When to Remove Duplicate Rows
Unless duplicates represent meaningful replications, get rid of them.
Python library commonly used for data manipulation and analysis
Pandas
Label Encoding
Converting categorical labels into numerical values, assigning a unique integer to each category.
Encode Categorical Features and Scale Numeric Ones: Steps
Encode Categorical Columns. Apply Normalization or Standardization to Numeric Columns.
Duplicate Detection
The process of identifying and locating identical rows within a dataset that could potentially skew the analysis.
Duplicate CustomerID Rows
Remove them.
Procedure to display the number of rows and columns of a Pandas Dataframe
Use the command $print(df.shape)$
One-Hot Encoding
Converting categorical data into a binary matrix, creating a new column for each category.
Variance Thresholding Purpose
Removes low-variance variables.
Python library commonly used for machine learning algorithms
Scikit-learn
Missing Data Imputation Strategy: Median
Replace empty values with the middle value of the other values in the same column.
Procedure to view the first few rows of a Pandas Dataframe
Use the command $print(df.head())$