Why do you typically not need to normalize the data when using a decision tree algorithm?
Decision trees don’t rely on distance metrics, so normalization is generally not necessary.
What are the steps for preparing the dataset for analysis and modeling?
1- Understand the Dataset
2- Handle Missing Data
3- Handle Duplicates
4- Data Transformation
5- Splitting The Dataset
6- Feature Selection
What are the substeps of “Understand the Dataset”?
1- Inspect the Data: Load your dataset and understand its structure, including the columns, data types, and overall format.
2- Identify the target variable: If you’re doing supervised learning, ensure you know which column is your target or label.
Unsupervised learning doesn’t have a predefined target variable, so how does it operate to identify the target variable?
The model tries to find patterns, clusters, or relationships within the data without being told what to predict. Pseudo-target
What are the substeps of handling missing data?
1- Identify missing values.
2- Decide how to handle the missing data (techniques).
What are the techniques for handling missing data?
1- Imputation: Replacing missing values with means (for continuous data), medians (for continuous or ordinal data), or modes (categorical or discrete).
2- Prediction: Using machine learning models to predict missing values.
3- Dropping: Removing rows or columns with missing data (when appropriate).
How to handle duplicate data in a dataset?
Detect and remove duplicate rows: Ensure there are no repeated entries in your dataset unless duplicates are meaningful for your analysis.
What are data transformations?
The process of converting data from one format or structure into another.
What are the techniques used in data transformations?
1- Normalization
2- Standardization
3- Encoding Categorical Variables
What is normalization?
Rescaling data to a specific range (e.g., 0 to 1)
What is standardization?
Transforming data to have a mean of 0 and a standard deviation of 1.
What are the types of encoding categorical variables?
1- One-Hot Encoding: used for non-ordinal categorical data when there is no inherent order in the categorical data.
2- Label Encoding: used for ordinal.
Why standardize or normalize data?
to ensure that each feature contributes equally to the model, regardless of its original scale.
What is data manipulation?
The process of adjusting, organizing, or modifying data to make it more suitable for analysis
What are the common methods in data manipulation?
1- Filtering
2- Joining
3- Sorting
4- Aggregation
5- Reshaping
Why do we need standardization?
1- Equal Contribution of Features: Prevents features with larger ranges from dominating model training.
2- Distance-Based Algorithms: Essential for KNN, SVM, and other algorithms that rely on distance calculations.
3- Handling Different Units: Standardization allows comparison across features with different units (e.g., cm vs. kg).
What is the purpose of normalization?
To ensure that all features contribute equally to the model and improve algorithm performance.
What are the steps for normalizing an image dataset?
1- Understand the Pixel Value Range.
2- Normalize by scaling to [0, 1].
What are the types of splitting the dataset?
1- Train-Test Split: Divide your dataset into training and testing subsets, typically a 70-30 or 80-20 ratio.
2- Cross-Validation: Consider using cross-validation for more robust evaluation.
How to select a feature?
Use techniques like correlation analysis, variance thresholding, or feature importance from models (like Random Forest) to identify the most relevant features and reduce dimensionality.
What is the benefit of feature selection?
It removes irrelevant or redundant features that don’t contribute to the model’s performance.