Steps for Preparing Dataset Flashcards

Question 1

Q

Why do you typically not need to normalize the data when using a decision tree algorithm?

Answer

A

Decision trees don’t rely on distance metrics, so normalization is generally not necessary.

Question 2

Q

What are the steps for preparing the dataset for analysis and modeling?

Answer

A

1- Understand the Dataset
2- Handle Missing Data
3- Handle Duplicates
4- Data Transformation
5- Splitting The Dataset
6- Feature Selection

Question 3

Q

What are the substeps of “Understand the Dataset”?

Answer

A

1- Inspect the Data: Load your dataset and understand its structure, including the columns, data types, and overall format.
2- Identify the target variable: If you’re doing supervised learning, ensure you know which column is your target or label.

Question 4

Q

Unsupervised learning doesn’t have a predefined target variable, so how does it operate to identify the target variable?

Answer

A

The model tries to find patterns, clusters, or relationships within the data without being told what to predict. Pseudo-target

Question 5

Q

What are the substeps of handling missing data?

Answer

A

1- Identify missing values.
2- Decide how to handle the missing data (techniques).

Question 6

Q

What are the techniques for handling missing data?

Answer

A

1- Imputation: Replacing missing values with means (for continuous data), medians (for continuous or ordinal data), or modes (categorical or discrete).
2- Prediction: Using machine learning models to predict missing values.
3- Dropping: Removing rows or columns with missing data (when appropriate).

Question 7

Q

How to handle duplicate data in a dataset?

Answer

A

Detect and remove duplicate rows: Ensure there are no repeated entries in your dataset unless duplicates are meaningful for your analysis.

Question 8

Q

What are data transformations?

Answer

A

The process of converting data from one format or structure into another.

Question 9

Q

What are the techniques used in data transformations?

Answer

A

1- Normalization
2- Standardization
3- Encoding Categorical Variables

Question 10

Q

What is normalization?

Answer

A

Rescaling data to a specific range (e.g., 0 to 1)

Question 11

Q

What is standardization?

Answer

A

Transforming data to have a mean of 0 and a standard deviation of 1.

Question 12

Q

What are the types of encoding categorical variables?

Answer

A

1- One-Hot Encoding: used for non-ordinal categorical data when there is no inherent order in the categorical data.
2- Label Encoding: used for ordinal.

Question 13

Q

Why standardize or normalize data?

Answer

A

to ensure that each feature contributes equally to the model, regardless of its original scale.

Question 14

Q

What is data manipulation?

Answer

A

The process of adjusting, organizing, or modifying data to make it more suitable for analysis

Question 15

Q

What are the common methods in data manipulation?

Answer

A

1- Filtering
2- Joining
3- Sorting
4- Aggregation
5- Reshaping

Question 16

Q

Why do we need standardization?

Answer

Study These Flashcards

A

1- Equal Contribution of Features: Prevents features with larger ranges from dominating model training.
2- Distance-Based Algorithms: Essential for KNN, SVM, and other algorithms that rely on distance calculations.
3- Handling Different Units: Standardization allows comparison across features with different units (e.g., cm vs. kg).

Question 17

Q

What is the purpose of normalization?

Answer

Study These Flashcards

A

To ensure that all features contribute equally to the model and improve algorithm performance.

Question 18

Q

What are the steps for normalizing an image dataset?

Answer

Study These Flashcards

A

1- Understand the Pixel Value Range.
2- Normalize by scaling to [0, 1].

Question 19

Q

What are the types of splitting the dataset?

Answer

Study These Flashcards

A

1- Train-Test Split: Divide your dataset into training and testing subsets, typically a 70-30 or 80-20 ratio.
2- Cross-Validation: Consider using cross-validation for more robust evaluation.

Question 20

Q

How to select a feature?

Answer

Study These Flashcards

A

Use techniques like correlation analysis, variance thresholding, or feature importance from models (like Random Forest) to identify the most relevant features and reduce dimensionality.

Question 21

Q

What is the benefit of feature selection?

Answer

Study These Flashcards

A

It removes irrelevant or redundant features that don’t contribute to the model’s performance.

Steps for Preparing Dataset Flashcards

(21 cards)