Steps for Preparing Dataset Flashcards

(21 cards)

1
Q

Why do you typically not need to normalize the data when using a decision tree algorithm?

A

Decision trees don’t rely on distance metrics, so normalization is generally not necessary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the steps for preparing the dataset for analysis and modeling?

A

1- Understand the Dataset
2- Handle Missing Data
3- Handle Duplicates
4- Data Transformation
5- Splitting The Dataset
6- Feature Selection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the substeps of “Understand the Dataset”?

A

1- Inspect the Data: Load your dataset and understand its structure, including the columns, data types, and overall format.
2- Identify the target variable: If you’re doing supervised learning, ensure you know which column is your target or label.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Unsupervised learning doesn’t have a predefined target variable, so how does it operate to identify the target variable?

A

The model tries to find patterns, clusters, or relationships within the data without being told what to predict. Pseudo-target

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the substeps of handling missing data?

A

1- Identify missing values.
2- Decide how to handle the missing data (techniques).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the techniques for handling missing data?

A

1- Imputation: Replacing missing values with means (for continuous data), medians (for continuous or ordinal data), or modes (categorical or discrete).
2- Prediction: Using machine learning models to predict missing values.
3- Dropping: Removing rows or columns with missing data (when appropriate).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to handle duplicate data in a dataset?

A

Detect and remove duplicate rows: Ensure there are no repeated entries in your dataset unless duplicates are meaningful for your analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are data transformations?

A

The process of converting data from one format or structure into another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the techniques used in data transformations?

A

1- Normalization
2- Standardization
3- Encoding Categorical Variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is normalization?

A

Rescaling data to a specific range (e.g., 0 to 1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is standardization?

A

Transforming data to have a mean of 0 and a standard deviation of 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the types of encoding categorical variables?

A

1- One-Hot Encoding: used for non-ordinal categorical data when there is no inherent order in the categorical data.
2- Label Encoding: used for ordinal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why standardize or normalize data?

A

to ensure that each feature contributes equally to the model, regardless of its original scale.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is data manipulation?

A

The process of adjusting, organizing, or modifying data to make it more suitable for analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the common methods in data manipulation?

A

1- Filtering
2- Joining
3- Sorting
4- Aggregation
5- Reshaping

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why do we need standardization?

A

1- Equal Contribution of Features: Prevents features with larger ranges from dominating model training.
2- Distance-Based Algorithms: Essential for KNN, SVM, and other algorithms that rely on distance calculations.
3- Handling Different Units: Standardization allows comparison across features with different units (e.g., cm vs. kg).

17
Q

What is the purpose of normalization?

A

To ensure that all features contribute equally to the model and improve algorithm performance.

18
Q

What are the steps for normalizing an image dataset?

A

1- Understand the Pixel Value Range.
2- Normalize by scaling to [0, 1].

19
Q

What are the types of splitting the dataset?

A

1- Train-Test Split: Divide your dataset into training and testing subsets, typically a 70-30 or 80-20 ratio.
2- Cross-Validation: Consider using cross-validation for more robust evaluation.

20
Q

How to select a feature?

A

Use techniques like correlation analysis, variance thresholding, or feature importance from models (like Random Forest) to identify the most relevant features and reduce dimensionality.

21
Q

What is the benefit of feature selection?

A

It removes irrelevant or redundant features that don’t contribute to the model’s performance.