How do you clean and preprocess data in python? (AI) Flashcards

(21 cards)

1
Q

Front (Question/Concept)

A

Back (Answer/Explanation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which Python library is used for the majority of data manipulation tasks, including loading, filtering, and cleaning?

A

pandas (DataFrames)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the primary library used for numerical operations (like array handling)?

A

numpy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the pandas code to load a CSV file named data.csv?

A

data = pd.read_csv('data.csv')

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the simplest pandas method to remove records (rows) with missing values?

A

.dropna(inplace=True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the pandas method to replace missing values with a calculated statistic like the mean?

A

.fillna(data.mean(), inplace=True) (This is a form of Imputation.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the goal of Standardization (scaling)?

A

Transforms data to have a mean of 0 and a standard deviation of 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which sklearn class performs Standardization?

A

StandardScaler

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the goal of Normalization (scaling)?

A

Scales data to a specific range (usually 0 to 1).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which sklearn class performs Normalization?

A

MinMaxScaler

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What technique groups numerical data into categories (e.g., ‘Low’, ‘High’)?

A

Data Binning/Categorization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the pandas function used for Data Binning?

A

pd.cut()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What technique converts categorical data to numerical format by creating binary columns for each category?

A

One-Hot Encoding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the pandas function used for One-Hot Encoding?

A

pd.get_dummies(data, columns=['...'])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What technique converts categorical data by assigning an integer to each category?

A

Label Encoding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which sklearn class performs Label Encoding?

17
Q

Why might Label Encoding be problematic for non-ordinal categorical features?

A

The model may mistakenly assume a hierarchical order or magnitude relationship between the assigned integers.

18
Q

What is the goal of Feature Engineering?

A

To create new features (e.g., data['new_feature'] = data['f1'] * data['f2']) that provide better insight for the model.

19
Q

What is the sklearn function used to split data into training and testing sets?

A

train_test_split (from sklearn.model_selection).

20
Q

What common statistical method is used for Outlier Detection?

A

Z-scores or Interquartile Range (IQR) methods.

21
Q

What is the pandas method used to remove duplicated rows?

A

.drop_duplicates(inplace=True)