Front (Question/Concept)
Back (Answer/Explanation)
Which Python library is used for the majority of data manipulation tasks, including loading, filtering, and cleaning?
pandas (DataFrames)
What is the primary library used for numerical operations (like array handling)?
numpy
What is the pandas code to load a CSV file named data.csv?
data = pd.read_csv('data.csv')
What is the simplest pandas method to remove records (rows) with missing values?
.dropna(inplace=True)
What is the pandas method to replace missing values with a calculated statistic like the mean?
.fillna(data.mean(), inplace=True) (This is a form of Imputation.)
What is the goal of Standardization (scaling)?
Transforms data to have a mean of 0 and a standard deviation of 1.
Which sklearn class performs Standardization?
StandardScaler
What is the goal of Normalization (scaling)?
Scales data to a specific range (usually 0 to 1).
Which sklearn class performs Normalization?
MinMaxScaler
What technique groups numerical data into categories (e.g., ‘Low’, ‘High’)?
Data Binning/Categorization
What is the pandas function used for Data Binning?
pd.cut()
What technique converts categorical data to numerical format by creating binary columns for each category?
One-Hot Encoding
What is the pandas function used for One-Hot Encoding?
pd.get_dummies(data, columns=['...'])
What technique converts categorical data by assigning an integer to each category?
Label Encoding
Which sklearn class performs Label Encoding?
LabelEncoder
Why might Label Encoding be problematic for non-ordinal categorical features?
The model may mistakenly assume a hierarchical order or magnitude relationship between the assigned integers.
What is the goal of Feature Engineering?
To create new features (e.g., data['new_feature'] = data['f1'] * data['f2']) that provide better insight for the model.
What is the sklearn function used to split data into training and testing sets?
train_test_split (from sklearn.model_selection).
What common statistical method is used for Outlier Detection?
Z-scores or Interquartile Range (IQR) methods.
What is the pandas method used to remove duplicated rows?
.drop_duplicates(inplace=True)