Cleaning - examples
o Convert fields with text into numeric form
o Remove rows with missing / faulty data
o Fix rows with faulty datapoints
o Drop unnecessary variables (columns)
o Scaling or normalizing data
o Creating new variables
o Renaming variables (remove capitalization, spaces)
Methods to changing categorical (& ordinal) to numerical values
one hot encoding
= transforming each possible value from categorical into new binary (dummy-variable) column (e.g. true = 1, false = 0)
label encoding
= transforming each possible value from categorical into a unique integer value
! Issue: Algorithms might wrongly misjudge order / relationship in data
Ordinal encoding
= transforming each possible value from categorical into a unique integer value where values maintain order (e.g. body mass g in size)
EDA
2-D Histogram
= dividing points among 2D bins
Skewness
= measure of asymmetry of probability distribution of variable about its mean
Boxplot
= method for graphically depicting groups of numerical data through their quartiles
Scaling & Normalizing - Relevance
Mathematical Transformation
Standardization
data points expressed as SD from mean
Make mean 0 and variance 1
Scaling
Pipelines
Splitting of data
Validation set - purpose
Normalizing