What do we define as outliers?
Data points that differ significantly from others
What kind of impact can outliers have in machine learning?
KNN: Overfit partition of space
Linear regression: Very sensitive to outliers,
decreases quality of the fit
Decision trees: more robust to outliers, but can introduce overfit
How can we deal with outliers?
Trimming / truncation:
drop values above/below a certain value or percentile
Winsorizing / winsorization:
setting values above/below a certain value or percentile to the closest value
What is an isolation tree?
Simple yet effective anomaly detection
* Use principle similar to Decision Trees
* Algorithm:
* Pick a random split between min and max
* Count how many points are in each partition
* Points in singleton partitions are marked with the
number of splits done so far
* Repeat the split on each partition
* Stop when all points are in singleton partitions
* A point’s number of splits is its anomaly score
* The score needs normalization (see original paper)
How to do forest isolation on n dimensions?
Why should we remember to handle outliers carefully?
Outliers can be:
* Noise, we want to remove it
* Originated by processes we are not interested in modeling (twitter bots)
* Originated by processes that differ from the main one we are focusing on,
but are still relevant to the analysis (power users)
How to handle missing data?
WHat is the three types of missing data?
Missing Completely At Random (MCAR)
* Missing At Random (MAR)
* Missing Non At Random (MNAR)
How can we deal with missing data?
Deletion
* Delete records with missing values
* Suitable when
* Data is Missing Completely At Random – else, bias
is introduced
* A small % of records is missing
* There is no reliable way to infer the value (see next)
* Deletion of entire column when
* A large % of records is missing
* Column is not crucial for the analysis
Single imputation – average
* Filling missing values with some global criterion
* Default value
* Fill with average/median/mode
* Suitable when
* Data is Missing Completely At Random
* A small % of records is missing
* Introduces bias in distribution
* Variability is reduced
Single imputation – local average
* Filling missing values using local information
* Applicable to data in which records are not independent,
and for which the assumption “what is close is similar”
holds
* Time series
* Audio
* Spatial data
* Images