Exploratory data analysis
getting a feel of the data, making it easier to find mistakes, guess what actually happened and makes it easier to find outliers.
- Understand and gain insights into the data before selecting analysis techniques.
- Approach data without assumptions, often using visual methods.
We need to get to know the data
We can ask questions
Comparison with Hypothesis Testing
Systematic Process
Descriptive Statistics
Quantitatively describe main features of the data. Main data features:
- Measures of central tendency represent a center around which measurements are distributed (mean, median)
- Measures of variability represent the spread of data from the center (standard dev.)
- Measures of relative standing represent the ‘relative position’ of specific measurements in data (quantiles)
The mean
Average, badly affected by outliers, making it a bad measure of central tendency
The median
Middle value when values are ranked in order, shows two halves. AKA the 50th percentile. Unaffected by outliers, making it a better measure of central tendency. In skewed data, the mean lies further towards the skew than the median.
The mode
Most common data point, may be multiple points.
Variance
the spread around the mean. Shows how median and mean differ. The lower the variance the more consistent it is.
Standard Deviation
Spread around the mean, high std means increased spread, less consistency and less clustering.
Quartiles
The value that marks one of the divisions that breaks a series of values into four equal parts. Median is the 2nd quartile and divides it in half.
Common Visualizations
Histograms/Bar Charts
Used to display frequency distribution. Counts of data falling in various ranges. Histogram is used for numeric data and bar chart for categorical data. The bin size selection is important; if too small it may show false patterns, if too large it may hide important patterns. Several variations are possible; plot relative frequencies instead of raw frequencies. Make the height of the histogram equal to the relative frequency/width.
Box plots
A five value summary plot of data, minimum, maximum, median, 1st and 3rd quartiles. Often used with histogram in EDA.
Scatterplots
2D graphs, useful for understanding the relationship between two attributes. Features of the relationship are describes by; strength, shape, direction, presence of outliers.
Models Definition & Purpose
Philosophies of Models
Occam’s Razor
Bias-Variance Trade-Off
Principles of Good Models
Baseline Models Purpose
Classification Baselines
Prediction Baselines