What is EDA?
Why EDA?
IMPORTANCE OF EDA
STEPS IN EDA
1) Data Collection: Gathering relevant data from various sources (last time)
2) Data Cleaning: Preparing the data for analysis (last time)
3) Data Visualization: Using plots and charts to understand data
(today)
4) Statistical Analysis: Applying stats to derive insights (next time).
1) DATA COLLECTION (RECAP)
2) DATA CLEANING (HANDLING MISSING
VALUES)
R Code:
# Identify missing values
missing_values <- is.na(data)
Remove rows with missing values
cleaned_data <- na.omit(data)
2) DATA CLEANING (DEALING WITH
OUTLIERS)
Definition: Data points that differ significantly from others.
Types: Point outliers, contextual outliers, and collective outliers.
R Code:
# Boxplot to visualize outliers
boxplot(data$column_name)
IQR method to identify outliers
IQR <- IQR(data$column_name)
upper_bound <- quantile(data$column_name, 0.75) + 1.5 * IQR
lower_bound <- quantile(data$column_name, 0.25) - 1.5 * IQR
outliers < data$column_name[data$column_name > upper_bound |
data$column_name < lower_bound]
2) DATA TRANSFORMATION AND
NORMALIZATION
Log transformation (E.g., Reduce skewness Normal dist.) 🡪
Why Transform Data? Enhance model performance; meet assumptions of certain
algorithms.
Common Transformations: Log, square root, z-score.
R Code:
log_data <- log(data$column_name)
Square root transformation
(E.g., Reduce skewness in count data Uniform dist.)
sqrt_data <- sqrt(data$column_name)
Z-score normalization
(E.g., to create mean = 0 and sd = 1).
z_score <- scale(data$column_name)
3) DATA VISUALIZATION
Histograms and Box Plots: Understand data distribution.
Scatter Plots: Visualize bivariate relationships.
Heatmaps: Show correlations.
R Code (for a simple scatter plot):
plot(data$column1, data$column2, main=”Scatter Plot of Column1 vs Column2”,
xlab=”Column1”, ylab=”Column2”)
4) STATISTICAL ANALYSIS (NEXT TIME)
DATA VISUALIZATION
LINE PLOT
ggplot(data = df, aes(x = date, y = unemploy) +
geom_line()
BAR PLOT
ggplot(data = df, aes(x = class) +
geom_bar()
BOX PLOT
ggplot(data = df, aes(x = ‘Distance measure’, y = temperature) +
geom_boxplot()
DENSITY PLOT
ggplot(data = df, aes(x = X, fill = cut)) +
geom_density(alpha = 0.5)
SCATTER PLOT
ggplot(data = df, aes(x = dist, y = speed) +
geom_point()
WORD CLOUDS
wordcloud(words = df$word, freq = df$freq,
random.order = FALSE, colors=brewer.pal(8,
“Dark2”))
PIE CHART
ggplot(data = df, aes(x = factor(1), fill = as.factor(cyl)) +
geom_bar()
RAINCLOUD PLOT
HEATMAP
ANIMATED PLOTS
INTERACTIVE PLOTS