What is Visualisation?
The use of graphics to examine data
What is WIlliam Cleveland’s Graphic Philosophy
A fine balancing act.
- A graphic should display as much information as it can, with the lowest possible cognitive strain to the viewer.
===
Strive for clarity.
- Make the data stand out. Specific tips for increasing clarity include:
- Avoid too many superimposed elements, such as too many curves in
the same graphing space.
- Find the right aspect ratio and scaling to properly bring out the details
of the data.
- Avoid having the data all skewed to one side or the other of your graph.
===
Visualization is an iterative process.
- Its purpose is to answer questions about the data.
- Different graphics are best suited for answering different questions
What is exploratory data analysis?
Exploratory data analysis, or EDA for short, is a task that uses
visualisation and transformation to explore your data in a systematic way.
What is the EDA cycle?
EDA is an iterative cycle that involves:
Generating questions about your data.
Searching for answers by visualising, transforming, and modelling your data.
Using what you learn to refine your questions and/or generate new questions.
How to ask good questions?
Like most creative processes, the key to asking quality questions is to
generate a large quantity of questions.
Two types of questions will always be useful for making discoveries within your data:
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
What is a variable?
A variable is a quantity, quality, or property that you can measure
What is a value?
A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
What is an observation?
An observation is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object).
- An observation will contain several values, each associated with a different variable.
- An observation is also referred to as a data point.
What to look for in histograms and bar charts?
In both bar charts and histograms, tall bars show the common values of a variable, and shorter bars show less-common values.
Places that do not have bars reveal values that were not seen in your data.
===
To turn this information into useful questions, look for anything unexpected:
Which values are the most common? Why?
Which values are rare? Why? Does that match your expectations?
Can you see any unusual patterns? What might explain them?
What is tabular data?
Tabular data is a set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.
Does the data form subgroups?
Clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask:
How are the observations within each cluster similar to each other?
How are the observations in separate clusters different from each other?
How can you explain or describe the clusters?
Why might the appearance of clusters be misleading?
How to compare two or more variables?
Covariation is the tendency for the values of two or more variables to vary together in a related way.
The best way to spot covariation is to visualise the relationship between two or more variables. How you do that should again depend on the type of variables involved. If you have
a continuous variable and a categorical variable – the categorical variable can be used as legend, aesthetic mapping;
two categorical variables – try geom_count and geom_tile;
two continuous variables – try geom_point and geom_boxplot and geom_bin2d or geom_hex.
What is variation and covariation?
Variation describes the behavior within a variable,
Covariation describes the behavior between variables.
What questions can you ask for covariation?
Patterns in your data provide clues about relationships. If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:
Could this pattern be due to coincidence (i.e. random chance)?
How can you describe the relationship implied by the pattern?
How strong is the relationship implied by the pattern?
What other variables might affect the relationship?
Does the relationship change if you look at individual subgroups of the data?
How to code Histograms
we use geom_histogram or in base R we use hist
How to read a histogram?
Values on the histogram that have a long tail, away from the cluster tend to be outliers or anomalies.
REFER TO SLIDES
What is the take away advantages and disadvantages of histograms?
Advantages:
With the proper binwidth, histograms visually highlight where the data is concentrated, and point out the presence of potential outliers and anomalies.
===
Disadvantages:
- The primary disadvantage of histograms is that you must decide ahead of time how wide the bins are:
- Bins too wide – you can lose information about the shape of the distribution.
- Bins too small – the histogram can look too noisy to read easily.
What is a density plot?
Can think of a density plot as a continuous histogram of a variable, except the area under the density plot is rescaled to equal one.
A point on a density plot corresponds to the fraction of data (or the percentage of data, divided by 100) that takes on a particular value.
This fraction is usually very small.
What are we interested in when looking at a density plot?
When looking at a density plot, we should be more interested in the overall shape of the curve than the actual values on the y-axis.
REFER TO SLIDES FOR CODE
How to read a density plot?
The peaks typically tell you where most of the distribution is, the lower and wider is it the more spread out it is, and the wider it is the more you might need to apply a scale to it
REFER TO SLIDES
When should we use a logarithmic scale?
One should use a logarithmic scale when percent change or change in orders of magnitude is more important than changes in absolute units.
In other words A log scale should be used to better visualize data that is heavily skewed.
REFER TO SLIDES FOR CODE
How to code a boxplot
we can use geom_boxplot from ggplot or from base R we can use boxplot
What are the componets of a box plot?
Median
- The median (middle quartile) marks the mid-point of the data and is shown by the line that divides the box into two parts. Half data points are greater than or equal to this value and half are less.
Inter-quartile range
- The middle “box” represents the middle 50% of data points for the group. The range from lower to upper quartile is referred to as the inter-quartile range (IQR).
Upper quartile
- Seventy-five percent of the data points fall below the upper quartile.
Lower quartile
- Twenty-five percent of the data points fall below the lower quartile.
Whiskers
- The upper and lower whiskers represent data outside the middle 50%.
- Whiskers often (but not always) stretch over a wider range than the middle quartile groups.
How to interpret a boxplot
Between Minimum and Q1 is the bottom 25% (lower whisker)
Between Q1 and Q3 is the middle 50% (IQR
Between Q3 and Maximum is the top 25% (upper wisker)
The value in the middle of IQR is the median
Values outside Minimum and Maximum are outliers
The longer the whiskers the more spread there is between the values
The shorter the whiskers the more clustered the values are
Uneven whisker means skewed distribution
REFER TO SLIDES