Data Visualization
the graphic representation and presentation of data
McCandless Method, 4 Elements of Good Data Visualization
1) Information: The data with which you’re working
2) Story: a clear and compelling narrative or concept
3) Goal: a specific objective or function for the visual
4) Visual Form: An effective use of metaphor or visual expression
Kaiser Fung’s Junk Chart Trifecta Checkup
A well-designed visual answers all three of the following questions at once (i.e., same answer):
1) What is the practical question?
2) What does the data say?
3) What does the visual say?
Marks
Basic visual objects such as points, lines, and shapes. Every mark can be broken down into 4 qualities:
1) Position: What is a specific mark in space relative to the scale or to other marks (e.g., if you look at two trends, the position allows you to compare the pattern of one element to another)
2) Size: How big, small, long, or tall is the mark? Comparison of object size can be an easy visual interpretation, but problems arise when the human eye inadvertently interprets some objects as appearing to be the same size when they’re not. Controlling the scale of a visual is important even when comparative sizes are not intended to offer info.
3) Shape: Does the shape of a specific object communicate something about it? Rather than using dots or lines, a bit of creativity can enhance how quickly people are able to interpret a visual by using shapes that align with a given application (e.g., instead of dots use person-shaped figures).
4) What color is a mark? Colors can be used both as a simple differentiator of groupings or as a way to communicate other concepts such as profitable versus unprofitable or hot versus cold.
Channels
Visual aspects or variables that represent characteristics in data (i.e., specialized marks that have been used to visualize data).
1) Accuracy: Are the channels helpful in accurately estimating the values being represented? (e.g., color works well when communicating categorical differences like apples and oranges, but it is less effective when distinguishing quantitative data such as 5 from 5.5.)
2) Popout: How easy is it to distinguish certain values from others?
There are many way to draw attention to specific parts of a visual (e.g., line length, size, line width, shape, enclosure, hue, and intensity).
3) Grouping: How effective is a channel at communicating groups that exist in data?
Consider the proximity, similarity, enclosure, connectedness, and continuity of the channel.
Remember: The more you emphasize one single thing, the more that counts. Emphasis diminishes with each item you emphasize because the items begin to compete with one another.
Bar Graph
Use size contrast to compare two or more values
x axis: horizontal line
y axis: vertical axis
Line Graph
Help your audience understand shifts or changes in your data
Often used to track changes over time.
Pie Chart
Shoes how much each part of something makes up the whole.
Maps
Help organize data geographically.
x-axis vs y-axis
x-axis: Horizontal line used to represent categories, time periods, or other variables.
y-axis: Vertical line that usually has a scale of values for variables.
Name types of data visualizations
(https://datavizcatalogue.com/#google_vignette)
Histogram
A chart that shows how often data values fall into certain ranges.
Correlation charts
Show relationship among data.
But use these with caution as they can cause viewers to think they show causation.
Correlation
Negative Correlation
Positive Correlation
No Correlation
Correlation in statistics is the measure of the degree to which two variables move in relationship to each other.
An example of correlation is the idea that “As the temperature goes up, ice cream sales also go up.”
It is important to remember that correlation doesn’t mean that one event causes another. But, it does indicate that they have a pattern with or a relationship to each other.
Negative Correlation: If one variable goes up and the other variable goes down, it is a negative or inverse correlation.
Positive Correlation: If one variable goes up and the other variable also goes up, it is a positive correlation.
No Correlation: If one variable goes up and the other variable stays about the same
dangers of assuming a causal relationship
When you make conclusions from data analysis, you need to make sure that you don’t assume a causal relationship between elements of your data when there is only a correlation.
Examples:
Cause of disease
For example, pellagra is a disease with symptoms of dizziness, sores, vomiting, and diarrhea. In the early 1900s, people thought that the disease was caused by unsanitary living conditions. Most people who got pellagra also lived in unsanitary environments. But, a closer examination of the data showed that pellagra was the result of a lack of niacin (Vitamin B3). Unsanitary conditions were related to pellagra because most people who couldn’t afford to purchase niacin-rich foods also couldn’t afford to live in more sanitary conditions. But, dirty living conditions turned out to be a correlation only.
Distribution of aid
Here is another example. Suppose you are working for a government agency that provides SNAP benefits. You noticed from the agency’s Google Analytics that people who qualify for the benefits are browsing the official website, but they are leaving the site without signing up for benefits. You think that the people visiting the site are leaving because they aren’t finding the information they need to sign up for SNAP benefits. Google Analytics can help you find clues (correlations), like the same people coming back many times or how quickly people leave the page. One of those correlations might lead you to the actual cause, but you will need to collect additional data, like in a survey, to know exactly why people coming to the site aren’t signing up for SNAP benefits. Only then can you figure out how to increase the sign-up rate.
To avoid attributing correlation to causation, always:
Critically analyze any correlations that you find
Examine the data’s context to determine if a causation makes sense (and can be supported by all of the data)
Understand the limitations of the tools that you use for analysis
Reverse causality (error)
Is Y causing X rather than X causing Y?
Sample selection error
Who is missing?
Measurement error
How easy is it to measure X and Y?
Omitted variables (error)
Are we forgetting about any variables Z that affect both X and Y?
Static visualizations vs Dynamic visualizations
static visualizations: do not change over time unless they’re edited.
Useful when you want to control your data and the data story.
dynamic visualizations: interactive or change over time.
Helpful if stakeholders want to be able to adjust what they’re able to view.
Line Chart
used to track changes over short and long periods of time. When smaller changes exist, line charts are better to use than bar graphs. Line charts can also be used to compare changes over the same period of time for more than one group.
Column charts
use size to contrast and compare two or more values, using height or lengths to represent the specific values.
Heatmap
use color to compare categories in a data set. They are mainly used to show relationships between two variables and use a system of color-coding to represent different values.