Lecture 3 - Exploratory Data Flashcards by Jeremy Robertson

What is Visualisation?

The use of graphics to examine data

How well did you know this?

Not at all

Perfectly

What is WIlliam Cleveland’s Graphic Philosophy

A fine balancing act.
- A graphic should display as much information as it can, with the lowest possible cognitive strain to the viewer.
===
Strive for clarity.
- Make the data stand out. Specific tips for increasing clarity include:
- Avoid too many superimposed elements, such as too many curves in
the same graphing space.
- Find the right aspect ratio and scaling to properly bring out the details
of the data.
- Avoid having the data all skewed to one side or the other of your graph.
===
Visualization is an iterative process.
- Its purpose is to answer questions about the data.
- Different graphics are best suited for answering different questions

How well did you know this?

Not at all

Perfectly

What is exploratory data analysis?

Exploratory data analysis, or EDA for short, is a task that uses
visualisation and transformation to explore your data in a systematic way.

How well did you know this?

Not at all

Perfectly

What is the EDA cycle?

EDA is an iterative cycle that involves:
Generating questions about your data.
Searching for answers by visualising, transforming, and modelling your data.
Using what you learn to refine your questions and/or generate new questions.

How well did you know this?

Not at all

Perfectly

How to ask good questions?

Like most creative processes, the key to asking quality questions is to
generate a large quantity of questions.

Two types of questions will always be useful for making discoveries within your data:
What type of variation occurs within my variables?
What type of covariation occurs between my variables?

How well did you know this?

Not at all

Perfectly

What is a variable?

A variable is a quantity, quality, or property that you can measure

How well did you know this?

Not at all

Perfectly

What is a value?

A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.

How well did you know this?

Not at all

Perfectly

What is an observation?

An observation is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object).
- An observation will contain several values, each associated with a different variable.
- An observation is also referred to as a data point.

How well did you know this?

Not at all

Perfectly

What to look for in histograms and bar charts?

In both bar charts and histograms, tall bars show the common values of a variable, and shorter bars show less-common values.
Places that do not have bars reveal values that were not seen in your data.
===
To turn this information into useful questions, look for anything unexpected:
Which values are the most common? Why?
Which values are rare? Why? Does that match your expectations?
Can you see any unusual patterns? What might explain them?

How well did you know this?

Not at all

Perfectly

What is tabular data?

Tabular data is a set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.

How well did you know this?

Not at all

Perfectly

Does the data form subgroups?

Clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask:

How are the observations within each cluster similar to each other?

How are the observations in separate clusters different from each other?

How can you explain or describe the clusters?

Why might the appearance of clusters be misleading?

How well did you know this?

Not at all

Perfectly

How to compare two or more variables?

Covariation is the tendency for the values of two or more variables to vary together in a related way.

The best way to spot covariation is to visualise the relationship between two or more variables. How you do that should again depend on the type of variables involved. If you have

a continuous variable and a categorical variable – the categorical variable can be used as legend, aesthetic mapping;
two categorical variables – try geom_count and geom_tile;
two continuous variables – try geom_point and geom_boxplot and geom_bin2d or geom_hex.

How well did you know this?

Not at all

Perfectly

What is variation and covariation?

Variation describes the behavior within a variable,
Covariation describes the behavior between variables.

How well did you know this?

Not at all

Perfectly

What questions can you ask for covariation?

Patterns in your data provide clues about relationships. If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:

Could this pattern be due to coincidence (i.e. random chance)?

How can you describe the relationship implied by the pattern?

How strong is the relationship implied by the pattern?

What other variables might affect the relationship?

Does the relationship change if you look at individual subgroups of the data?

How well did you know this?

Not at all

Perfectly

How to code Histograms

we use geom_histogram or in base R we use hist

How well did you know this?

Not at all

Perfectly

How to read a histogram?

Study These Flashcards

Values on the histogram that have a long tail, away from the cluster tend to be outliers or anomalies.
REFER TO SLIDES

What is the take away advantages and disadvantages of histograms?

Study These Flashcards

Advantages:
With the proper binwidth, histograms visually highlight where the data is concentrated, and point out the presence of potential outliers and anomalies.
===
Disadvantages:
- The primary disadvantage of histograms is that you must decide ahead of time how wide the bins are:
- Bins too wide – you can lose information about the shape of the distribution.
- Bins too small – the histogram can look too noisy to read easily.

What is a density plot?

Study These Flashcards

Can think of a density plot as a continuous histogram of a variable, except the area under the density plot is rescaled to equal one.

A point on a density plot corresponds to the fraction of data (or the percentage of data, divided by 100) that takes on a particular value.

This fraction is usually very small.

What are we interested in when looking at a density plot?

Study These Flashcards

When looking at a density plot, we should be more interested in the overall shape of the curve than the actual values on the y-axis.
REFER TO SLIDES FOR CODE

How to read a density plot?

Study These Flashcards

The peaks typically tell you where most of the distribution is, the lower and wider is it the more spread out it is, and the wider it is the more you might need to apply a scale to it

REFER TO SLIDES

When should we use a logarithmic scale?

Study These Flashcards

One should use a logarithmic scale when percent change or change in orders of magnitude is more important than changes in absolute units.

In other words A log scale should be used to better visualize data that is heavily skewed.
REFER TO SLIDES FOR CODE

How to code a boxplot

Study These Flashcards

we can use geom_boxplot from ggplot or from base R we can use boxplot

What are the componets of a box plot?

Study These Flashcards

Median
- The median (middle quartile) marks the mid-point of the data and is shown by the line that divides the box into two parts. Half data points are greater than or equal to this value and half are less.
Inter-quartile range
- The middle “box” represents the middle 50% of data points for the group. The range from lower to upper quartile is referred to as the inter-quartile range (IQR).
Upper quartile
- Seventy-five percent of the data points fall below the upper quartile.
Lower quartile
- Twenty-five percent of the data points fall below the lower quartile.
Whiskers
- The upper and lower whiskers represent data outside the middle 50%.
- Whiskers often (but not always) stretch over a wider range than the middle quartile groups.

How to interpret a boxplot

Study These Flashcards

Between Minimum and Q1 is the bottom 25% (lower whisker)
Between Q1 and Q3 is the middle 50% (IQR
Between Q3 and Maximum is the top 25% (upper wisker)
The value in the middle of IQR is the median

Values outside Minimum and Maximum are outliers

The longer the whiskers the more spread there is between the values
The shorter the whiskers the more clustered the values are
Uneven whisker means skewed distribution

REFER TO SLIDES

What is the mean and standard deveviation?

The mean is the average the STD is is a measure that summarises the amount by which every value within a dataset varies from the mean

Why median and IQR are better than mean and standard deviation?

median and IQR measure the central tendency and spread, respectively, but are robust against outliers and non-normal data

What are the adavantages of median and IQR?

Outlier Identification. IQR makes it easy to do an initial estimate of outliers Skewness. Comparing the median to the quartile values shows whether data is skewed.

What are the three data types?

Dichotomous: data which has two states such as anyone in university is either “staff” or “student”. === Nominal: data which has multiple specific values such as marital status can be “married”, “divorced”, “single”, and “separated”. === Ordinal: data which has multiple value and has order such as customer satisfaction feedback which can be “very unhappy”, “unhappy”, “neutral”, “happy”, and “very happy” (Likert scaling).

What is a factor in R

The factor stores the nominal values as a vector of integers in the range [ 1... k ] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers. REFER TO SLIDES

What is a bar chart?

A bar chart is a histogram for discrete data: it records the frequency of every value of a categorical variable.

How to code a bar chart?

using ggplot2 we can use geom_bar and ontop can use coord_flip() to make them horizontal bars

How to sort a bar chart and why is it done

REFER TO SLIDES - for how to sort a bar chart It done to gain better insight and to be more efficient (this also applies to dot plots)

Why in some cases are dotplot better than bar charts (Cleveland)?

Bars are perceptually misleading. - Bars are two dimensional; a difference in counts looks like a difference in bar areas, rather than merely in bar heights. - The dot-and-line of a dot plot is not two dimensional; the viewer considers only the height difference when comparing two quantities, as they should.

Lecture 3 - Exploratory Data Flashcards

(33 cards)