+ - 0:00:00
Notes for current slide
Notes for next slide

Some data viz advice

Daniel Anderson

Wednesday, April 8, 2020

1 / 105

#whoami

  • Research Assistant Professor: Behavioral Research and Teaching, University
  • Dad (two daughters: (nearly) 8 and 5 (almost 6))
  • Primary areas of interest
    • 💗💗R💗💗 and computational research
    • Open data, open science, and reproducible workflows
    • Growth modeling, achievement gaps, and variance between educational institutions (particularly spatially)

2 / 105

Resources (free)

3 / 105

Other Resources

  • My classes!
  • Sequence
    • EDLD 651: Introductory Educational Data Science (EDS)
    • EDLD 652: Data Visualization for EDS
    • EDLD 653: Functional Programming for EDS
    • EDLD 654: Machine Learning for EDS
    • Capstone
4 / 105

Where to start?

  • I really recommend moving to R as quickly as possible
5 / 105

Where to start?

  • I really recommend moving to R as quickly as possible
5 / 105

ggplot2!

Third edition in progress!

6 / 105

Last note before we really start

  • These slides were produced with R

  • See the source code here

  • The focus of this particular talk is not on the code itself

7 / 105

Different ways of encoding data

8 / 105

Other elements to consider

  • Text

    • How is the text displayed (e.g., font, face, location)?

    • What is the purpose of the text?

9 / 105

Other elements to consider

  • Text

    • How is the text displayed (e.g., font, face, location)?

    • What is the purpose of the text?

  • Transparency

    • Are there overlapping pieces?

    • Can transparency help?

9 / 105

Other elements to consider

  • Text

    • How is the text displayed (e.g., font, face, location)?

    • What is the purpose of the text?

  • Transparency

    • Are there overlapping pieces?

    • Can transparency help?

  • Type of data
    • Continuous/categorical
    • Which can be mapped to each aesthetic?
    • e.g., shape and line type can only be mapped to categorical data, whereas color and size can be mapped to either.
9 / 105

Basic Scales

10 / 105

Talk with a neighbor

How would you encode these data into a display?

Month Day Location Temperature
Jan 1 Chicago 25.6
Jan 1 San Diego 55.2
Jan 1 Houston 53.9
Jan 1 Death Valley 51.0
Jan 2 Chicago 25.5
Jan 2 San Diego 55.3
Jan 2 Houston 53.8
Jan 2 Death Valley 51.2
Jan 3 Chicago 25.3
11 / 105

Putting it to practice

12 / 105

Alternative representation

13 / 105

Comparison

  • Both represent three scales

    • Two position scales (x/y axis)
    • One color scale (categorical for the first, continuous for the second)
14 / 105

More scales are possible

15 / 105

Additional scales can become lost without high structure in the data

16 / 105

Thinking more about color

Three fundamental uses

17 / 105

Thinking more about color

Three fundamental uses

  1. Distinguish groups from each other
17 / 105

Thinking more about color

Three fundamental uses

  1. Distinguish groups from each other

  2. Represent data values

17 / 105

Thinking more about color

Three fundamental uses

  1. Distinguish groups from each other

  2. Represent data values

  3. Highlight

17 / 105

Discrete items

  • Often no intrinsic order
18 / 105

Discrete items

  • Often no intrinsic order

Qualitative color scale

  • Finite number of colors
    • Chosen to maximize distinctness, while also be equivalent
    • Equivalent
    • No color should stand out
    • No impression of order
18 / 105

Some examples

See more about the Okabe Ito palette origins here: http://jfly.iam.u-tokyo.ac.jp/color/

19 / 105

Sequential scale examples

Colors to represent continuous values

20 / 105

Diverging palettes

21 / 105

Earth palette

22 / 105

23 / 105

Common problems with color

Too many

More than 5-ish categories generally becomes too difficult to track

24 / 105

Use labels

still too many...

25 / 105

Better

Get a subset

26 / 105

Best

(but could still be improved)

27 / 105

Problem with default ggplot2 palette

28 / 105

Alternative: viridis

29 / 105

Revised version

30 / 105

Last few note on palettes

  • Do some research, find what you like and what tends to work well

  • Check for colorblindness

  • Look into http://colorbrewer2.org/

31 / 105

Data ink ratio

32 / 105

What is it?

33 / 105

What is it?

Above all else, show the data


-Edward Tufte

33 / 105

What is it?

Above all else, show the data


-Edward Tufte

  • Data-Ink Ratio = Ink devoted to the data / total ink used to produce the figure
33 / 105

What is it?

Above all else, show the data


-Edward Tufte

  • Data-Ink Ratio = Ink devoted to the data / total ink used to produce the figure

  • Common goal: Maximize the data-ink ratio

33 / 105

Example

34 / 105

Example

  • First thought might be... Cool!
34 / 105
35 / 105

Minimize cognitive load

  • Empirically, Tufte's plot was the most difficult for viewers to interpret.
36 / 105

Minimize cognitive load

  • Empirically, Tufte's plot was the most difficult for viewers to interpret.

  • Visual cues (labels, gridlines) reduce the data-ink ratio, but can also reduce cognitive load.

36 / 105

An example

Which do you prefer?

37 / 105

Advice from Wilke

Whenever possible, visualize your data with solid, colored shapes rather than with lines that outline those shapes. Solid shapes are more easily perceived, are less likely to create visual artifacts or optical illusions, and do more immediately convey amounts than do outlines.

emphasis added

38 / 105

Another example

39 / 105

40 / 105

Labels in place of legends

Prior slide is a great example of when annotations can be used in place of a legend to

  • reduce cognitive load
  • increase clarity
  • increase beauty
  • maximize the figure size
41 / 105

Practical advice so far

42 / 105

Practical advice so far

Avoid line drawings

42 / 105

Practical advice so far

Avoid line drawings

Maximize the data-ink ratio within reason (but preference reduction of cognitive load)

42 / 105

Practical advice so far

Avoid line drawings

Maximize the data-ink ratio within reason (but preference reduction of cognitive load)

Use color to your advantage (and think critically about the palettes you choose)

42 / 105

Practical advice so far

Avoid line drawings

Maximize the data-ink ratio within reason (but preference reduction of cognitive load)

Use color to your advantage (and think critically about the palettes you choose)

Consider plot annotations over legends

42 / 105

Grouped data

Distributions

How do we display more than one distribution at a time?

43 / 105

Boxplots

44 / 105

Violin plots

45 / 105

Jittered points

46 / 105

Sina plots

47 / 105

Stacked histograms

48 / 105

Overlapping densities

49 / 105

Ridgeline densities

50 / 105

Quick empirical examples

51 / 105

Titanic data

## # A tibble: 1,313 x 5
## name class age sex survived
## <chr> <chr> <dbl> <chr> <int>
## 1 Allen, Miss Elisabeth Walton 1st 29 female 1
## 2 Allison, Miss Helen Loraine 1st 2 female 0
## 3 Allison, Mr Hudson Joshua Creighton 1st 30 male 0
## 4 Allison, Mrs Hudson JC (Bessie Waldo Daniels) 1st 25 female 0
## 5 Allison, Master Hudson Trevor 1st 0.92 male 1
## 6 Anderson, Mr Harry 1st 47 male 1
## # … with 1,307 more rows
52 / 105

Boxplots

53 / 105

Violin plots

54 / 105

Jittered point plots

55 / 105

Sina plot

56 / 105

Stacked histogram

57 / 105

Stacked histogram

🤨

57 / 105

Dodged

58 / 105

Better

59 / 105

Overlapping densities

60 / 105

Overlapping densities

Note the default colors really don't work well in most of these

60 / 105

61 / 105

Ridgeline densities

62 / 105

Visualizing amounts

63 / 105

Bar plots

64 / 105

Flipped bars

65 / 105

Dotplot

66 / 105

Heatmap

67 / 105

A short journey

How much does college cost?

68 / 105

Tuition data

## # A tibble: 6 x 13
## State `2004-05` `2005-06` `2006-07` `2007-08` `2008-09` `2009-10` `2010-11`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alab… 5682.838 5840.550 5753.496 6008.169 6475.092 7188.954 8071.134
## 2 Alas… 4328.281 4632.623 4918.501 5069.822 5075.482 5454.607 5759.153
## 3 Ariz… 5138.495 5415.516 5481.419 5681.638 6058.464 7263.204 8839.605
## 4 Arka… 5772.302 6082.379 6231.977 6414.900 6416.503 6627.092 6900.912
## 5 Cali… 5285.921 5527.881 5334.826 5672.472 5897.888 7258.771 8193.739
## 6 Colo… 4703.777 5406.967 5596.348 6227.002 6284.137 6948.473 7748.201
## # … with 5 more variables: `2011-12` <dbl>, `2012-13` <dbl>, `2013-14` <dbl>,
## # `2014-15` <dbl>, `2015-16` <dbl>
69 / 105

By state: 2015-16

70 / 105

By state: 2015-16

🤮🤮🤮

70 / 105

Two puke emoji version

🤮🤮

71 / 105

One puke emoji version

🤮

72 / 105

Kinda smiley version

😏

73 / 105

Highlight Oregon

🙂

74 / 105

Not always good to sort

75 / 105

Much better

76 / 105

Heatmap

77 / 105

Better heatmap

78 / 105

Even better heatmap

79 / 105
80 / 105

Quick aside

  • Think about the data you have
  • Given that these are state-level data, they have a geographic component
81 / 105
82 / 105

Some things to avoid

83 / 105

Line drawings

As discussed earlier

😫

Change the fill

84 / 105

85 / 105

Much worse

Unnecessary 3D

86 / 105

Much worse

Unnecessary 3D

87 / 105

Horrid example

Used relatively regularly

88 / 105

Pie charts

Especially w/lots of categories

89 / 105

Alternative representation

90 / 105

A case for pie charts

  • n categories low,
  • differences are relatively large
  • familiar for some audiences

91 / 105

The anatomy of a pie chart

Pie charts are just stacked bar charts with a radial coordinate system

92 / 105

Horizontal

93 / 105

My preference

94 / 105

Dual axes

  • One exception - if second axis is a direct transformation of the first
    • e.g., Miles/Kilometers, Fahrenheit/Celsius

See many examples here: http://www.tylervigen.com/spurious-correlations

95 / 105

Truncated axes

96 / 105

97 / 105

Not always a bad thing

It is tempting to lay down inflexible rules about what to do in terms of producing your graphs, and to dismiss people who don’t follow them as producing junk charts or lying with statistics. But being honest with your data is a bigger problem than can be solved by rules of thumb about making graphs. In this case there is a moderate level of agreement that bar charts should generally include a zero baseline (or equivalent) given that bars encode their variables as lengths. But it would be a mistake to think that a dot plot was by the same token deliberately misleading, just because it kept itself to the range of the data instead.

98 / 105

99 / 105

100 / 105

Scaling issues

101 / 105

Poor binning choices

102 / 105

Conclusions

Practical takeaways to make better visualizations

  1. Avoid line drawings

  2. Sort bar charts in ascending/descending order as long as the other axis does not have implicit meaning

  3. Consider dropping legends and using annotations, when possible

  4. Use color to your advantage, but be sensitive to color-blindness, and use the right kind of palette

  5. Consider double-encoding data (shapes and color)

  6. Make your labels bigger! Didn't talk about this one much but it's super common and really important

103 / 105

Some things to avoid

  • Essentially never

    • Use dual axes (produce separate plots instead)

    • Use 3D unnecessarily

  • Be wary of

    • Truncated axes

    • Pie charts (particularly with lots of categories)

104 / 105

#whoami

  • Research Assistant Professor: Behavioral Research and Teaching, University
  • Dad (two daughters: (nearly) 8 and 5 (almost 6))
  • Primary areas of interest
    • 💗💗R💗💗 and computational research
    • Open data, open science, and reproducible workflows
    • Growth modeling, achievement gaps, and variance between educational institutions (particularly spatially)

2 / 105
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow