This article is a How-to-plot page that covers the most frequently used charts. It is using Profinit color theme, of course. We start with displaying distributions, then proportions and relations. Each topic has an initial setup followed by couple of collapsed sections describing various use-cases. Code for ggplot2 is provided, some of the charts are covered by base R graphics code, too.

In case of any bug/edits/contributions feel free to either create a pull-request or raise an issue in the issue tracker.

It is not purpose of this page to cover all the use-cases, though. For more detailed guide how to design a good chart take a look on the Fundamentals of data visualization (either online or in Profinit’s library).

Setup

As a toy dataset, let’s use the dplyr::starwars dataset of Star Wars characters. Be ware, it contains information from the first 7 films in the series.

# load packages
library(tidyverse)
library(profiplots)
library(ggalluvial)
library(ggrepel)

# set the aesthetics (theme) of plots
profiplots::set_theme(pal_name = "blue-red", pal_name_discrete="discrete")

movie_series <- c(
  "The Phantom Menace",
  "Attack of the Clones",
  "Revenge of the Sith",
  "A New Hope",
  "The Empire Strikes Back",
  "Return of the Jedi",
  "The Force Awakens"
)

get_movie_order <- function(movie_names) {
  purrr::map_dbl(movie_names, function(mn) which(mn == movie_series))
}

# prepare dataset: Star Wars characters
sw <- 
  dplyr::starwars %>% 
  mutate(
    bmi = mass/(height/100)^2,
    is_droid = forcats::fct_explicit_na(if_else(sex == "none", "Droid", "Other"), "N/A"),
    first_film = purrr::map_chr(films, function(movies) {
      movie_ord = get_movie_order(movies)
      movies[which.min(movie_ord)]
    }),
    first_film = factor(first_film, labels = movie_series, ordered = TRUE),
    been_in_jedi = purrr::map_lgl(films, ~"Return of the Jedi" %in% .),
    n_films = purrr::map_dbl(films, length)
  )

Distributions

Barplot

Use-case: Visualization of discrete variables distributions.

ggplot

CODE

plt <- 
  sw %>% 
  mutate(
    gender = forcats::fct_explicit_na(gender),  # Make the NA's be obvious (new level)
    gender = forcats::fct_infreq(gender),       # in case of `nominal` values, sort according to frequency
  ) %>% 
  ggplot(aes(x = gender)) + 
  stat_count(geom = "bar") + 
  labs(
    x = "Character gender",
    y = "Count",
    title = "Gender distribution among StarWars characters"
  )
plt

baseR

CODE

# TODO

# TODO

Tips

Colors
- Avoid rainbow, gradient and other multicolored versions of a barplot. Typically, it’s not necessary to map variable on both x and fill scales.
- Use either profinit’s grey, blue or red (depending on the report color theme; be consistent).
Ordering
- Prefer levels ordering in case of ordinal values (e.g., age group, education etc.).
- Use frequency ordering in case of nominal values (e.g., gender)
Long labels
- Do not rotate the column labels. You can either:
  - Use horizontal version of the barplot,
  - Wrap the category label (e.g., stringr::str_wrap()),
  - Use labels within the plot.
Too high values
- Do not trunk the column sizes!
- Use, e.g., dots (geom = "point") if you need truncated y axis. Be aware of misleading potential.
- For better axis annotation (numbers), you can use scales::number formatter.

Histogram

Use case: Visualization of continuous variables distribution.

ggplot

Single population distribution:

CODE

plt <- 
  sw %>% 
  ggplot(aes(x = height)) + 
  stat_bin(geom = "bar", bins = 20) +  
  labs(
    x = "Height [cm]",
    y = "Count",
    title = "Height distribution of StarWars characters"
  )
plt

baseR

hist(
  x = sw$height,
  breaks = 20,                 # (optional) tweak default setting of bins number
  border = NA,                 # bins border color, NA to turn it off
  col = profinit_cols("blue"), # bins fill color, use either of `profinit_cols()`, either `blue`, `red` or `grey` are preferable
  main = "Distribution of heights of StarWars characters",
  xlab = "Height [cm]",        # do not forget to mention units
  ylab = "Count",
  # TODO: change axes style
  # TODO: add grid
)

Tips

Colors:
- Use either default color or profinit_cols("blue") or profinit_cols("red") (depending on your report’s color palette).
- You may use other color (preferably from the profinit_cols() palette) in case of being consistent with a sub population color mapping (e.g., drawing a sub population)
- In general, avoid:
  - Avoid rainbow, gradient etc. (no need to map any variable to color at all).
  - Avoid edge color mapping. Keep the chart as simple as possible
Bins:
- Bin sizes (and position) might affect the appearance significantly. Tweak the
Subgroups
- Avoid stacking (subgroups on top of each other), groups on top are hard to read.
- Avoid dodging, the x axis is continuous & subgroups would
- Use either
  - KDE plot (with transparency)
  - overlapping histograms + transparency
  - lineplot

KDE

Use case: Continuous variables distribution for skilled audience. Esp. useful in case of multiple subgroups to be plotted on one chart.

ggplot

CODE

plt <- 
  sw %>% 
  filter(mass < 1000) %>% 
  ggplot(aes(x = bmi)) + 
  stat_density() +  
  labs(
    x = "Body mass index",
    y = "Density",
    title = "BMI distribution of StarWars characters",
    caption = "Characters under 1000kg  only."
  )
plt

baseR

CODE

# TODO: provide more straightforward approach

height_density = density(sw$height,na.rm = TRUE)
plot(
  height_density, 
  col = NA, 
  main = "Height distribution of StarWars characters",
  xlab = "Height [cm]",
  ylab = "Density"
)
polygon(
  x = height_density,
  col = profinit_cols("blue"), 
  border = NA,
)

Tips

To visualize 1D continuous variable (that is, without any subgroups), we prefer histograms (people are more familiar with them)
Be sure your readers can read the graph before you include KDE into your report.
Avoid stacking of densities if the cumulative density is not of your concern.
- Set some transparency in case of overlapping densities (subgroups).

Proportions

Single Variable

Use-case: Visualizing proportions of category levels. (Avoiding pie-chart).

ggplot

CODE

Use stat_count to get summary stats out of the raw dataset.
- Otherwise, aggregate the dataset upfront & use stat_identity.
Use bar geom (default, therefore I’m not specifying it here).
Get rid of the x axis (redundant). Still, there needs to be mapping, so we use x = 1 here.
Use color mapping.
- Here we’re using a droid color mapping introduced above (for consistency reasons).
You can annotate the sections directly (will be shown in one of the use-cases below). You can get rid of the legend completely.
Again, you can highlight just one level etc. (See the sections above for more customization ideas).

plt <- 
  sw %>% 
  filter(is_droid != "N/A") %>% 
  ggplot() +
  aes(fill = is_droid, x =  1, y = ..count..) + 
  stat_count(position = "stack") + 
  guides(x = "none") + 
  scale_fill_manual(values = is_droid_color_mapping) + 
  scale_y_continuous(breaks = seq(0, 100, 10)) +              # customize Y axis ticks position
  labs(
    x = NULL,
    y = "Character count",
    fill = "Character type",
    title = "Proportion of droids among SW characters",
    subtitle = "Based on dplyr::starwars dataset",
    caption = "Characters with known status only"
  ) + 
  theme(
    legend.position = "bottom"
  )

plt

base R

TODO

Tips

Avoid pie charts!
Proportions of one variable among levels of other category – see the Relations section.
Nested proportions – see Fundamentals of data visualization chapter on tips how to do that (code not available).

Two variables, proportion of two categories

Use-case: Visualizing proportions of a category levels in different subgroups based on another variable.

In this case, the best way is to use side-by-side stacked barplots (with fill option).

ggplot

sw %>% 
  filter(!is.na(gender)) %>% 
  mutate(is_droid = forcats::fct_rev(is_droid)) %>% 
  ggplot() + 
  aes(x = gender, fill = is_droid) + 
  stat_count(position = position_fill()) + 
  scale_y_continuous(breaks = seq(0, 1, .1), labels = scales::percent) + 
  scale_fill_manual(values = is_droid_color_mapping) + 
  labs(
    title = "Droid proportion is the same accross Gender",
    x = "Gender",
    y = "Proportion of droids",
    fill = "Character type",
    caption = "Characters with known Gender only"
  )

Add horizontal line

E.g., to highlight the population mean.

droid_prop_overall <- mean(sw$is_droid == "Droid", na.rm = TRUE)
droid_prop_overall_label <- paste0("Overall mean: ", scales::percent(droid_prop_overall, accuracy = .01))

sw %>% 
  filter(!is.na(gender)) %>% 
  mutate(is_droid = forcats::fct_rev(is_droid)) %>% 
  ggplot() + 
  aes(x = gender, fill = is_droid) + 
  stat_count(position = position_fill()) + 
  stat_identity(geom = "hline", yintercept = droid_prop_overall, linetype = "dashed", color = profinit_cols("grey")) +
  annotate(x = 2.1, y = droid_prop_overall - .01, geom = "text", label = droid_prop_overall_label, size = 2.5) + 
  scale_y_continuous(breaks = seq(0, 1, .1), labels = scales::percent) + 
  scale_fill_manual(values = is_droid_color_mapping) + 
  labs(
    title = "Droid proportion is the same accross Gender",
    x = "Gender",
    y = "Proportion of droids",
    fill = "Character type",
    caption = "Characters with known Gender only"
  )

base R

TODO

Tips

Reorder the levels to have the most important category in the bottom of the plot (let it come fist in input data in ggplot) to ease the comparison.

Two variables, proportion of 3+ categories

ggplot

Use position_dodge with preserve = "single" to have the same proportion of column widths even if a level is missing.
Use position_dodge2 if you prefer to have spaces between columns.

CODE

plt

base R

TODO

Tips

Do not use stacked (with opt. fill) to compare proportions of more than two levels among multiple categories.
If the x-variable is continuous, you can use stacked density plot.

Relations

Scatterplot

Use-case: Visualizing relationship of two numeric variables. Visualizing trend (target ~ regresor).

ggplot

CODE

plt <-
  sw %>% 
  filter(mass < 1e3) %>% 
  ggplot(aes(x = height, y = mass)) + 
  geom_point() + 
  labs(
    x = "Height [cm]",
    y = "Weight [kg]",
    title = "Height ~ weight relation of StarWars characters",
    note = "Characters weighting less hten 1t"            # Indicate population filters!
  )
plt

base R

CODE

# TODO

# TODO

Tips

In case of too many points, you can either:
- Use smaller symbols
- Use transparency
- Use 2D density plot (see below)
To annotate points, you can use geom_text_repel (and geom_label_repel) from the ggrepel package. This geom automatically tries to resolve overlapping for you.
To highlight a trend, you can add a model fitted line (geom_smooth) or an arbitrary line (geom_abline, geom_vline and geom_hline).

2D Density

Use-case: Visualizing relationship of two numeric variables with too many observations.

With too many observations, the details are hidden in the tons of spots. You can try to set transparency low enough and use scatterplot anyway (see above). But it’s quite convenient to rely on 2D Density plot.

ggplot

CODE

plt <- 
  sw %>% 
  filter(mass < 1000) %>% 
  ggplot(aes(x = height, y = mass)) +
  stat_density2d_filled() +  
  scale_fill_profinit("blues", reverse = TRUE) + 
  labs(
    x = "Height [cm]",
    y = "Mass [kg]",
    caption = "Characters below 1000kg only",
    title = "Height ~ Mass relationship among SW Characters"
  )

base R

CODE

# TODO

Tips

Colors
- Use some light one for low level values to not clutter the whole graph & to let the high values be highlighted.
- Use some bright (but not too light) color for high values.
You can discretize the variables first & use bin_2d geom.
You may try hexbin geom as well. I should be a bit more appealing. But you need an extra package installed.

Heatmap

Use-case: Visualizing relationship of two numeric variables. Visualizing trend (target ~ regresor).

ggplot

CODE

plt <- sw %>% 
  group_by(first_film, gender) %>% 
  summarise(n = n()) %>% 
  ggplot(aes(x = gender, y = first_film, fill=n)) + 
  stat_identity(geom = "tile") + 
  scale_fill_profinit_c("blues", reverse = TRUE) + 
  labs(
    x = "Character gender",
    y = "First film of the character",
    fill = "Count", 
    title = "Where do the characters of given gender mostly starts?"
  )
#> `summarise()` has grouped output by 'first_film'. You can override using the
#> `.groups` argument.

base R

CODE

# TODO

Tips

In case of counts, use very light colors for low values (to be similar to the transparent ‘No obs.’ level).
In case of known baseline (e.g., odds, ratios etc.), use diverging color palette with three colors (middle: neutral). E.g., blue-white-red color palette.

Extra: Odds ratio visualization

ggplot

CODE

plt <- ggplot()

base R

CODE

# TODO

Tips

TODO

Plots gallery

Setup

Distributions

Barplot

ggplot

See also

baseR

Tips

Histogram

ggplot

See also

baseR

Tips

KDE

ggplot

See also

baseR

Tips

Proportions

Single Variable

ggplot

See also

base R

Tips

Two variables, proportion of two categories

ggplot

base R

Tips

Two variables, proportion of 3+ categories

ggplot

base R

Tips

Relations

Scatterplot

ggplot

See Also

base R

Tips

2D Density

ggplot

base R

Tips

Heatmap

ggplot

base R

Tips

Extra: Odds ratio visualization

ggplot

base R

Tips