The grammar of graphics and ggplot2


36-315: Statistical Graphics and Visualization, Summer 2026

The data science workflow


Source: R for Data Science (2e)

“The simple graph has brought more information to the data analyst’s mind than any other device.”
— John Tukey

Exploratory data analysis (EDA)

  • In statistics/data science, data visualization is part of the EDA process

  • Goal of EDA: perform initial explorations in order to better understand the data, discover trends/patterns, spot anomalies, etc.

  • Data can be explored

    • numerically (tables, descriptive statistics,…)

    • visually (graphics)

  • EDA complements statistical inference and modeling

  • EDA is an important and necessary step to build intuition

  • EDA is NOT a replacement for statistical inference and modeling

ALWAYS visualize your data

Anscombe’s quartet


panel mean(x) var(x) mean(y) var(y) cor(x,y)
1 9 11 7.5 4.13 0.82
2 9 11 7.5 4.13 0.82
3 9 11 7.5 4.12 0.82
4 9 11 7.5 4.12 0.82

ALWAYS visualize your data

The Datasaurus dozen: each of these has the same mean, variance, and correlation

3D? Pie? How about 3D Pie? Ugh…


Don’t be misleading

The simpler the better

The grammar of graphics

  • Key idea: specify plotting “layers” and combine them to produce a graphic
  • ggplot2 provides an implementation of the grammar of graphics, with the following layers
  1. data: one or more datasets (in tidy tabular format)
  2. geom: geometric objects to visually represent the data (e.g. points, lines, bars, etc.)
  3. aes: mappings of variables to visual properties (i.e. aesthetics) of the geometric objects
  4. scale: one scale for each variable displayed (e.g. axis limits, log scale, colors, etc.)
  5. facet: similar subplots (i.e. facets) for subsets of the same data using a conditioning variable
  6. stat: statistical transformations and summaries (e.g. identity, count, smooth, quantile, etc.)
  7. coord: one or more coordinate systems (e.g. cartesian, polar, map projection)
  8. labs: labels/guides for each variable and other parts of the plot (e.g. title, subtitle, caption, etc.)
  9. theme: customization of plot layout (e.g. text size, alignment, legend position, etc.)

First example: fuel economy

  • Data: mpg (for more information, type help(mpg) or ?mpg)
library(ggplot2)
dim(mpg)
[1] 234  11
names(mpg)
 [1] "manufacturer" "model"        "displ"        "year"         "cyl"         
 [6] "trans"        "drv"          "cty"          "hwy"          "fl"          
[11] "class"       
Variable Description Variable Type
hwy fuel economy (mpg) for city highway driving quantitative
displ engine displacement (liters) quantitative
class car type (compact, minivan, etc.) categorical

Tidy data

head(mpg)
# A tibble: 6 × 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

In tidy data:

  • Each row is an observation (car)

  • Each column is a variable/measurement about about each observation

How do we convert data into visualizations?

Starting with the data


ggplot(data = mpg)


or equivalently, using |>

mpg |> 
  ggplot()


So far, nothing is displayed

Specifying variables and geometric object

mpg |> 
  ggplot(aes(x = displ, y = hwy)) +
  geom_point()
  • Adding (+) a geometric layer of points to the plot

  • Aesthetic mapping: displ to x-axis and hwy to y-axis via aes()

Adding additional variables

mpg |> 
  ggplot(aes(x = displ, y = hwy, color = class)) +
  geom_point()
  • To add additional variables to a plot, use other aesthetics like color, shape, etc.

  • Give each point a unique color corresponding to its class (car type)

Scaling mapped variables

Remember: one scale for each mapped variable


mpg |> 
  ggplot(aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  scale_x_reverse() + 
  scale_y_continuous(breaks = seq(0, 45, 15),
                     limits = c(0, 50)) +
  scale_color_discrete(palette = "Dark2")
  • Flip the x-axis scale

  • Change y-axis breaks and limits

  • Change color palette

Always label your plots! (seriously…)


mpg |> 
  ggplot(aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  scale_x_reverse() + 
  scale_y_continuous(breaks = seq(0, 45, 15),
                     limits = c(0, 50)) +
  scale_color_discrete(palette = "Dark2") +
  labs(
    x = "Engine displacement (liters)", 
    y = "Fuel economy (mpg)", 
    color = "Car type",
    title = "Highway fuel economy vs. engine displacement"
  )
  • Each mapped aesthetic can be labeled

Changing aesthetics without mapping variables

mpg |> 
  ggplot(aes(x = displ, y = hwy)) + 
  geom_point(color = "darkblue", size = 2, alpha = 0.4)
  • Manually set the color, size, and alpha (transparency) of the point layer

  • Important notes:

    • Inside aes(): map a visual property (e.g., color, size, shape) to a variable in the data

    • Outside aes(): set a visual property to a constant value for all elements in a layer (e.g., color = "darkblue")

Global and local aesthetics

mpg |>
  # x and y are shared across layers
  ggplot(aes(x = displ, y = hwy)) + 
  # geom_point() uses shared x and y and local color
  geom_point(aes(color = class)) +
  # geom_smooth() uses only shared x and y
  geom_smooth()
  • Set x and y globally inside ggplot()

    • geom_point() and geom_smooth() inherit global aesthetics
  • Set color locally at the geom_point() layer

Global and local aesthetics

  • Global aesthetics     ggplot(aes(...))

    • Scope: applies to all layers

    • Best for: variables used by multiple layers

  • Local aesthetics     geom_*(aes(...))

    • Scope: applies only to a specific geom_* layer

    • Best for: unique mappings (also overriding shared aesthetics)

Faceting

mpg |> 
  ggplot(aes(x = displ, y = hwy)) + 
  geom_point() + 
  facet_wrap(~ class)
  • Create a multi-panel plot faceted by a conditioning variable (class)

  • These facet/panel plots are sometimes called lattice plots or trellis plots

Themes

mpg |> 
  ggplot(aes(x = displ, y = hwy)) + 
  geom_point() + 
  theme_light()

More customization

mpg |> 
  ggplot(aes(x = displ, y = hwy)) +
  # adjust color, size, alpha of points
  geom_point(color = "darkblue", size = 2, alpha = 0.4) +
  # add smooth regression line
  geom_smooth(method = "lm") +
  # change x-axis scale to reverse
  scale_x_reverse() +
  # change y-axis breaks and limits
  scale_y_continuous(breaks = seq(0, 45, 15), limits = c(0, 50)) +
  # change title and axes labels
  labs(x = "Engine displacement (liters)", 
       y = "Fuel economy (mpg)", 
       title = "Highway fuel economy vs. engine displacement") + 
  # change theme
  theme_light() +
  # update text justification and font face
  theme(axis.title = element_text(face = "bold"),
        plot.title = element_text(hjust = 0.5, face = "bold"))

More customization