ggplot2
36-315: Statistical Graphics and Visualization, Summer 2026
“The simple graph has brought more information to the data analyst’s mind than any other device.”
— John Tukey
In statistics/data science, data visualization is part of the EDA process
Goal of EDA: perform initial explorations in order to better understand the data, discover trends/patterns, spot anomalies, etc.
Data can be explored
numerically (tables, descriptive statistics,…)
visually (graphics)
EDA complements statistical inference and modeling
EDA is an important and necessary step to build intuition
EDA is NOT a replacement for statistical inference and modeling
Anscombe’s quartet
| panel | mean(x) | var(x) | mean(y) | var(y) | cor(x,y) |
|---|---|---|---|---|---|
| 1 | 9 | 11 | 7.5 | 4.13 | 0.82 |
| 2 | 9 | 11 | 7.5 | 4.13 | 0.82 |
| 3 | 9 | 11 | 7.5 | 4.12 | 0.82 |
| 4 | 9 | 11 | 7.5 | 4.12 | 0.82 |
The Datasaurus dozen: each of these has the same mean, variance, and correlation
#CrookedHillary = Obama's third term, which would be terrible news for our economic growth - seen below. pic.twitter.com/y9WJoUaaql
— Donald J. Trump (@realDonaldTrump) July 30, 2016
ggplot2 provides an implementation of the grammar of graphics, with the following layersdata: one or more datasets (in tidy tabular format)geom: geometric objects to visually represent the data (e.g. points, lines, bars, etc.)aes: mappings of variables to visual properties (i.e. aesthetics) of the geometric objectsscale: one scale for each variable displayed (e.g. axis limits, log scale, colors, etc.)facet: similar subplots (i.e. facets) for subsets of the same data using a conditioning variablestat: statistical transformations and summaries (e.g. identity, count, smooth, quantile, etc.)coord: one or more coordinate systems (e.g. cartesian, polar, map projection)labs: labels/guides for each variable and other parts of the plot (e.g. title, subtitle, caption, etc.)theme: customization of plot layout (e.g. text size, alignment, legend position, etc.)mpg (for more information, type help(mpg) or ?mpg)[1] 234 11
[1] "manufacturer" "model" "displ" "year" "cyl"
[6] "trans" "drv" "cty" "hwy" "fl"
[11] "class"
| Variable | Description | Variable Type |
|---|---|---|
hwy |
fuel economy (mpg) for city highway driving | quantitative |
displ |
engine displacement (liters) | quantitative |
class |
car type (compact, minivan, etc.) | categorical |
# A tibble: 6 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
In tidy data:
Each row is an observation (car)
Each column is a variable/measurement about about each observation
How do we convert data into visualizations?
Remember: one scale for each mapped variable
mpg |>
ggplot(aes(x = displ, y = hwy, color = class)) +
geom_point() +
scale_x_reverse() +
scale_y_continuous(breaks = seq(0, 45, 15),
limits = c(0, 50)) +
scale_color_discrete(palette = "Dark2") +
labs(
x = "Engine displacement (liters)",
y = "Fuel economy (mpg)",
color = "Car type",
title = "Highway fuel economy vs. engine displacement"
)Manually set the color, size, and alpha (transparency) of the point layer
Important notes:
Inside aes(): map a visual property (e.g., color, size, shape) to a variable in the data
Outside aes(): set a visual property to a constant value for all elements in a layer (e.g., color = "darkblue")
Set x and y globally inside ggplot()
geom_point() and geom_smooth() inherit global aestheticsSet color locally at the geom_point() layer
Global aesthetics ggplot(aes(...))
Scope: applies to all layers
Best for: variables used by multiple layers
Local aesthetics geom_*(aes(...))
Scope: applies only to a specific geom_* layer
Best for: unique mappings (also overriding shared aesthetics)
mpg |>
ggplot(aes(x = displ, y = hwy)) +
# adjust color, size, alpha of points
geom_point(color = "darkblue", size = 2, alpha = 0.4) +
# add smooth regression line
geom_smooth(method = "lm") +
# change x-axis scale to reverse
scale_x_reverse() +
# change y-axis breaks and limits
scale_y_continuous(breaks = seq(0, 45, 15), limits = c(0, 50)) +
# change title and axes labels
labs(x = "Engine displacement (liters)",
y = "Fuel economy (mpg)",
title = "Highway fuel economy vs. engine displacement") +
# change theme
theme_light() +
# update text justification and font face
theme(axis.title = element_text(face = "bold"),
plot.title = element_text(hjust = 0.5, face = "bold"))