cor(penguins$bill_dep, penguins$bill_len, use = "complete.obs")[1] -0.2350529
36-315: Statistical Graphics and Visualization, Summer 2026
Three aspects:
Direction/trend (positive, negative)
Strength of the relationship (strong, moderate, weak)
Linearity (linear, non-linear)
Goals:
describe the relationships between two variables \(X\) and \(Y\)
describe the conditional distribution \(Y \mid X\) via regression analysis
describe the joint distribution \(X, Y\) via contours, heatmaps, etc.
Big picture ideas:
Scatterplots are by far the most common visual
Regression analysis is by far the most popular analysis
Relationships may vary across other variables, e.g., categorical variables
Note: By default, cor() returns Pearson correlation
Other correlations:
Display regression line forbody_mass ~ flipper_len
95% confidence band by default
Estimating the conditional expectation ofbody_mass | flipper_len
body_mass \(\mid\) flipper_len \(]\)Data: shot attempts by the Sabrina Ionescu in the 2023 WNBA season
Each row is a shot attempt by Caitlin Clark in the 2024 WNBA season
Categorical variables: scoring_play, shot_type
Quantitative variables: shot_x, shot_y, shot_distance, score_value
Rows: 614
Columns: 6
$ shot_x <dbl> -10, 11, 2, -13, -18, 19, -1, 0, -1, -12, -1, -14, -7, -…
$ shot_y <dbl> 24, 23, 2, 21, 19, 16, 3, 2, 4, 5, 2, 21, 26, 14, 7, 6, …
$ shot_distance <dbl> 26.000000, 25.495098, 2.828427, 24.698178, 26.172505, 24…
$ shot_type <chr> "Pullup Jump Shot", "Running Jump Shot", "Cutting Layup …
$ scoring_play <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FAL…
$ score_value <dbl> 0, 3, 0, 0, 3, 0, 2, 0, 0, 0, 0, 3, 0, 2, 0, 3, 3, 0, 0,…
By default, the bins are centered at the integers
Intervals are left-closed, right-open: -0.5 to 0.5, etc.
In 1D: estimate density \(f(x)\), assuming that \(f(x)\) is smooth:
\[ \hat{f}(x) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h} K_h(x - x_i) \]
In 2D: estimate joint density \(f(x_1, x_2)\)
\[\hat{f}(x_1, x_2) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h_1h_2} K\left(\frac{x_1 - x_{i1}}{h_1}\right) K\left(\frac{x_2 - x_{i2}}{h_2}\right)\]
For 2D: there are 2 bandwidths
Gaussian kernels are still a popular choice
Best known in topography: outlines (contours) denote levels of elevation
Use geom_density2d() to display contour lines
Extend KDE for joint density estimates in 2D
Inner lines denote “peaks”
coord_fixed() forced a fixed ratio
Use stat_density_2d() and the after_stat() function to make 2D KDE heatmaps
May be easier to read than nested lines with color
Use geom_hex() to make hexagonal heatmaps
Need to have the hexbin package installed
2D version of histogram
stat_summary_hex()Assume a linear relationship for \(Y = f(X)\) \[Y_{i}=\beta_{0}+\beta_{1} X_{i}+\epsilon_{i}, \quad \text { for } i=1,2, \dots, n\]
\(Y_i\) is the \(i\)th value for the response variable
\(X_i\) is the \(i\)th value for the predictor variable
\(\beta_0\) is an unknown, constant intercept
\(\beta_1\) is an unknown, constant slope
\(\epsilon_i\) is the random noise
assume independent, identically distributed (iid) from a normal distribution
\(\epsilon_i \overset{iid}{\sim}N(0, \sigma^2)\) with constant variance \(\sigma^2\)
\[\text{RSS} \left(\beta_{0}, \beta_{1}\right)=\sum_{i=1}^{n}\left[Y_{i}-\left(\beta_{0}+\beta_{1} X_{i}\right)\right]^{2}=\sum_{i=1}^{n}\left(Y_{i}-\beta_{0}-\beta_{1} X_{i}\right)^{2}\]
\[\widehat{\beta}_{1}=\frac{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)}{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2}} = \frac{\textsf{Cov}(X,Y)}{\textsf{Var}(X)} \quad \text{ and } \quad \widehat{\beta}_{0}=\bar{Y}-\widehat{\beta}_{1} \bar{X}\]
Goal: find a “best fit” line that minimizes the vertical distance between the line and the data points
R
Call:
lm(formula = body_mass ~ flipper_len, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-1058.80 -259.27 -26.88 247.33 1288.69
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5780.831 305.815 -18.90 <2e-16 ***
flipper_len 49.686 1.518 32.72 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 394.3 on 340 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.759, Adjusted R-squared: 0.7583
F-statistic: 1071 on 1 and 340 DF, p-value: < 2.2e-16
Simple linear regression assumes \(Y_i \overset{\textsf{iid}}{\sim} N(\beta_0 + \beta_1 X_i, \sigma^2)\)
Plot residuals against fitted values \(\hat{Y}_i\)
Assess linearity: any divergence from mean 0
Asess equal variance: if \(\sigma^2\) is homogeneous across predictions/fits \(\hat{Y}_i\)
More difficult to assess independence and fixed \(X\) assumptions
Conditional on the predicted values, residuals should have a mean of zero
Residuals should not display any pattern
Two things to look for:
Any trend around horizontal reference line?
Equal vertical spread?
Example: Penguins species (3 levels: Adelie, Chinstrap, Gentoo)
Indicators for Chinstrap and Gentoo: \(I_C\) and \(I_G\)
If \(I_C = I_G = 0\), then the species must be Adelie (reference/baseline level)
Model: \(Y_i \overset{\textsf{iid}}{\sim} N(\beta_0 + \beta_C I_C + \beta_G I_G, \sigma^2)\)
\(\beta_0\): mean for Adelie (reference level)
\(\beta_0 + \beta_C\): mean for Chinstrap
\(\beta_0 + \beta_G\): mean for Gentoo
Significant \(\beta_C\): Chinstrap and Adelie are different
Significant \(\beta_G\): Gentoo and Adelie are different
What about comparing Chinstrap and Gentoo? Need to fit a different model
bill_dep) and 2 predictors \(X\) (bill_len, quantitative) and species (categorical)Model 1 (additive, no interaction): \(Y_i \overset{\textsf{iid}}{\sim} N(\beta_0 + \beta_X X + \beta_C I_C + \beta_G I_G, \sigma^2)\)
(Formula: bill_dep ~ bill_len + species)
The intercept for Adelie: \(\beta_0\); for Chinstrap: \(\beta_0 + \beta_C\); for Gentoo: \(\beta_0 + \beta_G\)
The slope for all species: \(\beta_X\)
Model 2 (with interaction): \(Y_i \overset{\textsf{iid}}{\sim} N(\beta_0 + \beta_X X + \beta_C I_C + \beta_G I_G + \beta_{CX} I_C X + \beta_{GX} I_G X, \sigma^2)\)
(Formula: bill_dep ~ bill_len + species + bill_len:species)
The intercept for Adelie: \(\beta_0\); for Chinstrap: \(\beta_0 + \beta_C\); for Gentoo: \(\beta_0 + \beta_G\)
The slope for Adelie: \(\beta_X\); for Chinstrap: \(\beta_X + \beta_{CX}\); for Gentoo: \(\beta_X + \beta_{GX}\)
Significant coefficient for categorical variables by themselves \(\rightarrow\) significantly different intercepts
Significant coefficient for interactions with categorical variables \(\rightarrow\) significantly different slopes
Model: bill_dep ~ bill_len + species
Example of Simpson’s paradox: negative linear relationship between bill depth and bill length overall, but positive linear relationship within species
additive_lm <- lm(bill_dep ~ bill_len + species, data = penguins)
intercepts <- c(
coef(additive_lm)["(Intercept)"],
coef(additive_lm)["(Intercept)"] + coef(additive_lm)["speciesChinstrap"],
coef(additive_lm)["(Intercept)"] + coef(additive_lm)["speciesGentoo"]
)
lines_tbl <- tibble(intercept = intercepts,
slope = rep(coef(additive_lm)["bill_len"], 3),
species = levels(penguins$species))
penguins |>
ggplot(aes(x = bill_len, y = bill_dep)) +
geom_point(alpha = 0.5) +
geom_abline(data = lines_tbl, linewidth = 1.5,
aes(intercept = intercept, slope = slope, color = species))
Call:
lm(formula = bill_dep ~ bill_len + species, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-2.4529 -0.6864 -0.0508 0.5519 3.5915
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.59218 0.68302 15.508 < 2e-16 ***
bill_len 0.19989 0.01749 11.427 < 2e-16 ***
speciesChinstrap -1.93319 0.22416 -8.624 2.55e-16 ***
speciesGentoo -5.10602 0.19142 -26.674 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9533 on 338 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.769, Adjusted R-squared: 0.7669
F-statistic: 375.1 on 3 and 338 DF, p-value: < 2.2e-16
Model: bill_dep ~ bill_len + species + bill_len:species
Map species to the color aesthetic for geom_smooth() layer
Call:
lm(formula = bill_dep ~ bill_len * species, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-2.6574 -0.6675 -0.0524 0.5383 3.5032
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.40912 1.13812 10.025 < 2e-16 ***
bill_len 0.17883 0.02927 6.110 2.76e-09 ***
speciesChinstrap -3.83998 2.05398 -1.870 0.062419 .
speciesGentoo -6.15812 1.75451 -3.510 0.000509 ***
bill_len:speciesChinstrap 0.04338 0.04558 0.952 0.341895
bill_len:speciesGentoo 0.02601 0.04054 0.642 0.521590
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9548 on 336 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.7697, Adjusted R-squared: 0.7662
F-statistic: 224.5 on 5 and 336 DF, p-value: < 2.2e-16
Simpson’s paradox
A particular trend appears in several subgroups but reverses (or disappears) when the subgroups are combined
In this case, subgroup analysis is especially important
Is the intercept meaningful?
Interpolation versus extrapolation
Interpolation: prediction within the range of a variable
Extrapolation: prediction outside the range of a variable
Generally speaking, interpolation is more reliable than extrapolation (less sensitive to model misspecification)