Visualizing 2D quantitative data and linear regression


36-315: Statistical Graphics and Visualization, Summer 2026

Summarizing 2D quantitative data

Three aspects:

  • Direction/trend (positive, negative)

  • Strength of the relationship (strong, moderate, weak)

  • Linearity (linear, non-linear)

Goals:

  • describe the relationships between two variables \(X\) and \(Y\)

  • describe the conditional distribution \(Y \mid X\) via regression analysis

  • describe the joint distribution \(X, Y\) via contours, heatmaps, etc.

Big picture ideas:

  • Scatterplots are by far the most common visual

  • Regression analysis is by far the most popular analysis

  • Relationships may vary across other variables, e.g., categorical variables

Summarizing 2D quantitative data

  • Numerical summary: correlation coefficient
cor(penguins$bill_dep, penguins$bill_len, use = "complete.obs")
[1] -0.2350529

Making scatterplots

  • Use geom_point()

  • Displaying the joint (bivariate) distribution

  • What is the obvious flaw with this plot?

library(tidyverse)
theme_set(theme_light())
penguins |> 
  ggplot(aes(x = bill_dep, y = bill_len)) +
  geom_point(color = "darkred")

Making scatterplots: adjust the transparency

  • Adjust the transparency of points via alpha to visualize overlap

  • Provide better understanding of joint frequency

  • Especially important with larger datasets

penguins |> 
  ggplot(aes(x = bill_dep, y = bill_len)) +
  geom_point(color = "darkred", alpha = 0.5)

Point color by a categorical variable

  • Map a categorical variable to color
penguins |> 
  ggplot(aes(x = bill_dep, y = bill_len, color = species)) +
  geom_point(alpha = 0.5)

Point color by a continuous variable

  • Map a continuous variable to color
penguins |> 
  ggplot(aes(x = bill_dep, y = bill_len, color = body_mass)) +
  geom_point(alpha = 0.5)

Color gradient for a continuous variable

  • Map a continuous variable to color
penguins |> 
  ggplot(aes(x = bill_dep, y = bill_len, color = body_mass)) +
  geom_point(alpha = 0.5) +
  scale_color_gradient(low = "darkblue", high = "darkorange")

Point size by a continuous variable

  • Map a continuous variable to size
penguins |> 
  ggplot(aes(x = bill_dep, y = bill_len, size = body_mass)) +
  geom_point(alpha = 0.5)

Point shape by a categorical variable

  • Map a categorical variable to shape
penguins |> 
  ggplot(aes(x = bill_dep, y = bill_len, shape = species)) +
  geom_point(alpha = 0.5)

Mapping multiple aesthetics to different variables

  • Each of color, shape, and size
    is mapped to a different variable

  • 5 dimensions at once!

  • Be careful about this…

penguins |> 
  ggplot(aes(x = bill_dep, y = bill_len,
             color = species, shape = island, size = body_mass)) +
  geom_point(alpha = 0.5)

Displaying trend line: linear regression

  • Display regression line for
    body_mass ~ flipper_len

  • 95% confidence band by default

  • Estimating the conditional expectation of
    body_mass | flipper_len

    • i.e., \(\mathbb{E}[\) body_mass \(\mid\) flipper_len \(]\)
penguins |> 
  ggplot(aes(x = flipper_len, y = body_mass)) +
  geom_point(color = "darkred", alpha = 0.5) +
  geom_smooth(method = "lm")

Visualizing joint density

Data: shot attempts by the Sabrina Ionescu in the 2023 WNBA season

  • Each row is a shot attempt by Caitlin Clark in the 2024 WNBA season

  • Categorical variables: scoring_play, shot_type

  • Quantitative variables: shot_x, shot_y, shot_distance, score_value

ionescu_shots <- read_csv("https://raw.githubusercontent.com/qntkhvn/36-315-summer26/refs/heads/master/data/ionescu-shots.csv") |> 
  filter(shot_y < 35) # remove some outliers
glimpse(ionescu_shots)
Rows: 614
Columns: 6
$ shot_x        <dbl> -10, 11, 2, -13, -18, 19, -1, 0, -1, -12, -1, -14, -7, -…
$ shot_y        <dbl> 24, 23, 2, 21, 19, 16, 3, 2, 4, 5, 2, 21, 26, 14, 7, 6, …
$ shot_distance <dbl> 26.000000, 25.495098, 2.828427, 24.698178, 26.172505, 24…
$ shot_type     <chr> "Pullup Jump Shot", "Running Jump Shot", "Cutting Layup …
$ scoring_play  <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FAL…
$ score_value   <dbl> 0, 3, 0, 0, 3, 0, 2, 0, 0, 0, 0, 3, 0, 2, 0, 3, 3, 0, 0,…

A subtle point about histogram

By default, the bins are centered at the integers
Intervals are left-closed, right-open: -0.5 to 0.5, etc.

ionescu_shots |>
  ggplot(aes(x = shot_distance)) +
  geom_histogram(binwidth = 1)

Specify the center of one bin (e.g. 0.5)
and use closed = "left"

ionescu_shots |>
  ggplot(aes(x = shot_distance)) +
  geom_histogram(binwidth = 1, center = 0.5, closed = "left")

Visualizing joint density

  • Visualize joint distribution with a scatterplot
ionescu_shots |>
  ggplot(aes(x = shot_x, y = shot_y)) +
  geom_point(alpha = 0.3)

Going from 1D to 2D density estimation

In 1D: estimate density \(f(x)\), assuming that \(f(x)\) is smooth:

\[ \hat{f}(x) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h} K_h(x - x_i) \]

In 2D: estimate joint density \(f(x_1, x_2)\)

\[\hat{f}(x_1, x_2) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h_1h_2} K\left(\frac{x_1 - x_{i1}}{h_1}\right) K\left(\frac{x_2 - x_{i2}}{h_2}\right)\]

For 2D: there are 2 bandwidths

  • \(h_1\): controls smoothness as \(X_1\) changes, holding \(X_2\) fixed
  • \(h_2\): controls smoothness as \(X_2\) changes, holding \(X_1\) fixed

Gaussian kernels are still a popular choice

Display densities for 2D data: contour plots

Best known in topography: outlines (contours) denote levels of elevation

Create contours of 2D kernel density estimate

ionescu_shots |>
  ggplot(aes(x = shot_x, y = shot_y)) + 
  geom_point(alpha = 0.3) + 
  geom_density2d() +
  coord_fixed()
  • Use geom_density2d() to display contour lines

  • Extend KDE for joint density estimates in 2D

  • Inner lines denote “peaks”

  • coord_fixed() forced a fixed ratio

Create contours of 2D kernel density estimate

  • Modify the binwidth
ionescu_shots |>
  ggplot(aes(x = shot_x, y = shot_y)) + 
  geom_point(alpha = 0.3) + 
  geom_density2d(binwidth = 0.0001) + 
  coord_fixed()

Contours are difficult… what about heatmaps?

ionescu_shots |>
  ggplot(aes(x = shot_x, y = shot_y)) + 
  stat_density2d(aes(fill = after_stat(level)),
                 geom = "polygon") +
  scale_fill_gradient(low = "darkblue", high = "gold") +
  coord_fixed()

Use tiles instead

  • Divide the space into a grid and color the grid according to high/low values
ionescu_shots |>
  ggplot(aes(x = shot_x, y = shot_y)) + 
  stat_density2d(aes(fill = after_stat(density)),
                 geom = "tile", contour = FALSE) +
  scale_fill_gradient(low = "darkblue", high = "gold") +
  coord_fixed()

Hexagonal binning

  • Use geom_hex() to make hexagonal heatmaps

  • Need to have the hexbin package installed

  • 2D version of histogram

ionescu_shots |>
  ggplot(aes(x = shot_x, y = shot_y)) + 
  geom_hex() +
  scale_fill_gradient(low = "darkblue", high = "gold") + 
  coord_fixed()

What about shooting efficiency?

ionescu_shots |>
  ggplot(aes(x = shot_x, y = shot_y, 
             z = scoring_play, group = -1)) +
  stat_summary_hex(binwidth = c(2, 2), 
                   fun = mean, color = "black") +
  scale_fill_gradient(low = "darkblue", high = "gold") + 
  coord_fixed()

Linear regression

Assume a linear relationship for \(Y = f(X)\) \[Y_{i}=\beta_{0}+\beta_{1} X_{i}+\epsilon_{i}, \quad \text { for } i=1,2, \dots, n\]

  • \(Y_i\) is the \(i\)th value for the response variable

  • \(X_i\) is the \(i\)th value for the predictor variable

  • \(\beta_0\) is an unknown, constant intercept

    • average value for \(Y\) if \(X = 0\) (be careful sometimes…)
  • \(\beta_1\) is an unknown, constant slope

    • change in average value for \(Y\) for each one-unit increase in \(X\)
  • \(\epsilon_i\) is the random noise

    • assume independent, identically distributed (iid) from a normal distribution

    • \(\epsilon_i \overset{iid}{\sim}N(0, \sigma^2)\) with constant variance \(\sigma^2\)

Linear regression estimation

  • Estimate the conditional expection for \(Y\) (i.e., average value for \(Y\) given the value for \(X\)) \[\mathbb{E}[Y_i \mid X_i] = \beta_0 + \beta_1X_i\]
  • Estimate the best fitted line with ordinary least squares (OLS) by minimizing the residual sum of squares (RSS)

\[\text{RSS} \left(\beta_{0}, \beta_{1}\right)=\sum_{i=1}^{n}\left[Y_{i}-\left(\beta_{0}+\beta_{1} X_{i}\right)\right]^{2}=\sum_{i=1}^{n}\left(Y_{i}-\beta_{0}-\beta_{1} X_{i}\right)^{2}\]

\[\widehat{\beta}_{1}=\frac{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)}{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2}} = \frac{\textsf{Cov}(X,Y)}{\textsf{Var}(X)} \quad \text{ and } \quad \widehat{\beta}_{0}=\bar{Y}-\widehat{\beta}_{1} \bar{X}\]

Visual explanation of linear regression in 2D

  • Goal: find a “best fit” line that minimizes the vertical distance between the line and the data points

    • The vertical distance is known as a residual (difference between the actual y-value of a data point and the predicted y-value on the regression line)

Linear regression in R

penguins_lm <- lm(body_mass ~ flipper_len, data = penguins)
summary(penguins_lm)

Call:
lm(formula = body_mass ~ flipper_len, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1058.80  -259.27   -26.88   247.33  1288.69 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5780.831    305.815  -18.90   <2e-16 ***
flipper_len    49.686      1.518   32.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 394.3 on 340 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.759, Adjusted R-squared:  0.7583 
F-statistic:  1071 on 1 and 340 DF,  p-value: < 2.2e-16

Assessing assumptions of linear regression

Simple linear regression assumes \(Y_i \overset{\textsf{iid}}{\sim} N(\beta_0 + \beta_1 X_i, \sigma^2)\)

  • If this is true, then \(Y_i - \hat{Y}_i \overset{\textsf{iid}}{\sim} N(0, \sigma^2)\)

Plot residuals against fitted values \(\hat{Y}_i\)

  • Assess linearity: any divergence from mean 0

  • Asess equal variance: if \(\sigma^2\) is homogeneous across predictions/fits \(\hat{Y}_i\)

More difficult to assess independence and fixed \(X\) assumptions

  • Make these assumptions based on subject-matter knowledge

Plot residuals against predicted values

tibble(fitted = fitted(penguins_lm),
       resid = residuals(penguins_lm)) |> 
  ggplot(aes(x = fitted, y = resid)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red")
  • Conditional on the predicted values, residuals should have a mean of zero

  • Residuals should not display any pattern

  • Two things to look for:

    • Any trend around horizontal reference line?

    • Equal vertical spread?

Linear regression with categorical variables

  • A categorical variable of \(k\) levels is encoded into \(k-1\) indicator/dummy variables
  • Example: Penguins species (3 levels: Adelie, Chinstrap, Gentoo)

  • Indicators for Chinstrap and Gentoo: \(I_C\) and \(I_G\)

  • If \(I_C = I_G = 0\), then the species must be Adelie (reference/baseline level)

  • Model: \(Y_i \overset{\textsf{iid}}{\sim} N(\beta_0 + \beta_C I_C + \beta_G I_G, \sigma^2)\)

    • \(\beta_0\): mean for Adelie (reference level)

    • \(\beta_0 + \beta_C\): mean for Chinstrap

    • \(\beta_0 + \beta_G\): mean for Gentoo

    • Significant \(\beta_C\): Chinstrap and Adelie are different

    • Significant \(\beta_G\): Gentoo and Adelie are different

    • What about comparing Chinstrap and Gentoo? Need to fit a different model

Linear regression with interactions

  • Setup: response \(Y\) (bill_dep) and 2 predictors \(X\) (bill_len, quantitative) and species (categorical)
  • Model 1 (additive, no interaction): \(Y_i \overset{\textsf{iid}}{\sim} N(\beta_0 + \beta_X X + \beta_C I_C + \beta_G I_G, \sigma^2)\)
    (Formula: bill_dep ~ bill_len + species)

    • The intercept for Adelie: \(\beta_0\); for Chinstrap: \(\beta_0 + \beta_C\); for Gentoo: \(\beta_0 + \beta_G\)

    • The slope for all species: \(\beta_X\)

  • Model 2 (with interaction): \(Y_i \overset{\textsf{iid}}{\sim} N(\beta_0 + \beta_X X + \beta_C I_C + \beta_G I_G + \beta_{CX} I_C X + \beta_{GX} I_G X, \sigma^2)\)
    (Formula: bill_dep ~ bill_len + species + bill_len:species)

    • The intercept for Adelie: \(\beta_0\); for Chinstrap: \(\beta_0 + \beta_C\); for Gentoo: \(\beta_0 + \beta_G\)

    • The slope for Adelie: \(\beta_X\); for Chinstrap: \(\beta_X + \beta_{CX}\); for Gentoo: \(\beta_X + \beta_{GX}\)

    • Significant coefficient for categorical variables by themselves \(\rightarrow\) significantly different intercepts

    • Significant coefficient for interactions with categorical variables \(\rightarrow\) significantly different slopes

Visualizing additive model (no interaction)

Model: bill_dep ~ bill_len + species

Example of Simpson’s paradox: negative linear relationship between bill depth and bill length overall, but positive linear relationship within species

Code
additive_lm <- lm(bill_dep ~ bill_len + species, data = penguins)

intercepts <- c(
  coef(additive_lm)["(Intercept)"],
  coef(additive_lm)["(Intercept)"] + coef(additive_lm)["speciesChinstrap"],
  coef(additive_lm)["(Intercept)"] + coef(additive_lm)["speciesGentoo"]
)

lines_tbl <- tibble(intercept = intercepts,
                    slope = rep(coef(additive_lm)["bill_len"], 3),
                    species = levels(penguins$species))
penguins |> 
  ggplot(aes(x = bill_len, y = bill_dep)) +
  geom_point(alpha = 0.5) +
  geom_abline(data = lines_tbl, linewidth = 1.5,
              aes(intercept = intercept, slope = slope, color = species))

Summarizing additive model

summary(lm(bill_dep ~ bill_len + species, data = penguins))

Call:
lm(formula = bill_dep ~ bill_len + species, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4529 -0.6864 -0.0508  0.5519  3.5915 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      10.59218    0.68302  15.508  < 2e-16 ***
bill_len          0.19989    0.01749  11.427  < 2e-16 ***
speciesChinstrap -1.93319    0.22416  -8.624 2.55e-16 ***
speciesGentoo    -5.10602    0.19142 -26.674  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9533 on 338 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.769, Adjusted R-squared:  0.7669 
F-statistic: 375.1 on 3 and 338 DF,  p-value: < 2.2e-16

Visualizing model with interaction

Model: bill_dep ~ bill_len + species + bill_len:species

Map species to the color aesthetic for geom_smooth() layer

Code
penguins |>
  ggplot(aes(x = bill_len, y = bill_dep, color = species)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE)

Summarizing model with interaction

summary(lm(bill_dep ~ bill_len * species, data = penguins))

Call:
lm(formula = bill_dep ~ bill_len * species, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6574 -0.6675 -0.0524  0.5383  3.5032 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)               11.40912    1.13812  10.025  < 2e-16 ***
bill_len                   0.17883    0.02927   6.110 2.76e-09 ***
speciesChinstrap          -3.83998    2.05398  -1.870 0.062419 .  
speciesGentoo             -6.15812    1.75451  -3.510 0.000509 ***
bill_len:speciesChinstrap  0.04338    0.04558   0.952 0.341895    
bill_len:speciesGentoo     0.02601    0.04054   0.642 0.521590    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9548 on 336 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.7697,    Adjusted R-squared:  0.7662 
F-statistic: 224.5 on 5 and 336 DF,  p-value: < 2.2e-16

Linear regression warnings

Simpson’s paradox

  • A particular trend appears in several subgroups but reverses (or disappears) when the subgroups are combined

    • e.g., negative relationship between two variables overall but positive relationship within every subgroup
  • In this case, subgroup analysis is especially important

Is the intercept meaningful?

  • Think about whether \(X = 0\) makes sense for a particular variable before interpreting the intercept

Interpolation versus extrapolation

  • Interpolation: prediction within the range of a variable

  • Extrapolation: prediction outside the range of a variable

  • Generally speaking, interpolation is more reliable than extrapolation (less sensitive to model misspecification)

Extrapolation example

Extrapolation example

Extrapolation example