Visualizing 2D quantitative data and linear regression

36-315: Statistical Graphics and Visualization, Summer 2026

Summarizing 2D quantitative data

Three aspects:

Direction/trend (positive, negative)
Strength of the relationship (strong, moderate, weak)
Linearity (linear, non-linear)

Goals:

describe the relationships between two variables \(X\) and \(Y\)
describe the conditional distribution \(Y \mid X\) via regression analysis
describe the joint distribution \(X, Y\) via contours, heatmaps, etc.

Big picture ideas:

Scatterplots are by far the most common visual
Regression analysis is by far the most popular analysis
Relationships may vary across other variables, e.g., categorical variables

Summarizing 2D quantitative data

Numerical summary: correlation coefficient

cor(penguins$bill_dep, penguins$bill_len, use = "complete.obs")

[1] -0.2350529

Note: By default, cor() returns Pearson correlation
Other correlations:
- Spearman’s correlation
- Kendall rank correlation coefficient
- and more

Making scatterplots

Use geom_point()
Displaying the joint (bivariate) distribution
What is the obvious flaw with this plot?

library(tidyverse)
theme_set(theme_light())

penguins |> 
  ggplot(aes(x = bill_dep, y = bill_len)) +
  geom_point(color = "darkred")

Making scatterplots: adjust the transparency

Adjust the transparency of points via alpha to visualize overlap
Provide better understanding of joint frequency
Especially important with larger datasets

penguins |> 
  ggplot(aes(x = bill_dep, y = bill_len)) +
  geom_point(color = "darkred", alpha = 0.5)

Point color by a categorical variable

Map a categorical variable to color

penguins |> 
  ggplot(aes(x = bill_dep, y = bill_len, color = species)) +
  geom_point(alpha = 0.5)

Point color by a continuous variable

Map a continuous variable to color

penguins |> 
  ggplot(aes(x = bill_dep, y = bill_len, color = body_mass)) +
  geom_point(alpha = 0.5)

Color gradient for a continuous variable

Map a continuous variable to color

penguins |> 
  ggplot(aes(x = bill_dep, y = bill_len, color = body_mass)) +
  geom_point(alpha = 0.5) +
  scale_color_gradient(low = "darkblue", high = "darkorange")

Point size by a continuous variable

Map a continuous variable to size

penguins |> 
  ggplot(aes(x = bill_dep, y = bill_len, size = body_mass)) +
  geom_point(alpha = 0.5)

Point shape by a categorical variable

Map a categorical variable to shape

penguins |> 
  ggplot(aes(x = bill_dep, y = bill_len, shape = species)) +
  geom_point(alpha = 0.5)

Mapping multiple aesthetics to different variables

Each of color, shape, and size
is mapped to a different variable
5 dimensions at once!
Be careful about this…

penguins |> 
  ggplot(aes(x = bill_dep, y = bill_len,
             color = species, shape = island, size = body_mass)) +
  geom_point(alpha = 0.5)

Displaying trend line: linear regression

Display regression line for
body_mass ~ flipper_len
95% confidence band by default
Estimating the conditional expectation of
body_mass | flipper_len
- i.e., \(\mathbb{E}[\) body_mass \(\mid\) flipper_len \(]\)

penguins |> 
  ggplot(aes(x = flipper_len, y = body_mass)) +
  geom_point(color = "darkred", alpha = 0.5) +
  geom_smooth(method = "lm")

Visualizing joint density

Data: shot attempts by the Sabrina Ionescu in the 2023 WNBA season

Each row is a shot attempt by Caitlin Clark in the 2024 WNBA season
Categorical variables: scoring_play, shot_type
Quantitative variables: shot_x, shot_y, shot_distance, score_value

ionescu_shots <- read_csv("https://raw.githubusercontent.com/qntkhvn/36-315-summer26/refs/heads/master/data/ionescu-shots.csv") |> 
  filter(shot_y < 35) # remove some outliers
glimpse(ionescu_shots)

Rows: 614
Columns: 6
$ shot_x        <dbl> -10, 11, 2, -13, -18, 19, -1, 0, -1, -12, -1, -14, -7, -…
$ shot_y        <dbl> 24, 23, 2, 21, 19, 16, 3, 2, 4, 5, 2, 21, 26, 14, 7, 6, …
$ shot_distance <dbl> 26.000000, 25.495098, 2.828427, 24.698178, 26.172505, 24…
$ shot_type     <chr> "Pullup Jump Shot", "Running Jump Shot", "Cutting Layup …
$ scoring_play  <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FAL…
$ score_value   <dbl> 0, 3, 0, 0, 3, 0, 2, 0, 0, 0, 0, 3, 0, 2, 0, 3, 3, 0, 0,…

A subtle point about histogram

By default, the bins are centered at the integers
Intervals are left-closed, right-open: -0.5 to 0.5, etc.

ionescu_shots |>
  ggplot(aes(x = shot_distance)) +
  geom_histogram(binwidth = 1)

Specify the center of one bin (e.g. 0.5)
and use closed = "left"

ionescu_shots |>
  ggplot(aes(x = shot_distance)) +
  geom_histogram(binwidth = 1, center = 0.5, closed = "left")

Visualizing joint density

Visualize joint distribution with a scatterplot

ionescu_shots |>
  ggplot(aes(x = shot_x, y = shot_y)) +
  geom_point(alpha = 0.3)

Going from 1D to 2D density estimation

In 1D: estimate density \(f(x)\), assuming that \(f(x)\) is smooth:

\[ \hat{f}(x) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h} K_h(x - x_i) \]

In 2D: estimate joint density \(f(x_1, x_2)\)

\[\hat{f}(x_1, x_2) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h_1h_2} K\left(\frac{x_1 - x_{i1}}{h_1}\right) K\left(\frac{x_2 - x_{i2}}{h_2}\right)\]

For 2D: there are 2 bandwidths

\(h_1\): controls smoothness as \(X_1\) changes, holding \(X_2\) fixed
\(h_2\): controls smoothness as \(X_2\) changes, holding \(X_1\) fixed

Gaussian kernels are still a popular choice

Display densities for 2D data: contour plots

Best known in topography: outlines (contours) denote levels of elevation

Create contours of 2D kernel density estimate

ionescu_shots |>
  ggplot(aes(x = shot_x, y = shot_y)) + 
  geom_point(alpha = 0.3) + 
  geom_density2d() +
  coord_fixed()

Use geom_density2d() to display contour lines
Extend KDE for joint density estimates in 2D
Inner lines denote “peaks”
coord_fixed() forced a fixed ratio

Create contours of 2D kernel density estimate

Modify the binwidth

ionescu_shots |>
  ggplot(aes(x = shot_x, y = shot_y)) + 
  geom_point(alpha = 0.3) + 
  geom_density2d(binwidth = 0.0001) + 
  coord_fixed()

Contours are difficult… what about heatmaps?

Use stat_density_2d() and the after_stat() function to make 2D KDE heatmaps
May be easier to read than nested lines with color

ionescu_shots |>
  ggplot(aes(x = shot_x, y = shot_y)) + 
  stat_density2d(aes(fill = after_stat(level)),
                 geom = "polygon") +
  scale_fill_gradient(low = "darkblue", high = "gold") +
  coord_fixed()

Use tiles instead

Divide the space into a grid and color the grid according to high/low values

ionescu_shots |>
  ggplot(aes(x = shot_x, y = shot_y)) + 
  stat_density2d(aes(fill = after_stat(density)),
                 geom = "tile", contour = FALSE) +
  scale_fill_gradient(low = "darkblue", high = "gold") +
  coord_fixed()

Hexagonal binning

Use geom_hex() to make hexagonal heatmaps
Need to have the hexbin package installed
2D version of histogram

ionescu_shots |>
  ggplot(aes(x = shot_x, y = shot_y)) + 
  geom_hex() +
  scale_fill_gradient(low = "darkblue", high = "gold") + 
  coord_fixed()

What about shooting efficiency?

Compute a function of another variable inside hexagons with stat_summary_hex()

ionescu_shots |>
  ggplot(aes(x = shot_x, y = shot_y, 
             z = scoring_play, group = -1)) +
  stat_summary_hex(binwidth = c(2, 2), 
                   fun = mean, color = "black") +
  scale_fill_gradient(low = "darkblue", high = "gold") + 
  coord_fixed()

Linear regression

Assume a linear relationship for \(Y = f(X)\) \[Y_{i}=\beta_{0}+\beta_{1} X_{i}+\epsilon_{i}, \quad \text { for } i=1,2, \dots, n\]

\(Y_i\) is the \(i\)th value for the response variable
\(X_i\) is the \(i\)th value for the predictor variable

\(\beta_0\) is an unknown, constant intercept
- average value for \(Y\) if \(X = 0\) (be careful sometimes…)
\(\beta_1\) is an unknown, constant slope
- change in average value for \(Y\) for each one-unit increase in \(X\)

\(\epsilon_i\) is the random noise
- assume independent, identically distributed (iid) from a normal distribution
- \(\epsilon_i \overset{iid}{\sim}N(0, \sigma^2)\) with constant variance \(\sigma^2\)

Linear regression estimation

Estimate the conditional expection for \(Y\) (i.e., average value for \(Y\) given the value for \(X\)) \[\mathbb{E}[Y_i \mid X_i] = \beta_0 + \beta_1X_i\]

Estimate the best fitted line with ordinary least squares (OLS) by minimizing the residual sum of squares (RSS)

\[\text{RSS} \left(\beta_{0}, \beta_{1}\right)=\sum_{i=1}^{n}\left[Y_{i}-\left(\beta_{0}+\beta_{1} X_{i}\right)\right]^{2}=\sum_{i=1}^{n}\left(Y_{i}-\beta_{0}-\beta_{1} X_{i}\right)^{2}\]

\[\widehat{\beta}_{1}=\frac{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)}{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2}} = \frac{\textsf{Cov}(X,Y)}{\textsf{Var}(X)} \quad \text{ and } \quad \widehat{\beta}_{0}=\bar{Y}-\widehat{\beta}_{1} \bar{X}\]

Visual explanation of linear regression in 2D

Goal: find a “best fit” line that minimizes the vertical distance between the line and the data points
- The vertical distance is known as a residual (difference between the actual y-value of a data point and the predicted y-value on the regression line)

Linear regression in `R`

penguins_lm <- lm(body_mass ~ flipper_len, data = penguins)
summary(penguins_lm)


Call:
lm(formula = body_mass ~ flipper_len, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1058.80  -259.27   -26.88   247.33  1288.69 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5780.831    305.815  -18.90   <2e-16 ***
flipper_len    49.686      1.518   32.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 394.3 on 340 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.759, Adjusted R-squared:  0.7583 
F-statistic:  1071 on 1 and 340 DF,  p-value: < 2.2e-16

Assessing assumptions of linear regression

Simple linear regression assumes \(Y_i \overset{\textsf{iid}}{\sim} N(\beta_0 + \beta_1 X_i, \sigma^2)\)

If this is true, then \(Y_i - \hat{Y}_i \overset{\textsf{iid}}{\sim} N(0, \sigma^2)\)

Plot residuals against fitted values \(\hat{Y}_i\)

Assess linearity: any divergence from mean 0
Asess equal variance: if \(\sigma^2\) is homogeneous across predictions/fits \(\hat{Y}_i\)

More difficult to assess independence and fixed \(X\) assumptions

Make these assumptions based on subject-matter knowledge

Plot residuals against predicted values

tibble(fitted = fitted(penguins_lm),
       resid = residuals(penguins_lm)) |> 
  ggplot(aes(x = fitted, y = resid)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red")

Conditional on the predicted values, residuals should have a mean of zero
Residuals should not display any pattern
Two things to look for:
- Any trend around horizontal reference line?
- Equal vertical spread?

Linear regression with categorical variables

A categorical variable of \(k\) levels is encoded into \(k-1\) indicator/dummy variables

Example: Penguins species (3 levels: Adelie, Chinstrap, Gentoo)
Indicators for Chinstrap and Gentoo: \(I_C\) and \(I_G\)
If \(I_C = I_G = 0\), then the species must be Adelie (reference/baseline level)

Model: \(Y_i \overset{\textsf{iid}}{\sim} N(\beta_0 + \beta_C I_C + \beta_G I_G, \sigma^2)\)
- \(\beta_0\): mean for Adelie (reference level)
- \(\beta_0 + \beta_C\): mean for Chinstrap
- \(\beta_0 + \beta_G\): mean for Gentoo
- Significant \(\beta_C\): Chinstrap and Adelie are different
- Significant \(\beta_G\): Gentoo and Adelie are different
- What about comparing Chinstrap and Gentoo? Need to fit a different model

Linear regression with interactions

Setup: response \(Y\) (bill_dep) and 2 predictors \(X\) (bill_len, quantitative) and species (categorical)

Model 1 (additive, no interaction): \(Y_i \overset{\textsf{iid}}{\sim} N(\beta_0 + \beta_X X + \beta_C I_C + \beta_G I_G, \sigma^2)\)
(Formula: bill_dep ~ bill_len + species)
- The intercept for Adelie: \(\beta_0\); for Chinstrap: \(\beta_0 + \beta_C\); for Gentoo: \(\beta_0 + \beta_G\)
- The slope for all species: \(\beta_X\)

Model 2 (with interaction): \(Y_i \overset{\textsf{iid}}{\sim} N(\beta_0 + \beta_X X + \beta_C I_C + \beta_G I_G + \beta_{CX} I_C X + \beta_{GX} I_G X, \sigma^2)\)
(Formula: bill_dep ~ bill_len + species + bill_len:species)
- The intercept for Adelie: \(\beta_0\); for Chinstrap: \(\beta_0 + \beta_C\); for Gentoo: \(\beta_0 + \beta_G\)
- The slope for Adelie: \(\beta_X\); for Chinstrap: \(\beta_X + \beta_{CX}\); for Gentoo: \(\beta_X + \beta_{GX}\)
- Significant coefficient for categorical variables by themselves \(\rightarrow\) significantly different intercepts
- Significant coefficient for interactions with categorical variables \(\rightarrow\) significantly different slopes

Visualizing additive model (no interaction)

Model: bill_dep ~ bill_len + species

Example of Simpson’s paradox: negative linear relationship between bill depth and bill length overall, but positive linear relationship within species

Code

additive_lm <- lm(bill_dep ~ bill_len + species, data = penguins)

intercepts <- c(
  coef(additive_lm)["(Intercept)"],
  coef(additive_lm)["(Intercept)"] + coef(additive_lm)["speciesChinstrap"],
  coef(additive_lm)["(Intercept)"] + coef(additive_lm)["speciesGentoo"]
)

lines_tbl <- tibble(intercept = intercepts,
                    slope = rep(coef(additive_lm)["bill_len"], 3),
                    species = levels(penguins$species))
penguins |> 
  ggplot(aes(x = bill_len, y = bill_dep)) +
  geom_point(alpha = 0.5) +
  geom_abline(data = lines_tbl, linewidth = 1.5,
              aes(intercept = intercept, slope = slope, color = species))

Summarizing additive model

summary(lm(bill_dep ~ bill_len + species, data = penguins))


Call:
lm(formula = bill_dep ~ bill_len + species, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4529 -0.6864 -0.0508  0.5519  3.5915 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      10.59218    0.68302  15.508  < 2e-16 ***
bill_len          0.19989    0.01749  11.427  < 2e-16 ***
speciesChinstrap -1.93319    0.22416  -8.624 2.55e-16 ***
speciesGentoo    -5.10602    0.19142 -26.674  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9533 on 338 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.769, Adjusted R-squared:  0.7669 
F-statistic: 375.1 on 3 and 338 DF,  p-value: < 2.2e-16

Visualizing model with interaction

Model: bill_dep ~ bill_len + species + bill_len:species

Map species to the color aesthetic for geom_smooth() layer

Code

penguins |>
  ggplot(aes(x = bill_len, y = bill_dep, color = species)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE)

Summarizing model with interaction

summary(lm(bill_dep ~ bill_len * species, data = penguins))


Call:
lm(formula = bill_dep ~ bill_len * species, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6574 -0.6675 -0.0524  0.5383  3.5032 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)               11.40912    1.13812  10.025  < 2e-16 ***
bill_len                   0.17883    0.02927   6.110 2.76e-09 ***
speciesChinstrap          -3.83998    2.05398  -1.870 0.062419 .  
speciesGentoo             -6.15812    1.75451  -3.510 0.000509 ***
bill_len:speciesChinstrap  0.04338    0.04558   0.952 0.341895    
bill_len:speciesGentoo     0.02601    0.04054   0.642 0.521590    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9548 on 336 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.7697,    Adjusted R-squared:  0.7662 
F-statistic: 224.5 on 5 and 336 DF,  p-value: < 2.2e-16

Linear regression warnings

Simpson’s paradox

A particular trend appears in several subgroups but reverses (or disappears) when the subgroups are combined
- e.g., negative relationship between two variables overall but positive relationship within every subgroup
In this case, subgroup analysis is especially important

Is the intercept meaningful?

Think about whether \(X = 0\) makes sense for a particular variable before interpreting the intercept

Interpolation versus extrapolation

Interpolation: prediction within the range of a variable
Extrapolation: prediction outside the range of a variable
Generally speaking, interpolation is more reliable than extrapolation (less sensitive to model misspecification)

Visualizing 2D quantitative data and linear regression

Summarizing 2D quantitative data

Summarizing 2D quantitative data

Making scatterplots

Making scatterplots: adjust the transparency

Point color by a categorical variable

Point color by a continuous variable

Color gradient for a continuous variable

Point size by a continuous variable

Point shape by a categorical variable

Mapping multiple aesthetics to different variables

Displaying trend line: linear regression

Visualizing joint density

A subtle point about histogram

Visualizing joint density

Going from 1D to 2D density estimation

Display densities for 2D data: contour plots

Create contours of 2D kernel density estimate

Create contours of 2D kernel density estimate

Contours are difficult… what about heatmaps?

Use tiles instead

Hexagonal binning

What about shooting efficiency?

Linear regression

Linear regression estimation

Visual explanation of linear regression in 2D

Linear regression in `R`

Assessing assumptions of linear regression

Plot residuals against predicted values

Linear regression with categorical variables

Linear regression with interactions

Visualizing additive model (no interaction)

Summarizing additive model

Visualizing model with interaction

Summarizing model with interaction

Linear regression warnings

Extrapolation example

Extrapolation example

Extrapolation example

Visualizing 2D quantitative data and linear regression

Summarizing 2D quantitative data

Summarizing 2D quantitative data

Making scatterplots

Making scatterplots: adjust the transparency

Point color by a categorical variable

Point color by a continuous variable

Color gradient for a continuous variable

Point size by a continuous variable

Point shape by a categorical variable

Mapping multiple aesthetics to different variables

Displaying trend line: linear regression

Visualizing joint density

A subtle point about histogram

Visualizing joint density

Going from 1D to 2D density estimation

Display densities for 2D data: contour plots

Create contours of 2D kernel density estimate

Create contours of 2D kernel density estimate

Contours are difficult… what about heatmaps?

Use tiles instead

Hexagonal binning

What about shooting efficiency?

Linear regression

Linear regression estimation

Visual explanation of linear regression in 2D

Linear regression in R

Assessing assumptions of linear regression

Plot residuals against predicted values

Linear regression with categorical variables

Linear regression with interactions

Visualizing additive model (no interaction)

Summarizing additive model

Visualizing model with interaction

Summarizing model with interaction

Linear regression warnings

Extrapolation example

Extrapolation example

Extrapolation example

Linear regression in `R`