Comparing distributions

36-315: Statistical Graphics and Visualization, Summer 2026

Review: density estimation

Smoothed densities are good for visualizing the shape of a distribution

There are two ways to produce smoothed densities:

Nonparametric (previous lecture)
- Pick a bandwidth and kernel, then fit a kernel-smoothed density
Parametric (this lecture)
- Specify a distribution (e.g., normal), estimate the parameters, and fit the corresponding pdf

Review: display full distribution with ECDF plots

Pros:

Displays all of the data (sorted)
Does NOT require any parameters to adjust
As \(n \rightarrow \infty\), the ECDF \(\hat F_n(x)\) converges to
the true CDF \(F(x)\)

Cons:

What are the cons?

penguins |> 
  ggplot(aes(x = flipper_len)) +
  stat_ecdf()

What is the relationship between these two figures?

Probability distributions

A distribution is a mathematical function \(f(x \mid \theta)\) where

\(x\) may take on continuous or discrete values over the domain (i.e. all possible inputs) of \(f(x \mid \theta)\)
\(\theta\) is a set of parameters governing the shape of the distribution
- e.g. \(\theta = \{\mu, \sigma ^2 \}\) for a normal/Gaussian distribution
the \(\mid\) symbol means that the shape of the distribution is conditional on the values of \(\theta\)

Let \(f\) denote the distribution for its

probability density function (PDF) if \(x\) is continuous
probability mass function (PMF) if \(x\) is discrete

Note:

\(f(x \mid \theta) \geq 0\) for all \(x\)
\(\displaystyle \sum_x f(x \mid \theta) = 1\) (discrete case) or \(\displaystyle \int_x f(x \mid \theta) = 1\) (continuous case)

Example: Normal distribution

Given \(X \sim N(\mu, \sigma^2)\)
PDF: \(\displaystyle{f}(x \mid \mu, \sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} \exp \left\{ -\frac{1}{2} \left(\frac{x-\mu}{\sigma}\right)^2 \right\}\) for \(x \in (- \infty, \infty)\)
Standard normal distribution: \(N(0, 1)\)

tibble(x = c(-5, 5)) |> 
  ggplot(aes(x)) +
  stat_function(fun = dnorm, color = "blue",
                args = list(mean = 0, sd = 1)) +
  stat_function(fun = dnorm, color = "red",
                args = list(mean = -2, sd = sqrt(0.5)))

Distribution functions in `R`

Example: normal distribution

dnorm(): normal density function
pnorm(): normal cumulative distribution function (fraction of values smaller than)
qnorm(): normal quantile function (inverse of cumulative distribution)
rnorm(): generate normal random variables

Note: Replace “norm” with the name of another distribution, all the same functions apply

e.g., “t”, “exp”, “gamma”, “chisq”, “binom”, “pois”, “unif”, etc.

See this manual for more details

Parametric density estimation

Instead of trying to estimate the whole \(f(x)\) non-parametrically, assume a particular distribution \(f(x)\) and estimate its parameters
For example, assume \(X_i \sim N(\mu, \sigma^2)\). Then use the observed data to estimate the parameters \[\hat{\mu} = \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i \quad \text{and} \quad \hat{\sigma}^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2\]
The (plug-in) density estimate is \[\hat f(x) = \frac{1}{\hat\sigma\sqrt{2\pi}} \exp \left\{ -\frac{1}{2} \left(\frac{x-\hat\mu}{\hat\sigma}\right)^2 \right\}\]

One-sample Kolmogorov-Smirnov test

Compare the ECDF \(\hat{F}(x)\) to the CDF \(F(x)\) of a theoretical distribution
Null hypothesis: the observed data follow a particular theoretical distribution
Test statistic: \(\displaystyle \quad \max_x |\hat{F}(x) - F(x)|\)
If \(\hat{F}(x)\) is far away from \(F(x)\), reject the null hypothesis

One-sample Kolmogorov-Smirnov test example

What if we assume flipper_len follows Normal distribution? (i.e., \(H_0:\) flipper_len \(\sim N(\mu, \sigma^2)\))

Need estimates for mean \(\mu\) and standard deviation \(\sigma\):

fl_mean <- mean(penguins$flipper_len, na.rm = TRUE)
fl_mean

[1] 200.9152

fl_sd <- sd(penguins$flipper_len, na.rm = TRUE)
fl_sd

[1] 14.06171

One-sample KS test: There is strong evidence that flipper_len is not normally distributed

ks.test(x = penguins$flipper_len, y = "pnorm", mean = fl_mean, sd = fl_sd)


    Asymptotic one-sample Kolmogorov-Smirnov test

data:  penguins$flipper_len
D = 0.12428, p-value = 5.163e-05
alternative hypothesis: two-sided

Visualizing empirical and parametric densities

penguins |> 
  ggplot(aes(x = flipper_len)) + 
  geom_density(color = "blue") +
  stat_function(fun = dnorm, linetype = "dashed", color = "black",
                args = list(mean = fl_mean, sd = fl_sd))

Visualizing empirical and parametric CDF

See Appendix for code

Comparing multiple empirical distributions

Goal: determine whether a quantitative variable depends on a categorical variable

Examples:
- clinical trials with multiple treatments
- assessing differences across race, gender, socioeconomic status, etc.
- industrial experiments, A/B testing

Graphical comparison: overlayed densities, side-by-side violin plots, faceted histograms, etc.

Remember:
- It’s useful to visualize and compare conditional distributions
- But when are differences in a graphic statistically significant?
- We need formal statistical inference (e.g., hypothesis tests)

Example: Spotify songs duration by genre

Do rap and rock songs differ in duration?

Two-sample Kolmogorov-Smirnov test

Use a two-sample KS to compare two empirical distributions \(\hat{F}_A(x)\) and \(\hat{F}_B\)
Null hypothesis: two samples \(A\) and \(B\) follow the same distribution
Test statistic: \(\displaystyle \quad \max_x |\hat{F}_A(x) - \hat{F}_B(x)|\)
If \(\hat F_A\) and \(\hat F_B\) are far away from each other, reject the null hypothesis

Two-sample Kolmogorov-Smirnov test example

spotify_songs <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv") |> 
  mutate(duration = duration_ms / 60000)

rap_duration <- spotify_songs |> filter(playlist_genre == "rap") |> pull(duration)
rock_duration <- spotify_songs |> filter(playlist_genre == "rock") |> pull(duration)

ks.test(rap_duration, y = rock_duration)


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  rap_duration and rock_duration
D = 0.22386, p-value < 2.2e-16
alternative hypothesis: two-sided

We have strong evidence that the duration distributions differ between rap and rock songs.

Statistical tests for comparing distributions

Any difference at all: two-sample KS test
Difference in means
- Null hypothesis: \(H_0: \mu_1 = \mu_2 = \cdots = \mu_K\) (use t.test() or oneway.test())
- Can assume the variances are all the same or differ
- If \(H_0\) is rejected, can only conclude not all means are equal
Difference in variances
- Null hypothesis: \(H_0: \sigma^2_1 = \sigma^2_2 = \cdots = \sigma^2_K\) (use bartlett.test())
- If \(H_0\) is rejected, can only conclude not all variances are equal

Note: unlike the KS test, difference in means and variances are sensitive to non-normality

Different distributions can yield insignificant results

What about the difference between pop and rap?

Comparison using full data

table(spotify_songs$playlist_genre)


  edm latin   pop   r&b   rap  rock 
 6043  5155  5507  5431  5746  4951

rap_duration <- spotify_songs |> filter(playlist_genre == "rap") |> pull(duration)
pop_duration <- spotify_songs |> filter(playlist_genre == "pop") |> pull(duration)

ks.test(rap_duration, y = pop_duration)


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  rap_duration and pop_duration
D = 0.14569, p-value < 2.2e-16
alternative hypothesis: two-sided

There is a statistically significant difference in duration between rap and pop songs (given large sample size)

Comparison using a smaller subset of data

set.seed(2017)
subset_songs <- spotify_songs |>
  group_by(playlist_genre) |> 
  slice_sample(n = 100)

table(subset_songs$playlist_genre)


  edm latin   pop   r&b   rap  rock 
  100   100   100   100   100   100

subset_rap_duration <- subset_songs |> filter(playlist_genre == "rap") |> pull(duration)
subset_pop_duration <- subset_songs |> filter(playlist_genre == "pop") |> pull(duration)

ks.test(subset_rap_duration, y = subset_pop_duration)


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  subset_rap_duration and subset_pop_duration
D = 0.16, p-value = 0.1545
alternative hypothesis: two-sided

There is NO statistically significant difference in duration between rap and pop songs

But it still looks different visually???

Test for difference in means

Using a two-sample \(t\)-test

Null hypothesis: the means of two groups (populations) are equal

t.test(subset_rap_duration, subset_pop_duration)


    Welch Two Sample t-test

data:  subset_rap_duration and subset_pop_duration
t = 0.83091, df = 172.78, p-value = 0.4072
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.1461026  0.3585440
sample estimates:
mean of x mean of y 
 3.694096  3.587875

Test for difference in variances

Using a Bartlett test

Null hypothesis: all variances are equal

bartlett.test(list(subset_rap_duration, subset_pop_duration))


    Bartlett test of homogeneity of variances

data:  list(subset_rap_duration, subset_pop_duration)
Bartlett's K-squared = 15.54, df = 1, p-value = 8.08e-05

Rejects at \(\alpha = 0.05\) even with this smaller sample size!

Why did the KS test reveals no differences while the graphs are clearly different? Two possible reasons:

The sample size might be too small to detect a difference
The KS test is known to have low power

Statistical power

Statistical power is key to really understanding graphics: whether we’re seeing real effects versus noise

Definition: \(\quad \textsf{power} = P(p\text{-value} \leq \alpha \mid H_0 \text{ is false})\)
- In words: the probability that null hypothesis is rejected when it is false

Things that affect statistical power:
- Larger differences in the data \(\rightarrow\) more power
- Smaller variance/error in differences \(\rightarrow\) more power
- Larger sample size \(\rightarrow\) more power
- More appropriate statistical test \(\rightarrow\) more power

Remember: Type 1 error is falsely rejecting; Type 2 error is falsely failing to reject

Toy example: power of a \(t\)-test

Consider two samples \(\mathbf{X} = (X_1,\dots,X_n) \sim N(0, 1)\) and \(\mathbf{Y} = (Y_1,\dots,Y_n) \sim N(\delta, 1)\)
Use a \(t\)-test for difference in means between \(\mathbf{X}\) and \(\mathbf{Y}\)
Simulate \(\mathbf{X}\) and \(\mathbf{Y}\) 1000 times for some \(n\) and \(\delta > 0\)
Count the number of rejections

\[ \begin{aligned} \textsf{power} &= P(p\text{-value} \leq \alpha \mid H_0 \text{ is false}) \\ &= P(p\text{-value} \leq \alpha \mid \delta > 0) \\ &\approx \frac{\text{# rejections}}{1000} \end{aligned} \]

Consider \(n = 10, 20, \dots, 1000\) and \(\delta = 0.1\) or \(\delta = 0.25\)

Toy example: power of a \(t\)-test

Another toy example: power for different tests

Consider two samples: \(\mathbf{X} = (X_1,\dots,X_n) \sim N(0, 1)\) and \(\mathbf{Y} = (Y_1,\dots,Y_n) \sim N(0, 1.5)\)
Consider three ways to test differences between \(\mathbf{X}\) and \(\mathbf{Y}\)
1. \(t\)-test
2. Bartlett test
3. KS test
Simulate \(\mathbf{X}\) and \(\mathbf{Y}\) 1000 times for samples sizes \(n = 10, 20, \dots, 1000\)

What do you think the power curves will look like for these methods?

Another toy example: power for different tests

Takeaways

Graphics should be paired with statistical analyses to determine if a true effect versus noise is displayed
Even if there is a true effect, there may be limited power to detect it (some effects are easier to detect than others)
Remember: Power is the probability of rejecting when the null is false
Things that increase statistical power:
- Increase sample size
- Reduce variance/error
- Increase differences/effects
- Choose appropriate tests!

Appendix: visualizing empirical and parametric CDF

# create the ECDF function
fl_ecdf <- ecdf(penguins$flipper_len)

# compute absolute value of the differences between ECDF and theoretical (normal distribution)
abs_ecdf_diff <- abs(fl_ecdf(penguins$flipper_len) - pnorm(penguins$flipper_len, mean = fl_mean, sd = fl_sd))

# find the maximum difference value
max_fl_diff <- penguins$flipper_len[which.max(abs_ecdf_diff)]

penguins |>
  ggplot(aes(x = flipper_len)) +
  stat_ecdf(color = "darkblue") +
  # display normal ECDF
  stat_function(fun = pnorm, args = list(mean = fl_mean, sd = fl_sd), color = "black", linetype = "dashed") +
  # display KS test line
  geom_vline(xintercept = max_fl_diff, color = "red") +
  # add text with the test results
  annotate(geom = "text", x = 215, y = 0.25, label = "KS test stat = 0.12428\np-value = 5.163e-05") +
  labs(x = "Flipper length (mm)", y = "Fn(x)")

Comparing distributions

Review: density estimation

Review: display full distribution with ECDF plots

What is the relationship between these two figures?

Probability distributions

Example: Normal distribution

Distribution functions in R

Parametric density estimation

One-sample Kolmogorov-Smirnov test

One-sample Kolmogorov-Smirnov test example

Visualizing empirical and parametric densities

Visualizing empirical and parametric CDF

Comparing multiple empirical distributions

Example: Spotify songs duration by genre

Example: Spotify songs duration by genre

Do rap and rock songs differ in duration?

Two-sample Kolmogorov-Smirnov test

Two-sample Kolmogorov-Smirnov test example

Statistical tests for comparing distributions

What about the difference between pop and rap?

Comparison using full data

Comparison using a smaller subset of data

But it still looks different visually???

Test for difference in means

Test for difference in variances

Statistical power

Toy example: power of a \(t\)-test

Toy example: power of a \(t\)-test

Another toy example: power for different tests

Another toy example: power for different tests

Takeaways

Appendix: visualizing empirical and parametric CDF

Distribution functions in `R`