Comparing distributions


36-315: Statistical Graphics and Visualization, Summer 2026

Review: density estimation

Smoothed densities are good for visualizing the shape of a distribution

There are two ways to produce smoothed densities:

  • Nonparametric (previous lecture)

    • Pick a bandwidth and kernel, then fit a kernel-smoothed density
  • Parametric (this lecture)

    • Specify a distribution (e.g., normal), estimate the parameters, and fit the corresponding pdf

Review: display full distribution with ECDF plots

Pros:

  • Displays all of the data (sorted)

  • Does NOT require any parameters to adjust

  • As \(n \rightarrow \infty\), the ECDF \(\hat F_n(x)\) converges to
    the true CDF \(F(x)\)

Cons:

  • What are the cons?
penguins |> 
  ggplot(aes(x = flipper_len)) +
  stat_ecdf()

What is the relationship between these two figures?

Probability distributions

A distribution is a mathematical function \(f(x \mid \theta)\) where

  • \(x\) may take on continuous or discrete values over the domain (i.e. all possible inputs) of \(f(x \mid \theta)\)
  • \(\theta\) is a set of parameters governing the shape of the distribution
    • e.g. \(\theta = \{\mu, \sigma ^2 \}\) for a normal/Gaussian distribution
  • the \(\mid\) symbol means that the shape of the distribution is conditional on the values of \(\theta\)

Let \(f\) denote the distribution for its

  • probability density function (PDF) if \(x\) is continuous
  • probability mass function (PMF) if \(x\) is discrete

Note:

  • \(f(x \mid \theta) \geq 0\) for all \(x\)
  • \(\displaystyle \sum_x f(x \mid \theta) = 1\) (discrete case) or \(\displaystyle \int_x f(x \mid \theta) = 1\) (continuous case)

Example: Normal distribution

  • Given \(X \sim N(\mu, \sigma^2)\)

  • PDF: \(\displaystyle{f}(x \mid \mu, \sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} \exp \left\{ -\frac{1}{2} \left(\frac{x-\mu}{\sigma}\right)^2 \right\}\) for \(x \in (- \infty, \infty)\)

  • Standard normal distribution: \(N(0, 1)\)

tibble(x = c(-5, 5)) |> 
  ggplot(aes(x)) +
  stat_function(fun = dnorm, color = "blue",
                args = list(mean = 0, sd = 1)) +
  stat_function(fun = dnorm, color = "red",
                args = list(mean = -2, sd = sqrt(0.5)))

Distribution functions in R

Example: normal distribution

  • dnorm(): normal density function

  • pnorm(): normal cumulative distribution function (fraction of values smaller than)

  • qnorm(): normal quantile function (inverse of cumulative distribution)

  • rnorm(): generate normal random variables

Note: Replace “norm” with the name of another distribution, all the same functions apply

  • e.g., “t”, “exp”, “gamma”, “chisq”, “binom”, “pois”, “unif”, etc.

See this manual for more details

Parametric density estimation

  • Instead of trying to estimate the whole \(f(x)\) non-parametrically, assume a particular distribution \(f(x)\) and estimate its parameters

  • For example, assume \(X_i \sim N(\mu, \sigma^2)\). Then use the observed data to estimate the parameters \[\hat{\mu} = \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i \quad \text{and} \quad \hat{\sigma}^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2\]

  • The (plug-in) density estimate is \[\hat f(x) = \frac{1}{\hat\sigma\sqrt{2\pi}} \exp \left\{ -\frac{1}{2} \left(\frac{x-\hat\mu}{\hat\sigma}\right)^2 \right\}\]

One-sample Kolmogorov-Smirnov test

  • Compare the ECDF \(\hat{F}(x)\) to the CDF \(F(x)\) of a theoretical distribution

  • Null hypothesis: the observed data follow a particular theoretical distribution

  • Test statistic: \(\displaystyle \quad \max_x |\hat{F}(x) - F(x)|\)

  • If \(\hat{F}(x)\) is far away from \(F(x)\), reject the null hypothesis

One-sample Kolmogorov-Smirnov test example

  • What if we assume flipper_len follows Normal distribution? (i.e., \(H_0:\) flipper_len \(\sim N(\mu, \sigma^2)\))
  • Need estimates for mean \(\mu\) and standard deviation \(\sigma\):
fl_mean <- mean(penguins$flipper_len, na.rm = TRUE)
fl_mean
[1] 200.9152
fl_sd <- sd(penguins$flipper_len, na.rm = TRUE)
fl_sd
[1] 14.06171
  • One-sample KS test: There is strong evidence that flipper_len is not normally distributed
ks.test(x = penguins$flipper_len, y = "pnorm", mean = fl_mean, sd = fl_sd)

    Asymptotic one-sample Kolmogorov-Smirnov test

data:  penguins$flipper_len
D = 0.12428, p-value = 5.163e-05
alternative hypothesis: two-sided

Visualizing empirical and parametric densities

penguins |> 
  ggplot(aes(x = flipper_len)) + 
  geom_density(color = "blue") +
  stat_function(fun = dnorm, linetype = "dashed", color = "black",
                args = list(mean = fl_mean, sd = fl_sd))

Visualizing empirical and parametric CDF

See Appendix for code

Comparing multiple empirical distributions

  • Goal: determine whether a quantitative variable depends on a categorical variable
  • Examples:

    • clinical trials with multiple treatments

    • assessing differences across race, gender, socioeconomic status, etc.

    • industrial experiments, A/B testing

  • Graphical comparison: overlayed densities, side-by-side violin plots, faceted histograms, etc.
  • Remember:

    • It’s useful to visualize and compare conditional distributions

    • But when are differences in a graphic statistically significant?

    • We need formal statistical inference (e.g., hypothesis tests)

Example: Spotify songs duration by genre

Example: Spotify songs duration by genre

Do rap and rock songs differ in duration?

Two-sample Kolmogorov-Smirnov test

  • Use a two-sample KS to compare two empirical distributions \(\hat{F}_A(x)\) and \(\hat{F}_B\)

  • Null hypothesis: two samples \(A\) and \(B\) follow the same distribution

  • Test statistic: \(\displaystyle \quad \max_x |\hat{F}_A(x) - \hat{F}_B(x)|\)

  • If \(\hat F_A\) and \(\hat F_B\) are far away from each other, reject the null hypothesis

Two-sample Kolmogorov-Smirnov test example

spotify_songs <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv") |> 
  mutate(duration = duration_ms / 60000)

rap_duration <- spotify_songs |> filter(playlist_genre == "rap") |> pull(duration)
rock_duration <- spotify_songs |> filter(playlist_genre == "rock") |> pull(duration)

ks.test(rap_duration, y = rock_duration)

    Asymptotic two-sample Kolmogorov-Smirnov test

data:  rap_duration and rock_duration
D = 0.22386, p-value < 2.2e-16
alternative hypothesis: two-sided
  • We have strong evidence that the duration distributions differ between rap and rock songs.

Statistical tests for comparing distributions

  • Any difference at all: two-sample KS test

  • Difference in means

    • Null hypothesis: \(H_0: \mu_1 = \mu_2 = \cdots = \mu_K\) (use t.test() or oneway.test())

    • Can assume the variances are all the same or differ

    • If \(H_0\) is rejected, can only conclude not all means are equal

  • Difference in variances

    • Null hypothesis: \(H_0: \sigma^2_1 = \sigma^2_2 = \cdots = \sigma^2_K\) (use bartlett.test())

    • If \(H_0\) is rejected, can only conclude not all variances are equal

Note: unlike the KS test, difference in means and variances are sensitive to non-normality

  • Different distributions can yield insignificant results

What about the difference between pop and rap?

Comparison using full data

table(spotify_songs$playlist_genre)

  edm latin   pop   r&b   rap  rock 
 6043  5155  5507  5431  5746  4951 
rap_duration <- spotify_songs |> filter(playlist_genre == "rap") |> pull(duration)
pop_duration <- spotify_songs |> filter(playlist_genre == "pop") |> pull(duration)

ks.test(rap_duration, y = pop_duration)

    Asymptotic two-sample Kolmogorov-Smirnov test

data:  rap_duration and pop_duration
D = 0.14569, p-value < 2.2e-16
alternative hypothesis: two-sided

There is a statistically significant difference in duration between rap and pop songs (given large sample size)

Comparison using a smaller subset of data

set.seed(2017)
subset_songs <- spotify_songs |>
  group_by(playlist_genre) |> 
  slice_sample(n = 100)

table(subset_songs$playlist_genre)

  edm latin   pop   r&b   rap  rock 
  100   100   100   100   100   100 
subset_rap_duration <- subset_songs |> filter(playlist_genre == "rap") |> pull(duration)
subset_pop_duration <- subset_songs |> filter(playlist_genre == "pop") |> pull(duration)

ks.test(subset_rap_duration, y = subset_pop_duration)

    Asymptotic two-sample Kolmogorov-Smirnov test

data:  subset_rap_duration and subset_pop_duration
D = 0.16, p-value = 0.1545
alternative hypothesis: two-sided

There is NO statistically significant difference in duration between rap and pop songs

But it still looks different visually???

Test for difference in means

Using a two-sample \(t\)-test

Null hypothesis: the means of two groups (populations) are equal

t.test(subset_rap_duration, subset_pop_duration)

    Welch Two Sample t-test

data:  subset_rap_duration and subset_pop_duration
t = 0.83091, df = 172.78, p-value = 0.4072
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.1461026  0.3585440
sample estimates:
mean of x mean of y 
 3.694096  3.587875 

Test for difference in variances

Using a Bartlett test

Null hypothesis: all variances are equal

bartlett.test(list(subset_rap_duration, subset_pop_duration))

    Bartlett test of homogeneity of variances

data:  list(subset_rap_duration, subset_pop_duration)
Bartlett's K-squared = 15.54, df = 1, p-value = 8.08e-05

Rejects at \(\alpha = 0.05\) even with this smaller sample size!


Why did the KS test reveals no differences while the graphs are clearly different? Two possible reasons:

  • The sample size might be too small to detect a difference

  • The KS test is known to have low power

Statistical power

  • Statistical power is key to really understanding graphics: whether we’re seeing real effects versus noise
  • Definition: \(\quad \textsf{power} = P(p\text{-value} \leq \alpha \mid H_0 \text{ is false})\)

    • In words: the probability that null hypothesis is rejected when it is false
  • Things that affect statistical power:

    • Larger differences in the data \(\rightarrow\) more power

    • Smaller variance/error in differences \(\rightarrow\) more power

    • Larger sample size \(\rightarrow\) more power

    • More appropriate statistical test \(\rightarrow\) more power

  • Remember: Type 1 error is falsely rejecting; Type 2 error is falsely failing to reject

Toy example: power of a \(t\)-test

  • Consider two samples \(\mathbf{X} = (X_1,\dots,X_n) \sim N(0, 1)\) and \(\mathbf{Y} = (Y_1,\dots,Y_n) \sim N(\delta, 1)\)

  • Use a \(t\)-test for difference in means between \(\mathbf{X}\) and \(\mathbf{Y}\)

  • Simulate \(\mathbf{X}\) and \(\mathbf{Y}\) 1000 times for some \(n\) and \(\delta > 0\)

  • Count the number of rejections

\[ \begin{aligned} \textsf{power} &= P(p\text{-value} \leq \alpha \mid H_0 \text{ is false}) \\ &= P(p\text{-value} \leq \alpha \mid \delta > 0) \\ &\approx \frac{\text{# rejections}}{1000} \end{aligned} \]

  • Consider \(n = 10, 20, \dots, 1000\) and \(\delta = 0.1\) or \(\delta = 0.25\)

Toy example: power of a \(t\)-test

Another toy example: power for different tests

  • Consider two samples: \(\mathbf{X} = (X_1,\dots,X_n) \sim N(0, 1)\) and \(\mathbf{Y} = (Y_1,\dots,Y_n) \sim N(0, 1.5)\)

  • Consider three ways to test differences between \(\mathbf{X}\) and \(\mathbf{Y}\)

    1. \(t\)-test

    2. Bartlett test

    3. KS test

  • Simulate \(\mathbf{X}\) and \(\mathbf{Y}\) 1000 times for samples sizes \(n = 10, 20, \dots, 1000\)

What do you think the power curves will look like for these methods?

Another toy example: power for different tests

Takeaways

  • Graphics should be paired with statistical analyses to determine if a true effect versus noise is displayed

  • Even if there is a true effect, there may be limited power to detect it (some effects are easier to detect than others)

  • Remember: Power is the probability of rejecting when the null is false

  • Things that increase statistical power:

    • Increase sample size

    • Reduce variance/error

    • Increase differences/effects

    • Choose appropriate tests!

Appendix: visualizing empirical and parametric CDF

# create the ECDF function
fl_ecdf <- ecdf(penguins$flipper_len)

# compute absolute value of the differences between ECDF and theoretical (normal distribution)
abs_ecdf_diff <- abs(fl_ecdf(penguins$flipper_len) - pnorm(penguins$flipper_len, mean = fl_mean, sd = fl_sd))

# find the maximum difference value
max_fl_diff <- penguins$flipper_len[which.max(abs_ecdf_diff)]

penguins |>
  ggplot(aes(x = flipper_len)) +
  stat_ecdf(color = "darkblue") +
  # display normal ECDF
  stat_function(fun = pnorm, args = list(mean = fl_mean, sd = fl_sd), color = "black", linetype = "dashed") +
  # display KS test line
  geom_vline(xintercept = max_fl_diff, color = "red") +
  # add text with the test results
  annotate(geom = "text", x = 215, y = 0.25, label = "KS test stat = 0.12428\np-value = 5.163e-05") +
  labs(x = "Flipper length (mm)", y = "Fn(x)")