Statistical inference for 1D categorical data


36-315: Statistical Graphics and Visualization, Summer 2026

What does a bar chart show?

library(tidyverse)
theme_set(theme_light())

penguins |>
  ggplot(aes(y = species)) +
  geom_bar()

New discovery

Not only messing up orinithology, but also making a mockery of R tutorials. #Rstats

[image or embed]

— brucy (@realbrucy.bsky.social) 10:31 AM · May 14, 2026

What does a bar chart show?

Marginal distribution: probability that a categorical variable \(X\) (e.g., species) takes each particular category \(x\) (Adelie, Chinstrap, Gentoo)

  • Frequency bar charts display sample size
table(penguins$species)

   Adelie Chinstrap    Gentoo 
      152        68       124 
  • Proportion/percent bar charts display class probabilities

    • e.g., \(P(\)species \(=\) Adelie\()\)
prop.table(table(penguins$species))

   Adelie Chinstrap    Gentoo 
0.4418605 0.1976744 0.3604651 

Proportion bar charts

  • Compute proportions “by hand” with count() and mutate()

  • Use the pipe operator |> to perform multiple operations

  • Use geom_col(), since we want the bar length to represent values in the data

penguins |>
  count(species) |> 
  mutate(prop = n / sum(n))
    species   n      prop
1    Adelie 152 0.4418605
2 Chinstrap  68 0.1976744
3    Gentoo 124 0.3604651
penguins |>
  count(species) |> 
  mutate(prop = n / sum(n)) |> 
  ggplot(aes(x = prop, y = species)) +
  geom_col()

Statistical inference for proportions

  • Estimate \(P(\)species = \(C_j\) ) with \(\hat p_j\) for each category \(C_j\) ( \(\hat p_\texttt{Adelie}\), \(\hat p_\texttt{Chinstrap}\), \(\hat p_\texttt{Gentoo}\) )
prop.table(table(penguins$species))

   Adelie Chinstrap    Gentoo 
0.4418605 0.1976744 0.3604651 
  • Quantify uncertainty for \(\displaystyle \hat p_j = \frac{n_j}{n}\) with the standard error \[\textsf{se}(\hat{p}_j) = \sqrt{\frac{\hat{p}_j(1 - \hat{p}_j)}{n}}\]

  • Compute \(\alpha\)-level confidence interval \(\hat{p}_j \pm z_{1 - \alpha / 2} \cdot \textsf{se}(\hat{p}_j)\)

  • Good rule-of-thumb: construct 95% confidence interval using \(\hat{p}_j \pm 2 \cdot \textsf{se}(\hat{p}_j)\)

Adding confidence intervals to bar chart


penguins |> 
  count(species) |> 
  mutate(prop = n / sum(n),
         se = sqrt(prop * (1 - prop) / sum(n)),
         lower = prop - 2 * se,
         upper = prop + 2 * se) |> 
  ggplot(aes(x = prop, y = species)) +
  geom_col() +
  geom_errorbar(aes(xmin = lower, xmax = upper), 
                color = "blue",  width = 0.2, linewidth = 1)

Ordering factors in a bar chart

Order the bars by proportion

penguins |> 
  count(species) |> 
  mutate(prop = n / sum(n),
         se = sqrt(prop * (1 - prop) / sum(n)),
         lower = prop - 2 * se,
         upper = prop + 2 * se,
         species = fct_reorder(species, prop)) |> 
  ggplot(aes(x = prop, y = species)) +
  geom_col() +
  geom_errorbar(aes(xmin = lower, xmax = upper), 
                color = "blue", width = 0.2, linewidth = 1)

Don’t do this… (Look closely…)

Hypothesis testing in general

  • Define the null and alternative hypotheses
  • Construct the test statistic
  • Compute the \(p\)-value

    • The \(p\)-value is the probability of observing a test statistic at least as extreme as the observed statistic, under the assumption that null is true

    • Is test statistic “unusual” compared to what we would expect under the null?

  • Decide whether to reject the null hypothesis

    • Compare \(p\)-value to the target error rate (or significance level) \(\alpha\)

    • Typically choose \(\alpha = 0.05\) (the origins of 0.05)

    • In other words, if we reject the null hypothesis at \(\alpha = 0.05\), then, assuming \(H_0\) is true, there is a 5% chance it is a false positive (also known as Type I error)

Chi-squared test for 1D categorical data

  • Null hypothesis: \(H_0\): \(p_1 = p_2 = \cdots = p_K\)

  • Test statistic: \(\displaystyle \chi^2 = \sum_{j=1}^K \frac{(O_j - E_j)^2}{E_j}\), where

    • \(O_j\): observed counts in category \(j\)

    • \(E_j\) : expected counts under \(H_0\)
      (each category is equally likely to occur with probability \(n/K = p_1 = p_2 = \cdots = p_K\))


chisq.test(table(penguins$species))

    Chi-squared test for given probabilities

data:  table(penguins$species)
X-squared = 31.907, df = 2, p-value = 1.179e-07

Graphics versus statistical inference

Reminder: Anscombe’s Quartet


  • Statistical inference is the same,
    but the graphics are very different

  • The opposite can be true!

  • Graphics can be the same,
    but statistical inference is very different

Toy example: 3 categories, \(p_1 = 1/2,\ p_2 = p_3 = 1/4\)

Toy example: 3 categories, \(p_1 = 1/2,\ p_2 = p_3 = 1/4\)

Toy example: 3 categories, \(p_1 = 1/2,\ p_2 = p_3 = 1/4\)

Toy example: 3 categories, \(p_1 = 1/2,\ p_2 = p_3 = 1/4\)

Test results for different sample sizes

How do we combine graphics with inference?

  • Simply add \(p\)-values (or other info) to graph via text

  • Add confidence intervals to the graph

    • Need to remember what each CI is for

    • The CIs on previous slides are for each \(\hat{p}_j\) marginally, NOT jointly

    • Have to be careful with multiple testing

Confidence intervals visually capture uncertainty

(Rough) Rules-of-thumb for comparing CIs on bar charts

Comparing overlap between two CIs is NOT exactly the same as directly testing for a significant difference

  • What we really want is a CI for the difference \(\textsf{CI}(\hat{p}_1 - \hat{p}_2)\), rather than \(\textsf{CI}(\hat{p}_1)\) and \(\textsf{CI}(\hat{p}_2)\) separately

  • If \(\textsf{CI}(\hat{p}_1)\) and \(\textsf{CI}(\hat{p}_2)\) do not overlap, then \(0 \notin\) \(\textsf{CI}(\hat{p}_1 - \hat{p}_2)\)

  • However, \(\textsf{CI}(\hat{p}_1)\) and \(\textsf{CI}(\hat{p}_2)\) overlapping does not necessarily imply that \(0 \in\) \(\textsf{CI}(\hat{p}_1 - \hat{p}_2)\)

Roughly speaking:

  • If CIs do not overlap: evidence of a significant difference

  • If CIs overlap slightly: ambiguous

  • If CIs overlap substantially: likely no significant difference

If we compare more than two CIs simultaneously, we must account for multiple testing

  • Looking for all non-overlapping CIs among \(K\) groups implicitly involves \(\displaystyle \binom{K}{2}\) pairwise comparisons

Corrections for multiple testing

  • In those bar plots, when we determine whether CIs overlap, we make 3 comparisons:

    1. A vs B

    2. A vs C

    3. B vs C

  • This is a multiple testing (or multiple comparison) problem
  • Making multiple comparisons increases the probability of a Type I error beyond 5%

    • Type I error: rejecting \(H_0\) when \(H_0\) is true

    • Example: concluding that A and B differ because their CIs don’t overlap, although \(H_0: p_A = p_B\) is true

  • If we are only interested in comparing A vs B, then just construct 95% CI for A vs B and control error rate at 5%

  • However, if we perform several comparisons simultaneously, the overall probability of making at least one Type I error becomes greater than 5%.

Corrections for multiple testing

Basic idea:

  • Multiple testing corrections make hypothesis tests more conservative (e.g., make \(p\)-values larger)

  • Equivalently, they produce wider CIs

  • Goal: control Type I error rate \(\leq 5\%\)

Bonferroni correction: simple, easy to implement, but most conservative

  • Normally, we reject \(H_0\) when \(p\)-value \(\leq 0.05\)

  • If making \(K\) comparisons, the Bonferroni correction rejects only if \(\displaystyle p\text{-value} \leq \frac{0.05}{K}\)

  • Equivalently, instead of plotting 95% CIs, we plot \(\displaystyle\left(1 - \frac{0.05}{K}\right) \times 100 \%\) CIs

    • e.g., when \(K = 3\), use 98.33% CIs

Impact of Bonferroni correction on confidence intervals

Takeaways

  • Graphics for 1D categorical data (e.g., bar charts) show the empirical distribution
    of the categorical variable ( \(\hat{p}_1, \dots, \hat{p}_K\) )

  • Chi-squared test is a common test for 1D categorical data, testing \(H_0 : p_1 = \cdots = p_K\)

  • However, from this (global) test alone, we can’t tell which probabilities differ

  • We can compute individual confidence intervals for each \(\hat{p}_1, \cdots, \hat{p}_K\)

    • Allows for easy visualization

    • But can be complicated, especially with respect to multiple testing

  • Graphs with the same trends can display very different statistical significance (largely due to sample size)

Do it live: data viz replication

nwsl <- read_csv("https://raw.githubusercontent.com/qntkhvn/36-315-summer26/refs/heads/master/data/nwsl.csv")