Statistical inference for 1D categorical data

36-315: Statistical Graphics and Visualization, Summer 2026

What does a bar chart show?

library(tidyverse)
theme_set(theme_light())

penguins |>
  ggplot(aes(y = species)) +
  geom_bar()

New discovery

Not only messing up orinithology, but also making a mockery of R tutorials. #Rstats

[image or embed]
— brucy (@realbrucy.bsky.social) 10:31 AM · May 14, 2026

What does a bar chart show?

Marginal distribution: probability that a categorical variable \(X\) (e.g., species) takes each particular category \(x\) (Adelie, Chinstrap, Gentoo)

Frequency bar charts display sample size

table(penguins$species)


   Adelie Chinstrap    Gentoo 
      152        68       124

Proportion/percent bar charts display class probabilities
- e.g., \(P(\)species \(=\) Adelie\()\)

prop.table(table(penguins$species))


   Adelie Chinstrap    Gentoo 
0.4418605 0.1976744 0.3604651

Proportion bar charts

Compute proportions “by hand” with count() and mutate()
Use the pipe operator |> to perform multiple operations
Use geom_col(), since we want the bar length to represent values in the data

penguins |>
  count(species) |> 
  mutate(prop = n / sum(n))

    species   n      prop
1    Adelie 152 0.4418605
2 Chinstrap  68 0.1976744
3    Gentoo 124 0.3604651

penguins |>
  count(species) |> 
  mutate(prop = n / sum(n)) |> 
  ggplot(aes(x = prop, y = species)) +
  geom_col()

Statistical inference for proportions

Estimate \(P(\)species = \(C_j\) ) with \(\hat p_j\) for each category \(C_j\) ( \(\hat p_\texttt{Adelie}\), \(\hat p_\texttt{Chinstrap}\), \(\hat p_\texttt{Gentoo}\) )

prop.table(table(penguins$species))


   Adelie Chinstrap    Gentoo 
0.4418605 0.1976744 0.3604651

Quantify uncertainty for \(\displaystyle \hat p_j = \frac{n_j}{n}\) with the standard error \[\textsf{se}(\hat{p}_j) = \sqrt{\frac{\hat{p}_j(1 - \hat{p}_j)}{n}}\]
Compute \(\alpha\)-level confidence interval \(\hat{p}_j \pm z_{1 - \alpha / 2} \cdot \textsf{se}(\hat{p}_j)\)
Good rule-of-thumb: construct 95% confidence interval using \(\hat{p}_j \pm 2 \cdot \textsf{se}(\hat{p}_j)\)

Adding confidence intervals to bar chart

penguins |> 
  count(species) |> 
  mutate(prop = n / sum(n),
         se = sqrt(prop * (1 - prop) / sum(n)),
         lower = prop - 2 * se,
         upper = prop + 2 * se) |> 
  ggplot(aes(x = prop, y = species)) +
  geom_col() +
  geom_errorbar(aes(xmin = lower, xmax = upper), 
                color = "blue",  width = 0.2, linewidth = 1)

Ordering factors in a bar chart

Order the bars by proportion

penguins |> 
  count(species) |> 
  mutate(prop = n / sum(n),
         se = sqrt(prop * (1 - prop) / sum(n)),
         lower = prop - 2 * se,
         upper = prop + 2 * se,
         species = fct_reorder(species, prop)) |> 
  ggplot(aes(x = prop, y = species)) +
  geom_col() +
  geom_errorbar(aes(xmin = lower, xmax = upper), 
                color = "blue", width = 0.2, linewidth = 1)

Don’t do this… (Look closely…)

Hypothesis testing in general

Define the null and alternative hypotheses

Construct the test statistic

Compute the \(p\)-value
- The \(p\)-value is the probability of observing a test statistic at least as extreme as the observed statistic, under the assumption that null is true
- Is test statistic “unusual” compared to what we would expect under the null?

Decide whether to reject the null hypothesis
- Compare \(p\)-value to the target error rate (or significance level) \(\alpha\)
- Typically choose \(\alpha = 0.05\) (the origins of 0.05)
- In other words, if we reject the null hypothesis at \(\alpha = 0.05\), then, assuming \(H_0\) is true, there is a 5% chance it is a false positive (also known as Type I error)

Chi-squared test for 1D categorical data

Null hypothesis: \(H_0\): \(p_1 = p_2 = \cdots = p_K\)
Test statistic: \(\displaystyle \chi^2 = \sum_{j=1}^K \frac{(O_j - E_j)^2}{E_j}\), where
- \(O_j\): observed counts in category \(j\)
- \(E_j\) : expected counts under \(H_0\)
  (each category is equally likely to occur with probability \(n/K = p_1 = p_2 = \cdots = p_K\))

chisq.test(table(penguins$species))


    Chi-squared test for given probabilities

data:  table(penguins$species)
X-squared = 31.907, df = 2, p-value = 1.179e-07

Graphics versus statistical inference

Reminder: Anscombe’s Quartet

Statistical inference is the same,
but the graphics are very different
The opposite can be true!
Graphics can be the same,
but statistical inference is very different

Toy example: 3 categories, \(p_1 = 1/2,\ p_2 = p_3 = 1/4\)

Test results for different sample sizes

How do we combine graphics with inference?

Simply add \(p\)-values (or other info) to graph via text
Add confidence intervals to the graph
- Need to remember what each CI is for
- The CIs on previous slides are for each \(\hat{p}_j\) marginally, NOT jointly
- Have to be careful with multiple testing

Confidence intervals visually capture uncertainty

(Rough) Rules-of-thumb for comparing CIs on bar charts

Comparing overlap between two CIs is NOT exactly the same as directly testing for a significant difference

What we really want is a CI for the difference \(\textsf{CI}(\hat{p}_1 - \hat{p}_2)\), rather than \(\textsf{CI}(\hat{p}_1)\) and \(\textsf{CI}(\hat{p}_2)\) separately
If \(\textsf{CI}(\hat{p}_1)\) and \(\textsf{CI}(\hat{p}_2)\) do not overlap, then \(0 \notin\) \(\textsf{CI}(\hat{p}_1 - \hat{p}_2)\)
However, \(\textsf{CI}(\hat{p}_1)\) and \(\textsf{CI}(\hat{p}_2)\) overlapping does not necessarily imply that \(0 \in\) \(\textsf{CI}(\hat{p}_1 - \hat{p}_2)\)

Roughly speaking:

If CIs do not overlap: evidence of a significant difference
If CIs overlap slightly: ambiguous
If CIs overlap substantially: likely no significant difference

If we compare more than two CIs simultaneously, we must account for multiple testing

Looking for all non-overlapping CIs among \(K\) groups implicitly involves \(\displaystyle \binom{K}{2}\) pairwise comparisons

Corrections for multiple testing

In those bar plots, when we determine whether CIs overlap, we make 3 comparisons:
1. A vs B
2. A vs C
3. B vs C

This is a multiple testing (or multiple comparison) problem

Making multiple comparisons increases the probability of a Type I error beyond 5%
- Type I error: rejecting \(H_0\) when \(H_0\) is true
- Example: concluding that A and B differ because their CIs don’t overlap, although \(H_0: p_A = p_B\) is true

If we are only interested in comparing A vs B, then just construct 95% CI for A vs B and control error rate at 5%
However, if we perform several comparisons simultaneously, the overall probability of making at least one Type I error becomes greater than 5%.

Corrections for multiple testing

Basic idea:

Multiple testing corrections make hypothesis tests more conservative (e.g., make \(p\)-values larger)
Equivalently, they produce wider CIs
Goal: control Type I error rate \(\leq 5\%\)

Bonferroni correction: simple, easy to implement, but most conservative

Normally, we reject \(H_0\) when \(p\)-value \(\leq 0.05\)
If making \(K\) comparisons, the Bonferroni correction rejects only if \(\displaystyle p\text{-value} \leq \frac{0.05}{K}\)
Equivalently, instead of plotting 95% CIs, we plot \(\displaystyle\left(1 - \frac{0.05}{K}\right) \times 100 \%\) CIs
- e.g., when \(K = 3\), use 98.33% CIs

Impact of Bonferroni correction on confidence intervals

Takeaways

Graphics for 1D categorical data (e.g., bar charts) show the empirical distribution
of the categorical variable ( \(\hat{p}_1, \dots, \hat{p}_K\) )
Chi-squared test is a common test for 1D categorical data, testing \(H_0 : p_1 = \cdots = p_K\)
However, from this (global) test alone, we can’t tell which probabilities differ
We can compute individual confidence intervals for each \(\hat{p}_1, \cdots, \hat{p}_K\)
- Allows for easy visualization
- But can be complicated, especially with respect to multiple testing
Graphs with the same trends can display very different statistical significance (largely due to sample size)

Do it live: data viz replication

nwsl <- read_csv("https://raw.githubusercontent.com/qntkhvn/36-315-summer26/refs/heads/master/data/nwsl.csv")