Visualizations and inference for 2D categorical data

36-315: Statistical Graphics and Visualization, Summer 2026

Summarizing 2D categorical data

library(tidyverse)
theme_set(theme_light())

titanic <- read_csv("https://raw.githubusercontent.com/qntkhvn/36-315-summer26/refs/heads/master/data/titanic.csv") |> 
  filter(!is.na(Survived) & !is.na(Pclass)) |> 
  mutate(Survived = factor(Survived),
         Pclass = factor(Pclass))

Using the titanic data, consider:

Survived: survival outcome (0/1)

table(titanic$Survived)


  0   1 
424 288

Class: cabin class (1st/2nd/3rd)

table(titanic$Pclass)


  1   2   3 
184 173 355

Summarizing 2D categorical data

Two-way table (or two-way table, cross tabulation, crosstab)

table("Survived" = titanic$Survived, "Pclass" = titanic$Pclass)

        Pclass
Survived   1   2   3
       0  64  90 270
       1 120  83  85

xtabs(~ Survived + Pclass, data = titanic)

        Pclass
Survived   1   2   3
       0  64  90 270
       1 120  83  85

Joint, marginal, and conditional distributions

xtabs(~ Survived + Pclass, data = titanic) |> 
  addmargins()

        Pclass
Survived   1   2   3 Sum
     0    64  90 270 424
     1   120  83  85 288
     Sum 184 173 355 712

Column and row sums: marginal distributions

Values within rows: conditional distribution for Pclass given Survived

Values within columns: conditional distribution for Survived given Pclass

Values within cells: joint distribution for Survived and Pclass

Bottom right: total number of observations

Joint, marginal, and conditional probabilities

Joint distribution: the intersection
- e.g., \(P(\texttt{Survived} = 1, \texttt{Pclass} = 3)\)

table(titanic$Survived, titanic$Pclass)

   
      1   2   3
  0  64  90 270
  1 120  83  85

prop.table(table(titanic$Survived, titanic$Pclass))

   
             1          2          3
  0 0.08988764 0.12640449 0.37921348
  1 0.16853933 0.11657303 0.11938202

Joint, marginal, and conditional probabilities

Joint distribution: the intersection
- e.g., \(P(\texttt{Survived} = 1, \texttt{Pclass} = 3)\)

Marginal distribution: row sums or column sums
- e.g., \(P(\texttt{Survived} = 1)\), \(P(\texttt{Pclass} = 3)\)

table(titanic$Survived, titanic$Pclass)

   
      1   2   3
  0  64  90 270
  1 120  83  85

prop.table(table(titanic$Survived, titanic$Pclass))

   
             1          2          3
  0 0.08988764 0.12640449 0.37921348
  1 0.16853933 0.11657303 0.11938202

Joint, marginal, and conditional probabilities

Joint distribution: the intersection
- e.g., \(P(\texttt{Survived} = 1, \texttt{Pclass} = 3)\)

Marginal distribution: row sums or column sums
- e.g., \(P(\texttt{Survived} = 1)\), \(P(\texttt{Pclass} = 3)\)

Conditional distribution:
probability of \(\texttt{Survived}\) given \(\texttt{Pclass}\)
- e.g., \(P(\texttt{Survived} = 1 \mid \texttt{Pclass} = 3)\)
  
  \(\displaystyle \qquad \quad = \frac{P(\texttt{Survived} = 1, \texttt{Pclass} = 3)}{P(\texttt{Pclass} = 3)}\)

table(titanic$Survived, titanic$Pclass)

   
      1   2   3
  0  64  90 270
  1 120  83  85

prop.table(table(titanic$Survived, titanic$Pclass))

   
             1          2          3
  0 0.08988764 0.12640449 0.37921348
  1 0.16853933 0.11657303 0.11938202

Connecting distributions to visualizations

For two categorical variables \(A\) and \(B\)
- Marginal distributions: \(P(A)\) and \(P(B)\)
- Conditional distributions: \(P(A \mid B)\) and \(P(B \mid A)\)
- Joint distribution: \(P(A, B)\)

We use bar charts to visualize marginal distributions for categorical variables…

And we will use more bar charts (stacked and side-by-side) to visualize conditional and joint distributions

Stacked bar charts

Stacked bar chart: a bar chart of spine charts
Easy to see marginal of Survived
- i.e., \(P(\)x\()\)
Can see conditional of Pclass | Survived
- i.e., \(P(\)fill \(\mid\) x\()\)
Harder to see conditional of Survived | Pclass
- i.e., \(P(\)x \(\mid\) fill\()\)

titanic |> 
  ggplot(aes(x = Survived, fill = Pclass)) +
  geom_bar()

Side-by-side bar charts

Side-by-side (grouped/dodged) bar chart: a bar chart of bar charts
Easy to see conditional of Pclass | Survived
- i.e., \(P(\)fill \(\mid\) x\()\)
Can see conditional of Survived | Pclass
- i.e., \(P(\)x \(\mid\) fill\()\)
Harder to see marginals…

titanic |> 
  ggplot(aes(x = Survived, fill = Pclass)) +
  geom_bar(position = "dodge")

Which one do you prefer?

Categorical heatmaps

Use geom_tile to display joint distribution of two categorical variables
Annotate tiles with labels of percentages using geom_text()

titanic |>
  group_by(Survived, Pclass) |>
  summarize(n = n(), 
            joint = n() / nrow(titanic),
            txt = paste(round(100 * joint, 2), "%")) |> 
  ggplot(aes(x = Survived, y = Pclass)) +
  geom_tile(aes(fill = n), color = "white") +
  geom_text(aes(label = txt)) +
  scale_fill_gradient2()

Inference for 2D categorical data: Chi-squared test

Null hypothesis (\(H_0\)) : variables \(A\) (rows) and \(B\) (columns) are independent of each other
- e.g., no association between Survived and Pclass

Test statistic: \(\displaystyle \chi^2 = \sum_i^{k_1} \sum_j^{k_2} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\)
- \(O_{ij}\): observed cell counts in two-way table
- \(E_{ij}\): expected cell counts under \(H_0\), where \[E_{ij} = n \cdot P(A = a_i, B = b_j) = n \cdot P(A = a_i) P(B = b_j) = n \cdot \left( \frac{n_{i \cdot}}{n} \right) \left( \frac{ n_{\cdot j}}{n} \right)\]

chisq.test(table(titanic$Survived, titanic$Pclass))


    Pearson's Chi-squared test

data:  table(titanic$Survived, titanic$Pclass)
X-squared = 91.081, df = 2, p-value < 2.2e-16

Visualizing independence with mosaic plots (marimekko)

Two variables are independent if knowing the level of one tells us nothing about the other
i.e., \(P(A \mid B) = P(A)\), and \(P(A, B) = P(A) \times P(B)\)

Mosaic plots: a spine chart of spine charts
width: marginal distribution of Survived (columns)
height: conditional of Pclass | Survived
(rows | columns)
area: joint distribution

Use a mosaic plot to visually check for independence: whether all proportions are the same
(the boxes line up in a grid)

mosaicplot(table(titanic$Survived, titanic$Pclass),
           main = "Relationship between survival and cabin class")

Residuals

Recall the test statistic: \(\displaystyle \chi^2 = \sum_i^{k_1} \sum_j^{k_2} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\)

Define the Pearson residuals: \(\displaystyle r_{ij} = \frac{O_{ij} - E_{ij}}{\sqrt{E_{ij}}}\)

Some rules of thumb:
- \(r_{ij} \approx 0\): observed counts are close to expected counts
- \(|r_{ij}| > 2\): significant at \(\alpha = 0.05\)
- very positive \(r_{ij}\): higher than expected
- very negative \(r_{ij}\): lower than expected

Mosaic plots with residuals

From Chi-squared test earlier: Survival and Pclass are associated. But how?
Boxes are shaded by Pearson residuals (shade = TRUE), reveal which combinations of
2 categorical variables (cells) are higher/lower than expected

mosaicplot(table(titanic$Survived, titanic$Pclass), shade = TRUE, main = "Relationship between survival and cabin class")

Mosaic plots with residuals

mosaicplot(table(titanic$Sex, titanic$Embarked), shade = TRUE, main = "Passenger sex appears independent of embarkation point")

Beyond 2D: faceting

titanic |> 
  ggplot(aes(x = Survived, fill = Pclass)) + 
  geom_bar(position = "dodge") +
  facet_wrap(~ Sex)

Mosaic plots in the wild

1000 songs to hear before you die (Source: The Guardian)

Do it live: data viz replication

songs <- read_csv("https://raw.githubusercontent.com/qntkhvn/36-315-summer26/refs/heads/master/data/songs.csv") |> 
  mutate(decade = case_when(YEAR <= 1959 ~ "1910s-50s",
                            YEAR %in% 1960:1969 ~ "1960s",
                            YEAR %in% 1970:1979 ~ "1970s",
                            YEAR %in% 1980:1989 ~ "1980s",
                            YEAR %in% 1990:1999 ~ "1990s",
                            YEAR >= 2000 ~ "2000s"))

With base R
With ggplot2
Check out this blog post
See also: ggmosaic (has recurring issues) and marimekko

Do it live: data viz replication

bechdel <- read_csv("https://raw.githubusercontent.com/qntkhvn/36-315-summer26/refs/heads/master/data/bechdel.csv")