Visualizations and inference for 2D categorical data


36-315: Statistical Graphics and Visualization, Summer 2026

Summarizing 2D categorical data

library(tidyverse)
theme_set(theme_light())

titanic <- read_csv("https://raw.githubusercontent.com/qntkhvn/36-315-summer26/refs/heads/master/data/titanic.csv") |> 
  filter(!is.na(Survived) & !is.na(Pclass)) |> 
  mutate(Survived = factor(Survived),
         Pclass = factor(Pclass))


Using the titanic data, consider:

  • Survived: survival outcome (0/1)
table(titanic$Survived)

  0   1 
424 288 
  • Class: cabin class (1st/2nd/3rd)
table(titanic$Pclass)

  1   2   3 
184 173 355 

Summarizing 2D categorical data

Two-way table (or two-way table, cross tabulation, crosstab)


table("Survived" = titanic$Survived, "Pclass" = titanic$Pclass)
        Pclass
Survived   1   2   3
       0  64  90 270
       1 120  83  85


xtabs(~ Survived + Pclass, data = titanic)
        Pclass
Survived   1   2   3
       0  64  90 270
       1 120  83  85

Joint, marginal, and conditional distributions

xtabs(~ Survived + Pclass, data = titanic) |> 
  addmargins()
        Pclass
Survived   1   2   3 Sum
     0    64  90 270 424
     1   120  83  85 288
     Sum 184 173 355 712
  • Column and row sums: marginal distributions
  • Values within rows: conditional distribution for Pclass given Survived
  • Values within columns: conditional distribution for Survived given Pclass
  • Values within cells: joint distribution for Survived and Pclass
  • Bottom right: total number of observations

Joint, marginal, and conditional probabilities

  • Joint distribution: the intersection

    • e.g., \(P(\texttt{Survived} = 1, \texttt{Pclass} = 3)\)


table(titanic$Survived, titanic$Pclass)
   
      1   2   3
  0  64  90 270
  1 120  83  85


prop.table(table(titanic$Survived, titanic$Pclass))
   
             1          2          3
  0 0.08988764 0.12640449 0.37921348
  1 0.16853933 0.11657303 0.11938202

Joint, marginal, and conditional probabilities

  • Joint distribution: the intersection

    • e.g., \(P(\texttt{Survived} = 1, \texttt{Pclass} = 3)\)


  • Marginal distribution: row sums or column sums

    • e.g., \(P(\texttt{Survived} = 1)\), \(P(\texttt{Pclass} = 3)\)


table(titanic$Survived, titanic$Pclass)
   
      1   2   3
  0  64  90 270
  1 120  83  85


prop.table(table(titanic$Survived, titanic$Pclass))
   
             1          2          3
  0 0.08988764 0.12640449 0.37921348
  1 0.16853933 0.11657303 0.11938202

Joint, marginal, and conditional probabilities

  • Joint distribution: the intersection

    • e.g., \(P(\texttt{Survived} = 1, \texttt{Pclass} = 3)\)


  • Marginal distribution: row sums or column sums

    • e.g., \(P(\texttt{Survived} = 1)\), \(P(\texttt{Pclass} = 3)\)


  • Conditional distribution:
    probability of \(\texttt{Survived}\) given \(\texttt{Pclass}\)

    • e.g., \(P(\texttt{Survived} = 1 \mid \texttt{Pclass} = 3)\)

      \(\displaystyle \qquad \quad = \frac{P(\texttt{Survived} = 1, \texttt{Pclass} = 3)}{P(\texttt{Pclass} = 3)}\)


table(titanic$Survived, titanic$Pclass)
   
      1   2   3
  0  64  90 270
  1 120  83  85


prop.table(table(titanic$Survived, titanic$Pclass))
   
             1          2          3
  0 0.08988764 0.12640449 0.37921348
  1 0.16853933 0.11657303 0.11938202

Connecting distributions to visualizations

  • For two categorical variables \(A\) and \(B\)

    • Marginal distributions: \(P(A)\) and \(P(B)\)

    • Conditional distributions: \(P(A \mid B)\) and \(P(B \mid A)\)

    • Joint distribution: \(P(A, B)\)

  • We use bar charts to visualize marginal distributions for categorical variables…
  • And we will use more bar charts (stacked and side-by-side) to visualize conditional and joint distributions

Stacked bar charts

  • Stacked bar chart: a bar chart of spine charts

  • Easy to see marginal of Survived

    • i.e., \(P(\)x\()\)
  • Can see conditional of Pclass | Survived

    • i.e., \(P(\)fill \(\mid\) x\()\)
  • Harder to see conditional of Survived | Pclass

    • i.e., \(P(\)x \(\mid\) fill\()\)
titanic |> 
  ggplot(aes(x = Survived, fill = Pclass)) +
  geom_bar() 

Side-by-side bar charts

  • Side-by-side (grouped/dodged) bar chart: a bar chart of bar charts

  • Easy to see conditional of Pclass | Survived

    • i.e., \(P(\)fill \(\mid\) x\()\)
  • Can see conditional of Survived | Pclass

    • i.e., \(P(\)x \(\mid\) fill\()\)
  • Harder to see marginals…

titanic |> 
  ggplot(aes(x = Survived, fill = Pclass)) +
  geom_bar(position = "dodge")

Which one do you prefer?

Categorical heatmaps

  • Use geom_tile to display joint distribution of two categorical variables

  • Annotate tiles with labels of percentages using geom_text()

titanic |>
  group_by(Survived, Pclass) |>
  summarize(n = n(), 
            joint = n() / nrow(titanic),
            txt = paste(round(100 * joint, 2), "%")) |> 
  ggplot(aes(x = Survived, y = Pclass)) +
  geom_tile(aes(fill = n), color = "white") +
  geom_text(aes(label = txt)) +
  scale_fill_gradient2()

Inference for 2D categorical data: Chi-squared test

  • Null hypothesis (\(H_0\)) : variables \(A\) (rows) and \(B\) (columns) are independent of each other

    • e.g., no association between Survived and Pclass
  • Test statistic: \(\displaystyle \chi^2 = \sum_i^{k_1} \sum_j^{k_2} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\)

    • \(O_{ij}\): observed cell counts in two-way table

    • \(E_{ij}\): expected cell counts under \(H_0\), where \[E_{ij} = n \cdot P(A = a_i, B = b_j) = n \cdot P(A = a_i) P(B = b_j) = n \cdot \left( \frac{n_{i \cdot}}{n} \right) \left( \frac{ n_{\cdot j}}{n} \right)\]

chisq.test(table(titanic$Survived, titanic$Pclass))

    Pearson's Chi-squared test

data:  table(titanic$Survived, titanic$Pclass)
X-squared = 91.081, df = 2, p-value < 2.2e-16

Visualizing independence with mosaic plots (marimekko)

Two variables are independent if knowing the level of one tells us nothing about the other
i.e., \(P(A \mid B) = P(A)\), and \(P(A, B) = P(A) \times P(B)\)

  • Mosaic plots: a spine chart of spine charts

  • width: marginal distribution of Survived (columns)

  • height: conditional of Pclass | Survived
    (rows | columns)

  • area: joint distribution

Use a mosaic plot to visually check for independence: whether all proportions are the same
(the boxes line up in a grid)

mosaicplot(table(titanic$Survived, titanic$Pclass),
           main = "Relationship between survival and cabin class")

Residuals

  • Recall the test statistic: \(\displaystyle \chi^2 = \sum_i^{k_1} \sum_j^{k_2} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\)
  • Define the Pearson residuals: \(\displaystyle r_{ij} = \frac{O_{ij} - E_{ij}}{\sqrt{E_{ij}}}\)
  • Some rules of thumb:

    • \(r_{ij} \approx 0\): observed counts are close to expected counts

    • \(|r_{ij}| > 2\): significant at \(\alpha = 0.05\)

    • very positive \(r_{ij}\): higher than expected

    • very negative \(r_{ij}\): lower than expected

Mosaic plots with residuals

  • From Chi-squared test earlier: Survival and Pclass are associated. But how?

  • Boxes are shaded by Pearson residuals (shade = TRUE), reveal which combinations of
    2 categorical variables (cells) are higher/lower than expected

mosaicplot(table(titanic$Survived, titanic$Pclass), shade = TRUE, main = "Relationship between survival and cabin class")

Mosaic plots with residuals

mosaicplot(table(titanic$Sex, titanic$Embarked), shade = TRUE, main = "Passenger sex appears independent of embarkation point")

Beyond 2D: faceting

titanic |> 
  ggplot(aes(x = Survived, fill = Pclass)) + 
  geom_bar(position = "dodge") +
  facet_wrap(~ Sex)

Mosaic plots in the wild

1000 songs to hear before you die (Source: The Guardian)

Do it live: data viz replication

songs <- read_csv("https://raw.githubusercontent.com/qntkhvn/36-315-summer26/refs/heads/master/data/songs.csv") |> 
  mutate(decade = case_when(YEAR <= 1959 ~ "1910s-50s",
                            YEAR %in% 1960:1969 ~ "1960s",
                            YEAR %in% 1970:1979 ~ "1970s",
                            YEAR %in% 1980:1989 ~ "1980s",
                            YEAR %in% 1990:1999 ~ "1990s",
                            YEAR >= 2000 ~ "2000s"))

Do it live: data viz replication

bechdel <- read_csv("https://raw.githubusercontent.com/qntkhvn/36-315-summer26/refs/heads/master/data/bechdel.csv")