Visualizing 1D categorical data


36-315: Statistical Graphics and Visualization, Summer 2026

Motivating data: Palmer Penguins

head(penguins)
  species    island bill_len bill_dep flipper_len body_mass    sex year
1  Adelie Torgersen     39.1     18.7         181      3750   male 2007
2  Adelie Torgersen     39.5     17.4         186      3800 female 2007
3  Adelie Torgersen     40.3     18.0         195      3250 female 2007
4  Adelie Torgersen       NA       NA          NA        NA   <NA> 2007
5  Adelie Torgersen     36.7     19.3         193      3450 female 2007
6  Adelie Torgersen     39.3     20.6         190      3650   male 2007
  • Reminder: tidy data

    • Each row is an observation (penguin)

    • Each column is a variable/measurement about about each observation

  • Two main variable types: quantitative and categorical

Data wrangling with dplyr

  • dplyr is a package within the tidyverse with functions for data wrangling
library(tidyverse)
  • The dplyr data verbs for manipulating data

    • select()

    • filter()

    • arrange()

    • mutate()

    • group_by()

    • summarize()

select()

  • Use select() to extract COLUMNS (variables) of interest

  • Just simply specify the column names…

penguins |> 
  select(species, island, body_mass)
# A tibble: 344 × 3
   species island    body_mass
   <fct>   <fct>         <int>
 1 Adelie  Torgersen      3750
 2 Adelie  Torgersen      3800
 3 Adelie  Torgersen      3250
 4 Adelie  Torgersen        NA
 5 Adelie  Torgersen      3450
 6 Adelie  Torgersen      3650
 7 Adelie  Torgersen      3625
 8 Adelie  Torgersen      4675
 9 Adelie  Torgersen      3475
10 Adelie  Torgersen      4250
# ℹ 334 more rows

filter()

  • Use filter() to extract ROWS (observations) that meet certain conditions
  • Need to specify a logical condition (aka boolean expression)
  • x < y: less than
  • x <= y: less than or equal to
  • x == y: equal to
  • x != y: not equal to
  • x > y: greater than
  • x >= y: greater than or equal to
  • x %in% y: whether the value is present in a given vector
  • is.na(x): is missing
  • !is.na(x): is not missing
  • x & y: and
  • x | y: or
  • !x: not

filter()

Example: Extract features for female Adelie penguins only

penguins |> 
  filter(species == "Adelie" & sex == "female")
# A tibble: 73 × 8
   species island    bill_len bill_dep flipper_len body_mass sex     year
   <fct>   <fct>        <dbl>    <dbl>       <int>     <int> <fct>  <int>
 1 Adelie  Torgersen     39.5     17.4         186      3800 female  2007
 2 Adelie  Torgersen     40.3     18           195      3250 female  2007
 3 Adelie  Torgersen     36.7     19.3         193      3450 female  2007
 4 Adelie  Torgersen     38.9     17.8         181      3625 female  2007
 5 Adelie  Torgersen     41.1     17.6         182      3200 female  2007
 6 Adelie  Torgersen     36.6     17.8         185      3700 female  2007
 7 Adelie  Torgersen     38.7     19           195      3450 female  2007
 8 Adelie  Torgersen     34.4     18.4         184      3325 female  2007
 9 Adelie  Biscoe        37.8     18.3         174      3400 female  2007
10 Adelie  Biscoe        35.9     19.2         189      3800 female  2007
# ℹ 63 more rows

arrange()

  • Arrange observations (rows) by variables (columns)

    • Ascending order is the default (low to high for numeric columns, alphabetical order for character columns)

Example: Sort penguins by body mass (heaviest first)

penguins |> 
  arrange(desc(body_mass)) # desc() for descending order
# A tibble: 344 × 8
   species island bill_len bill_dep flipper_len body_mass sex    year
   <fct>   <fct>     <dbl>    <dbl>       <int>     <int> <fct> <int>
 1 Gentoo  Biscoe     49.2     15.2         221      6300 male   2007
 2 Gentoo  Biscoe     59.6     17           230      6050 male   2007
 3 Gentoo  Biscoe     51.1     16.3         220      6000 male   2008
 4 Gentoo  Biscoe     48.8     16.2         222      6000 male   2009
 5 Gentoo  Biscoe     45.2     16.4         223      5950 male   2008
 6 Gentoo  Biscoe     49.8     15.9         229      5950 male   2009
 7 Gentoo  Biscoe     48.4     14.6         213      5850 male   2007
 8 Gentoo  Biscoe     49.3     15.7         217      5850 male   2007
 9 Gentoo  Biscoe     55.1     16           230      5850 male   2009
10 Gentoo  Biscoe     49.5     16.2         229      5800 male   2008
# ℹ 334 more rows

arrange()

  • Arrange by multiple columns (variable order matters)

Example: Sort penguins by bill length (low to high, first sort), then flipper length (high to low, second sort)

penguins |> 
  arrange(bill_len, desc(flipper_len))
# A tibble: 344 × 8
   species island    bill_len bill_dep flipper_len body_mass sex     year
   <fct>   <fct>        <dbl>    <dbl>       <int>     <int> <fct>  <int>
 1 Adelie  Dream         32.1     15.5         188      3050 female  2009
 2 Adelie  Dream         33.1     16.1         178      2900 female  2008
 3 Adelie  Torgersen     33.5     19           190      3600 female  2008
 4 Adelie  Dream         34       17.1         185      3400 female  2008
 5 Adelie  Torgersen     34.1     18.1         193      3475 <NA>    2007
 6 Adelie  Torgersen     34.4     18.4         184      3325 female  2007
 7 Adelie  Biscoe        34.5     18.1         187      2900 female  2008
 8 Adelie  Torgersen     34.6     21.1         198      4400 male    2007
 9 Adelie  Torgersen     34.6     17.2         189      3200 female  2008
10 Adelie  Biscoe        35       17.9         192      3725 female  2009
# ℹ 334 more rows

mutate()

  • Use mutate() to create new variables

  • New variables created via mutate() are usually based on existing variables

    • Make sure to give your new variable a name

    • Note that naming the new variable the same as the existing variable will overwrite the original column

mutate()

Example: Create a new column for bill ratio (bill length / bill depth), and a new column to categorize penguins as “large” if they weigh more than 4.5kg

penguins |> 
  mutate(bill_ratio = bill_len / bill_dep,
         is_large = ifelse(body_mass > 4500, 1, 0))
# A tibble: 344 × 10
   species island bill_len bill_dep flipper_len body_mass sex    year bill_ratio
   <fct>   <fct>     <dbl>    <dbl>       <int>     <int> <fct> <int>      <dbl>
 1 Adelie  Torge…     39.1     18.7         181      3750 male   2007       2.09
 2 Adelie  Torge…     39.5     17.4         186      3800 fema…  2007       2.27
 3 Adelie  Torge…     40.3     18           195      3250 fema…  2007       2.24
 4 Adelie  Torge…     NA       NA            NA        NA <NA>   2007      NA   
 5 Adelie  Torge…     36.7     19.3         193      3450 fema…  2007       1.90
 6 Adelie  Torge…     39.3     20.6         190      3650 male   2007       1.91
 7 Adelie  Torge…     38.9     17.8         181      3625 fema…  2007       2.19
 8 Adelie  Torge…     39.2     19.6         195      4675 male   2007       2   
 9 Adelie  Torge…     34.1     18.1         193      3475 <NA>   2007       1.88
10 Adelie  Torge…     42       20.2         190      4250 <NA>   2007       2.08
# ℹ 334 more rows
# ℹ 1 more variable: is_large <dbl>

summarize() (by itself)

  • Use summarize() to collapse the data down to a single row (per group)
    by aggregating variables into single values

  • Useful for computing summaries (e.g., mean, median, max, min, correlation, etc.)

penguins |> 
  summarize(median_bill_len = median(bill_len, na.rm = TRUE))
# A tibble: 1 × 1
  median_bill_len
            <dbl>
1            44.4

group_by() and summarize()

  • group_by() converts the data into a grouped format where operations are performed by group

  • A group can be defined by one or more variables (columns)

  • group_by() becomes powerful when combining with summarize()

  • Use the pipe operator |> to perform multiple operations

Example: How many male and female Adelie penguins are in each island (Biscoe, Dream, Torgersen)?

penguins |> 
  filter(species == "Adelie", !is.na(sex)) |> 
  group_by(sex, island) |> 
  summarize(count = n())
# A tibble: 6 × 3
# Groups:   sex [2]
  sex    island    count
  <fct>  <fct>     <int>
1 female Biscoe       22
2 female Dream        27
3 female Torgersen    24
4 male   Biscoe       22
5 male   Dream        28
6 male   Torgersen    23

Variable types

Most visualizations are about understanding the distribution of different variables (columns)

The variable type usually dictates the type of graphs you should make

There are two main types of variables:

Quantitative

  • Discrete (i.e., counts, usually recorded as whole numbers)
    Examples: number of likes/retweets, number of times word is used

  • Continuous (any real number)
    Examples: income, age, miles run, heart rate

Categorical

  • Focus of today’s lecture

Categorical data

Two different versions of categorical data:

Nominal: categorical variables having unordered scales

  • Examples: race, gender, species, etc,

Ordinal: ordered categories; levels with a meaningful order

  • Examples: education level, grades, ranks

Factors in R

  • In R, factors are used to work with categorical variables
  • R treats factors as ordinal; defaults to alphabetical

    • May need to manually define the factor levels (e.g., the reference level)
class(penguins$species)
[1] "factor"
levels(penguins$species)
[1] "Adelie"    "Chinstrap" "Gentoo"   
  • See the forcats package (automatically loaded with tidyverse)

Summarizing 1D categorical data

  • Setup: a single column of categorical data (i.e., 1D categorical data)
  • Frequency tables (counts) are the most common form of non-graphical EDA
table(penguins$species)

   Adelie Chinstrap    Gentoo 
      152        68       124 
  • Proportion tables
prop.table(table(penguins$species))

   Adelie Chinstrap    Gentoo 
0.4418605 0.1976744 0.3604651 

Visualizing 1D categorical data: Area plots

  • Each area corresponds to one categorical level

  • Area is proportional to counts/frequencies/percentages

  • Differences between areas correspond to differences between counts/frequencies/percentages

Bar charts

Create a bar chart with geom_bar()

  • Map species to the x-axis

  • Counts of each category are displayed on the y-axis

library(tidyverse)
theme_set(theme_light())
penguins |> 
  ggplot(aes(x = species)) +
  geom_bar()

Behind the scenes of geom_bar()

  • Start with the data

  • Aggregate and count the number of observations in each bar

  • Map to plot aesthetics

Flip your bar charts!

Just simply replace x with y (Quang prefers this way)

penguins |>
  ggplot(aes(y = species)) +
  geom_bar()

Or use coord_flip()

penguins |> 
  ggplot(aes(x = species)) +
  geom_bar() +
  coord_flip()

Crimes against bar charts

Crimes against bar charts

Statistical inference for 1D categorical data

  • Chi-squared test for 1D categorical data

  • Null hypothesis (\(H_0\)): all categories have equal proportions (i.e., \(p_\texttt{Adelie} = p_\texttt{Chinstrap} = p_\texttt{Gentoo}\))

chisq.test(table(penguins$species))

    Chi-squared test for given probabilities

data:  table(penguins$species)
X-squared = 31.907, df = 2, p-value = 1.179e-07
  • Since \(p\)-value \(< 0.05\) (or any other reasonable significance level), we reject the null hypothesis at \(\alpha = 0.05\)

  • We have strong evidence that the proportions of penguins across all species are not the same

Spine charts

A single bar, with height/width divided into different categories

  • Height is proportional to counts (proportions)
penguins |> 
  ggplot(aes(fill = species, x = "")) +
  geom_bar()

  • Width is proportional to counts (proportions)
penguins |> 
  ggplot(aes(fill = species, y = "")) +
  geom_bar()

Pie charts

  • Circle is divided up into (pie) slices

  • One slice for each category

  • \(\text{Area}_{\text{ total}}= \pi r^2\)

  • \(\displaystyle \text{Area}_{\text{ slice}}= \frac{\pi r^2 \theta}{360^\circ}\)

  • Angle \(\theta\) is proportional is counts (proportions)

  • What about radius?

penguins |> 
  ggplot(aes(fill = species, x = "")) +
  geom_bar() +
  coord_polar(theta = "y")

Friends Don’t Let Friends Make Pie Chart

Ugh…

:)

Rose diagrams

  • Circle sections for each category (like pie charts)

  • All sections (“petals”) have the same width/arc/angle

  • Radius of each section is proportional to category frequency

  • Made popular by Florence Nightingale

penguins |> 
  ggplot(aes(x = species)) +
  geom_bar(fill = "midnightblue") +
  coord_polar()

Florence Nightingale’s rose diagram

Tips for visualizing 1D categorical data

  • You should pretty much always just make a bar chart

  • Spine charts will be more useful with more variables

  • Only use circular plots if circular/temporal context has actual meaning

Waffle charts are cooler anyway…

# install.packages("remotes")
# remotes::install_github("hrbrmstr/waffle")
library(waffle)
penguins |>
  count(species) |> 
  ggplot(aes(fill = species, values = n)) +
  geom_waffle(n_rows = 20, color = "white") +
  coord_equal() +
  theme_void()

Graphic critique (if time permits)

https://www.nytimes.com/interactive/2024/05/13/climate/home-insurance-profit-us-states-weather.html