---
title: "Homework 2"
subtitle: "36-315: Statistical Graphics and Visualization, Summer 2026"
author: "YOUR NAME HERE"
toc: true
fontsize: 10pt
geometry: margin=0.9in
format:
  pdf:
    colorlinks: true
execute:
  warning: false
  message: false
---

\newpage

# Problem 1: Statistical tests for categorical data (26 points)

For this problem, we will use the same Titanic dataset from Homework 1.

```{r}
library(tidyverse)
titanic <- read_csv("https://raw.githubusercontent.com/qntkhvn/36-315-summer26/refs/heads/master/data/titanic.csv")
```

## A (5pts) 

First, create a graph that shows the marginal distribution of the `Pclass` (cabin class) variable. Be sure the graph is properly titled and labeled, and choose a non-default color.

```{r}
# YOUR CODE HERE
```

Next, describe the marginal distribution of `Pclass` in 1--2 sentences. Be sure to mention the actual meaning of the values "1", "2", and "3". Figure out what these categories mean using the data dictionary [here](https://www.kaggle.com/c/titanic/data).

**YOUR ANSWER HERE**

## B (5pts) 

Professor Ph.Dizzle looks at your graph in Part A and says: "The passengers were equally likely to belong to any of the three classes." You obviously disagree. You tell Professor Ph.Dizzle to just look at the graph, but they are still unconvinced (and appearing to get angry), arguing that what you see in the graph could just be due to random noise and may not be an actual significant difference.

From the graph alone, are you able to definitively state whether or not there is a statistically significant difference among the proportions of classes? State yes or no, and then provide a one-sentence explanation for your answer.

**YOUR ANSWER HERE**

## C (8pts) 

To provide more explicit evidence of statistical significance, what kind of statistical test could be used to show that the passengers are not equally likely to belong to any of the three classes?

**YOUR ANSWER HERE**

Next, write code to perform the statistical test.

```{r}
# YOUR CODE HERE
```

Interpret the test results. In particular, state the p-value and the formal conclusion (in context) in 1--3 sentences.

**YOUR ANSWER HERE**

## D (8pts) 

Using the code from Part A, make the following two changes:

* Annotate the p-value from Part C somewhere on the graph other than the title. If the p-value is extremely small, it is fine to put "Chi-squared test p-value approximately zero."

* Change the labels for `Pclass` to "Upper", "Middle", "Lower", which are the proper class names.

```{r}
# YOUR CODE HERE
```

\newpage

# Problem 2: Visualizing uncertainty (49 points)

In this problem, we will work with a dataset of cereals manufactured in the United States. For a description of the variables, see [here](https://cran.r-project.org/web/packages/MASS/refman/MASS.html#UScereal).

```{r}
cereal <- read_csv("https://raw.githubusercontent.com/qntkhvn/36-315-summer26/refs/heads/master/data/cereal.csv")
```

## A (5pts) 

First, produce a frequency table for the cereal manufacturers (`mfr`).

```{r}
# YOUR CODE HERE
```

Which manufacturer appears the most and least frequent in the data? Be sure to mention the full names of the actual manufacturers, rather than just the initials.

**YOUR ANSWER HERE**

## B (5pts) 

Create a frequency bar chart for `mfr`. Make the color of the bars something other than gray. Add appropriate labels and a title to your plot. Also, center the title by adding `theme(plot.title = element_text(hjust = 0.5))`.

```{r}
# YOUR CODE HERE
```

Next, summarize the plot in 1--2 sentences.

**YOUR ANSWER HERE**

## C (10pts) 

Now let's add confidence intervals to the plot in Part B.

Write a function called `get_prop_ci()` that computes the 95% confidence intervals for the *proportion* of each manufacturer. The function should take in a table object `freq_table` produced by the `table()` function.

```{r}
freq_table <- table(cereal$mfr)
```

Fill in the template code below for `get_prop_ci()` to do the following:

* Compute the sample size

* Compute the proportions of each category of `mfr`

* Compute the standard error of the proportions

```{r}
get_prop_ci <- function(freq_table, alpha = 0.05) {
  # compute sample size
  # n <-  
  # compute proportions
  # prop <- 
  # compute standard error for proportions
  # se <-  
  # compute confidence intervals
  # ci_lower <- prop - qnorm(1 - alpha / 2) * se
  # ci_upper <- prop + qnorm(1 - alpha / 2) * se
  # ci <- rbind(ci_lower, ci_upper)
  # return(ci)
}
```

Finally, uncomment the following line of code to display confidence intervals.

```{r}
# get_prop_ci(freq_table)
```

Is there anything non-intuitive with any of the confidence intervals above? State yes or no, and give a 1--2 sentence explanation.

**YOUR ANSWER HERE**

## D (3pts) 

Based on Part C, write one line of code that displays the 95% confidence intervals for the **frequency** of each `mfr`.

(**Hint**: The proportion for any category $j$ is defined as $p_j = n_j / n$, where $n_j$ is the number of category $j$ and $n$ is the total sample size.)

```{r}
# YOUR CODE HERE
```

## E (10pts) 

Create a bar chart with the 95% confidence intervals, which ultimately displays four pieces of information for each category: name, frequency, lower and upper bounds for the 95% confidence interval.

First, create a new table that summarizes the data for the desired plot by filling the `?` below.

Note: Don't just hard code the name and frequency of each manufacturer. Figure out a way to use the code from earlier to define these variables. Because there are six manufacturers in the data, `cereal_ci` should only have six rows, one for each manufacturer.

(**Hint**: `mfr` and `frequency` can be obtained from Part A, and `lower` and `upper` can be obtained from Part D.)

```{r}
# store your answer to part D
# freq_ci <- ?

# cereal_ci <- data.frame(mfr = ?, # name of each manufacturer
#                         frequency = ?, # frequency for each manufacturer
#                         lower = ?, # lower bound of 95% CI
#                         upper = ?) # upper bound of 95% CI
```

Once `cereal_ci` is defined, modify the following code to produce the desired bar chart. Also, add color, appropriate labels, and a title to the bar chart.

```{r}
# cereal_ci |>
#   ggplot(aes(x = ?, y = ?)) +
#   geom_col() +
#   geom_errorbar(aes(xmin = ?, xmax = ?))
```

## F (6pts) 

To interpret the confidence intervals visualized in Part E, answer the following questions. For now, ignore issues of multiple testing. For each question, explain your reasoning in 1--3 sentences.

* Does it appear that there are any significant differences in frequency among manufacturers that *are not* General Mills or Kellogg's?

**YOUR ANSWER HERE**

* For which manufacturers does General Mills appear to have significantly different frequency?

**YOUR ANSWER HERE**

## G (10pts) 

Suppose we want to make pairwise comparisons between one manufacturer and another. There are 15 total comparisons to be made from the bar chart above. If the 95% confidence intervals are used for pairwise comparisons, the chance of making a Type I error is greater than 5%.

To address this issue, create a bar chart similar to Part E, but with confidence intervals that incorporate the Bonferroni correction for 15 comparisons. In other words, the graph here should be exactly the same as the graph in Part E, but with wider confidence intervals due to Bonferroni correction.

**Hint**: In general, $\alpha$-level confidence intervals are constructed using the Normal quantile $z_{1-\alpha/2}$. For 95% confidence intervals, $\alpha = 0.05$, so $z_{1 - \alpha/2} = z_{0.975}$ is used (computed with `qnorm(0.975)`). This is exactly what we did with `get_prop_ci()` in Part C, where `alpha = 0.05` by default. With the Bonferroni correction. $\alpha$ is adjusted to $0.05/k$, where $k$ is the number of comparisons. **For this part, use `get_prop_ci()` with the Bonferroni-adjusted $\alpha$ and create a new graph.**

```{r}
# YOUR CODE HERE
```

Finally, based on the Bonferroni-adjusted confidence intervals, answer the same two questions from Part F.

**YOUR ANSWER HERE**

**YOUR ANSWER HERE**

\newpage

# Problem 3: Resisting the first order for categories (25 points)

For this problem, we'll work with the [`starwars`](https://dplyr.tidyverse.org/reference/starwars.html) dataset, which is included in the `dplyr` package.

The code below loads the data, then uses the [`unnest`](https://tidyr.tidyverse.org/reference/nest.html) function to make a new dataset `starwars_character_films` where each row corresponds to a character appearance in a film. Type `help(starwars)` to view more information about the `starwars` dataset.

```{r}
starwars_character_films <- starwars |>
  select(name, films) |>
  unnest(films)
```

We will focus on the `films` variable, which indicates the film a particular character appeared in. Specifically, we will demonstrate how to reorder categories on graphs in `R`. This is particularly useful when visualizing categorical variables, because the plotting order that `R` chooses by default may not be the best choice for your graphs.

## A (5pts)

First, make a bar chart that displays the frequency on the x-axis and `films` on the y-axis. Make sure the plot is properly titled and labeled, and choose a non-default color.

```{r}
# YOUR CODE HERE
```

What is the default plotting order for categorical ("character" or "factor") variables in `R`? (Look at the order of the categories in the bar chart, which starts from the bottom.)

**YOUR ANSWER HERE**

## B (5pts)

Read this [introduction](https://forcats.tidyverse.org/) to the `forcats` package, which was designed to work with categorical data in `R`. The cheatsheet on that page will be particularly useful.

* Which function in the `forcats` package can be used to reorder the levels of a factor in any order you want?

**YOUR ANSWER HERE**

* Which function in the `forcats` package can be used to reorder the categories from most frequent to least frequent?

**YOUR ANSWER HERE**

* Which combination of functions in the `forcats` package can be used to reorder the categories from least frequent to most frequent?

**YOUR ANSWER HERE**

## C (5pts) 

Recreate the plot in Part A, but order the films either from least to most frequent or from most to least frequent.

```{r}
# YOUR CODE HERE
```

## D (5pts)

Recreate the plot in Part A, but order the films based on episode order (The Phantom Menace, Attack of the Clones, ..., The Force Awakens). See the Star Wars [wikipedia page](https://en.wikipedia.org/wiki/Star_Wars#The_Skywalker_saga) for the episode and release orders. 

```{r}
# YOUR CODE HERE
```

## E (5pts) 

Recreate the plot from part D, but rename the categories so that they use the following film abbreviations: I, II, III, IV, V, VI, VII. Again, see the Star Wars [wikipedia page](https://en.wikipedia.org/wiki/Star_Wars#The_Skywalker_saga) for more information.

```{r}
# YOUR CODE HERE
```