Institution reports

In this section, we present a potential future use for the EADA data that is well-suited for a cloud computing environment, has high potential value for academic institutions, and is remarkably efficient to execute.

Concept

The institution-level structure of the EADA files is ideal for generating standardized, institution-specific reports that provide a snapshot of the sports equity landscape at the institution. Because multiple years of data are available, it is also highly feasible to generate cross-year institution comparison reports, either for a particular institution or for the entire institution universe. Because each file contains 1,800+ institutions and reports are not dependent on one another, this is an ideal task to parallelize in a cloud environment.

Report production walk-through

In this section, we present a tutorial on how to prepare year-specific snapshot files for a set of institutions using R. For the purpose of this tutorial, we deployed these steps locally on a small subset of institutions. However, in the final section, we will discuss thoughts for scaling this task up for execution in the cloud.

1. Data structure

Below, we illustrate the structure of the EADA data for five institutions from the 2022-23 EADA file (with the first 30 variables).

unitid	OPEID	institution_name	addr1_txt	addr2_txt	city_txt	state_cd	zip_text	ClassificationCode	classification_name	ClassificationOther	EFMaleCount	EFFemaleCount	EFTotalCount	sector_cd	sector_name	STUDENTAID_MEN	STUDENTAID_WOMEN	STUDENTAID_COED	STUDENTAID_TOTAL	STUAID_MEN_RATIO	STUAID_WOMEN_RATIO	RECRUITEXP_MEN	RECRUITEXP_WOMEN	RECRUITEXP_COED	RECRUITEXP_TOTAL	HDCOACH_SALARY_MEN	HDCOACH_SALARY_WOMEN	HDCOACH_SALARY_COED
131496	00144500	Georgetown University	37th and O St NW	Georgetown University	Washington	DC	200570001	2	NCAA Division I-FCS	NA	2995	4153	7148	2	Private nonprofit, 4-year or above	5559743	5871423	2478	11433644	49	51	997498	153634	7467	1158599	580565	103748	109326
170976	00232500	University of Michigan-Ann Arbor	1109 Geddes Ave	NA	Ann Arbor	MI	48109	1	NCAA Division I-FBS	NA	14705	16443	31148	1	Public, 4-year or above	17929935	13947339	NA	31877274	56	44	3876973	940960	NA	4817933	1086369	258780	NA
198419	00292000	Duke University	103 Allen Bldg	NA	Durham	NC	27708	1	NCAA Division I-FBS	NA	3114	3429	6543	2	Private nonprofit, 4-year or above	14309544	13291616	NA	27601160	52	48	2459555	574848	NA	3034403	687176	209582	NA
199120	00297400	University of North Carolina at Chapel Hill	103 South Bldg Cb 9100	NA	Chapel Hill	NC	27599	1	NCAA Division I-FBS	NA	7532	11630	19162	1	Public, 4-year or above	8398938	6738951	NA	15137889	55	45	2155890	848579	NA	3004469	747659	255997	NA
218070	00343400	Furman University	3300 Poinsett Highway	NA	Greenville	SC	296136162	2	NCAA Division I-FCS	NA	928	1331	2259	2	Private nonprofit, 4-year or above	7852046	4732379	NA	12584425	62	38	263225	141196	NA	404421	196422	99950	NA

2. Report template

Next, we identify key statistics we want each report to contain. We will present:

A table with the total number of sports offered, by gender
A table with the number of students participating in each sport, by gender
A figure with average full−time coaching salary, by team gender and coach type (head vs. assistant)

In the following sections, we will discuss key code chunks from our reports_template.qmd file. This is the script that provides the template for our PDF reports. The full script is available on our GitHub repository here.

a. YAML header

Importantly, the document’s YAML header should include the parameters that vary during report production. In this case, these are:

Institution name
School year (spring)

We initialize our default parameter values as Georgetown University for 2022-23.

---
title: "Annual institutional sports equity report"
format: pdf
theme: litera
toc: FALSE
fontsize: 12pt
embed-resources: TRUE
page-layout: full
params:
  inst: Georgetown University
  year: 2023
geometry:
      - top=2cm
---

b. Using parameters

Next, we use our parameter values to import the proper year of data and subset the data to our institution of interest.

Code

```{r}
#| eval: false
#| code-fold: true

library(tidyverse)
library(lubridate)
library(assertr)
library(quarto)
library(readxl)
library(glue)
library(kableExtra)
library(janitor)

root <- getwd()
```

```{r}
#| eval: false

# place parameters into variables
curr_year <- params$year
inst      <- params$inst

# import year-specific data and subset to specific institution
dta <- read_excel(file.path(root,glue("EADA_{curr_year}.xlsx")), sheet = 1) %>%
  filter(institution_name == inst) %>%
  verify(nrow(.) == 1)

# identify city and state and institution
location <- paste0(dta %>% pull(city_txt),", ", dta %>% pull(state_cd))

# format academic year
sy <- paste0(params$year - 1,"-",params$year)
```

c. Displaying basic institution information

We then employ simple in-line code with markdown formatting to display basic institutional information at the top of the report. We can capitalize on available variables such as NCAA classification (classification_name) to easily display relevant information about the institution.

**Year:** `r sy`

**Institution:** `r dta %>% pull(institution_name)`

**Location:** `r location`

**Institution type:** `r dta %>% pull(sector_name)`

**NCAA classification:** `r dta %>% pull(classification_name)`

**Total student body size:** `r format(dta %>% pull(EFTotalCount), big.mark = ",")`

d. Tables

Next, we create our two tables using tidyverse and kableExtra. This is simple data cleaning and table preparation code that is not specific to the parameterized report framework. Thus, the code is available below but hidden by default.

Code

```{r}
#| eval: false
#| code-fold: true

all_sports <- dta %>%
  select(starts_with(c("PARTIC_WOMEN","PARTIC_MEN"))) %>%
  pivot_longer(cols = everything(), 
               names_to = c("gender", "Sport"), 
               values_to = "participants",
               names_pattern = "PARTIC_(\\w+)_(.*)") %>%
  filter(!is.na(participants) & participants > 0) %>%
  mutate(gender = str_to_title(gender),
         Sport  = case_when(Sport == "BchVoll"    ~ "Beach Volleyball",
                            Sport == "Bskball"    ~ "Basketball",
                            Sport == "Eqstrian"   ~ "Equestrian",
                            Sport == "FldHcky"    ~ "Field Hockey",
                            Sport == "Gymn"       ~ "Gymnastics",
                            Sport == "IceHcky"    ~ "Ice Hockey",
                            Sport == "Lacrsse"    ~ "Lacrosse",
                            Sport == "OthSpts"    ~ "other Sports",
                            Sport == "SwimDivng"  ~ "Swimming \\& Diving",
                            Sport == "SynSwim"    ~ "Synchronized Swimming",
                            Sport == "TblTennis"  ~ "Table Tennis",
                            Sport == "Trckcomb"   ~ "Track \\& Field (Combined)",
                            Sport == "TrkFldIn"   ~ "Indoor Track \\& Field",
                            Sport == "TrkFldOut"  ~ "Outdoor Track \\& Field",
                            Sport == "Vollball"   ~ "Volleyball",
                            Sport == "WaterPolo"  ~ "Water Polo",
                            Sport == "WgtLift"    ~ "Weightlifting",
                            Sport == "XCountry"   ~ "Cross country",
                            TRUE                  ~ Sport))

total_sports <- all_sports %>%
  group_by(gender) %>%
  summarise(n_sports = n()) %>%
  pivot_wider(names_from = gender, values_from = n_sports) %>%
  ungroup() %>%
  select(Women,Men)

sports_by_gender <- all_sports %>%
  pivot_wider(names_from = gender, values_from = participants) %>%
  arrange(Sport)

total_sports %>%
  kable(escape = F, booktabs = T, format = "latex", align = "c", caption = "Number of total sports offered, by gender") %>%
  kable_styling(latex_options = c("HOLD_position","repeat_header"), full_width = TRUE) 

sports_by_gender %>%
  adorn_totals() %>%
  mutate(across(where(is.numeric), ~as.character(.x))) %>%
  mutate(across(everything(), ~ifelse(is.na(.x), "\u2013",.x))) %>%
  kable(escape = F, booktabs = T, format = "latex", align = "c", caption = "Number of students participating in sports, by gender") %>%
  kable_styling(latex_options = c("HOLD_position","repeat_header"), full_width = TRUE) %>%
  row_spec(nrow(sports_by_gender), hline_after = T, extra_latex_after = "%") %>%
  row_spec(nrow(sports_by_gender)+1,bold=T) 
```

Table 1

Table 2

e. Figure

We create our figure using ggplot2. Again, this is simple data reshaping and figure preparation code that is not specific to the parameterized report framework. Thus, the code is available below but hidden by default.

Code

```{r}
#| eval: false
#| code-fold: true

plot_dta <- dta %>%
  select(starts_with(c("STUDENTAID","RECRUITEXP","ASCOACH_SAL_FTE","HDCOACH_SAL_FTE")) & !ends_with(c("COED","TOTAL"))) %>%
  pivot_longer(
    cols = everything(),
    names_to = c("type","gender"),
    values_to = "expense",
    names_pattern = "(STUDENTAID|RECRUITEXP|HDCOACH_SALARY|ASCOACH_SAL_FTE|HDCOACH_SAL_FTE)_(.*)"
  ) 

coach_sal <- plot_dta %>%
  filter(grepl("SAL_FTE",type, fixed = TRUE)) %>%
  mutate(type = case_when(grepl("ASCOACH",type,fixed = TRUE) ~ "Assistant coach",
                          TRUE                               ~ "Head coach"),
         gender = case_when(gender == "WOMN" ~ "Women's teams",
                            TRUE             ~ "Men's teams")) %>%
  mutate(type = factor(type, levels = c("Head coach","Assistant coach")))
  
ggplot(data = coach_sal, aes(x = type, y = expense, fill = gender)) +
  geom_bar(stat = "identity", position = "dodge", color = "black") +
  geom_text(aes(label = scales::dollar(expense)), position = position_dodge(width = 0.9), vjust = -1) +
  theme_minimal() +
  labs(x = "",
       y = "Average full-time salary") +
  ggtitle("Figure 1: Average full-time coaching salary, \nby team gender and coach type") +
  theme(plot.title = element_text(hjust = 0.5),
        legend.title = element_blank(),
        legend.key = element_blank(),
        legend.background = element_rect(color = NA),
        legend.position="bottom") +
  scale_fill_brewer(palette = "Accent") +
  scale_y_continuous(labels = scales::dollar_format(prefix="$"), expand = c(0.2,0))
```

f. Footnotes

Last, we’ll use our year-specific parameter and inline code to add the citation to identify where we obtained the data for these reports.

`r glue("U.S. Department of Education, Office of Postsecondary Education, Equity in Athletics Disclosure Act (EADA) survey, {sy}.")`

3. Running the reports

In the following section, we present the run_reports.R file to execute our reports. This script is also available on our GitHub repository here.

In the inst vector, we specify the institutions for which we want to run reports. In the year vector, we identify the year of data to use to prepare the reports. The remaining code tells R to iterate over these vectors and for each institution-year combination call our report_template.qmd, substitute the template parameters with the respective values, and export a PDF.

Note: We obtained the basis for the code below from R for the Rest of Us (Keyes, 2024).

```{r}
#| eval: false

library(quarto)
library(tidyverse)

inst <- c("Georgetown University",
          "Duke University",
          "Furman University",
          "University of Michigan-Ann Arbor",
          "University of North Carolina at Chapel Hill")
year <- c(2023)

reports <-
  tibble(
    input = "report_template.qmd",
    output_file = str_glue("{inst} {year} annual sports equity report.pdf"),
    execute_params = map2(inst, year, ~ list(inst = .x, year = .y))
  )

pwalk(reports, quarto_render)
```

4. Reports

We now have standardized reports for our 5 institutions - scroll through each report below!

Georgetown University

University of North Carolina at Chapel Hill

Duke University

Furman University

University of Michigan-Ann Arbor

Scaling up

In a cloud environment with distributed computing, we could imagine many use cases for this reporting pipeline. For example, we could efficiently:

Produce these standardized reports for all institutions for all available years
Create a much more comprehensive report template by including a far greater number of tables and figures
Capitalize on two or more years of data for the same institution to create cross-time reports
Utilize the results of the K-means clustering analysis to identify similar institutions and create a unified PDF report with the snapshot for each institution

The opportunities are endless!