Designing a data pipeline to analyze gender equity in college sports

Spring 2025 Massive Data Fundamentals final project

Institution reports

In this section, we present a potential future use for the EADA data that is well-suited for a cloud computing environment, has high potential value for academic institutions, and is remarkably efficient to execute.

Concept

The institution-level structure of the EADA files is ideal for generating standardized, institution-specific reports that provide a snapshot of the sports equity landscape at the institution. Because multiple years of data are available, it is also highly feasible to generate cross-year institution comparison reports, either for a particular institution or for the entire institution universe. Because each file contains 1,800+ institutions and reports are not dependent on one another, this is an ideal task to parallelize in a cloud environment.

Report production walk-through

In this section, we present a tutorial on how to prepare year-specific snapshot files for a set of institutions using R. For the purpose of this tutorial, we deployed these steps locally on a small subset of institutions. However, in the final section, we will discuss thoughts for scaling this task up for execution in the cloud.

1. Data structure

Below, we illustrate the structure of the EADA data for five institutions from the 2022-23 EADA file (with the first 30 variables).

unitid OPEID institution_name addr1_txt addr2_txt city_txt state_cd zip_text ClassificationCode classification_name ClassificationOther EFMaleCount EFFemaleCount EFTotalCount sector_cd sector_name STUDENTAID_MEN STUDENTAID_WOMEN STUDENTAID_COED STUDENTAID_TOTAL STUAID_MEN_RATIO STUAID_WOMEN_RATIO STUAID_COED_RATIO RECRUITEXP_MEN RECRUITEXP_WOMEN RECRUITEXP_COED RECRUITEXP_TOTAL HDCOACH_SALARY_MEN HDCOACH_SALARY_WOMEN HDCOACH_SALARY_COED
131496 00144500 Georgetown University 37th and O St NW Georgetown University Washington DC 200570001 2 NCAA Division I-FCS NA 2995 4153 7148 2 Private nonprofit, 4-year or above 5559743 5871423 2478 11433644 49 51 0 997498 153634 7467 1158599 580565 103748 109326
170976 00232500 University of Michigan-Ann Arbor 1109 Geddes Ave NA Ann Arbor MI 48109 1 NCAA Division I-FBS NA 14705 16443 31148 1 Public, 4-year or above 17929935 13947339 NA 31877274 56 44 0 3876973 940960 NA 4817933 1086369 258780 NA
198419 00292000 Duke University 103 Allen Bldg NA Durham NC 27708 1 NCAA Division I-FBS NA 3114 3429 6543 2 Private nonprofit, 4-year or above 14309544 13291616 NA 27601160 52 48 0 2459555 574848 NA 3034403 687176 209582 NA
199120 00297400 University of North Carolina at Chapel Hill 103 South Bldg Cb 9100 NA Chapel Hill NC 27599 1 NCAA Division I-FBS NA 7532 11630 19162 1 Public, 4-year or above 8398938 6738951 NA 15137889 55 45 0 2155890 848579 NA 3004469 747659 255997 NA
218070 00343400 Furman University 3300 Poinsett Highway NA Greenville SC 296136162 2 NCAA Division I-FCS NA 928 1331 2259 2 Private nonprofit, 4-year or above 7852046 4732379 NA 12584425 62 38 0 263225 141196 NA 404421 196422 99950 NA

2. Report template

Next, we identify key statistics we want each report to contain. We will present:

  • A table with the total number of sports offered, by gender

  • A table with the number of students participating in each sport, by gender

  • A figure with average full−time coaching salary, by team gender and coach type (head vs. assistant)

In the following sections, we will discuss key code chunks from our reports_template.qmd file. This is the script that provides the template for our PDF reports. The full script is available on our GitHub repository here.

a. YAML header

Importantly, the document’s YAML header should include the parameters that vary during report production. In this case, these are:

  • Institution name

  • School year (spring)

We initialize our default parameter values as Georgetown University for 2022-23.

---
title: "Annual institutional sports equity report"
format: pdf
theme: litera
toc: FALSE
fontsize: 12pt
embed-resources: TRUE
page-layout: full
params:
  inst: Georgetown University
  year: 2023
geometry:
      - top=2cm
---

b. Using parameters

Next, we use our parameter values to import the proper year of data and subset the data to our institution of interest.

Code
```{r}
#| eval: false
#| code-fold: true

library(tidyverse)
library(lubridate)
library(assertr)
library(quarto)
library(readxl)
library(glue)
library(kableExtra)
library(janitor)

root <- getwd()
```
```{r}
#| eval: false

# place parameters into variables
curr_year <- params$year
inst      <- params$inst

# import year-specific data and subset to specific institution
dta <- read_excel(file.path(root,glue("EADA_{curr_year}.xlsx")), sheet = 1) %>%
  filter(institution_name == inst) %>%
  verify(nrow(.) == 1)

# identify city and state and institution
location <- paste0(dta %>% pull(city_txt),", ", dta %>% pull(state_cd))

# format academic year
sy <- paste0(params$year - 1,"-",params$year)
```

c. Displaying basic institution information

We then employ simple in-line code with markdown formatting to display basic institutional information at the top of the report. We can capitalize on available variables such as NCAA classification (classification_name) to easily display relevant information about the institution.

**Year:** `r sy`

**Institution:** `r dta %>% pull(institution_name)`

**Location:** `r location`

**Institution type:** `r dta %>% pull(sector_name)`

**NCAA classification:** `r dta %>% pull(classification_name)`

**Total student body size:** `r format(dta %>% pull(EFTotalCount), big.mark = ",")`

d. Tables

Next, we create our two tables using tidyverse and kableExtra. This is simple data cleaning and table preparation code that is not specific to the parameterized report framework. Thus, the code is available below but hidden by default.

Code
```{r}
#| eval: false
#| code-fold: true

all_sports <- dta %>%
  select(starts_with(c("PARTIC_WOMEN","PARTIC_MEN"))) %>%
  pivot_longer(cols = everything(), 
               names_to = c("gender", "Sport"), 
               values_to = "participants",
               names_pattern = "PARTIC_(\\w+)_(.*)") %>%
  filter(!is.na(participants) & participants > 0) %>%
  mutate(gender = str_to_title(gender),
         Sport  = case_when(Sport == "BchVoll"    ~ "Beach Volleyball",
                            Sport == "Bskball"    ~ "Basketball",
                            Sport == "Eqstrian"   ~ "Equestrian",
                            Sport == "FldHcky"    ~ "Field Hockey",
                            Sport == "Gymn"       ~ "Gymnastics",
                            Sport == "IceHcky"    ~ "Ice Hockey",
                            Sport == "Lacrsse"    ~ "Lacrosse",
                            Sport == "OthSpts"    ~ "other Sports",
                            Sport == "SwimDivng"  ~ "Swimming \\& Diving",
                            Sport == "SynSwim"    ~ "Synchronized Swimming",
                            Sport == "TblTennis"  ~ "Table Tennis",
                            Sport == "Trckcomb"   ~ "Track \\& Field (Combined)",
                            Sport == "TrkFldIn"   ~ "Indoor Track \\& Field",
                            Sport == "TrkFldOut"  ~ "Outdoor Track \\& Field",
                            Sport == "Vollball"   ~ "Volleyball",
                            Sport == "WaterPolo"  ~ "Water Polo",
                            Sport == "WgtLift"    ~ "Weightlifting",
                            Sport == "XCountry"   ~ "Cross country",
                            TRUE                  ~ Sport))

total_sports <- all_sports %>%
  group_by(gender) %>%
  summarise(n_sports = n()) %>%
  pivot_wider(names_from = gender, values_from = n_sports) %>%
  ungroup() %>%
  select(Women,Men)

sports_by_gender <- all_sports %>%
  pivot_wider(names_from = gender, values_from = participants) %>%
  arrange(Sport)

total_sports %>%
  kable(escape = F, booktabs = T, format = "latex", align = "c", caption = "Number of total sports offered, by gender") %>%
  kable_styling(latex_options = c("HOLD_position","repeat_header"), full_width = TRUE) 

sports_by_gender %>%
  adorn_totals() %>%
  mutate(across(where(is.numeric), ~as.character(.x))) %>%
  mutate(across(everything(), ~ifelse(is.na(.x), "\u2013",.x))) %>%
  kable(escape = F, booktabs = T, format = "latex", align = "c", caption = "Number of students participating in sports, by gender") %>%
  kable_styling(latex_options = c("HOLD_position","repeat_header"), full_width = TRUE) %>%
  row_spec(nrow(sports_by_gender), hline_after = T, extra_latex_after = "%") %>%
  row_spec(nrow(sports_by_gender)+1,bold=T) 
```

Table 1

Table 2

e. Figure

We create our figure using ggplot2. Again, this is simple data reshaping and figure preparation code that is not specific to the parameterized report framework. Thus, the code is available below but hidden by default.

Code
```{r}
#| eval: false
#| code-fold: true

plot_dta <- dta %>%
  select(starts_with(c("STUDENTAID","RECRUITEXP","ASCOACH_SAL_FTE","HDCOACH_SAL_FTE")) & !ends_with(c("COED","TOTAL"))) %>%
  pivot_longer(
    cols = everything(),
    names_to = c("type","gender"),
    values_to = "expense",
    names_pattern = "(STUDENTAID|RECRUITEXP|HDCOACH_SALARY|ASCOACH_SAL_FTE|HDCOACH_SAL_FTE)_(.*)"
  ) 

coach_sal <- plot_dta %>%
  filter(grepl("SAL_FTE",type, fixed = TRUE)) %>%
  mutate(type = case_when(grepl("ASCOACH",type,fixed = TRUE) ~ "Assistant coach",
                          TRUE                               ~ "Head coach"),
         gender = case_when(gender == "WOMN" ~ "Women's teams",
                            TRUE             ~ "Men's teams")) %>%
  mutate(type = factor(type, levels = c("Head coach","Assistant coach")))
  
ggplot(data = coach_sal, aes(x = type, y = expense, fill = gender)) +
  geom_bar(stat = "identity", position = "dodge", color = "black") +
  geom_text(aes(label = scales::dollar(expense)), position = position_dodge(width = 0.9), vjust = -1) +
  theme_minimal() +
  labs(x = "",
       y = "Average full-time salary") +
  ggtitle("Figure 1: Average full-time coaching salary, \nby team gender and coach type") +
  theme(plot.title = element_text(hjust = 0.5),
        legend.title = element_blank(),
        legend.key = element_blank(),
        legend.background = element_rect(color = NA),
        legend.position="bottom") +
  scale_fill_brewer(palette = "Accent") +
  scale_y_continuous(labels = scales::dollar_format(prefix="$"), expand = c(0.2,0))
```

f. Footnotes

Last, we’ll use our year-specific parameter and inline code to add the citation to identify where we obtained the data for these reports.

`r glue("U.S. Department of Education, Office of Postsecondary Education, Equity in Athletics Disclosure Act (EADA) survey, {sy}.")`

3. Running the reports

In the following section, we present the run_reports.R file to execute our reports. This script is also available on our GitHub repository here.

In the inst vector, we specify the institutions for which we want to run reports. In the year vector, we identify the year of data to use to prepare the reports. The remaining code tells R to iterate over these vectors and for each institution-year combination call our report_template.qmd, substitute the template parameters with the respective values, and export a PDF.

Note: We obtained the basis for the code below from R for the Rest of Us (Keyes, 2024).

```{r}
#| eval: false

library(quarto)
library(tidyverse)

inst <- c("Georgetown University",
          "Duke University",
          "Furman University",
          "University of Michigan-Ann Arbor",
          "University of North Carolina at Chapel Hill")
year <- c(2023)

reports <-
  tibble(
    input = "report_template.qmd",
    output_file = str_glue("{inst} {year} annual sports equity report.pdf"),
    execute_params = map2(inst, year, ~ list(inst = .x, year = .y))
  )

pwalk(reports, quarto_render)
```

4. Reports

We now have standardized reports for our 5 institutions - scroll through each report below!

Georgetown University

Download PDF file.

University of North Carolina at Chapel Hill

Download PDF file.

Duke University

Download PDF file.

Furman University

Download PDF file.

University of Michigan-Ann Arbor

Download PDF file.

Scaling up

In a cloud environment with distributed computing, we could imagine many use cases for this reporting pipeline. For example, we could efficiently:

  • Produce these standardized reports for all institutions for all available years

  • Create a much more comprehensive report template by including a far greater number of tables and figures

  • Capitalize on two or more years of data for the same institution to create cross-time reports

  • Utilize the results of the K-means clustering analysis to identify similar institutions and create a unified PDF report with the snapshot for each institution

The opportunities are endless!