Parsing Excel Files with R

How to parse out data into multiple excel files using R openxlsx. Convenient way to save time and automate a painfully monotonous task.
R
workflow automation
Excel
Published

January 4, 2024

Intro

I remember a moment back in my fledgling analyst days where I was assigned a Herculean task to separate a giant Excel workbook into multiple files for different clients. It took me several days to filter and copy/paste each data subset into its own workbook. And it brought me absolutely no joy. In the back of my mind I thought there has to be a better way to do this.

the thought of doing this assignment again manually

Yes - there IS a better way. But it wasn’t until much later in my career that I learned tools that could help me with worklow automation. If you know a little bit of R, you can 100% automate this task and do it in less than 10 minutes! I’ll walk you through a scenario in this post.

Scenario

Let’s pretend we’re an analyst working for a sports analytics firm. Your manager, Chris, has a list of all the latest NFL team rosters. Chris manages a team of experts that focus on player performance for each team within the NFL. He’s asked for your help to create a report for each expert listing all players for their respective team.

Here’s a quick preview of the data:

library(nflreadr)
library(kableExtra)
library(dplyr)

#load in data form nflreadr to see team rosters
roster<-nflreadr::load_rosters()|>
#subset data with dplyr to select specific columns
  dplyr::select(last_name, first_name, status, position, jersey_number, team)

#preview first 5 rows of data
kableExtra::kable(head(roster,5))
last_name first_name status position jersey_number team
Williams Bernard CUT OL 74 PHI
Peters Jason INA OL 70 SEA
Rodgers Aaron INA QB 8 NYJ
Prater Matt ACT K 5 ARI
Lewis Marcedes ACT TE 84 CHI

And here’s a look at how many team files we’d need…

teams = unique(roster$team)

teams
 [1] "PHI" "SEA" "NYJ" "ARI" "CHI" "TEN" "NYG" "CAR" "CLE" "ATL" "NE"  "BAL"
[13] "LA"  "LV"  "HOU" "IND" "NO"  "SF"  "DET" "MIA" "BUF" "PIT" "DAL" "KC" 
[25] "LAC" "CIN" "DEN" "TB"  "MIN" "JAX" "GB"  "WAS"

Creating 32 files by hand would be tedious. But luckily, R is going to do all the heavy lifting for us.

Writing Data to Excel

There are a few different libraries that can help us write data to Excel. In this tutorial, we’ll use openxlsx to create Excel workbooks, write data as Excel tables, and save them.

There’s a lot more to this powerful library that goes beyond this tutorial, for a full run-down of openxlsx’s functions, be sure to check out the CRAN package doc.

Let’s try it out with the roster data to familiarize ourselves with a few basic functions.

library(openxlsx)

#initialize a workbook
wb <- createWorkbook()

#add a worksheet in workbook
addWorksheet(wb, sheet = "Roster")

#write data with filters to workbook
writeData(wb, sheet="Roster", roster, withFilter = TRUE)

#save and write workbook 
saveWorkbook(wb, file = "roster.xlsx", overwrite = TRUE)

Alternatively, if we want to write and store data as an Excel table, openxlsx also has a writeDataTable function that works similarly to writeData.

Power of For Loops

Of course the main task at hand is not writing one Excel workbook, but creating a workbook per team. This is a good use case to use for loops to help us repeat the task for each data subset.

First, a quick example of how to filter data with dplyr to create our data subset by team. The example below

#filter roster to see players for Tampa Bay, TB
tb_roster <- roster|>filter(team=='TB')

#preview roster subset
kable(head(tb_roster,5))
last_name first_name status position jersey_number team
David Lavonte ACT LB 54 TB
Jensen Ryan RES OL 66 TB
Gholston William ACT DL 92 TB
Barrett Shaquil ACT LB 7 TB
Feiler Matt ACT OL 71 TB

Now, let’s combine what we’ve learned from creating workbooks and subsetting data into a for loop like this. Before running the code, we create a new directory (folder) to save all of our future rosters to. We aptly call this directly “team_rosters.” This is important to note for our workbook file path!

#for each team, aka t, in teams
for(t in teams){
  
  #filter data for team, bonus arrange by last name
  team_roster = roster|>filter(team==t)|>arrange(last_name)
  
  #create new workbook
  wb <- createWorkbook()
  
  #create a worksheet, Roster, in workbook wb
  addWorksheet(wb, "Roster")
  
  #write a data table to sheet Roster in wb
  writeDataTable(wb, sheet="Roster", team_roster, withFilter = TRUE, tableName="roster")
  
  #create a new workbook title, include path if applicable
  wb_file_path = paste("team_rosters/", tolower(t),"_roster.xlsx", sep='')
  
  #save file 
  saveWorkbook(wb, file = wb_file_path, overwrite = TRUE)
}

Once the code runs, we excitedly rush to open our directory, to find 32 glorious files - one for each team. We open one to inspect, and see the list of players in a sheet called “Rosters”, formatted in a basic Excel table.

With the job done, we send the files off to Chris to distribute to his team. We look at the clock, it’s time for lunch - we log off for a well deserved lunch break.

But There’s More

Chris is impresesed we got the files back to him so quickly. You come back from lunch and see an email in your inbox with his latest request.

Only after the fact, Chris realizes it would be more useful if each expert could see the list split out for active players vs. inactive players (he doesn’t trust the experts to use the table filters…their Excel skills pale in comparison to your prowess).

In the example below, we’ll make the fix to create TWO sheets per team excel file. One with active players, one with inactive players. To help differentiate the tables, we can also use the tableStyle argument in writeDataTable to change the table style.

for(t in teams){
  team_roster = roster|>filter(team==t)|>arrange(last_name)
  
  wb <- createWorkbook()
  
  #set up two worksheets in the workbook
  addWorksheet(wb, "Active")
  addWorksheet(wb, "Not Active")
  
  #store active players "ACT" as a table in the Active tab
  writeDataTable(wb, "Active", team_roster[team_roster$status=="ACT", ], withFilter = TRUE, tableName="active")
  
  #store iactive players (not equal to "ACT") as a table in the Active tab, change table style
  writeDataTable(wb, "Not Active", team_roster[team_roster$status!="ACT", ], withFilter = TRUE, tableName="inactive", tableStyle = "TableStyleLight10")
  
  
    wb_file_path = paste("team_rosters/", tolower(t),"_roster.xlsx", sep='')
  
  saveWorkbook(wb, file = wb_file_path, overwrite = TRUE)
}

The code runs in less than 10 seconds, and the 32 excel files resave to our director. We rush to inspect the first one, and fist pump triumphantly to see two worksheets with the roster now split into active and inactive players.

We save the files to the company OneDrive and email Chris the update an hour later (you decide you can’t make it look too easy…64 tables in less than a minute is record speed). All done with plenty of time to spare for your other projects!


Thank you for checking out my blog, let me know if you have questions!