SwimmeR

library(SwimmeR)
library(rvest)
#> Warning: package 'rvest' was built under R version 4.0.2
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.0.2
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.0.2
library(scales)

SwimmeR was developed to work with results from swimming competitions. Results are often shared as web pages (.html) or PDF documents, which are nice to read, but make data difficult to access.

SwimmeR solves this problem by importing & cleaning .html and .pdf files containing swimming results, and returns a tidy dataframe.

Importing is performed by read_results which takes as an argument a file path as file and a node (for .html only).

In addition to this vignette I do a lot of demos on how to use SwimmeR at my blog Swimming + Data Science.

Reading PDF Results

SwimmeR includes Texas-Florida-Indiana.pdf, results from a tri-meet between the three schools. It can be read in as such:

file_path <- system.file("extdata", "Texas-Florida-Indiana.pdf", package = "SwimmeR")

file_read <- read_results(file = file_path)
file_read[294:303]
#>  [1] "\nEvent 7 Women 100 Yard Breaststroke"                                                                                                   
#>  [2] "\n                         58.79 A"                                                                                                      
#>  [3] "\n                       1:01.84 B"                                                                                                      
#>  [4] "\n        Name                                  Age School                               Seed Time       Finals Time              Points"
#>  [5] "\n    1 Lilly King                                21 Indiana University                         NT               59.46   B"              
#>  [6] "\n               r:+0.70 27.62         59.46 (31.84)"                                                                                    
#>  [7] "\n    2 Olivia Anderson                           21 Texas, University of                       NT            1:01.88"                   
#>  [8] "\n               r:+0.74 29.13       1:01.88 (32.75)"                                                                                    
#>  [9] "\n    3 Noelle Peplowski                          18 Indiana University                         NT            1:02.02"                   
#> [10] "\n               r:+0.71 29.06       1:02.02 (32.96)"

Here we see a subsection of the meet - the top three finishers in the Women’s 100 Yard Breaststroke featuring Olympic gold medalist Lilly King.

The next step is to convert this data to a dataframe using swim_parse. Because swim_parse works on text strings it is very sensitive to typos and/or nonstandard naming conventions. “Texas-Florida-Indiana.pdf” has two examples of these potential problems.

The first is that Indiana University is sometimes entered as Indiana University, with two spaces between Indiana and University. This is a problem because swim_parse will interpret two spaces as a column separator, and will not properly capture Indiana University (two spaces) as a team name.

The second issue is that Texas and Florida are styled as Texas, University of and Florida, University of which will cause swim_parse to interpret them as Lastname, Firstname.

Both of these issues can be fixed with the typo and replacement arguments to swim_parse. Elements of typo will be replaced by the element of replacement with which they share an index, so all instances of the first element of typo will be replaced by the first element of replacement etc. etc. Not specifying typo or replacement will not produce an error, but might negatively impact the results. If your results look strange, or are missing values, look for typos related to those swims.

There is a third argument to swim_parse, called avoid, which will be addressed in the section on reading in html results below.

df <-
  swim_parse(
    file = file_read,
    typo = c("Indiana  University", ", University of"),
    replacement = c("Indiana University", "")
  )

Here are those same Women’s 100 Breaststroke results, as a dataframe in tidy format:

df[100:102,]
#>                 Name Place Grade             School Prelims_Time Finals_Time
#> 100       Lilly King     1    21 Indiana University         <NA>       59.46
#> 101  Olivia Anderson     2    21              Texas         <NA>     1:01.88
#> 102 Noelle Peplowski     3    18 Indiana University         <NA>     1:02.02
#>     Points Exhibition DQ                       Event
#> 100   <NA>          0  0 Women 100 Yard Breaststroke
#> 101   <NA>          0  0 Women 100 Yard Breaststroke
#> 102   <NA>          0  0 Women 100 Yard Breaststroke

Please note that SwimmeR does not capture split times.

Reading HTML Results

Reading html results is very similar to reading pdf results, but a value must be specified to node, containing which CSS node the read_results should look in for results. Here results from the New York State 2003 Girls Championship meet will be read in, from the “pre” node.

url <- "http://www.nyhsswim.com/Results/Girls/2003/NYS/Single.htm"
url_read <- read_results(file = url, node = "pre")
url_read[587:598]
#>  [1] "\n==============================================================================="
#>  [2] "\nNY State Rcd: S 54.35  1990      Richelle Depold, Scotia"                       
#>  [3] "\n    Name                    Year School               Prelims     Finals"       
#>  [4] "\n==============================================================================="
#>  [5] "\nNYSPHSAA 2003 Federation Championship"                                          
#>  [6] "\nA - Final"                                                                      
#>  [7] "\n  1 Bridget O'Connor          12 1-Scarsdale            56.16      55.42"       
#>  [8] "\n      26.12   29.30"                                                            
#>  [9] "\n  2 Lauren Bonfe              12 5-Alfred-Almond        56.37      56.93"       
#> [10] "\n      26.18   30.75"                                                            
#> [11] "\n  3 Christa Narus             11 11-Ward Melville       58.67      57.94"       
#> [12] "\n      27.19   30.75"

Looking at the raw results above one will see that line 2 is a header and contains NY State Rcd:, showing the New York State record. Lines of this type are a common feature in swimming results, but because they contain a recognizable swimming time, without being a result per say, they can cause problems for swim_parse. Like typos these will not cause an error, but might produce nonsense rows in the resulting dataframe. swim_parse deals with strings that should not be included in results with the avoid argument. By default avoid contains a lot of common formulations of these header items under avoid_default. You can create your own list of strings as pass it to avoid, or add to avoid_default via avoid_new <- c(avoid_default, "your string here"). Avoid should also include "r\\:" if your results have reaction times (avoid_default already includes "r\\:").

df_1 <- swim_parse(file = url_read, avoid = c("NY State Rcd:"))
df_1[358:360,]
#>                 Name Place Grade           School Prelims_Time Finals_Time
#> 358 Bridget O'Connor     1    12      1-Scarsdale        56.16       55.42
#> 359     Lauren Bonfe     2    12  5-Alfred-Almond        56.37       56.93
#> 360    Christa Narus     3    11 11-Ward Melville        58.67       57.94
#>     Points Exhibition DQ                    Event
#> 358   <NA>          0  0 Girls 100 Yard Butterfly
#> 359   <NA>          0  0 Girls 100 Yard Butterfly
#> 360   <NA>          0  0 Girls 100 Yard Butterfly

Formatting Swimming Times

Once results are captured in R as tidy dataframes the real fun can begin - but there’s another problem. Times in swimming are recorded as minutes:seconds.hundredth. This is fine when a time is less than a minute, because 59.99 can be of class numeric in R, but times greater than or equal to a minute 1:00.00 are stuck as class character. SwimmeR provides two functions, sec_format and mmss_format to convert between times as seconds (for doing math), and times as minutes:seconds.hundredth, for swimming-specific display.

data(King200Breast)
King200Breast
#> # A tibble: 50 x 4
#>    Event      Year      Time    Date      
#>    <chr>      <chr>     <chr>   <date>    
#>  1 200 Breast Junior    2:02.60 2018-03-17
#>  2 200 Breast Senior    2:02.90 2019-03-23
#>  3 200 Breast Sophomore 2:03.18 2017-03-18
#>  4 200 Breast Freshman  2:03.59 2016-03-19
#>  5 200 Breast Senior    2:03.60 2018-11-17
#>  6 200 Breast Sophomore 2:04.03 2017-02-18
#>  7 200 Breast Junior    2:04.68 2018-02-17
#>  8 200 Breast Senior    2:05.14 2019-02-23
#>  9 200 Breast Junior    2:05.49 2018-03-17
#> 10 200 Breast Freshman  2:05.58 2016-02-20
#> # ... with 40 more rows

Included in SwimmeR is King200Breast, containing all Lilly King’s 200 Breaststroke times for her NCAA career. Times recorded as character values, in standard minutes:seconds.hundredth format. We can use sec_format to format them as seconds, and mmss_format to go back to minutes:seconds.hundredth. Both functions work well with the tidyverse packages.

King200Breast <- King200Breast %>% 
  mutate(Time_sec = sec_format(Time),
         Time_swim_2 = mmss_format(Time_sec))
King200Breast
#> # A tibble: 50 x 6
#>    Event      Year      Time    Date       Time_sec Time_swim_2
#>    <chr>      <chr>     <chr>   <date>        <dbl> <chr>      
#>  1 200 Breast Junior    2:02.60 2018-03-17     123. 2:02.60    
#>  2 200 Breast Senior    2:02.90 2019-03-23     123. 2:02.90    
#>  3 200 Breast Sophomore 2:03.18 2017-03-18     123. 2:03.18    
#>  4 200 Breast Freshman  2:03.59 2016-03-19     124. 2:03.59    
#>  5 200 Breast Senior    2:03.60 2018-11-17     124. 2:03.60    
#>  6 200 Breast Sophomore 2:04.03 2017-02-18     124. 2:04.03    
#>  7 200 Breast Junior    2:04.68 2018-02-17     125. 2:04.68    
#>  8 200 Breast Senior    2:05.14 2019-02-23     125. 2:05.14    
#>  9 200 Breast Junior    2:05.49 2018-03-17     125. 2:05.49    
#> 10 200 Breast Freshman  2:05.58 2016-02-20     126. 2:05.58    
#> # ... with 40 more rows

This is useful for comparing times, or plotting

King200Breast %>% 
  ggplot(aes(x = Date, y = Time_sec)) +
  geom_point() +
  scale_y_continuous(labels = scales::trans_format("identity", mmss_format)) +
  theme_classic() +
  labs(y= "Time",
       title = "Lilly King NCAA 200 Breaststroke")

Using get_mode to clean swimming data

Swim teams often have abbreviations, for example Lilly King swam for Indiana University, and sometimes “Indiana University” was listed as her team name. Other times though the team might be listed as “IU” or “IUWSD”. James (Sulley) Sullivan swam (probably) for Monsters University, or MU Regularizing these names is a useful part of cleaning data.

Name <- c(rep("Lilly King", 5), rep("James Sullivan", 3))
Team <- c(rep("IU", 2), "Indiana", "IUWSD", "Indiana University", rep("Monsters University", 2), "MU")
df <- data.frame(Name, Team, stringsAsFactors = FALSE)
df
#>             Name                Team
#> 1     Lilly King                  IU
#> 2     Lilly King                  IU
#> 3     Lilly King             Indiana
#> 4     Lilly King               IUWSD
#> 5     Lilly King  Indiana University
#> 6 James Sullivan Monsters University
#> 7 James Sullivan Monsters University
#> 8 James Sullivan                  MU

Lilly has 4 different teams, but all of them are actually the same. Similarly Sulley has two teams, but actually only one. Using get_mode to return the most frequently occurring team for each swimmer is easier than manually specifying every swimmer’s team.

df <- df %>% 
  group_by(Name) %>% 
  mutate(Team = get_mode(Team))
df
#> # A tibble: 8 x 2
#> # Groups:   Name [2]
#>   Name           Team               
#>   <chr>          <chr>              
#> 1 Lilly King     IU                 
#> 2 Lilly King     IU                 
#> 3 Lilly King     IU                 
#> 4 Lilly King     IU                 
#> 5 Lilly King     IU                 
#> 6 James Sullivan Monsters University
#> 7 James Sullivan Monsters University
#> 8 James Sullivan Monsters University

Drawing brackets

To aid in making single elimination brackets for tournaments and shoot-outs SwimmeR has draw_bracket. Any number of teams between 5 and 64 can be used, with byes automatically assigned to higher seeds.

teams <- c("red", "orange", "yellow", "green", "blue", "indigo", "violet")
draw_bracket(teams = teams)

Now add the results of round two:

round_two <- c("red", "yellow", "blue", "indigo")
draw_bracket(teams = teams,
             round_two = round_two)

And round three:

round_three <- c("red", "blue")
draw_bracket(teams = teams,
             round_two = round_two,
             round_three = round_three)

And crown the champion:

champion <- "red"
draw_bracket(teams = teams,
             round_two = round_two,
             round_three = round_three,
             champion = champion)