vignettes/articles/multisheet_resources.Rmd
multisheet_resources.Rmd
Excel files (XLS and XLSX) are a common form of data on the City of Toronto Open Data Portal.
In cases where the file only contains one sheet, the resource is returned as a tibble. For example, this data set on TTC Ridership Analysis from 1985 to 2018:
library(opendatatoronto)
list_package_resources("https://open.toronto.ca/dataset/ttc-ridership-analysis/") %>%
get_resource()
#> # A tibble: 71 x 37
#> `TORONTO TRANSI… ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10 ...11
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 ANALYSIS OF RID… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 1985 TO 2019 AC… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 4 <NA> FARE… 2019 2018 2017 2016 2015… 2014 2013 2012 2011
#> 5 WHO ADULT <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 6 <NA> PRES… 1906 N/A N/A N/A N/A N/A N/A N/A N/A
#> 7 <NA> PRES… 829 N/A N/A N/A N/A N/A N/A N/A N/A
#> 8 <NA> PRES… 1668… 1109… 67829 27397 13323 9862 8194 4399 1139
#> 9 <NA> PRES… 4340 1496 N/A N/A N/A N/A N/A N/A N/A
#> 10 <NA> PRES… 9937 4442 N/A N/A N/A N/A N/A N/A N/A
#> # … with 61 more rows, and 26 more variables: ...12 <chr>, ...13 <chr>,
#> # ...14 <chr>, ...15 <chr>, ...16 <chr>, ...17 <chr>, ...18 <chr>,
#> # ...19 <chr>, ...20 <chr>, ...21 <chr>, ...22 <chr>, ...23 <chr>,
#> # ...24 <chr>, ...25 <chr>, ...26 <chr>, ...27 <chr>, ...28 <chr>,
#> # ...29 <chr>, ...30 <chr>, ...31 <chr>, ...32 <chr>, ...33 <chr>,
#> # ...34 <chr>, ...35 <chr>, ...36 <chr>, ...37 <chr>
When the file contains multiple sheets, the resource is returned as a named list, where the names are the names of the sheets, as in the dataset on Wellbeing Toronto Demographics:
library(dplyr)
wellbeing_toronto_demographics <- list_package_resources("https://open.toronto.ca/dataset/wellbeing-toronto-demographics/") %>%
filter(name == "wellbeing-toronto-demographics") %>%
get_resource()
str(wellbeing_toronto_demographics, max.level = 1)
#> List of 3
#> $ IndicatorMetaData : tibble [87 × 8] (S3: tbl_df/tbl/data.frame)
#> $ RawData-Ref Period 2008: tibble [141 × 85] (S3: tbl_df/tbl/data.frame)
#> $ RawData-Ref Period 2011: tibble [141 × 39] (S3: tbl_df/tbl/data.frame)
To access the relevant sheet, pull out the list element:
wellbeing_toronto_demographics[["IndicatorMetaData"]]
#> # A tibble: 87 x 8
#> PROVENANCE SHORT_NAME LONG_NAME DESCRIPTION URL CURRENCY
#> <chr> <chr> <chr> <chr> <chr> <dttm>
#> 1 Statistic… Total Pop… Total Po… For Refere… http… 2011-04-25 00:00:00
#> 2 Statistic… Pop - Mal… Total Po… For Refere… http… 2011-04-25 00:00:00
#> 3 Statistic… Pop - Fem… Total Po… For Refere… http… 2011-04-25 00:00:00
#> 4 Statistic… Pop 0 - 4… Total Po… For Refere… http… 2011-04-25 00:00:00
#> 5 Statistic… Pop 5 - 9… Total Po… For Refere… http… 2011-04-25 00:00:00
#> 6 Statistic… Pop 6-12 … Total Po… For Refere… http… 2006-05-01 00:00:00
#> 7 Statistic… Pop 10 - … Total Po… For Refere… http… 2011-04-25 00:00:00
#> 8 Statistic… Pop 15 -1… Total Po… For Refere… http… 2011-04-25 00:00:00
#> 9 Statistic… Pop 20 - … Total Po… For Refere… http… 2011-04-25 00:00:00
#> 10 Statistic… Pop 25 -… Total Po… For Refere… http… 2011-04-25 00:00:00
#> # … with 77 more rows, and 2 more variables: DATE_UPDATED <dttm>, DOMAIN <chr>
There are also cases where the file contains multiple sheets and it would be helpful to have them all together as a single data set. For example, the 2019 TTC Bus Delay Data:
ttc_bus_delays_2019 <- search_packages("TTC Bus Delay Data") %>%
list_package_resources() %>%
filter(name == "ttc-bus-delay-data-2019") %>%
get_resource()
The result of is a list with an element for every month of data, each of which is a tibble:
str(ttc_bus_delays_2019, max.level = 1)
#> List of 12
#> $ Jan 2019 : tibble [6,743 × 10] (S3: tbl_df/tbl/data.frame)
#> $ Feb 2019 : tibble [6,958 × 10] (S3: tbl_df/tbl/data.frame)
#> $ Mar 2019 : tibble [5,712 × 10] (S3: tbl_df/tbl/data.frame)
#> $ Apr 2019 : tibble [5,144 × 11] (S3: tbl_df/tbl/data.frame)
#> $ May 2019 : tibble [5,023 × 10] (S3: tbl_df/tbl/data.frame)
#> $ June 2019 : tibble [5,232 × 10] (S3: tbl_df/tbl/data.frame)
#> $ July 2019 : tibble [5,113 × 10] (S3: tbl_df/tbl/data.frame)
#> $ Aug, 2019 : tibble [4,354 × 10] (S3: tbl_df/tbl/data.frame)
#> $ Sept 2019 : tibble [3,894 × 10] (S3: tbl_df/tbl/data.frame)
#> $ Oct 2019 : tibble [4,283 × 10] (S3: tbl_df/tbl/data.frame)
#> $ Nov 2019 : tibble [5,436 × 10] (S3: tbl_df/tbl/data.frame)
#> $ Dec 2019 : tibble [4,484 × 10] (S3: tbl_df/tbl/data.frame)
Note that the data for for the element Apr 2019
has one more variable than the rest (11 versus 10):
sapply(ttc_bus_delays_2019, colnames)
#> $`Jan 2019`
#> [1] "Report Date" "Route" "Time" "Day" "Location"
#> [6] "Incident" "Min Delay" "Min Gap" "Direction" "Vehicle"
#>
#> $`Feb 2019 `
#> [1] "Report Date" "Route" "Time" "Day" "Location"
#> [6] "Incident" "Min Delay" "Min Gap" "Direction" "Vehicle"
#>
#> $`Mar 2019 `
#> [1] "Report Date" "Route" "Time" "Day" "Location"
#> [6] "Incident" "Min Delay" "Min Gap" "Direction" "Vehicle"
#>
#> $`Apr 2019`
#> [1] "Report Date" "Route" "Time" "Day" "Location"
#> [6] "Incident ID" "Incident" "Delay" "Gap" "Direction"
#> [11] "Vehicle"
#>
#> $`May 2019 `
#> [1] "Report Date" "Route" "Time" "Day" "Location"
#> [6] "Incident" "Min Delay" "Min Gap" "Direction" "Vehicle"
#>
#> $`June 2019`
#> [1] "Report Date" "Route" "Time" "Day" "Location"
#> [6] "Incident" "Delay" "Gap" "Direction" "Vehicle"
#>
#> $`July 2019`
#> [1] "Report Date" "Route" "Time" "Day" "Location"
#> [6] "Incident" "Min Delay" "Min Gap" "Direction" "Vehicle"
#>
#> $`Aug, 2019 `
#> [1] "Report Date" "Route" "Time" "Day" "Location"
#> [6] "Incident" "Min Delay" "Min Gap" "Direction" "Vehicle"
#>
#> $`Sept 2019`
#> [1] "Report Date" "Route" "Time" "Day" "Location"
#> [6] "Incident" "Min Delay" "Min Gap" "Direction" "Vehicle"
#>
#> $`Oct 2019 `
#> [1] "Report Date" "Route" "Time" "Day" "Location"
#> [6] "Incident" "Min Delay" "Min Gap" "Direction" "Vehicle"
#>
#> $`Nov 2019`
#> [1] "Report Date" "Route" "Time" "Day" "Location"
#> [6] "Incident" "Delay" "Gap" "Direction" "Vehicle"
#>
#> $`Dec 2019`
#> [1] "Report Date" "Route" "Time" "Day" "Location"
#> [6] "Incident" "Delay" "Gap" "Direction" "Vehicle"
It seems that the Apr 2019
data has gained a variable Incident ID
, and that the variables Min Gap
and Min Delay
, present in all the other months, have been renamed to Gap
and Delay
, respectively.
We can rename these two variables:
ttc_bus_delays_2019[["Apr 2019"]] <- ttc_bus_delays_2019[["Apr 2019"]] %>%
rename(`Min Gap` = Gap, `Min Delay` = Delay)
and combine all of the elements into a single tibble using dplyr::bind_rows()
:
ttc_bus_delays_2019_combined <- bind_rows(ttc_bus_delays_2019)
ttc_bus_delays_2019_combined
#> # A tibble: 62,376 x 13
#> `Report Date` Route Time Day Location Incident
#> <dttm> <dbl> <dttm> <chr> <chr> <chr>
#> 1 2019-01-01 00:00:00 39 1899-12-31 00:13:00 Tues… NECR Mechani…
#> 2 2019-01-01 00:00:00 111 1899-12-31 00:15:00 Tues… Eglingt… Mechani…
#> 3 2019-01-01 00:00:00 35 1899-12-31 00:18:00 Tues… Finch Mechani…
#> 4 2019-01-01 00:00:00 25 1899-12-31 00:30:00 Tues… Don Mil… Mechani…
#> 5 2019-01-01 00:00:00 36 1899-12-31 00:40:00 Tues… Humberw… Investi…
#> 6 2019-01-01 00:00:00 45 1899-12-31 00:51:00 Tues… Kipling… Utilize…
#> 7 2019-01-01 00:00:00 32 1899-12-31 01:55:00 Tues… Royal Y… Mechani…
#> 8 2019-01-01 00:00:00 53 1899-12-31 02:19:00 Tues… FSTN Utilize…
#> 9 2019-01-01 00:00:00 112 1899-12-31 02:33:00 Tues… Kipling… Mechani…
#> 10 2019-01-01 00:00:00 85 1899-12-31 02:57:00 Tues… DONS Mechani…
#> # … with 62,366 more rows, and 7 more variables: `Min Delay` <dbl>, `Min
#> # Gap` <dbl>, Direction <chr>, Vehicle <dbl>, `Incident ID` <dbl>,
#> # Delay <dbl>, Gap <dbl>
Unfortunately, it looks like the Time
variable got Excel™ed, and will need some data cleaning.
For interests sake, it appears that Incident ID
is a lookup ID for the type of incident – only present in the Apr 2019
data, but interesting nonetheless!
ttc_bus_delays_2019_combined %>%
filter(!is.na(`Incident ID`)) %>%
distinct(`Incident ID`, Incident)
#> # A tibble: 11 x 2
#> Incident `Incident ID`
#> <chr> <dbl>
#> 1 Investigation 5
#> 2 Mechanical 1
#> 3 Utilized Off Route 4
#> 4 Diversion 3
#> 5 <NA> 9
#> 6 General Delay 8
#> 7 Emergency Services 6
#> 8 Late Leaving Garage - Operator 10
#> 9 Late Leaving Garage - Mechanical 11
#> 10 Vision 7
#> 11 Late Leaving Garage - Management 12