vignettes/multisheet_resources.Rmd
multisheet_resources.Rmd
Excel files (XLS and XLSX) are a common form of data on the City of Toronto Open Data Portal.
In cases where the file only contains one sheet, the resource is returned as a tibble. For example, this data set on TTC Ridership Analysis from 1985 to 2018:
library(opendatatoronto)
list_package_resources("https://open.toronto.ca/dataset/ttc-ridership-analysis/") %>%
get_resource()
#> # A tibble: 64 x 36
#> `TORONTO TRANSI… ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10 ...11
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 ANALYSIS OF RID… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 1985 TO 2018 AC… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 4 <NA> FARE… 2018 2017 2016 2015… 2014 2013 2012 2011 2010
#> 5 WHO ADULT <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 6 <NA> TOKE… 46974 76106 1020… 1109… 1111… 1123… 1179… 1247… 1203…
#> 7 <NA> TICK… N/A N/A N/A N/A N/A N/A N/A N/A 1298
#> 8 <NA> TWO-… N/A N/A N/A N/A N/A N/A N/A N/A N/A
#> 9 <NA> PRES… 1168… 67829 27397 13323 9862 8194 4399 1139 0
#> 10 <NA> PRES… 1752 N/A N/A N/A N/A N/A N/A N/A N/A
#> # … with 54 more rows, and 25 more variables: ...12 <chr>, ...13 <chr>,
#> # ...14 <chr>, ...15 <chr>, ...16 <chr>, ...17 <chr>, ...18 <chr>,
#> # ...19 <chr>, ...20 <chr>, ...21 <chr>, ...22 <chr>, ...23 <chr>,
#> # ...24 <chr>, ...25 <chr>, ...26 <chr>, ...27 <chr>, ...28 <chr>,
#> # ...29 <chr>, ...30 <chr>, ...31 <chr>, ...32 <chr>, ...33 <chr>,
#> # ...34 <chr>, ...35 <chr>, ...36 <chr>
When the file contains multiple sheets, the resource is returned as a named list, where the names are the names of the sheets, as in the dataset on Wellbeing Toronto Demographics:
library(dplyr)
wellbeing_toronto_demographics <- list_package_resources("https://open.toronto.ca/dataset/wellbeing-toronto-demographics/") %>%
filter(name == "wellbeing-toronto-demographics") %>%
get_resource()
str(wellbeing_toronto_demographics, max.level = 1)
#> List of 3
#> $ IndicatorMetaData :Classes 'tbl_df', 'tbl' and 'data.frame': 87 obs. of 8 variables:
#> $ RawData-Ref Period 2008:Classes 'tbl_df', 'tbl' and 'data.frame': 141 obs. of 85 variables:
#> $ RawData-Ref Period 2011:Classes 'tbl_df', 'tbl' and 'data.frame': 141 obs. of 39 variables:
To access the relevant sheet, pull out the list element:
wellbeing_toronto_demographics[["IndicatorMetaData"]]
#> # A tibble: 87 x 8
#> PROVENANCE SHORT_NAME LONG_NAME DESCRIPTION URL CURRENCY
#> <chr> <chr> <chr> <chr> <chr> <dttm>
#> 1 Statistic… Total Pop… Total Po… For Refere… http… 2011-04-25 00:00:00
#> 2 Statistic… Pop - Mal… Total Po… For Refere… http… 2011-04-25 00:00:00
#> 3 Statistic… Pop - Fem… Total Po… For Refere… http… 2011-04-25 00:00:00
#> 4 Statistic… Pop 0 - 4… Total Po… For Refere… http… 2011-04-25 00:00:00
#> 5 Statistic… Pop 5 - 9… Total Po… For Refere… http… 2011-04-25 00:00:00
#> 6 Statistic… Pop 6-12 … Total Po… For Refere… http… 2006-05-01 00:00:00
#> 7 Statistic… Pop 10 - … Total Po… For Refere… http… 2011-04-25 00:00:00
#> 8 Statistic… Pop 15 -1… Total Po… For Refere… http… 2011-04-25 00:00:00
#> 9 Statistic… Pop 20 - … Total Po… For Refere… http… 2011-04-25 00:00:00
#> 10 Statistic… Pop 25 -… Total Po… For Refere… http… 2011-04-25 00:00:00
#> # … with 77 more rows, and 2 more variables: DATE_UPDATED <dttm>, DOMAIN <chr>
There are also cases where the file contains multiple sheets and it would be helpful to have them all together as a single data set. For example, the 2019 TTC Bus Delay Data:
ttc_bus_delays_2019 <- search_packages("TTC Bus Delay Data") %>%
list_package_resources() %>%
filter(name == "ttc-bus-delay-data-2019") %>%
get_resource()
The result of is a list with an element for every month of data, each of which is a tibble:
str(ttc_bus_delays_2019, max.level = 1)
#> List of 11
#> $ Jan 2019 :Classes 'tbl_df', 'tbl' and 'data.frame': 6743 obs. of 10 variables:
#> $ Feb 2019 :Classes 'tbl_df', 'tbl' and 'data.frame': 6958 obs. of 10 variables:
#> $ Mar 2019 :Classes 'tbl_df', 'tbl' and 'data.frame': 5712 obs. of 10 variables:
#> $ Apr 2019 :Classes 'tbl_df', 'tbl' and 'data.frame': 5144 obs. of 11 variables:
#> $ May 2019 :Classes 'tbl_df', 'tbl' and 'data.frame': 5023 obs. of 10 variables:
#> $ June 2019 :Classes 'tbl_df', 'tbl' and 'data.frame': 5232 obs. of 10 variables:
#> $ July 2019 :Classes 'tbl_df', 'tbl' and 'data.frame': 5113 obs. of 10 variables:
#> $ Aug, 2019 :Classes 'tbl_df', 'tbl' and 'data.frame': 4354 obs. of 10 variables:
#> $ Sept 2019 :Classes 'tbl_df', 'tbl' and 'data.frame': 3894 obs. of 10 variables:
#> $ Oct 2019 :Classes 'tbl_df', 'tbl' and 'data.frame': 4283 obs. of 10 variables:
#> $ Nov 2019 :Classes 'tbl_df', 'tbl' and 'data.frame': 5436 obs. of 10 variables:
Note that the data for for the element Apr 2019
has one more variable than the rest (11 versus 10):
sapply(ttc_bus_delays_2019, colnames)
#> $`Jan 2019`
#> [1] "Report Date" "Route" "Time" "Day" "Location"
#> [6] "Incident" "Min Delay" "Min Gap" "Direction" "Vehicle"
#>
#> $`Feb 2019 `
#> [1] "Report Date" "Route" "Time" "Day" "Location"
#> [6] "Incident" "Min Delay" "Min Gap" "Direction" "Vehicle"
#>
#> $`Mar 2019 `
#> [1] "Report Date" "Route" "Time" "Day" "Location"
#> [6] "Incident" "Min Delay" "Min Gap" "Direction" "Vehicle"
#>
#> $`Apr 2019`
#> [1] "Report Date" "Route" "Time" "Day" "Location"
#> [6] "Incident ID" "Incident" "Delay" "Gap" "Direction"
#> [11] "Vehicle"
#>
#> $`May 2019 `
#> [1] "Report Date" "Route" "Time" "Day" "Location"
#> [6] "Incident" "Min Delay" "Min Gap" "Direction" "Vehicle"
#>
#> $`June 2019`
#> [1] "Report Date" "Route" "Time" "Day" "Location"
#> [6] "Incident" "Delay" "Gap" "Direction" "Vehicle"
#>
#> $`July 2019`
#> [1] "Report Date" "Route" "Time" "Day" "Location"
#> [6] "Incident" "Min Delay" "Min Gap" "Direction" "Vehicle"
#>
#> $`Aug, 2019 `
#> [1] "Report Date" "Route" "Time" "Day" "Location"
#> [6] "Incident" "Min Delay" "Min Gap" "Direction" "Vehicle"
#>
#> $`Sept 2019`
#> [1] "Report Date" "Route" "Time" "Day" "Location"
#> [6] "Incident" "Min Delay" "Min Gap" "Direction" "Vehicle"
#>
#> $`Oct 2019 `
#> [1] "Report Date" "Route" "Time" "Day" "Location"
#> [6] "Incident" "Min Delay" "Min Gap" "Direction" "Vehicle"
#>
#> $`Nov 2019`
#> [1] "Report Date" "Route" "Time" "Day" "Location"
#> [6] "Incident" "Delay" "Gap" "Direction" "Vehicle"
It seems that the Apr 2019
data has gained a variable Incident ID
, and that the variables Min Gap
and Min Delay
, present in all the other months, have been renamed to Gap
and Delay
, respectively.
We can rename these two variables:
ttc_bus_delays_2019[["Apr 2019"]] <- ttc_bus_delays_2019[["Apr 2019"]] %>%
rename(`Min Gap` = Gap, `Min Delay` = Delay)
and combine all of the elements into a single tibble using dplyr::bind_rows()
:
ttc_bus_delays_2019_combined <- bind_rows(ttc_bus_delays_2019)
ttc_bus_delays_2019_combined
#> # A tibble: 57,892 x 13
#> `Report Date` Route Time Day Location Incident
#> <dttm> <dbl> <dttm> <chr> <chr> <chr>
#> 1 2019-01-01 00:00:00 39 1899-12-31 00:13:00 Tues… NECR Mechani…
#> 2 2019-01-01 00:00:00 111 1899-12-31 00:15:00 Tues… Eglingt… Mechani…
#> 3 2019-01-01 00:00:00 35 1899-12-31 00:18:00 Tues… Finch Mechani…
#> 4 2019-01-01 00:00:00 25 1899-12-31 00:30:00 Tues… Don Mil… Mechani…
#> 5 2019-01-01 00:00:00 36 1899-12-31 00:40:00 Tues… Humberw… Investi…
#> 6 2019-01-01 00:00:00 45 1899-12-31 00:51:00 Tues… Kipling… Utilize…
#> 7 2019-01-01 00:00:00 32 1899-12-31 01:55:00 Tues… Royal Y… Mechani…
#> 8 2019-01-01 00:00:00 53 1899-12-31 02:19:00 Tues… FSTN Utilize…
#> 9 2019-01-01 00:00:00 112 1899-12-31 02:33:00 Tues… Kipling… Mechani…
#> 10 2019-01-01 00:00:00 85 1899-12-31 02:57:00 Tues… DONS Mechani…
#> # … with 57,882 more rows, and 7 more variables: `Min Delay` <dbl>, `Min
#> # Gap` <dbl>, Direction <chr>, Vehicle <dbl>, `Incident ID` <dbl>,
#> # Delay <dbl>, Gap <dbl>
Unfortunately, it looks like the Time
variable got Excel™ed, and will need some data cleaning.
For interests sake, it appears that Incident ID
is a lookup ID for the type of incident – only present in the Apr 2019
data, but interesting nonetheless!
ttc_bus_delays_2019_combined %>%
filter(!is.na(`Incident ID`)) %>%
distinct(`Incident ID`, Incident)
#> # A tibble: 11 x 2
#> `Incident ID` Incident
#> <dbl> <chr>
#> 1 5 Investigation
#> 2 1 Mechanical
#> 3 4 Utilized Off Route
#> 4 3 Diversion
#> 5 9 <NA>
#> 6 8 General Delay
#> 7 6 Emergency Services
#> 8 10 Late Leaving Garage - Operator
#> 9 11 Late Leaving Garage - Mechanical
#> 10 7 Vision
#> 11 12 Late Leaving Garage - Management