Chapter 4 Tidyverse / kableExtra pipelines
To really levarage the power of R for tables, you’ll likely want to get in the habit of
creating pipelines that use functions from the tidyverse
package to reshape or
clean your data before piping that data into kable
and kableExtra
functions to
create the table. This section will walk through some examples of this pipeline, and
how it can provide a powerful approach for creating tables from data in R.
To run this code, please be sure to load these packages:
4.1 Example data
Again, this section uses a dataset from the dslabs
package called “us_contagious_diseases”.
You can load that data by loading the dslabs
package (if you don’t have it, you
can install it with install.packages("dslabs")
) and then using the data
function
to load the dataset:
This dataset includes counts by states for various infectious diseases. It gives these counts by year over a several decades. It also gives the number of weeks for which that state reported data that year and the population of the state that year (in case you’d like to calculate rates):
## disease state year weeks_reporting count population
## 1 Hepatitis A Alabama 1966 50 321 3345787
## 2 Hepatitis A Alabama 1967 49 291 3364130
## 3 Hepatitis A Alabama 1968 52 314 3386068
## 4 Hepatitis A Alabama 1969 49 380 3412450
4.2 Example 1
One quick table we might want to create may be a table where each state is listed, with the range of years that the state reported data. This way, we can quickly see the years that each state reported cases for each disease, and if all states have data for the same number of years.
To do this, we’ll need to get a summary from each combination of state and disease,
where we get the minimum and maximum values of year
for that combination. Then, we
can paste those values together to get the year range. We can select
just the
columns we need for the final table, getting rid of any extra columns we created
along the way to help with our calculation. Finally, this table will look much nicer
if we have a separate column for each disease (while “long” data is easier to work with
when coding, “wide” data often is clearer in a presentation format). The
pivot_wider
can do that reshaping.
Here’s the full pipeline:
us_contagious_diseases %>%
# Get the year range for each combination of state and disease
# by grouping by these variables and then using summarize
group_by(state, disease) %>%
summarize(earliest_year = min(year),
latest_year = max(year)) %>%
# Use `mutate` to make a new column that presents this range more
# nicely (the `unite` function could work here, too)
mutate(years = paste0(earliest_year, "--", latest_year)) %>%
# Limit to just the data you want to present in the table
select(state, disease, years) %>%
# Reshape the data, so that each disease gets its own column
pivot_wider(names_from = "disease", values_from = "years") %>%
# Send to `kable` for formatting as a table
kable(booktabs = TRUE,
caption = "Years of reporting for each state and disease.") %>%
# Make the font size a bit smaller than the default, so the date ranges
# will be shown on one line each
kable_styling(font_size = 10)
state | Hepatitis A | Measles | Mumps | Pertussis | Polio | Rubella | Smallpox |
---|---|---|---|---|---|---|---|
Alabama | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Alaska | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Arizona | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Arkansas | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
California | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Colorado | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Connecticut | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Delaware | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
District Of Columbia | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Florida | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Georgia | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Hawaii | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Idaho | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Illinois | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Indiana | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Iowa | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Kansas | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Kentucky | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Louisiana | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Maine | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Maryland | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Massachusetts | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Michigan | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Minnesota | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Mississippi | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Missouri | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Montana | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Nebraska | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Nevada | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
New Hampshire | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
New Jersey | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
New Mexico | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
New York | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
North Carolina | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
North Dakota | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Ohio | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Oklahoma | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Oregon | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Pennsylvania | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Rhode Island | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
South Carolina | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
South Dakota | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Tennessee | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Texas | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Utah | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Vermont | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Virginia | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Washington | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
West Virginia | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Wisconsin | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
Wyoming | 1966–2011 | 1928–2002 | 1968–2002 | 1938–2011 | 1928–1968 | 1966–2002 | 1928–1952 |
4.3 Example 2
As another example, you might want to use this data to create a table that shows the number of cases of a specific disease as it’s changed over the last five years of reporting in a certain region. For example, you could pull data on states along the west coast of the US (California, Oregon, and Washington) and look at changes in pertussis over the last five available years.
us_contagious_diseases %>%
# Limit to states in the region of interest
filter(state %in% c("California", "Oregon", "Washington")) %>%
# Limit to the disease of interest
filter(disease == "Pertussis") %>%
# Limit to the latest five years reported for each state
group_by(state) %>%
arrange(desc(year)) %>%
slice(1:5) %>%
# Calculate the average weekly count from the total yearly count and
# the number of weeks reporting (should check data---does an
# 'unreported' week mean there were no cases that week?)
mutate(ave_weekly_count = count / weeks_reporting) %>%
# Limit to only the data to show in the table
select(state, year, ave_weekly_count) %>%
# Reorder by year so that the earliest year comes first
arrange(year) %>%
# "Widen" the data for a nicer table
pivot_wider(names_from = year, values_from = ave_weekly_count) %>%
# Only report one digit (i.e., after the decimal place) for each number
kable(digits = 1,
caption = "Average weekly number of cases of pertussis in west coast states for the last five years of reporting.")
state | 2007 | 2008 | 2009 | 2010 | 2011 |
---|---|---|---|---|---|
California | 3.3 | 7.1 | 10.4 | 26.4 | 23.4 |
Oregon | 2.0 | 1.6 | 3.2 | 3.3 | 3.6 |
Washington | 6.0 | 9.1 | 6.3 | 11.7 | 15.2 |