Chapter 5 How to…

5.1 Make a table full width

5.2 Add a top header

You can also add headers above certain column names, to help distinguish groupings.

For this section, we’ll use an example based on the “us_contagious_diseases” data from the dslabs package. The following code will load that data and then create the pertussis_latest_years dataframe from it:

This dataframe gives average weekly counts of pertussis in three west coast states over the last five years of reporting. The code used to create this dataset from the original data is discussed in a previous section of this book.

Previously, we created this table from the data:

Table 5.1: Average weekly number of cases of pertussis in west coast states for the last five years of reporting.
state 2007 2008 2009 2010 2011
California 3.3 7.1 10.4 26.4 23.4
Oregon 2.0 1.6 3.2 3.3 3.6
Washington 6.0 9.1 6.3 11.7 15.2

We might want to add a top header over the five columns that show years, to distinguish that those are all similar types of columns. With kableExtra, this can be added with the function add_header_above. The convention for this argument is to include a value for header that’s a vector; this vector will give the names to use for top-level headers as well as the number of columns they should span. In this case, we’d like to not add anything over the “state” column, so we can use the marker " " to indicate that no top-level header should be added. For all the other columns, we’ll add “Year”. Since we only have one column (the first) that will lack a top-level header, and then five (the years) that will all get the same header (“Year”), we can express that in the “header” argument as c(" " = 1, "Year" = 5). In other words, don’t have a top-level header for one column, and then have one named “Year” that spans five columns.

Here’s the result:

Table 5.2: Average weekly number of cases of pertussis in west coast states for the last five years of reporting.
Year
state 2007 2008 2009 2010 2011
California 3.3 7.1 10.4 26.4 23.4
Oregon 2.0 1.6 3.2 3.3 3.6
Washington 6.0 9.1 6.3 11.7 15.2

In the add_header_above call, you’ll need to make sure that you specify values for all the columns in your table. In other words, all the numbers in that argument should add up to the number of columns that you have. In this case, we have six columns in the table (the “state” column plus each of the yearly ones). Since 5 + 1 = 6, we’re good here.

Here’s another example, where separate headers are added for the first two then the next three years:

Table 5.3: Average weekly number of cases of pertussis in west coast states for the last five years of reporting.
First two years
Next three years
state 2007 2008 2009 2010 2011
California 3.3 7.1 10.4 26.4 23.4
Oregon 2.0 1.6 3.2 3.3 3.6
Washington 6.0 9.1 6.3 11.7 15.2

Again, you can see that 1 + 2 + 3 = 6, so we’ve covered all of our columns in the table.

If you want, you can even add headers on top of headers. Just keep adding calls to add_header_above, going from the header you want closest to the column names to the top-most one you want. For example, you could add a header above both of the two top-level headers in the previous example with:

Table 5.4: Average weekly number of cases of pertussis in west coast states for the last five years of reporting.
Year
First two years
Next three years
state 2007 2008 2009 2010 2011
California 3.3 7.1 10.4 26.4 23.4
Oregon 2.0 1.6 3.2 3.3 3.6
Washington 6.0 9.1 6.3 11.7 15.2

You can change the formatting of the header in the add_header_above call, as well. For example, you can change it to be in italics with italic = TRUE:

Table 5.5: Average weekly number of cases of pertussis in west coast states for the last five years of reporting.
Year
state 2007 2008 2009 2010 2011
California 3.3 7.1 10.4 26.4 23.4
Oregon 2.0 1.6 3.2 3.3 3.6
Washington 6.0 9.1 6.3 11.7 15.2

You can change the alignment, too. The default is to be centered over the columns that it spans, but you can change that with align. For example, align = "l" would change it to be left-aligned:

Table 5.6: Average weekly number of cases of pertussis in west coast states for the last five years of reporting.
Year
state 2007 2008 2009 2010 2011
California 3.3 7.1 10.4 26.4 23.4
Oregon 2.0 1.6 3.2 3.3 3.6
Washington 6.0 9.1 6.3 11.7 15.2

You can change the color of both the text and the background color for this new header section with teh color and background arguments, respectively:

Table 5.7: Average weekly number of cases of pertussis in west coast states for the last five years of reporting.
Year
state 2007 2008 2009 2010 2011
California 3.3 7.1 10.4 26.4 23.4
Oregon 2.0 1.6 3.2 3.3 3.6
Washington 6.0 9.1 6.3 11.7 15.2

You can change the angle of the text for the new header with angle:

Table 5.8: Average weekly number of cases of pertussis in west coast states for the last five years of reporting.
Year
state 2007 2008 2009 2010 2011
California 3.3 7.1 10.4 26.4 23.4
Oregon 2.0 1.6 3.2 3.3 3.6
Washington 6.0 9.1 6.3 11.7 15.2

If you’d like the text for a column to have a line break, you can include the code for a new line (“”) inside the column name:

Table 5.9: Average weekly number of cases of pertussis in west coast states for the last five years of reporting.
Year
(Georgian calendar)
state 2007 2008 2009 2010 2011
California 3.3 7.1 10.4 26.4 23.4
Oregon 2.0 1.6 3.2 3.3 3.6
Washington 6.0 9.1 6.3 11.7 15.2

If you have more than one top-level header, you may want to change the amount of separation between each group. You can do that with the line_sep argument. Here is the example table where separate top-level headers are added for the first two years versus the last three from earlier:

Table 5.10: Average weekly number of cases of pertussis in west coast states for the last five years of reporting.
First two years
Next three years
state 2007 2008 2009 2010 2011
California 3.3 7.1 10.4 26.4 23.4
Oregon 2.0 1.6 3.2 3.3 3.6
Washington 6.0 9.1 6.3 11.7 15.2

If you set line_sep = 0, then you’ll see that there’s no space between the two header sections, so the line under the two headers runs together into one line:

Table 5.11: Average weekly number of cases of pertussis in west coast states for the last five years of reporting.
First two years
Next three years
state 2007 2008 2009 2010 2011
California 3.3 7.1 10.4 26.4 23.4
Oregon 2.0 1.6 3.2 3.3 3.6
Washington 6.0 9.1 6.3 11.7 15.2

On the other hand, if you set line_sep to a value higher than it’s default of 3, you can increase the separation in the headers:

Table 5.12: Average weekly number of cases of pertussis in west coast states for the last five years of reporting.
First two years
Next three years
state 2007 2008 2009 2010 2011
California 3.3 7.1 10.4 26.4 23.4
Oregon 2.0 1.6 3.2 3.3 3.6
Washington 6.0 9.1 6.3 11.7 15.2

5.3 Group rows

You might also want to show structure or grouping within your rows. There are a number of functions in kableExtra that help with this.

These examples will again use data derived from the “us_contagious_diseases” data from the dslabs package. The following code will load that data and then create the per_hepa_latest_years dataframe from it:

This dataframe gives the average weekly counts of two diseases (separately) in three west coast states over the last five years of reporting. The code used to create this dataset is an extension of the code used in the section on adding top headings to tables—in this case, we’re adding on data for a second disease.

Here’s what the dataframe looks like:

## # A tibble: 6 x 7
##   state      disease     `2007` `2008` `2009` `2010` `2011`
##   <fct>      <fct>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 California Hepatitis A  6.12   6.48   4.60   4.10   3.59 
## 2 Oregon     Hepatitis A  0.612  0.596  0.422  0.413  0.277
## 3 Washington Hepatitis A  1.12   0.896  0.875  0.417  0.583
## 4 California Pertussis    3.28   7.08  10.4   26.4   23.4  
## 5 Oregon     Pertussis    2.04   1.61   3.15   3.26   3.60 
## 6 Washington Pertussis    5.98   9.08   6.3   11.7   15.2

When we create a table from this, we may want to group the rows by disease (or by state), so that readers can see common factors in certain rows.

You can create those groupings using the pack_rows function. You will include one of these for each of the groupings that you want to create. For example, here there’s one pack_rows to group the first three rows (all with measurements of Hepatitis A) and a second pack_rows to group the fourth through six rows (all with measurements of pertussis). For each pack_rows call, it gives the name that will be used for the grouping (this is the name that shows up as a label for those groups in the table) as well as the number of the line where the grouping should start (start_row) and the line where it should end (end_row):

state 2007 2008 2009 2010 2011
Hepatitis A
California 6.1 6.5 4.6 4.1 3.6
Oregon 0.6 0.6 0.4 0.4 0.3
Washington 1.1 0.9 0.9 0.4 0.6
Pertussis
California 3.3 7.1 10.4 26.4 23.4
Oregon 2.0 1.6 3.2 3.3 3.6
Washington 6.0 9.1 6.3 11.7 15.2

This method of grouping the rows required us to list, by hand, the rows in each group. However, that’s something that we originally had in the data, in the disease column.

The kableExtra has a function called collapse_rows that will do the grouping by a column in the data. You need to move the columns to use to early in the dataframe, which you can do with select. Then, you can specify to collapse the rows by the first column using collapse_rows:

disease state 2007 2008 2009 2010 2011
Hepatitis A California 6.1 6.5 4.6 4.1 3.6
Oregon 0.6 0.6 0.4 0.4 0.3
Washington 1.1 0.9 0.9 0.4 0.6
Pertussis California 3.3 7.1 10.4 26.4 23.4
Oregon 2.0 1.6 3.2 3.3 3.6
Washington 6.0 9.1 6.3 11.7 15.2

The collapse_rows function includes some different formatting options. For example, you can change the vertical alignment of the headings for the row groups with valign. Here’s the code to put them at the top of each row group, instead of vertically centered:

disease state 2007 2008 2009 2010 2011
Hepatitis A California 6.1 6.5 4.6 4.1 3.6
Oregon 0.6 0.6 0.4 0.4 0.3
Washington 1.1 0.9 0.9 0.4 0.6
Pertussis California 3.3 7.1 10.4 26.4 23.4
Oregon 2.0 1.6 3.2 3.3 3.6
Washington 6.0 9.1 6.3 11.7 15.2

5.4 Bold columns or rows

When tables have a lot of content, you may want to highlight elements to draw attention to them. This example uses data from the AmesHousing package.

The column_spec() function allows the selection of a column and then a specification for the look. For example, this is a bold treatment for the third column.

Neighborhood Home_Sales Total
North_Ames 58 $8,721,963
Old_Town 29 $3,556,339
Edwards 23 $2,835,421
Sawyer 23 $3,106,800
College_Creek 21 $4,163,850
Somerset 21 $4,334,126
Sawyer_West 18 $3,268,000
Mitchell 17 $3,043,600

The row_spec() function allows the selection of rows and a specification for a style. For greater emphasis, this example sets a background fill color and the font color.

Neighborhood Home_Sales Total
North_Ames 58 $8,721,963
Old_Town 29 $3,556,339
Edwards 23 $2,835,421
Sawyer 23 $3,106,800
College_Creek 21 $4,163,850
Somerset 21 $4,334,126
Sawyer_West 18 $3,268,000
Mitchell 17 $3,043,600

5.5 Dynamically make changes

You might want to make changes to your tables that depend on features of your data that aren’t known in advance or can’t be hardcoded.

Let’s look at the comma_survey data set, which polled Americans on the oxford comma and other grammatical concepts. In particular, we’ll look at the education of those who care about the debate of whether “data” is a singular or plural noun:

## # A tibble: 1,091 x 2
##    education                        care_data 
##    <ord>                            <ord>     
##  1 Bachelor degree                  Not much  
##  2 Graduate degree                  Not much  
##  3 <NA>                             Not at all
##  4 Less than high school degree     Some      
##  5 <NA>                             Not much  
##  6 Some college or Associate degree Some      
##  7 Some college or Associate degree Some      
##  8 Some college or Associate degree A lot     
##  9 Graduate degree                  Not much  
## 10 Bachelor degree                  Some      
## # … with 1,081 more rows

We want to make a table of what percent, by education, answered each of “Not much”, “Not at all”, “Some”, and “A lot” to the question: “How much, if at all, do you care about the debate over the use of the word”data" as a singular or plural noun?".

First, we need to do some wrangling to replace NA values of Education with “Education unknown”, and to calculate the percent of each response.

## # A tibble: 24 x 4
##    education                        care_data    prop percent
##    <ord>                            <ord>       <dbl> <chr>  
##  1 Less than high school degree     Not at all 0.273  27.3%  
##  2 Less than high school degree     Not much   0.455  45.5%  
##  3 Less than high school degree     Some       0.182  18.2%  
##  4 Less than high school degree     A lot      0.0909 9.1%   
##  5 High school degree               Not at all 0.28   28.0%  
##  6 High school degree               Not much   0.34   34.0%  
##  7 High school degree               Some       0.3    30.0%  
##  8 High school degree               A lot      0.08   8.0%   
##  9 Some college or Associate degree Not at all 0.2    20.0%  
## 10 Some college or Associate degree Not much   0.403  40.3%  
## # … with 14 more rows

Then reshape the data:

## # A tibble: 6 x 5
##   education                        `Not at all` `Not much` Some   `A lot`
##   <ord>                            <chr>        <chr>      <chr>  <chr>  
## 1 Less than high school degree     27.3%        45.5%      18.2%  9.1%   
## 2 High school degree               28.0%        34.0%      30.0%  8.0%   
## 3 Some college or Associate degree 20.0%        40.3%      29.8%  9.8%   
## 4 Bachelor degree                  18.90%       35.47%     35.76% 9.88%  
## 5 Graduate degree                  13.0%        37.3%      31.9%  17.8%  
## 6 Education unknown                18%          31%        32%    18%

Then, we can make a table. We’ll clean up the column names, make sure the education column is left aligned and the responses are right aligned, and add a top header with the question “How much, if at all, do you care about the debate over the use of the word ‘data’ as a singular or plural noun?” above the “Not at all”, “Not much”, “Some”, and “A lot” columns:

How much, if at all, do you care about the debate over the use of the word ‘data’ as a singular or plural noun?
Education Not at all Not much Some A lot
Less than high school degree 27.3% 45.5% 18.2% 9.1%
High school degree 28.0% 34.0% 30.0% 8.0%
Some college or Associate degree 20.0% 40.3% 29.8% 9.8%
Bachelor degree 18.90% 35.47% 35.76% 9.88%
Graduate degree 13.0% 37.3% 31.9% 17.8%
Education unknown 18% 31% 32% 18%

Doing this required us to hardcode in the question responses and the number of columns (4) corresponding to them (both for the header above, and for the column alignment). Our code is now fragile if this isn’t known in advance, or could change!

Instead, we can think about it this way: we know that the first column contains Education and the rest contain the answers. So, we can set the column names, alignment, and the header using attributes of data_noun_percent_wide.

For col.names and align, we can quickly use names() and ncol(), respectively:

Education Not at all Not much Some A lot
Less than high school degree 27.3% 45.5% 18.2% 9.1%
High school degree 28.0% 34.0% 30.0% 8.0%
Some college or Associate degree 20.0% 40.3% 29.8% 9.8%
Bachelor degree 18.90% 35.47% 35.76% 9.88%
Graduate degree 13.0% 37.3% 31.9% 17.8%
Education unknown 18% 31% 32% 18%

And for the header, we can create a named vector with the number of columns each header spans (again, using ncol()):

How much, if at all, do you care about the debate over the use of the word ‘data’ as a singular or plural noun?
Education Not at all Not much Some A lot
Less than high school degree 27.3% 45.5% 18.2% 9.1%
High school degree 28.0% 34.0% 30.0% 8.0%
Some college or Associate degree 20.0% 40.3% 29.8% 9.8%
Bachelor degree 18.90% 35.47% 35.76% 9.88%
Graduate degree 13.0% 37.3% 31.9% 17.8%
Education unknown 18% 31% 32% 18%

Now our code is more robust and can easily adapt to changes in the data (e.g., if there were additional responses).

We might want to do other things, too, like bold the cell with the highest percent for each of the levels of Education.

To do this, we can use cell_spec() to set values of percent to bold if they are the highest value of prop (the unformatted percent) for their level of education.

## # A tibble: 24 x 5
## # Groups:   education [6]
##    education           care_data   prop percent                     highest_prop
##    <ord>               <ord>      <dbl> <chr>                              <dbl>
##  1 Less than high sch… Not at a… 0.273  "<span style=\"     \" >27…        0.455
##  2 Less than high sch… Not much  0.455  "<span style=\" font-weigh…        0.455
##  3 Less than high sch… Some      0.182  "<span style=\"     \" >18…        0.455
##  4 Less than high sch… A lot     0.0909 "<span style=\"     \" >9.…        0.455
##  5 High school degree  Not at a… 0.28   "<span style=\"     \" >28…        0.34 
##  6 High school degree  Not much  0.34   "<span style=\" font-weigh…        0.34 
##  7 High school degree  Some      0.3    "<span style=\"     \" >30…        0.34 
##  8 High school degree  A lot     0.08   "<span style=\"     \" >8.…        0.34 
##  9 Some college or As… Not at a… 0.2    "<span style=\"     \" >20…        0.403
## 10 Some college or As… Not much  0.403  "<span style=\" font-weigh…        0.403
## # … with 14 more rows

This applies HTML formatting to percent (specifically, bold formatting to the highest value for each level of education) that will be picked up by kableExtra.

Then, we again reshape the data to a wide format and create a table. This time, we need to set escape = FALSE in kable() to let it know that there is HTML in the cells that should not be escaped (i.e., not treated as regular text):

How much, if at all, do you care about the debate over the use of the word ‘data’ as a singular or plural noun?
Education Not at all Not much Some A lot
Less than high school degree 27.3% 45.5% 18.2% 9.1%
High school degree 28.0% 34.0% 30.0% 8.0%
Some college or Associate degree 20.0% 40.3% 29.8% 9.8%
Bachelor degree 18.90% 35.47% 35.76% 9.88%
Graduate degree 13.0% 37.3% 31.9% 17.8%
Education unknown 18% 31% 32% 18%

This way, we can apply formatting based on specific values of the data, which may change.

5.6 Take advantage of PDF-only features

Example code that will allow us to embed PDF-only table output. This uses a couple of basic examples of doing this, with code from the dslabs package.

Then, you can use the kable function to create a basic PDF table:

The booktabs = TRUE option, if you include it in the kable function call, will give some cleaner default formatting for the table: