Formula versus `tidyselect` interface

tableone is implemented to allow a subset of columns in a large dataset to be pulled into a table without any fuss. It is also designed with a workflow in mind that involves building statistical models from the data later. We assume the data follows a general pattern in that there are one observation per row, individual columns are specific data points in those observations and may be one of:

outcome: something that we will be assessing in a statistical model, maybe a continuous outcome, or a time measure, or a logical measure.
intervention: the thing that is varied between the different observations
covariates: the other factors that may influence the outcome that we want to control for.

In the end we will want to construct a model that takes the following high level structure:

outcome ~ intervention + covariate_1 + covariate_2 + ... + covariate_n

Simple population description example

Before we build a model we need to firstly compare the distribution of the covariates in the population and secondly compare them in the intervention and non-intervention groups, usually done without reference to outcome. To demonstrate this we are using the survival::cgd data set.

cgd = survival::cgd %>% 
  # filter to include only the first visit
  dplyr::filter(enum==1) %>% 
  # make the steroids and propylac columns into a logical value
  # see later for a better way of doing this.
  dplyr::mutate(
    steroids = as.logical(steroids),
    propylac = as.logical(propylac)
  )
  

# A basic unstratified population description table is as follows:
cgd %>% describe_population(tidyselect::everything())
#> Warning: Unknown or uninitialised column: `level`.

Variable	Characteristic	Value	Count (N=128)
Id	Median [IQR]	64.5 [32.8—96.2]	128
Center	Harvard Medical Sch % [95% CI]	3.1% [1.2%—7.8%]	4/128
	Scripps Institute % [95% CI]	12.5% [7.8%—19.3%]	16/128
	Copenhagen % [95% CI]	3.1% [1.2%—7.8%]	4/128
	NIH % [95% CI]	20.3% [14.3%—28.1%]	26/128
	L.A. Children's Hosp % [95% CI]	6.2% [3.2%—11.8%]	8/128
	Mott Children's Hosp % [95% CI]	7.0% [3.7%—12.8%]	9/128
	Univ. of Utah % [95% CI]	3.1% [1.2%—7.8%]	4/128
	Univ. of Washington % [95% CI]	3.1% [1.2%—7.8%]	4/128
	Univ. of Minnesota % [95% CI]	4.7% [2.2%—9.8%]	6/128
	Univ. of Zurich % [95% CI]	12.5% [7.8%—19.3%]	16/128
	Texas Children's Hosp % [95% CI]	6.2% [3.2%—11.8%]	8/128
	Amsterdam % [95% CI]	14.8% [9.7%—22.0%]	19/128
	Mt. Sinai Medical Ctr % [95% CI]	3.1% [1.2%—7.8%]	4/128
Random		—	128
Treat	placebo % [95% CI]	50.8% [42.2%—59.3%]	65/128
Treat	rIFN-g % [95% CI]	49.2% [40.7%—57.8%]	63/128
Sex	male % [95% CI]	81.2% [73.6%—87.1%]	104/128
Sex	female % [95% CI]	18.8% [12.9%—26.4%]	24/128
Age	Median [IQR]	12 [7—22]	128
Height	Median [IQR]	141 [116—170]	128
Weight	Median [IQR]	34.8 [20.7—59.2]	128
Inherit	X-linked % [95% CI]	67.2% [58.7%—74.7%]	86/128
Inherit	autosomal % [95% CI]	32.8% [25.3%—41.3%]	42/128
Steroid treatment	false % [95% CI]	97.7% [93.3%—99.2%]	125/128
Steroid treatment	true % [95% CI]	2.3% [0.8%—6.7%]	3/128
Propylac	false % [95% CI]	13.3% [8.5%—20.2%]	17/128
Propylac	true % [95% CI]	86.7% [79.8%—91.5%]	111/128
Location	US:NIH % [95% CI]	20.3% [14.3%—28.1%]	26/128
	US:other % [95% CI]	49.2% [40.7%—57.8%]	63/128
	Europe:Amsterdam % [95% CI]	14.8% [9.7%—22.0%]	19/128
	Europe:other % [95% CI]	15.6% [10.3%—22.9%]	20/128
Tstart	Median [IQR]	0 [0—0]	128
Enum	Median [IQR]	1 [1—1]	128
Tstop	Median [IQR]	269 [197—304]	128
Status	Median [IQR]	0 [0—1]	128
Normal distributions determined by the Anderson-Darling test (P>0.005)

This could have been specified using the formula interface. In this example we have taken an example of the formula we might wish to use for a survival model and we reuse it to give us a more targetted descriptive table. It is also possible to supply tableone with a relabelling function that maps column names to printable labels, as demonstrated here:

# define a formula - this might be reused in model building later
formula = Surv(tstart, tstop, status) ~ treat + 
  sex + age + height + weight + inherit + steroids + hos.cat

# set a table relabelling function
rename_cols = function(col) {
  dplyr::case_when(
    col == "hos.cat" ~ "Location",
    col == "steroids" ~ "Steroid treatment",
    TRUE ~ stringr::str_to_sentence(col)
  )
}
options("tableone.labeller"=rename_cols)

# create a simple description
cgd %>% describe_population(formula)

Variable	Characteristic	Value	Count (N=128)
Treat	placebo % [95% CI]	50.8% [42.2%—59.3%]	65/128
Treat	rIFN-g % [95% CI]	49.2% [40.7%—57.8%]	63/128
Sex	male % [95% CI]	81.2% [73.6%—87.1%]	104/128
Sex	female % [95% CI]	18.8% [12.9%—26.4%]	24/128
Age	Median [IQR]	12 [7—22]	128
Height	Median [IQR]	141 [116—170]	128
Weight	Median [IQR]	34.8 [20.7—59.2]	128
Inherit	X-linked % [95% CI]	67.2% [58.7%—74.7%]	86/128
Inherit	autosomal % [95% CI]	32.8% [25.3%—41.3%]	42/128
Steroid treatment	false % [95% CI]	97.7% [93.3%—99.2%]	125/128
Steroid treatment	true % [95% CI]	2.3% [0.8%—6.7%]	3/128
Location	US:NIH % [95% CI]	20.3% [14.3%—28.1%]	26/128
	US:other % [95% CI]	49.2% [40.7%—57.8%]	63/128
	Europe:Amsterdam % [95% CI]	14.8% [9.7%—22.0%]	19/128
	Europe:other % [95% CI]	15.6% [10.3%—22.9%]	20/128
Normal distributions determined by the Anderson-Darling test (P>0.005)

The relabelling function can either be passed to each invocation of tableone functions or as an option as shown here, which makes the labeller available to all subsequent calls. This is useful if you are generating many tables from a single dataset.

We will generally use the formula interface from here on but for exploration of larger datasets with more covariates the tidyselect interface may be more useful.

Comparing the population by intervention

In this example a more useful table compares the treatment groups. We can use the same formula syntax for this, but in this case the first predictor is assumed to be the intervention and the data set is compared by intervention (in this case the treat column). From this we can conclude that the population is well distributed between placebo and treatment groups and there is no major bias in the randomisation process:


# same as above
formula = Surv(tstart, tstop, status) ~ treat + 
  sex + age + height + weight + inherit + steroids + hos.cat

# labelling function is still active
cgd %>% compare_population(formula)

		placebo	rIFN-g
Variable	Characteristic	Value (N=65)	Value (N=63)	P value
Sex	male % [95% CI] (n)	81.5% [70.4%—89.1%] (53)	81.0% [69.6%—88.8%] (51)	1 †
Sex	female % [95% CI] (n)	18.5% [10.9%—29.6%] (12)	19.0% [11.2%—30.4%] (12)	1 †
Age	Median [IQR]	14 [7—24]	12 [7—19.5]	0.56 ††
Height	Median [IQR]	143 [115—171]	139 [119—167]	0.45 †††
Weight	Median [IQR]	36.1 [21.6—63.7]	34.4 [20.6—53.7]	0.4 †††
Inherit	X-linked % [95% CI] (n)	63.1% [50.9%—73.8%] (41)	71.4% [59.3%—81.1%] (45)	0.35 †
Inherit	autosomal % [95% CI] (n)	36.9% [26.2%—49.1%] (24)	28.6% [18.9%—40.7%] (18)	0.35 †
Steroid treatment	false % [95% CI] (n)	96.9% [89.5%—99.2%] (63)	98.4% [91.5%—99.7%] (62)	1 †
Steroid treatment	true % [95% CI] (n)	3.1% [0.8%—10.5%] (2)	1.6% [0.3%—8.5%] (1)	1 †
Location	US:NIH % [95% CI] (n)	16.9% [9.7%—27.8%] (11)	23.8% [15.0%—35.6%] (15)	0.7 †
	US:other % [95% CI] (n)	49.2% [37.5%—61.1%] (32)	49.2% [37.3%—61.2%] (31)
	Europe:Amsterdam % [95% CI] (n)	15.4% [8.6%—26.1%] (10)	14.3% [7.7%—25.0%] (9)
	Europe:other % [95% CI] (n)	18.5% [10.9%—29.6%] (12)	12.7% [6.6%—23.1%] (8)
†, Fisher's exact test (categorical); ††, 2 sample Wilcoxon Rank Sum test (continuous); †††, 2 sample Kolmogorov-Smirnov test (continuous) Normal distributions determined by the Anderson-Darling test (P>0.005) An adjusted P value of 0.00714 may be considered significant.

Alternatively if we were using the tidyselect interface this alternate syntax would have given us the same table. Note that we must group the data by intervention, for the tidyselect to work as intended:

cgd %>% dplyr::group_by(treat) %>% 
  compare_population(sex,age,height,weight,inherit,steroids,hos.cat)

Analysis of missing data

We need to make sure that not only is the data equivalent between the intervention groups but also that missing data is not unevenly distributed or excessive. Reporting on the frequency of missing data stratified by intervention is also easy, and to demonstrate this we make a data set with 10% of the placebo arm having missing values, but 25% of the treatment arm:


# generate a dataset with values missing not at random compared to the intervention:
cgd_treat = cgd %>% dplyr::mutate(treat = as.character(treat)) %>% dplyr::filter(treat != "placebo")
cgd_placebo = cgd %>% dplyr::mutate(treat = as.character(treat)) %>% dplyr::filter(treat == "placebo")

set.seed(100)
mnar_cgd = dplyr::bind_rows(
  cgd_placebo %>% .make_missing(p_missing = 0.1),
  cgd_treat %>% .make_missing(p_missing = 0.25)
)

Comparing this new data set we see that there is significant differences in some of the data (but not the steroids variable). As this is quite a small dataset it is not sufficiently powered to reliably detect the difference in missingness at this level (15% difference).

# compare the MNAR dataset against the intervention:
formula = Surv(tstart, tstop, status) ~ treat + 
  sex + age + height + weight + inherit + steroids + hos.cat

mnar_cgd %>% compare_missing(formula)

	placebo	rIFN-g
variable	missing % (N)	missing % (N)	P value
Sex	12.3% (8/65)	23.8% (15/63)	0.11
Age	4.6% (3/65)	23.8% (15/63)	0.002
Height	6.2% (4/65)	20.6% (13/63)	0.019
Weight	10.8% (7/65)	23.8% (15/63)	0.062
Inherit	12.3% (8/65)	28.6% (18/63)	0.028
Steroid treatment	13.8% (9/65)	28.6% (18/63)	0.052
Location	9.2% (6/65)	19.0% (12/63)	0.13
More than 10% of data is missing for variables Sex, Age, Height, Weight, Inherit, Steroid treatment, Location. Data is missing not at random (compared to Treat) at a p-value<0.007 (0.05 over 7 comparisons) for variables Age.

with this analysis it is useful to be able to update the analysis formula removing the variables with missing data so that we are confident the models are based on reasonable data.


# formula can also be a list of formulae
new_formula = mnar_cgd %>% tableone::remove_missing(formula)
#> More than 10% of data is missing for variables Sex, Age, Height, Weight, Inherit, Steroid treatment, Location.
#> Data is missing not at random (compared to Treat) at a p-value<0.007 (0.05 over 7 comparisons) for variables Age.

print(new_formula)
#> [[1]]
#> Surv(tstart, tstop, status) ~ treat

Conversion of discrete data

Using this new data set with missing data it may be necessary to discretise some or all of the data, or convert logical values into properly named factors.


decade = function(x) sprintf("%d-%d",x-(x%%10),x-(x%%10)+9)

discrete_cgd = mnar_cgd %>% 
  # pick out the first episode
  dplyr::filter(enum == 1) %>%
  # convert data
  make_factors(
    steroids,propylac,age,weight,height,
    .logical = c("received","not received"),
    .numeric = list(
      age="{decade(value)}",
      weight="{ifelse(value<20,'<20','20+')}",
      height="{ifelse(value<mean(value, na.rm=TRUE),'below average','above average')}"
    )
  )

formula = Surv(tstart, tstop, status) ~ treat + 
  sex + age + height + weight + inherit + steroids + hos.cat


old = options("tableone.show_pvalue_method"=TRUE)
# This comparison implicitly ignores missing values.
t = discrete_cgd %>% compare_population(formula)
options(old)

t

		placebo	rIFN-g
Variable	Characteristic	Value (N=61)	Value (N=49)	P value
Sex	male % [95% CI] (n)	77.8% [65.1%—86.8%] (42)	80.6% [65.0%—90.2%] (29)	— †
Sex	female % [95% CI] (n)	22.2% [13.2%—34.9%] (12)	19.4% [9.8%—35.0%] (7)	— †
Age	0-9 % [95% CI] (n)	31.0% [20.6%—43.8%] (18)	47.4% [32.5%—62.7%] (18)	— ††
	10-19 % [95% CI] (n)	31.0% [20.6%—43.8%] (18)	23.7% [13.0%—39.2%] (9)
	20-29 % [95% CI] (n)	31.0% [20.6%—43.8%] (18)	13.2% [5.8%—27.3%] (5)
	30-39 % [95% CI] (n)	6.9% [2.7%—16.4%] (4)	13.2% [5.8%—27.3%] (5)
	40-49 % [95% CI] (n)	0.0% [0.0%—6.2%] (0)	2.6% [0.5%—13.5%] (1)
Height	below average % [95% CI] (n)	43.9% [31.8%—56.7%] (25)	52.5% [37.5%—67.1%] (21)	— ††
Height	above average % [95% CI] (n)	56.1% [43.3%—68.2%] (32)	47.5% [32.9%—62.5%] (19)	— ††
Weight	<20 % [95% CI] (n)	21.4% [12.7%—33.8%] (12)	20.5% [10.8%—35.5%] (8)	— ††
Weight	20+ % [95% CI] (n)	78.6% [66.2%—87.3%] (44)	79.5% [64.5%—89.2%] (31)	— ††
Inherit	X-linked % [95% CI] (n)	58.5% [45.1%—70.7%] (31)	68.4% [52.5%—80.9%] (26)	— †
Inherit	autosomal % [95% CI] (n)	41.5% [29.3%—54.9%] (22)	31.6% [19.1%—47.5%] (12)	— †
Steroid treatment	received % [95% CI] (n)	1.9% [0.3%—9.9%] (1)	0.0% [0.0%—10.4%] (0)	— †
Steroid treatment	not received % [95% CI] (n)	98.1% [90.1%—99.7%] (52)	100.0% [89.6%—100.0%] (33)	— †
Location	US:NIH % [95% CI] (n)	19.3% [11.1%—31.3%] (11)	20.5% [10.8%—35.5%] (8)	— †
	US:other % [95% CI] (n)	50.9% [38.3%—63.4%] (29)	51.3% [36.2%—66.1%] (20)
	Europe:Amsterdam % [95% CI] (n)	14.0% [7.3%—25.3%] (8)	12.8% [5.6%—26.7%] (5)
	Europe:other % [95% CI] (n)	15.8% [8.5%—27.4%] (9)	15.4% [7.2%—29.7%] (6)
†, Not calculated due to missing values (categorical); ††, Not calculated due to missing values (ordered) An adjusted P value of 0.00714 may be considered significant.

# N.B. The following option is involved when converting integer data
# which decides how many levels of integer data are considered discrete
# and when to decide integer data can be treated as continuous:
options("tableone.max_discrete_levels"=0)
# and is described in the documentation for make_factors().

Making missing factors explicit:

In the comparison above missing values were not included, and we should be cautious of the findings. Because of the missingness tableone will not calculate p-values. If factor values are missing (as in this case) then we can include them as a new group and get a more robust comparison which includes the distribution of missingness, and for which we can calculate a p-value. However previously ordered variables, are now regarded as unordered as we cannot determine the value of a missing level.

discrete_cgd %>% explicit_na() %>% compare_population(formula)

		placebo	rIFN-g
Variable	Characteristic	Value (N=61)	Value (N=49)	P value
Sex	male % [95% CI] (n)	68.9% [56.4%—79.1%] (42)	59.2% [45.2%—71.8%] (29)	0.13 †
	female % [95% CI] (n)	19.7% [11.6%—31.3%] (12)	14.3% [7.1%—26.7%] (7)
	<missing> % [95% CI] (n)	11.5% [5.7%—21.8%] (7)	26.5% [16.2%—40.3%] (13)
Age	0-9 % [95% CI] (n)	29.5% [19.6%—41.9%] (18)	36.7% [24.7%—50.7%] (18)	0.0064 †
	10-19 % [95% CI] (n)	29.5% [19.6%—41.9%] (18)	18.4% [10.0%—31.4%] (9)
	20-29 % [95% CI] (n)	29.5% [19.6%—41.9%] (18)	10.2% [4.4%—21.8%] (5)
	30-39 % [95% CI] (n)	6.6% [2.6%—15.7%] (4)	10.2% [4.4%—21.8%] (5)
	40-49 % [95% CI] (n)	0.0% [-0.0%—5.9%] (0)	2.0% [0.4%—10.7%] (1)
	<missing> % [95% CI] (n)	4.9% [1.7%—13.5%] (3)	22.4% [13.0%—35.9%] (11)
Height	below average % [95% CI] (n)	41.0% [29.5%—53.5%] (25)	42.9% [30.0%—56.7%] (21)	0.12 †
	above average % [95% CI] (n)	52.5% [40.2%—64.5%] (32)	38.8% [26.4%—52.8%] (19)
	<missing> % [95% CI] (n)	6.6% [2.6%—15.7%] (4)	18.4% [10.0%—31.4%] (9)
Weight	<20 % [95% CI] (n)	19.7% [11.6%—31.3%] (12)	16.3% [8.5%—29.0%] (8)	0.21 †
	20+ % [95% CI] (n)	72.1% [59.8%—81.8%] (44)	63.3% [49.3%—75.3%] (31)
	<missing> % [95% CI] (n)	8.2% [3.6%—17.8%] (5)	20.4% [11.5%—33.6%] (10)
Inherit	X-linked % [95% CI] (n)	50.8% [38.6%—62.9%] (31)	53.1% [39.4%—66.3%] (26)	0.3 †
	autosomal % [95% CI] (n)	36.1% [25.2%—48.6%] (22)	24.5% [14.6%—38.1%] (12)
	<missing> % [95% CI] (n)	13.1% [6.8%—23.8%] (8)	22.4% [13.0%—35.9%] (11)
Steroid treatment	received % [95% CI] (n)	1.6% [0.3%—8.7%] (1)	0.0% [0.0%—7.3%] (0)	0.019 †
	not received % [95% CI] (n)	85.2% [74.3%—92.0%] (52)	67.3% [53.4%—78.8%] (33)
	<missing> % [95% CI] (n)	13.1% [6.8%—23.8%] (8)	32.7% [21.2%—46.6%] (16)
Location	US:NIH % [95% CI] (n)	18.0% [10.4%—29.5%] (11)	16.3% [8.5%—29.0%] (8)	0.33 †
	US:other % [95% CI] (n)	47.5% [35.5%—59.8%] (29)	40.8% [28.2%—54.8%] (20)
	Europe:Amsterdam % [95% CI] (n)	13.1% [6.8%—23.8%] (8)	10.2% [4.4%—21.8%] (5)
	Europe:other % [95% CI] (n)	14.8% [8.0%—25.7%] (9)	12.2% [5.7%—24.2%] (6)
	<missing> % [95% CI] (n)	6.6% [2.6%—15.7%] (4)	20.4% [11.5%—33.6%] (10)
†, Fisher's exact test (categorical) An adjusted P value of 0.00714 may be considered significant.

Non biomedical data

Beyond the bio-medical example tableone can make any more general comparison between data that has a structure like:

~ group + observation_1 + observation_2 + ... + observation_n

We will use the iris and the diamonds datasets to demonstrate this more general use case for tableone.


# revert the labeller setting to the default
# and additionally hide the footer.
old = options(
  "tableone.labeller"=NULL,
  "tableone.show_pvalue_method"=FALSE,
  "tableone.hide_footer"=TRUE)

# the heuristics detect that Petals in the iris data set are not normally
# distributed and hence report median and IQR:
iris %>% dplyr::group_by(Species) %>% compare_population(tidyselect::everything())

		setosa	versicolor	virginica
Variable	Characteristic	Value (N=50)	Value (N=50)	Value (N=50)	P value
Sepal.Length	Mean ± SD	5.01 ± 0.352	5.94 ± 0.516	6.59 ± 0.636	<0.001
Sepal.Width	Mean ± SD	3.43 ± 0.379	2.77 ± 0.314	2.97 ± 0.322	<0.001
Petal.Length	Median [IQR]	1.5 [1.4—1.58]	4.35 [4—4.6]	5.55 [5.1—5.88]	<0.001
Petal.Width	Median [IQR]	0.2 [0.2—0.3]	1.3 [1.2—1.5]	2 [1.8—2.3]	<0.001


options(old)

The missing_diamonds data set which is included in this package has 10% of the values removed. This demonstrates the need for reporting the denominator.

# The counts sometimes seem redundant if there is no missing information:
# however in a data set with missing values the denominators are important:
missing_diamonds %>% describe_population(tidyselect::everything())

Variable	Characteristic	Value	Count (N=53940)
Carat	Median [IQR]	0.7 [0.4—1.04]	48682
Cut	Fair % [95% CI]	3.0% [2.8%—3.2%]	1454/48553
	Good % [95% CI]	9.2% [8.9%—9.5%]	4462/48553
	Very Good % [95% CI]	22.3% [21.9%—22.6%]	10816/48553
	Premium % [95% CI]	25.7% [25.3%—26.1%]	12460/48553
	Ideal % [95% CI]	39.9% [39.4%—40.3%]	19361/48553
Color	D % [95% CI]	12.5% [12.2%—12.8%]	6079/48569
	E % [95% CI]	18.3% [18.0%—18.6%]	8886/48569
	F % [95% CI]	17.7% [17.4%—18.1%]	8613/48569
	G % [95% CI]	20.9% [20.5%—21.2%]	10137/48569
	H % [95% CI]	15.4% [15.1%—15.7%]	7466/48569
	I % [95% CI]	10.0% [9.8%—10.3%]	4876/48569
	J % [95% CI]	5.2% [5.0%—5.4%]	2512/48569
Clarity	I1 % [95% CI]	1.4% [1.3%—1.5%]	664/48527
	SI2 % [95% CI]	17.0% [16.7%—17.4%]	8265/48527
	SI1 % [95% CI]	24.2% [23.8%—24.6%]	11756/48527
	VS2 % [95% CI]	22.7% [22.3%—23.1%]	11020/48527
	VS1 % [95% CI]	15.2% [14.8%—15.5%]	7355/48527
	VVS2 % [95% CI]	9.4% [9.2%—9.7%]	4570/48527
	VVS1 % [95% CI]	6.8% [6.6%—7.0%]	3298/48527
	IF % [95% CI]	3.3% [3.1%—3.5%]	1599/48527
Depth	Median [IQR]	61.8 [61—62.5]	48584
Table	Median [IQR]	57 [56—59]	48707
Price	Median [IQR]	2.41e+03 [952—5.33e+03]	48675
X	Median [IQR]	5.69 [4.72—6.54]	48577
Y	Median [IQR]	5.71 [4.72—6.54]	48578
Z	Median [IQR]	3.52 [2.91—4.03]	48559
Is_colored	clear % [95% CI]	30.7% [30.3%—31.1%]	16572/53940
Is_colored	colored % [95% CI]	69.3% [68.9%—69.7%]	37368/53940
Normal distributions determined by the Anderson-Darling test (P>0.005)

tableone: Getting started