Title: | Descriptive Tables for Observational or Interventional Studies |
---|---|
Description: | Generating tabular summaries of data in a format suitable for reporting in journal articles is fiddly and slows down more detailed analysis. Comparing two populations with respect to an intervention, and reporting it is a task that can be largely automated. |
Authors: | Robert Challen [aut, cre] |
Maintainer: | Robert Challen <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.4.3 |
Built: | 2025-01-10 04:48:42 UTC |
Source: | https://github.com/bristol-vaccine-centre/tableone |
t1_summary
object to a huxtable
Convert a t1_summary
object to a huxtable
## S3 method for class 't1_shape' as_huxtable( x, ..., font_size = getOption("tableone.font_size", 8), font = getOption("tableone.font", "Arial"), footer_text = NULL, show_binary_value = NULL )
## S3 method for class 't1_shape' as_huxtable( x, ..., font_size = getOption("tableone.font_size", 8), font = getOption("tableone.font", "Arial"), footer_text = NULL, show_binary_value = NULL )
x |
the |
... |
not used |
font_size |
(optional) the font size for the table in points |
font |
(optional) the font family for the table (which will be matched to closest on your system) |
footer_text |
any text that needs to be added at the end of the table,
setting this to FALSE dsables the whole footer (as does
|
show_binary_value |
if set this will filter the display of covariates where the number of possibilities is exactly 2 to this value. |
a formatted table as a huxtable
t1_signif
S3 class to a huxtableThis is responsible for printing the significance test results and comparison
## S3 method for class 't1_signif' as_huxtable( x, ..., layout = "compact", override_percent_dp = list(), override_real_dp = list(), p_format = names(.pvalue.defaults), font_size = getOption("tableone.font_size", 8), font = getOption("tableone.font", "Arial"), footer_text = NULL, show_binary_value = NULL )
## S3 method for class 't1_signif' as_huxtable( x, ..., layout = "compact", override_percent_dp = list(), override_real_dp = list(), p_format = names(.pvalue.defaults), font_size = getOption("tableone.font_size", 8), font = getOption("tableone.font", "Arial"), footer_text = NULL, show_binary_value = NULL )
x |
the |
... |
not used |
layout |
(optional) various layouts are defined as default. As of this
version of |
override_percent_dp |
(optional) a named list of overrides for the default
precision of formatting percentages, following a |
override_real_dp |
(optional) a named list of overrides for the default
precision of formatting real values, following a |
p_format |
the format of the p-values: one of
"sampl", "nejm", "jama", "lancet", "aim" but any value
here is overridden by the |
font_size |
(optional) the font size for the table in points |
font |
(optional) the font family for the table (which will be matched to closest on your system) |
footer_text |
any text that needs to be added at the end of the table,
setting this to FALSE dsables the whole footer (as does
|
show_binary_value |
if set this will filter the display of covariates where the number of possibilities is exactly 2 to this value. |
a formatted table as a huxtable
library(tableone) tmp = iris %>% dplyr::group_by(Species) %>% as_t1_signif(tidyselect::everything()) %>% huxtable::as_huxtable()
library(tableone) tmp = iris %>% dplyr::group_by(Species) %>% as_t1_signif(tidyselect::everything()) %>% huxtable::as_huxtable()
t1_summary
object to a huxtable
Convert a t1_summary
object to a huxtable
## S3 method for class 't1_summary' as_huxtable( x, ..., layout = "single", override_percent_dp = list(), override_real_dp = list(), font_size = getOption("tableone.font_size", 8), font = getOption("tableone.font", "Arial"), footer_text = NULL, show_binary_value = NULL )
## S3 method for class 't1_summary' as_huxtable( x, ..., layout = "single", override_percent_dp = list(), override_real_dp = list(), font_size = getOption("tableone.font_size", 8), font = getOption("tableone.font", "Arial"), footer_text = NULL, show_binary_value = NULL )
x |
the |
... |
not used |
layout |
(optional) various layouts are defined as default. As of this
version of |
override_percent_dp |
(optional) a named list of overrides for the default
precision of formatting percentages, following a |
override_real_dp |
(optional) a named list of overrides for the default
precision of formatting real values, following a |
font_size |
(optional) the font size for the table in points |
font |
(optional) the font family for the table (which will be matched to closest on your system) |
footer_text |
any text that needs to be added at the end of the table,
setting this to FALSE dsables the whole footer (as does
|
show_binary_value |
if set this will filter the display of covariates where the number of possibilities is exactly 2 to this value. |
a formatted table as a huxtable
The data set description is a simple summary of the data formats, types and missingness
as_t1_shape(df, ..., label_fn = label_extractor(df), units = extract_units(df))
as_t1_shape(df, ..., label_fn = label_extractor(df), units = extract_units(df))
df |
a dataframe of individual observations. Grouping, if present, is ignored.
(n.b. if you wanted to construct multiple summary tables a |
... |
the columns of variables we wish to summarise. This can be given as
a
which may be more convenient if you are going on to do a model fit. If the latter format the left hand side is ignored (outcomes are not usual in this kind of table). |
label_fn |
(optional) a function for mapping a co-variate column name to
printable label. This is by default a no-operation and the output table
will contain the dataframe column names as labels. A simple alternative
would be some form of dplyr::case_when lookup, or a string function such
as stringr::str_to_sentence. (N.b. this function must be vectorised).
Any value provided here will be overridden by the
|
units |
(optional) a named list of units, following a |
a t1_shape
data frame.
tmp = iris %>% as_t1_shape( tidyselect::everything() )
tmp = iris %>% as_t1_shape( tidyselect::everything() )
The population comparison is a summary of the co-variates in a data set with no reference to outcome, but comparing intervention groups. It will report summary statistics for continuous and counts for categorical data, for each of the intervention groups, and reports on the significance of the association in relation to the intervention groups. It gives a clear summary of whether data is correlated to intervention.
as_t1_signif( df, ..., label_fn = label_extractor(df), units = extract_units(df), override_type = list(), override_method = list() )
as_t1_signif( df, ..., label_fn = label_extractor(df), units = extract_units(df), override_type = list(), override_method = list() )
df |
a dataframe of individual observations. If using the |
... |
the columns of variables we wish to summarise. This can be given as
a
which may be more convenient if you are going on to do a model fit later. If the latter format the left hand side is ignored (outcomes are not usual in this kind of table). |
label_fn |
(optional) a function for mapping a co-variate column name to
printable label. This is by default a no-operation and the output table
will contain the dataframe column names as labels. A simple alternative
would be some form of dplyr::case_when lookup, or a string function such
as stringr::str_to_sentence. (N.b. this function must be vectorised).
Any value provided here will be overridden by the
|
units |
(optional) a named list of units, following a |
override_type |
(optional) a named list of data summary types. The
default type for a column in a data set are calculated using heurisitics
depending on the nature of the data (categorical or continuous), and result
of normality tests. if you want to override this the options are
"subtype_count","median_iqr","mean_sd","skipped" and you
specify this on a column by column bases with a named list (e.g
|
override_method |
if you want to override the comparison method for a
particular variable the options are
"chi-sq trend","fisher","t-test","2-sided wilcoxon","2-sided ks","anova","kruskal-wallis","no comparison" and you
specify this on a column by column bases with a named list (e.g
|
a t1_signif
dataframe.
tmp = iris %>% dplyr::group_by(Species) %>% as_t1_signif(tidyselect::everything()) tmp = diamonds %>% dplyr::group_by(is_colored) %>% as_t1_signif(tidyselect::everything())
tmp = iris %>% dplyr::group_by(Species) %>% as_t1_signif(tidyselect::everything()) tmp = diamonds %>% dplyr::group_by(is_colored) %>% as_t1_signif(tidyselect::everything())
The population description is a simple summary of the co-variates in a data set with no reference to outcome, and not comparing intervention (although it might contain intervention rates.) It will report summary statistics for continuous and counts for categorical data,
as_t1_summary( df, ..., label_fn = label_extractor(df), units = extract_units(df), override_type = list() )
as_t1_summary( df, ..., label_fn = label_extractor(df), units = extract_units(df), override_type = list() )
df |
a dataframe of individual observations. Grouping, if present, is ignored.
(n.b. if you wanted to construct multiple summary tables a |
... |
the columns of variables we wish to summarise. This can be given as
a
which may be more convenient if you are going on to do a model fit. If the latter format the left hand side is ignored (outcomes are not usual in this kind of table). |
label_fn |
(optional) a function for mapping a co-variate column name to
printable label. This is by default a no-operation and the output table
will contain the dataframe column names as labels. A simple alternative
would be some form of dplyr::case_when lookup, or a string function such
as stringr::str_to_sentence. (N.b. this function must be vectorised).
Any value provided here will be overridden by the
|
units |
(optional) a named list of units, following a |
override_type |
(optional) a named list of data summary types. The
default type for a column in a data set are calculated using heurisitics
depending on the nature of the data (categorical or continuous), and result
of normality tests. if you want to override this the options are
"subtype_count","median_iqr","mean_sd","skipped" and you
specify this on a column by column bases with a named list (e.g
|
a t1_summary
data frame.
tmp = iris %>% as_t1_summary( tidyselect::everything(), override_type = c(Petal.Length = "mean_sd", Petal.Width = "mean_sd") )
tmp = iris %>% as_t1_summary( tidyselect::everything(), override_type = c(Petal.Length = "mean_sd", Petal.Width = "mean_sd") )
A list of columns for a test case
bad_test_cols
bad_test_cols
bad_test_cols
Test data
The missing data summary is a simple summary of the missingness of co-variates in a data set with no reference to outcome, but comparing intervention groups. It reports summary counts for missingness in data and reports on the significance of that missingness in relation to the intervention groups, allowing a clear summary of whether data is missing at random compared to the intervention.
compare_missing( df, ..., label_fn = label_extractor(df), p_format = names(.pvalue.defaults), font_size = getOption("tableone.font_size", 8), font = getOption("tableone.font", "Arial"), significance_limit = 0.05, missingness_limit = 0.1, footer_text = NULL, raw_output = FALSE )
compare_missing( df, ..., label_fn = label_extractor(df), p_format = names(.pvalue.defaults), font_size = getOption("tableone.font_size", 8), font = getOption("tableone.font", "Arial"), significance_limit = 0.05, missingness_limit = 0.1, footer_text = NULL, raw_output = FALSE )
df |
a dataframe of individual observations. If using the |
... |
the columns of variables we wish to summarise. This can be given as
a
which may be more convenient if you are going on to do a model fit later. If the latter format the left hand side is ignored (outcomes are not usual in this kind of table). |
label_fn |
(optional) a function for mapping a co-variate column name to
printable label. This is by default a no-operation and the output table
will contain the dataframe column names as labels. A simple alternative
would be some form of dplyr::case_when lookup, or a string function such
as stringr::str_to_sentence. (N.b. this function must be vectorised).
Any value provided here will be overridden by the
|
p_format |
the format of the p-values: one of
"sampl", "nejm", "jama", "lancet", "aim" but any value
here is overridden by the |
font_size |
(optional) the font size for the table in points |
font |
(optional) the font family for the table (which will be matched to closest on your system) |
significance_limit |
the limit at which we reject the hypothesis that the data is missing at random. |
missingness_limit |
the limit at which too much data is missing to include the predictor. |
footer_text |
any text that needs to be added at the end of the table,
setting this to FALSE dsables the whole footer (as does
|
raw_output |
return comparison as tidy dataframe rather than formatted table |
a huxtable
formatted table.
# this option lets us change the column name for p value from its default # "P value" old = options("tableone.pvalue_column_name"="p-value") # missing at random missing_diamonds %>% dplyr::group_by(is_colored) %>% compare_missing(tidyselect::everything()) # nothing missing iris %>% dplyr::group_by(Species) %>% compare_missing(tidyselect::everything()) # MNAR: by design missingness is correlated with grouping mnar_two_class_1000 %>% dplyr::group_by(grouping) %>% compare_missing(tidyselect::everything()) options(old)
# this option lets us change the column name for p value from its default # "P value" old = options("tableone.pvalue_column_name"="p-value") # missing at random missing_diamonds %>% dplyr::group_by(is_colored) %>% compare_missing(tidyselect::everything()) # nothing missing iris %>% dplyr::group_by(Species) %>% compare_missing(tidyselect::everything()) # MNAR: by design missingness is correlated with grouping mnar_two_class_1000 %>% dplyr::group_by(grouping) %>% compare_missing(tidyselect::everything()) options(old)
The outcome table is a simple summary of a binary or categorical outcome in a data set compared by intervention groups. The comparison is independent of any covariates, and is a preliminary output prior to more formal statistical analysis or model fitting.
compare_outcomes( df, ..., label_fn = label_extractor(df), units = extract_units(df), override_type = list(), layout = "compact", override_percent_dp = list(), override_real_dp = list(), p_format = names(.pvalue.defaults), font_size = getOption("tableone.font_size", 8), font = getOption("tableone.font", "Arial"), footer_text = NULL, show_binary_value = NULL, raw_output = FALSE )
compare_outcomes( df, ..., label_fn = label_extractor(df), units = extract_units(df), override_type = list(), layout = "compact", override_percent_dp = list(), override_real_dp = list(), p_format = names(.pvalue.defaults), font_size = getOption("tableone.font_size", 8), font = getOption("tableone.font", "Arial"), footer_text = NULL, show_binary_value = NULL, raw_output = FALSE )
df |
a dataframe of individual observations. If using the |
... |
the outcomes are specified either as a |
label_fn |
(optional) a function for mapping a co-variate column name to
printable label. This is by default a no-operation and the output table
will contain the dataframe column names as labels. A simple alternative
would be some form of dplyr::case_when lookup, or a string function such
as stringr::str_to_sentence. (N.b. this function must be vectorised).
Any value provided here will be overridden by the
|
units |
(optional) a named list of units, following a |
override_type |
(optional) a named list of data summary types. The
default type for a column in a data set are calculated using heurisitics
depending on the nature of the data (categorical or continuous), and result
of normality tests. if you want to override this the options are
"subtype_count","median_iqr","mean_sd","skipped" and you
specify this on a column by column bases with a named list (e.g
|
layout |
(optional) various layouts are defined as default. As of this
version of |
override_percent_dp |
(optional) a named list of overrides for the default
precision of formatting percentages, following a |
override_real_dp |
(optional) a named list of overrides for the default
precision of formatting real values, following a |
p_format |
the format of the p-values: one of
"sampl", "nejm", "jama", "lancet", "aim" but any value
here is overridden by the |
font_size |
(optional) the font size for the table in points |
font |
(optional) the font family for the table (which will be matched to closest on your system) |
footer_text |
any text that needs to be added at the end of the table,
setting this to FALSE dsables the whole footer (as does
|
show_binary_value |
if set this will filter the display of covariates where the number of possibilities is exactly 2 to this value. |
raw_output |
return comparison as |
It reports summary counts for the outcomes and a measure of significance of the relationship between outcome and intervention. Interpretation of significance tests, should include Bonferroni adjustment.
a huxtable
formatted table.
The population comparison is a summary of the co-variates in a data set with no reference to outcome, but comparing intervention groups. It will report summary statistics for continuous and counts for categorical data, for each of the intervention groups, and reports on the significance of the association in relation to the intervention groups. It gives a clear summary of whether data is correlated to intervention.
compare_population( df, ..., label_fn = label_extractor(df), units = extract_units(df), override_type = list(), override_method = list(), layout = "compact", override_percent_dp = list(), override_real_dp = list(), p_format = names(.pvalue.defaults), font_size = getOption("tableone.font_size", 8), font = getOption("tableone.font", "Arial"), footer_text = NULL, show_binary_value = NULL, raw_output = FALSE )
compare_population( df, ..., label_fn = label_extractor(df), units = extract_units(df), override_type = list(), override_method = list(), layout = "compact", override_percent_dp = list(), override_real_dp = list(), p_format = names(.pvalue.defaults), font_size = getOption("tableone.font_size", 8), font = getOption("tableone.font", "Arial"), footer_text = NULL, show_binary_value = NULL, raw_output = FALSE )
df |
a dataframe of individual observations. If using the |
... |
the columns of variables we wish to summarise. This can be given as
a
which may be more convenient if you are going on to do a model fit later. If the latter format the left hand side is ignored (outcomes are not usual in this kind of table). |
label_fn |
(optional) a function for mapping a co-variate column name to
printable label. This is by default a no-operation and the output table
will contain the dataframe column names as labels. A simple alternative
would be some form of dplyr::case_when lookup, or a string function such
as stringr::str_to_sentence. (N.b. this function must be vectorised).
Any value provided here will be overridden by the
|
units |
(optional) a named list of units, following a |
override_type |
(optional) a named list of data summary types. The
default type for a column in a data set are calculated using heurisitics
depending on the nature of the data (categorical or continuous), and result
of normality tests. if you want to override this the options are
"subtype_count","median_iqr","mean_sd","skipped" and you
specify this on a column by column bases with a named list (e.g
|
override_method |
if you want to override the comparison method for a
particular variable the options are
"chi-sq trend","fisher","t-test","2-sided wilcoxon","2-sided ks","anova","kruskal-wallis","no comparison" and you
specify this on a column by column bases with a named list (e.g
|
layout |
(optional) various layouts are defined as default. As of this
version of |
override_percent_dp |
(optional) a named list of overrides for the default
precision of formatting percentages, following a |
override_real_dp |
(optional) a named list of overrides for the default
precision of formatting real values, following a |
p_format |
the format of the p-values: one of
"sampl", "nejm", "jama", "lancet", "aim" but any value
here is overridden by the |
font_size |
(optional) the font size for the table in points |
font |
(optional) the font family for the table (which will be matched to closest on your system) |
footer_text |
any text that needs to be added at the end of the table,
setting this to FALSE dsables the whole footer (as does
|
show_binary_value |
if set this will filter the display of covariates where the number of possibilities is exactly 2 to this value. |
raw_output |
return comparison as |
a huxtable
formatted table.
# the heuristics detect that Petals in the iris data set are not normally # distributed and hence report median and IQR: iris %>% dplyr::group_by(Species) %>% compare_population(tidyselect::everything()) # Missing data old = options("tableone.show_pvalue_method"=FALSE) missing_diamonds %>% dplyr::group_by(is_colored) %>% compare_population(-color, layout="relaxed") tmp = missing_diamonds %>% explicit_na() %>% dplyr::group_by(is_colored) tmp %>% compare_population(-color, footer_text = c( "IQR: Interquartile range; CI: Confidence interval", "Line two") ) options(old)
# the heuristics detect that Petals in the iris data set are not normally # distributed and hence report median and IQR: iris %>% dplyr::group_by(Species) %>% compare_population(tidyselect::everything()) # Missing data old = options("tableone.show_pvalue_method"=FALSE) missing_diamonds %>% dplyr::group_by(is_colored) %>% compare_population(-color, layout="relaxed") tmp = missing_diamonds %>% explicit_na() %>% dplyr::group_by(is_colored) tmp %>% compare_population(-color, footer_text = c( "IQR: Interquartile range; CI: Confidence interval", "Line two") ) options(old)
Group data count and calculate proportions by column.
count_table( df, rowGroupVars, colGroupVars, numExpr = dplyr::n(), denomExpr = dplyr::n(), totalExpr = dplyr::n(), subgroupLevel = length(rowGroupVars), glue = list(`Count [%] (N={sprintf("%d",N)})` = "{sprintf(\"%d/%d [%1.1f%%]\", x, n, mean*100)}"), label_fn = label_extractor(df), font_size = getOption("tableone.font_size", 8), font = getOption("tableone.font", "Arial") )
count_table( df, rowGroupVars, colGroupVars, numExpr = dplyr::n(), denomExpr = dplyr::n(), totalExpr = dplyr::n(), subgroupLevel = length(rowGroupVars), glue = list(`Count [%] (N={sprintf("%d",N)})` = "{sprintf(\"%d/%d [%1.1f%%]\", x, n, mean*100)}"), label_fn = label_extractor(df), font_size = getOption("tableone.font_size", 8), font = getOption("tableone.font", "Arial") )
df |
a dataframe of linelist items |
rowGroupVars |
the rows of the table. The last one of these is the denominator grouping |
colGroupVars |
the column groupings of the table. |
numExpr |
defines how the numerator is defined in the context of the column and row groups (e.g. dplyr::n()) |
denomExpr |
defines how the numerator is defined in the context of the column and row (ungrouped one level) |
totalExpr |
defines how the column level total is defined |
subgroupLevel |
defines how the numerator grouping is defined in terms of the row groupings |
glue |
a named list of column value specifications. |
label_fn |
(optional) a function for mapping a co-variate column name to
printable label. This is by default a no-operation and the output table
will contain the dataframe column names as labels. A simple alternative
would be some form of dplyr::case_when lookup, or a string function such
as stringr::str_to_sentence. (N.b. this function must be vectorised).
Any value provided here will be overridden by the
|
font_size |
(optional) the font size for the table in points |
font |
(optional) the font family for the table (which will be matched to closest on your system) |
a huxtable with the count and proportions of the rows groups
diamonds %>% count_table(dplyr::vars(cut,clarity), dplyr::vars(color), subgroupLevel = 1)
diamonds %>% count_table(dplyr::vars(cut,clarity), dplyr::vars(color), subgroupLevel = 1)
Deals with some annoying issues classifying integer data sets, such as ages, into groups. where you want to specify just the change over points as integers and clearly label the resulting ordered factor.
cut_integer( x, cut_points, glue = "{label}", lower_limit = -Inf, upper_limit = Inf, ... )
cut_integer( x, cut_points, glue = "{label}", lower_limit = -Inf, upper_limit = Inf, ... )
x |
a vector of integer valued numbers, e.g. ages, counts |
cut_points |
a vector of integer valued cut points which define the lower, inclusive boundary of each group |
glue |
a glue spec that may be used to generate a label. It can use low, high, next_low, or label as values. |
lower_limit |
the minimum value we should include (this is inclusive for the bottom category) (default -Inf) |
upper_limit |
the maximum value we should include (this is also inclusive for the top category) (default Inf) |
... |
not used |
an ordered factor of the integer
cut_integer(stats::rbinom(20,20,0.5), c(5,10,15)) cut_integer(floor(stats::runif(100,-10,10)), cut_points = c(2,3,4,6), lower_limit=0, upper_limit=10) cut_integer(1:10, cut_points = c(1,3,9))
cut_integer(stats::rbinom(20,20,0.5), c(5,10,15)) cut_integer(floor(stats::runif(100,-10,10)), cut_points = c(2,3,4,6), lower_limit=0, upper_limit=10) cut_integer(1:10, cut_points = c(1,3,9))
Customisation of output can use one of these entries as a starting point. A custom layout should look like one of the entries in level 2 of this nested list, containing 4 named entries, one for each type of table summary.
default.format
default.format
default.format
A names list of lists:
The name of the table layout
The name of the summary type required. one of subtype_count
,
median_iqr
,mean_sd
,skipped
a named list of column
=glue specification
pairs. The
column
(itself a glue spec) might reference N_total
, N_present
or .unit
but
typically will be a fixed string- it defines the name of the table column
to generate. The glue specification
defines the layout of that column,
and can use summary statistics as below
can use level
, prob.0.5
, prob.0.025
,
prob.0.975
, unit
, n
, N
. n
is subgroup count, N
is data count.
can use q.0.5
, q.0.25
, ..., unit
, n
, N
- n
excludes
missing, N
does not.
can use mean
, sd
, unit
, n
, N
- n
excludes
missing, N
does not.
can use n
, N
- n
excludes
missing, N
does not.
The population description is a simple summary of the co-variates in a data set with no reference to outcome, and not comparing intervention (although it might contain intervention rates.) It will report summary statistics for continuous and counts for categorical data,
describe_data( df, ..., label_fn = label_extractor(df), units = extract_units(df), layout = "single", font_size = getOption("tableone.font_size", 8), font = getOption("tableone.font", "Arial"), footer_text = NULL, raw_output = FALSE )
describe_data( df, ..., label_fn = label_extractor(df), units = extract_units(df), layout = "single", font_size = getOption("tableone.font_size", 8), font = getOption("tableone.font", "Arial"), footer_text = NULL, raw_output = FALSE )
df |
a dataframe of individual observations. Grouping, if present, is ignored.
(n.b. if you wanted to construct multiple summary tables a |
... |
the columns of variables we wish to summarise. This can be given as
a
which may be more convenient if you are going on to do a model fit. If the latter format the left hand side is ignored (outcomes are not usual in this kind of table). |
label_fn |
(optional) a function for mapping a co-variate column name to
printable label. This is by default a no-operation and the output table
will contain the dataframe column names as labels. A simple alternative
would be some form of dplyr::case_when lookup, or a string function such
as stringr::str_to_sentence. (N.b. this function must be vectorised).
Any value provided here will be overridden by the
|
units |
(optional) a named list of units, following a |
layout |
(optional) various layouts are defined as default. As of this
version of |
font_size |
(optional) the font size for the table in points |
font |
(optional) the font family for the table (which will be matched to closest on your system) |
footer_text |
any text that needs to be added at the end of the table,
setting this to FALSE dsables the whole footer (as does
|
raw_output |
return comparison as |
a huxtable
formatted table.
# Overriding the heuristics is possible: iris %>% describe_data(tidyselect::everything()) diamonds %>% dplyr::group_by(is_colored) %>% describe_data(tidyselect::everything())
# Overriding the heuristics is possible: iris %>% describe_data(tidyselect::everything()) diamonds %>% dplyr::group_by(is_colored) %>% describe_data(tidyselect::everything())
The population description is a simple summary of the co-variates in a data set with no reference to outcome, and not comparing intervention (although it might contain intervention rates.) It will report summary statistics for continuous and counts for categorical data,
describe_population( df, ..., label_fn = label_extractor(df), units = extract_units(df), override_type = list(), layout = "single", override_percent_dp = list(), override_real_dp = list(), font_size = getOption("tableone.font_size", 8), font = getOption("tableone.font", "Arial"), footer_text = NULL, show_binary_value = NULL, raw_output = FALSE )
describe_population( df, ..., label_fn = label_extractor(df), units = extract_units(df), override_type = list(), layout = "single", override_percent_dp = list(), override_real_dp = list(), font_size = getOption("tableone.font_size", 8), font = getOption("tableone.font", "Arial"), footer_text = NULL, show_binary_value = NULL, raw_output = FALSE )
df |
a dataframe of individual observations. Grouping, if present, is ignored.
(n.b. if you wanted to construct multiple summary tables a |
... |
the columns of variables we wish to summarise. This can be given as
a
which may be more convenient if you are going on to do a model fit. If the latter format the left hand side is ignored (outcomes are not usual in this kind of table). |
label_fn |
(optional) a function for mapping a co-variate column name to
printable label. This is by default a no-operation and the output table
will contain the dataframe column names as labels. A simple alternative
would be some form of dplyr::case_when lookup, or a string function such
as stringr::str_to_sentence. (N.b. this function must be vectorised).
Any value provided here will be overridden by the
|
units |
(optional) a named list of units, following a |
override_type |
(optional) a named list of data summary types. The
default type for a column in a data set are calculated using heurisitics
depending on the nature of the data (categorical or continuous), and result
of normality tests. if you want to override this the options are
"subtype_count","median_iqr","mean_sd","skipped" and you
specify this on a column by column bases with a named list (e.g
|
layout |
(optional) various layouts are defined as default. As of this
version of |
override_percent_dp |
(optional) a named list of overrides for the default
precision of formatting percentages, following a |
override_real_dp |
(optional) a named list of overrides for the default
precision of formatting real values, following a |
font_size |
(optional) the font size for the table in points |
font |
(optional) the font family for the table (which will be matched to closest on your system) |
footer_text |
any text that needs to be added at the end of the table,
setting this to FALSE dsables the whole footer (as does
|
show_binary_value |
if set this will filter the display of covariates where the number of possibilities is exactly 2 to this value. |
raw_output |
return comparison as |
a huxtable
formatted table.
# the heuristics detect that Petals in the iris data set are not normally # distributed and hence report median and IQR: iris %>% describe_population(tidyselect::everything()) # Overriding the heuristics is possible: iris %>% describe_population( tidyselect::everything(), override_type = c(Petal.Length = "mean_sd", Petal.Width = "mean_sd") ) # The counts sometimes seem redundant if there is no missing information: diamonds %>% describe_population(tidyselect::everything()) # however in a data set with missing values the denominators are important: missing_diamonds %>% describe_population(tidyselect::everything()) # for factor levels we can make the missing values more explicit missing_diamonds %>% explicit_na() %>% describe_population(tidyselect::everything()) # in the output above the price variable is not # presented the way we would # like so here we override the number of decimal places shown for the price # variable while we are at it we will use a mid point for the decimal point, # and make the variable labels sentence case. old = options("tableone.dp"="\u00B7") missing_diamonds %>% explicit_na() %>% describe_population( tidyselect::everything(), label_fn=stringr::str_to_sentence, override_real_dp=list(price=6) ) options(old)
# the heuristics detect that Petals in the iris data set are not normally # distributed and hence report median and IQR: iris %>% describe_population(tidyselect::everything()) # Overriding the heuristics is possible: iris %>% describe_population( tidyselect::everything(), override_type = c(Petal.Length = "mean_sd", Petal.Width = "mean_sd") ) # The counts sometimes seem redundant if there is no missing information: diamonds %>% describe_population(tidyselect::everything()) # however in a data set with missing values the denominators are important: missing_diamonds %>% describe_population(tidyselect::everything()) # for factor levels we can make the missing values more explicit missing_diamonds %>% explicit_na() %>% describe_population(tidyselect::everything()) # in the output above the price variable is not # presented the way we would # like so here we override the number of decimal places shown for the price # variable while we are at it we will use a mid point for the decimal point, # and make the variable labels sentence case. old = options("tableone.dp"="\u00B7") missing_diamonds %>% explicit_na() %>% describe_population( tidyselect::everything(), label_fn=stringr::str_to_sentence, override_real_dp=list(price=6) ) options(old)
with a binary class is_coloured based on the color column
diamonds
diamonds
diamonds
Test data
Converts NA values in any factors in the dataframe into a new level -
This is a thin wrapper for forcats::fct_explicit_na()
but with missing
value level added regardless of whether any values missing. This forces an
empty row in count tables.
explicit_na(df, na_level = "<missing>", hide_if_empty = FALSE)
explicit_na(df, na_level = "<missing>", hide_if_empty = FALSE)
df |
the data frame |
na_level |
a label for NA valued factors |
hide_if_empty |
dont add a missing data category if no data is missing |
the dataframe with all factor columns containing explicit na values
# before missing_diamonds %>% dplyr::group_by(cut) %>% dplyr::count() # after missing_diamonds %>% explicit_na() %>% dplyr::group_by(cut) %>% dplyr::count()
# before missing_diamonds %>% dplyr::group_by(cut) %>% dplyr::count() # after missing_diamonds %>% explicit_na() %>% dplyr::group_by(cut) %>% dplyr::count()
Get summary comparisons and statistics between variables as raw data.
extract_comparison( df, ..., label_fn = label_extractor(df), override_type = list(), p_format = names(.pvalue.defaults), override_method = list(), power_analysis = FALSE, override_power = list(), raw_output = FALSE )
extract_comparison( df, ..., label_fn = label_extractor(df), override_type = list(), p_format = names(.pvalue.defaults), override_method = list(), power_analysis = FALSE, override_power = list(), raw_output = FALSE )
df |
a dataframe of individual observations. If using the |
... |
the outcomes are specified either as a |
label_fn |
(optional) a function for mapping a co-variate column name to
printable label. This is by default a no-operation and the output table
will contain the dataframe column names as labels. A simple alternative
would be some form of dplyr::case_when lookup, or a string function such
as stringr::str_to_sentence. (N.b. this function must be vectorised).
Any value provided here will be overridden by the
|
override_type |
(optional) a named list of data summary types. The
default type for a column in a data set are calculated using heurisitics
depending on the nature of the data (categorical or continuous), and result
of normality tests. if you want to override this the options are
"subtype_count","median_iqr","mean_sd","skipped" and you
specify this on a column by column bases with a named list (e.g
|
p_format |
the format of the p-values: one of
"sampl", "nejm", "jama", "lancet", "aim" but any value
here is overridden by the |
override_method |
if you want to override the comparison method for a
particular variable the options are
"chi-sq trend","fisher","t-test","2-sided wilcoxon","2-sided ks","anova","kruskal-wallis","no comparison" and you
specify this on a column by column bases with a named list (e.g
|
power_analysis |
conduct sample size based power analysis. |
override_power |
if you want to override the power calculation method for a
particular variable the options are
"fisher","t-test","2-sided wilcoxon","2-sided ks","anova","kruskal-wallis","no comparison" and you
specify this on a column by column bases with a named list (e.g
|
raw_output |
return comparison as |
a list of accessor functions for the summary data allowing granular access to the results of the analysis:
comparison$compare(.variable, .characteristic = NULL)
-
prints a comparison between the different
intervention groups for the specified variable (and optionally the given
characteristic if it is a categorical variable).
comparison$filter(.variable, .intervention = NULL, .characteristic = NULL)
extracts a given variable
(e.g. gender
), optionally for a given level of intervention (e.g.
control
) and if categorical a given characteristic (e.g. male
). This
will output a dataframe with all the calculated summary variables, for all
qualifying intervention, variable and characteristic combinations,
significance tests (and power analyses) for the qualifying variable
(comparing intervention groups).
comparison$signif_tests(.variable)
- extracts for
a given variable (e.g. gender
) the significance tests (and optionally
power analyses) of the univariate comparison between different
interventions and the variable.
comparison$summary_stats(.variable, .intervention = NULL, .characteristic = NULL)
extracts a given variable (e.g. gender
),
optionally for a given level of intervention (e.g. control
) and if
categorical a given characteristic (e.g. male
). This returns only the
summary stats for all qualifying intervention, variable and characteristic
combinations.
Extracts units set as dataframe column attributes
extract_units(df)
extract_units(df)
df |
the data frame from |
a named list of column / unit pairs.
iris = iris %>% set_units(-Species, units="mm") iris %>% extract_units()
iris = iris %>% set_units(-Species, units="mm") iris %>% extract_units()
Uses the default formatter set globally in options("tableone.pvalue_formatter")
in
preference the one defined by p_format
which is only used if no default is set.
format_pvalue(p.value, p_format = names(.pvalue.defaults))
format_pvalue(p.value, p_format = names(.pvalue.defaults))
p.value |
the p-value to be formatted |
p_format |
a name of a p-value formatter (one of sampl, nejm, jama, lancet, aim) |
a formatted P-value
At some point we need to take information from the tables produced by
tableone
and place it into the main text of the document. It is annoying
if this cannot be done automatically. the group_comparison()
function enables
extraction of one or more head to head comparisons and provides a fairly
flexible mechanism for building the precise format desired.
group_comparison( t1_signif, variable = NULL, subgroup = NULL, intervention = NULL, percent_fmt = "%1.1f%%", p_format = names(.pvalue.defaults), no_summary = FALSE, summary_glue = NULL, summary_arrange = NULL, summary_sep = ", ", summary_last = " versus ", no_signif = FALSE, signif_glue = NULL, signif_sep = NULL, signif_last = NULL )
group_comparison( t1_signif, variable = NULL, subgroup = NULL, intervention = NULL, percent_fmt = "%1.1f%%", p_format = names(.pvalue.defaults), no_summary = FALSE, summary_glue = NULL, summary_arrange = NULL, summary_sep = ", ", summary_last = " versus ", no_signif = FALSE, signif_glue = NULL, signif_sep = NULL, signif_last = NULL )
t1_signif |
a |
variable |
a variable or set of variables to compare. If missing a
set of approriate values is displayed based on the columns of |
subgroup |
a subgroup or set of subgroups to compare. |
intervention |
the side or sides of the intervention to select. N.b. using this effectively prevents any statistical comparison as only one side will be available. |
percent_fmt |
a |
p_format |
the format of the p-values: one of
"sampl", "nejm", "jama", "lancet", "aim" but any value
here is overridden by the |
no_summary |
only extract significance test values |
summary_glue |
a glue specification that maps the summary statistics to a readable string. |
summary_arrange |
an expression by which to order the summary output |
summary_sep |
a separator to combine the summary output (see |
summary_last |
a separator to combine the last 2 summary outputs (see |
no_signif |
do not try and include significance in the output. Sometimes
this is the only option if there is not enough of the comparison to retained
by the |
signif_glue |
a glue specification that maps the combined summary output with the result of the significance tests, to given a complete comparison. |
signif_sep |
a separator to combine complete comparisons (see |
signif_last |
a separator to combine the last 2 complete comparisons (see |
ideally a single string but various things will be returned depending on hos much input is constrained, and sometimes will provide guidance about what next to do. The intention is the function to be used interactively until a satisfactory result is obtained.
tmp = diamonds %>% dplyr::group_by(is_colored) %>% set_units(price,units="£") %>% compare_population(-color, raw_output=TRUE) # The tabular output is retrieved by converting to a huxtable # as_huxtable(tmp, layout="simple") # An unqualified group_comparison call gives informative messages # about what can be compared: tmp %>% group_comparison() # filtering down the data gets us to a specific comparison: tmp %>% group_comparison(variable = "cut", subgroup="Fair") %>% dplyr::glimpse() # With further interactive exploration the # data available for that comparison can be made into a glue string tmp %>% group_comparison(variable = "cut", subgroup="Fair", intervention = "clear", summary_glue = "{is_colored}: {x}/{n} ({prob.0.5}%)", signif_glue = "{variable}={subgroup}; {text}; Overall p-value for '{variable}': {p.value}.") # group comparisons above using many individual subgroups are a bit confusing because # the p-value is at the variable level. This is less of an issue for continuous # or binary values. tmp %>% group_comparison( variable = "price", summary_glue = "{is_colored}: {unit}{q.0.5}; IQR: {q.0.25} \u2014 {q.0.75} (n={n})", signif_glue = "{variable}: {text}; P-value {p.value}.") # Sometimes we only want to extract a p-value: tmp %>% group_comparison(variable = "cut", subgroup="Fair", no_summary=TRUE) %>% dplyr::glimpse()
tmp = diamonds %>% dplyr::group_by(is_colored) %>% set_units(price,units="£") %>% compare_population(-color, raw_output=TRUE) # The tabular output is retrieved by converting to a huxtable # as_huxtable(tmp, layout="simple") # An unqualified group_comparison call gives informative messages # about what can be compared: tmp %>% group_comparison() # filtering down the data gets us to a specific comparison: tmp %>% group_comparison(variable = "cut", subgroup="Fair") %>% dplyr::glimpse() # With further interactive exploration the # data available for that comparison can be made into a glue string tmp %>% group_comparison(variable = "cut", subgroup="Fair", intervention = "clear", summary_glue = "{is_colored}: {x}/{n} ({prob.0.5}%)", signif_glue = "{variable}={subgroup}; {text}; Overall p-value for '{variable}': {p.value}.") # group comparisons above using many individual subgroups are a bit confusing because # the p-value is at the variable level. This is less of an issue for continuous # or binary values. tmp %>% group_comparison( variable = "price", summary_glue = "{is_colored}: {unit}{q.0.5}; IQR: {q.0.25} \u2014 {q.0.75} (n={n})", signif_glue = "{variable}: {text}; P-value {p.value}.") # Sometimes we only want to extract a p-value: tmp %>% group_comparison(variable = "cut", subgroup="Fair", no_summary=TRUE) %>% dplyr::glimpse()
Retrieve column labels are embedded as an attribute of each column.
label_extractor(df, ..., attribute = "label")
label_extractor(df, ..., attribute = "label")
df |
a dataframe containing some labels |
... |
additional string manipulation functions to apply e.g. |
attribute |
the name of the label containing attribute (defaults to |
a labelling function. This is specific to the dataframe provided in df
iris = set_labels(iris, c( "Sepal Length", "Sepal Width", "Petal Length", "Petal Width", "Species" )) fn = label_extractor(iris,tolower) fn(colnames(iris))
iris = set_labels(iris, c( "Sepal Length", "Sepal Width", "Petal Length", "Petal Width", "Species" )) fn = label_extractor(iris,tolower) fn(colnames(iris))
It is simpler for presentation and sometimes more correct for discrete valued data to be represented as factors. Such discrete valued data might be logical values, character values, or numeric values with a limited number of levels (e.g. scores). this function lets you convert (a subset of) data frame columns into factors using
make_factors( df, ..., .logical = c("yes", "no"), .numeric = "{name}={value}", .character = NULL )
make_factors( df, ..., .logical = c("yes", "no"), .numeric = "{name}={value}", .character = NULL )
df |
a data frame |
... |
either a |
.logical |
(optional) a length 2 vector defining the levels of TRUE, then FALSE. |
.numeric |
(optional) if provided it must either be a named list e.g.
|
.character |
in general character columns are converted into a factor with
the default levels. To explicitly set levels a named list can be given here
which |
a dataframe with the columns converted to factors
iris %>% make_factors(tidyselect::ends_with("Length"), .numeric = "{name}={round(value)}") %>% dplyr::glimpse() # Convert everything in diamonds to be a factor, rounding all # the numeric values and converting all the names to upper case tmp = diamonds %>% dplyr::mutate(is_colored = color > "F") %>% make_factors(tidyselect::everything(), .numeric="{toupper(name)}={round(value)}") # as we included `price` which has very many levels one factor is unuseable with 11602 levels: length(levels(tmp$price)) # we could explicitly exclude it from the `tidyselect` syntax `...` parameter: diamonds %>% dplyr::mutate(is_colored = color > "F") %>% make_factors(-price, .numeric="{toupper(name)}={round(value)}") %>% dplyr::glimpse() # or alternatively we set a limit on the maximum number of factors, which # in this example picks up the `depth` and `table` columns as exceeding this # new limit: old = options("tableone.max_discrete_levels"=16) diamonds %>% dplyr::mutate(is_colored = color > "F") %>% make_factors(tidyselect::everything(), .numeric="{toupper(name)}={round(value)}") %>% dplyr::glimpse() options(old) # converting a character vector. Here we specify `.character` as a list giving the # possible levels of `alpha2`. Values outside of this list are converted to `NA` set.seed(100) eg_character = tibble::tibble( alpha1 = sample(letters,50,replace=TRUE), alpha2 = sample(LETTERS,50,replace=TRUE) ) eg_character %>% make_factors(tidyselect::everything(), .character = list(alpha2 = LETTERS[3:20]))
iris %>% make_factors(tidyselect::ends_with("Length"), .numeric = "{name}={round(value)}") %>% dplyr::glimpse() # Convert everything in diamonds to be a factor, rounding all # the numeric values and converting all the names to upper case tmp = diamonds %>% dplyr::mutate(is_colored = color > "F") %>% make_factors(tidyselect::everything(), .numeric="{toupper(name)}={round(value)}") # as we included `price` which has very many levels one factor is unuseable with 11602 levels: length(levels(tmp$price)) # we could explicitly exclude it from the `tidyselect` syntax `...` parameter: diamonds %>% dplyr::mutate(is_colored = color > "F") %>% make_factors(-price, .numeric="{toupper(name)}={round(value)}") %>% dplyr::glimpse() # or alternatively we set a limit on the maximum number of factors, which # in this example picks up the `depth` and `table` columns as exceeding this # new limit: old = options("tableone.max_discrete_levels"=16) diamonds %>% dplyr::mutate(is_colored = color > "F") %>% make_factors(tidyselect::everything(), .numeric="{toupper(name)}={round(value)}") %>% dplyr::glimpse() options(old) # converting a character vector. Here we specify `.character` as a list giving the # possible levels of `alpha2`. Values outside of this list are converted to `NA` set.seed(100) eg_character = tibble::tibble( alpha1 = sample(letters,50,replace=TRUE), alpha2 = sample(LETTERS,50,replace=TRUE) ) eg_character %>% make_factors(tidyselect::everything(), .character = list(alpha2 = LETTERS[3:20]))
with 10% of entries replaced by NA and a binary class is_coloured based on the color column
missing_diamonds
missing_diamonds
missing_diamonds
Test data
A random data test dataset with 2 classes (groupings column) one of which has 10% missing data and the other has 20%
mnar_two_class_1000
mnar_two_class_1000
mnar_two_class_1000
Test data
A multi-class dataset with equal random samples in each class
multi_class_negative
multi_class_negative
multi_class_negative
Test data
columns contain a set of random data of different types e.g. uniform continuous, normal, binomial, multinomial.
one_class_test_100
one_class_test_100
one_class_test_100
Test data
columns contain a set of random data of different types e.g. uniform continuous, normal, binomial, multinomial.
one_class_test_1000
one_class_test_1000
one_class_test_1000
Test data
Comparing missingness by looking at a table is good but we also want to update models to exclude missing data from the predictors.
remove_missing( df, ..., label_fn = label_extractor(df), significance_limit = 0.05, missingness_limit = 0.1 )
remove_missing( df, ..., label_fn = label_extractor(df), significance_limit = 0.05, missingness_limit = 0.1 )
df |
a dataframe of individual observations. If using the |
... |
a list of formulae that specify the models that we want to check |
label_fn |
(optional) a function for mapping a co-variate column name to
printable label. This is by default a no-operation and the output table
will contain the dataframe column names as labels. A simple alternative
would be some form of dplyr::case_when lookup, or a string function such
as stringr::str_to_sentence. (N.b. this function must be vectorised).
Any value provided here will be overridden by the
|
significance_limit |
the limit at which we reject the hypothesis that the data is missing at random. |
missingness_limit |
the limit at which too much data is missing to include the predictor. |
a list of formulae with missing parameters removed
df = iris %>% dplyr::mutate(Petal.Width = ifelse( stats::runif(dplyr::n()) < dplyr::case_when( Species == "setosa" ~ 0.2, Species == "virginica" ~ 0.1, TRUE~0 ), NA, Petal.Width )) remove_missing(df, ~ Species + Petal.Width + Sepal.Width, ~ Species + Petal.Length + Sepal.Length)
df = iris %>% dplyr::mutate(Petal.Width = ifelse( stats::runif(dplyr::n()) < dplyr::case_when( Species == "setosa" ~ 0.2, Species == "virginica" ~ 0.1, TRUE~0 ), NA, Petal.Width )) remove_missing(df, ~ Species + Petal.Width + Sepal.Width, ~ Species + Petal.Length + Sepal.Length)
Set a label attribute
set_labels(df, labels, attribute = "label")
set_labels(df, labels, attribute = "label")
df |
a dataframe |
labels |
a vector of labels, one for each column |
attribute |
the name of the label attribute (defaults to |
the same dataframe with each column labelled
iris = set_labels(iris, c("Sepal Length", "Sepal Width", "Petal Length", "Petal Width", "Species" )) fn = label_extractor(iris,tolower) fn(colnames(iris))
iris = set_labels(iris, c("Sepal Length", "Sepal Width", "Petal Length", "Petal Width", "Species" )) fn = label_extractor(iris,tolower) fn(colnames(iris))
Title
set_units(df, ..., units)
set_units(df, ..., units)
df |
a dataframe |
... |
a tidyselect specification or a formula |
units |
a list of unit as strings which must be either 1 or the same length as the columns matched by the tidyselect. |
the dataframe with the unit
attribute updated
iris = iris %>% set_units(-Species, units="mm") iris %>% extract_units()
iris = iris %>% set_units(-Species, units="mm") iris %>% extract_units()
A list of columns for a test case
test_cols
test_cols
test_cols
Test data
columns contain a set of random data of different types e.g. uniform continuous, normal, binomial, multinomial. in grouping 1 there is 100 items in grouping 2 there are 1000 items
two_class_test
two_class_test
one_class_test_100
Test data