Nested dataframes and purrr style list columns

library(interfacer)

Nesting & list columns

interfacer is designed to work with list columns, as generated by purrr. purrr style list columns may contain any arbitrary data type within a list. Consider the following complex dataframe for example, which includes a single regular factor column, a nested dataframe as a list column, a nested S3 lm object as a list column and a nested matrix as a list column:


tmp = iris %>% 
  tidyr::nest(by_species = -Species) %>%
  dplyr::mutate(
    model = purrr::map(by_species, ~ stats::lm(Sepal.Length ~ Sepal.Width, .x)),
    quantiles = purrr::map(by_species, ~ sapply(.x, quantile))
  )

tmp %>% dplyr::glimpse()
#> Rows: 3
#> Columns: 4
#> $ Species    <fct> setosa, versicolor, virginica
#> $ by_species <list> [<tbl_df[50 x 4]>], [<tbl_df[50 x 4]>], [<tbl_df[50 x 4]>]
#> $ model      <list> [2.6390012, 0.6904897, 0.04428474, 0.18952960, -0.14856834,…
#> $ quantiles  <list> <<matrix[5 x 4]>>, <<matrix[5 x 4]>>, <<matrix[5 x 4]>>

interfacer can be used to both represent and validate this data structure. Here the initial specifications were generated using iclip(tmp) and hand modified:


# Pasted from `iclip(tmp)` with minor modification:
i_tmp = interfacer::iface(
    Species = enum(`setosa`,`versicolor`,`virginica`) ~ "the Species column",
    by_species = list(i_by_species) ~ "the by_species column",
    model = list(of_type(lm)) ~ "the model column",
    quantiles = list(matrix) ~ "the quantiles column",
    .groups = NULL
)

i_by_species = interfacer::iface(
    Sepal.Length = numeric ~ "the Sepal.Length column",
    Sepal.Width = numeric ~ "the Sepal.Width column",
    Petal.Length = numeric ~ "the Petal.Length column",
    Petal.Width = numeric ~ "the Petal.Width column",
    .groups = NULL
)

We can then test that the input matches this specification:

tmp %>% iconvert(i_tmp) %>% dplyr::glimpse()
#> Rows: 3
#> Columns: 4
#> $ Species    <fct> setosa, versicolor, virginica
#> $ by_species <list> [<tbl_df[50 x 4]>], [<tbl_df[50 x 4]>], [<tbl_df[50 x 4]>]
#> $ model      <list> [2.6390012, 0.6904897, 0.04428474, 0.18952960, -0.14856834,…
#> $ quantiles  <list> <<matrix[5 x 4]>>, <<matrix[5 x 4]>>, <<matrix[5 x 4]>>

Such specifications could be used for validation, or controlling function dispatch. However it must be recognised that validation of nested dataframes is potentially computationally expensive as each individual nested dataframe must be completely validated. This could create a high overhead in situations where there are a large number of small nested dataframes.

Another example of a nested list column using the diamonds dataframe demonstrates this overhead, where 276 nested dataframes need to be validated individually. This takes a few seconds on my machine.


i_diamonds_cat = interfacer::iface(
  cut = enum(`Fair`,`Good`,`Very Good`,`Premium`,`Ideal`, .ordered=TRUE) ~ "the cut column",
  color = enum(`D`,`E`,`F`,`G`,`H`,`I`,`J`, .ordered=TRUE) ~ "the color column",
  clarity = enum(`I1`,`SI2`,`SI1`,`VS2`,`VS1`,`VVS2`,`VVS1`,`IF`, .ordered=TRUE) ~ "the clarity column",
  data = list(i_diamonds_data) ~ "A nested data column must be specified as a list",
  .groups = FALSE
)

i_diamonds_data = interfacer::iface(
  carat = numeric ~ "the carat column",
  depth = numeric ~ "the depth column",
  table = numeric ~ "the table column",
  price = integer ~ "the price column",
  x = numeric ~ "the x column",
  y = numeric ~ "the y column",
  z = numeric ~ "the z column",
  .groups = FALSE
)

nested_diamonds = ggplot2::diamonds %>%
  tidyr::nest(data = c(-cut,-color,-clarity))

system.time(
  nested_diamonds %>% 
    iconvert(i_diamonds_cat) %>% 
    dplyr::glimpse()
)
#> Rows: 276
#> Columns: 4
#> $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
#> $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, I, E, G,…
#> $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
#> $ data    <list> [<tbl_df[469 x 7]>], [<tbl_df[614 x 7]>], [<tbl_df[89 x 7]>],…
#>    user  system elapsed 
#>   4.337   0.000   4.337

In this example the price column is removes before nesting. Errors in the validation of nested columns are bubbled up to the top level.

try(
  ggplot2::diamonds %>%
    dplyr::select(-price) %>%
    tidyr::nest(data = c(-cut,-color,-clarity)) %>%
    iconvert(i_diamonds_cat) %>% 
    dplyr::glimpse()
)
#> Error : input column `data` in function parameter `<unknown>(<unknown> = ?)` cannot be coerced to a list(i_diamonds_data): nested dataframe problem - missing columns: price

Conclusion

interfacer does work with nested dataframes but there is a performance hit if there are nested columns with iface specifications. Care must be taken if this capability is used to keep data validation performant.