--- title: "Multinomial proportions models for genomic variants" output: html_document vignette: > %\VignetteIndexEntry{Multinomial proportions models for genomic variants} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) here::i_am("vignettes/variant-proportions.Rmd") source(here::here("vignettes/vignette-utils.R")) library(growthrates) ``` ## COVID-19 proportions in England The Sanger Centre & COGUK performed a large amount of sequencing of COVID-19 during the pandemic, to identify emerging genomic variants. This was scaled up in the second half of 2021 and continued through to the beginning of 2023. Lineages were assigned using the Pango lineage system and important ones given nicknames by the WHO. The Sanger variants data has been discontinued, but were still available for download. The code to download, process these data sets and determine the full lineage is in the `data-raw/variants.R` file, but the output of this has been bundled as a data set here. There are many caveats to the data here in terms of bias and it should not be regarded as definitive: ```{r} # tidy copy of the sanger weekly variants count data aggregated to England level growthrates::england_variants %>% dplyr::glimpse() ``` The data must have a `class` column defining the main categorisation of the data (in this case it is the main pango variant). The `time` column is a `time_period` derived from the date (which is weekly). The other necessary column is the `count` column which is integer counts of each `class`. The data must be grouped by `class`. Multiple models can be fitted simultaneously if the data is grouped by other columns. ## Multinomial proportions model. Genomic testing happened only in a subset of cases. The testing effort varied significantly over time. The frequency of each variant over time can be determined with a multinomial model. ```{r} probs = england_variants %>% multinomial_nnet_model(window = 28) plot_multinomial(probs)+ ggplot2::scale_fill_viridis_d(option="cividis") ``` ## Binomial proportions model The binomial proportions are not very different to the multinomial probabilities calculated above, but come with confidence intervals, however the median values do not necessarily sum to 1. ```{r} probs2 = england_variants %>% proportion_locfit_model(window = 14) plot_proportion(probs2)+ ggplot2::scale_colour_viridis_d(option="cividis",aesthetics = c("colour","fill")) ``` The rate of change of the proportion of each individual variant versus the others on a logistic scale can be used to work out the exponential growth rate of one variant relative to the others. Because this is a relative growth rate taken togehter the esimates of all variants at a given time are centred around zero. If one variant has a growth advantage, by definition others have a growth disadvantage despite potentially causing a larger disease burden and having increasing numbers in a growing epidemic. ```{r} plot_growth_rate(probs2) + ggplot2::scale_fill_viridis_d(option="cividis",aesthetics = c("colour","fill")) ``` The binomial relative growth rate per day is a growth advantage over existing variants. This has a dependency on the unit of time which is controlled by the `time_period` configuration. In the data provided here the `time_period` is defined on a daily basis despite the data being provided weekly. Doubling time does not make strict sense when describing relative growth rates and is not shown here.