---
title: "Multinomial proportions models for genomic variants"
output: html_document
vignette: >
  %\VignetteIndexEntry{Multinomial proportions models for genomic variants}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
here::i_am("vignettes/variant-proportions.Rmd")
source(here::here("vignettes/vignette-utils.R"))

library(growthrates)

```

## COVID-19 proportions in England

The Sanger Centre & COGUK performed a large amount of sequencing of COVID-19 
during the pandemic, to identify emerging genomic variants. This was scaled up in 
the second half of 2021 and continued through to the beginning of 2023. Lineages
were assigned using the Pango lineage system and important ones given nicknames
by the WHO.

The Sanger variants data has been discontinued, but were still available for
download. The code to download, process these data sets and determine the full
lineage is in the `data-raw/variants.R` file, but the output of this has been 
bundled as a data set here. There are many caveats to the data here in terms of 
bias and it should not be regarded as definitive:

```{r}  

# tidy copy of the sanger weekly variants count data aggregated to England level
growthrates::england_variants %>% dplyr::glimpse()

```

The data must have a `class` column defining the main categorisation of the data
(in this case it is the main pango variant). The `time` column is a
`time_period` derived from the date (which is weekly). The other necessary
column is the `count` column which is integer counts of each `class`. The data
must be grouped by `class`. Multiple models can be fitted simultaneously if the
data is grouped by other columns.

## Multinomial proportions model.

Genomic testing happened only in a subset of cases. The testing effort varied 
significantly over time. The frequency of each variant over time can be determined
with a multinomial model.

```{r}

probs = england_variants %>% 
  multinomial_nnet_model(window = 28)

plot_multinomial(probs)+
  ggplot2::scale_fill_viridis_d(option="cividis")
  
```

## Binomial proportions model

The binomial proportions are not very different to the multinomial probabilities
calculated above, but come with confidence intervals, however the median values
do not necessarily sum to 1.

```{r}

probs2 = england_variants %>% proportion_locfit_model(window = 14)

plot_proportion(probs2)+
  ggplot2::scale_colour_viridis_d(option="cividis",aesthetics = c("colour","fill"))

```

The rate of change of the proportion of each individual variant versus the
others on a logistic scale can be used to work out the exponential growth rate
of one variant relative to the others. Because this is a relative growth rate
taken togehter the esimates of all variants at a given time are centred around
zero. If one variant has a growth advantage, by definition others have a growth
disadvantage despite potentially causing a larger disease burden and having
increasing numbers in a growing epidemic.

```{r}

plot_growth_rate(probs2) +
  ggplot2::scale_fill_viridis_d(option="cividis",aesthetics = c("colour","fill"))

```

The binomial relative growth rate per day is a growth advantage over existing variants.
This has a dependency on the unit of time which is controlled by the `time_period`
configuration. In the data provided here the `time_period` is defined on a daily basis
despite the data being provided weekly. Doubling time does not make strict sense when
describing relative growth rates and is not shown here.