# Estimating the linkage false discovery rate from sample size

This vignette provides an overview of the primary function of the linkage scenario portion of the phylosamp package: how to estimate the false discovery rate given a sample size. In the examples provided, we use the default assumption argument (multiple transmissions and multiple links, mtml), though alternative assumptions can also be specified.

The most basic function of the package is translink_tdr(), which calculates the probability that an identified link represents a true transmission event. This calculation relies on the following parameters:

Param Variable Name Description
$$\eta$$ sensitivity the sensitivity of the linkage criteria for identifying transmission links
$$\chi$$ specificity the specificity of the linkage criteria for identifying transmission links
$$\rho$$ rho the proportion of infections sampled
$$M$$ M the number of infections sampled
$$R$$ R the average reproductive number (also denoted $$R_\text{pop}$$, see below)
library(phylosamp)
## Calculating true discovery rate assuming multiple-transmission and multiple-linkage
## [1] 0.2334906

In other words, given a sample size of 100 infections (representing 75% of the total population), a linkage criteria with a specificity of 99% for identifying infections linked by transmission and a specificity of 95%, fewer than 25% of identified pairs will represent true transmission events. Increasing the specificity to 99.5% has a significant impact on our ability to distinguish linked and unlinked pairs:

## Calculating true discovery rate assuming multiple-transmission and multiple-linkage
## [1] 0.7528517

The other core functions are designed to calculate the expected number of true transmission pairs identified in the sample (translink_expected_links_true()) and the total number of linkages one can expect to identify given the sensitivity and specificity of the linkage criteria and a particular sample size and proportion (translink_expected_links_obs()).

It is important to recognize that $$R$$ in these functions represents the average $$R$$ in the sampled population (alternatively denoted $$R_\text{pop}$$). Because any sampling frame contains a finite number of cases, there will always be more cases than infection events (at minimum, all infectees in a transmission chain plus a single index case), so $$R_\text{pop}\leq1$$. For outbreaks with a single introduction, $$R_\text{pop}$$ is approximately equal to 1; sampling frames containing cases from separate introduction events will have lower values of $$R_\text{pop}$$.