Spaces:
Sleeping
Sleeping
| title: Tutorial - Fitness estimation from pooled growth and NGS | |
| emoji: 🧮 | |
| colorFrom: green | |
| colorTo: blue | |
| sdk: gradio | |
| sdk_version: 5.31.0 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| short_description: Inferring competitive fitness from NGS data | |
| tags: | |
| - biology | |
| - sequencing | |
| # Tutorial – Fitness estimation from pooled growth | |
| [](https://huggingface.co/spaces/scbirlab/tutorial-seq-fitness) | |
| This is the repository for the interactive tutorial, acccessed [here](https://huggingface.co/spaces/scbirlab/tutorial-seq-fitness). Non-interactive text from the tutorial is below. | |
| ## Multiplex growth curves | |
| How do strains grow when competing with each other? | |
| That's given by the [Lotka–Volterra competition model](https://en.wikipedia.org/wiki/Competitive_Lotka%E2%80%93Volterra_equations). | |
| For two strains: | |
| $$ | |
| \frac{dn_{wt}}{dt} = w_{wt} n_{wt} \left( 1 - \frac{n_{wt} + n_{1}}{K} \right) | |
| $$ | |
| $$ | |
| \frac{dn_1}{dt} = w_{1} n_1 \left( 1 - \frac{n_{wt} + n_{1}}{K} \right) | |
| $$ | |
| - $n_i(t)$: abundance of species (or strain) $i$ at time $t$. | |
| - $w_i$: intrinsic (exponential) growth rate of species $i$. | |
| - $K$: carrying capacity. | |
| We can generalize to many strains. For each one: | |
| $$ | |
| \frac{dn_i}{dt} = w_{i} n_i \left( 1 - \frac{\Sigma_j n_j}{K} \right) | |
| $$ | |
| It's not possible to algebraically integrate these equations, since they are | |
| circularly dependent on each other. But we can numerically integrate, to simulate | |
| multiplexed growth curves. | |
| ### Removing time dependence | |
| It can be difficult to get absolute fitness out of these curves, because | |
| when the pool approaches the carrying capacity, all the strains growth rates | |
| mutually affect each other. | |
| However, if we're only interested in the relative fitness of multiplexed strains relative | |
| to a reference (e.g. wild-type) strain, then we can make this simplification: | |
| $$ | |
| \frac{dn_{i}}{dt} / \frac{dn_{wt}}{dt} = \frac{dn_{i}}{dn_{wt}} = \frac{w_{i} n_i}{w_{wt} n_{wt}} | |
| $$ | |
| The interdependency term cancels out, and time is removed, with the reference strain's | |
| growth acting as the clock. Unlike the time-dependent Lotka-Volterra equations, this | |
| has a closed-form integral: | |
| $$ | |
| \log n_i(t) = \frac{w_i}{w_{wt}} \log \frac{n_{wt}(t)}{n_{wt}(0)} + \log{n_i(0)} | |
| $$ | |
| So now the log of the number of cells of a mutant at any moment ($n_1(t)$) is dependent only | |
| on its inoculum ($n_1(0)$), how much the reference strain has grown (i.e. fold-expansion, | |
| $\frac{n_{wt}(t)}{n_{wt}(0)}$), and the ratio of fitness between the mutant and the | |
| reference ($\frac{n_{wt}(t)}{n_{wt}(0)}$). | |
| ## Read counts from next-generation sequencing | |
| But we don't actually measure the number of cells directly. Instead, we're measuring the | |
| number of reads (or UMIs) which represent a random sampling of the population followed by | |
| molecular biology handling and uneven sequencing per lane which decouples the relative | |
| abundances for each timepoint. | |
| Below, you can simulate read counts for technical replicates of the growth curves above. | |
| The simulation: | |
| 1. Randomly samples a defined fraction of the cell population (without replacement, i.e. | |
| the [Hypergeometric distribution](https://en.wikipedia.org/wiki/Hypergeometric_distribution)). | |
| Smaller samples from smaller populations are noisier. | |
| 2. Calculates the resulting proportional representation of every strain in every sample. | |
| 3. Multiplies that proportion by read depth. | |
| 4. Randomly samples sequencing read counts resulting from variations in library construction | |
| and other stochasticity, according to the | |
| [Negative Binomial distribution](https://en.wikipedia.org/wiki/Negative_binomial_distribution), | |
| an established noise model for sequencing counts. | |
| ### Accounting for sequencing subsampling per sample | |
| Each sequencing sample $s$ could be over- or under-sampling the population relative to the first | |
| timepoint by some factor $\phi_s$. | |
| $$\log \frac{c_i(t)}{c_i(0)} = \log \phi_s\frac{n_i(t)}{n_i(0)} = \log \phi_s + \frac{w_i}{w_{wt}} \log \frac{n_{wt}(t)}{n_{wt}(0)}$$ | |
| Variables: | |
| - $c_i(t)$: Read (or UMI) count of strain $i$ at time $t$ | |
| - $\phi_s$: The ratio of sampling depth at time $t$ to that at time $0$ for sample $s$ | |
| The factor $\phi_s$ is the ratio of _the ratio of read counts between samples_ | |
| and _the ratio of cell counts between samples_ for any strain (assuming each strain | |
| is sampled without bias): | |
| $$\log \phi_s = \log \frac{c_i(t)}{c_i(0)} - \log \frac{n_i(0)}{n_i(0)}$$ | |
| We can get rid of the nuisance parameter $\phi_s$ (which is difficult to measure becuase | |
| we don't know the true number of cells for each strain and sample) using the following trick. | |
| We have the equation for read counts for mutant $i$ (same as above): | |
| $$ | |
| \log \frac{c_i(t)}{c_i(0)} = \log \phi_s + \frac{w_i}{w_{wt}} \log \frac{n_{wt}(t)}{n_{wt}(0)} | |
| $$ | |
| And for the reference strain (relative fitness is 1): | |
| $$ | |
| \log \frac{c_{wt}(t)}{c_{wt}(0)} = \log \phi_s + \log \frac{n_{wt}(t)}{n_{wt}(0)} | |
| $$ | |
| We can make $\phi_s$ disappear by taking the difference: | |
| $$ | |
| \log \frac{c_i(t)}{c_i(0)} - \log \frac{c_{wt}(t)}{c_{wt}(0)} = \frac{w_i}{w_{wt}} \log \frac{n_{wt}(t)}{n_{wt}(0)} - \log \frac{n_{wt}(t)}{n_{wt}(0)} | |
| $$ | |
| This is equivalent to: | |
| $$ | |
| \log \left( \frac{c_i(t)}{c_{wt}(t)}\frac{c_{wt}(0)}{c_i(0)} \right) = \left(\frac{w_i}{w_{wt}} - 1 \right) \log \frac{n_{wt}(t)}{n_{wt}(0)} | |
| $$ | |
| So the ratio of _the count ratio of a strain to the reference strain at time t_ to | |
| _the count ratio of a strain to the reference strain at time 0_ is | |
| dependent only on the relative fitness and the true fold-expansion of the reference strain. | |
| Plotting the ratio of _the count ratio of a strain to the reference strain at time t_ to | |
| _the count ratio of a strain to the reference strain at time 0_ | |
| should give a straight line (on a log-log) plot, with intercept 0 and gradient equal to the relative fitness minus 1. | |
| ### Using spike-in counts | |
| But we don't actually know the true fold-expansion of the reference strain, since | |
| it's not directly observed. However, a non-growing fitness-zero control can help, | |
| such as a heat-killed strain or a spike-in plasmid. | |
| We start with the equation before, | |
| $$ | |
| \log \left( \frac{c_i(t)}{c_{wt}(t)}\frac{c_{wt}(0)}{c_i(0)} \right) = \left(\frac{w_i}{w_{wt}} - 1 \right) \log \frac{n_{wt}(t)}{n_{wt}(0)} | |
| $$ | |
| But for the fitness-zero control, $w_{spike} = 0$, so: | |
| $$ | |
| \log \left( \frac{c_{spike}(t)}{c_{wt}(t)}\frac{c_{wt}(0)}{c_{spike}(0)} \right) = -\log \frac{n_{wt}(t)}{n_{wt}(0)} | |
| $$ | |
| This means that, although we don't know how the reference strain grows directly, its | |
| growth is given from the ratio of the spike counts to the reference counts, normalized | |
| to the same ratio at time 0. | |
| This leaves us with the overall equation: | |
| $$ | |
| \log \left( \frac{c_i(t)}{c_{wt}(t)}\frac{c_{wt}(0)}{c_i(0)} \right) = \left(1 - \frac{w_i}{w_{wt}} \right) \log \left( \frac{c_{spike}(t)}{c_{wt}(t)}\frac{c_{wt}(0)}{c_{spike}(0)} \right) | |
| $$ | |
| If we plot the left hand side against the right, we should get a straight line for | |
| each strain with intercept zero and gradient $1 - \frac{w_i}{w_{wt}}$. | |