Animating population demographic changes
This post animates the genetic signature of a population crash as it unfolds, attempting to shed a little bit more light on when a crash is detectable, and how long it takes to stabilise post-crash.
Although a forward simulation is probably more appropriate here, I chose the extremely fast coalesent simulator msprime
. Strictly speaking, the simulations shown here aren’t following a single population crashing, but multiple independent populations at different time periods given a particular demographic history.
msprime
is plenty fast enough to handle this type of work, comfortably handling 140 separate coalescent simulations over 10 Mbp in about 90 minutes. Also, the independent observations allow us to visualise the inherent stochasticity present.
The rest of the post details how this was done. But the final animation is shown below:
The left panel shows the demographic history of the population, with the time point currently displayed in the other two panels. In the centre, I show the site frequency spectrum (SFS). This is the distribution of allele frequencies among SNPs present in the population. Under neutrality and a constant population size, we expect rare variants to be observed frequently as they may occur via sporadic mutations- in contrast common variants should be rare, as they take a long time to drift to high frequency. The allele frequency of a polymorphism has a direct relationship to the number of times we may expect to see a polymophism with that allele frequency, a direct consequence of drift. Note that this is independent of the total number of mutations that a population can maintain, the SFS pertains to the relative levels of polymorphisms with different allele frequencies. The SFS therefore encodes information regarding population demographic history; populations that have experienced contraction tend to have a dearth of rare variants, as genetic drift is much more likely to remove these polymorphisms from the population, but insufficient time has passed to generate new mutations.
Here we plot the scaled SFS, where a stable population is expected to show a straight (horizontal) line.
Tajima’s D
is a related statistic, and represents the ratio of variation at a per chromosome level to the absolute number of variants in the population.
A negative D
suggests an excess of rare variants.
Following the crash we see that even 5000 generations later, the site frequency spectrum hugely disrupted, Tajima’s D
is also heavily right skewed.
This is an interesting observation - severe population bottlenecks still leave profound effects on genomes 5000 generations later.
(I explore this a little more thoroughly at the bottom of this page.)
Simulations begin
If you happened to see my previous post, I wrote about the interface between msprime
and scikit-allel
, see here. This function is taken directly from that, but with one addition, a cache for previously generated tree sequences. Although msprime
is fast, simulating 140 different dempgraphic scenarios takes some time!
Parameters
Here I define recombination rate, mutation rate, and how much DNA I want to simulate.
Specifications
Here I make some intial specifications about the demographic history I wish to simulate… The population is at time t
1000 chromosomes, with a decline of 0.5% per generation for 1000 generations completing at 5000 generations in the past (and therefore starting at 6000 generations in the past).
Now we need to decide at which time points to sample, I chose a gap of 100 generations, reduced to 10 during the population crash. This has the nice effect of having more detail over this time point, as well as slowing down the animation during the interesting bit.
array([ 0, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000,
1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2100,
2200, 2300, 2400, 2500, 2600, 2700, 2800, 2900, 3000, 3100, 3200,
3300, 3400, 3500, 3600, 3700, 3800, 3900, 4000, 4100, 4200, 4300,
4400, 4500, 4600, 4700, 4800, 4900, 5000, 5010, 5020, 5030, 5040,
5050, 5060, 5070, 5080, 5090, 5100, 5110, 5120, 5130, 5140, 5150,
5160, 5170, 5180, 5190, 5200, 5210, 5220, 5230, 5240, 5250, 5260,
5270, 5280, 5290, 5300, 5310, 5320, 5330, 5340, 5350, 5360, 5370,
5380, 5390, 5400, 5410, 5420, 5430, 5440, 5450, 5460, 5470, 5480,
5490, 5500, 5510, 5520, 5530, 5540, 5550, 5560, 5570, 5580, 5590,
5600, 5610, 5620, 5630, 5640, 5650, 5660, 5670, 5680, 5690, 5700,
5710, 5720, 5730, 5740, 5750, 5760, 5770, 5780, 5790, 5800, 5810,
5820, 5830, 5840, 5850, 5860, 5870, 5880, 5890, 5900, 5910, 5920,
5930, 5940, 5950, 5960, 5970, 5980, 5990, 6000, 6100, 6200, 6300,
6400, 6500, 6600, 6700, 6800, 6900])
160
This is just a quick sum to work out the ancestral size of the population- because we are dealing with the coalescent, we work backwards…
148413.1591025766
Here I specify a seed so that the cacheing described above works.
The crash truly happens at 5000 gen - 6000 generations ago. However, we need to adjust the times that are read into msprime
, as everything has to be relative. ie at t=1000, then the crash is 4000 generations away, not 5000.
After the simulation we are only interested in the allele counts, so we keep those in a dict.
Using scikit-allel
we calculate the scaled site frequency spectrum and Tajima’s D in 10kbp windows. These are stored in dicts to be used later in the animation.
Now animate
This is a fairly standard matplotlib animation. It’s not the most efficient, as it redraws the whole axes for the SFS and Tajima’s D plots, but it works well enough here.
These two tutorials were extremely helpful in putting this together:
-
http://jakevdp.github.io/blog/2013/05/12/embedding-matplotlib-animations/
-
https://stackoverflow.com/questions/43552575/how-to-make-a-matplotlib-animated-violinplot
So thanks to the respective authors/answerers.
Is a severe crash detectable 5000 generations later?
To show this I ran two contrasting demographic scenarios. One a population that has been at a constant Ne of 1000 for 5000 generations, but before that experiencing a severe crash from approximately 140,000 over a span of 1000 generations.
The null, alternative model shows a population at a stable Ne of 1000.
The simulation below clearly shows an extremely disrupted site frequency spectrum in the “crash model”.
Fin.
Thanks to @alimanfoo for the intial idea, and @jeromekelleher for help with the implementation.