Coalescent simulation in python, a brief introduction
This blog is an introduction to two python modules, msprime
(by Jerome Kelleher) and scikit-allel
(by Alistair Miles), both of whom are based in Oxford at the Big Data Institute. They are both really great tools, and if you work with genomic data I would encourage you to take a look at them.
However, the particular point I’d like to make here is that they work really nicely together.
msprime is a coalescent simulator with a python interface, essentially an efficient version of ms
, that also writes data in a sensible format, a compressed sequence tree rather than text files that need to be parsed (which are the bane of bioinformaticians).
scikit-allel is a toolkit for working with genomic data. Crucially it abstracts data chunking and compression allowing handling of extremely large datasets on a desktop machine. It has a wealth of functions, and is very fast.
It’s easy to imagine taking several days to construct and debug an ms
command, parse the results into something plink
can work with, compute summary statistics, then parse that into a format appropriate for a plotting library. What I will show here is that this is possible in a few very readable lines of python.
Writing your own parsing code is prone to bugs, even for experienced coders. It’s almost invariably inefficient, and harder to maintain/reproduce when you have several tools in your chain.
Here comes some code…
We begin as usual by importing some essential libraries. As well as the two that are the subject of this blog, we use numpy
to create a format from msprime
that allel
can understand, pandas
for handling tabular data, and the seaborn
wrapper to matplotlib
for plotting.
Overview
The task I am going to work though is to simulate two populations, with different demographic histories, and see how genome summary statistics differ.
-
We’ll draw 100 samples from a population size of 1000. Both populations have experienced bottlenecks, one very rapid, the other a gradual decline over many generations.
-
Population 1 has experienced a crash starting 40 generations ago, until 20 generations ago, declining at 25% per generation.
-
Poputation 2 has experienced a 0.5% per generation decline for 1000 generations. Both have been stable for 20 generations.
Obviously here we can specify more complex demographic scenarios as needed!
msprime
also has this neat demography debugger feature so we can check the demographic history is as we intended.
============================
Epoch: 0 -- 20.0 generations
============================
start end growth_rate | 0
-------- -------- -------- | --------
0 | 1e+03 1e+03 0 | 0
Events @ generation 20.0
- Population parameter change for 0: growth_rate -> -0.25
===============================
Epoch: 20.0 -- 40.0 generations
===============================
start end growth_rate | 0
-------- -------- -------- | --------
0 | 1e+03 1.48e+05 -0.25 | 0
Events @ generation 40.0
- Population parameter change for 0: growth_rate -> 0
==============================
Epoch: 40.0 -- inf generations
==============================
start end growth_rate | 0
-------- -------- -------- | --------
0 |1.48e+05 1.48e+05 0 | 0
============================
Epoch: 0 -- 20.0 generations
============================
start end growth_rate | 0
-------- -------- -------- | --------
0 | 1e+03 1e+03 0 | 0
Events @ generation 20.0
- Population parameter change for 0: growth_rate -> -0.005
=================================
Epoch: 20.0 -- 1020.0 generations
=================================
start end growth_rate | 0
-------- -------- -------- | --------
0 | 1e+03 1.48e+05 -0.005 | 0
Events @ generation 1020.0
- Population parameter change for 0: growth_rate -> 0
================================
Epoch: 1020.0 -- inf generations
================================
start end growth_rate | 0
-------- -------- -------- | --------
0 |1.48e+05 1.48e+05 0 | 0
This function is the key part, and represents the interface between the two tools. I’ve tried to explain what’s happening in the comments
We’ll use a contig of 10 Mbp, and summarize the data in 10 kbp windows.
Simulated 98234 mutations
Simulated 70472 mutations
CPU times: user 1min 54s, sys: 167 ms, total: 1min 54s
Wall time: 1min 54s
Initialize our data frame: create a multi-index from the combinations of the things we are interested in, and tell it how many rows to expect.
Using the nice indexing features of pandas it’s simple to write a loop, and save the data in the appropriate index.
allel
also returns some other values such as counts, and bin limits, which in this case we are not interested in, so we assign to _
.
CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 4.05 µs
(I haven’t yet figured out how to get multi-index tables to look nice in markdown/html- apologies!)
scenario | recent_crash | slow_decline | ||||
---|---|---|---|---|---|---|
statistic | pi | tajimaD | wattersonTheta | pi | tajimaD | wattersonTheta |
0 | 0.0019 | -0.0356 | 0.0019 | 0.0016 | 1.5078 | 0.0011 |
1 | 0.0018 | 0.2704 | 0.0017 | 0.0022 | 1.6559 | 0.0015 |
2 | 0.0019 | 0.3512 | 0.0017 | 0.0023 | 1.3548 | 0.0017 |
3 | 0.0020 | 0.5189 | 0.0018 | 0.0019 | 1.5628 | 0.0013 |
4 | 0.0013 | -0.7073 | 0.0017 | 0.0018 | 1.9320 | 0.0011 |
Thanks for reading this far! Hope something was of use.