bokeh
is a plotting library that has some incredible example plots, usually written in about 10 lines of python... It has been around for a little while, but it's recent release spurred me to finally get around to trying it.
Working with genomic data, one thing I find myself doing a lot, is plotting some metric across the genome, then wondering what genomic features are underlying the interesting bits. Currently, my workflow relies on static matplotlib plots. This means I make a plot, then replot iteratively with different axes- which can end up wasting time. What bokeh
does, is to provide interactivity. Allowing panning and zooming into the interesting bits, as well as helpful annotations.
The aim of this post then, is to give an example of how to use the bokeh plotting library to create some interactive genome scale plots of two different organisms- An. gambiae, and Pl. falciparum.
Two examples are shown directly below, but described in detail further down.
TL;DR How to make some neat generalisable genome plots
%run boilerplate.ipynb
Try hovering over a data point, or clicking on one of the panels on the left to wheel zoom, or select zoom!
The hover label gives the x
and y
values, as well as any genes in the region.
gf.apply(stat_plot, stat="tajimaD", label="Tajima'sD")
gf2.apply(plot_gc_data)
Blogs are supposed to be narrative driven, so I will try not to spend much time discussing the below GenomePlot
class.
The main gist is that we first initialise the class with information about the genome we are plotting:
The class contains an apply()
method, that creates one subplot per contig, and places them in a grid. The function that actually makes the subplots is passed to apply()
as an argument, and is called for each contig.
class GenomePlot:
genome = None
contigs = None
tools = "pan,wheel_zoom,box_zoom,save,reset"
layout = None
pfunc = None
# we need these min border values to ensure that different contig sizes do not squeeze axes
min_border_left = 50
min_border_right = 10
major_tick_dist = 1e7
plot_width_per_mb = 6
chrom_label_func = lambda self, y: y
def __init__(self, fasta, contigs=None, layout=None, share_y=True, pfunc=None):
self.genome = pyfasta.Fasta(fasta)
if contigs is None:
self.contigs = list(self.genome.keys())
else:
self.contigs = contigs
# handle layout
if layout is not None:
self.layout = self.parse_layout(layout)
else:
self.layout = self.auto_layout()
if pfunc is not None:
self.pfunc = pfunc
def parse_layout(self, lstring):
layout = list()
assert type(lstring) == str, "layout must be a string"
q = list(self.contigs)
q.reverse()
rows = lstring.split("|")
for row in rows:
r = list()
for x in row:
assert x in "ox.", "Only chars o, x, . are allowed"
if x == ".":
r.append(None)
else:
r.append(q.pop())
layout.append(r)
return layout
def auto_layout():
pass
def apply(self, func, **kwargs):
# create a figure with specified layout
d = [[] for i in range(len(self.layout))]
for i, row in enumerate(self.layout):
is_bottom = ((i + 1) == len(self.layout))
for j, chrom in enumerate(row):
is_left = (j == 0)
if chrom is not None:
csize = len(self.genome[chrom])
px = int(csize * 1e-6 * self.plot_width_per_mb)
px += self.min_border_left
px += self.min_border_right
try:
yrange = s1.y_range
except NameError:
yrange = None
s1 = figure(width=px,
min_border_left=self.min_border_left,
min_border_right=self.min_border_right,
plot_height=250,
tools=self.tools,
title=self.chrom_label_func(chrom),
y_range=yrange,
x_range=(1, csize))
s1.xaxis.ticker = FixedTicker(
ticks=np.arange(0, csize, self.major_tick_dist))
s1.xaxis[0].formatter = NumeralTickFormatter(format="0a.0")
# handle general plot things specific to genome not data
if self.pfunc is not None:
self.pfunc(chrom, s1)
# function lives here
func(chrom, s1, is_left=is_left, is_bottom=is_bottom, **kwargs)
d[i].append(s1)
else:
d[i].append(None)
# put the subplots in a gridplot
p = gridplot(d, toolbar_location="left", sizing_mode='fixed', plot_width=None)
show(p)
Here, I want to plot some statistics across the Anopheles gambiae genome: Wattersons's $\Theta$, Tajima's D, and Nucleotide diversity ($\Pi$).
These data are precomputed in another notebook - here, using scikit-allel
%run generate_stats.ipynb
All these data are contained in a dict
containing pandas
dataframes that look like this:
# where X is a chromosome label
annotated_data["X"].head()
As mentioned above, the GenomePlot
class has an .apply()
method, that takes a plotting function, and applies it to each contig.
Plotting functions must take at least 4 arguments: the contig label (chrom), the subplot object, and two booleans describing whether the subplot is to be placed on the left and/or at the bottom of the grid, to allow customisation depending on position.
This particular function has 5th and 6th arguments describing which statistic is to be plotted and the label. You'll notice that we also only make a Y-axis label on plots on the left of the grid, and a X-axis label if we are positioned on the bottom.
We use the HoverTool class from bokeh to annotate points with chromosomal position, contig ID, and gene names.
def stat_plot(chrom, subplot, is_left=None, is_bottom=None, stat="diversity", label=None):
source = ColumnDataSource(annotated_data[chrom])
if label is None:
label = stat
hover = HoverTool(
tooltips=[
("Position", "@start{0a.000}-@stop{0a.000}"),
(label, "$y"),
("contig", chrom),
("gene(s)", "@gene")],
mode="mouse")
subplot.add_tools(hover)
subplot.circle("midpoint",
stat,
source=source,
size=3,
color="navy",
alpha=0.5)
if is_left:
subplot.yaxis[0].axis_label = label
if is_bottom:
subplot.xaxis[0].axis_label = "genome position (bp)"
Anopheles have 3 chromosomes (2, 3 and X). The autosomes are split into "left" and "right" arms (i.e. 2R, 2L, 3R, 3L).
We use the custom layout function to move the Y-axis of the left arms to the right hand side.
# Custom changes to axes that pertain to genome not data
# eg in this case move left chromosome arm axes to RHS
def anopheles_plot(chrom, subplot):
if chrom.endswith("L"):
subplot.yaxis.visible = False
subplot.add_layout(LinearAxis(), 'right')
The layout string below means that we place two plots on the top row, and 3 on the bottom. 'o' represents a plot, and '|' a line break. Gaps can be left using '.'
A little bit of trial and error may be required to get this right!
gf = GenomePlot(fasta=phase1_ar3.genome_fn, # filepath to fasta
contigs=("2R", "2L", "3R", "3L", "X"), # contigs to display in order
layout="oo|ooo", # layout string
pfunc=anopheles_plot) # custom layout function
Any further arguments to the plotting function can be passed as **kwargs
to the apply()
method.
gf.apply(stat_plot, stat="tajimaD", label="Tajima'sD")
gf.apply(stat_plot, stat="theta", label="\u0398W")
Here, instead of the circles shown previously, we present nucleotide diversity as a line, by changing the plotting function.
Note we also change the HoverTool
to use a vline, instead of a mouseover.
def stat_plot_line(chrom, subplot, is_left=None, is_bottom=None, stat="diversity", label=None):
source = ColumnDataSource(annotated_data[chrom])
if label is None:
label = stat
hover = HoverTool(
tooltips=[
("Position", "@start{0a.000}-@stop{0a.000}"),
(label, "$y"),
("contig", chrom),
("gene(s)", "@gene")],
mode="vline")
subplot.add_tools(hover)
subplot.line("midpoint",
stat,
source=source,
color="navy",
alpha=0.5)
if is_left:
subplot.yaxis[0].axis_label = label
if is_bottom:
subplot.xaxis[0].axis_label = "position (bp)"
The nucleotide diversity ($\Pi$) plot:
gf.apply(stat_plot_line, stat="diversity", label="\u03A0")
plasmo_fn = "../assets/PlasmoDB-35_Pfalciparum3D7_Genome.fasta"
fa = pyfasta.Fasta(plasmo_fn, key_fn=lambda y: y.split(" | ")[0])
list(fa.keys())
This function calculates the proportion of GC bases in windows.
def calc_gc(genome, window_size=10000):
clen = len(genome)
starts = np.arange(0, clen, window_size)
stops = starts + window_size
propgc = np.zeros(starts.shape[0])
for i, ix in enumerate(starts):
r = genome[ix:(ix + window_size)]
propgc[i] = (r.count("C") + r.count("G")) / (len(r) - r.count("N"))
return pd.DataFrame.from_dict({"start": starts, "stop": stops, "gc": propgc * 100})
gc_data = {c: calc_gc(fa[c]) for c in fa.keys()}
This is our plotting function, no extra arguments this time. Also we don't add a custom plot layout function in this case.
def plot_gc_data(chrom, subplot, is_left=False, is_bottom=False):
source = ColumnDataSource(gc_data[chrom])
hover = HoverTool(
tooltips=[
("Position", "@start{0a.00}-@stop{0a.00}"),
("%GC", "$y"),
("contig", chrom)],
mode="vline")
subplot.add_tools(hover)
subplot.line("start",
"gc",
source=source,
color="navy",
alpha=0.5)
if is_bottom:
subplot.xaxis[0].axis_label = "position (bp)"
if is_left:
subplot.yaxis[0].axis_label = "%GC"
clen = {c: len(fa[c]) for c in fa.keys()}
clen
The GenomePlot class stuggles to present data where some chromosomes are vastly different sizes, so here we exclude the API and the mitochondrial chromosomes.
plas_contigs = sorted(fa.keys())[:14]
gf2 = GenomePlot(plasmo_fn, contigs=plas_contigs, layout="ooooo|oooo|ooo|oo")
# set some parameters default work better with anopheles
gf2.plot_width_per_mb = 100
gf2.major_tick_dist = 5e5
gf2.chrom_label_func = lambda y: " ".join(y.split("_")[:2])
gf2.apply(plot_gc_data)
I hope this has been of some use- but please, this is very rough code, and should be used as a guideline for development, rather than a finished piece of work.
There are a few improvements I'd like to make- particularly guessing parameters and layouts, which require too much trial and error at the moment.
Any issues/questions please drop a comment. Or ask via twitter @nick_harding.
The idea for the GenomePlot()
class is heavily based on code using matplotlib by @alimanfoo.
~ nh