Interactive plotting with bokeh

bokeh is a plotting library that has some incredible example plots, usually written in about 10 lines of python... It has been around for a little while, but it's recent release spurred me to finally get around to trying it.

Working with genomic data, one thing I find myself doing a lot, is plotting some metric across the genome, then wondering what genomic features are underlying the interesting bits. Currently, my workflow relies on static matplotlib plots. This means I make a plot, then replot iteratively with different axes- which can end up wasting time. What bokeh does, is to provide interactivity. Allowing panning and zooming into the interesting bits, as well as helpful annotations.

The aim of this post then, is to give an example of how to use the bokeh plotting library to create some interactive genome scale plots of two different organisms- An. gambiae, and Pl. falciparum.

Two examples are shown directly below, but described in detail further down.

TL;DR How to make some neat generalisable genome plots

In [1]:
%run boilerplate.ipynb
Loading BokehJS ...

Genome-wide Tajima's D of Anopheles gambiae from Burkina Faso

Try hovering over a data point, or clicking on one of the panels on the left to wheel zoom, or select zoom!

The hover label gives the x and y values, as well as any genes in the region.

In [24]:
gf.apply(stat_plot, stat="tajimaD", label="Tajima'sD")

GC% of Plasmodium falciparium

In [25]:
gf2.apply(plot_gc_data)

GenomePlot class

Blogs are supposed to be narrative driven, so I will try not to spend much time discussing the below GenomePlot class.

The main gist is that we first initialise the class with information about the genome we are plotting:

  1. a fasta file, which provides information about contig names and sizes
  2. a list of contigs contained in the above that we wish to display (default all)
  3. a layout of the above, describing how many contigs on each row etc. (and optionally)
  4. a function that is applied to each subplot to customise layout that's not data specific

The class contains an apply() method, that creates one subplot per contig, and places them in a grid. The function that actually makes the subplots is passed to apply() as an argument, and is called for each contig.

In [4]:
class GenomePlot:
    
    genome = None
    contigs = None
    tools = "pan,wheel_zoom,box_zoom,save,reset"
    layout = None
    pfunc = None
    # we need these min border values to ensure that different contig sizes do not squeeze axes
    min_border_left = 50
    min_border_right = 10
    major_tick_dist = 1e7
    plot_width_per_mb = 6
    chrom_label_func = lambda self, y: y
    
    def __init__(self, fasta, contigs=None, layout=None, share_y=True, pfunc=None):
        
        self.genome = pyfasta.Fasta(fasta)
        
        if contigs is None:
            self.contigs = list(self.genome.keys())
        else:
            self.contigs = contigs
            
        # handle layout
        if layout is not None:
            self.layout = self.parse_layout(layout)
        else:
            self.layout = self.auto_layout()
        
        if pfunc is not None:
            self.pfunc = pfunc
                
    def parse_layout(self, lstring):
        layout = list()
        assert type(lstring) == str, "layout must be a string"
        q = list(self.contigs)
        q.reverse()
            
        rows = lstring.split("|")
        for row in rows:
            r = list()
            for x in row:
                assert x in "ox.", "Only chars o, x, . are allowed"
                if x == ".":
                    r.append(None)
                else:
                    r.append(q.pop())
            layout.append(r)
        return layout
    
    def auto_layout():
        pass

    def apply(self, func, **kwargs):
        
        # create a figure with specified layout        
        d = [[] for i in range(len(self.layout))]
        
        for i, row in enumerate(self.layout):
            
            is_bottom = ((i + 1) == len(self.layout))
            
            for j, chrom in enumerate(row):
                
                is_left = (j == 0)
                
                if chrom is not None:
                    
                    csize = len(self.genome[chrom])
                    px = int(csize * 1e-6 * self.plot_width_per_mb) 
                    px += self.min_border_left 
                    px += self.min_border_right
                    
                    try:
                        yrange = s1.y_range
                    except NameError:
                        yrange = None

                    s1 = figure(width=px, 
                                min_border_left=self.min_border_left,
                                min_border_right=self.min_border_right,
                                plot_height=250, 
                                tools=self.tools,
                                title=self.chrom_label_func(chrom),
                                y_range=yrange,
                                x_range=(1, csize))

                    s1.xaxis.ticker = FixedTicker(
                        ticks=np.arange(0, csize, self.major_tick_dist))
                    s1.xaxis[0].formatter = NumeralTickFormatter(format="0a.0")
                    
                    # handle general plot things specific to genome not data
                    if self.pfunc is not None:
                        self.pfunc(chrom, s1)

                    # function lives here
                    func(chrom, s1, is_left=is_left, is_bottom=is_bottom, **kwargs)
                    d[i].append(s1)

                else:
                    d[i].append(None)
            
            
        # put the subplots in a gridplot
        p = gridplot(d, toolbar_location="left", sizing_mode='fixed', plot_width=None)
        
        show(p)

Example 1: Anopheles gambiae

Here, I want to plot some statistics across the Anopheles gambiae genome: Wattersons's $\Theta$, Tajima's D, and Nucleotide diversity ($\Pi$).

These data are precomputed in another notebook - here, using scikit-allel

In [5]:
%run generate_stats.ipynb

All these data are contained in a dict containing pandas dataframes that look like this:

In [6]:
# where X is a chromosome label
annotated_data["X"].head()
Out[6]:
diversity start stop tajimaD theta midpoint gene
0 0.001781 25 132324 -2.190824 0.005357 66174.5 AGAP000011
1 0.003892 132325 246994 -2.041654 0.010288 189659.5 AGAP000011, AGAP000018
2 0.005919 246995 368599 -2.176849 0.017548 307797.0 AGAP000018
3 0.006699 368600 487739 -2.297078 0.022272 428169.5
4 0.009483 487740 625447 -2.252033 0.030146 556593.5

Define plotting functions

As mentioned above, the GenomePlot class has an .apply() method, that takes a plotting function, and applies it to each contig.

Plotting functions must take at least 4 arguments: the contig label (chrom), the subplot object, and two booleans describing whether the subplot is to be placed on the left and/or at the bottom of the grid, to allow customisation depending on position.

This particular function has 5th and 6th arguments describing which statistic is to be plotted and the label. You'll notice that we also only make a Y-axis label on plots on the left of the grid, and a X-axis label if we are positioned on the bottom.

We use the HoverTool class from bokeh to annotate points with chromosomal position, contig ID, and gene names.

In [7]:
def stat_plot(chrom, subplot, is_left=None, is_bottom=None, stat="diversity", label=None):
    
    source = ColumnDataSource(annotated_data[chrom])
    
    if label is None:
        label = stat
    
    hover = HoverTool(
        tooltips=[
            ("Position", "@start{0a.000}-@stop{0a.000}"),
            (label, "$y"),
            ("contig", chrom), 
            ("gene(s)", "@gene")],
        mode="mouse")
    
    subplot.add_tools(hover)
    
    subplot.circle("midpoint", 
                   stat,
                   source=source,
                   size=3, 
                   color="navy", 
                   alpha=0.5)
    
    if is_left:
        subplot.yaxis[0].axis_label = label
    if is_bottom:
        subplot.xaxis[0].axis_label = "genome position (bp)"

Custom layout function

Anopheles have 3 chromosomes (2, 3 and X). The autosomes are split into "left" and "right" arms (i.e. 2R, 2L, 3R, 3L).

We use the custom layout function to move the Y-axis of the left arms to the right hand side.

In [8]:
# Custom changes to axes that pertain to genome not data
# eg in this case move left chromosome arm axes to RHS
def anopheles_plot(chrom, subplot):
    if chrom.endswith("L"):
        subplot.yaxis.visible = False
        subplot.add_layout(LinearAxis(), 'right')

Layout

The layout string below means that we place two plots on the top row, and 3 on the bottom. 'o' represents a plot, and '|' a line break. Gaps can be left using '.'

A little bit of trial and error may be required to get this right!

In [9]:
gf = GenomePlot(fasta=phase1_ar3.genome_fn,              # filepath to fasta
                contigs=("2R", "2L", "3R", "3L", "X"),   # contigs to display in order
                layout="oo|ooo",                         # layout string
                pfunc=anopheles_plot)                    # custom layout function

Any further arguments to the plotting function can be passed as **kwargs to the apply() method.

Tajima's D

In [10]:
gf.apply(stat_plot, stat="tajimaD", label="Tajima'sD")

Wattersons $\Theta$

In [11]:
gf.apply(stat_plot, stat="theta", label="\u0398W")

Changing the plotting function

Here, instead of the circles shown previously, we present nucleotide diversity as a line, by changing the plotting function.

Note we also change the HoverTool to use a vline, instead of a mouseover.

In [12]:
def stat_plot_line(chrom, subplot, is_left=None, is_bottom=None, stat="diversity", label=None):
    
    source = ColumnDataSource(annotated_data[chrom])
    
    if label is None:
        label = stat
    
    hover = HoverTool(
        tooltips=[
            ("Position", "@start{0a.000}-@stop{0a.000}"),
            (label, "$y"),
            ("contig", chrom), 
            ("gene(s)", "@gene")],
        mode="vline")
    
    subplot.add_tools(hover)
    
    subplot.line("midpoint",
                 stat,
                 source=source,
                 color="navy",
                 alpha=0.5)

    if is_left:
        subplot.yaxis[0].axis_label = label
    if is_bottom:
        subplot.xaxis[0].axis_label = "position (bp)"

The nucleotide diversity ($\Pi$) plot:

In [13]:
gf.apply(stat_plot_line, stat="diversity", label="\u03A0")

Example 2: Plasmodium falciparum

Pl. falciparum is known to have a highly AT rich genome- here I am going to plot the % GC content in windows across the genome using the reference genome, downloaded from PlasmoDB.

In [14]:
plasmo_fn = "../assets/PlasmoDB-35_Pfalciparum3D7_Genome.fasta"
In [15]:
fa = pyfasta.Fasta(plasmo_fn, key_fn=lambda y: y.split(" | ")[0])
In [16]:
list(fa.keys())
Out[16]:
['Pf3D7_09_v3',
 'Pf3D7_11_v3',
 'Pf3D7_03_v3',
 'Pf3D7_07_v3',
 'Pf3D7_06_v3',
 'Pf3D7_08_v3',
 'Pf3D7_04_v3',
 'Pf_M76611',
 'Pf3D7_14_v3',
 'Pf3D7_12_v3',
 'Pf3D7_01_v3',
 'Pf3D7_02_v3',
 'Pf3D7_API_v3',
 'Pf3D7_05_v3',
 'Pf3D7_10_v3',
 'Pf3D7_13_v3']

This function calculates the proportion of GC bases in windows.

In [17]:
def calc_gc(genome, window_size=10000):
    
    clen = len(genome)
    starts = np.arange(0, clen, window_size)
    stops = starts + window_size
    
    propgc = np.zeros(starts.shape[0])
    
    for i, ix in enumerate(starts):
        r = genome[ix:(ix + window_size)]
        propgc[i] = (r.count("C") + r.count("G")) / (len(r) - r.count("N"))
        
    return pd.DataFrame.from_dict({"start": starts, "stop": stops, "gc": propgc * 100})
In [18]:
gc_data = {c: calc_gc(fa[c]) for c in fa.keys()}

This is our plotting function, no extra arguments this time. Also we don't add a custom plot layout function in this case.

In [19]:
def plot_gc_data(chrom, subplot, is_left=False, is_bottom=False):
    
    source = ColumnDataSource(gc_data[chrom])
    
    hover = HoverTool(
        tooltips=[
            ("Position", "@start{0a.00}-@stop{0a.00}"),
            ("%GC", "$y"),
            ("contig", chrom)],
        mode="vline")
    
    subplot.add_tools(hover)
    
    subplot.line("start",
                 "gc",
                 source=source,
                 color="navy", 
                 alpha=0.5)

    if is_bottom:
        subplot.xaxis[0].axis_label = "position (bp)"
    
    if is_left:
        subplot.yaxis[0].axis_label = "%GC"
In [20]:
clen = {c: len(fa[c]) for c in fa.keys()}
clen
Out[20]:
{'Pf3D7_01_v3': 640851,
 'Pf3D7_02_v3': 947102,
 'Pf3D7_03_v3': 1067971,
 'Pf3D7_04_v3': 1200490,
 'Pf3D7_05_v3': 1343557,
 'Pf3D7_06_v3': 1418242,
 'Pf3D7_07_v3': 1445207,
 'Pf3D7_08_v3': 1472805,
 'Pf3D7_09_v3': 1541735,
 'Pf3D7_10_v3': 1687656,
 'Pf3D7_11_v3': 2038340,
 'Pf3D7_12_v3': 2271494,
 'Pf3D7_13_v3': 2925236,
 'Pf3D7_14_v3': 3291936,
 'Pf3D7_API_v3': 34250,
 'Pf_M76611': 5967}

The GenomePlot class stuggles to present data where some chromosomes are vastly different sizes, so here we exclude the API and the mitochondrial chromosomes.

In [21]:
plas_contigs = sorted(fa.keys())[:14]
In [22]:
gf2 = GenomePlot(plasmo_fn, contigs=plas_contigs, layout="ooooo|oooo|ooo|oo")

# set some parameters default work better with anopheles
gf2.plot_width_per_mb = 100
gf2.major_tick_dist = 5e5
gf2.chrom_label_func = lambda y: " ".join(y.split("_")[:2])
In [23]:
gf2.apply(plot_gc_data)

Postscript

I hope this has been of some use- but please, this is very rough code, and should be used as a guideline for development, rather than a finished piece of work.

There are a few improvements I'd like to make- particularly guessing parameters and layouts, which require too much trial and error at the moment.

Any issues/questions please drop a comment. Or ask via twitter @nick_harding.

Acknowledgements

The idea for the GenomePlot() class is heavily based on code using matplotlib by @alimanfoo.

~ nh

In [ ]: