Making interactive PCA plots¶

Using the `bokeh` library¶

In population genetics, we frequently make PCA plots- they are helpful to identify population structure, as well as performing quality control. It's often of biological or technical interest (or concern!) if a sample doesn't cluster with samples that we expect it to. A limitation of most plots is that it is awkward to work out what the sample ID is of an outlier.

Another limitation of PCA plots is that samples fall into very dense clusters, making it difficult to see exactly what is going on, with some samples completely masking others.

Interactive plots solve both of these issues. A simple mouse over reveals sample ID, and pertinent information about the sample- and we can pan and zoom in on densely populated regions.

This isn't a PCA tutorial, in fact the PCA analysis performed here was taken from Alistair Miles' excellent blog post Fast PCA. The coordinates and percent variance explained values are written as .npy files, and loaded here.

I'll assume some working knowledge of python/numpy etc. in this post. First though, if TLDR here is the final interactive PCA plot based on phase 1 of the Anopheles gambiae 1000 genomes project.

from IPython.display import HTML

HTML(filename="pca_plots.html")

import numpy as np

# load some PCA data
coordinates = np.load("3L_10000000_30000000.coords.npy")
coordinates.shape

(765, 10)

# load the PVE data
pve = np.load("3L_10000000_30000000.pve.npy")
pve

array([ 0.01200335,  0.01070692,  0.00852672,  0.00606027,  0.00494645,
        0.00357033,  0.00341437,  0.00317344,  0.00313177,  0.00299404])

We need some metadata, so we can label our points accordingly. I know that the order of this dataframe corresponds to the PCA coordinates, it's important to check this!

Here we are interested in the sample ID, the population label, country and year of collection.

import pandas as pd

df = pd.read_table(
    "/kwiat/vector/ag1000g/release/phase1.AR3/samples/samples.meta.txt",
    index_col=0)[['ox_code', 'population', 'country', 'year']]

df.head()

We define some colours to use for the different populations

pop_colours = {
    'BFM': '#FF0000',
    'GAS': '#008000',
    'GNS': '#00FFFF',
    'UGS': '#90EE90',
    'GWA': '#FFA500',
    'AOM': '#8B0000',
    'BFS': '#1E90FF',
    'KES': '#808080',
    'CMS': '#0000FF',
}

# imports required for `bokeh`
from bokeh.plotting import figure, show, ColumnDataSource, output_file
from bokeh.layouts import gridplot

This is the main function that does most of the plotting work. It takes in the coordinates, the PVE, the principal components you wish to plot, a figure object (explained later) and a list of populations to show.

The only fiddly bit is that a bokeh plot expects a source, which should be tabular. All values that you wish to annotate with must be in this table. This is why I create a copy of my metadata frame and add in the relevant principal components as x and y.

bokeh expects hex colors, so we use mpl.colors to convert from the above.

import matplotlib as mpl

def plot_pca_coords(coords, pve, pc1, pc2, fig, populations):

    x = coords[:, pc1]
    y = coords[:, pc2]
    
    qdf = df.copy()
    qdf["x"] = x
    qdf["y"] = y
    
    for pop in populations:
        
        source = ColumnDataSource(
            data=qdf.query("population == @pop"))
        
        fig.circle(
            'x', 'y', 
            source=source,
            line_color='black',
            line_width=0.5,
            size=6,
            fill_color=mpl.colors.rgb2hex(pop_colours[pop]))
    
    fig.xaxis.axis_label = 'PC {0} ({1:.2f}%)'.format(
        pc1 + 1, 100 * pve[pc1])
    
    fig.yaxis.axis_label = 'PC {0} ({1:.2f}%)'.format(
        pc2 + 1, 100 * pve[pc2])
    
    return fig

components = np.array(range(8)).reshape((2, 2, 2)).tolist()

# This defines what is displayed when the mouse hovers over a point.
# The @ values correspond to values in the table.
TOOLTIPS = [
    ("ox_code", "@ox_code"),
    ("population", "@population"),
    ("collection year", "@year"),
    ("(x, y)", "($x, $y)"),]

The below code sets up a bokeh grid to display multiple PCA components at a time. If you just want to make a single figure, you just need to create the figure and pass it to the plot_pca_coords function.

output_file("pca_plots.html")
grid = []

for row in components:
    
    l = []
    
    for (c1, c2) in row:

        p = figure(plot_width=400, plot_height=400, tooltips=TOOLTIPS)
        l.append(
            plot_pca_coords(
                coordinates, pve, c1, c2, p,
                pop_colours.keys()))
        
    grid.append(l)

col = gridplot(grid)
show(col)

And that's it! Check out the box zoom, I find that super useful.

HTML(filename="pca_plots.html")

	ox_code	population	country	year
index
0	AB0085-C	BFS	Burkina Faso	2012
1	AB0087-C	BFM	Burkina Faso	2012
2	AB0088-C	BFM	Burkina Faso	2012
3	AB0089-C	BFM	Burkina Faso	2012
4	AB0090-C	BFM	Burkina Faso	2012

Making interactive PCA plots¶

Using the bokeh library¶

Using the `bokeh` library¶