Making interactive PCA plots

Using the bokeh library

In population genetics, we frequently make PCA plots- they are helpful to identify population structure, as well as performing quality control. It's often of biological or technical interest (or concern!) if a sample doesn't cluster with samples that we expect it to. A limitation of most plots is that it is awkward to work out what the sample ID is of an outlier.

Another limitation of PCA plots is that samples fall into very dense clusters, making it difficult to see exactly what is going on, with some samples completely masking others.

Interactive plots solve both of these issues. A simple mouse over reveals sample ID, and pertinent information about the sample- and we can pan and zoom in on densely populated regions.

This isn't a PCA tutorial, in fact the PCA analysis performed here was taken from Alistair Miles' excellent blog post Fast PCA. The coordinates and percent variance explained values are written as .npy files, and loaded here.

I'll assume some working knowledge of python/numpy etc. in this post. First though, if TLDR here is the final interactive PCA plot based on phase 1 of the Anopheles gambiae 1000 genomes project.

In [1]:
from IPython.display import HTML
In [2]:
HTML(filename="pca_plots.html")
Out[2]:
Bokeh Plot
In [3]:
import numpy as np
In [4]:
# load some PCA data
coordinates = np.load("3L_10000000_30000000.coords.npy")
coordinates.shape
Out[4]:
(765, 10)
In [5]:
# load the PVE data
pve = np.load("3L_10000000_30000000.pve.npy")
pve
Out[5]:
array([ 0.01200335,  0.01070692,  0.00852672,  0.00606027,  0.00494645,
        0.00357033,  0.00341437,  0.00317344,  0.00313177,  0.00299404])

We need some metadata, so we can label our points accordingly. I know that the order of this dataframe corresponds to the PCA coordinates, it's important to check this!

Here we are interested in the sample ID, the population label, country and year of collection.

In [6]:
import pandas as pd
In [7]:
df = pd.read_table(
    "/kwiat/vector/ag1000g/release/phase1.AR3/samples/samples.meta.txt",
    index_col=0)[['ox_code', 'population', 'country', 'year']]
In [8]:
df.head()
Out[8]:
ox_code population country year
index
0 AB0085-C BFS Burkina Faso 2012
1 AB0087-C BFM Burkina Faso 2012
2 AB0088-C BFM Burkina Faso 2012
3 AB0089-C BFM Burkina Faso 2012
4 AB0090-C BFM Burkina Faso 2012

We define some colours to use for the different populations

In [9]:
pop_colours = {
    'BFM': '#FF0000',
    'GAS': '#008000',
    'GNS': '#00FFFF',
    'UGS': '#90EE90',
    'GWA': '#FFA500',
    'AOM': '#8B0000',
    'BFS': '#1E90FF',
    'KES': '#808080',
    'CMS': '#0000FF',
}
In [10]:
# imports required for `bokeh`
from bokeh.plotting import figure, show, ColumnDataSource, output_file
from bokeh.layouts import gridplot

This is the main function that does most of the plotting work. It takes in the coordinates, the PVE, the principal components you wish to plot, a figure object (explained later) and a list of populations to show.

The only fiddly bit is that a bokeh plot expects a source, which should be tabular. All values that you wish to annotate with must be in this table. This is why I create a copy of my metadata frame and add in the relevant principal components as x and y.

bokeh expects hex colors, so we use mpl.colors to convert from the above.

In [11]:
import matplotlib as mpl

def plot_pca_coords(coords, pve, pc1, pc2, fig, populations):

    x = coords[:, pc1]
    y = coords[:, pc2]
    
    qdf = df.copy()
    qdf["x"] = x
    qdf["y"] = y
    
    for pop in populations:
        
        source = ColumnDataSource(
            data=qdf.query("population == @pop"))
        
        fig.circle(
            'x', 'y', 
            source=source,
            line_color='black',
            line_width=0.5,
            size=6,
            fill_color=mpl.colors.rgb2hex(pop_colours[pop]))
    
    fig.xaxis.axis_label = 'PC {0} ({1:.2f}%)'.format(
        pc1 + 1, 100 * pve[pc1])
    
    fig.yaxis.axis_label = 'PC {0} ({1:.2f}%)'.format(
        pc2 + 1, 100 * pve[pc2])
    
    return fig
In [12]:
components = np.array(range(8)).reshape((2, 2, 2)).tolist()
In [13]:
# This defines what is displayed when the mouse hovers over a point.
# The @ values correspond to values in the table.
TOOLTIPS = [
    ("ox_code", "@ox_code"),
    ("population", "@population"),
    ("collection year", "@year"),
    ("(x, y)", "($x, $y)"),]

The below code sets up a bokeh grid to display multiple PCA components at a time. If you just want to make a single figure, you just need to create the figure and pass it to the plot_pca_coords function.

In [14]:
output_file("pca_plots.html")
grid = []

for row in components:
    
    l = []
    
    for (c1, c2) in row:

        p = figure(plot_width=400, plot_height=400, tooltips=TOOLTIPS)
        l.append(
            plot_pca_coords(
                coordinates, pve, c1, c2, p,
                pop_colours.keys()))
        
    grid.append(l)

col = gridplot(grid)
show(col)

And that's it! Check out the box zoom, I find that super useful.

In [15]:
HTML(filename="pca_plots.html")
Out[15]:
Bokeh Plot