bokeh
library¶In population genetics, we frequently make PCA plots- they are helpful to identify population structure, as well as performing quality control. It's often of biological or technical interest (or concern!) if a sample doesn't cluster with samples that we expect it to. A limitation of most plots is that it is awkward to work out what the sample ID is of an outlier.
Another limitation of PCA plots is that samples fall into very dense clusters, making it difficult to see exactly what is going on, with some samples completely masking others.
Interactive plots solve both of these issues. A simple mouse over reveals sample ID, and pertinent information about the sample- and we can pan and zoom in on densely populated regions.
This isn't a PCA tutorial, in fact the PCA analysis performed here was taken from Alistair Miles' excellent blog post Fast PCA. The coordinates and percent variance explained values are written as .npy
files, and loaded here.
I'll assume some working knowledge of python/numpy etc. in this post. First though, if TLDR here is the final interactive PCA plot based on phase 1 of the Anopheles gambiae 1000 genomes project.
from IPython.display import HTML
HTML(filename="pca_plots.html")
import numpy as np
# load some PCA data
coordinates = np.load("3L_10000000_30000000.coords.npy")
coordinates.shape
# load the PVE data
pve = np.load("3L_10000000_30000000.pve.npy")
pve
We need some metadata, so we can label our points accordingly. I know that the order of this dataframe corresponds to the PCA coordinates, it's important to check this!
Here we are interested in the sample ID, the population label, country and year of collection.
import pandas as pd
df = pd.read_table(
"/kwiat/vector/ag1000g/release/phase1.AR3/samples/samples.meta.txt",
index_col=0)[['ox_code', 'population', 'country', 'year']]
df.head()
We define some colours to use for the different populations
pop_colours = {
'BFM': '#FF0000',
'GAS': '#008000',
'GNS': '#00FFFF',
'UGS': '#90EE90',
'GWA': '#FFA500',
'AOM': '#8B0000',
'BFS': '#1E90FF',
'KES': '#808080',
'CMS': '#0000FF',
}
# imports required for `bokeh`
from bokeh.plotting import figure, show, ColumnDataSource, output_file
from bokeh.layouts import gridplot
This is the main function that does most of the plotting work. It takes in the coordinates, the PVE, the principal components you wish to plot, a figure object (explained later) and a list of populations to show.
The only fiddly bit is that a bokeh
plot expects a source, which should be tabular. All values that you wish to annotate with must be in this table. This is why I create a copy of my metadata frame and add in the relevant principal components as x
and y
.
bokeh
expects hex colors, so we use mpl.colors
to convert from the above.
import matplotlib as mpl
def plot_pca_coords(coords, pve, pc1, pc2, fig, populations):
x = coords[:, pc1]
y = coords[:, pc2]
qdf = df.copy()
qdf["x"] = x
qdf["y"] = y
for pop in populations:
source = ColumnDataSource(
data=qdf.query("population == @pop"))
fig.circle(
'x', 'y',
source=source,
line_color='black',
line_width=0.5,
size=6,
fill_color=mpl.colors.rgb2hex(pop_colours[pop]))
fig.xaxis.axis_label = 'PC {0} ({1:.2f}%)'.format(
pc1 + 1, 100 * pve[pc1])
fig.yaxis.axis_label = 'PC {0} ({1:.2f}%)'.format(
pc2 + 1, 100 * pve[pc2])
return fig
components = np.array(range(8)).reshape((2, 2, 2)).tolist()
# This defines what is displayed when the mouse hovers over a point.
# The @ values correspond to values in the table.
TOOLTIPS = [
("ox_code", "@ox_code"),
("population", "@population"),
("collection year", "@year"),
("(x, y)", "($x, $y)"),]
The below code sets up a bokeh grid to display multiple PCA components at a time.
If you just want to make a single figure, you just need to create the figure
and pass it to the plot_pca_coords
function.
output_file("pca_plots.html")
grid = []
for row in components:
l = []
for (c1, c2) in row:
p = figure(plot_width=400, plot_height=400, tooltips=TOOLTIPS)
l.append(
plot_pca_coords(
coordinates, pve, c1, c2, p,
pop_colours.keys()))
grid.append(l)
col = gridplot(grid)
show(col)
And that's it! Check out the box zoom, I find that super useful.
HTML(filename="pca_plots.html")