Wednesday, May 20, 2015

Nested Heatmaps in Pandas

I kind of hate heatmaps . They look pretty, but they don't really mean anything. There are so many ways to torture your distance matrix to give you wildly different results, that I often just skip over them in papers. But, biologists love heatmaps. So, here I am.

A recent request way to make a nested heatmap. Basically we want to cluster things by one distance matrix, then show the sub-components of that component under it.

For instance, we have a heatmap that is clustered by genes.



We can see there is differential expression here, such as HLA-A in low in the brain and HLA-C is high in leukocytes. However, genes are made of transcripts -- sometimes there are single transcripts, sometimes there can be hundreds (or in drosophila, tens of thousands). We want to know what is the real signal driving the values measured at the gene level.

To do this, we wanted a heatmap that then showed what makes up each gene.

To accomplish this, here is a generic piece of code that groups things on a 'major index' and plots the 'minor index'.

%matplotlib qt
import numpy as np
import pandas as pd
major_index = 'Gene name'
minor_index = 'Ensembl ID'
# let's use a big dataset that is public, the illumina human bodymap summarized here:
# http://genomicdbdemo.bxgenomics.com/bxaf6/genomicdb/search_all.php?t=tbl_genomicdb_usr_search_131889_92&s=1
df = pd.read_table('/home/chris/Downloads/Human Body Map.csv', sep='\t', index_col=[major_index, minor_index])
# remove the description column
del df['Description']
# This is a bunch of random data pieced together
# some data pre-processing
df = np.log2(df)
# set our undected samples to our lowest detection
df[df==-1*np.inf] = df[df!=-1*np.inf].min().min()
# translate our data so we have no negatives (which would screw up our addition and makes no biological sense)
df+=abs(df.min().min())
major_counts = df.groupby(level=[major_index]).count()
# we only want to plot samples with multiple values in the minor index
multi = df[df.index.get_level_values(major_index).isin(major_counts[major_counts>=2].dropna().index)]

import seaborn as sns
# Let's select the most variable minor axis elements
most_variable = multi.groupby(level=major_index).var().mean(axis=1).order(ascending=False)
# and select the top 10
dat = multi[multi.index.get_level_values(major_index).isin(most_variable.index[:10])]
# we want to cluster by our major index, and then under these plot the values of our minor index
major_dat = dat.groupby(level=major_index).sum()
seaborn_map = sns.clustermap(major_dat, row_cluster=True, col_cluster=True)
# now we keep this clustering, but recreate our data to fit the above clustering, with our minor
# index below the major index (you can think of transcript levels under gene levels if you are
# a biologist)
merged_dat = pd.DataFrame(columns=[seaborn_map.data2d.columns])
for major_val in seaborn_map.data2d.index:
    minor_rows = multi[multi.index.get_level_values(major_index)==major_val][seaborn_map.data2d.columns]
    major_row = major_dat.loc[major_val,][seaborn_map.data2d.columns]
    merged_dat.append(major_row)
    merged_dat = merged_dat.append(major_row).append(minor_rows)
merged_map = sns.clustermap(merged_dat, row_cluster=False, col_cluster=False)

# recreate our dendrogram, this is undocumented and probably a hack but it works
seaborn_map.dendrogram_col.plot(merged_map.ax_col_dendrogram)

# for rows, I imagine at some point it will fail to fall within the major axis but fortunately
# for this dataset it is not true
seaborn_map.dendrogram_row.plot(merged_map.ax_row_dendrogram)



From this, we can see interesting aspects. Such as the wide variety of isoforms with HLA-A.  We can see there is a mixture of transcript classes being expressed here, and it is very different across tissue (note: I believe this is probably an artifact of sequence alignment and not true if there are any biologists reading this, this is just to illustrate a point.). We also see that FLOT1 has an isoform expressed solely in lymphocytes.

This looks quite nice on my own data, and is a fairly useful way to create quick graphics showing the general trend of your data (gene level) and being able to visually illustrate to others the underlying components.

No comments:

Post a Comment