A recent request way to make a nested heatmap. Basically we want to cluster things by one distance matrix, then show the sub-components of that component under it.
For instance, we have a heatmap that is clustered by genes.
We can see there is differential expression here, such as HLA-A in low in the brain and HLA-C is high in leukocytes. However, genes are made of transcripts -- sometimes there are single transcripts, sometimes there can be hundreds (or in drosophila, tens of thousands). We want to know what is the real signal driving the values measured at the gene level.
To do this, we wanted a heatmap that then showed what makes up each gene.
To accomplish this, here is a generic piece of code that groups things on a 'major index' and plots the 'minor index'.
%matplotlib qt import numpy as np import pandas as pd major_index = 'Gene name' minor_index = 'Ensembl ID' # let's use a big dataset that is public, the illumina human bodymap summarized here: # http://genomicdbdemo.bxgenomics.com/bxaf6/genomicdb/search_all.php?t=tbl_genomicdb_usr_search_131889_92&s=1 df = pd.read_table('/home/chris/Downloads/Human Body Map.csv', sep='\t', index_col=[major_index, minor_index]) # remove the description column del df['Description'] # This is a bunch of random data pieced together # some data pre-processing df = np.log2(df) # set our undected samples to our lowest detection df[df==-1*np.inf] = df[df!=-1*np.inf].min().min() # translate our data so we have no negatives (which would screw up our addition and makes no biological sense) df+=abs(df.min().min()) major_counts = df.groupby(level=[major_index]).count() # we only want to plot samples with multiple values in the minor index multi = df[df.index.get_level_values(major_index).isin(major_counts[major_counts>=2].dropna().index)] import seaborn as sns # Let's select the most variable minor axis elements most_variable = multi.groupby(level=major_index).var().mean(axis=1).order(ascending=False) # and select the top 10 dat = multi[multi.index.get_level_values(major_index).isin(most_variable.index[:10])] # we want to cluster by our major index, and then under these plot the values of our minor index major_dat = dat.groupby(level=major_index).sum() seaborn_map = sns.clustermap(major_dat, row_cluster=True, col_cluster=True) # now we keep this clustering, but recreate our data to fit the above clustering, with our minor # index below the major index (you can think of transcript levels under gene levels if you are # a biologist) merged_dat = pd.DataFrame(columns=[seaborn_map.data2d.columns]) for major_val in seaborn_map.data2d.index: minor_rows = multi[multi.index.get_level_values(major_index)==major_val][seaborn_map.data2d.columns] major_row = major_dat.loc[major_val,][seaborn_map.data2d.columns] merged_dat.append(major_row) merged_dat = merged_dat.append(major_row).append(minor_rows) merged_map = sns.clustermap(merged_dat, row_cluster=False, col_cluster=False) # recreate our dendrogram, this is undocumented and probably a hack but it works seaborn_map.dendrogram_col.plot(merged_map.ax_col_dendrogram) # for rows, I imagine at some point it will fail to fall within the major axis but fortunately # for this dataset it is not true seaborn_map.dendrogram_row.plot(merged_map.ax_row_dendrogram)
From this, we can see interesting aspects. Such as the wide variety of isoforms with HLA-A. We can see there is a mixture of transcript classes being expressed here, and it is very different across tissue (note: I believe this is probably an artifact of sequence alignment and not true if there are any biologists reading this, this is just to illustrate a point.). We also see that FLOT1 has an isoform expressed solely in lymphocytes.
This looks quite nice on my own data, and is a fairly useful way to create quick graphics showing the general trend of your data (gene level) and being able to visually illustrate to others the underlying components.
No comments:
Post a Comment