Tuesday, May 26, 2015

Djangui, empower your python scripts with Django!

Djangui is a Django app which converts scripts written in argparse to a web based interface.

In short, it converts this:





Into this:




If you just want to go out and play with it now, you can do so here:

https://djangui.herokuapp.com

Note: this is on a free heroku plan, so it may take ~20 seconds to spin up or you may have to refresh if the site is initially unresponsive.

And if you like what you see, you can see how to install it here:


So, what is it?

I quite often have to do things which are minimally related to my day to day work, usually involving handling other's data. In biology, there is an exponential increase in the datasets, and most biologists are limited by what can be opened in excel. Even then, they are limited to small datasets because even if you can open 1 million rows in your latest excel, doing useful analysis of it is another story (such as waiting 5 hours to group things).

So basic tasks, like:

  • I want to make a heatmap
  • I want to do a GO analysis
  • I want to do a PCA analysis
  • I want to group by X, and make a histogram
and other similar tasks, are out of reach. Traditionally, you sought out your resident bioinformatician or someone who 'I've seen use the command line before', or tried to find something online.

Because I didn't want to make heatmaps for people (which also, makes them realize making a heatmap of 80,000 rows x 50 columns is pointless unless you are displaying your graph on an imax), I wanted to make a system which enables others to use my scripts without my involvement. There are existing tools to do this with the interests of biologists in mind, such as galaxy or taverna. However, my fundamental issue with these programs is they make me write a wrapper around my scripts in their own markup language. I don't want to learn a platform specific language which has no relevant role in my day-to-day life. Knowing this language is purely altruistic on my end, and doesn't expedite my workday any.

Enter, Djangui

So, there is a wonderful library for building command line scripts in python, argparse. Given its fairly structured nature, I thought it would be fairly straight forward to map these structures to a web form equivalent (it wasn't too hard). This project was initially inspired by what power sandman could leverage over a sql database in a single command. Guided by this 'it should be simple' ideology, djangui will fully bootstrap a Django project with djangui pre-installed for you in a single command:


djanguify.py -p ProjectName

From here, Djangui leverages the awesome Django admin to control adding scripts as well as managing their organization. Scripts are classified into groups, such as plotting tools/etc., and user access to scripts can be controlled at the script group as well as the script level.




A Sample Script

After adding a script, your basic interface appears as so:



There are three aspects of the front-end:
  • Your script groups
  • Your currently selected script: We are currently using the 'find cats' script, which will find you cats.
  • A summary of past, on-going, and queued jobs. Tasks can be further viewed in here by clicking on their task name (its also a datatables table, so it's nicely sortable and queryable). 


Task Viewing

The task viewer has several useful feature. From the aforementioned 'find cats' script (also: how to cheaply get undeserved attention from the internet):

File Previews:

Currently there are three types of files we attempt to identify and embed. I'm considering ways to make this more of a plug-and-play style from the admin.
  • Images
    • From our find cats scripts
    • An incomprehensible heatmap: (images naturally have a lightbox)


  • Fasta Files: (a format for DNA/RNA/protein sequences)


  • Tabular Files:



Job Cloning:


Complex scripts may have several options to enable, a job can be cloned and its existing values will be pre-filled into a script for the user to submit.

And other obvious options like job deletion, resubmission, and downloading the contents of the job as a zip or tar.gz file.

Job Resubmission:

Made a stupid mistake in a script you uploaded? Fix the script, then you can simply resubmit the job.

Anything else?

Djangui is a work in progress. It supports ephemeral file systems, so you can use platforms such as Heroku with an Amazon S3 data store. This also allows you to host your worker on a separate node from your main web server. It also supports customization of the interface. If you know a bit of Django, you can override the template system with your own to customize its interface (or extend certain elements of it). Also, it is a pluggable app, meaning you can attach it to an existing Django project simply.

A Flask Alternative

In the time I was making this, another project emerged called Wooey which is based on Flask (ironically enough Martin also works in the scientific realm, so it seems we deal with similar issues). It has a slick interface, but has a few differences I went against such as storing the scripts as local json files instead of in the database. Thankfully, he chose a different avenue for loading scripts (using AST), which was a great boon for handling cases my approach to parsing argparse scripts would not cover. We have plans to release a library that is solely based on converting libraries such as argparse/clint to a standard markup for libraries such as Djangui and Wooey to utilize.

Pull requests and comments are welcome!

Friday, May 22, 2015

Minimizing render times of shared Django forms

A common situation with Django sites is the need to render a given form across all pages, such as a login-form that is embedded in the header. There is a recipe I came upon, probably from stackoverflow, that has some derivation of the following pattern:

# as a context_processor
from .forms import SomeLoginForm

def loginFormProcessor(request):
    ctx = {}
    if not request.user.is_authenticated():
        ctx['login_form'] = SomeLoginForm
    return ctx

# your template
{% if not request.user.is_authenticated %}
    {% crispy login_form %}
{% endif %}

I was using this pattern for a rather complicated form without thinking about the overhead incurred. However, when new-relic revealed this was taking ~600 ms per render, I knew it had to be fixed.

 The simplest solution is template caching, making our template look like so:

# your template
{% load cache %}
{% if not request.user.is_authenticated %}
  {% cache 99999 login_form_cache %}
    {% crispy login_form %}
  {% endcache %}
{% endif %}


The problem with this is we still incur the overhead in our context processor. We can avoid this by doing all our work within the cache tag. First, we need to move the logic of generating the form out of the context processor and into a template_tag.

# our template_tag.py file
@register.assignment_tag
def get_login_forms():
    from ..forms import StepOne, StepTwo, StepThree
    ctx = {}
    ctx['first'] = StepOne
    ctx['second'] = StepTwo
    ctx['third'] = StepThree
    return Context(ctx)

Now, we need to integrate this tag into our text, so our final template looks like the following (this is also more related to my particular example where I have a multi-stepped form):

# our template file
{% load cache our_tags %}
{% if not request.user.is_authenticated %}
  {% cache 99999 login_form_cache %}
    {% get_login_forms as modal_login_forms %}
    {% crispy modal_login_forms.first %}
    {% crispy modal_login_forms.second %}
    {% crispy modal_login_forms.third %}
  {% endcache %}
{% endif %}

This alone made the server response time come from ~2-3  seconds down to 0.69 seconds.  Not too shabby.

Note: This code should run but I didn't test it as it isn't exactly my code copy & pasted, but an example.

Wednesday, May 20, 2015

Nested Heatmaps in Pandas

I kind of hate heatmaps . They look pretty, but they don't really mean anything. There are so many ways to torture your distance matrix to give you wildly different results, that I often just skip over them in papers. But, biologists love heatmaps. So, here I am.

A recent request way to make a nested heatmap. Basically we want to cluster things by one distance matrix, then show the sub-components of that component under it.

For instance, we have a heatmap that is clustered by genes.



We can see there is differential expression here, such as HLA-A in low in the brain and HLA-C is high in leukocytes. However, genes are made of transcripts -- sometimes there are single transcripts, sometimes there can be hundreds (or in drosophila, tens of thousands). We want to know what is the real signal driving the values measured at the gene level.

To do this, we wanted a heatmap that then showed what makes up each gene.

To accomplish this, here is a generic piece of code that groups things on a 'major index' and plots the 'minor index'.

%matplotlib qt
import numpy as np
import pandas as pd
major_index = 'Gene name'
minor_index = 'Ensembl ID'
# let's use a big dataset that is public, the illumina human bodymap summarized here:
# http://genomicdbdemo.bxgenomics.com/bxaf6/genomicdb/search_all.php?t=tbl_genomicdb_usr_search_131889_92&s=1
df = pd.read_table('/home/chris/Downloads/Human Body Map.csv', sep='\t', index_col=[major_index, minor_index])
# remove the description column
del df['Description']
# This is a bunch of random data pieced together
# some data pre-processing
df = np.log2(df)
# set our undected samples to our lowest detection
df[df==-1*np.inf] = df[df!=-1*np.inf].min().min()
# translate our data so we have no negatives (which would screw up our addition and makes no biological sense)
df+=abs(df.min().min())
major_counts = df.groupby(level=[major_index]).count()
# we only want to plot samples with multiple values in the minor index
multi = df[df.index.get_level_values(major_index).isin(major_counts[major_counts>=2].dropna().index)]

import seaborn as sns
# Let's select the most variable minor axis elements
most_variable = multi.groupby(level=major_index).var().mean(axis=1).order(ascending=False)
# and select the top 10
dat = multi[multi.index.get_level_values(major_index).isin(most_variable.index[:10])]
# we want to cluster by our major index, and then under these plot the values of our minor index
major_dat = dat.groupby(level=major_index).sum()
seaborn_map = sns.clustermap(major_dat, row_cluster=True, col_cluster=True)
# now we keep this clustering, but recreate our data to fit the above clustering, with our minor
# index below the major index (you can think of transcript levels under gene levels if you are
# a biologist)
merged_dat = pd.DataFrame(columns=[seaborn_map.data2d.columns])
for major_val in seaborn_map.data2d.index:
    minor_rows = multi[multi.index.get_level_values(major_index)==major_val][seaborn_map.data2d.columns]
    major_row = major_dat.loc[major_val,][seaborn_map.data2d.columns]
    merged_dat.append(major_row)
    merged_dat = merged_dat.append(major_row).append(minor_rows)
merged_map = sns.clustermap(merged_dat, row_cluster=False, col_cluster=False)

# recreate our dendrogram, this is undocumented and probably a hack but it works
seaborn_map.dendrogram_col.plot(merged_map.ax_col_dendrogram)

# for rows, I imagine at some point it will fail to fall within the major axis but fortunately
# for this dataset it is not true
seaborn_map.dendrogram_row.plot(merged_map.ax_row_dendrogram)



From this, we can see interesting aspects. Such as the wide variety of isoforms with HLA-A.  We can see there is a mixture of transcript classes being expressed here, and it is very different across tissue (note: I believe this is probably an artifact of sequence alignment and not true if there are any biologists reading this, this is just to illustrate a point.). We also see that FLOT1 has an isoform expressed solely in lymphocytes.

This looks quite nice on my own data, and is a fairly useful way to create quick graphics showing the general trend of your data (gene level) and being able to visually illustrate to others the underlying components.