Biologists and bioinformaticians have different software needs

I attended the Bioinformatics Open Source Conference last week in Dublin. Galaxy and Docker were the buzzwords of the conference. A recurring theme was grounding our bioinformatics back in biology, starting right off at the beginning with Holly Bik’s keynote, Bioinformatics: still a scary word for biologists. I missed the keynote, but Holly was active in the #BOSC2015 tweetstream reminding people to consider how the other half lives.

After following the tweets and reading Kai Blin’s blog post, I felt I had to throw my thousand words at the topic of biologists versus bioinformaticians. I assume that all biologists are or will soon be computational biologists, and the distinction will become irrelevant.

(Computational) Biologists : create and use software to support specific datasets

Bioinformaticians : create and use software to support generalized datasets

The two functions have very different interests in software. 

Computational biologists are interested in results

A computational biologist asks,  “How can I use this software or pipeline to interrogate my data?”. Her interest is in results. To her, it doesn’t matter if you use Docker, a chain of bash scripts, a Java application, or a web service. She needs to understand what the tool is doing so that she can form and test her hypotheses appropriately.

Computational biologists use “toolbox” software: Bioconductor, R, Perl, iPython notebooks, Galaxy, or pretty much anything that will get the job done. 

Day in the life of a computational biologist

You have a file full of aligned reads from your favourite diseased model organism and you want to call variants. You’re hoping to find a SNP or small indel in one of a handful of genes, which will be the focus of your next paper.  The Genome Analysis Toolkit is standard for variant calling in your field. Should be pretty straightforward. But to get reads ready for processing, you need to:

  • Convert SAM to BAM and create an index
  • Sort in chromosome order
  • Add read groups
  • Mark duplicates

That’s four different command line tools. Then you can start the GATK best practices, which requires about five more commands just to get raw variants. And while the GATK is claiming “Input file reads and reference have incompatible contigs“, you’re not thinking about Docker or the cloud. You’re thinking, “Just give me my stinking variants, you piece of shsoftware.”

Bioinformaticians are interested in methods

A bioinformatician asks, “How can I build this software so that I can use it for many different datasets with minimum hassle?”. Her interest is in methods (and ideally results!). She wants to write this tool once, have it run without support requests for ten years (okay, 3 months). It should produce good quality results for a variety of datasets.

Bioinformaticians use “building blocks” software: Hadoop, cloud technologies, Docker, RESTful web services, NoSQL databases. 

Day in the life of a bioinformatican 

You’ve sorted out the vagaries of GATK, and that makes you a local hero in your lab. Pretty soon you have grad students at your desk with plaintive looks. “Can you call variants for me?” You make little tweaks to your original script and produce good variants for all of them. 

But soon, someone else wants to use your script. You send it to them, and all of a sudden you’re in some kind of IT role. You discover hardcoded paths that you definitely don’t remember putting in, and for some reason Java throws up a JVM error on their machine, and you have some obscure library in your path that nobody else has. “If only!” You think. “If only I’d put this in Docker!”

Not stirring the pot

As I indicated (somewhat) facetiously in my example, most people in the field do a mixture of the two jobs. Also, creating reproducible code and level of coding skill have nothing to do with these definitions. Which camp we fall in depends on the needs of the task at hand. 

I presented these definitions at my institution and I had several people insist afterwards that they were bioinformaticians, no matter what definition I gave, thank you very much. Labels matter to the people in the jobs. My point is purely about software.

It’s worth considering who is going to be using the software you develop. If you’re developing for bioinformaticians, go ahead and brag about decreasing runtime by massively parallelizing using Hadoop and Amazon.  If your target is that computational biologist who wants her answer, don’t distract her with infrastructure. Just give her those stinking variants. 

Converting an open source project to Docker: why bother?

Stuart Watt from Princess Margaret Hospital in Toronto has written a tracker app that is intended to replicate the experience of Google Docs spreadsheets for research study project managers in a private, local install. The webapp currently supports spreadsheets for multiple projects, fine grained permissions handling, multiple views on the same sheet, and of course it can export to Excel.

A project manager at OICR leading a multi-centre project approached me about his tracker a few weeks ago. Currently, she keeps several spreadsheets up to date and passes them around to keep track of hundreds of samples in various stages of sample preparation, sequencing and analysis. For many clinicians, technicians, project and people managers, Excel spreadsheets are the tool of choice to share data and information about samples and projects. They are flexible and simple for presenting and organizing information. Unfortunately they’re time consuming to keep up to date and they can corrupt data. The tracker looked like a possible time and tear saver.

My first concern was assessing the tracker and enabling the project manager to try it out. Plan A was to launch a virtual machine, open a port on it and give her my machine address. Halfway through building the VM (while my machine was crawling along) I realized I didn’t particularly want to email around the image or host the VM on my local machine forever.

So, jump to plan B: Docker

Read More »

Creating Phylogenetic Trees (a Happy Ending)

You may remember that I posted a few weeks ago about how to create phylogenetic trees out of similar genes using seaview and RaxML. To re-cap briefly, I created a multiple sequence alignment from FastA files, removed all the gaps so that only the substitutions were left, and then ran it through RaxML to produce the trees. Unfortunately, I couldn’t get the trees to open with treeview.

I spoke to an expert the next day and discovered that treeview has certain dependencies on Ubuntu that were difficult to resolve, so the answer was… to use Windows. I have to admit, it seems like a pretty funny answer to the question, given how much better Ubuntu is for bioinformatics tasks than Windows. Even BioLinux, the Linux built especially for bioinformatics, was unable to open my phylogenetic tree.

Anyway, here are two different views of the same tree for you to enjoy, with highlighting and legends added in MS Paint.

Creating Phylogenetic Trees

Our assignment for the past week has been to create phylogenetic trees from multiple sequences alignments based on clusters of orthologous genes (COGs). Specifically, to decide why a simple BLAST search was unable to accurately place a subject gene from Cryptosporidium parvum into a COG category.

I think this assignment is an interesting exercise in ‘real’ bioinformatics: where data is messy, the programs are challenging to install and use, and in the end you’re not quite sure what you ended up with, but it’s enormous fun anyway!

Read More »

Selfishness in study

By Psy3330 W10 [CC-BY-SA-3.0 or GFDL], from Wikimedia Commons
Hopefully avoiding this during my degree
My contract ended in September, and I started a Master’s degree. The end of my contract was harrowing, to say the least, and the first week of being back in lectures was completely exhausting. But now that I consider the differences, I’ve come to a conclusion: studying is a selfish business.

Read More »

Generating plots and correlation coefficients with PostgreSQL and R

The major task of today was to generate some correlation coefficients showing that our approach to inferring data was consistent with established results. One of my colleagues generated a plot a few months (years?) ago that showed a respectable correlation of 0.86. Unfortunately, there are only 11 points on the plot, where there should be closer to 500 000. In the many presentations I’ve seen on this topic, that correlation slide is always questioned.

Fortunately, all of this data is available in our PostgreSQL database. Unfortunately, it was an adventure in several languages and programs that I tend to avoid: Perl, vi, and especially R.
Read More »