Biologists and bioinformaticians have different software needs

I attended the Bioinformatics Open Source Conference last week in Dublin. Galaxy and Docker were the buzzwords of the conference. A recurring theme was grounding our bioinformatics back in biology, starting right off at the beginning with Holly Bik’s keynote, Bioinformatics: still a scary word for biologists. I missed the keynote, but Holly was active in the #BOSC2015 tweetstream reminding people to consider how the other half lives.

After following the tweets and reading Kai Blin’s blog post, I felt I had to throw my thousand words at the topic of biologists versus bioinformaticians. I assume that all biologists are or will soon be computational biologists, and the distinction will become irrelevant.

(Computational) Biologists : create and use software to support specific datasets

Bioinformaticians : create and use software to support generalized datasets

The two functions have very different interests in software. 

Computational biologists are interested in results

A computational biologist asks,  “How can I use this software or pipeline to interrogate my data?”. Her interest is in results. To her, it doesn’t matter if you use Docker, a chain of bash scripts, a Java application, or a web service. She needs to understand what the tool is doing so that she can form and test her hypotheses appropriately.

Computational biologists use “toolbox” software: Bioconductor, R, Perl, iPython notebooks, Galaxy, or pretty much anything that will get the job done. 

Day in the life of a computational biologist

You have a file full of aligned reads from your favourite diseased model organism and you want to call variants. You’re hoping to find a SNP or small indel in one of a handful of genes, which will be the focus of your next paper.  The Genome Analysis Toolkit is standard for variant calling in your field. Should be pretty straightforward. But to get reads ready for processing, you need to:

  • Convert SAM to BAM and create an index
  • Sort in chromosome order
  • Add read groups
  • Mark duplicates

That’s four different command line tools. Then you can start the GATK best practices, which requires about five more commands just to get raw variants. And while the GATK is claiming “Input file reads and reference have incompatible contigs“, you’re not thinking about Docker or the cloud. You’re thinking, “Just give me my stinking variants, you piece of shsoftware.”

Bioinformaticians are interested in methods

A bioinformatician asks, “How can I build this software so that I can use it for many different datasets with minimum hassle?”. Her interest is in methods (and ideally results!). She wants to write this tool once, have it run without support requests for ten years (okay, 3 months). It should produce good quality results for a variety of datasets.

Bioinformaticians use “building blocks” software: Hadoop, cloud technologies, Docker, RESTful web services, NoSQL databases. 

Day in the life of a bioinformatican 

You’ve sorted out the vagaries of GATK, and that makes you a local hero in your lab. Pretty soon you have grad students at your desk with plaintive looks. “Can you call variants for me?” You make little tweaks to your original script and produce good variants for all of them. 

But soon, someone else wants to use your script. You send it to them, and all of a sudden you’re in some kind of IT role. You discover hardcoded paths that you definitely don’t remember putting in, and for some reason Java throws up a JVM error on their machine, and you have some obscure library in your path that nobody else has. “If only!” You think. “If only I’d put this in Docker!”

Not stirring the pot

As I indicated (somewhat) facetiously in my example, most people in the field do a mixture of the two jobs. Also, creating reproducible code and level of coding skill have nothing to do with these definitions. Which camp we fall in depends on the needs of the task at hand. 

I presented these definitions at my institution and I had several people insist afterwards that they were bioinformaticians, no matter what definition I gave, thank you very much. Labels matter to the people in the jobs. My point is purely about software.

It’s worth considering who is going to be using the software you develop. If you’re developing for bioinformaticians, go ahead and brag about decreasing runtime by massively parallelizing using Hadoop and Amazon.  If your target is that computational biologist who wants her answer, don’t distract her with infrastructure. Just give her those stinking variants.