Biologists and bioinformaticians have different software needs

I attended the Bioinformatics Open Source Conference last week in Dublin. Galaxy and Docker were the buzzwords of the conference. A recurring theme was grounding our bioinformatics back in biology, starting right off at the beginning with Holly Bik’s keynote, Bioinformatics: still a scary word for biologists. I missed the keynote, but Holly was active in the #BOSC2015 tweetstream reminding people to consider how the other half lives.

After following the tweets and reading Kai Blin’s blog post, I felt I had to throw my thousand words at the topic of biologists versus bioinformaticians. I assume that all biologists are or will soon be computational biologists, and the distinction will become irrelevant.

(Computational) Biologists : create and use software to support specific datasets

Bioinformaticians : create and use software to support generalized datasets

The two functions have very different interests in software. 

Computational biologists are interested in results

A computational biologist asks,  “How can I use this software or pipeline to interrogate my data?”. Her interest is in results. To her, it doesn’t matter if you use Docker, a chain of bash scripts, a Java application, or a web service. She needs to understand what the tool is doing so that she can form and test her hypotheses appropriately.

Computational biologists use “toolbox” software: Bioconductor, R, Perl, iPython notebooks, Galaxy, or pretty much anything that will get the job done. 

Day in the life of a computational biologist

You have a file full of aligned reads from your favourite diseased model organism and you want to call variants. You’re hoping to find a SNP or small indel in one of a handful of genes, which will be the focus of your next paper.  The Genome Analysis Toolkit is standard for variant calling in your field. Should be pretty straightforward. But to get reads ready for processing, you need to:

  • Convert SAM to BAM and create an index
  • Sort in chromosome order
  • Add read groups
  • Mark duplicates

That’s four different command line tools. Then you can start the GATK best practices, which requires about five more commands just to get raw variants. And while the GATK is claiming “Input file reads and reference have incompatible contigs“, you’re not thinking about Docker or the cloud. You’re thinking, “Just give me my stinking variants, you piece of shsoftware.”

Bioinformaticians are interested in methods

A bioinformatician asks, “How can I build this software so that I can use it for many different datasets with minimum hassle?”. Her interest is in methods (and ideally results!). She wants to write this tool once, have it run without support requests for ten years (okay, 3 months). It should produce good quality results for a variety of datasets.

Bioinformaticians use “building blocks” software: Hadoop, cloud technologies, Docker, RESTful web services, NoSQL databases. 

Day in the life of a bioinformatican 

You’ve sorted out the vagaries of GATK, and that makes you a local hero in your lab. Pretty soon you have grad students at your desk with plaintive looks. “Can you call variants for me?” You make little tweaks to your original script and produce good variants for all of them. 

But soon, someone else wants to use your script. You send it to them, and all of a sudden you’re in some kind of IT role. You discover hardcoded paths that you definitely don’t remember putting in, and for some reason Java throws up a JVM error on their machine, and you have some obscure library in your path that nobody else has. “If only!” You think. “If only I’d put this in Docker!”

Not stirring the pot

As I indicated (somewhat) facetiously in my example, most people in the field do a mixture of the two jobs. Also, creating reproducible code and level of coding skill have nothing to do with these definitions. Which camp we fall in depends on the needs of the task at hand. 

I presented these definitions at my institution and I had several people insist afterwards that they were bioinformaticians, no matter what definition I gave, thank you very much. Labels matter to the people in the jobs. My point is purely about software.

It’s worth considering who is going to be using the software you develop. If you’re developing for bioinformaticians, go ahead and brag about decreasing runtime by massively parallelizing using Hadoop and Amazon.  If your target is that computational biologist who wants her answer, don’t distract her with infrastructure. Just give her those stinking variants. 

7 thoughts on “Biologists and bioinformaticians have different software needs

  1. Nice discussion, but at my institution we use “computational biologist” to define all the Computer Science faculty (with CS Ph.D.s) who develop algorithms for sequence analysis. So not at all the same as your definition. Many people here in the School of Medicine think that “bioinformatics” means something to do with databases. I think the terms should be interchangeable, but you have to know who you’re talking to, because your interlocutor might have a different definition from your own.

    These days I tell other academics I’m a computational biologist. But I kind of like the way my daughter described me years ago, when she was about 5: “DNA scientist.”

    • “DNA scientist”, lovely. Simplicity, natural in children, a gift in adults. That’s what we are missing, no matter whether we are computational biologists or bioinformaticians.

  2. Okay, yes, people are tied to their labels. But I’m talking about software. What type of software concerns you? You develop algorithms, so you probably don’t use Galaxy. Your software needs are different than people who are analysing one dataset.

    • Indeed, they are very fragrant variants. It does take a lot of steps to get there though.

  3. Interesting distinction but it worries me a little. I think it should be: bioinformaticians are ALSO interested in methods. If a bioinformatician cares about methods more than results, chances are they’re a bad bioinformatician who makes fast but useless software. (Not always but I see it happen far too often for comfort, especially from computer science departments.) At the same time, biologists also need to care deeply about methods, just not implementations, which I think you are really meaning. Too many biologists analyse their data wrong because they assume that the people writing the software (a) had their data in mind, and/or (b) have established the “best”/correct way to analyse that data. Often, we add lots of parameters and options to programs BECAUSE we don’t know what’s best. (Although the generalisation thing applies too. Perhaps we should keep cloning our software with different default settings for different uses.)

  4. We have a term for someone who develops algorithms – computer scientist.
    We have a term for someone who writes software – programmer.
    We have a term for someone who studies biology – biologist.
    And theres nothing wrong with an individual being many things at once – we don’t need a new term for every intersection of two potential jobs, particularly if that new term is just a concatenation of the two job descriptions anyway…
    Rock climber and Geologist? Geoclimber.
    Butcher and fisherman? Fisherbutch.
    Biologist and Informatician? Bioinformatician.
    Brad and Angelina? Brangelina.
    No, if we’re going to add new words to our lexicon, they need to describe new concepts.

    So is Bioinformatics just a variable waiting for a function to be assigned to? Well, I think there’s one task that Bioinformaticians do that is unique to the profession, and that’s bridging the gap between CS and Biology. Unfortunately, I feel that the confusion over the term Bioinformatician is because few Bioinformaticians actually do this. Some do, but many do not.
    If you consider yourself a Bioinformatician, ask yourself these two questions: when was the last time you enabled a Biologist to understand a computer science concept? When did you last help a Computer Scientist last understand a biological phenomena?
    If you answer “not sure” to both of those questions, chances are your job is not about bridging the gap between CS and Bio, but rather acting as a middle-man. A “go-between”. A “point of contact”. Maybe even an elusive “data scientist”. But these are all pretty non-descript terms for a job that we already have a name for, and that is Tech Support.

Leave a reply to Steven Salzberg (@StevenSalzberg1) Cancel reply