I live! Well, I have been alive all along, but I am not very good at this blogging thing. Thanks to my employer, Ontario Institute for Cancer Research, I am attending AGBT 2013 in sunny Marco Island, Florida. The following is the poster I am presenting.
Morgan L. Taschuk, Andrew M. K. Brown, Robert E. Denroche, Zhibin Liu, Peter Rusanov, Stuart Watt, Timothy Beck, Vincent Ferretti, John D. McPherson, Brian D. O’Connor, and B.F. Francis Ouellette
Human genome sequencing data generated in the context of clinical care is increasingly produced by centers of all forms and varying levels of resources around the world. Automated analysis allows the most critical information, including both quality control information and variant calls, to be returned to technicians and to clinicians as soon as possible, ensuring rapid diagnosis.
SeqWare is an open-source framework for building and executing automated NGS pipelines on Grid or Cloud architecture that also tracks data provenance as well as sufficient metadata to replicate the analysis. We created a SeqWare pipeline to identify candidate somatic mutations from clinical samples sequenced on Illumina MiSeq instruments. In conjunction with verification performed by a CAP/CLIA lab, a panel of experts used the somatic variants to inform personalised cancer treatments in an ongoing clinical study. The pipeline was able to launch automatically, recover from failure, and track data provenance with an average analysis time comparable to previous methods.
Genomics Pathway Strategy
Figure 1: The process used by the Genomic Pathway Strategy (GPS) to detect somatic variants in cancer patients. Consented donors are biopsied and the tissue is analysed with a CAP/CLIA-certified gene panel and by targeted high-throughput sequencing. The results are confirmed using an orthogonal technology. Confirmed somatic variants are present to an expert panel of clinical oncologists who determine whether they are reportable or actionable. This poster focuses on the analysis of the high-throughput sequencing data (blue).
The pilot project showed that high-throughput sequencing is a viable alternative to traditional gene assays and can in some situations be more sensitive to mutations. Given the short timeline of 21 days for analysis to be completed and the increasing number of participants, manual HTS analysis was shown to be impractical. A more efficient approach to analysis was suggested for future projects of this type.
The SeqWare infrastructure consists of a series of tools designed to facilitate and automate the analysis of high-throughput short read sequencing data. SeqWare is cluster agnostic and the workflow language abstracted so that the same workflows can be run on any supported scheduling environment, which includes traditional Grids, the Amazon Cloud, and also Hadoop clusters. It is open source and available from http://seqware.github.com.
We describe the efficient assembly of an automated clinical sequencing pipeline for the GPS project using SeqWare.
SeqWare Pipelines at OICR
Since 2011, OICR’s Sequencing Production Bioinformatics has used SeqWare to automate the analysis of its Illumina Inc. HiSeq sequencing data and align over 50 trillion bases. This pipeline processes all of the data that is produced by the ten Illumina HiSeq machines at OICR, including two HiSeq 2500s. In an average month, the HiSeq pipeline processes 19 sequencer runs, aligns 246 libraries, and calls variants in 72 samples.
Figure 2: The HiSeq SeqWare Pipeline at OICR. SeqWare components are in blue and external files and technologies are in green. Base calls are produced by Illumina HiSeq 2000 or 2500 and are deposited on the file system and in the LIMS. The results are migrated to the SeqWare MetaDB, which triggers designated deciders to launch appropriate workflows. The pipeline currently consists of CASAVA for base calling, Novoalign for alignment, and GATK for variant calling
Figure 3: The pipeline developed for the GPS project. Each analysis step in the pipeline is a SeqWare workflow representing a distinct technology. After sequencing, the information about the run must be injected into the MetaDB, which launches the automated pipeline.
We created a virtual machine with a bespoke SeqWare pipeline for the GPS pipeline. Due to the clinical nature of the data, the sample information needed to be kept independent from both the existing LIMS and HiSeq SeqWare MetaDB. In addition, creating a virtual machine image allows the infrastructure to be re-used for other projects with similar requirements and privacy concerns.
Workflows from OICR’s HiSeq pipeline were repurposed for the MiSeq pipeline, enabling the pipeline to be created quickly. Additional workflows included a quality control pipeline that evaluates the quality of the sequencing and the alignment, and an annotation workflow that annotates the variants resulting from GATK.
As each sequencing run completes and analysis begins, the donor and sequencing information must be injected into the MetaDB. We developed a SeqWare plugin for importing MiSeq sample sheets into the SeqWare MetaDB. Tumour, reference and archival samples are often sequenced in separate assays and can be easily matched by this method. Having good metadata also allows more complicated analyses to be automated.
Analysis data is recorded in the MetaDB, satisfying the requirement for provenance tracking in CAP/CLIA certified software. Workflow failures are immediately detected and resubmitted for analysis, to a maximum of five tries. If a workflow fails, information about the failure is available for the informatician to diagnose the problem.
Figure 4: Implementation details of the virtual machine (VM) constructed for the GPS project. The MetaDB, associated web service, the cluster abstraction layer and workflows and deciders are installed on the VM. The actual execution cluster is external to the VM, as is the file system where the input and output files are stored. A Crontab regularly runs deciders, which check for the presence of new data in the MetaDB and launch workflows if necessary. They workflows monitor jobs in the cluster and record analysis information in the MetaDB, potentially triggering new deciders.
The current framework makes use of the original Freemarker template workflow language and Pegasus/Condor/Globus software stack. Recently, new workflow languages have been developed by the SeqWare team that enable the use of more diverse Grid and Cloud environments, such as Apache Oozie Workflow Scheduler for Hadoop. The workflows and deciders will be updated to take advantage of the latest features in SeqWare.
SeqWare is also working closely with the Galaxy project to incorporate SeqWare infrastructure with the user-friendly Galaxy interface. Using Galaxy would enable non-informaticians to launch the pipeline on demand.
The current pipeline is also limited to the current analysis of MiSeq data. We plan on expanding this pipeline to handle other types of HTS data.
Liu, Z., M. L. Taschuk, B. O’Connor, and B.F.F. Ouellette. Integration of SeqWare within Galaxy. Galaxy Community Conference 2012.
O’Connor, B. D., B. Merriman, and S. F. Nelson. SeqWare Query Engine: storing and searching sequencing data in the cloud. BMC Bioinformatics 2010 11(supple 12):S2
Tran, B., J. E. Dancey, S. Kamel-Reid, J. D. McPherson, P. L. Bedard, A. M. K. Brown, T. Zhang, P. Shaw, N. Onetto, L. Stein, T. J. Hudson, B. G. Neel, and L. L. Siu. Cancer Genomics: Technology, Discovery, and Translation. JCO Feb 20, 2012:647-660.