With genome sequencing becoming increasingly affordable, Garvan Institute of Medical Research needed more compute and storage resources to help researchers analyze the rapidly rising volumes of sequencing data. Garvan deployed an on-premises high performance computing (HPC) cluster with compute nodes aligned with Intel® Select Solutions for Genomics Analytics. Garvan gained cost-effective capacity for its diverse workloads and is seeing double the performance of its external cloud platform on identical workloads.
As one of Australia’s largest biomedical research institutes, Garvan Institute of Medical Research is at the forefront of next-generation genome analytics. In addition to sequencing whole genomes and increasing numbers of single cells, Garvan’s scientists and researchers perform sophisticated analysis of the genomic data to produce new insights into understanding the causes of cancer, immunity and inflammation, and diseases that interfere with healthy aging. Garvan’s mission is to make significant contributions to medical research that will change the directions of science and medicine and deliver major benefits for human health.
Established in 1963, Garvan has focused more heavily on being a genome-powered institution since 2012. The institute brings together 600 researchers, including more than 80 informaticists. Garvan’s high performance computing (HPC) infrastructure runs a range of demanding workloads, including the production workflows of whole human genomes as well as the fast-growing field of single-cell genomics. It also supports cross-disciplinary collaborations within the institute and with researchers across Australia and internationally.
Fig. 1. Close up of DNA methylation (tiny chemical tag shown as a glowing particle being added to one of the DNA bases), by Dr Kate Patterson, Garvan Institute
Garvan began building its HPC infrastructure in 2012 with a cluster based on a non-Intel architecture. By 2015, when the institute was ready for its next expansion, “It was clear Intel was the way to go,” according to Dr. Warren Kaplan, head of Garvan’s Data Sciences Platform.
By 2018, Garvan was using both external clouds and on-premises infrastructure, but still needed more capacity, performance, and scale to keep up with rising workloads. In addition to cost-effective infrastructure that could handle diverse of genomics workloads, Kaplan listed three critical requirements for new computer nodes on his HPC infrastructure.
“The CPU is our workhorse, so we’re always obsessed about the CPU,” Kaplan said. “A fast CPU is good, and a faster one is always better. The world of bioinformatics and genomics has a large number of inefficient codes, so we’re also obsessed about memory. Finally, to support our Spark computing and also codes that generate large numbers of very small files, we like lots of very fast local storage.”
Kaplan discussed Garvan’s requirements with Intel experts and said those conversations were a key step in choosing technologies that would meet the institute’s needs in a scalable, cost-effective fashion. “When we started talking about expanding the system, our Intel team here in Sydney put us in touch with Intel’s genomics experts in the United States,” Kaplan recalled. “They were deeply, deeply embedded in the genome space and extremely knowledgeable about the genomics industry and technologies. They understood what we wanted to do, and the conversations we had with them about how to achieve our goals were extremely valuable.” The Intel team also demonstrated substantial improvements in latency and throughput that Intel and the Broad Institute had achieved for whole genome sequencing (WGS).1
When we started talking about expanding the system, our Intel team here in Sydney put us in touch with Intel’s genomics experts in the United States. They were deeply, deeply embedded in the genome space and extremely knowledgeable about the genomics industry and technologies. They understood what we wanted to do, and the conversations we had with them about how to achieve our goals were extremely valuable. —Dr. Warren Kaplan, head of Garvan’s Data Sciences Platform
Following those discussions, Garvan deployed a Dell EMC cluster with compute nodes that align with Intel Select Solutions for Genomics Analytics. Intel® Select Solutions are verified configurations intended to deliver workload- optimized performance while simplifying deployment of data center infrastructure. The Intel® Select Solution for Genomics analytics is based on the Broad-Institute Genomics Stack (BIGstack) 2.0, developed by Intel and the Broad Institute of MIT and Harvard. The cluster includes compute nodes based on Intel® Xeon® Scalable processors with Intel® Ethernet Converged Network Adapter (Intel® Ethernet CNA) X710-DA2 that offer optimizations and sophisticated packet processing to address the demanding needs for the agile data center.
Garvan is using Intel® Optane™ DC SSDs and Intel® 3D NAND SSDs to expand local memory and storage capacity in the cluster. Compute nodes are equipped with 24 TB of Intel® Optane™ DC SSD P4800X storage.
Kaplan said the performance of the Intel® Optane™ SSDs has been outstanding and is helping give some of Garvan’s legacy codes a new lease on life. “Traditionally, using spinning disk configured as swap space led to terrible performance,” he said. “With the Intel® SSDs, we can use the disk as swap space and the performance is great. It gives us access to much more memory and performance.”
“We have a few non-Intel GPGPUs, but almost everything else, including all our production codes, is powered by Intel,” he added.
Garvan’s new HPC infrastructure is a powerful enabler for the institute’s sharpened focus on genomics-enabled medical research. “We’re only at the beginning of what we’re achieving with this platform and where we’re going with it, but fundamentally, it is leading to a lot of changes that are enabling Garvan to evolve into a data-informed medical research institute,” Kaplan stated. “The infrastructure is fundamental to the changing of our business and how we’re working, and the impact is transformative.”
Part of that impact comes from the platform’s performance and cost advantages compared to the commercial cloud services Garvan uses. Kaplan said his team has used the Singularity container system to port some of its containerized WGS workflows from external cloud environments onto the new infrastructure. “The workflow is identical in every respect, and the performance on our infrastructure is significantly faster than on the commercial clouds,” Kaplan said.
The added capacity and performance are keeping Garvan’s researchers at the forefront of new fields such as single-cell genomics. “With this system, we are able to keep up with the demand as the sequencing technologies are racing ahead,” Kaplan said. “Researchers also use R and Python languages in Apache Zeppelin that connects to our Spark cluster, so they also have this scale-out backend that lets them do extraordinary computation. That’s going to be transformative, too.”
Garvan’s new infrastructure is also advancing the institute’s collaboration with the National Computational Infrastructure in Australia’s Capital Territory, in the development of workloads that can run at scale. “The ability to have a beautiful prototyping environment such as we do with this infrastructure, to be able to rapidly develop, and build out prototypes, learn lessons from them, share what we’re learning, and build out the solutions—that is extremely valuable,” said Kaplan.
Moving forward, Kaplan said he will continue to look to Intel for technologies and insights. “As a strategic partner, Intel is as good as it gets,” he stated. “We feel privileged to have worked with them. They have enabled us to build something very special—and that’s all to the good of medical research across Australia and the world.”
- Dell EMC PowerEdge* servers
- Intel® Xeon® Scalable processors
- Intel® Optane™ SSD DC P4800X Series
- Intel® 3D NAND SSD DC P4600 and P4500 Series
- Intel® Ethernet Converged Network Adapter (Intel® Ethernet CNA) X710-DA2