Advancing Rice Research Using High Performance Computing at NYU
By Stratos Efstathiadis and Aketh Thimmasandra
For the past 10,000 years, rice has played a central role in Asia’s nutrition. Even today, rice remains the staple food for approximately 50% of the world’s population. Rice productivity has more than doubled in recent decades, but by the year 2030, rice production must increase by an additional 25% in order to meet global population growth and demand. To achieve these goals, investigating available genetic variation and accelerated genetic improvements in rice production is imperative. The rice of the future must be cultivated using less water, less land, and under severe environmental stresses due to climate change. Mapping the rice genome is a significant step toward achieving these goals.
What Makes Rice an Ideal Candidate for Genetic Research?
Rice genomes are well characterized, and are, at about 400 to 430 Mb (million base pairs), the smallest among the genomes of major cereal crops. Rice is also among the easiest of all cereal crops to transform genetically. Hence, research directed towards developing improved rice strains offers the most promise of being able to meet the needs of the growing population demand.
The 3,000 Rice Genome Project
Although there have been a number of studies on rice to discover allelic variants through next generation sequencing (NGS), these studies were unable to provide a complete picture of the total genetic diversity within the Oryza sativa gene pool that comprises rice. This has been due to either the small sample size of sequenced accessions (identified and tracked units of data), or the low-coverage sequencing depth (“reads” aligning with reference bases) of the genomes. Therefore, an international effort to extend the understanding of the total genetic diversity of the rice gene pool by re-sequencing 3,010 O. sativa genomes was undertaken, called the 3,000 Rice Genome Project. The idea was to establish a public rice database containing genetic and genomic information suitable for advancing rice breeding technology. The results were published in Nature on April 25, 2018.
Why NYU Researchers Are Reanalyzing the 3,000 Rice Data
In December 2017, a team of researchers at NYU Faculty of Arts and Sciences’ Biology Department embarked on a reanalysis of 1,400 samples from the 3,000 Rice data. Their motivation for doing so was threefold: 1) to make use of a state-of-the-art genomic sequencing tool to identify genetic mutations; 2) to advance rice genetic research pertaining to “landraces” — local rice varieties that have remained in the same place, often passed from generation to generation; and 3) to allow seamless integration of new data. The research team, consisting of NYU Dean for Science, Faculty of Arts and Science Michael Purugganan, postdoctoral researcher Rafal Gutaker, and postdoctoral researcher Jae Young Choi, who played a critical role in sequencing 200 genomes that were merged with the 3K dataset, strongly believes that it is important to distinguish elite rice cultivars from landraces to understand genetic variation and help advance rice genetic research.
For the original 3,000 Rice Genome Project, genomic sequence processing was conducted using a now-deprecated tool (GATK UnifiedGenotyper). In their reanalysis, NYU’s researchers have instead been using the state-of-the-art GATK Best Practices processing pipeline (https://software.broadinstitute.org/gatk) to achieve the highest degree of confidence in the results of this critical step. Using GATK Best Practices also offers advantages such as seamless integration of current and future genomic data. At the time of this writing, NYU researcher Jae Young Choi has used NYU High Performance Computing resources to sequence the data of about 200 landraces, which have been combined with the 1,400 samples they extracted from the original 3,000 Rice data for reanalysis.
The NYU researchers intend to take advantage of these improved SNP calls, which identify variable sites, to advance their understanding of rice variability by studying its natural history. They hope to better understand how rice spreads through Asia and beyond, as well as to achieve a better understanding of rice’s diversity by identifying factors responsible for shaping the genomic diversity of rice around the world and their adaptation to different climates. Another aspect of the research is focused on identifying genes from wild relatives of rice that would enable rice to be successfully grown around the world, and finding mutations in rice genomes that change the effect of gene regulatory networks (networks inferred from gene expression data). The previously published SNP set was sufficiently robust to work on gene haplotypes (G3: Genes | Genomes | Genetics), but its reanalyzed version will add precision to the whole genome demographic and evolutionary inferences.
The Role of HPC in Genome Sequencing
Genome sizes range from a hundred million to tens of billions of base pairs. The human genome contains approximately three billion base pairs. Comparatively, rice contains roughly 400 to 430 million base pairs. Analyzing such data requires sophisticated computing systems specifically designed for performing computation with data of a very large scale. This is where high performance computing (HPC) resources come in. Using less-powerful systems to analyze genomes can take years. Analyzing that same data using HPC systems take a matter of just a few weeks.
HPC systems employ multiple nodes (blocks of processors that can be programmed to perform specific computations), connected through high-speed networks, working in coordination to solve large computational problems. The vast amounts of data are distributed (equally or otherwise) among various nodes, and computations are performed simultaneously on all nodes. The results of the computation are then combined at a single, designated node to produce the desired results. This is known as “distributed computing,” a very popular way to solve large-scale computational problems.
The requirements of computational genetic research, such as that being conducted by the NYU rice researchers, align well with NYU’s HPC Prince cluster resources. Genetic data is processed in multiple stages, also called pipelines. Coordination of data transfer and management of multiple processes in parallel is necessary. The HPC Prince cluster has well-designed tools, such as schedulers and job managers, that enable jobs to run quickly and efficiently. Specialized software packages such as Nextflow are specifically designed to cater to genomic research needs for scheduling pipelines with HPC clusters. Using these and other HPC tools offers many advantages, including high throughput (by running many analysis jobs in parallel), scalability and easy management, preinstalled software, and speedy task execution. See the NYU IT website for more information about NYU’s HPC resources and services.