AWS democratizes access to the largest genomic sequences repository — NIH’s Sequence Read Archive
For the first time, the largest genomic sequencing repository in the Americas will be natively accessible on AWS through the Open Data Sponsorship Program.
Amazon today announced that the National Institutes of Health (NIH) Sequence Read Archive (SRA) data, managed by the National Library of Medicine’s (NLM) National Center for Biotechnology Information (NCBI), will now be freely accessible through its AWS Open Data Sponsorship program.
In 2018, NCBI began the migration of SRA data to AWS’ cloud platform by leveraging the NIH STRIDES Initiative. With SRA publicly accessible on Amazon Simple Storage Service (Amazon S3), scientists can now seamlessly integrate SRA data into cloud-based genomics workflows. Scientists can choose to access the data via native AWS clients such as the AWS Management Console or the AWS Command Line Interface (AWS CLI), or use open-source tools such as the SRA Explorer.
“By making SRA data available in the cloud, researchers in the life sciences and genomics community can build on a strong foundation of open data and pay it forward with tools, solutions, and products that enrich and expand the science ecosystem,” says Dave Levy, AWS vice president for U.S. government, nonprofit and healthcare businesses.
“The ability to access SRA data in the cloud is a wonderful realization of NIH's long-standing principles for broad, rapid, and equitable public access to biomedical research data. The new opportunities for computational access afforded by this open data program will accelerate the pace of research, permit us to ask bold questions, and enable scientific discoveries,” says Susan Gregurick, PhD, NIH associate director for data science and director, Office of Data Science Strategy.
Since its discovery in the 1860s, DNA has been a source of fascination and revelation. Through genomics, the field dedicated to the study of DNA, scientists have begun to understand how DNA can dictate an individual’s appearance, behavior, and disease risk. In fact, common conditions such as diabetes, depression, and cancer all have known genetic contributions.
The reuse of genetic data is encouraged within the science community to reduce the overall cost of data generation and ensure that discoveries can be replicated by other scientists. Major research funding agencies, such as the NIH, the Howard Hughes Medical Institute, and the Department of Defense, recommend a few repositories to store genomic data so that it can be accessed and reused by others. SRA is one of these recommended repositories and is one of the oldest repositories designed specifically for next-generation biomedical/genomic sequencing data.
The fundamental unit of DNA is the nucleotide, which in genomics is denoted by its nitrogenous base — the familiar A, C, T, or G. SRA currently stores more than 44 petabases of genomic sequence today. For perspective, this amount of data is over 6 billion human genomes—more than 18 times the current population of the United States.
SRA currently measures more than 40 petabytes in volume, and there is no sign of slowing down. In fact, experts predict that the SRA will double in volume every 12 to 18 months for the foreseeable future.
“Object-based storage like Amazon S3 can scale with that rate of growth,” says Levy, “and with the power of the cloud, so can the compute.”
SRA comprises genomic sequences from all branches of the tree of life and has proven essential in the fight against COVID 19. For example, Serratus, an open science viral discovery platform stemming from the University of British Columbia’s Cloud Innovation Center, aligned coronavirus pangenomes to 3.8 million SRA submissions using AWS to identify new coronavirus sequences in the battle against COVID-19 (learn more about how AWS is helping the fight against COVID-19).