Programming Questionnaire

BIO 312 – Lab Quick Start ChecklistWhen you first come to lab, you need to do three main things:
Clone and open the lab document to your GitHub. The link to clone each lab will
be provided in the respective folder under “Modules” on Blackboard.
Copy and paste the lab markdown text from GitHub into a new HackMD note.
You MUST do this to easily annotate your labs.
Start the AWS academy lab, and boot up your instance. This is how you actually
turn on the super computer you need to run your analyses!
Step 1: Clone the lab to your GitHub
Navigate to the “Modules” tab on Blackboard and click the folder corresponding to the
lab.
Click the classroom.github link:
Click “Accept this assignment”. The lab should appear in your personal repository within
a minute or so.
Step 2: Copy the lab text to HackMD
Log into your HackMD account.
Click the green “+ New note” button in the upper left corner of your dashboard.
Go back to GitHub, click the pencil icon on the lab README.md file:
Highlight and copy all of the text in the “Edit file” window:
Paste the text into your new HackMD note. In the preview window to the right, you
should see the lab formatted exactly as it is in GitHub.
Step 3: Start the AWS lab and launch your instance
Navigate to our course on AWS academy:
https://awsacademy.instructure.com/courses/18419
Click the “Modules” tab in the left side bar.
Select “Learner Lab – Foundational Services
Click “Start Lab”. The “AWS” icon will show a red dot if not running, a yellow if it is
booting up, and a green if it is running. Wait until the icon turns green:
When the icon is green, click the AWS icon (it is a link). This will bring you to your
AWS console.
At the console, click the “EC2” icon. You do NOT need to set up a new EC2 machine.
You have already done this in lab 2.
At your EC2 dashboard, click the “Instances” tab in the left sidebar.
In the instance dashboard, you should see your personal instance (the name will be whatever
you called it when you first created the instance.) Select the checkbox next to your instance,
click the “Instance state” button, then click “Start instance”. The instance may take a few
minutes to boot up. When your instance is up and running, the Instance state column will be
green and say “Running”:
Once your instance is successfully running, click “Connect”:
You now have 2 options to connect to your instance. You can either:
1) (easier) Connect to the internal AWS terminal shell in your browser using EC2
Instance Connect, or;
2) (more flexible, fewer bugs) Generate a custom ssh command to connect to your
instance on your own computer’s terminal
For Option 1:
Under the “EC2 Instance Connect” tab change the User name field from “root” to
“ec2-user”. You must do this every single time if you are connecting via option 1.
Click “Connect”
Click “Connect”. You should now see your terminal with your ec2-user IP address.
For Option 2:
Under the “SSH client” tab, find your custom ssh command under “Example”:
IMPORTANT: You MUST change the word “root” to “ec2user” in the ssh command in order for it to work! So in this
example, the command you would run in your terminal is:
ssh -i “amb_keypair.pem” ec2-user@ec2-3-227-206-212.compute-1.amazonaws.com
Mac users: Open a terminal window by searching “terminal” in your Mac’s search
bar.
Remember the .pem file you downloaded at the beginning of lab 2? Find where it
is on your computer. The absolute easiest way to access it is to put it on your Desktop.
In the terminal, type (without the carats), filling in the proper path to the directory that
contains your pem file:
cd
If your .pem file is on your Desktop, just type
cd ~/Desktop
Type:
chmod 400
Paste the custom ssh command from AWS into your terminal, MAKING SURE TO
CHANGE THE WORD “root” to “ec2-user”. Everything else stays the same.
If you are successfully connected, you should see a window like this:
** NOTE: the name of your .pem file will be different. Mine is called
amb_keypair.pem, reflected in the code.
If you allocated a custom elastic IP address to your instance, you can use the same
line of code that you just entered every time you want to access your instance (after
making sure your instance is running on AWS). If you did NOT allocate a custom IP
address, the ssh command will change every time.
I HIGHLY recommend you allocate a custom IP so accessing your instance on your
computer’s terminal is as easy as two simple steps:
1) Changing your working directory to where your .pem file is;
2) Running the exact same line of ssh code every time.
To allocate a custom IP, see separate document: “Quick Guide for Allocating a
Custom IP Address to your Instance” on Blackboard and posted in the “aws” channel
on Slack.
BIO 312 – Lab Quick Start Checklist
When you first come to lab, you need to do three main things:
Clone and open the lab document to your GitHub. The link to clone each lab will
be provided in the respective folder under “Modules” on Blackboard.
Copy and paste the lab markdown text from GitHub into a new HackMD note.
You MUST do this to easily annotate your labs.
Start the AWS academy lab, and boot up your instance. This is how you actually
turn on the super computer you need to run your analyses!
Step 1: Clone the lab to your GitHub
Navigate to the “Modules” tab on Blackboard and click the folder corresponding to the
lab.
Click the classroom.github link:
Click “Accept this assignment”. The lab should appear in your personal repository within
a minute or so.
Step 2: Copy the lab text to HackMD
Log into your HackMD account.
Click the green “+ New note” button in the upper left corner of your dashboard.
Go back to GitHub, click the pencil icon on the lab README.md file:
Highlight and copy all of the text in the “Edit file” window:
Paste the text into your new HackMD note. In the preview window to the right, you
should see the lab formatted exactly as it is in GitHub.
Step 3: Start the AWS lab and launch your instance
Navigate to our course on AWS academy:
https://awsacademy.instructure.com/courses/18419
Click the “Modules” tab in the left side bar.
Select “Learner Lab – Foundational Services
Click “Start Lab”. The “AWS” icon will show a red dot if not running, a yellow if it is
booting up, and a green if it is running. Wait until the icon turns green:
When the icon is green, click the AWS icon (it is a link). This will bring you to your
AWS console.
At the console, click the “EC2” icon. You do NOT need to set up a new EC2 machine.
You have already done this in lab 2.
At your EC2 dashboard, click the “Instances” tab in the left sidebar.
In the instance dashboard, you should see your personal instance (the name will be whatever
you called it when you first created the instance.) Select the checkbox next to your instance,
click the “Instance state” button, then click “Start instance”. The instance may take a few
minutes to boot up. When your instance is up and running, the Instance state column will be
green and say “Running”:
Once your instance is successfully running, click “Connect”:
You now have 2 options to connect to your instance. You can either:
1) (easier) Connect to the internal AWS terminal shell in your browser using EC2
Instance Connect, or;
2) (more flexible, fewer bugs) Generate a custom ssh command to connect to your
instance on your own computer’s terminal
For Option 1:
Under the “EC2 Instance Connect” tab change the User name field from “root” to
“ec2-user”. You must do this every single time if you are connecting via option 1.
Click “Connect”
Click “Connect”. You should now see your terminal with your ec2-user IP address.
For Option 2:
Under the “SSH client” tab, find your custom ssh command under “Example”:
IMPORTANT: You MUST change the word “root” to “ec2user” in the ssh command in order for it to work! So in this
example, the command you would run in your terminal is:
ssh -i “amb_keypair.pem” ec2-user@ec2-3-227-206-212.compute-1.amazonaws.com
Mac users: Open a terminal window by searching “terminal” in your Mac’s search
bar.
Remember the .pem file you downloaded at the beginning of lab 2? Find where it
is on your computer. The absolute easiest way to access it is to put it on your Desktop.
In the terminal, type (without the carats), filling in the proper path to the directory that
contains your pem file:
cd
If your .pem file is on your Desktop, just type
cd ~/Desktop
Type:
chmod 400
Paste the custom ssh command from AWS into your terminal, MAKING SURE TO
CHANGE THE WORD “root” to “ec2-user”. Everything else stays the same.
If you are successfully connected, you should see a window like this:
** NOTE: the name of your .pem file will be different. Mine is called
amb_keypair.pem, reflected in the code.
If you allocated a custom elastic IP address to your instance, you can use the same
line of code that you just entered every time you want to access your instance (after
making sure your instance is running on AWS). If you did NOT allocate a custom IP
address, the ssh command will change every time.
I HIGHLY recommend you allocate a custom IP so accessing your instance on your
computer’s terminal is as easy as two simple steps:
1) Changing your working directory to where your .pem file is;
2) Running the exact same line of ssh code every time.
To allocate a custom IP, see separate document: “Quick Guide for Allocating a
Custom IP Address to your Instance” on Blackboard and posted in the “aws” channel
on Slack.
Quick Guide for Allocating a Custom IP Address to your Instance
If you want to use the same ssh code every time to easily connect to your instance (trust me
on this, you do), you need to allocate and then associate a custom IP address to your specific
instance:
1) On your EC2 dashboard, click “Elastic IPs” under Network and Security on the left
sidebar:
2) Click “Allocate Elastic IP address”
Make sure “us-east-1” is selected under Network Border Group, then click “Allocate”.
3) You should now see a custom IP address listed. Check the box next to the name, then
click “Associate this Elastic IP address”:
4) Make sure “Instance” is selected under Resource type, then select your instance under the
Instance drop-down menu.
5) Click “Associate”.
6) You now have a custom IP address! To access this, return to your EC2 instance page.
Check the box next to your instance and click “Connect”
7) Navigate to the “SSH client” tab. Your custom ssh command, which includes your
custom, allocated IP address, is under the “Example”:
IMPORTANT: You MUST change the word “root” to “ec2user” in the ssh command in order for it to work! So in this
example, the command you would run in your terminal is:
ssh -i “amb_keypair.pem” ec2-user@ec2-3-227-206-212.compute-1.amazonaws.com
• 1. Given gene annotation coordinates, students will calculate the lengths of exons
and introns, coding sequence, and the final protein product. BP; ESI1-2
• 2. Run programs at the bash command prompt, adjust program options using
command line switches, seek help for options using command line tools, and
specify input and output files. CT; TECH1-2
• 3. Launch a provided amazon machine instance, connect to the instance. CT;
TECH1-2
• 4. Track and store generated lab scripts, outputs, and results using GitHub code
development platform. CT; TECH1-2
WE MEET ON FRIDAY!
The Kills – “DNA”
What’s due by Friday morning?
• Pre-lecture questions
• Lab 2 blackboard questions
• Lab 2 GitHub repository – files and README.md
What’s a gene family?
How can you find out what gene family your starting gene
belongs to?
WE MEET ON FRIDAY!
Your gene belongs to a gene family.
Gene families are genes in a genome that are related
by common descent because they are the result of
gene duplication from a common ancestral gene.
They therefore have related functions.
A gene family
A gene family
Sequence alignment of α-β globins, myoglobin, neuroglobin globin-X
and cytoglobin 2.
Wetten et al. 2010
We have a lot of data to
process. We need computers.
Where can you find a
computer that will:
• – stay on, even when you are away
• – is accessible from anywhere
• – can grow as large as you want
(storage, processing power)
Answer: AWS + EC2
• Bioinformaticians run pipelines
• A series of programs to manipulate and analyze data
• Examples: processing genome sequencing data,
processing transcriptomics data, estimating ancestry,
phylogenetics..
• For example, to evaluate SNPs from next generation
sequencing reads, we call no fewer than eight separate
programs…
• GBs of data from each subject… process hundreds of
patients…
Welcome to N. Virginia!
https://www.theatlantic.com/tech
nology/archive/2016/01/amazonweb-services-data-center/423147/
• AWS is a great way to run these pipelines.
• We will use AWS academy
Are you set up on AWS Academy?
A. I have not received an invitaGon
B. I received an invitaGon but have not
accepted it yet.
C. I am in the AWS classroom, but haven’t yet
set up my instance.
D. I have set up my instance, launched it, and I
am geMng ready to connect.
E. I don’t even understand this quesGon.
secure shell(ssh) text connection: This allows you to
send and receive text from your virtual machine’s
linux console. A SSH connection is a good choice if
you don’t need a graphical display.
Welcome to the shell!
How will you connect to your EC2 instance?
A. I will be connecting via EC2 Instance Connect Session Manager
B. I will connect via AWS Cloudshell
C. I will be connecting via a terminal on a mac or other linux box.
D. I will be connecting via shh software on a PC
E. I’ve successfully connected to my instance via SSH. (any architecture)
F. I really have no idea how I am going to connect.
G. I don’t understand the question.
pwd = print working directory
cd =change directory
ls = list directory contents
cat = display file (etc.)
Use Git to clone this lab’s repository to the
local instance
you need to generate a PAT (personal access
token). In your web browser:
1.Navigate
to: https://github.com/settings/tokens
2.Click “Generate new token”
3.Add a note (e.g. for my Amazon instance),
4.select an expiration date (at least till the end
of the semester.
5.check the box “repo” under scopes
6.click “Generate token”
7.Copy your token to the clipboard, and save it
in a safe place.
git clone https://github.com/Bio312/lab2-myusername
cd lab2-myusername
How did you connect to your EC2 instance?
A. I connected via EC2 Instance Connect Session Manager
B. I connected via AWS Cloudshell
C. I connected via a terminal on a mac or other linux box.
D. I connected via shh software on a PC
E. I didn’t connect.
F. I don’t understand the question.
secure shell(ssh) text connection: This allows you to
send and receive text from your virtual machine’s
linux console. A SSH connection is a good choice if
you don’t need a graphical display.
Welcome to the shell!
Did you pull the lab from GitHub?
A. Yes!
B. Not yet
Did you push your files to the remote GitHub repository?
A. Yes!
B. Not yet
Local file system – PC example
cd C:\\Users\Matthew\Desktop
Local file system – Mac example
Remote file system
Remote file system
Which linux program lists the ﬁles
in your directory?
A. list
B. less
C. ls
D. dir
E. cd
Remote file system
Where should you cd to?
A. cd lab2-mysername
B. cd lab2-L06-charlesdarwin (if your name is Charles Darwin and you are in L06 )
Welcome to the shell!
Many ways
to do
bioinforma
tics.
You could complete this lab in
hundreds of different ways…
maybe even better and more
efficient ways.
ncbi-acc-download
• Python script written to download sequence from
NCBI’s FTP site (FTP=file transfer protocol)
• There are many similar solutions; I chose this
because of its relative simplicity
• Note: Python is an interpreted, high-level, generalpurpose programming language.
Command Flag or Option or Switch
Programs installed on all linux systems…
more – Display output one screen at a time.
less – Page through text one screenful at a
time, Search through output, Edit the
command line. less provides more emulation
plus extensive enhancements such as allowing
backward paging through a file as well as
forward movement.
cat – Concatenate and print (display) the
content of files.
less NW_001834278.1.gff
less -S NW_001834278.1.gff
Chop long lines.
A. Modiﬁer or AlternaVve or Adjunct
B. Command Flag or OpVon or Switch
C. Man or Help or InstrucVons
D. STDOUT or STDIN
File format you need to know: GFF
The general feature format (gene-finding format, generic feature format, GFF) is a file
format used for describing genes and other features of DNA, RNA and protein sequences.
File format you need to know:
FASTA
.fas .fa .fasta
Description line
Multi-sequence fasta
bedtools – your first true
genomics tool
• bedtools: a powerful toolset for genome
arithmetic
• Collectively, the bedtools utilities are a swiss-army
knife of tools for a wide-range of genomics analysis
tasks.
• While each individual tool is designed to do a
relatively simple task (e.g., intersect two interval
files), quite sophisticated analyses can be
conducted by combining multiple bedtools
operations on the UNIX command line.
Diagnose this command!
What are the inputs?
What are the outputs?
Diagnose this command!
What are the inputs?
What are the outputs?
What format be outputted as a
result of this command?
A. gff
B. bed
C. fasta
D. genbank
emboss
• EMBOSS is “The European Molecular Biology Open
Software Suite”. EMBOSS is a free Open Source
software analysis package specially developed for the
needs of the molecular biology (e.g. EMBnet) user
community. The software automatically copes with
data in a variety of formats and even allows transparent
retrieval of sequence data from the web. Also, as
extensive libraries are provided with the package, it is a
platform to allow other scientists to develop and
release software in true open source spirit. EMBOSS
also integrates a range of currently available packages
and tools for sequence analysis into a seamless whole.
emboss
• Sequence alignment,
• Rapid database searching with sequence paQerns,
• Protein moRf idenRﬁcaRon, including domain
analysis,
• NucleoRde sequence paQern analysis—for example
to idenRfy CpG islands or repeats,
• Codon usage analysis for small genomes,
• Rapid idenRﬁcaRon of sequence paQerns in large
scale sequence sets,
• PresentaRon tools for publicaRon
genometools – The versatile open
source genome analysis software
• Similar idea to bedtools
• Powerful and diverse set of tools – try them out!
Good luck!
Shell scripting
A shell script is a computer program designed to be run by the Unix shell,
a command-line interpreter. Note: we are using the bash shell; there are many
flavors.
Create a file, first line of the file contains the shell we want to use:
#!/bin/bash
Then, additional lines can include commands:
ncbi-acc-download -F genbank NM_001180370
ncbi-acc-download -F fasta NM_001180370
ncbi-acc-download -F gff3 NM_001180370
How do you know if you are cool?
vi or vim or nano
nano- the text editor
a text editor for eﬃciently crea:ng and
changing any kind of text. It is included
as ”nano” with most UNIX systems
REVIEWS
C O M P U TAT I O N A L T O O L S
Cloud computing for genomic data
analysis and collaboration
Ben Langmead1 and Abhinav Nellore2
Abstract | Next-generation sequencing has made major strides in the past decade. Studies based
on large sequencing data sets are growing in number, and public archives for raw sequencing
data have been doubling in size every 18 months. Leveraging these data requires researchers to
use large-scale computational resources. Cloud computing, a model whereby users rent
computers and storage from large data centres, is a solution that is gaining traction in genomics
research. Here, we describe how cloud computing is used in genomics for research and
large-scale collaborations, and argue that its elasticity, reproducibility and privacy features make
it ideally suited for the large-scale reanalysis of publicly available archived data, including
privacy-protected data.
Sequencing reads
Snippets of DNA sequence as
reported by a DNA sequencer.
Storage
A component of a computer
that stores data.
Processors
A central component of a
computer in which the
computation takes place.
Computer cluster
A collection of connected
computers that are able to
work in a coordinated fashion
to analyse data.
Department of Computer
Science, Center for
Computational Biology,
Johns Hopkins University,
Baltimore, MD, USA.
2
Department of Biomedical
Engineering, Department of
Surgery, Computational
Biology Program, Oregon
Health and Science
University,
Portland, OR, USA.
e-mail: langmea@cs.jhu.edu;
anellore@gmail.com
1
doi:10.1038/nrg.2017.113
Published online 30 Jan 2018;
corrected online 12 Feb 2018
Next-generation sequencing (NGS) technologies have
been improving rapidly and have become the workhorse technology for studying nucleic acids. NGS platforms work by collecting information on a large array
of polymerase reactions working in parallel, up to billions at a time inside a single sequencer 1. The speed
and decreasing cost of NGS have led to the rapid accumulation of raw sequencing data (sequencing reads),
used in published studies, in public archives2 such as
the Sequence Read Archive (SRA)3,4, which is hosted by
the US National Center for Biotechnology Information
(NCBI), and the European Nucleotide Archive (ENA)5,
which is hosted by the European Molecular Biology
Laboratory at the European Bioinformatics Institute
(EMBL–EBI). The SRA now holds about 14 petabases
(millions of billions of bases) and has been doubling in
size every 10–20 months (FIG. 1). Genomics researchers
can use these archived data for various scientific purposes6. For example, in the microarray era, researchers
combined and reanalysed large collections of archived
data for platform comparisons 7 to improve methods8,9, conduct meta-analyses10,11 or find clinical predictors12. Sequencing data archives can democratize
access to valuable data, which in turn can improve our
understanding of biology, genetics and disease.
NGS is also fuelling ever-larger collaborations that generate vast sequencing data sets, including the Genome
Aggregation Database (gnomAD), which in its first
release contained exclusively exome data and was known
as the Exome Aggregation Consortium (ExAC)13, the
International Cancer Genome Consortium (ICGC)14,
the Genotype–Tissue Expression (GTEx) Project 15,16
and the Trans-Omics for Precision Medicine (TOPMed)
programme17, among others (TABLE 1). gnomAD now
spans over 120,000 exomes and over 15,000 whole
genomes. ICGC encompasses over 70 subprojects targeting distinct cancer types, which are conducted in more
than a dozen countries and have already collected samples from more than 20,000 donors. Aligned sequencing reads for ICGC require over 1 petabyte (PB; that
is, a million GB) of storage. The TOPMed programme,
which plans to sequence more than 120,000 genomes17,
has already deposited more than 18,000 human whole-
genome sequencing data sets in the SRA, comprising
approximately 2.3 petabases or about 16.5% of the
entire archive. Large observational studies currently in
progress, such as the Precision Medicine Initiative18 and
Million Veterans Project 19, will drive up the totals yet
more rapidly.
While advances in NGS have increased opportunities
for reuse and collaboration, they have also created new
computational problems. To convert raw sequencing
data to scientific results requires coordinated computation, storage and data movement. Computer processors
are required to solve the various computational problems
encountered along the way, for example, read alignment,
de novo assembly, variant calling and quantification.
Storage is needed to hold raw data, processed data and
data from the computational steps in between. A perspective on data set sizes, relating size in bytes to number of
nucleotides and amount of computational power needed
to analyse, is presented in TABLE 2. Sometimes the resource
requirements are modest enough to fit within the computer cluster of research laboratories and small institutions.
But as NGS throughput and archives continue to grow and
as projects grow larger, the resources required will more
208 | APRIL 2018 | VOLUME 19
www.nature.com/nrg
.
d
e
v
r
e
s
e
r
s
t
h
g
i
r
l
l
A
.
e
r
u
t
a
N
r
e
g
n
i
r
p
S
f
o
t
r
a
p
,
d
e
t
i
m
i
L
s
r
e
h
s
i
l
b
u
P
n
a
l
l
i
m
c
a
M
8
1
0
2
©
log (total SRA bases (petabases))
REVIEWS
16.0
15.6
15.2
5 to 10 PB in
11.3 months
1.25 to 2.5 PB
in 16.8 months
2.5 to 5 PB
in 19.2 months
0.625 to 1.25 PB
in 9.1 months
14.8
2013
2014
2015
2016
2017
Date
Figure 1 | Increase in storage of next-generation sequencing
data. Reviews
From July| Genetics
2012 to
Nature
March 2017, the amount of genomic data (total bases) in the Sequence Read Archive
(SRA) doubled four times. The large jump in October 2016 is chiefly due to data from the
Trans-Omics for Precision Medicine (TOPMed) project. As of October 2017, the SRA
contains about 14 petabases (millions of billions of bases) of data. PB, petabases.
frequently outgrow those owned by a single laboratory
or institution. Investigators must look for other ways to
procure the needed computation and storage.
Coincident with the success of NGS has been another
technological success story: cloud computing. Although
cloud computing was not invented with science in mind
— major commercial cloud customers are technology
companies and other businesses — it increasingly has
a key role as the venue of choice for many science and
engineering efforts20. In genomics, in particular, cloud
computing has played a major part in two areas. The first
involves the reanalysis of vast data sets available in public
sequencing data archives. The second area where cloud
computing has made inroads is enabling collaborations
on large amounts of shared data. The distributed nature
of the cloud has made it a natural venue for collaborative and distributed computing efforts and for otherwise
facilitating collaboration. This has been happening most
visibly in projects such as the ICGC and the related PanCancer Analysis of Whole Genomes (PCAWG) effort 21,
the National Cancer Institute (NCI) Cancer Genomics
Cloud (CGC) Pilots22–24 and the Encyclopedia of DNA
Elements (ENCODE and Model Organism ENCODE
(modENCODE)) projects25,26.
In this Review, we begin by describing the cloud computing model and its different forms. We then discuss
the two major areas in which cloud computing is having an impact on genomics: the reanalysis of large-scale
archived data sets and large genomics collaborations. We
give some perspective on how a move to cloud computing in genomics research will affect software development, training and funding, briefly discuss privacy and
regulatory issues and conclude with thoughts on the
future of cloud computing in genomics.
The cloud model
A formal definition of cloud computing, according to
the US National Institute of Standards and Technology
(NIST), is “a model for enabling ubiquitous, convenient, on‑demand network access to a shared pool of
Table 1 | Large genomics projects and resources
Name
Website
Description
1000 Genomes Project (1KGP)
www.internationalgenome.
org
This project includes whole-genome and exome sequencing data from 2,504
individuals across 26 populations
Cancer Cell Line Encyclopedia
(CCLE)115
portals.broadinstitute.org/
ccle
This resource includes data spanning 1,457 cancer cell lines
Encyclopedia of DNA Elements
(ENCODE)33
www.encodeproject.org
The goal of this project is to identify functional elements of the human genome
using a gamut of sequencing assays across cell lines and tissues
Genome Aggregation Database
(gnomAD)13
gnomad.broadinstitute.org
This resource entails coverage and allele frequency information from over
120,000 exomes and 15,000 whole genomes
Genotype–Tissue Expression
(GTEx) Portal15,16
gtexportal.org
This effort has to date performed RNA sequencing or genotyping of 714
individuals across 53 tissues
Global Alliance for Genomics and
Health (GA4GH)92
genomicsandhealth.org
This consortium of over 400 institutions aims to standardize secure sharing of
genomic and clinical data
International Cancer Genome
Consortium (ICGC)14
icgc.org
This consortium spans 76 projects, including TCGA
Million Veterans Program (MVP)19
www.research.va.gov/mvp
This US programme aims to collect blood samples and health information from
1 million military veterans
Model Organism Encyclopedia of
DNA Elements (modENCODE)25,85
www.modencode.org
The goal of this effort is to identify functional elements of the Drosophila
melanogaster and Caenorhabditis elegans genomes using a gamut of
sequencing assays
102
Precision Medicine Initiative (PMI)18 allofus.nih.gov
This US programme aims to collect genetic data from over 1 million individuals
The Cancer Genome Atlas
(TCGA)116
cancergenome.nih.gov
This resource includes data from 11,350 individuals spanning 33 cancer types
Trans-Omics for Precision
Medicine (TOPMed)17
https://www.nhlbiwgs.org
The goal of this programme is to build a commons with omics data and
associated clinical outcomes data across populations for research on heart,
lung, blood and sleep disorders
NATURE REVIEWS | GENETICS
VOLUME 19 | APRIL 2018 | 209
.
d
e
v
r
e
s
e
r
s
t
h
g
i
r
l
l
A
.
e
r
u
t
a
N
r
e
g
n
i
r
p
S
f
o
t
r
a
p
,
d
e
t
i
m
i
L
s
r
e
h
s
i
l
b
u
P
n
a
l
l
i
m
c
a
M
8
1
0
2
©
REVIEWS
Table 2 | A comparison of genomics data types
NGS technology
Total
bases
Compressed Equivalent
bytes
size
Core hours
to analyse
100 samples
Comments
Single-cell RNA
sequencing
725 million 300 MB
50 MP3
songs
20
>100,000 such samples in SRA, >50,000 from humans
Bulk RNA sequencing
4 billion
2 GB
2 CD‑ROMs
100
>400,000 such samples in SRA, >100,000 from humans
Human reference genome
(GRCh38)
3 billion
800 MB
1 CD‑ROM
NA
Whole-exome sequencing
9.5 billion
4.5 GB
1 DVD movie
4,000
~1,300 human samples from 1000 Genomes Project alone
Whole-genome
sequencing of human DNA
75 billion
25 GB
1 Blu-ray
movie
30,000
~18,000 human samples with 30× coverage from the
TOPMed project alone
All numbers are approximate. Sizes and prevalence in the Sequence Read Archive (SRA) are estimated using the SRA RunSelector tool. Computation amounts are
estimated from figures in published studies. Analyses for these figures are summarized in Supplementary methods S1. NA, not applicable; NGS, next-generation
sequencing; TOPMed, Trans-Omics for Precision Medicine.
configurable computing resources … that can be rapidly
provisioned and released with minimal management
effort or service provider interaction”27. Put briefly,
cloud computing is a way of organizing computing
resources so that users can conveniently rent them
instead of buying them. At its advent, cloud computing
was focused on computing infrastructure (storage and
computers), although it has since expanded to include
platforms and software.
In the cloud computing model, computational
resources such as processors and hard disks are
thought of as utilities to be rented from a provider
(TABLE 3). The term ‘cloud provider’ is most often used
to describe major US‑based commercial services
such as Amazon Web Services (AWS), Google Cloud
Platform or Microsoft Azure. However, the number
of cloud vendors has proliferated recently, and many
other cloud services, both commercial and academic
(for example, Open Science Data Cloud, the EMBL–
EBI Embassy Cloud, Helix Nebula and Jetstream) are
currently available worldwide (TABLE 3). These have
matured rapidly in recent years, creating new data
centres, lowering prices, adding services and generating newsworthy profits 28. Providers control vast
pools of computers and storage that are organized
into data centres scattered across the world. Users
request resources, use them and then release them
back into the pool when the work is complete. Fees are
incurred according to usage. Storage incurs a per-GB
per-month fee, and computers incur a per-computer
per-unit-time fee, where time units might be seconds,
minutes or hours. Users are billed monthly, just as for
a home utility.
estimated that AWS housed more than 1.4 million
servers (computers)29. This is a few orders of magnitude larger than academic clusters, which range from
hundreds to low thousands of servers, or even super
computing centres, which typically have many thousands
of servers. While they are generally smaller, academic
clouds are proliferating 20, building on platforms such
as OpenStack30 and OpenNebula31 that convert large
clusters into cloud-like resources.
Because of the size of cloud data centres, computational requests large and small can be fulfilled quickly,
sometimes immediately. A user requesting a 1,000‑computer cluster for 1 hour is as likely to succeed as a user
requesting a single computer for 1,000 hours. The ability
to recruit vast resources is crucial; instead of waiting for
the trickle of computer-hours available on a busy institutional cluster, the user can rent a cluster the size of an
entire institutional cluster for a day. Work is completed
in a fraction of the time, and the user pays only for what
they use. For services available to a wide user community
(for example, a genetic imputation server 32), elasticity
allows the computational burden of answering user
queries to be spread across many computers, avoiding
bottlenecks and delays (FIG. 2).
The cloud also frees the user from maintaining computer hardware. Cloud providers maintain data centres
in a way that achieves economies of scale. Users need
not be concerned with outages, software patches, service contracts or damaged parts. That said, the task of
recruiting and maintaining cloud resources appropriate
for one’s needs is itself a complex administrative task that
can require a professional administrator or a dedicated
effort to learn.
Elasticity. The cloud’s hallmarks are elasticity and convenience. Elasticity refers to the ability to rent and pay
for the exact resources needed. The user is not compelled
to downscale the task to fit the confines of a local cluster
(‘underprovisioning’) nor must the user incur the cost of
purchasing an amount of computing to match the largest
possible future need (‘overprovisioning’).
Cloud providers — in particular commercial providers — control enormous fleets of computers. Although
exact numbers are not public knowledge, a 2014 study
Variations on the cloud. The NIST definition of
cloud computing describes a range of ‘deployment
models’27, including private clouds, where resources
are made available only to users in an organization;
community clouds, where several organizations share
access to an otherwise private cloud; public clouds,
where resources are available to the public (for example, commercial clouds); and hybrid clouds, where
multiple deployment models are mixed. The hybrid
cloud might be appropriate for an organization that
210 | APRIL 2018 | VOLUME 19
www.nature.com/nrg
.
d
e
v
r
e
s
e
r
s
t
h
g
i
r
l
l
A
.
e
r
u
t
a
N
r
e
g
n
i
r
p
S
f
o
t
r
a
p
,
d
e
t
i
m
i
L
s
r
e
h
s
i
l
b
u
P
n
a
l
l
i
m
c
a
M
8
1
0
2
©
REVIEWS
Table 3 | Cloud service providers
Platform
Website
Notes
AWS
www.aws.amazon.com
Largest commercial provider, IaaSa
Google Cloud Platform
cloud.google.com
IaaS
Microsoft Azure
azure.microsoft.com
IaaS
IBM Cloud
www.ibm.com/cloud/
IaaS
Alibaba Cloud
www.alibabacloud.com
IaaS
DNAnexus
www.dnanexus.com
SaaSb; used by ENCODE
Illumina BaseSpace Sequence Hub
basespace.illumina.com
SaaS
Seven Bridges
www.sevenbridges.com/platform
SaaS; hosts one of the platforms for the CGC Pilots
Globus Genomics
globusgenomics.org
SaaS
Rodeo
www.tacc.utexas.edu/systems/rodeo
• Part of TACC, which is located at the University of Texas at Austin and
contains large-scale computing resources
• Comprises 256 processing cores and 2 TB of memory allocated for
the Galaxy public server
Corral
www.tacc.utexas.edu/systems/corral
• Part of TACC
• Comprises 20 PB of storage allocated for the Galaxy public server
XSEDE75
xsede.org
Supported by NSF
Jetstream33
jetstream-cloud.org
Science cloud in XSEDE located at TACC and Indiana University’s
Pervasive Technology Institute
European Open Science Cloud
ec.europa.eu/research/openscience/
index.cfm?pg=open-science-cloud
Proposed science cloud across European Union member states
OSDC
opensciencedatacloud.org
PB-scale science cloud tailored for access to and analysis of publicly
available data stored in the OSDC Data Commons
Bionimbus Protected Data Cloud117
bionimbus-pdc.opensciencedatacloud.org
Science cloud associated with OSDC that permits secure analysis of
protected health information
Compute Canada
computecanada.ca
High-performance Canadian computing network spanning ACENET,
Calcul Québec, Compute Ontario and WestGrid
de.NBI
www.denbi.de
Bioinformatics service provider in Germany spanning education,
consulting, computing and storage, as well as databases
Embassy Cloud
embassycloud.org
Science cloud for EMBL–EBI affiliates including direct access to public
genomics data sets
Helix Nebula
helix-nebula.eu
European open science partnership across industry and academia to
provide cloud computing infrastructure
Nectar Cloud
nectar.org.au
Self-service Australian science cloud
Broad FireCloud
software.broadinstitute.org/firecloud Part of the CGC Pilots initiative by the NCI to fund three platforms
that colocate genomic data and analysis tools in the cloud; includes
collaboration tools; allows working with new data
ISB-CGC
http://cgc.systemsbiology.net/
Part of the CGC Pilots initiative
Seven Bridges CGC
cancergenomicscloud.org
Part of the CGC Pilots initiative; includes collaboration tools; allows
working with new data
Commercial cloud services
Academic cloud services
ACENET, Atlantic Computational Excellence Network; AWS, Amazon Web Services; CGC, Cancer Genomics Cloud; EMBL–EBI, de.NBI, German Network for
Bioinformatics Infrastructure; European Molecular Biology Laboratory at the European Bioinformatics Institute; ENCODE, Encyclopedia of DNA Elements; ISB,
Institute for Systems Biology; Nectar, National eResearch Collaboration Tools and Resources project; NCI, National Cancer Institute; NSF, National Science
Foundation; OSDC, Open Science Data Cloud; TACC, Texas Advanced Computing Center; XSEDE, Extreme Science and Engineering Discovery Environment.
a
Infrastructure as a service (IaaS) providers offer pay‑as‑you‑go cloud computing infrastructure together with software layers that facilitate deployment. bSoftware
as a service (SaaS) providers offer cloud-based genomics data analysis, data sharing and collaboration tools that run on IaaS providers.
has a private cloud that is occasionally overbooked,
requiring that some work be ‘offloaded’ to a larger
commercial cloud. The term ‘science cloud’ is sometimes used to refer to a community cloud made available specifically to scientific researchers, such as
Jetstream33 or the proposed European Open Science
Cloud34 (TABLE 3).
The definition of ‘cloud computing’ is flexible in
another respect: the resources on offer can range from
lower-level computing infrastructure to higher-level
platforms and even higher-level software. Infrastructure
as a service (IaaS) refers to the renting out of capabilities as low-level as computers, disk space and network
bandwidth. The infrastructure can be used for nearly any
NATURE REVIEWS | GENETICS
VOLUME 19 | APRIL 2018 | 211
.
d
e
v
r
e
s
e
r
s
t
h
g
i
r
l
l
A
.
e
r
u
t
a
N
r
e
g
n
i
r
p
S
f
o
t
r
a
p
,
d
e
t
i
m
i
L
s
r
e
h
s
i
l
b
u
P
n
a
l
l
i
m
c
a
M
8
1
0
2
©
REVIEWS
Small cluster
Hours
Large cluster
Computers (n)
Figure 2 | Cloud elasticity. Elasticity allows the user to rent
Nature
Reviews
| Genetics
resources while paying for only what
is used.
Imagine
a
scenario with two computational tasks to perform (coloured
red and blue). The red task requires 36 computer-hours and
runs on up to 8 computers simultaneously. The blue task
requires 18 computer-hours and runs on up to 3 computers
simultaneously. On a smaller cluster (left), both the tasks run
sequentially and require 15 hours to complete. On a larger
cluster (right), representing a cloud cluster, the tasks can run
simultaneously and the red task can use its full complement
of 8 computers. As a result, both tasks are completed within
6 hours. This ignores the fact that many more users are
contending for cloud clusters than are contending for an
institutional cluster. The greater number of users is
mitigated by the fact that needs and timing vary from user
to user. Cloud providers also provide incentives, such as
spot pricing, to encourage renting at less busy times.
Metadata
Information about a data set,
often pertaining to how and
from where it was collected.
For example, for a human data
set, metadata might include
sex, age, cause of death and
sequencing protocol used.
Containers
Similar to ‘virtual machines’,
containers are ‘virtual
computers’ that enable the use
of multiple, isolated services on
a single platform. They can run
in the context of another
computer, using a portion of
the host computer’s resources.
Docker and Singularity are two
container management
systems.
Firewalls
Barriers that prevent
unwanted, perhaps insecure
network traffic from reaching a
protected network.
purpose, at the cost of greater effort spent on administration and maintenance. Platform as a service (PaaS)
refers to the renting out of cloud-based platforms on
which the user can run software. For instance, some
commercial cloud providers offer platforms for running
software built on the MapReduce programming model
(reviewed in REF. 35). This gives the user the flexibility
to run a range of software tools as long as each tool is
compatible with the provided platform, but it requires
much less administration and maintenance effort than
IaaS. Software as a service (SaaS) refers to the renting out
of software, which should be a familiar idea to users of
Google products such as Gmail and Google Docs. In the
genomics space, commercial SaaS companies (TABLE 3)
provide cloud-based software for analysing sequencing
data. Third parties have relevant SaaS offerings as well;
for instance, preliminary PCAWG studies used a SaaS
product called Elasticsearch to manage and index metadata36,37 and other products called Logstash and Kibana
to index and analyse logging data36.
Application to genomics
Advantages of cloud computing for genomics researchers. For scientific users, the cloud has two major advantages: reproducibility and global access. Cloud resources
are rented in virtualized slices called instances.
Providers advertise a stable menu of instance types,
each with defined capabilities, for example, a certain
processor speed, amount of disk space or amount of
random-access memory. This predictability extends
to the software running on the instances; the user
decides exactly which software catalogue should be
preinstalled, including the operating system and software. This approach has at least two benefits. First, it
gives researchers easy access to a variety of common
software setups, sidestepping the challenges associated
with installing complex software. Second, it allows
users, perhaps on opposite ends of the globe, to create near-identical hardware and software setups (FIG. 3).
This is crucial in a field that has seen reproducibility
failures38–40 and where many are calling for more rigorous accounting of factors, such as underlying software
versions, that directly or indirectly affect computational analyses41,42. Similar reproducibility advantages
are possible on non-cloud computers using virtual
machines43,44 and Docker 45 or Singularity 46 containers,
tools that package software with all the necessary components to enable reproducible deployment in different
computing environments (TABLE 4).
The cloud is also globally accessible. A user anywhere in the world can rent resources from a provider,
regardless of whether the user is near a data centre.
Data can be secured and controlled by the collaborators without having to navigate several institutions’
firewalls. Team members can use the same commands
to run the same analysis on the same (virtualized)
hardware and software. This makes the cloud an
attractive venue for large genomics collaborations
and an important tool in the effort to promote robust
sharing of genomics data18,47. For example, the cloud
is the home of the National Institutes of Health (NIH)
Data Commons Pilot, an effort to increase availability
and utility of data and software from NIH-funded
efforts48,49.
Revitalizing archived data. The 14 petabases of data
stored in the SRA were obtained from more than
100,000 studies spanning many species, individuals,
tissues, sequencing instruments and assays. The data
are available to download, with the caveat that access
to most human data, and a majority of the nucleotides in the SRA overall, is controlled-access, protected
by dbGaP measures for maintaining privacy of donors50;
dbGaP refers to the Database of Genotypes and
Phenotypes, which is the part of the SRA that houses
controlled-access sequencing data. In short, the SRA
gives researchers ready access to valuable data, some of
it quite unique — for example, from individuals with
rare phenotypes or from hard‑to‑obtain tissues — and
does so at a scale that no one laboratory or institution
could recreate. These data have enabled researchers to
reproduce past findings or to ‘borrow’ data to address
new questions. For example, one recent study gathered human RNA sequencing (RNA-seq) samples from
various projects, including The Cancer Genome Atlas
(TCGA) and ENCODE, to create a catalogue of long
non-coding RNAs51. Another effort reanalysed RNA-seq
data from modENCODE52,53 that detailed transcription
in Drosophila melanogaster over developmental time.
The reanalysis focused on gene expression in the endosymbiont Wolbachia pipientis and yielded new insights
into gene expression patterns related to symbiosis54.
212 | APRIL 2018 | VOLUME 19
www.nature.com/nrg
.
d
e
v
r
e
s
e
r
s
t
h
g
i
r
l
l
A
.
e
r
u
t
a
N
r
e
g
n
i
r
p
S
f
o
t
r
a
p
,
d
e
t
i
m
i
L
s
r
e
h
s
i
l
b
u
P
n
a
l
l
i
m
c
a
M
8
1
0
2
©
REVIEWS
Public archives such as the SRA are comprehensive
enough to allow researchers to ask and answer a broad
range of sophisticated questions without generating new
data. But to get new results from archived data sets, a
researcher must secure storage space, perform large
and time-consuming downloads and perform a computing-intensive reanalysis of the data, usually from
scratch. Downloading, storing and analysing the data
are resource-intensive and subject to bottlenecks, such
as internet uplink speeds; many labs are not equipped for
this, so valuable data go unused6.
A second major obstacle is data quality of both raw
data and metadata. Low-quality sequencing data may
not pass typical quality control measures, and metadata
such as platform, gender, age, tissue of origin and disease
a Data layer
Independent
replication
01011
01011 100101
100101101011
101011101010
101010
01011
100101
101011
101010
01011
100101
101011
101010
01011
100101
101011
101010
New
analysis
Data set cloning
01011
100101
101011
101010
Versions
Data generation
b System layer
Custom software
and scripts
Custom software
and scripts
Settings and
conﬁgurations
Settings and
conﬁgurations
Operating system
Operating system
Machine image
Machine image clone
Figure 3 | Cloud reproducibility. The cloud fosters reproducibility
byReviews
enabling| Genetics
Nature
investigators to publish data sets to the cloud, including different versions thereof,
without loss or modification of the previous data set. Other investigators, situated near
or far geographically, can clone data sets within the cloud and apply customized
software to perform their own analyses and derive new results. Independent
investigators can copy original primary data sets, software and published results within
the cloud to replicate published analyses (part a). Collaborating investigators can set up
cloud-based virtual machine images that contain the software, configurations and
scripts needed for specific computational analyses (part b). Customized machine images
can be copied and shared with other investigators within the cloud for replicate analyses
(not shown). Figure adapted from REF. 114, Macmillan Publishers Limited.
state may be incomplete or inaccurate. In the absence
of reliable and automatic methods for dealing with
poor-quality data and metadata — a problem receiving
more attention55 — researchers must approach public
data cautiously.
Cloud computing addresses many problems posed
by data archives. The cloud’s elasticity allows users to
scale computing resources in proportion to the amount
of data being analysed, sidestepping constraints imposed
by local clusters. Input data can be downloaded directly
to the cloud computers that will process it, without first
traversing a particular investigator’s cluster. In some
cases, data may already be preloaded into a cloud; for
example, the ICGC data are available in the Cancer
Genome Collaboratory 56, an academic cloud computing
resource. If data are protected, for example, by dbGaP,
existing protocols make it possible to craft a compliant
cloud-based computational setup57. Commands used to
rent the cluster and run the software can be published or
shared so that collaborators can do the same, avoiding
inter-cluster compatibility issues.
Recent years have seen a series of studies applying
cloud computing to study large collections of publicly archived data. ReCount 58 used Myrna59, a cloud-
enabled tool for calculating differential gene expression
in large RNA-seq data sets, and computing resources
rented from AWS to reanalyse 475 RNA-seq samples
from 18 different experiments. This is a small data
set by modern standards but constituted a large fraction of publicly available RNA-seq data in 2011. The
ReCount resource consists of genome-wide expression
values at the level of genes and exons, along with relevant metadata. The Intropolis60 and recount2 (REF. 61)
efforts used AWS and a custom cloud-enabled RNAseq aligner 62 to reanalyse over 70,000 human samples
spanning TCGA, GTEx and the SRA, releasing expression-level summaries at the level of splice junctions,
exons, genes and individual genomic bases. The Toil
effort 63 used a standardized computational pipeline and
AWS to analyse nearly 20,000 samples spanning four
major studies, including TCGA and GTEx, in under
4 days. After a total of approximately 1.3 million core
hours of analysis using Spliced Transcripts Alignment
to a Reference (STAR) 64, RNA-seq by Expectation
Maximization (RSEM)65 and Kallisto66, the estimated
cost for using discounted ‘pre-emptible’ (that is, excess)
instances was about US$1.30 per sample, but a much
lower per-sample cost of US$0.19 was achieved with an
alternative pipeline using only Kallisto. Toil supports
workflows written in the Common Workflow Language
(CWL) 67 and can run the workflows in any major
commercial cloud or OpenStack-based cloud and
on many other clusters. Another effort used Google
Cloud Platform to quantify transcript expression levels for over 12,000 RNA-seq samples from large cancer
projects68. One of the pipelines proposed in this work,
again based on Kallisto, was shown to achieve a cost of
US$0.09 per sample, the lowest of any effort to date and
a small fraction of the cost of sequencing.
More recently, preliminary PCAWG studies re
analysed data from several individual ICGC projects,
NATURE REVIEWS | GENETICS
VOLUME 19 | APRIL 2018 | 213
.
d
e
v
r
e
s
e
r
s
t
h
g
i
r
l
l
A
.
e
r
u
t
a
N
r
e
g
n
i
r
p
S
f
o
t
r
a
p
,
d
e
t
i
m
i
L
s
r
e
h
s
i
l
b
u
P
n
a
l
l
i
m
c
a
M
8
1
0
2
©
REVIEWS
Table 4 | Ancillary technologies and services
Tool
Website
Notes
Workflow execution: tools that help deploy analysis workflows over large computer clusters
Butler36
github.com/llevar/butler
PCAWG-affiliated workflow development tool and execution manager for
OpenStack-based and commercial cloud platforms
Common Workflow Language
commonwl.org
Language for specifying analysis workflows read by Butler, Rabix and Toil
Nextflow118
nextflow.io
Scientific workflow language deployed reproducibly via Docker or
Singularity containers
Rabix/Bunny
github.com/rabix/bunny
Workflow executor
Toil
cgl.genomics.ucsc.edu/toil
Workflow executor used to analyse over 20,000 RNA-seq samples
71
Cluster management: tools that help create and manage large computer clusters
AWS Elastic MapReduce
aws.amazon.com/emr
AWS service that creates and manages Hadoop clusters for when efficient
grouping and ordering of big data are required
Eucalyptus
github.com/eucalyptus/
eucalyptus/wiki
Open-source framework for building clouds that mimic the AWS interface
OpenNebula31
opennebula.org
Open-source software that sets up and manages infrastructure as a service
OpenStack35
openstack.org
Open-source software that sets up and manages infrastructure as a service
StarCluster
star.mit.edu/cluster
Open-source software that facilitates creating and managing clusters on AWS
Reproducibility: tools that enable reproducibility of analysis results
Docker
docker.com
Packages software with all necessary components to enable reproducible
deployment in different computing environments
Galaxy74
galaxyproject.org
Cloud-based platform for analysis driven by reproducible workflows
Omics Pipe119
sulab.org/tools/omics-pipe
Automates deployment of best-practice omics data analysis pipelines in
the cloud
Singularity46
singularity.lbl.gov
Similar to Docker but designed for scientific software running in
high-performance computing environments
Data transfer: tools that enable rapid transfer of large volumes of data
Globus Online89
globus.org
Enables rapid transfer of data between designated end points
GridFTP120
http://toolkit.globus.org/
toolkit/docs/latest-stable/
gridftp/
High-performance transfer protocol used by Globus Online
AWS, Amazon Web Services; PCAWG, Pan-Cancer Analysis of Whole Genomes; RNA-seq, RNA sequencing.
each studying different cancer types. The reanalysis
spanned whole-genome sequencing data from 2,834
donors, including over 5,789 tumour and normal
genomes. The vast computational needs, together with
issues of data sovereignty, motivated the use of a distributed analysis strategy reminiscent of the ‘grid computing’ paradigm69, which predates cloud computing and
has its own history of being applied to large-scale life
science data sets70.
Combined with similar efforts that used large local
clusters rather than cloud computing 71, the field has produced an array of uniformly processed and summarized
data sets (TABLE 5). While these developments are recent,
we expect cross-study summaries such as recount2 and
PCAWG to become popular starting points for studies
that derive new conclusions from archived data. Of note,
enabled by the cloud’s predictable and transparent pricing, teams from around the world can publicly compete
to build the most cost-efficient analysis software.
A shared computational laboratory. Given the complexity of genomics studies and the need to enrol
patients in geographically dispersed study locations,
collaboration on large-scale genomics sequencing projects across multiple sites is fairly common.
Before a computational analysis begins, all relevant
data are gathered at whichever site has the requisite
computing capacity and expertise. If more than one
site is to analyse the full data set, the data must be
copied. The larger and the more decentralized the
project, the more copies must be made72 (FIG. 4a). The
cloud combats this challenge by providing a single
common venue for data and computation (FIG. 4b).
Collaborators at the various sites can use computers
located near the data.
This approach is exemplified by the NCI CGC Pilots,
which were launched in 2016 to facilitate the reana
lysis and querying of large cancer genomics data sets
such as TCGA (TABLE 1) . Two of the three platforms tested as part of the pilot, the Broad Institute’s
FireCloud22 and Seven Bridges’ CGC23, additionally
allow users to upload their own data and perform ana
lyses in workspaces that can be shared privately with
collaborators or publicly with the broader community.
Common analysis workflows are provided, but users
can develop and share new workflows using CWL67
214 | APRIL 2018 | VOLUME 19
www.nature.com/nrg
.
d
e
v
r
e
s
e
r
s
t
h
g
i
r
l
l
A
.
e
r
u
t
a
N
r
e
g
n
i
r
p
S
f
o
t
r
a
p
,
d
e
t
i
m
i
L
s
r
e
h
s
i
l
b
u
P
n
a
l
l
i
m
c
a
M
8
1
0
2
©
REVIEWS
Table 5 | Summarized data sets, services and resources
Name
Website
Notes
ArrayExpress
www.ebi.ac.uk/arrayexpress
Archives processed data from high-throughput functional genomics experiments
Beacon
beacon-network.org
Platform for sharing genetic mutations across web services called ‘beacons’
Bravo
bravo.sph.umich.edu
TOPMed data browser for accessing alleles across over 60,000 whole genomes
Expression Atlas121
www.ebi.ac.uk/gxa
Gene expression information across 3,000 transcriptomic experiments from ArrayExpress
PCAWG
docs.icgc.org/pcawg
Called germline and somatic variants, including structural variants, from over 5,600 tumour
and normal samples across ICGC projects
recount261
jhubiostatistics.shinyapps.
io/recount
Web and R/Bioconductor resource for accessing genome coverage data from over 70,000
archived human RNA-seq samples, including publicly available SRA, TCGA and GTEx samples
RNASeq‑er93
www.ebi.ac.uk/fg/rnaseq/api Provides programmatic access to processed outputs for all archived publicly available
RNA-seq samples
Snaptron
snaptron.cs.jhu.edu
Allows rapid querying of splice junctions, splicing patterns and metadata from recount2
Tatlow-Piccolo68
osf.io/gqrz9
Quantified transcripts across TCGA and CCLE
Toil
xenabrowser.net/datapages/?host=https://toil.
xenahubs.net
Processed outputs from over 20,000 RNA-seq samples including TCGA and GTEx
Xena94
xena.ucsc.edu
Visualizes investigators’ new functional genomics data next to publicly available data
95
63
CCLE, Cancer Cell Line Encyclopedia; GTEx, Genotype-Tissue Expression Project; ICGC, International Cancer Genome Consortium; PCAWG, Pan-Cancer Analysis
of Whole Genomes; RNA-seq, RNA sequencing; SRA, Sequence Read Archive; TCGA, The Cancer Genome Atlas; TOPMed, Trans-Omics for Precision Medicine.
in the case of CGC or a custom language in the case
of FireCloud. One recently published workflow used
the CGC platform to detect patient-specific tumour
neoantigens from sequencing data73.
FireCloud and CGC rely on AWS and the Google
Cloud Platform for computing and data storage. By
contrast, since 2007, the Galaxy Project 74 has enabled
the execution of sharable analysis workflows for free
through its main public server, which uses computing
hardware at the Texas Advanced Computing Center
(TACC; part of the National Science Foundation
(NSF)-supported Extreme Science and Engineering
Discovery Environment (XSEDE))75. Hardware allocated exclusively for Galaxy users spans the Rodeo
cluster, with 256 processing cores and 2 TB of memory,
and Corral, with 20 PB of disk space. Hardware shared
with non-Galaxy users includes the Stampede cluster,
with over 400,000 processing cores and 205 TB of memory. However, a registered user has a 250 GB disk space
quota for their own data and is limited to running six
concurrent jobs, each using at most 16 processing cores.
Galaxy can also be used with the NSF-funded Jetstream
resource33, a large-scale cloud computing resource that
is part of XSEDE and broadly serves as an alternative to
commercial cloud computing services. Investigators can
request XSEDE allocations to use Jetstream, and analyses can be run on this platform through the Galaxy
main server.
Collaboration using Galaxy is less direct than on
FireCloud and CGC. Multiple collaborators do not
manage an analysis in a shared workspace; rather, a single user completes all or part of an analysis and Galaxy
records its history or the sequence of intermediate and
final outputs. The history is then shared with another
Galaxy user, who imports it, essentially creating a new
branch of the history. The history can also be published
for public consumption.
The Galaxy interface can run atop different computing infrastructures, and the community has set up
more than 80 public servers in addition to the main
server 76. Users can also install Galaxy on cloud clusters
using CloudMan77, which not only supports AWS for
pay‑as‑you‑go storage and computing but also private and
public clouds that leverage OpenStack30 or OpenNebula31.
Globus Genomics35 is an alternative way to use Galaxy on
AWS. An initiative by the Computation Institute at the
University of Chicago, Globus Genomics not only combines an enhanced Galaxy instance for workflow management with AWS for powering computational analyses78
but also uses Globus Online79 to speed up data transfer.
Another way to analyse a distributed data set is to
distribute the computation, that is to analyse each
piece using computational resources near to it (FIG. 4c).
A layer of software coordinates computational activity
across sites, turning them into a federated cloud that
can run a common analysis workflow while enforcing
colocalization of data and the computing infrastructure
used to analyse them. This is the approach taken by
the Collaborative Cancer Cloud (CCC), a partnership
between Intel, Oregon Health and Science University, the
Dana-Farber Cancer Institute and the Ontario Institute
for Cancer Research80. The CCC will be a platform that
allows cancer researchers to search for patient omics
data across multiple sites and perform analyses while
keeping identifying information about patients secure81.
A federated cloud approach is also being explored
by the Global Alliance for Genomics and Health
(GA4GH)82. GA4GH was formed in 2013 to facilitate
legal and responsible sharing of genomic and clinical
data across the globe. A federated cloud is a natural
fit for sharing across borders: data may be housed in
the originating jurisdiction, where control is maintained, while authorized access is available outside the
jurisdiction through a common interface.
NATURE REVIEWS | GENETICS
VOLUME 19 | APRIL 2018 | 215
.
d
e
v
r
e
s
e
r
s
t
h
g
i
r
l
l
A
.
e
r
u
t
a
N
r
e
g
n
i
r
p
S
f
o
t
r
a
p
,
d
e
t
i
m
i
L
s
r
e
h
s
i
l
b
u
P
n
a
l
l
i
m
c
a
M
8
1
0
2
©
REVIEWS
Many collaborations already use the cloud to consolidate project data. ENCODE uses the DNAnexus
platform for cloud-based analysis and data sharing, and
DNAnexus in turn uses the infrastructure of AWS83.
modENCODE and ICGC host their data sets in the
cloud through AWS84,85.
Genomics software development and boosting power.
While the cloud can be viewed as an alternative way of
running existing genomics software, it has also spurred
new thinking on how to design software for large data
sets. A typical approach is to take software originally
a
b
c
Figure 4 | Models for distributed collaboration. Each site (blue rectangle) has some
Nature
| Genetics
computational resources and also generates a portion of the data
(redReviews
puzzle pieces).
Analyses that require the full data sets are to be performed at multiple sites, requiring
each of these sites to gather all portions of the data (part a). As more sites join the
analysis, more copies must be made. Alternatively, sites can consolidate their data in a
cloud-based data centre, where all analyses are performed (part b). Additionally, multiple
sites can organize themselves into a federated cloud, where each analysis of the full data
set is automatically coordinated to minimize data transfer (part c). Where possible, the
computers located where the data are generated are used to analyse that subset.
designed to analyse a single sample on a single computer
and then attempt to scale it by launching many simultaneous copies, each analysing a distinct sample. But the
rise of cloud computing has driven advances in certain
programming frameworks, MapReduce in particular 86,
that make it easier to scale software to the kinds of large
clusters available from cloud providers. In return, the
programmer must adhere to certain permitted programming patterns. A prime advantage of MapReduce is that
it allows programs to scalably aggregate and sort data
in-between computational steps. Importantly for genomics research, this makes it easier to analyse many samples
at once, conferring greater power to find subtle effects62.
MapReduce and similar frameworks have been used in
variant calling 87,88, RNA-seq analysis59,62,63 and chromatin
immunoprecipitation followed by sequencing (ChIP–
seq) analysis89. Notably, the popular Genome Analysis
Toolkit (GATK) variant caller 90 uses MapReduce to
parallelize population-scale variant calling.
One way that authors of genomics tools have been coping with the variety of cloud resources is by standardizing
the design of cloud-enabled software workflows and the
engines they execute on. The GA4GH–DREAM Workflow
Execution Challenge91, for example, is a competition where
participants submit workflow execution engines that are
then tested and compared on the basis of their ability to
run workflows described in standard languages such as
CWL67. We expect this trend to continue, eventually making it much easier for users to run large-scale analyses not
just on a single cloud or cluster but across many.
By making it easier to leverage public data, cloud
computing encourages another dimension of ‘strength
borrowing’. That is, researchers can use public data to
boost the power available to analyse a locally generated
data set, a paradigm that already prevails in micro
array data analysis8,9,12,92. We expect this trend to continue and evolve, even to the point that new sequencing
data analyses are performed in the cloud and with the
benefit of being able to ‘see’ across many past studies
with important variables in common. RNASeq-er 93 and
the University of California Santa Cruz (UCSC) Xena
Functional Genomics Browser 94 show how this is already
possible for RNA-seq data. For example, RNASeq‑er
enables researchers who have submitted unpublished
sequencing data to the ArrayExpress archive95 to automatically analyse that data, free of charge, using computational resources at the EMBL–EBI in the context
of other public data in the archive. When the study is
published and the data become public, the summarized
results can be joined with those of the other published,
archived studies. Thus, the researcher benefits before
publication and the community benefits after.
Impact on funding
Academic clouds have the advantage of being cost-free
to applicants working on relevant projects, but commercial clouds cost money. Depending on usage patterns
and financial incentives, commercial clouds might cost
more to use than a local cluster. Building and maintaining a local cluster is work-intensive, but initial costs are
amortized over the cluster’s lifetime. The marginal cost
216 | APRIL 2018 | VOLUME 19
www.nature.com/nrg
.
d
e
v
r
e
s
e
r
s
t
h
g
i
r
l
l
A
.
e
r
u
t
a
N
r
e
g
n
i
r
p
S
f
o
t
r
a
p
,
d
e
t
i
m
i
L
s
r
e
h
s
i
l
b
u
P
n
a
l
l
i
m
c
a
M
8
1
0
2
©
REVIEWS
of undertaking one more computation is low; once hard
disks have been purchased and set up, storing an additional GB of data is practically free. For laboratories that
already own local infrastructure, this makes it harder to
justify paying anew for cloud resources. Furthermore,
institutions often subsidize indirect costs incurred by
local clusters, such as electricity or space, making local
resources more cost-effective.
That said, commercial clouds can also provide
excess computational resources — variously called
‘spot’, ‘pre-emptible’ or ‘low-priority’ instances —
at a discount that can be substantial. Discounts of
70–80% are often possible96, and more complex adaptive schemes have achieved discounts of more than
90%97. In one example, the discounted resources were
sufficient for one team to obtain over 1 million excess
processors at once from AWS98. Commercial clouds
also allow a degree of cost amortization through sustained use discounts, through use of partially prepaid
‘dedicated’ instances or through resellers or enterprise
agreements.
Funders have traditionally viewed computing as a
one-time expense yielding benefits that can be spread
over future projects. The cloud model is radically different: all costs are on‑going, and little to nothing is subsidized or amortized. But funders are beginning to make
adjustments. For example, the NIH is funding cloud
computing activities through its Commons Credit Pilot99
as is the NSF through its BIGDATA programme100. Both
programmes allow investigators to apply for cloud credits which are redeemable with major cloud providers. For
example, although smaller than the commercial clouds,
the NSF-funded Jetstream can handle cloud workloads
free of charge through XSEDE programme grants75.
Funders can further control costs for grantees by
working to keep key genomic data sets stored in the
cloud. The NIH and others are beginning to do this, as
evidenced by the National Institute of Mental Health
(NIMH) Data Archive101, the NCI CGC Pilots22–24 and
the Cancer Genome Collaboratory 56. Keeping raw data
in clouds at the outset greatly reduces the effort and cost
of moving these data to the cloud computers doing the
work. Cloud providers are aware of this advantage, with
AWS and Google advertising availability of data from
several large projects, including the 1000 Genomes
Project 102 on their cloud storage services.
Application programming
interfaces
(APIs). Formal specifications of
the ways in which a user or
program can interface with a
system, for example, a cloud.
Privacy, security and regulation
Much human-derived sequencing data, including most
of the SRA, is restricted-access. Such ‘dbGaP-protected’
data50 have been de‑identified, and potential users must
be approved by a data access committee and conduct
their analysis in a way that maintains donor anonymity.
dbGaP and the related European Genome–Phenome
Archive (EGA)103 serve as portals for protected data sets,
providing a way to apply for access and an interface for
downloading the encrypted data once approved.
The NIH outlines minimum security standards to
be followed when analysing dbGaP-protected data104.
These include encryption at rest (protected data
stored on disks must be encrypted to foil potential
eavesdroppers), encryption in flight (when protected
data are transferred (for example, from the SRA to
the cloud) they must be encrypted), implementation
of strong password policies for users granted access
and physical security (for example, locked doors)
in the facility hosting the actual computers used for
analysis. Importantly, these constitute a minimum
standard. When applying to a data access committee,
an institutional signing official with the authority to
enter legal agreements on the institution’s behalf must
vouch that the application has been reviewed and that
relevant policies and legalities are adhered to. Of not,
some institutions may enforce stronger policies than
those outlined by the NIH.
While the NIH initially disallowed analysis of protected data on commercial clouds, this policy was reversed
in 2015 (REF. 105). Since then, researchers have proposed
ways to implement the minimum security measures using
features of commercial clouds57. That said, users must still
work carefully with their institutional signing officials to
ensure such architectures are sufficient.
In Europe, data privacy laws vary widely between
jurisdictions. Some laws require that data be stored only
on servers located physically within the country of origin, a principle called ‘data sovereignty’. This requirement
has fuelled the development of clouds, usually based on
OpenStack, with the physical footprints required for data
sovereignty, for example, the Open Telekom cloud106
and the proposed European Open Science Cloud34. For
projects such as PCAWG, for which large computations
are performed over cross-jurisdictional data, concerns
over privacy and differing legislations have spurred the
development of new systems that distribute and manage
computational loads across many clouds and computing
centres, each built with data sovereignty in mind36,37.
Issues of privacy and data sovereignty go deeper than
we explore here, and we refer the reader to other reviews
on this topic107–109. As scientific clouds continue to gain
a foothold and as policies governing access to human-
derived data continue to evolve, we expect readers will
increasingly encounter secure application programming
interfaces (APIs), such as the Seven Bridges API110, and the
related ideas of authentication, authorization and identity
federation (for overviews of these concepts, see REFS 111,112).
Conclusions and perspectives
Cloud computing is changing how large computational
resources are organized and procured. It is also changing
how scientists in genomics and other fields collaborate
and deal with vast archived data sets. We expect the role
of cloud computing to grow: first, as a venue for large
international collaborations; second, as a workhorse for
indexing and summarizing valuable archived sequencing
data; and third, as a storage medium for archived data.
As more archived data are housed in the cloud, the
advantage of using cloud computers — which are physically proximate to the data and can therefore access
them rapidly — also increases. As funders grow more
cognizant of the cloud’s capabilities, we expect that
it will become increasingly easy to obtain funds in a
manner that matches the rental model of the cloud.
NATURE REVIEWS | GENETICS
VOLUME 19 | APRIL 2018 | 217
.
d
e
v
r
e
s
e
r
s
t
h
g
i
r
l
l
A
.
e
r
u
t
a
N
r
e
g
n
i
r
p
S
f
o
t
r
a
p
,
d
e
t
i
m
i
L
s
r
e
h
s
i
l
b
u
P
n
a
l
l
i
m
c
a
M
8
1
0
2
©
REVIEWS
Going forward, it will be important for biological
researchers to understand the cloud and the new modes
of analysis and collaboration it enables. This requires
changes in how bioinformaticians — especially those
who seek to deploy software and resources in the cloud
— are trained. Computer scientists increasingly view
cloud computing as a full-fledged classroom subject, but
teaching it is not straightforward. It has been argued that
this is in part because the educator cannot simply teach
programming; to teach cloud computing requires us “to
venture into operations, [operating systems], networking,
and other applied areas”113. The variety of resources (public and private clouds) and paradigms (IaaS, PaaS and
SaaS) means that part of the learner’s challenge is simply to understand how a problem maps onto the various
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
Goodwin, S., McPherson, J. D. & McCombie, W. R.
Coming of age: ten years of next-generation
sequencing technologies. Nat. Rev. Genet. 17,
333–351 (2016).
Stephens, Z. D. et al. Big data: astronomical or
genomical? PLOS Biol. 13, e1002195 (2015).
This perspective puts the genomic data deluge in
context with other sciences and shows how growth
of archived genomics data is tracking
improvements in technology.
Kodama, Y. et al. The sequence read archive: explosive
growth of sequencing data. Nucleic Acids Res. 40,
D54–D56 (2012).
Leinonen, R. et al. The sequence read archive. Nucleic
Acids Res. 39, D19–D21 (2010).
Toribio, A. L. et al. European Nucleotide Archive in
2016. Nucleic Acids Res. 45, D32–D36 (2017).
Denk, F. Don’t let useful data go to waste. Nature 543,
7 (2017).
Kuo, W. P., Jenssen, T.‑K., Butte, A. J., OhnoMachado, L. & Kohane, I. S. Analysis of matched
mRNA measurements from two different microarray
technologies. Bioinformatics 18, 405–412 (2002).
Leek, J. T. et al. Tackling the widespread and critical
impact of batch effects in high-throughput data. Nat.
Rev. Genet. 11, 733–739 (2010).
McCall, M. N., Bolstad, B. M. & Irizarry, R. A. Frozen
robust multiarray analysis (fRMA). Biostatistics 11,
242–253 (2010).
Rhodes, D. R. et al. Large-scale meta-analysis of
cancer microarray data identifies common
transcriptional profiles of neoplastic transformation
and progression. Proc. Natl Acad. Sci. USA 101,
9309–9314 (2004).
Zeggini, E. et al. Meta-analysis of genome-wide
association data and large-scale replication identifies
additional susceptibility loci for type 2 diabetes. Nat.
Genet. 40, 638–645 (2008).
Marchionni, L., Afsari, B., Geman, D. & Leek, J. T. A
simple and reproducible breast cancer prognostic test.
BMC Genomics 14, 336 (2013).
Lek, M. et al. Analysis of protein-coding genetic
variation in 60,706 humans. Nature 536, 285–291
(2016).
International Cancer Genome Consortium et al.
International network of cancer genome projects.
Nature 464, 993–998 (2010).
GTEx Consortium. Genetic effects on gene expression
across human tissues. Nature 550, 204–213 (2017).
Melé, M. et al. Human genomics. The human
transcriptome across tissues and individuals. Science
348, 660–665 (2015).
Trans-Omics for Precision Medicine (TOPMed)
Program. National Heart, Lung, and Blood Institute
https://www.nhlbi.nih.gov/science/trans-omicsprecision-medicine-topmed-program (2017).
Collins, F. S. & Varmus, H. A new initiative on precision
medicine. N. Engl. J. Med. 372, 793–795 (2015).
Gaziano, J. M. et al. Million Veteran Program: A megabiobank to study genetic influences on health and
disease. J. Clin. Epidemiol. 70, 214–223 (2016).
Foster, I. G. & Dennis, B. Cloud Computing for Science
and Engineering (MIT Press, 2017).
This book describes the public and private cloud
offerings availabkle and how to use APIs for both
commercial and OpenStack clouds to automate
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
offerings available, which are still evolving. Fortunately,
there are on‑ramps that allow learners to familiarize
themselves with cloud concepts; a good start would be
to learn how to use Docker and Singularity, Vagrant
and the Galaxy cloud version. To get started with a free
science cloud, learners can apply for access to Jetstream33.
Cloud computing is not a panacea, nor is it the only
way to address the crucial issues that arise as scientists
increasingly use large-scale genomics data to improve
our understanding of biology and disease. But its relevance to many crucial problems in genomics today —
scale, reproducibility, reusability and privacy — along
with its recent successes in the field make it a crucial
technology for computer scientists, bioinformaticians
and life scientists to study and leverage going forward.
cloud tasks. It also describes Globus Auth and
other important ideas related to identity
federation, authentication and authorization.
International Cancer Genes Consortium. PCAWG Data
Portal and Visualizations. ICGC http://docs.icgc.org/
pcawg/ (2017).
Birger, C. et al. FireCloud, a scalable cloud-based
platform for collaborative genome analysis: strategies
for reducing and controlling costs. bioRxiv, http://dx.
doi.org/10.1101/209494 (2017).
Lau, J. W. et al. The Cancer Genomics Cloud:
collaborative, reproducible, and democratized − a new
paradigm in large-scale computational research.
Cancer Res. 77, e3–e6 (2017).
Reynolds, S. M. et al. The ISB Cancer Genomics Cloud:
a flexible cloud-based platform for cancer genomics
research. Cancer Res. 77, e7–e10 (2017).
Celniker, S. E. et al. Unlocking the secrets of the
genome. Nature 459, 927–930 (2009).
The ENCODE Project Consortium. An integrated
encyclopedia of DNA elements in the human genome.
Nature 489, 57–74 (2012).
Mell, P. M. & Grance, T. SP 800–145. The NIST
definition of cloud computing. National Institute of
Standards and Technology http://csrc.nist.gov/
publications/nistpubs/800-145/SP800-145.pdf (2011).
Wingfield, N., Streitfeld, D. & Lohr, S. Cloud produces
sunny earnings at Amazon, Microsoft and Alphabet.
New York Times https://www.nytimes.
com/2017/04/27/technology/quarterly-earningscloud-computing-amazon-microsoft-alphabet.html
(27 April 2017).
Mathews, L. Just how big is Amazon’s AWS business?
(hint: it’s absolutely massive). Geek.com https://www.
geek.com/chips/just-how-big-is-amazons-aws-businesshint-its-absolutely-massive-1610221/ (2014).
Sefraoui, O., Aissaoui, M. & Eleuldj, M. OpenStack:
toward an open-source solution for cloud computing.
Int. J. Comput. Appl. Technol. 55, 38–42 (2012).
Moreno-Vozmediano, R., Montero, R. S. &
Llorente, I. M. IaaS cloud architecture: from virtualized
datacenters to federated cloud infrastructures.
Computer 45, 65–72 (2012).
Das, S. et al. Next-generation genotype imputation
service and methods. Nat. Genet. 48, 1284–1287
(2016).
Stewart, C. A. et al. in Proc. 2015 XSEDE Conference:
Scientific Advancements Enabled by Enhanced
Cyberinfrastructure https://dl.acm.org/citation.
cfm?id=2792745 (2015).
European Open Science Cloud [Editorial]. Nat. Genet.
48, 821 (2016).
Madduri, R. K. et al. Experiences building Globus
Genomics: a next-generation sequencing analysis
service using Galaxy, Globus, and Amazon web
services. Concurr. Comput. 26, 2266–2279 (2014).
Yakneen, S., Waszak, S., Gertz, M. & Korbel, J. O.
Enabling rapid cloud-based analysis of thousands of
human genomes via Butler. bioRxiv http://dx.doi.
org/10.1101/185736 (2017).
Yung, C. K. et al. Large-scale uniform analysis of
cancer whole genomes in multiple computing
environments. bioRxiv http://dx.doi.
org/10.1101/161638 (2017).
Baggerly, K. A. & Coombes, K. R. Deriving
chemosensitivity from cell lines: forensic bioinformatics
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
and reproducible research in high-throughput biology.
Ann. Appl. Statist. 3, 1309–1334 (2009).
Dai, M. et al. Evolving gene/transcript definitions
significantly alter the interpretation of GeneChip data.
Nucleic Acids Res. 33, e175 (2005).
Ioannidis, J. P. et al. Repeatability of published
microarray gene expression analyses. Nat. Genet. 41,
149–155 (2009).
Nekrutenko, A. & Taylor, J. Next-generation
sequencing data interpretation: enhancing
reproducibility and accessibility. Nat. Rev. Genet. 13,
667–672 (2012).
Piccolo, S. R. & Frampton, M. B. Tools and techniques
for computational reproducibility. Gigascience 5, 30
(2016).
Angiuoli, S. V. et al. CloVR: a virtual machine for
automated and portable sequence analysis from the
desktop using cloud computing. BMC Bioinformatics
12, 356 (2011).
Krampis, K. et al. Cloud BioLinux: pre-configured and
on‑demand bioinformatics computing for the genomics
community. BMC Bioinformatics 13, 42 (2012).
Merkel, D. Docker: lightweight Linux containers for
consistent development and deployment. Linux J.
2014, 2 (2014).
Kurtzer, G. M., Sochat, V. & Bauer, M. W. Singularity:
scientific containers for mobility of compute. PLOS
One 12, e0177459 (2017).
The Clinical Cancer Genome Task Team of the Global
Alliance for Genomics and Health. Sharing clinical and
genomic data on cancer − the need for global solutions.
N. Engl. J. Med. 376, 2006–2009 (2017).
Bonazzi, V. R. & Bourne, P. E. Should biomedical research
be like Airbnb? PLOS Biol. 15, e2001818 (2017).
The authors of this paper describe the NIH Data
Commons and suggest cloud computing as a means
for making large-scale genomics data sets available
and associated analyses reproducible.
Bourne, P. E., Lorsch, J. R. & Green, E. D. Perspective:
sustaining the big-data ecosystem. Nature 527,
S16–17 (2015).
Tryka, K. A. et al. NCBI’s database of genotypes and
phenotypes: dbGaP. Nucleic Acids Res. 42,
D975–D979 (2014).
Iyer, M. K. et al. The landscape of long noncoding
RNAs in the human transcriptome. Nat. Genet. 47,
199–208 (2015).
Brown, J. B. et al. Diversity and dynamics of the
Drosophila transcriptome. Nature 512, 393–399
(2014).
Graveley, B. The developmental transcriptome of
Drosophila melanogaster. Genome Biol. 11, I11
(2010).
Gutzwiller, F. et al. Dynamics of Wolbachia pipientis
gene expression across the Drosophila melanogaster
life cycle. G3 5, 2843–2856 (2015).
Bernstein, M. N., Doan, A. & Dewey, C. N. MetaSRA:
normalized human sample-specific metadata for the
sequence read archive. Bioinformatics 33,
2914–2923 (2017).
Yung, C. K. et al. The Cancer Genome Collaboratory
[abstract]. Cancer Res. 77, 378 (2017).
Nellore, A. et al. Human splicing diversity and the
extent of unannotated splice junctions across human
RNA-seq samples on the sequence read archive.
Genome Biol. 17, 266 (2016).
218 | APRIL 2018 | VOLUME 19
www.nature.com/nrg
.
d
e
v
r
e
s
e
r
s
t
h
g
i
r
l
l
A
.
e
r
u
t
a
N
r
e
g
n
i
r
p
S
f
o
t
r
a
p
,
d
e
t
i
m
i
L
s
r
e
h
s
i
l
b
u
P
n
a
l
l
i
m
c
a
M
8
1
0
2
©
REVIEWS
58. Frazee, A. C., Langmead, B. & Leek, J. T. ReCount: a
multi-experiment resource of analysis-ready RNA-seq
gene count datasets. BMC Bioinformatics 12, 449
(2011).
59. Langmead, B., Hansen, K. D. & Leek, J. T. Cloud-scale
RNA-sequencing differential expression analysis with
Myrna. Genome Biol. 11, R83 (2010).
60. Nellore, A., Wilks, C., Hansen, K. D., Leek, J. T. &
Langmead, B. Rail-dbGaP: analyzing dbGaP-protected
data in the cloud with Amazon Elastic MapReduce.
Bioinformatics 32, 2551–2553 (2016).
This work reports the use of cloud computing and
MapReduce software to study tens of thousands of
human RNA sequencing data sets, showing that
many splice junctions that are well represented in
public data are not present in popular gene
annotations.
61. Collado-Torres, L. et al. Reproducible RNA-seq analysis
using recount2. Nat. Biotechnol. 35, 319–321
(2017).
62. Nellore, A. et al. Rail-RNA: scalable analysis of RNAseq splicing and coverage. Bioinformatics 33,
4003–4040 (2017).
63. Vivian, J. et al. Toil enables reproducible, open source,
big biomedical data analyses. Nat. Biotechnol. 35,
314–316 (2017).
64. Dobin, A. et al. STAR: ultrafast universal RNA-seq
aligner. Bioinformatics 29, 15–21 (2013).
65. Li, B. & Dewey, C. N. RSEM: accurate transcript
quantification from RNA-Seq data with or without a
reference genome. BMC Bioinformatics 12, 323 (2011).
66. Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L.
Near-optimal probabilistic RNA-seq quantification.
Nat. Biotech. 34, 525–527 (2016).
67. Amstutz, P. et al. Common workflow language, v1.0.
Figshare https://doi.org/10.6084/m9.
figshare.3115156.v2 (2016).
68. Tatlow, P. J. & Piccolo, S. R. A cloud-based workflow to
quantify transcript-expression levels in public cancer
compendia. Sci. Rep. 6, 39259 (2016).
This study shows how cloud computing can be used
to reanalyse over 12,000 human cancer RNA
sequencing data sets for as little as US$0.09 per
sample.
69. Foster, I. K., Carl. The Grid 2: Blueprint for a New
Computing Infrastructure (Morgan Kaufmann, 2003).
70. Drew, K. et al. The Proteome Folding Project:
proteome-scale prediction of structure and function.
Genome Res. 21, 1981–1994 (2011).
71. Rahman, M. et al. Alternative preprocessing of RNASequencing data in The Cancer Genome Atlas leads to
improved analysis results. Bioinformatics 31,
3666–3672 (2015).
72. Stein, L. D. The case for cloud computing in genome
informatics. Genome Biol. 11, 207 (2010).
73. Bais, P., Namburi, S., Gatti, D. M., Zhang, X. &
Chuang, J. H. CloudNeo: a cloud pipeline for
identifying patient-specific tumor neoantigens.
Bioinformatics 33, 3110–3112 (2017).
74. Afgan, E. et al. The Galaxy platform for accessible,
reproducible and collaborative biomedical analyses:
2016 update. Nucleic Acids Res. 44, W3–W10
(2016).
75. Towns, J. et al. XSEDE: accelerating scientific
discovery. Comput. Sci. Eng. 16, 62–74 (2014).
76. Galaxy Community Hub. Publicly accessible Galaxy
servers. Galaxy Project https://galaxyproject.org/
public-galaxy-servers/ (2017).
77. Afgan, E. et al. Galaxy CloudMan: delivering cloud
compute clusters. BMC Bioinformatics 11 (Suppl. 12),
S4 (2010).
78. Liu, B. et al. Cloud-based bioinformatics workflow
platform for large-scale next-generation sequencing
analyses. J. Biomed. Inform. 49, 119–133 (2014).
79. Foster, I. Globus Online: accelerating and
democratizing science through cloud-based services.
IEEE Internet Comput. 15, 70–73 (2011).
80. Dana-Farber Cancer Institute. Dana-Farber Cancer
Institute and Ontario Institute for Cancer Research
join Collaborative Cancer Cloud http://www.danafarber.org/newsroom/news-releases/2016/danafarber-cancer-institute-and-ontario-institute-forcancer-research-join-collaborative-cancer-cloud/
(2016).
81. Hawkins, T. The Collaborative Cancer Cloud: Intel and
OHSU team up for cancer research. siliconANGLE
http://siliconangle.com/blog/2016/12/16/
collaborative-cancer-cloud-intel-ohsu-team-cancerresearch-thecube/ (2016).
82. Global Alliance for Genomics and Health. A federated
ecosystem for sharing genomic, clinical data. Science
352, 1278–1280 (2016).
83. Amazon Web Services. AWS case study: DNAnexus.
Amazon https://aws.amazon.com/solutions/casestudies/dnanexus/ (2017).
84. ICGC Data Coordination Center. About cloud partners.
ICGC http://docs.icgc.org/cloud/about/ (2017).
85. modENCODE Project. modENCODE on the EC2 cloud.
modENCODE http://data.modencode.org/modencodecloud.html (2017).
86. Dean, J. & Ghemawat, S. MapReduce. Commun. ACM
51, 107 (2008).
87. Kelly, B. J. et al. Churchill: an ultra-fast, deterministic,
highly scalable and balanced parallelization strategy for
the discovery of human genetic variation in clinical and
population-scale genomics. Genome Biol. 16, 6 (2015).
88. Langmead, B., Schatz, M. C., Lin, J., Pop, M. &
Salzberg, S. L. Searching for SNPs with cloud
computing. Genome Biol. 10, R134 (2009).
89. Feng, X., Grossman, R. & Stein, L. PeakRanger: a
cloud-enabled peak caller for ChIP-seq data. BMC
Bioinformatics 12, 139 (2011).
90. McKenna, A. et al. The Genome Analysis Toolkit: a
MapReduce framework for analyzing next-generation
DNA sequencing data. Genome Res. 20, 1297–1303
(2010).
91. GA4GH‑DREAM. GA4GH‑DREAM Workflow Execution
Challenge. Synapse https://www.synapse.org/
WorkflowChallenge (2017).
92. Franke, A. et al. Genome-wide meta-analysis increases
to 71 the number of confirmed Crohn’s disease
susceptibility loci. Nat. Genet. 42, 1118–1125 (2010).
93. Petryszak, R. et al. The RNASeq‑er API—a gateway to
systematically updated analysis of public RNA-seq
data. Bioinformatics 33, 2218–2220 (2017).
94. Goldman, M., Craft, B., Zhu, J. & Haussler, D. The
UCSC Xena system for cancer genomics data
visualization and interpretation [Abstr. 2584]. Cancer
Res. 77, 2584 (2017).
95. Kolesnikov, N. et al. ArrayExpress update—simplifying
data submissions. Nucleic Acids Res. 43,
D1113–D1116 (2015).
96. Google Compute Engine. Google Compute Engine
pricing. Google Cloud Platform https://cloud.google.
com/compute/pricing (2017).
97. Chard, R. et al. in 2015 IEEE 11th International
Conference on e‑Science, 136–144 (IEEE, 2015).
98. Barr, J. Natural Language Processing at Clemson
University – 1.1 Million vCPUs & EC2 Spot Instances.
Amazon https://aws.amazon.com/blogs/aws/naturallanguage-processing-at-clemson-university-1-1-millionvcpus-ec2-spot-instances/ (2017).
99. NIH Commons. Commons Credits Pilot Portal.
Commons Credits Pilot Portal https://www.commonscredit-portal.org/ (2017).
100. National Science Foundation. Amazon Web Services,
Google Cloud, and Microsoft Azure join NSF’s Big
Data Program. National Science Foundation https://
www.nsf.gov/news/news_summ.jsp?cntn_
id=190830&WT.mc_ev=click (2017).
101. National Institute of Mental Health. Welcome to the
NIMH Data Archive. NDA https://data-archive.nimh.
nih.gov/ (2017).
102. Genomes Project Consortium. A global reference for
human genetic variation. Nature 526, 68–74 (2015).
103. Lappalainen, I. et al. The European Genome-Phenome
Archive of human data consented for biomedical
research. Nat. Genet. 47, 692–695 (2015).
104. National Institutes of Health. NIH security best
practices for controlled-access data subject to the NIH
genomic data sharing (GDS) policy. NIH Office of
Science Policy https://osp.od.nih.gov/wp-content/
uploads/NIH_Best_Practices_for_Controlled-Access_
Data_Subject_to_the_NIH_GDS_Policy.pdf (2015).
105. Stein, L. D., Knoppers, B. M., Campbell, P., Getz, G. &
Korbel, J. O. Data analysis: Create a cloud commons.
Nature 523, 149–151 (2015).
In this paper, the authors argue for the use of
cloud computing in large consortia and describe
plans for its use in the ICGC.
106. Deutsche Telekom. Deutsche Telekom launches highly
secure public cloud based on Cisco platform. Deutsche
Telekom https://www.telekom.com/en/media/mediainformation/archive/deutsche-telekom-launches-highlysecure-public-cloud-based-on-ciscoplatform——362100 (2015).
107. Datta, S., Bettinger, K. & Snyder, M. Secure cloud
computing for genomic data. Nat. Biotechnol. 34,
588–591 (2016).
108. Dove, E. S. et al. Genomic cloud computing: legal and
ethical points to consider. Eur. J. Hum. Genet. 23,
1271–1278 (2015).
109. Francis, L. P. Genomic knowledge sharing: a review of
the ethical and legal issues. Appl. Transl Genom. 3,
111–115 (2014).
110. Seven Bridges Genomics. API Overview. Seven Bridges
Genomics https://docs.sevenbridges.com/v1.0/docs/
the-api (2017).
111. Ananthakrishnan, R., Chard, K., Foster, I. & Tuecke, S.
Globus platform-as‑a‑service for collaborative science
applications. Concurrency Comput. Pract. Exp. 27,
290–305 (2015).
112. Chaterji, S. et al. Federation in genomics pipelines:
techniques and challenges. Brief Bioinform. http://dx.
doi.org/10.1093/bib/bbx102 (2017).
113. Campbell, S. Teaching cloud computing. Computer 49,
91–93 (2016).
114. Dudley, J. T. & Butte, A.J. In silico research in the era
of cloud computing. Nat. Biotech. 28, 1181–1185
(2010).
115. Barretina, J. et al. The Cancer Cell Line Encyclopedia
enables predictive modelling of anticancer drug
sensitivity. Nature 483, 603–607 (2012).
116. Cancer Genome Atlas Research Network et al. The
Cancer Genome Atlas Pan-Cancer analysis project.
Nat. Genet. 45, 1113–1120 (2013).
117. Heath, A. P. et al. Bionimbus: a cloud for managing,
analyzing and sharing large genomics datasets. J. Am.
Med. Inform. Assoc. 21, 969–975 (2014).
118. Di Tommaso, P. et al. Nextflow enables reproducible
computational workflows. Nat. Biotechnol. 35,
316–319 (2017).
119. Fisch, K. M. et al. Omics Pipe: a community-based
framework for reproducible multi-omics data
analysis. Bioinformatics 31, 1724–1728
(2015).
120. Allcock, W. et al. in Proceedings of the 2005 ACM/IEEE
conference on Supercomputing 54 (Seattle, 2005).
121. Petryszak, R. et al. Expression Atlas update — a
database of gene and transcript expression from
microarray- and sequencing-based functional
genomics experiments. Nucleic Acids Res. 42,
D926–D932 (2014).
Acknowledgements
The authors thank J. Taylor, E. Afgan, M. Schatz, J. Goecks
and A. Margolin for reading through a draft of this work and
providing helpful comments. B.L. was supported by the US
National Institutes of Health/National Institute of General
Medical Sciences grant 1R01GM118568.
Author contributions
The authors contributed equally to all aspects of this
manuscript.
Competing interests statement
The authors declare no competing interests.
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
SUPPLEMENTARY INFORMATION
See online article: S1 (methods)
FURTHER INFORMATION
European Genome–Phenome Archive:
https://www.ebi.ac.uk/ega/home
European Nucleotide Archive: www.ebi.ac.uk/ena
Database of Genotypes and Phenotypes (dbGaP):
www.ncbi.nlm.nih.gov/gap
Sequence Read Archive: www.ncbi.nlm.nih.gov/sra
Vagrant: https://www.vagrantup.com/
Galaxy cloud: https://galaxyproject.org/cloud/
ALL LINKS ARE ACTIVE IN THE ONLINE PDF
NATURE REVIEWS | GENETICS
VOLUME 19 | APRIL 2018 | 219
.
d
e
v
r
e
s
e
r
s
t
h
g
i
r
l
l
A
.
e
r
u
t
a
N
r
e
g
n
i
r
p
S
f
o
t
r
a
p
,
d
e
t
i
m
i
L
s
r
e
h
s
i
l
b
u
P
n
a
l
l
i
m
c
a
M
8
1
0
2
©
O N L I N E O N LY
Figure permissions information
[FIG. 3: Figure adapted from REF. 114, Macmillan Publishers Limited.]
Key points
• Cloud computing is a paradigm whereby computational resources
such as computers, storage and bandwidth can be rented on a
pay-for-what-you-use basis.
• The cloud’s chief advantages are elasticity and convenience. Elasticity
refers to the ability to rent and pay for the exact resources needed,
and convenience refers to the fact that the user need not deal with the
disadvantages of owning or maintaining the resources.
• Archives of sequencing data are vast and rapidly growing. Cloud
computing is an important enabler for recent efforts to reanalyse
large cross-sections of archived sequencing data.
• The cloud is becoming a popular venue for hosting large international
collaborations, which benefit from the ability to hold data securely in
a single location and proximate to the computational infrastructure
that will be used to analyse it.
• Funders of genomics research are increasingly aware of the cloud and
its advantages and are beginning to allocate funds and create cloudbased resources accordingly.
• Cloud clusters can be configured with security measures needed
to adhere to privacy standards, such as those from the Database of
Genotypes and Phenotypes (dbGaP).
Subject categories
Biological sciences / Computational biology and bioinformatics [URI
/631/114]
Biological sciences / Genetics / Genomics [URI /631/208/212]
Biological sciences / Genetics / Sequencing / Next-generation sequencing [URI /631/208/514/2254]
Biological sciences / Computational biology and bioinformatics /
Databases / Genetic databases [URI /631/114/129/2043]
Scientific community and society / Scientific community / Research data
[URI /706/648/697]
ToC blurb
computing for genomic data analysis
000 Cloud
and collaboration
Ben Langmead and Abhinav Nellore
Next-generation sequencing technologies have fuelled
a rapid rise in data, which require vast computational
resources to store and analyse. This Review discusses
the role of cloud computing in genomics research to
facilitate data sharing and new analyses of archived
sequencing data, as well as large-scale international
collaborations.
.
d
e
v
r
e
s
e
r
s
t
h
g
i
r
l
l
A
.
e
r
u
t
a
N
r
e
g
n
i
r
p
S
f
o
t
r
a
p
,
d
e
t
i
m
i
L
s
r
e
h
s
i
l
b
u
P
n
a
l
l
i
m
c
a
M
8
1
0
2
©
RESEARCH ARTICLES
14. We experimented with various power ratings from 5 to
75 W.
15. W. A. Edson, Vacuum-Tube Oscillators (Wiley, New York,
1953).
16. Note that E ≠ cm0H, and that the fields are out of phase
and not necessarily perpendicular because we are not in
a radiative regime.
17. See supporting material on Science Online.
18. IEEE Std C95.1—2005 IEEE Standard for Safety Le…

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Programming Questionnaire ”

Get high-quality paper

Guarantee! All work is written by expert writers!

Still stressed from student homework?

Get quality assistance from academic writers!

Order now