Single-end mapped and split-reads.
When a read maps to the breakpoint
of the deletion on the donor it cannot be mapped back to the reference
(Figure 2b, read “c”). In the case of a
“clean” deletion, the prefix and suffix
of the fragment can be mapped separately; such split-reads are indicative
of deletion events.
Loss of heterozygosity. Consider the
SNV locations on the donor genome.
While sampling multiple polymorphic sites, a geneticist would expect a
mix of heterozygous and homozygous
sites. At a deletion, the single chromosome being sampled displays a loss of
Even within the constraints of
these four categories, a number of design decisions must be made by software tools to account for repetitive sequences and to reconcile conflicting
evidence. Variant inference remains a
challenging research problem.
Layering for Genomics
Our vision is inspired by analogy with
systems and networks; for example,
the Internet has dealt with a variety of
new link technologies (from fiber to
wireless) and applications (from email
to social networks) via the “hourglass”
model using the key abstractions of
TCP and IP (see Figure 3a).
Similarly, we propose that genom-ic-processing software be layered into
an instrument layer, a compression
layer, an evidence layer, an inference
layer, and a variation layer that can
insulate genomic applications from
sequencing technology. Such modularity requires computer systems to
forgo efficiencies that can be gained
by leaking information across layers;
for example, biological inferences can
be sharpened by considering which
sequencing technology is used (such
as Illumina and Life Technologies),
but modularity is paramount.
Some initial interfaces are in vogue
among geneticists today. Many instruments now produce sequence data
in the “fastq” format. The output of
mapping reads is often represented
as “SAM/BAM” format, though other
compressed formats have been proposed. 10 At a higher level, standards
(such as the Variant Call Format, or
VCF) are used to describe variants (see
GQL also supports
We propose additional layering
between the mapped tools and applications. Specifically, our architecture
separates the collection of evidence
required to support a query (
deterministic, large data movement, standardized) from the inference (
probabilistic, comparatively smaller data
movement, little agreement on techniques). While inference methods
vary considerably, the evidence for inferences is fairly standard. To gather
it in a flexible, efficient manner, we
propose a Genome Query Language
(GQL). Though we do not address it
here, a careful specification of a variation layer (see Figure 3a) is also important. While the data format of a variation is standardized using, say, VCF,
the interface functions are not.
The case for an evidence layer. Genomes, each several hundred giga-bytes long, are being produced at different locations around the world. To
realize the vision outlined in Figure 1,
individual laboratories must be able
to process them to reveal variations
and correlate them with medical out-comes/phenotypes at each place a
discovery study or personalized medicine assay is undertaken. The obvious
alternatives are not workable, as described in the following paragraphs:
Downloading raw data.
Transporting 100Gb for each of 1,000 genomes
across the network is infeasible today.
Compression can mitigate (5x) but
not completely avoid the problem.
Massive computational infrastructure
must be replicated at every study location for analysis.
Downloading variation information.
Alternatively, the genomic repositories could run standard-variant-calling
pipelines4 and produce much smaller
lists of variations in a standard format
(such as VCF). Unfortunately, variant
calling is an inexact science; researchers often want to use their own callers
and almost always want to see “
evidence” for specific variants. Discovery applications thus very likely need
raw genomic evidence. By contrast,
personalized genomics applications
might query only called variants and
a knowledgebase that correlates genotypes and phenotypes. However, even
medical personnel might occasionally
need to review the raw evidence for