applies to a personalized workflow,
where a geneticist might be interested
in patients with copy numbers similar
to the specific query individual.
Group inference without accurate individual inference. The ability to query
populations has an important benefit:
Individual genomes may have little
evidence for SNV calls at low-coverage
sequencing. However, if a large number of affected individuals in a population (such as 800 out of 1,000) all show
the same SNV while controls do not,
an inference tool can reliably predict
an association, even with unreliable
calls for individuals. While more work
is required to demonstrate the benefits of group inference, the point is
that GQL provides query support for
group inference.
Prototype implementation
We have developed a prototype implementation of a subset of GQL (see Figure 5). The uploaded genome (in BAM
format) can be queried through a simple text interface that allows the user
to write a GQL query. This query is
compiled and executed and the output
returned as a (smaller) BAM file that
can be visualized through appropriate
genome browsers (such as jbrowse,
http://jbrowse.org/) or downloaded to
the client for further analysis.
The implementation of “
genome-query” has a customized parser that
converts GQL to an intermediate representation. We included customized
procedures for each of the algebraic
operations, with some concessions
for efficiency, mainly with respect to
memory. Specifically, GQL uses interval trees to implement joins and customized indices (such as strength vectors) for efficient querying.
Challenges
We have made the case for a set of genomic layers, including an evidence
layer where evidence is retrieved
through GQL. Successful implementation of this vision depends on new
ideas from computer science:
Query power (database theory). Is
GQL sufficiently powerful to address
all evidence-layer queries needed in
practice? The goal is to have the evidence layer handle as much data-intensive computation as possible while
preserving performance; without the
the true power
of querying
genomes comes
from the ability to
query populations.
performance goal, any query can be
trivially satisfied by passing the entire
genome to the inference layer. Note
that GQL’s expressive power coincides
with that of first-order logic over the
schema of the three relations R, G, P,
a signature of aggregation functions,
and a group-by operator. However,
user feedback may require GQL developers to add extensions, and, in
implementing extensions, care must
be taken to balance expressive power
with efficient evaluation.
Query speed (database systems).
We designed a corresponding algebra
GQA as an internal representation for
optimization and for evaluating query
plans; for example, assume queries
on populations are automatically decomposed into queries on individuals. Consider queries of the general
form SELECT α FROM MAPJOIN R,
G WHERE b. Two steps are needed to
construct such a query:
1. Select for relations that satisfy
constraints b; and
2. Project (while removing duplicates) onto attributes α.
GQL uses a location-based index
LTOR, where LTOR(ℓ) is a pointer to
first read that maps to the point-inter-val l. For each individual, GQL keeps
a compressed index of the mapped
reads in memory. The index can be
used for select operations based on
specific locations (such as reads that
map to specific genes).
However, many queries involve
scanning the entire genome for maximal intervals; for example, find all
maximal regions where there is a disproportionate increase in the number
of mapped reads (high copy number).
For efficient implementation of these
queries, GQL constructs special indices that allow filtering for reads according to a user-defined constraint.
Define a strength-vector SΘ for a constraint Θ as a vector of length G (the
entire genome). Any location l ∈ G, SΘ
[ℓ] gives the strength of the evidence
at that location and can be pre-com-puted for common constraints Θ. To
reduce memory, GQL also chooses a
minimum strength cut-off and maintains CΘ,t as a sorted sequence of intervals i1,i2,… such that each ij is a
maximal interval satisfying SΘ[ℓ]≥t
for all ℓ∈ij. The compressed vectors
reduce memory and computation re-