quirements and can be recomputed
on demand.
EMRs (information retrieval). The
phenotypes associated with each sequenced individual are already in patient medical records. Initial results
from the eMERGE network indicate
that, for a limited set of diseases,
EMRs can be used for phenotype
characterization in genome-wide association studies within a reasonable
margin of error. 9, 13 We anticipate that
most health-care institutions will be
using EMRs by 2014, given incentives
provided by the Health Information
Technology for Economic and Clinical Health Act of 2009.17 Increasing
adherence to interoperability standards5 and advances in biomedical
natural language processing12 make
efficient querying possible. However, there is no integration of genotype and phenotype data today. GQL
should be useful for both interrogating a single genome and interrogating multiple genomes across groups
of individuals but will need to integrate with existing EMR systems so
phenotype data can be queried together with genomes.
Privacy (computer security). The
genome is the ultimate unique identifier. All privacy is lost once the public
has access to the genome of an individual, but current regulation, based
on the Health Information Portabil-ity and Accountability Act, is silent
about it. 2, 3, 11 Though the Genetic Information Nondiscrimination Act
addresses accountability for the use
of genetic information8 privacy laws
must change to ensure sensitive information is available only to the appropriate agents. Checking that a given
study satisfies a specific privacy definition requires formal reasoning about
the data manipulations that generated the disclosed data—impossible
without a declarative specification
(such as GQL) of such manipulations.
Provenance (software engineering).
GQL is an ideal way to record the provenance of genomic study conclusions.
Current scripts (such as GATK) often
consist of code that is too ad hoc for
human readability and span various
programming languages too low level
for automatic analysis. By contrast,
publishing the set of declarative GQL
queries along with their results would
While the work
is a challenge,
making genetics
interactive is
potentially as
transformative
as the move
from batch
processing to
time sharing.
significantly enhance the clarity and
reproducibility of a study’s claims.
Provenance queries also enable
scientists to reuse the data of previously published computing-intensive
studies. Rather than run their costly
queries directly on the original input
databases, these scientists would prefer to launch an automatic search for
previously published studies in which
provenance queries correspond to
(parts of) the computation needed by
their own queries. The results of provenance queries can be directly imported and used as partial results of a new
study’s queries, skipping re-computa-tion. This scenario corresponds in relational database practice to rewriting
queries using views.
Scaling (probabilistic inference).
Learning the correlation between diseases and variations can be tackled
differently if there are a large number
of genomes. It may be less critical to
accurately evaluate individual variations for such a discovery problem, as
erroneous variations are unlikely to
occur over a large group of randomly
selected individuals. More generally,
do other inference techniques leverage the presence of data at scale? As
an example, Google leverages the big-data collections it has to find common
misspellings. Note that accurately
screening individual variations is still
needed for personalized medicine.
Crowd sourcing (data mining).
Crowdsourcing might be able to address difficult challenges like cancer, 14
but the query system must first have
mechanisms to allow a group to work
coherently on a problem. Imagine that
a group of talented high-school science students are looking for genetic
associations from cases and controls
for a disease. A potentially useful GQL
mechanism would be to select a random subset of cases and controls that
are nevertheless genetically matched
(arising from a single mixing population). Researchers could then query
for a random subset of 100 individuals
with a fraction of network bandwidth
while still providing similar statistical
power for detecting associations.
Reducing costs (computer systems). Personalized medicine must
be commoditized to be successful so
requires computer systems research;
for example, since most genomes are