…ATG…GAGTA… Reference Assembly
…ACG…GAGTA… Maternal chromo-
some 1
…ATG…GAGCA… Paternal chromo-
some
Individual A is bi-allelic, or heterozy-
gous, at the two SNV sites and has the
genotype …C/T…C/T…, and the geno-
types are resolved into two haplotypes
…C…T… , … T…C…
Sites containing SNVs that are
prevalent in a population demarcate
chromosomal positions as varying,
or polymorphic. Consequently, these
locations are called single nucleotide
polymorphisms (SNPs). In discovery
workflows, geneticists test popula-
tions to see if the occurrence of varia-
tion correlates, or associates, with the
phenotype status of the individual.
So far we have discussed simple
variations involving one or a small
number of changes at a location. By
contrast, geneticists also investigate
structural variations in which large
(1kbp up to several million bases) genomic fragments are deleted, inserted, translocated, duplicated, or inverted, relative to the reference. 19
(and decreasing more slowly) than the
cost of sequencing.
We begin with exemplar queries on
genomic data that illustrate the difficulty of genomic analysis and lack of
consensus as to a best method. Abstractions must be flexible enough to
handle a variety of methods.
sequencing trends
Four technological trends are relevant
for designing a genomic software architecture:
Reduced cost. While the Human
Genome Project ( http://www.genome.
gov/) cost $100 million, human re-se-
quencing for redundant (15x) cover-
age now costs less than $5,000 in the
U.S., projected to fall below $1,000.
This implies universal sequencing
may be realizable, and archiving and
analysis, not sequencing, will domi-
nate cost;
figure 2. evidence for variation in the donor.
ACCGTACACTCAT
CCT-AC
GTAGACT
GTACAC
TAGACTCA
TACACTCAC
Snv deleted region Reference
50
cb
a
(a) evidence for Snvs is provided by aligning donor reads against the reference sequence; the g/t
variation might be a sequencing error, as the variant reads maps with too many errors, though the
g/C variation appears to be a true Snv. (b) Paired-end sequencing and mapping provides evidence
for deletion in the genome; the dotted rectangle demarcates the region in the reference deleted
in one of the two donor chromosomes. Read “a” samples the region around the deletion (marked
with the lightning bolt), mapping “discordantly” in the reference; read “b” maps concordantly,
but with coverage of only about half of neighboring regions; and read “c” is sampled from the
breakpoint, mapping at only one end.
evidence
1. Paired-end mapping
2. Depth of coverage
3. Loss of heterozygosity
4. Split reads
0
c
a
(a)
(b)
b
donor genome
Variation Calling
The key to efficiency in genomics is
the premise that an individual’s genetic record can be summarized succinctly by a much smaller list of individual genetic variations. While we
develop this premise further in our
layering proposal, we provide insight
as to how variants are called today;
the expert should skip this section
and proceed to our layering proposal.
We start with querying for SNVs in the
donor genome, the simplest form of
variation:
Calling SNVs. Figure 2 outlines
how a mutation may be called. Consider the reference allele C. We see
two copies of the donor genome with
a G allele and some copies with a C,
indicating a heterozygous SNV. If the
variation is homozygous, all overlapping reads would be expected to have
a G in that position, though even
this simple call can be confounded.
Some reads may have been mapped
to the wrong place on the reference
(such as the top donor read in the
figure). The G/T mutation may not
be correct, and the alignment of the
incorrectly mapped read might present many variations. Even if the read
is mapped correctly, sequencing errors could incorrectly appear as heterozygous mutations.
Mutation callers use statistical
methods informed by mapping the
quality of the read (such as number of
potential places in the genome a read
can map to), the quality score of a base
call, and the distribution of bases or
alleles in the reads for that location.
Some mutation callers use evidence
based on the surrounding locations
(such as an excess of insertion/dele-tion events nearby suggesting alignment problems). The decision itself
could be based on frequentist, Bayes-ian inference, or other machine-learning techniques. While SNP callers use various inference techniques,
all refer to the same evidence—the set