easily, chemically treated to sterilize
them, and then formed into artificial
structures. A team at the Massachusetts Institute of Technology (MIT)
led by Angela Belcher has used M13
scaffolds to make electrical batteries. Seung-Wuk Lee of the University
of California, Berkeley (UC Berkeley)
has used the genetically engineered
versions of the same inovirus to create piezoelectric generators.
Members of inoviridae can show a
much darker side, too. One inovirus
has been found to make cholera bacte-
ria much more deadly. Says Roux, “You
might think, as we have these viruses
with great applications and others with
a big impact, we must know a lot about
them. But we don’t.”
There are fewer than 100 confirmed
species of inovirus. They even seem to
elude methods that were developed
specifically to find and identify novel
species of microorganism. One such
technique, the meta-genomic survey,
takes advantage of the high-speed “next-
generation” gene-sequencing (NGS)
hardware now available to biologists.
Derived from the ”shotgun” sequencing
used on the Human Genome Project
more than 15 years ago, NGS makes it
possible to reconstruct genomes from
multiple species that may be contained
in a single sample, instead of trying to
isolate first the DNA of each organism.
The first step is to shred DNA extracted from a biological sample before
using enzymes to make enough copies
for sequencing. High-performance
computers then attempt to piece together the resulting jigsaw into longer
sequences. The algorithms do this by
aligning segments that appear to overlap before assembling them into different candidate genomes. Normally,
in a metagenomic survey, researchers
hand-check the results to try to weed
out false matches.
With bacteria and higher organisms, it is relatively straightforward to
ensure that each genome represents
a single species. One commonly employed technique looks for variations
of one or two essential large genes.
Because these particular genes are
fundamental to the survival of the or-
ganism, such genes exhibit relatively
minor deviations across species, and
organisms from the same family will
have common changes that are not
seen in more distant relatives.
In some cases, metagenomics has
revealed thousands of previously unknown organisms lurking in samples
from a single location. A group led by
Jill Banfield at UC Berkeley took samples from sediment beds at an abandoned uranium mine in Colorado in
2015. From those samples, NGS and
computer analysis coupled with manual curation reconstructed more than
2,500 partial and complete genomes,
and found among them were nearly 50
new families of bacteria. Further work
led to the team proposing a new “tree
of life” they believe better explains the
evolutionary relationships between microorganisms than traditional models.
For both bacteria and viruses,
metagenomic surveys have produced
genomes suitable for study without demanding that each species be cultured
in the lab. For many species, that is
impossible using current techniques.
Viruses present a significant problem
as they are closely associated with their
hosts and do not grow in isolation.
Siddharth Krishnamurthy, a re-
searcher at the Washington University
School of Medicine in St. Louis, says,
“Without these large genomic data-
bases and algorithmic approaches to
populate them, we would be unaware
of whole families of viruses that have
never been cultured.”
Yet within these databases, mem-
bers of the inoviridae family are suspi-
ciously absent. Roux’s hunch was that
inoviruses are commonly found in the
environment and that detection was
the main problem. It seemed tradition-
al genome-identification and binning
tactics do not work well on them. One
possibility was to use a tool called Vir-
Sorter, developed at the University of
Arizona when Roux worked there. This
software looks for characteristic nucle-
otide patterns in genomes, such as se-
quences that code the protein shells in
which viruses wrap their DNA payloads
for transport to new victims.
”This work started when we realized that these viruses were missed
by the probabilistic techniques used
in VirSorter. The short story is that
these inovirus genomes are too short
and their genes are too variable for a
VirSorter-like approach to identify,”
Roux says.
One approach that some groups
have tried is to look at the statistical
composition of the many tiny fragments of DNA that the sequencer
reads. Although the reasons why are
not yet understood, analysis of known
viral genomes has shown that closely
related genomes show a bias in the way
nucleotides are used even in short sequences, known as k-mers.
The DiscoVir tool developed by
Krishnamurthy and colleagues uses
machine learning trained on k-mer
data to sift, from bacterial and fungal
material in metagenomic surveys, the
genomes of unidentified viruses that
infect plants and animals, rather than
bacteria. Machine learning makes it
possible to use features that do not
rely on similarity to known genetic sequences and apply rules that are more
likely to find virus candidates.
“In principle, if we knew the sequence of every virus on the planet,
there would be no value in using a
machine learning algorithm for virus
identification,” Krishnamurthy says.
“The greatest asset that I believe machine learning brings to viral identification is the ability of these algorithms
to identify different combinations of
variables that can lead to the positive
prediction of a virus.
“Things like support vector machines and random forests don’t require all viruses to have the same properties. This is an important feature of
viral classification because biologically, there are no molecular attributes
that are specific to all viruses that are
not present in any non-viruses, which
is one of the reasons why it’s so hard to
“In principle,
if we knew
the sequence
of every virus on
the planet, there
would be no value
in using a machine
learning algorithm for
virus identification.”