There are tons of examples of data
that can be assembled right now that
will compromise privacy. Unfortunately, the social value to compromising privacy is pretty substantial.
So, you can argue that technology has
rendered privacy a moot question. Or
you can argue that preserving privacy
is a legislative issue.
As predictive models are increasingly used, how do we avoid biases when
interpreting and using data?
DAPHNE KOLLER: Bias will always be
a challenge, and there isn’t a single,
magic solution. The bigger question
is: “How do we disentangle correlation
from causation?” The gold standard in
medicine is that of randomized case
control. In the case of Web data, it’s
called AB testing. Although not perfect, randomized case control, or AB
testing, is about as good a tool as we
have been able to develop for addressing some of the confounders. Unfortunately, this type of control is not feasible in all cases. Then processes must
be carefully scrutinized to check for
different confounders and to look for
any and all correlations that give rise
to the phenomenon being viewed. It’s
a process that requires a lot of thought
and a lot of care and cannot be overstated in its importance.
For example, sometimes there are
biases that are reflected in the conclusions that are drawn from the data. In
searches on certain sites for example,
“Steph” auto-completes to “Stephen”
rather than “Stephanie” because Stephen is a more common search term.
Some would say this is a gender bias
and should be eliminated. As a woman
in tech, I can certainly relate to and understand that perspective. Some would
also say that the data is what it is, and
if Stephen is a more common search
term than Stephanie—then do we really want to make the algorithm do
something other than what is best for
user efficiency? It’s a real quandary, and
one can make legitimate arguments
either way.
MICHAEL STONEBRAKER: The trouble
with predictive models is that they
are built by humans, and humans by
nature are prone to bias. If we look at
the most recent presidential election,
we see a spectacular failure of exist-
ing polling models. Twenty-twenty
hindsight shows that nobody thought
Trump could actually win, when in
reality, it is far more likely the polling
models were subtly biased against him.
So, the problem with predictive
models is the models themselves. If
they include fraud, bias, etc., they can
yield very bad answers. One has to take
predictive models with a grain of salt.
We put way too much faith in predic-
tive modeling.
What role can big data and machine
learning play in helping scientists under-
stand data (for example, in the Human
Genome project) and bring forth some
potential real-world opportunities in
health and medicine?
DAPHNE KOLLER: One of the main
reasons I came back to the healthcare
field is because I think the opportu-
nity here is so tremendous. As costs go
down, our ability to sequence new ge-
nomes increases dramatically. And it’s
not just genomes; it’s transcriptomes
and proteomes and many other data
modalities. When we combine that
with wearable devices that allow you
to see the effect of phenotypes, there is
an amazing explosion of data that we
could access. One reason this is ben-
eficial is that it will improve our ability
to determine the genetic factors that
cause certain diseases. Yes, we could
do that before, but when faced with
tens of millions of variations in the
genome and only a couple hundred ex-
amples to use, it’s really difficult to ex-
tract much out of that except the very
strongest signals.
Are there potential technological
breakthroughs on the horizon that
could transform this area again in the
near future?
DAVID BLEI: I think we are in the
middle of a transformative time for
machine learning and statistics, and
it’s fueled by a few ideas. Reinforce-
ment learning is a big one. This is the
idea that we can learn how to act in
the face of an uncertain environment
with uncertain consequences of our
actions; it’s fueling a lot of the amaz-
ing results that we’re seeing in ma-
chine learning and AI. Deep learning
is another idea—a very flexible class
of learners that, when given massive
datasets, can identify complex and
compositional structure in high-di-
mensional data. Another idea is 60
years old, but it’s optimization: I have
some kind of function and I want the
maximal value of that function, how
do I do that? Well, it’s called an op-
timization procedure. Optimization
tells us how to do that very efficiently
with massive datasets.
VIPIN KUMAR: New types of sensors
and communication technologies
can be quite transformational. The
kinds of sensors that we see today, we
could not even have been imagined
just a few decades ago. Mobile health
sensors such as Fitbit and Apple
Watches that can record our physio-
logical parameters at unprecedented
detail have been around only for the
past decade or so. New types of sen-
sors based on advances in electron-
ics, nanotechnology, and biomedical
sciences are already enabling deploy-
ment of small and inexpensive satel-
lites that can monitor the earth and
its environment at spatial and tempo-
ral resolutions never possible before.
Without technologies such as RFID,
it would be very hard for someone
to imagine that you could walk into
a store and purchase something just
by looking at it or by being close to
it—something that is now possible at
Amazon Go, a grocery store in Seattle
that has no checkout counter. New
sensors based on quantum technol-
ogy may open up entirely new appli-
cations that we are not even consider-
ing today.
Final thoughts?
MICHAEL STONEBRAKER: All of the fancy
social benefits we expect from big data
depends on seamless data integration.
Solving the problem of how to improve
data integration is going to be key in
getting the most benefit from all the
data being created.
© 2017 ACM 001-0782/17/06 $15.00
“The trouble with
predictive models
is that they are
built by humans,
and humans are by
nature prone to bias.”