[ 3]. Background on multi-armed bandits can be found in a survey [ 10]. For
research on partial information disclosure as a means of persuasion see Kamenica and Gentzkow’s “Bayesian Persuasion” [ 11] and the follow-up work.
This article does not attempt to provide a comprehensive survey of the research direction. Instead, it focuses on
the model from Kremer et. al.[ 1] and
the improved and generalized results
for this model due to Mansour et. al. [ 3].
Relevance to medical trials has been
suggested in Mansour et. al. This article draws heavily on the text from their
research, the thoughts from Kleinberg
and Slivkins [ 9], and numerous discussions with my co-authors from [ 3, 4].
[ 1] Kremer, I., Mansour, Y., and Perry, M. Implementing
the wisdom of the cro wd. J. of Political Economy,
122, 5 (2014). Prelim. version in ACM EC 2014.
[ 2] Che, Y.-K. and Hörner, J. Optimal design for social
learning. Preprint, 2015. First draft: 2013.
[ 3] Mansour, Y., Slivkins, A., and Syrgkanis, V. Bayesian
incentive-compatible bandit exploration. Working
paper, 2017. Prelim. version in ACM EC 2015.
[ 4] Mansour, Y., Slivkins, A., Syrgkanis, V., Wu, S.
Bayesian exploration: Incentivizing exploration
in bayesian games. Working paper, 2016. Prelim.
version in ACM EC 2016.
[ 5] Bahar, G., Smorodinsky, R., and Tennenholtz, M.
Economic recommendation systems. In 16th ACM
Conf. on Electronic Commerce (EC), 2016.
[ 6] Bimpikis, K., Papanastasiou, Y., and Savva, N.
Crowdsourcing exploration. Management Science,
[ 7] Frazier, P., Kempe, D., Kleinberg, J. M. ,and Kleinberg,
R. Incentivizing exploration. In ACM Conf. on
Economics and Computation (ACM EC). ACM, New
[ 8] Kleinberg, R. D., Waggoner, B. , and Weyl, E. G.
Descending price optimally coordinates search.
Working paper, 2016. Prelim. version in ACM EC 2016.
[ 9] Kleinberg, R. D. and Aleksandrs Slivkins.
Incentivizing and coordinating exploration, 2017.
Tutorial at ACM EC 2017.
[ 10] Bubeck, S. and Cesa-Bianchi, N. Regret analysis of
stochastic and nonstochastic multi-armed bandit
problems. Foundations and Trends in Machine
Learning 5, 1 (2012).
[ 11] Kamenica, E. and Gentzkow, M. Bayesian persuasion.
American Economic Revie w 101, 6 (2011).
Aleksandrs Slivkins is a senior researcher at Microsoft
Research New York. Previously, he was a researcher at
MSR Silicon Valley in 2007-2013, after receiving his Ph.D.
from Cornell. His research interests are in algorithms and
theoretical computer science, spanning machine learning
theory, algorithmic economics, and networks. Slivkins
is particularly interested in exploration-exploitation
tradeoff and its manifestations in mechanism design and
human computation. His work has been recognized with
the best paper award at ACM EC 2010, the best paper
nomination at W W W 2015, and the best student paper
award at ACM PODC 2005.
© 2017 Copyright held by owners/authors.
Publication rights licensed to ACM
framework for thinking about the ex-plore-exploit trade-off. In particular, the
planner is a “BIC bandit algorithm:” A
multi-armed bandit algorithm with an
ancillary BIC constraint.
Now that the model is set, what are
the specific technical questions one
could ask? My colleagues and I have
pursued several directions. First, how
about optimal BIC bandit algorithms?
In other words, ho w about BIC bandit algorithms whose social welfare is largest
among all BIC bandit algorithms? The
point here is not just to demonstrate
such algorithms exist—they always
do—but also to provide a relatively simple description. This has been achieved
for the special case of two actions with
deterministic rewards, and appears out
of reach for more general scenarios.
Second, what about approximately
optimal BIC bandit algorithms? We rely
on a standard measure of sub-optimal-ity in bandit algorithms called regret:
The expected reward of always following
the best action minus the expected reward of the algorithm. Essentially, this
is how much does the algorithm “regret”
not knowing the best action in advance.
We would like to construct BIC bandit
algorithms whose regret is close to optimal regret that can be achieved for any
bandit algorithms, BIC or not. Long story short, we have achieved this for a constant number of actions. Moreover, we
provide a generic “reduction” that takes
your favorite bandit algorithm, BIC or
not, and converts it into a BIC bandit
algorithm, with only a constant-factor
blow-up up in regret. Conceptually, this
allows for a modular approach, where
one “injects” incentives into an existing
machine learning system.
Third, how about extending these
results to more general machine learning models? For example, there is much
research on scenarios when additional
information is revealed to a bandit algorithm before and/or after each decision.
We extend the generic reduction mentioned above to these scenarios.
One major reason why the current
work is purely theoretical is it makes
standard, but very idealized assump-
tions on human behavior, particularly
regarding common Bayesian prior and
myopic rationality. While we have made
some progress along this direction,
much more is needed to bring behavior-
al modelling closer to practice. In par-
ticular, we would like to have BIC ban-
dit algorithms that work well assuming
the agents are myopically rational, and
work reasonably well even if the agents
can deviate from myopic rationality—
but not too much and/or not too often.
Another important “theory-to-practice”
direction is to incorporate constraints
on which additional information, such
as reviews and aggregate scores, can
and/or must be revealed to agents. How-
ever, it is unclear how to model such
constraints in a way that is not too tied
to a particular application.
BIC bandits are relevant to medical trials as a conceptual framework
to think about participation incentives
therein. The thinking is a medical trial
can be cast as a bandit algorithm, perhaps a very simple one, which assigns
drugs to patients. In fact, medical trials were one of the original motivations
for studying multi-armed bandits. Now,
every time you take any medicine you
could be in a medical trial and contribute valuable data to science. But this is
just not how the world works. One major reason (among several others that
will not be discussed here) is that you do
not have incentives to subject yourself
to an unproven medical treatment. Particularly, consider inexpensive drugs
for common, not-very-serious medical
conditions, when stakes and costs are
relatively small, and suitable patients
are easy to find. We conjecture that participation incentives are an important
obstacle preventing large-scale medical
trials for such drugs. Thus, it is tantalizing to bring the theory of “incentivizing
exploration” closer to the theory of designing medical trials.
The study of incentivizing exploration
via information asymmetry has been
initiated in the seminal paper of Kremer, Mansour, and Perry [ 1], and continued in a series of papers [ 2, 3, 4, 5,
6]. A closely related direction, not discussed here, concerns incentivizing exploration via monetary payments [ 7, 8].
The general theme of exploration and
incentives also arises in several other
scenarios; see a related work discussion