technical Perspective
Learning to act in
uncertain environments
By Peter L. Bartlett
The ProBLeM oF decision making in an
uncertain environment arises in many
diverse contexts: deciding whether to
keep a hard drive spinning in a net-book; choosing which advertisement
to post to a Web site visitor; choosing how many newspapers to order so
as to maximize profits; or choosing a
route to recommend to a driver given
limited and possibly out-of-date information about traffic conditions.
All are sequential decision problems,
since earlier decisions affect subsequent performance; all require adaptive approaches, since they involve
significant uncertainty. The key issue
in effectively solving problems like
these is known as the
exploration/ex-ploitation trade-off: If I am at a crossroads, when should I go in the most
advantageous direction among those
that I have already explored, and when
should I strike out in a new direction,
in the hopes I will discover something
better?
The following paper by Ganchev,
Kearns, Nevmyvaka, and Vaughan
considers a sequential decision problem from the financial domain: how
to allocate stock orders across a variety of marketplaces, each with an
unknown volume available, so as to
maximize the number of orders that
are filled. These marketplaces are
known as dark pools because they allow traders to keep their transactions
hidden. The popularity of these dark
pools has grown enormously, as traders making large transactions hope
to reduce their market impact, that
is, the tendency for the price to move
in an unfavorable direction. Because
transactions are hidden, the characteristics of the various dark pools are
uncertain, and can only be discovered
by active exploration.
The broad approach followed by the
authors is based on an intuitive heu-
ristic that is reminiscent of a title you
might encounter in the self-help sec-
tion of a bookstore: “optimism in the
face of uncertainty.” The idea is to treat
uncertain outcomes as optimistically
as the data allows: pick the alternative
that, in the best possible world, is con-
sistent with our experiences so far, and
leads to the best outcome. One alter-
native might be chosen either because
it is clearly superior to all others or be-
cause there is not enough data to rule
out that possibility. If it turns out this
alternative is a poor choice, at least it
leads to a reduction in uncertainty, so
that in the future it can be confident-
ly avoided. This approach naturally
leads to a balance between the desire
to exploit information that has already
been gathered, and the need to explore
uncertain alternatives.
the following
paper considers
a sequential decision
problem from the
financial domain:
how to allocate
stock orders
across a variety of
marketplaces, each
with an unknown
volume available,
so as to maximize
the number of orders
that are filled.
the fact that the number of distinct
states in this problem is enormous: it
is exponential in the number of venues. They exploit the favorable structure of the problem, and in particular
the way it decomposes neatly across
the distinct venues. Their approach involves a modification of a conventional nonparametric statistical estimator
for censored data—the Kaplan-Meier
estimator, which is used in survival
analysis. They use one of these estimators for each venue in order to decide
on the allocation. Their modification
to the Kaplan-Meier estimator incorporates the optimism heuristic by encouraging exploration near the boundary of the region of state space that
has already been adequately explored.
The key result is that this approach
can successfully adapt to unknown
markets: under the assumption that
the volumes available in the various
venues are independent random variables, they prove that their strategy
rapidly performs almost as well as the
optimal allocation.
The heuristic of optimism in the
face of uncertainty is known to perform well in other sequential decision problems. For instance, in small
Markov decision problems, it leads to
learning algorithms that have small
regret: the amount of utility gathered
per time step rapidly approaches the
best possible value. The key challenge
in this area is the development of
methods that can deal with very large
problems: in a wide range of applications, the state space is enormous;
the dark pools problem is typical in
this regard. For good performance in
a case like this, it seems essential that
the problem exhibits some kind of
helpful structure. The work detailed
in the following paper shows how the
generic approach of optimism in the
face of uncertainty can be applied to
exploit the structure of a very large
sequential decision problem to give
an adaptive strategy that allows automatic performance optimization.
This is an approach that will certainly
see wider application.
Peter L. Bartlett is a professor in the computer science
division and department of statistics at the university of
california, berkeley.