Input: volume V, tail probability estimates {Tˆi }K i = 1
Output: an allocation vAE
vAE ¬ 0AE;
for l ¬ 1 to V do
i ¬ argmaxi Tˆi(vi + 1);
vi ¬ vi + 1;
end
return vAE
polynomial-time, near-optimal algorithm for multi-venue
exploration from censored data. The analysis of our algorithm bears strong resemblance to the exploration–
exploitation arguments common in the E3 and RMAX family of
algorithms for reinforcement learning. 4, 12 In particular,
there is an analogy to the notion of a known state inherent in
those earlier algorithms, along with an exploitation lemma
(proving that expected payoffs from known states are high)
and an exploration lemma (proving that extended periods of
low payoffs must result in more states becoming known).
In our setting, however, the number of states is exponential
and thus the special structure of our problem is required
to obtain a polynomial time algorithm. We first provide an
overview of the algorithm and its analysis before examining
it in more detail.
At the highest level, the algorithm is quite simple and
natural. It maintains estimates Tît for the true unknown
tail probabilities Ti for each venue i. These estimates
improve with time in a particular quantifiable sense which
drives between-venue exploration. At any given time t, the
current volume Vt is allocated across the venues by simply
calling the optimal greedy allocation scheme from Figure
1 on the current set of estimated tail probabilities Tît. This
results in new censored observations from each venue,
which in turn are used to update the estimates Tît + 1 used at
the next time step. Thus the algorithm, which is formally
stated in Figure 2, implements a continuous allocate–
reestimate loop.
Note that we have not yet described the algorithm’s
subroutine OptimisticKM, which specifies how we estimate Tît from the observed data. The most natural choice
would be the maximum likelihood estimator on the data.
This estimator is well-known in the statistics literature
as the Kaplan–Meier estimator. In the following section,
we describe Kaplan–Meier and derive a new convergence
result that suits our particular needs. This result in turn
lets us define an optimistic tail modification of Kaplan–
Meier that becomes our choice for OptimisticKM. Figure 3
shows the full subroutine.
The analysis of our algorithm, which is developed in more
detail over the next few sections, proceeds as follows:
step 1: We first review the Kaplan–Meier maximum likelihood estimator for censored data and provide a new
finite sample convergence bound for this estimator. This
bound allows us to define a cut-off for each venue i such
that the Kaplan–Meier estimate of the tail probability Ti(s)
for every value of s up to the cut-off is guaranteed to be
close to the true tail probability. We then define a lightly
modified version of the Kaplan–Meier estimates in which
the tail probability of the next unit above the cut-off is
modified in an optimistic manner. We show that in conjunction with the greedy allocation scheme, this minor
modification leads to increased exploration, since the next
unit beyond the cut-off always looks at least as good as the
cut-off itself.
step 2: We next prove our main Exploitation Lemma (Lemma
3). This lemma shows that at any time step, if it is the case
that the number of units allocated to each venue by the
greedy algorithm is strictly below the cut-off for that venue
(which can be thought of as being in a known state in the
parlance of reinforcement learning) then the allocation is
provably e-optimal.
step 3: We then prove our main Exploration Lemma (Lemma
4), which shows that on any time step at which the allocation
made by the greedy algorithm is not e-optimal, it is possible
to lower bound the probability that the algorithm explores.
Thus, any time we cannot ensure a near-optimal allocation,
we are instead assured of exploring.
step 4: Finally, we show that on any sufficiently long
sequence of time steps (where sufficiently long is polynomial in the parameters of the model), it must be the case
that either the algorithm has already implemented a near-
figure 2. main algorithm.
Input: volume sequence V 1, V 2, V 3,...
arbitrarily initialize Tˆ i1 for each i;
for t ¬ 1, 2, 3, ... do
Allocation Step:
vAEt ¬ Greedy (Vt, Tˆ t 1,..., Tˆ t K);
for i ¬ { 1,...,K } do
submit V t i units to venue i;
let r t i be the number of shares sold;
Reestimation Step:
Tˆ t i + 1 ¬ OptimisticKM ({(v t i, rt i )}t t = 1);
end
end
figure 3. subroutine OptimisticKM. Let Mt i,s' and Nt i,s' be defined in
section 4. 1, and assume that e, d > 0 are fixed parameters.
Input: observed data ({(v t i, rt i )} t t = 1) for venue i
Output: modified kaplan–meier estimators for i
Calculate the cut-off:
ct i ¬ max{s : s = 0 or Nt i,s– 1 ³ 128 (sV/e) 2 ln(2V/d )};
Compute Kaplan–Meier tail probabilities:
Tˆ t i (0) = 1;
for s = 1 to V do
Tˆ t i (s) ¬ Õs s¢= - 10 ( 1–(M t i,s¢/N t i,s¢));
end
Make the optimistic modification:
if ct i < V then
Tˆt i (cti + 1) ¬ Tˆt i (cti );
return Tˆ t i ;