Each algorithm was run in simulation for some number of episodes. Each episode consisted of the allocation
of a fixed number V of shares—thus the same number of
shares is repeatedly allocated by the algorithm, though
of course this allocation will change over time for the
two adaptive algorithms as they learn. Each episode of
simulation results in some fraction of the V shares being
executed. Two values of V were investigated—a smaller
value V = 1000, and the larger and potentially more difficult V = 8000.
We begin by showing full learning curves over 2000
episodes with V = 8000 for a couple of representative stocks
in Figure 4. Here the average performance of the two nonadaptive allocation schemes (ideal and uniform) are represented as horizontal lines, while learning curves are
given for the adaptive schemes. Due to high variance of the
heavy-tailed venue distributions used by the simulator, a
single trial of 2000 episodes is extremely noisy, so we both
average over 400 trials for each algorithm, and smooth the
resulting averaged learning curve with a standard exponential decay temporal moving average.
0.3
0.25
0.2
0.15
0.1
We see that our learning algorithm converges towards
the ideal allocation (as suggested by the theory), often
relatively quickly. Furthermore, in each case this ideal
asymptote is significantly better than the uniform allocation strawman, meaning that optimal allocations
are highly nonuniform. Learning curves for the bandit
approach exhibit one of the three general behaviors over
the set of 12 stocks. In some cases, the bandit approach
is quite competitive with our algorithm, though converging to ideal perhaps slightly slower (not shown in Figure
4). In other cases, the bandit approach learns to outperform uniform allocation but appears to asymptote short
of the ideal allocation. Finally, in some cases the bandit
approach appears to actually “learn the wrong thing”, with
performance decaying significantly with more episodes.
This happens when one venue has a very heavy tail, but
also a relatively high probability of executing zero shares,
and occurs because the very naive bandit approach that we
use does not have an explicit representation of the tails of
the distribution.
Learning Learning Learning
0.3
0.25
0.2
0.15
0.1
0.3
0.25
0.2
0.15
0.1
The left column of Figure 5 shows more systematic
head-to-head comparisons of our algorithm’s performance versus the other allocation techniques after 2000
episodes for both small and large V. The values plotted are
averages of the last 50 points on learning curves similar to
Figure 4. These scatterplots show that across all 12 stocks
and both settings of V, our algorithm competes well with
the optimal allocation, dramatically outperforms uniform, and significantly outperforms the naive bandit allocations (especially with V = 8000). The average completion
rate across all stocks for the large (small) order sequences
is 10.0% ( 13.1%) for uniform and 13.6% ( 19.4%) for optimal allocations. Our algorithm performs almost as well as
optimal— 13.5% ( 18.7%)—and much better than bandits at
11.9% ( 17.2%).
0.05
0.05 0.1
In the right column, we measure performance not by
the fraction of V shares filled in one step, but by the natural alternative of order half-life—the number of steps of
figure 5. comparison of our learning algorithm to the three
baselines. in each plot, the performance of the learning
algorithm is plotted on the y-axis, and the performance of one
of the baselines on the x-axis. Left column: evaluated by the
fraction of submitted shares executed in a single time step;
higher values are better, and points above the diagonal are wins
for our algorithm. Right: evaluated by order half-life; lower
values are better, and points below the diagonal are wins for
our algorithm. each point corresponds to a single stock and
order size; small orders (red plus signs) are 1000 shares, large
orders (blue squares) are 8000 shares.
Fraction executed
Order half-life
12
10
Learning
6
8
4
0.05
0.05 0.1
0.15 0.2 0.25 0.3
Ideal
22 4
6 8 10 12
Ideal
12
10
Learning
6
8
4
0.05
0.05 0.1
0.15 0.2 0.25 0.3
Uniform
2
24
6 8 10 12
Uniform
12
10
Learning
6
8
4
0.15 0.2 0.25 0.3
Bandit
2
24
6 8 10 12
Bandit
repeated resubmission of any remaining shares to get the
total number executed above V/2. Despite the fact that
our algorithm is not designed to optimize this criterion
and that our theory does not directly apply to it, we see the
same broad story on this metric as well—our algorithm
competes with ideal, dominates uniform allocation and
beats the bandit approach on large orders. The average
order half-life for large (small) orders is 7. 2 ( 5. 3) for uniform allocation and 5. 9 ( 4. 4) for the greedy algorithm on
the true distributions. Our algorithm requires on average
6.0 ( 4. 9) steps, while bandits uses 7.0 ( 4. 4) to trade the large
(small) orders.