and a few other enhancements allowed
us to compute λ27 and push the lower
bound on λ above 4.0. Here are the
main memory-reduction tricks we used
(see the online appendix for more):
Elimination of unreachable states.
Approximately 11% of all states of the
automaton are unreachable and do not
affect the result;
Bit-streaming of the succ0/1 arrays.
Instead of storing each entry of these
arrays in a full word, we allocated just
the required number of bits and stored
the entries consecutively in a packed
manner; in addition, we eliminated all
the null succ0-pointers (approximately
11% of all pointers);
Storing higher groups only once.
Approximately 50% of the states (those not
in G1) were not needed in the recursion
(∗); we thus did not keep in memory
yold[⋅] of the respective entries;
Recomputing succ0. We computed
the entries of the succ0 array on-the-fly
instead of keeping them in memory; and
Streamlining the successor computation. Instead of representing a Motzkin
path by a sequence of W + 1 integers,
we kept a sequence of W + 1 two-bit
items we could store in one 8-byte
word; this representation allowed
us to use word-level operations and
look-up tables for a fast mapping
from paths to numbers.
To make full use of the parallel ca-pacities, we used the partition of the
set of states into groups, as in Figure 5, such that ynew of all elements
in a group could be computed independently. We also parallelized the
computation of the succ arrays in
the preprocessing phase and various
housekeeping tasks.
Execution. After 120 iterations,
the program announced the lower
bound 4.00064 on λ27, thus breaking
the 4 barrier. We continued to run
the program for a few more days. On
May 23, 2013, after 290 iterations,
the program reached the stable situation (observed in a few successive
tens of iterations) 4.002537727 ≤ λ27
≤ 4.002542973, establishing the new
record λ > 4.00253. The total running
time for the computations leading
to this result was approximately 36
hours using approximately 1.5TiB
of RAM. We had exclusive use of the
server for a few dozen hours in total,
spread over several weeks.
(see Figure 6), we estimated that only
when we reach W = 27 would we break
the mythical barrier of 4.0. However,
as mentioned earlier, the storage
requirement is proportional to MW,
and MW increases roughly by a factor of 3 when W is incremented by 1.
With this exponential growth of both
memory consumption and running
time, the goal of breaking the barrier
was then out of reach.
Computing λ27
Environment. In the spring of 2013,
we were granted access to the Hewlett
Packard ProLiant DL980 G7 server of
the Hasso Plattner Institute Future
SOC Lab in Potsdam, Germany (see
Figure 7). This server consists of eight
Intel Xeon X7560 nodes (Intel64 architecture), each with eight physical
2.26GHz processors ( 16 virtual cores),
for a total of 64 processors ( 128 virtual
cores). Each node was equipped with
256GiB of RAM (and 24MiB of cache
memory) for a total of 2TiB of RAM. Si-
multaneous access by all processors to
the shared main memory was crucial to
project success. Distributed memory
would incur a severe penalty in run-
ning time. The machine ran under the
Ubuntu version of the Gnu/Linux oper-
ating system. We used the C-compiler
gcc with OpenMP 2.0 directives for
parallel processing.
Programming improvements. Since
for W = 27 the finite automaton has M28
≈ 2. 1 ⋅ 1011 states, we initially estimated
we would need memory for two 8-byte
arrays (for storing succ0 and succ1) and
two 4-byte arrays (for storing yold and
ynew), all of length M28, for a total of
24 ⋅ 2. 1 ⋅ 1011 ≈ 4. 6 TiB of RAM, which
exceeded the server’s available capacity. Only a combination of parallelization, storage-compression techniques,
Figure 7. Front view of the “supercomputer” we used, a box of approximately 45× 35× 70 cm.
Figure 6. Extrapolating the sequence λ W.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
1
1. 5
2. 5
2
3
4
3. 5
W
λ W