outcomes (which is a single game outcome, if deterministic)
given any fixed future sequence of actions. Another way of
putting this latter condition is to say that both players have
perfect knowledge of the game’s state. In board games such
as Go, chess, and checkers, disregarding subtleties such as
castling restrictions in chess and ko rules in Go, the state is
identical to the board position, that is, the configuration of
all pieces on the board.
The rules of the game determine the terminal states in
which the game ends. There is a reward associated with each
terminal state, which determines as to how much Black earns
if the game ends in that state. There are no intermediate
rewards, that is, the reward associated with each nonterminal state is zero. The goal of Black is to get the highest final
reward, while the goal of White is to minimize Black’s reward.
A game tree organizes the possible future action sequences
into a tree structure. The root of the tree represents the initial state (and the empty action sequence), while each other
node represents some nonempty, finite action sequence
of the two players. Each finite action sequence leads deter-ministically to a state, which we associate with the node
corresponding to that action sequence (Figure 3).
Note that the same state can be associated with many
nodes of the tree, because the same state can often be
reached by many distinct action sequences, known as
transpositions. In this case, the game can be represented more
compactly by a directed acyclic graph over the set of states.
The optimal value of a game tree node is the best possible
value that the player at that node can guarantee for himself,
assuming that the opponent plays the best possible counter-strategy. The mapping from the nodes (or states) to these values is called the optimal value function. Similarly, the optimal
action value of a move at a node is defined to be the optimal
value of the child node for that move.
If the optimal values of all children are known, then it
is trivial to select the optimal move at the parent: the Black
(White) player simply chooses the move with the highest (lowest) action-value. Assuming that the tree is finite,
the optimal value of each node can be computed by working backward from the leaves, using a recursive procedure
known as minimax search.
figure 3. a minimax game tree for a small two-player game. Black
selects actions to maximize his value; White selects actions to
minimize her value.
While minimax search leads to optimal actions, it is
utterly intractable for most interesting games; the computation is proportional to the size of the game tree, which
grows exponentially with its depth. A more practical alternative is to consider a subtree of the game tree with limited
depth. In this case, computation begins at the leaves of
the subtree. The (unknown) true optimal values at the leaf
nodes are replaced with values returned by a heuristic
evaluation function. If the evaluation function is of sufficiently
“high quality,” the action computed is expected to be near-optimal. The computation can be sped up by various techniques, the most well-known being a−b pruning, which is
often used together with iterative deepening.
The evaluation function is typically provided by human
experts, or it can be tuned either using supervised learning
based on a database of games or using reinforcement learning and self-play. 22 Programs based on variants of minimax
search with a − b pruning have outperformed human world
champions in chess, checkers, and othello. 22
3. 2. Monte-carlo simulation
In some games of interest, for example in the game of Go,
it has proven hard to encode or learn an evaluation function
of sufficient quality to achieve good performance in a minimax search. Instead of constructing an evaluation function,
an alternative idea is to first construct a policy (sometimes
called a playout policy) and then to use that policy to estimate the values of states. A policy is a mapping from states
to actions; in other words, a policy determines the way to
play the game. Given a policy pair (one policy for each player,
which if symmetric can be represented by a single policy),
a value estimate for a state s can be obtained by simulation:
start in state s and follow the respective policies in an alternating manner from s until the end of the game, and use the
reward in the terminal state as the value of state s. In some
games, it is easier to estimate the value indirectly by simulation, that is, it may be easier to come up with a simple policy
that leads to good value estimates via simulation than to
estimate those values directly.
A major problem with the approach described so far
is that it can be very sensitive to the choice of policy. For
example, a good policy may choose an optimal action in 90%
of states but a suboptimal action in the remaining 10% of
states. Because the policy is fixed, the value estimates will
suffer from systematic errors, as simulation will always produce a single, fixed sequence of actions from a given state.
These errors may often have disastrous consequences, leading to poor evaluations and an exploitable strategy.
Monte-Carlo methods address this problem by adding
explicit randomization to the policy and using the expected
reward of that policy as the value estimate. The potential
benefit of randomization is twofold: it can reduce the influence of systematic errors and it also allows one to make a
distinction between states where it is “easy to win” (i.e.,
from where most reasonable policy pairs lead to a high
reward terminal state) and states where it is “hard to win.”
This distinction pays off because real-world opponents are
also imperfect, and therefore it is worthwhile to bias the
game toward states with many available winning strategies.