p314 Brand
p314 Brand
Tree Search
ABSTRACT 1. INTRODUCTION
Building sophisticated computer players for games has been Since the advent of artificial intelligence research, develop-
of interest since the advent of artificial intelligence research. ing computer players for various games has received a large
Monte Carlo tree search (MCTS) techniques have led to re- amount of attention. This can be partially attributed to
cent advances in the performance of computer players in a the fact that a wide variety of games provide suitable ab-
variety of games. Without any refinements, the commonly- stractions of real-world tasks. Conceptually, the simplest
used upper confidence bounds applied to trees (UCT) selec- games are deterministic two-player games with perfect in-
tion policy for MCTS performs poorly on games with high formation, such as Checkers, Chess, and Go. While the best
branching factors, because an inordinate amount of time is action in most such games can theoretically be found by
spent performing simulations from each sibling of a node be- minimax game-tree search [33], in practice high branching
fore that node can be further investigated. Move-ordering factors make exact computation infeasible, so that the de-
heuristics are usually proposed to address this issue, but velopment of various search refinements and heuristics have
when the branching factor is large, it can be costly to order been necessary to achieve performance comparable to hu-
candidate actions. We propose a technique combining sam- mans.
pling from the action space with a naı̈ve evaluation function A large improvement in the quality of Computer Go play-
for identifying nodes to add to the tree when using MCTS ers can be attributed to the development of Monte Carlo
in cases where the branching factor is large. The approach tree search (MCTS) techniques in 2006 [22]. Subsequently,
is evaluated on a restricted version of the board game Risk various refinements and heuristics have made considerable
with promising results. further improvements to the performance of Computer Go
players making use of MCTS [13]. An important class of re-
finements are those modifying the exploration-exploitation
tradeoff central to MCTS to de-emphasize exploration of
Categories and Subject Descriptors new actions when branching factors are prohibitively large.
I.2.1 [Artificial Intelligence]: Applications and Expert Without such refinements, an inordinate amount of time is
Systems—Games spent performing simulations from each sibling of a node
before that node can be further investigated.
This work proposes such a technique using sampling from
General Terms the action space in conjunction with a naı̈ve evaluation func-
tion for identifying nodes to add to the tree during MCTS
Algorithms, Experimentation when the branching factor is large. The approach is evalu-
ated on a restricted version of the board game Risk. The
results indicate that the sampling approach can improve per-
Keywords formance over classical approaches and MCTS implementa-
Monte-Carlo Tree Search, Risk, Evaluation Function, Sam- tions without this enhancement.
pling, Heuristics
314
erations of the following phases, starting from a single root due to nonstationarity of the reward distribution, as dis-
node, until the available time expires: cussed in [13].
The UCT tree policy identifies a node for expansion if
• Selection — Recursively selecting a best child node there are legal actions from the corresponding state not yet
to explore, until a node to expand is identified.1 Leaf in the search tree; once all such legal actions have been added
nodes are always expanded if reached; internal nodes to the search tree as children for a node, the policy selects
may be expanded if there are legal actions from the the child node with the highest UCT value (also known as
node not yet represented in the current search tree. urgency). The UCT value of a node is calculated as
The mechanism for identifying the best child is known
as the tree policy. s
Q(v) ln(N (pv ))
UCT(v) = +c (1)
• Expansion — Adding a new node to the search tree N (v) N (v)
as child of the node found in the previous step. This
node corresponds to a new legal action to be explored. where N (v) is the number of simulations from descendants
of a node v, Q(v) is the number of those simulations won,
• Simulation — Performing a simulation from the newly pv denotes the parent node of v, and c is a tunable explo-
expanded node. A simulation (also known as a play- ration constant. Thus, the first term estimates the proba-
out or rollout) typically involves playing the rest of bility of winning from v, while the second term represents
the game using some simple random strategy. This uncertainty in the estimate based on the sample size. In this
strategy is known as the simulation policy. setting, a higher value for c will result in exploration being
• Backpropagation — Updating nodes in the search emphasised relative to exploitation. An appropriate value
tree with information obtained from the simulation. for c must typically be found experimentally — the experi-
Most commonly, nodes maintain information on the ments in this work used c = 1.2 (found experimentally [17]).
success rate of simulations performed starting from A practical problem with UCT is that it requires every
their descendants; however, various other information legal action from a node to be added to the search tree be-
may also be kept. fore any action from that node is evaluated for a second
time. This requirement is inconvenient when the game tree
branching factor is large, so some heuristics bypass this be-
haviour. The best-known heuristics in this regard are first-
play urgency (FPU) [16] and progressive widening [6, 12].
First-play urgency assumes that if some actions that have
already been considered look promising enough, they can
be explored further before trying other unexplored actions.
This is controlled by means of a constant called the FPU
value: if certain children’s UCT values exceed the FPU
value, further exploitation of those children is permitted. To
ensure that all nodes are eventually considered in the limit,
the FPU value is typically set to a value larger than one —
a common value in practice (as used in Oakfoam [32]), and
Figure 1: The MCTS Algorithm (Figure from [5]). which we use in this work, is 1.1 [23]. Using this technique
allows good-looking actions found early in positions to be
When the search stops, the actions that lead from the root investigated more thoroughly, while not hampering explo-
are considered and the best-performing action is returned. ration when initial actions look poor.
Coulom [13] considered various options for identifying the Progressive widening avoids early expansion in UCT by
best-performing action — based on his work, it is now fairly specifying a schedule for adding unexplored actions to nodes:
common to consider the action that has been visited the after a certain number of MCTS iterations descend through
most as the best. We use this “most robust child” convention the node, another child becomes eligible to be added to the
in the rest of this work. tree. Usually these schedules restrict the number of children
to grow as the logarithm or some root of the number of
2.1.1 Tree policies for large branching factors simulations — examples of each case can be found in [12]
The tree policy determines which nodes are selected in the and [27] respectively.
selection phase of the MCTS algorithm. Generally, MCTS Unlike FPU, this approach could be very poor if legal
considers each node in the search tree as a multi-armed ban- actions for expansion are selected entirely randomly: even
dit (where each arm corresponds to a possible next action), if initial actions look poor, the schedule prevents further
and a tree policy applies some strategy for the multi-armed exploration. For this reason, progressive widening orders
bandit problem at the node. One of the first MCTS algo- the legal actions based on some quality heuristic [27] (such
rithms, upper confidence bounds applied to trees (UCT), as an evaluation function), and expands them in decreasing
applied UCB1 [2], a classical multi-armed bandit algorithm, order of the heuristic.
at each node [22]. The authors proved convergence of this al- This ordering approach is often also applied to FPU to im-
gorithm to optimal behaviour, despite the fact that nodes do prove its performance. Common approaches which facilitate
not behave exactly like multi-armed bandits in tree search this, by modifying the UCT value formula, are progressive
1
We omit minor technical details involving the case where bias [6] and rapid action value estimation (RAVE) [15] —
a node under consideration is a terminal node in the game however, further discussion of these techniques is beyond the
tree. scope of this paper.
315
2.1.2 MCTS for stochastic domains pruning in the rest of the search. One can envision similar
The version of Risk we consider in our investigations is a use of scripts and similar heuristics for identifying initial
perfect information game, but nevertheless certain outcomes nodes to expand when employing progressive widening.
are determined by chance. Various studies have considered Gibson et al. [18] investigated the use of MCTS and UCT
approaches to dealing with stochasticity and uncertain in- for drafting territories in the setup phase in the game of
formation in MCTS [5, 3, 31, 8, 24]. Risk. They extracted candidate tactical features for guiding
We follow a fairly straightforward extension of MCTS for territory drafting, and used supervised learning to establish
dealing with this stochasticity based on the approach origi- which combinations of features were important for initial
nally proposed for backgammon in [31]. territory drafting. They showed that the drafting strategy
The approach modifies MCTS as follows: in the tree, ac- can have a notable effect on the strength of computer Risk
tions with stochastic outcomes correspond to nodes, while players, and incorporated their techniques into existing com-
the outcomes correspond to separate nodes. Stochastic ac- puter players for a Risk variant called Lux Delux [29]. An-
tion nodes are never selected for expansion during the selec- other Risk variant for which computer players have been
tion step; instead, when the action node is added to the tree developed is Domination [37].
in an expansion step, all of the nodes corresponding to that Wolf [36] used learning techniques in the design of his AI
action’s outcomes are also added to the tree. Furthermore, players for Risk. He developed an evaluation function for
in the selection process, the tree policy is ignored at nodes evaluating the game state. The evaluation function, coupled
with stochastic outcomes — instead, an outcome is selected with a game tree search approach, formed the basis for his
according to its probability of occurring. AI players. Tan [30] applied Markov chain theory to model
This approach has drawbacks when there are many stochas- the outcome of Risk battles, but made some erroneous inde-
tic outcomes for some actions. However, in the case we con- pendence assumptions. Osborne [26] addressed these errors,
sidered, no node has more than 3 possible outcomes. When and provides a table predicting battle outcomes which we
an attack node is expanded, all the possible outcomes are make use of in this work.
added to the tree, but are subsequently explored proportion- Other examples of enhancing MCTS with an evaluation
ally to their respective probabilities of occurring. function, include Coulom [13] who introduced a technique
for mixing regular MCTS backup operators with minimax
2.1.3 Related work evaluations. Similar techniques have been shown to signif-
icantly improve [24] the quality of MCTS players in games
It generally seems that some kind of prior knowledge about
like Breakout and LOA [35]. Lanctot et al. [24], introduced
candidate actions is needed in order to outperform the sim-
adding evaluation function values to nodes in the MCTS
ple UCT algorithm combined only with vanilla FPU. When
tree. These values are updated during the backup phase, by
there are too many actions at a node, however, it is not
considering the minimax values of a node’s children. This
feasible to obtain an explicit heuristic ordering on all these
effectively mixes minimax-style leaf evaluations with Monte-
actions.
Carlo simulation results.
The work we discuss later presents an approach that uses
sampling to avoid evaluating all candidate actions. Many
implementations avoid this concern by abstracting the state 2.2 Risk
and action spaces so that the branching factor is feasible for Risk is a modern strategy board game for 2-6 players in-
established approaches. We are not aware of previous stud- vented by French film director Albert Lamorisse in 1957.
ies directly addressing this problem in this context; however, Players vie for global domination by recruiting and deploy-
related work has been done on similar problems. ing troops (or armies) to various territories, and then using
Wang et al. [34] present algorithms for regret minimization these armies to battle one another. The goal of a player is to
of multi-armed bandits with infinitely many arms. Their eliminate all other players from the board (or, alternatively,
work also applies to cases where the number of actions ex- occupy every territory on the board).
ceed the number of simulations that can be performed. Pre- The game inherently includes stochasticity from dice rolls
sumably this approach could be incorporated into a tree determining the outcomes of battles, as well as cards drawn
search, but we are not aware of a convergence result for from a deck during the game. There is also imperfect infor-
such an approach like the one for UCB1 in [22]. Couëtoux et mation, since the cards drawn remain hidden for some time.
al. [10] point out a theoretical convergence issue with apply- In this study, we consider a simplified form of Risk obtained
ing progressive widening to continuous (and hence infinite) by: (a) restricting the game to two players; (b) removing
state and action spaces with stochastic transitions, and sug- the deck of cards; (c) simplifying the initial setup of the
gest double progressive widening to address it. Couëtoux game as detailed below; and (d) constraining the number
and Doghmen [9] give empirical evidence that MCTS using of troops used to attack or defend in a battle (both play-
double progressive widening can outperform regular progres- ers must always attack and defend with the maximum per-
sive widening on appropriate domains. The application of mitted number of armies)2 . This eliminates the imperfect
RAVE to continuous state and action spaces has also been information component, as well as opponent modelling con-
considered: Couëtoux et al. [11] effectively apply Gaussian siderations required for dealing sensibly with many players.
kernel density estimation to obtain interpolated values for (For the complete rules of Risk, see [1]).
applying RAVE.
Churchill et al. [7] propose scripted action ordering in the 2
In the original version of Risk, the attacking player could
context of minimax tree search with alpha-beta pruning for choose to attack with any number of armies between one and
real-time strategy games. This approach leverages scripts three. [36] showed that a player that attacks or defends with
representing simple domain knowledge: actions predicted the maximum number of armies possible in an encounter,
by the scripts are investigated first to enhance subsequent has the highest probability of victory in that encounter.
316
Risk has a huge branching factor [36]. As an example, nodes. In such cases, we propose sampling a number K of
during the recruitment phase, newly recruited troops are al- unexplored legal actions, evaluating the resulting state, and
located to the territories owned by a player. Thus, each expanding the action with the best evaluation. To obtain
possible allocation of these troops to the territories owned good performance, it is desirable that sampling from the ac-
corresponds to a legalaction. This number of possible troop tion space is uniform, i.e. all possible actions are equally
allocations is n+m−1
n
, where n is the number of troops to likely to be selected.
place and m the number of territories that the player owns.
If, for instance, a player owns 21 territories and recruits 7 3.1 Implementation
troops (as is typical at the start of a two-player Risk game), We implemented an open-source MCTS-based computer
there are 880070 different legal actions. Note that our sim- player for our restricted version of Risk. We achieve uniform
plified Risk version still has states with such large branching sampling [4] for all the phases in a player’s turn in our Risk
factors. domain as follows. The attack and manouevre actions have
In our restricted Risk domain, game play begins with an somewhat reasonable branching factors, so it is reasonable
initial setup phase, after which players take turns. A player’s to explicitly enumerate all possible attack actions, as well as
turn consists of three distinct game phases, namely: the possible source-destination pairs for all manoeuvre ac-
tions. Attack actions can thus easily be sampled uniformly,
1. the troop recruitment phase, while manoeuvre actions can be sampled uniformly using a
2. the attack phase, and binary search on an array constructed from the possible pairs
and the number of units on the various territories. Moving
3. the troop manoeuvre phase. after an attack is not dealt with as an MCTS action, but
instead by considering each possible number of troops to
As in the original game, the territories are initially ran- move and selecting the number leading to the game state
domly, but equally, divided between the players. Thereafter, with the highest value according to the evaluation function.
each player receives a pool of sixty armies to allocate to those For recruitment, a scheme based on randomized partition-
territories. They then proceed to place these armies simulta- ing was used to sample actions. n troops (tokens) must be
neously on their respective territories. This approach is sim- placed into m territories (buckets), so that randomly gener-
pler than the original game: the original game provisioned ating such a partition effectively ensures a uniform random
forty armies to each player and another forty to a ‘neutral’ troop recruitment.
player, and players would place their troops in a turn-based To ensure all actions can potentially be added to the tree,
fashion. We simplified this, because the initial placement and to reduce inefficiency from redundant sampling when
of troops was not of interest for the general strategies of the number of remaining unexplored actions u becomes sim-
players. In our engine, after the territories are allocated, ilar to K, the number of samples considered for expansion
one player gets randomly selected to place their troops first, is set to max{u/2, 1} when u/2 < K.
then the other player places their troops. When an action is identified as the best of a number of
In the attack phase, the player has the choice of where samples in the attack or manoeuvre phases, it is checked
they want to initiate an attack from. The player then chooses against the actions already in the tree, to ensure that dupli-
a source and a neighbouring destination territory of the at- cate actions are not added. Uniqueness is not checked during
tack. A number of dice are then rolled (between one and the recruitment phase, since selecting duplicates becomes
three for the attacking player and one and two for the de- very unlikely when the branching factor becomes sufficiently
fending player, depending on the number of troops they are large. In this case, there are thus potentially multiple child
attacking/defending with) and the attack is resolved. If the nodes for the same recruitment action, and this may lead to
attack leads to the player conquering a territory, the player not all recruitment actions being added to the tree. How-
may move as many troops as he wishes from the attacking ever, these effects should only play a noticeable role when a
territory to the conquered territory, subject to at least one player has very few remaining territories, in which case the
troop remaining on the attacking territory. player has already effectively lost the game. As such, we
Finally, the player can manoeuvre some troops by select- believe this effect should not have a significant impact on
ing a source and destination territory, as well as a number our results.
of troops to move from the source to the destination.3
The game ends when one of the players has successfully 3.1.1 Simulation policy
conquered every territory on the map.
The simulation policy is a random playing strategy, en-
hanced with some heuristics to simplify decision making and
3. PROPOSED APPROACH to increase the speed of individual simulations. The heuris-
We propose leveraging a crude evaluation function via tics are hand-crafted and essentially capture the core Risk
sampling to guide the expansion phase of MCTS when the strategy of one of the authors:
branching factor is so large that normal approaches are not
feasible. This allows us to bypass issues with large action • Recruitment phase: all recruited troops are placed on
spaces by identifying a promising node to expand based on a randomly selected frontier territory.4
its estimated value. Our approach is proposed for situations
• Attack phase: the territory used during recruitment is
where a significant portion of the MCTS running time is con-
chosen as the attacking territory. The neighbouring
sumed by evaluating and ordering the possible actions from
enemy territories are then traversed (in a fixed order)
3
For manoeuvring, it is required that the source and desti- and the first territory (if any) that can be conquered
nation territories be connected to each other by a chain of
4
territories owned by the manoeuvring player. A frontier territory is one adjacent to an enemy territory.
317
by the attacking territory with a probability higher investigation). During the setup phase, troops are placed
than 0.5 is attacked once. (The success probability on territories, in round-robin fashion, until every territory
of conquering a territory is obtained from the table has either two or three troops.
in [26]). This is repeated until a territory is conquered
(in which case a random number of troops is moved 3.2.1 EMM player
to the conquered territory, the conquered territory be- The EMM algorithm is a recursive depth-first search al-
comes the new active territory, and attacking continues gorithm that builds a game tree to a predetermined depth,
from there) or there are no more such territories. evaluating leaf nodes using an evaluation function, and prop-
agating leaf values up the tree [25, 28]. It is an extension of
• Manoeuvre phase: troops are shifted from the territory the minimax tree search algorithm using stochastic nodes to
with the most troops, to a random legal destination model stochastic events. The value of a stochastic node is
territory with the least troops. The number of troops determined by calculating the mean value of its child nodes,
is chosen such that the two territories have an equal thus making use of the probability of observing each stochas-
number of troops after the manoeuvre (in the case of tic outcome.
a tie, the source territory gets an additional troop). The EMM player in this work was restricted to examine
If the territory with the most troops is isolated, no only ten children per deterministic node, with these chil-
manoeuvre is performed. dren identified using best-of-K sampling based on our eval-
uation function. This approach allowed a deeper tree to be
3.1.2 Evaluation function construction
built in the same time frame that the MCTS player received.
For this work, we constructed a heuristic evaluation func- The strategy of the EMM player for selecting the number
tion for the Risk domain which is a linear function of the of troops to move to a conquered territory was identical to
thirteen features designed and tested by Wolf [36]. More that of the MCTS players.
details on these features are in Appendix C. To determine
the feature weights, we made use of confidence local opti- 3.2.2 Baseline player
misation (CLOP)[14], a noisy black-box parameter-tuning The baseline player implements an aggressive version of
algorithm. the MCTS simulation policy as a playing strategy. The
Training weights with CLOP was done by constructing a modification to the strategy is made to increase the playing
“Greedy AI”: a simple one-ply game tree search using the strength of the player, while sacrificing some of the simula-
evaluation function and best-of-50 sampling to determine tion speed.
the set of leaf nodes stemming from the root. It was limited During the recruitment phase, all recruited troops are
to a maximum of 10 branches. This greedy player was ideal placed on a frontier territory in order to maximize the re-
for weight training, as it makes decisions quickly, and relies sulting ratio of troops on the territory to the sum of oppo-
entirely on the evaluation function for its selection. The nent’s troops on adjacent territories. (For example, if after
training consisted of repeatedly playing games between this placement there are three troops on the territory and the
greedy AI and a baseline heuristic AI (detailed in Section opponent has four neighbouring territories each with two
3.2.2). CLOP used the results of these games to optimise troops, the ratio is 2×43
= 38 = 0.375). In the case of ties,
the feature weights for better general game performance. one of the tied territories is chosen randomly.
In the attack phase, the player repeatedly attacks from
3.2 Other agents its active territory to a (randomly selected) most vulnerable
We implemented various other computer players to com- adjacent opponent territory (i.e. an adjacent opponent ter-
pare our proposed sampling-based MCTS agent [“MCTS(20)”] ritory with the least number of troops) until further attack
to. These were: from the active territory is no longer possible. Initially, the
• a player using the MCTS simulation policy as a strat- active territory is the territory on which the player placed
egy [“Simulation Player”]; his recruited troops. If an opponent territory is defeated, all
but one of the troops on the active territory are shifted to the
• a baseline player [“Baseline”], following an aggressive defeated territory, which then becomes the active territory.
variation of the simulation policy; When moving troops after successfully conquering an enemy
territory, the baseline player moves all available troops from
• a player based on expectiminimax (EMM) search with the attacking territory to the conquered territory.
a restricted branching factor [“EMM”]; The manoeuvre phase is handled exactly the same as in
the simulation policy.
• an MCTS player with no sampling or evaluation func-
tion (i.e. the player essentially expands random ac-
tions) [“MCTS(1)”]; 4. EXPERIMENTS AND RESULTS
We measured the relative playing strength of various com-
• an MCTS player that first expands the node suggested puter players by using game outcomes to determine rating
by the baseline strategy and thereafter uses the pro- intervals for each player. The rating system used is the
posed sampling method [“MCTS Baseline”]. Glicko rating system developed by Glickman [19, 20]. This
• an MCTS player that samples 100 actions instead of system is more desirable than the classical Elo rating sys-
the usual 20 [“MCTS(100)”]. tem (which it generalizes), since each player x is assigned not
only a rating µx , but also a rating deviation σx , which cor-
Below we briefly outline EMM search and the approach used responds to the standard deviation of the estimated player
in our baseline player. The setup phase for all players is rank. As with the Elo rating system, a difference of 100 rat-
done in the same way (since it was not the focus of the ing points corresponds to an estimated 64% probability of
318
winning for the higher rated player. To measure and keep
track of the ratings, we used a Java library for Glicko-2 [21], Table 1: Ratings and rating deviations of our pro-
a further generalization of the Glicko rating system. The posed approach and other agents.
Glicko system also caters for changes in rating deviation over Rating Rating Deviation
time, but since computer players do not change in strength Baseline 1669.23 32.71
over time, we disabled this by setting each player’s rating MCTS Baseline 1595.03 30.23
volatility as well as the Glicko-2 rating system constant τ to MCTS(100) 1421.14 40.56
zero. MCTS(20) 1333.77 28.46
A game manager was implemented to schedule games be- MCTS(1) 1319.97 39.33
tween the players with the aim of identifying statistically Simulation Player 1270.66 27.29
significant differences in performance. Two pools of players EMM 1157.42 43.48
were created, corresponding to two experiments. The first
experiment compared the performance of MCTS(K) with
other, (mostly) non-MCTS agents, while the second inves-
tigated the effect of K on the performance of MCTS(K).
In each pool, players were given an initial Glicko rating of
1500 and a rating deviation of 350 (the default for an un-
ranked player), and rating values were updated in the pool Baseline
after each game was completed. The scheduler initially per-
formed random pairings of players in each pool, to provide
MCTS
an initial indication of ratings. After that, games were typ-
Baseline
ically scheduled by identifying pairs of players in the pool
for which the evidence of a difference between the ratings
was weak (to efficiently obtain statistical significance). This MCTS 100
evidence for a pair of players x and y was quantified by cal-
culating the standard score of the difference between their
ratings: MCTS 20
|µx − µy |
Zxy =p (2)
σx2 + σy2
MCTS 1
This helped to identify matches that would be the most in-
formative, as the results would potentially help significantly
distinguish players. Note that the ratings obtained in differ- Simulation
ent experiments are not directly comparable. In particular, Player
there is no reason to expect a similar rating for the same
agent in both experiments.5
EMM
For all matches, the starting player was randomly deter-
mined. The EMM agent uses best-of-50 sampling where
applicable, and a five-ply search to select actions, which 1,000 1,200 1,400 1,600 1,800
typically took a few seconds. The MCTS agents generated
Rating of player with RD
around 1000 playouts per second, and were given 2.5 sec-
onds of computation time to choose each action. In order to
get enough game results, we stopped games once one of the Figure 2: 95% confidence intervals for the ratings in
players had a ten territory advantage over the other player, Table 1. These are calculated as twice the ratings
subject to at least 100 actions having been performed. deviation on either side of the estimated rating, as
recommended in [19].
4.1 Results
The final player ratings and rating deviations of the play-
ers in the first experiment are presented in Table 1. Figure 2
presents 95% confidence intervals for the various ratings, and
Table 2 shows the P-values for detecting significant differ-
ences between players in this experiment. sampling strategy would be unlikely to sample actions bet-
We see that the best performing player is, surprisingly, the ter than the baseline strategy. Another possible explanation
baseline player. At first, it seems the observed performance could be that the evaluation function consistently underval-
of the baseline strategy over our proposed MCTS strategies ues the results of actions generated by the baseline strategy.
could possibly be explained by the fact that the baseline However, our MCTS approach augmented with the baseline
strategy’s actions are good enough that even the best-of-K action as the first candidate action is also significantly out-
5
performed by the baseline strategy, indicating that both of
Although each experiment is reported on individually, the these explanations do not adequately explain the observa-
results of all the games were pooled to obtain a single rating
and rating deviation for each agent once both experiments tion. Instead, a likely explanation is that the difference in
were completed. This makes it possible to compare the rat- performance is because the baseline strategy moves all avail-
ings of agents not involved in the same experiment to some able troops into conquered territories, while all the MCTS
extent — these results are presented in Appendix B. approaches select the number of troops to move by opti-
319
Table 2: P-values for testing whether the player in Table 3: Performance comparison of various choices
the column is stronger than the player in the row of K for our proposed sampling approach.
(p-values, to 4 significant digits). K Rating Rating Deviation
1 1393.98 95.42
Baseline
MCTS Baseline
MCTS(100)
MCTS(20)
MCTS(1)
Simulation Player
2 1357.79 86.14
5 1483.57 76.66
10 1556.21 68.51
20 1582.86 81.48
50 1605.82 72.92
100 1621.76 73.16
Baseline - - - - - -
MCTS 0.0485 - - - - -
Baseline 5. CONCLUSIONS AND FUTURE WORK
MCTS(100) < 10−4 0.0003 - - - - This study presents a sampling-based approach to using
MCTS(20) < 10−4 < 10−4 0.0392 - - - an evaluation function for guiding MCTS during the expan-
MCTS(1) < 10−4 < 10−4 0.0367 0.3897 - sion phase in domains with very large branching factors. Ex-
Simulation < 10−4 < 10−4 0.0011 0.0548 0.1515 - periments in Risk show improved results over naı̈ve MCTS,
Player despite a hand-crafted strategy outperforming the proposed
EMM < 10−4 < 10−4 < 10−4 0.0003 0.0028 0.0139 approach.
It would be valuable to investigate the performance of
our approach with a better-tuned evaluation function (or in
another domain with a good evaluation function): a weak
mizing with respect to the evaluation function6 . Thus the evaluation function may well consistently rate good actions
situation for the MCTS agent after selecting the baseline poorly and vice versa so that all the expanded actions in
action is different to the situation for the baseline player. the MCTS are poor. Another avenue to investigate in this
Another possibility, however, is that the poorer performance regard is not always selecting the sample with highest evalu-
of MCTS augmented with the baseline action relative to the ation, but select samples stochastically based on the various
baseline action indicates that the MCTS search, which is samples’ evaluations (for example, by using Boltzmann sam-
guided by the simulation results, tends to favour the non- pling).
baseline actions in the tree. This seems unlikely because of
the high similarity between the simulation policy and the
baseline player.
6. ACKNOWLEDGEMENTS
Despite being outperformed by the baseline player, we The authors would like to thank Marc Lanctot for his
see that using a naı̈ve evaluation function with sampling to feedback on a draft of this paper. Some of this work was
guide MCTS provides an improvement over the MCTS sim- done at the MIH Media Lab at Stellenbosch University.
ulation strategy, naı̈ve MCTS (i.e. MCTS(1)), as well as the
EMM agent. Furthermore, we see that MCTS(100) signifi-
cantly outperforms MCTS(20). A more thorough investiga-
7. REFERENCES
tion of the effect of K on the performance was conducted [1] Risk rule book.
in our second experiment. The results are presented in Ta- http://www.hasbro.com/common/instruct/risk.pdf,
ble 3. We see that increasing the sample size improved the 1997. [Online; accessed 15-October-2013].
performance of the player (given the same time constraints). [2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time
There is evidence of diminishing returns as K increases, how- analysis of the multi-armed bandit problem. Machine
ever, so we expect that for larger K performance will plateau Learning, 47(2-3):235–256, 2002.
before beginning to degrade once the sampling overhead be- [3] R. Bjarnason, A. Fern, and P. Tadepalli. Lower
comes too large: in general, it seems the optimal value of K bounding Klondike Solitaire with Monte-Carlo
should depend on the branching factor. In this experiment, planning. In ICAPS, 2009.
the strength increases between successive values of K are
[4] D. Brand. Risk. https://github.com/DirkBrand/
generally not statistically significant. However, the trend is
Risk/releases/tag/v1.1,
quite visible, and the differences between players with small
2013.
K and those with large K are statistically significant. As far
as we are aware, these are the first results reported in the [5] C. B. Browne, E. Powley, D. Whitehouse, S. M.
literature illustrating and quantifying improvements in the Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener,
performance of MCTS due to selective expansion of nodes. D. Perez, S. Samothrakis, and S. Colton. A Survey of
Monte Carlo tree search methods. IEEE Transactions
on Computational Intelligence and AI in Games,
4(1):1–43, 2012.
6
Since the submission of this paper, all the agents have been [6] G. M. J. Chaslot, M. H. M. Winands, H. J. van den
modified to employ MCTS-style search for selecting how
many troops to move after conquering a territory. With this Herik, J. W. H. M. Uiterwijk, and B. Bouzy.
approach, initial experiments indicate no significant differ- Progressive strategies for Monte-Carlo tree search.
ence between the baseline and MCTS Baseline players, sup- New Mathematics and Natural Computation,
porting this hypothesis. 04(03):343–357, 2008.
320
[7] D. Churchill, A. Saffidine, and M. Buro. Fast heuristic [25] D. Michie. Game-playing and game-learning
search for RTS game combat scenarios. In AIIDE, automata. Advances in programming and
2012. non-numerical computation, pages 183–200, 1966.
[8] P. Ciancarini and G. P. Favini. Monte Carlo tree [26] J. A. Osborne. Markov chains for the Risk board game
search in Kriegspiel. Artificial Intelligence, revisited. Mathematics magazine, pages 129–135, 2003.
174(11):670–684, 2010. [27] P. Rolet, M. Sebag, and O. Teytaud. Boosting active
[9] A. Couëtoux and H. Doghmen. Adding double learning to optimality: a tractable Monte-Carlo,
progressive widening to upper confidence trees to cope Billiard-based algorithm. In Machine Learning and
with uncertainty in planning problems. In The 9th Knowledge Discovery in Databases, pages 302–317.
European Workshop on Reinforcement Learning Springer, 2009.
(EWRL-9), 2011. [28] S. Russell and P. Norvig. Artificial Intelligence: A
[10] A. Couëtoux, J.-B. Hoock, N. Sokolovska, O. Teytaud, Modern Approach. Prentice Hall, 3rd edition, 2009.
and N. Bonnard. Continuous upper confidence trees. [29] Sillysoft Games. Lux delux, 2013. [Online; accessed
In Learning and Intelligent Optimization, pages 20-June-2014].
433–445. Springer, 2011. [30] B. Tan. Markov chains and the RISK board game.
[11] A. Couëtoux, M. Milone, M. Brendel, H. Doghmen, Mathematics Magazine, 70:349–357, 1997.
M. Sebag, and O. Teytaud. Continuous Rapid Action [31] F. Van Lishout, G. Chaslot, and J. W. Uiterwijk.
Value Estimates. In The 3rd Asian Conference on Monte-Carlo tree search in Backgammon. In
Machine Learning (ACML2011), volume 20, pages Computer Games Workshop, 2007.
19–31, 2011. [32] F. van Niekerk. Oakfoam - Computer player for the
[12] R. Coulom. Computing Elo ratings of move patterns in game of Go. http://oakfoam.com/, 2012. [Online;
the game of Go. In Computer Games Workshop, 2007. accessed 18-March-2014].
[13] R. Coulom. Efficient selectivity and backup operators [33] J. von Neumann and O. Morgenstern. The Theory of
in Monte-Carlo tree search. In Computers and Games, Games and Economic Behavior. Princeton University
pages 72–83. Springer, 2007. Press, 1944.
[14] R. Coulom. CLOP: Confident local optimization for [34] Y. Wang, J.-Y. Audibert, and R. Munos. Algorithms
noisy black-box parameter tuning. In Advances in for infinitely many-armed bandits. In NIPS, volume 8,
Computer Games, pages 146–157. Springer, 2012. pages 1–8, 2008.
[15] S. Gelly and D. Silver. Combining online and offline [35] M. H. M. Winands, Y. Bjornsson, and J.-T. Saito.
knowledge in UCT. In Proceedings of the 24th Monte Carlo tree search in lines of action. IEEE
International Conference on Machine Learning, pages Transactions on Computational Intelligence and AI in
273–280. ACM, 2007. Games, 2(4):239–250, 2010.
[16] S. Gelly and Y. Wang. Exploration exploitation in Go: [36] M. Wolf. An Intelligent Artificial Player for the Game
UCT for Monte-Carlo Go. In NIPS: Neural of Risk. PhD thesis, Darmstadt University of
Information Processing Systems Conference On-line Technology, 2005.
trading of Exploration and Exploitation Workshop [37] yuranet. Domination (Risk board game). [Online;
(December 2006), 2006. accessed 27-Mar-2013].
[17] S. Gelly, Y. Wang, R. Munos, and O. Teytaud.
Modification of UCT with patterns in Monte-Carlo
Go. Technical report, HAL-INRIA, 2006.
APPENDIX
[18] R. G. Gibson, N. Desai, and R. Zhao. An automated A. SOURCE CODE
technique for drafting territories in the board game All source code that was used in this work is part of the
Risk. In AIIDE, 2010. open-source Risk framework [4]. Version 1.1 was used for
[19] M. Glickman. Parameter estimation in large dynamic the work in this paper and is tagged in the code repository.
paired comparison experiments. Applied Statistics, All default parameters were used.
48:377–394, 1999.
[20] M. Glickman. Glicko rating system, 2013. [Online;
accessed 28-March-2013].
B. POOLED EXPERIMENTAL RESULTS
[21] J. Gooch. Java implementation of the Glicko-2 rating Table 4 summarizes the results of the pooled experiments.
algorithm. https://github.com/goochjs/glicko2,
2010. [Online; accessed 17-October-2013]. C. FEATURES
[22] L. Kocsis and C. Szepesvári. Bandit based All the features are scaled to be between 0 and 1. This
Monte-Carlo planning. In Machine Learning: ECML was not explicitly done in Wolf’s work [36], but was deemed
2006, pages 282–293. Springer, 2006. necessary for training the weights and for subsequent exper-
[23] T. Kozelek. Methods of MCTS and the game Arimaa. iments.
Master’s thesis, Faculty of Mathematics and Physics,
Charles University, Prague, 2009. 1. Armies. The percentage of the total number of armies
that the current player owns.
[24] M. Lanctot, M. H. Winands, T. Pepels, and N. R.
Sturtevant. Monte Carlo tree search with heuristic 2. Best Enemy. The relative strength of the best enemy
evaluations using implicit minimax backups. arXiv player. Since our version of Risk only allows for two-
preprint arXiv:1406.0486, 2014. player games, this would just return the strength of the
single opponent.
321
Table 4: Summary of game results and ratings of the various players, after pooling results from both exper-
iments. The first two columns indicate the player rating and rating deviation, while the remaining columns
show the numbers of wins and total games played per player pair. Here a cell’s entry shows the number of
wins of the row’s player against the column’s player, as well as the total number of games played between
the pair. Entries of 0/0 occur when players were not in the same initial experiment. The last column shows
total number of wins and games played by the player in each row.
Rating
RD
EMM
MCTS(2)
Simulation AI
MCTS(5)
MCTS(1)
MCTS(20)
MCTS(50)
MCTS(10)
MCTS(100)
MCTS Baseline
Totals
EMM 1162.81 45.39 - - - - - - - - - - 20/118
MCTS(2) 1213.32 99.29 0/0 - - - - - - - - - 6/19
Simulation AI 1305.03 25.99 7/21 0/0 - - - - - - - - 119/305
MCTS(5) 1305.03 86.18 0/0 2/3 0/0 - - - - - - - 10/22
MCTS(1) 1334.42 37.60 12/16 0/2 10/18 1/3 - - - - - - 49/135
MCTS(20) 1359.52 26.42 13/13 1/2 36/123 1/1 16/25 - - - - - 105/282
MCTS(50) 1382.54 75.40 0/0 2/3 0/0 4/5 2/3 2/3 - - - - 14/27
MCTS(10) 1403.29 73.58 0/0 5/6 0/0 4/7 1/3 4/6 3/4 - - - 18/30
MCTS(100) 1446.75 35.23 28/28 3/3 6/18 1/3 14/22 9/22 6/9 3/4 - - 86/147
MCTS Baseline 1646.28 28.14 17/18 0/0 88/91 0/0 13/16 26/37 0/0 0/0 9/19 - 189/282
Baseline 1724.37 30.53 21/22 0/0 32/34 0/0 24/27 39/50 0/0 0/0 13/19 65/101 194/253
3. Continent Safety. The relative threat from the op- 9. Maximum Threat. A measure that looks at all possi-
ponent against continents completely occupied by the ble attack source and destination combinations, calcu-
current player. lates the victory probability of the battle and considers
4. Continent Threat. The relative threat the current the maximum of these.
player poses against continents completely occupied by 10. More Than One Army. The percentage of the cur-
the opposing player. rent player’s territories that have more than one troop
5. Distance To Frontier. The average distance of troops on it.
from frontier territories. Effectively measures the army 11. Occupied Territories. The number of territories that
distribution of the player. the current player occupies in relation to the total num-
6. Enemy Estimated Reinforcement. An estimate of ber of territories.
how many troops the enemy might recruit in the next 12. Own Estimated Reinforcement. A measure that
turn. estimates how many troops the current player would
7. Enemy Occupied Continents. The number of con- recruit in their next turn.
tinents completely occupied by the opposing player. 13. Own Occupied Continents. The number of conti-
8. Hinterland. The percentage of player territories that nents the current player occupies completely.
are not adjacent to any enemy territories. For more details on these feature calculations, see [36].
322