AI Report Copy 1
AI Report Copy 1
Abstract—The Artificial Intelligence Lab report presents a II. E XPERIMENT - 7 : M ATCHBOX E DUCABLE N AUGHTS
comprehensive exploration and discussion of solutions to Arti- AND C ROSSES E NGINE (W EEK 7)
ficial Intelligence challenges using established software libraries
and techniques that have been the focal point of our coursework. A. Objective
This report delves into the practical application of these • Maximize the expected reward by balancing exploration
concepts, showcasing how they can address real-world problems. (trying new arms) and exploitation (selecting known high-
This report covers a wide range of topics, including graph reward arms) in the context of a binary bandit with two
search, Heuristic function, Travelling salesman Problem, Non-
rewards.
deterministic Search, Simulated Annealing, MINIMAX, Alpha-
Beta Pruning, Bayesian Network, Hidden Markov Model, Deci- • Use an epsilon-greedy algorithm to decide upon the
sion Tree, Hopfield network, n-armed bandit problem. In this action to take for maximizing the expected reward.
report, we also will discuss, develop algorithms, and design the • Develop a modified epsilon-greedy agent to track non-
solution by utilizing popular python libraries. stationary rewards in a 10-armed bandit and show
In summary, the Artificial Intelligence Lab Report offers a whether it is able to latch onto correct actions or not
valuable resource for understanding the practical application after at least 10000 time steps.
of Artificial Intelligence concepts and tools. It sheds light on
the challenges and opportunities in this field, emphasizing the
relevance of coursework materials in addressing real-world B. Introduction
problems.
The lab assignment involves exploring the concept of the
N-armed bandit problem, which is a classic reinforcement
learning problem. The goal is to identify which lever to pull
in order to maximize cumulative reward based on trials. The
document discusses the exploration vs exploitation dilemma,
I. I NTRODUCTION
where exploration involves trying different arms to learn about
their reward distributions, and exploitation involves choosing
the greedy action by selecting the arm believed to have the
RTIFICIAL Intelligence research involves developing
A systems that can mimic human intelligence to perform
tasks such as understanding natural language, recognizing
highest expected reward based on available information. The
assignment also involves developing a 10-armed bandit in
which all ten mean-rewards start out equal and then take inde-
patterns, making decisions, and learning from data. An AI
pendent random walks. Additionally, it includes implementing
lab focuses on experimenting with various AI techniques
an epsilon-greedy agent to track non-stationary rewards and
and algorithms to address challenges in creating intelligent
analyzing the results after at least 10,000 time steps.
systems.
This report provides a summary of experiments conducted
C. Problem statement
in an AI lab, which are aimed at addressing the key challenges
faced by AI systems and implementing solutions to overcome 1) Read the reference on MENACE by Michie and check
them. These experiments cover a range of topics within for its implementations. Pick the one that you like the
AI, including machine learning, natural language processing, most and go through the code carefully. Highlight the
computer vision, robotics, and more. parts that you feel are crucial. If possible, try to code the
MENACE in any programming language of your liking.
The experiments in the AI lab involve tasks such as import-
ing and preprocessing data, training and evaluating machine
learning models, designing algorithms for decision-making M ETHODOLOGY
and problem-solving, and integrating AI systems into real- The MENACE experiment employs a reinforcement learn-
world applications. The goal is to develop AI systems that can ing framework using a physical representation (matchboxes
effectively understand and respond to user needs, while also and beads) to learn the game of Noughts and Crosses (Tic-Tac-
addressing challenges such as accuracy, efficiency, scalability, Toe). The primary components of the methodology include:
and adaptability to new data and environments. 1) Physical Model:
AI REPORT 2
• Each matchbox represents a unique game state, with The number of winning games increased, reflect-
•
colored beads signifying potential moves. ing a successful adaptation to the game dynamics
• The number of beads for each move indicates the through reinforcement learning.
likelihood of choosing that move. 2) Bead Count Adjustments:
2) Learning Mechanism: • The count of beads for winning moves increased,
• MENACE learns via trial and error, where it rewards making them more likely to be selected in subse-
successful moves by adding beads and penalizes quent games, while losing moves saw a reduction
unsuccessful ones by removing beads. in their bead counts.
• The agent starts with an equal probability of select- 3) Learning Behavior:
ing any move and adjusts its strategy based on past • Initial exploration was observed, with moves se-
game outcomes. lected more randomly. As MENACE gained experi-
3) State Representation and Action Selection: ence, it began to exploit the knowledge gained from
• Game states are encoded into matchboxes, each previous games, favoring successful strategies.
containing bead counts for available moves.
• Moves are selected randomly based on the weighted C ONCLUSION
counts of beads. The MENACE experiment serves as a pioneering example
4) Reinforcement Process: of reinforcement learning implemented in a physical format,
• After each game, MENACE updates the bead counts illustrating how trial-and-error learning can lead to improved
in the relevant matchbox based on whether it won decision-making over time. By leveraging a simple yet effec-
or lost. tive method of reward and punishment through bead counts
in matchboxes, MENACE effectively learns to play Noughts
5) Simulation:
and Crosses. The simulation on the Pegasus 2 computer
• After testing the physical model, MENACE was
demonstrated that reinforcement learning principles can be
implemented in a digital format on a Pegasus 2 applied digitally, providing insights into the learning process of
computer, allowing for programmatic handling of intelligent agents. This foundational work laid the groundwork
states and moves. for future advancements in machine learning and artificial
intelligence, highlighting the significance of exploration and
E XECUTION S TEPS exploitation in adaptive systems.
1) Initialization:
• Set up the matchboxes for all possible game
D. problem statement
states with initial bead counts representing potential 1) Consider a binary bandit with two rewards 1-success, 0-
moves. failure. The bandit returns 1 or 0 for the action that you
2) Game Simulation: select, i.e. 1 or 2. The rewards are stochastic (but station-
ary). Use an epsilon-greedy algorithm discussed in class
• Run multiple games of Noughts and Crosses, allow-
and decide upon the action to take for maximizing the
ing MENACE to select moves based on the bead
expected reward. There are two binary bandits named
counts and adjusting the counts after each game.
binaryBanditA.m and binaryBanditB.m are waiting for
3) Move Selection: you.
• For each game state, MENACE randomly selects a 2) Develop a 10-armed bandit in which all ten mean-
move from the matchbox, weighted by the number rewards start out equal and then take independent ran-
of beads for each move. dom walks (by adding a normally distributed increment
4) Updating Matchboxes: with mean zero and standard deviation 0.01 to all mean-
• After each game, update the bead counts in the
rewards on each time step)
matchbox corresponding to the game state based on 3) The 10-armed bandit that you developed is difficult to
the game outcome (win or loss). crack with a standard epsilon-greedy algorithm since
the rewards are non-stationary. We did discuss how to
5) Data Collection:
track non-stationary rewards in class. Write a modified
• Track the performance of MENACE over multiple
epsilon-greedy agent and show whether it is able to latch
games to analyze its learning process and effective- onto correct actions or not.
ness.
E. Methodology
R ESULTS
The methodology section details the operational aspects and
1) Performance Improvement: procedures inherent in the Bandit problem:
• Over time, MENACE demonstrated an ability to 1) Exploration vs Expliotation: - Exploration entails ven-
select winning moves more frequently as it accu- turing into unfamiliar territories, options, or strategies
mulated experience. to discover potential benefits or opportunities, whereas
AI REPORT 3
between N different options, each with an unknown different exploration-exploitation strategies are sum-
reward distribution. marized below.
The approach typically involves a balance between • For Bandit B, the probability of success for each
exploration (trying different options to learn their re- arm is p = [0.8, 0.9].
wards) and exploitation (choosing the currently best-
known option to maximize short-term rewards), often H. Conclusion
implemented using algorithms like epsilon-greedy, UCB • The N-arm bandit experiment provided valuable insights
(Upper Confidence Bound), or Thompson sampling. into the effectiveness of different exploration-exploitation
strategies.
• Through experimentation with various exploration rates
F. Execution Steps (ϵ), we observed distinct trade-offs between exploration
and exploitation.
The following steps outline the execution process for
• The results indicate that a balanced approach (ϵ = 0.1)
comparing the binary bandits with different exploration-
often yields competitive performance, leveraging both
exploitation strategies:
exploration and exploitation effectively.
1) Initialize parameters: • High exploration rates (ϵ = 0.3) showed potential for
• Set the number of time steps T = 50. discovering new options but at the expense of immediate
• Define the exploration rates ϵ = [0.01, 0.1, 0.3]. rewards, while low exploration rates (ϵ = 0.01) leaned
2) Define Bandit A and Bandit B: heavily towards exploiting known options, risking over-
looking better alternatives.
• For Bandit A: p = [0.1, 0.2] • Overall, the N-arm bandit experiment underscores the
• For Bandit B: p = [0.8, 0.9] importance of carefully tuning exploration-exploitation
3) For each exploration rate ϵ: strategies to optimize cumulative rewards over time.
• Initialize empty arrays to store rewards for Bandit
A and Bandit B. I. References
4) At each time step t from 1 to 50: 1) Experiments on the mechanization of game-learning
• For each bandit: Part I. Characterization of the model and its param-
eters
– Choose an action:
∗ With probability ϵ, choose a random action.
∗ Otherwise, choose the action with the highest
estimated reward.
– Receive a reward based on the chosen action’s
success probability.
– Update the estimated reward for the chosen ac-
tion.
– Store the obtained reward in the respective array.
5) Calculate average rewards:
• Calculate the average reward for Bandit A and
Bandit B across all time steps for each exploration
rate.
6) Compare results:
• Analyze the average rewards obtained for Ban-
dit A and Bandit B under different exploration-
exploitation strategies to determine their effective-
ness.