COMP9414 23T2 Artificial Intelligence Assignment 1
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
COMP9414 23T2
Artificial Intelligence
Assignment 1 - Reward-based learning agents
Due: Week 5, Friday, 30 June 2023, 11:55 PM
1 Activities
In this assignment, you are asked to implement a modified version of the temporal-difference method Q-learning and SARSA. Additionally, you are asked to implement a modified version of the action selection methods soft- max and ϵ-greedy.
To run your experiments and test your code you should make use of the example gridworld used for Tutorial 3 (see Fig. 1). The modification of the method includes the following two aspects:
. Random numbers will be obtained sequentially from a file. . The initial Q-values will be obtained from a file as well.
The random numbers are available in the file random numbers.txt . The file contains 100k random numbers between 0 and 1 with seed = 9999 created with numpy.random.random as follows:
import numpy as np
np.random.seed(9999)
random_numbers=np.random.random(100000)
np.savetxt("random_numbers.txt", random_numbers)
|
0 |
1 |
2 |
3 |
||
4 |
|
5 |
6 |
7 |
||
8 |
9 |
10 |
|
11 |
Figure 1: 3 × 4 gridworld with one goal state and one fear state..
1.1 Implementing modified SARSA and ϵ-greedy
For the modified SARSA you must use the code review during Tutorial 3 as a base. Consider the following:
. The method will use a given set of initial Q-values, i.e., instead of initialising them using random values the initial Q-values should be obtained from the file initial Q values.txt . You must load the values using np.loadtxt(initial Q values.txt) .
. The initial state for the agent before the training will be always 0.
For the modified ϵ-greedy, create an action selection method that receives the state as an argument and returns the action. Consider the following:
. The method must use sequentially one random number from the pro- vided file each time, i.e., a random number is used only once.
. In case of a random number rnd <= ϵ the method returns an ex- ploratory action. We will use the next random number to decide what action to return, as shown in Table 1.
. You should keep a counter for the random numbers, as you will need it to access the numbers sequentially, i.e., you should increase the counter every time after using a random number.
Random number (T) |
Action |
Action code |
T <= 0.25 0.25 < T <= 0.5 0.5 < T <= 0.75 0.75 < T <= 1 |
down up right left |
0 1 2 3 |
Table 1: Exploratory action selection given the random number.
1.2 Implementing Q-learning and softmax
You should implement the temporal-difference method Q-learning. Consider the following for the implementation:
. For Q-learning the same set of initial Q-values will be used (provided in the file initial Q values.txt) .
. Update the Q-values according to the method. Remember this is an off-policy method.
. As in the previous case, the initial state before training is also 0.
For the softmax action selection method, consider the following:
. Use a temperature parameter τ = 0.1.
. Use a random number from the provided file to compare it with the cu- mulative probabilities to select an action. Hint: np.searchsorted returns the position where a number should be inserted in a sorted array to keep it sorted, this is equivalent to the action selected by softmax.
. Remember to use and increase a counter everytime you use a random number.
1.3 Testing and plotting the results
You should plot a heatmap with the final Q-values after 1,000 learning episodes. Additionally, you should plot the accumulated reward per episode and the number of steps taken by the agent in each episode.
For instance, if you want to test your code, you can use the gridworld shown in Fig. 1 and you will obtain the rewards shown in Fig. 2 and the
Accumulated Reward
Reward |
1.00 0.75 0.50 0.25 0.00
|
|
0 200 400 600 800 1000
Episodes
(a) Q-learning + ϵ-greedy.
Reward |
1.0 0.5 0.0
|
Accumulated Reward
0 200 400 600 800 1000 Episodes |
(c) SARSA + ϵ-greedy.
Reward |
1.0 0.5 0.0
|
Accumulated Reward
0 200 400 600 800 1000 Episodes |
(b) Q-learning + softmax
Reward |
1.0 0.5 0.0
|
Accumulated Reward
0 200 400 600 800 1000 Episodes |
(d) SARSA + softmax.
Figure 2: Accumulated rewards.
steps shown in Fig. 3. The learning parameters used are: learning rate α = 0.7, discount factor γ = 0.4, ϵ = 0.25, and τ = 0.1.
In case you want to compare your results with the exact output for this example using diff, four files with the accumulated reward and four files with the steps per episode are provided (the combination of using Q- learning/SARSA and ϵ-greedy/softmax).
To mark your submission, different gridworlds and learning parameters will be used as test cases.
Steps per Episode
2023-08-17
Reward-based learning agents