Experiment with the text's 4x3 grid example, modifying the costs, benefits and gamma values.
MDPs are sequential decision problems for fully observable, Markovian worlds with additive rewards.
Do the following exercises with the Gridworld MDP:
Download the following sample code, which implements a MDP for the text example shown in Figure 17.1: lab1.py. First run the code and check that the policy and utility values it produces match those shown in the text example. Then, make sure that you can answer the following questions:
Modify this code in each of the following ways:
Do the results you get in each case make sense? Compare them with the results shown in Figure 17.2 of the text. You may need to play with the reward/penalty, gamma and delta values a bit to get reasonably similar results.
What exactly is the value-iteration algorithm given and what does it compute? Would you consider it to be learning anything?
Save your code and your answers.
The text examples only work with modifications to the AIMA Python code
developed by Steven Klebanoff, see https://github.com/steveklebanoff/AIMA-Python-Reinforcement-Learning
.
Modify the previous file to reproduce Thrun's examples. Thrun uses the standard gridworld, but with default cell penalty -3.0, terminal state rewards/penalties +100.0/-100.0 respectively and gamma 1. Now compute the following:
Do the utilities you hand-computed match the utilities produced by your program? Why or why not? Save your code and your answers.
Note that Thrun computes the first iteration of the utility values for cells (3,3) and (3,2) in his “Planning under Uncertainty” lectures (unit 9, segments 26-29).
The optimal policies can be learned.
Do the following exercises with passive reinforcement learning of MDPs.
Download the following sample code, which implements a passive ADP learning agent for the Gridworld MDP discussed in Chapter 21: lab3.py.
What sort of agent design and learning approach does this code use? The choices are discussed in Section 21.1. What is it given? What does it learn (if anything)?
How well would this approach scale up to larger, more realistic problems?
Save your code and your answers.
The text covers other agent designs and other approaches to reinforcement learning.
Submit your source code as specified above in Moodle under lab 10.