There is a popular cliché ... that computers only do exactly what you tell them to, and that therefore computers are never creative. The cliché is true only in the crashingly trivial sense, the same sense in which Shakespeare never wrote anything except what his first schoolteacher taught him to write--words. – Richard Dawkins, The Blind Watchmaker.

Experiment with the text's 4x3 grid example, modifying the costs, benefits and gamma values.

Markov Decision Processes (MDPs)

MDPs are sequential decision problems for fully observable, Markovian worlds with additive rewards.

Exercise 10.1

Do the following exercises with the Gridworld MDP:

  1. Download the following sample code, which implements a MDP for the text example shown in Figure 17.1: lab1.py. First run the code and check that the policy and utility values it produces match those shown in the text example. Then, make sure that you can answer the following questions:

    1. What do the numbers in the grid represent?
    2. What are the terminals specified in the MDP?
    3. Can you explain how the utility values and the policy relate to one another?
  2. Modify this code in each of the following ways:

    1. Specify no reward/penalty for moving around the grid, apart from the two terminal states.
    2. Specify a heavy penalty for moving around the grid.
    3. Specify a reward for moving around the grid.

    Do the results you get in each case make sense? Compare them with the results shown in Figure 17.2 of the text. You may need to play with the reward/penalty, gamma and delta values a bit to get reasonably similar results.

  3. What exactly is the value-iteration algorithm given and what does it compute? Would you consider it to be learning anything?

Save your code and your answers.

The text examples only work with modifications to the AIMA Python code developed by Steven Klebanoff, see https://github.com/steveklebanoff/AIMA-Python-Reinforcement-Learning .

Exercise 10.2

Modify the previous file to reproduce Thrun's examples. Thrun uses the standard gridworld, but with default cell penalty -3.0, terminal state rewards/penalties +100.0/-100.0 respectively and gamma 1. Now compute the following:

  1. the first-iteration values for the utility of cell (3,3) (see Figure 17.1) and then cell (2,3).
  2. the utility values for all the cells after convergence of the value-iteration algorithm.

Do the utilities you hand-computed match the utilities produced by your program? Why or why not? Save your code and your answers.

Note that Thrun computes the first iteration of the utility values for cells (3,3) and (3,2) in his “Planning under Uncertainty” lectures (unit 9, segments 26-29).

Passive Reinforcement Learning

The optimal policies can be learned.

Exercise 10.3

Do the following exercises with passive reinforcement learning of MDPs.

  1. Download the following sample code, which implements a passive ADP learning agent for the Gridworld MDP discussed in Chapter 21: lab3.py.

    1. Run the code and check that the utility values it learns match those shown in Figure 21.1(b) How close are the learned values? Do the results make sense?
    2. Do you always get values for all the cells? Why or why not?
    3. Where is the transition model implemented?
    4. Does this code learn a good policy? Why or why not?
    5. What is a good choice for the value of n?
  2. What sort of agent design and learning approach does this code use? The choices are discussed in Section 21.1. What is it given? What does it learn (if anything)?

  3. How well would this approach scale up to larger, more realistic problems?

Save your code and your answers.

The text covers other agent designs and other approaches to reinforcement learning.

Checking in

Submit your source code as specified above in Moodle under lab 10.

Back to the top