Policy, Value, and Q functions
Go to the “Playground” at the bottom of this article.
- Change the Algorithm to Q-Learning. We won’t look at the others at this time.
- Try each of the “Visualization” options. What does each one show? Each one is a different function
- Add one agent. How does completing an episode affect each of the functions that the agent is learning?
Exploration
- Set the Explore-Exploit slider all the way to Explore. What do you notice about the agent’s behavior?
- Set it all the way to Exploit. What do you notice now?
This environment isn’t rich enough for exploration to help much. So: go to a different playground, where we can actually edit the environment and see what the agent learns.
- Read through “Strategically Making Mistakes”.
- What does a low epsilon do? What does a high epsilon do?
- Try editing the
maze =definition to edit the environment. What does it take to get the agent to tolerate a short-term negative reward to achieve a higher long-term reward?