8.1 Models and Planning

By amodelof the environment we mean anything that an agent can use to predict how the environment will respond to its actions

The wordplanningis used in several different ways in different fields. We use the term to refer to any computational process that takes a model as input and produces or improves a policy for interacting with the modeled environment

The difference is that whereas planning uses simulated experience generated by a model, learning methods use real experience generated by the environment. Of course this difference leads to a number of other differences

In addition to the unified view of planning and learning methods, a second theme in this chapter is the benefits of planning in small, incremental steps

利用较少的真实experience生成model,再利用生成的model产生更多的experience,充分利用了原始较少的experience;坏处是,生成的model可能不准确。。。

8.2 Dyna: Integrating Planning, Acting, and Learning

The possible relationships between experience, model, values, and policy are summarized in Figure 8.1.

Dyna-Q还是比较有名的,应该掌握:一步learning,n步planning

Learning and planning are deeply integrated in the sense that they share almost all the same machinery, differing only in the source of their experience.

通常情况下,n越大,Q(s,a)更新的越快,越准确。因为R,S' <== Model(S,A)这一步本质上是从真实experience里面筛选的,所以能够有效传递真实environment的信息(假设environment是stationary的)。

但是,如果根据真实experience构建R(s,a,s')、P(s,a,s')这些信息,则有可能造成model不准确(真实experience数据量少的时候)。

8.3 When the Model Is Wrong

model有偏差时,常常会收敛到sub-optimal policy。其中,当真实环境比model变得更坏时,更容易修正(因为model认为的好的policy得不到较高的rewards);而当真实环境比model变得更好时,则不容易发现更好的policy(以为按照原来的model执行policy,总能得到预期的rewards,就很难发现还有更好的policy可以得到更多的rewards了)。书中举了两个例子很好。

In some cases, the suboptimal policy computed by planning quickly leads to the discovery and correction of the modeling error. This tends to happen when the model is optimistic in the sense of predicting greater reward or better state transitions than are actually possible. The planned policy attempts to exploit these opportunities and in doing so discovers that they do not exist.

Greater difficulties arise when the environment changes to become better than it was before, and yet the formerly correct policy does not reveal the improvement. In these cases the modeling error may not be detected for a long time, if ever, as we see in the next example.

本质是探索和执行的tradeoff。The general problem here is another version of the conflict between exploration and exploitation。一个简单的启发式策略就是,对于很长时间没真实访问过的state、action,加大探索力度。

8.4 Prioritized Sweeping

Dyna中,state-action pairs selecteduniformly at randomfrom all previously experienced pairs. But a uniform selectionis usually not the best

In general, we want to work back not just from goal states but from any state whose value has changed。

Suppose now that the agent discovers a change in the environment and changes its estimated value of one state, either up or down. Typically, this will imply that the values of many other states should also be changed。

总之,prioritized sweeping思想还是很简单的,某个状态的value改变了,与之相关的状态很可能也应该相应的改变,那么,就先更新 潜在改变值 较大的那些状态。

Prioritized sweeping has been found to dramatically increase the speed at which optimal solutions are found in maze tasks, often by a factor of 5 to 10.

对于for allS¯ ,A¯ predicted to lead toS ,其实可以只选择一些必要的。

最后是8.7 Monte Carlo Tree Search,是一个比较大的话题,书中介绍了一些基础概念,很有意思,但是不介绍了。


书籍推荐