RL paper misc

Off-Policy Reinforcement Learning with Gaussian Processes (JAS 2014)

Intro

使用高斯过程对Q进行值估计,证明batch下的收敛性,和online下的理论、应用扩展。

Batch Off-policy RL with A GP

$w^2$代表高斯过程中不确定性度量的方差(噪声),这里offer an alternative interpretation of $w^2$ as a regularization term which accounts for the fact that current measurements are not necessarily drawn from the true model and therefore prevents our model from converging too quickly to an incorrect estimate of $Q^*$. 提供正则项,避免模型收敛到错误的Q值。

Reinforcement learning with Gaussian processes (ICML 2005)

Intro

Gaussian Process Temporal Difference (GPTD) learning offers a Bayesian solution to the policy evaluation problem of reinforcement learning.

GPTD 有两个问题没有解决:1)需要MDP是确定性的;2)policy必须是optimal或者sub-optimal的,才能训练value estimation

折扣Reward的随机性(intrinsic randomness)来自于:1)状态序列的随机性;2)reward的随机性

image-20220803172635038

将D分解为均值和不确定性:

image-20220803172708544

通过定义V(x)为随机过程,来建模extrinsic uncertainty

上式代入上上式:

image-20220803174706456

定义如下随机过程:

image-20220803183628784

Sample Efficient Reinforcement Learning with Gaussian Processes (ICML 2014)

Intro

介绍了高斯过程在model-based 和model-free RL里面的应用

the first model-free continuous state space PAC-MDP algorithm using GPs: Delayed-GPQ (DGPQ).

  • DGPQ represents the current value function as a GP, and updates a separately stored value function only when sufficient outlier data has been detected.
  • This operation “overwrites” a portion of the stored value function and resets the GP confidence bounds, avoiding the slowed convergence rate of the naive model-free approach.

how to maintain PAC-MDP sample efficiency

GP-Sarsa:off-policy,GP for value function

GPs for Model-Free RL