RL paper misc

2022-07-19 0 Word Count: 375(words) Read Count: 1(minutes)

Off-Policy Reinforcement Learning with Gaussian Processes (JAS 2014)

Intro

使用高斯过程对Q进行值估计，证明batch下的收敛性，和online下的理论、应用扩展。

Batch Off-policy RL with A GP

$w^2$代表高斯过程中不确定性度量的方差（噪声），这里offer an alternative interpretation of $w^2$ as a regularization term which accounts for the fact that current measurements are not necessarily drawn from the true model and therefore prevents our model from converging too quickly to an incorrect estimate of $Q^*$. 提供正则项，避免模型收敛到错误的Q值。

Reinforcement learning with Gaussian processes （ICML 2005）

Intro

Gaussian Process Temporal Difference (GPTD) learning offers a Bayesian solution to the policy evaluation problem of reinforcement learning.

GPTD 有两个问题没有解决：1）需要MDP是确定性的；2）policy必须是optimal或者sub-optimal的，才能训练value estimation

折扣Reward的随机性（intrinsic randomness）来自于：1）状态序列的随机性；2）reward的随机性

将D分解为均值和不确定性：

通过定义V(x)为随机过程，来建模extrinsic uncertainty

上式代入上上式：

定义如下随机过程：

Sample Efﬁcient Reinforcement Learning with Gaussian Processes （ICML 2014）

Intro

介绍了高斯过程在model-based 和model-free RL里面的应用

the ﬁrst model-free continuous state space PAC-MDP algorithm using GPs: Delayed-GPQ (DGPQ).

DGPQ represents the current value function as a GP, and updates a separately stored value function only when sufﬁcient outlier data has been detected.
This operation “overwrites” a portion of the stored value function and resets the GP conﬁdence bounds, avoiding the slowed convergence rate of the naive model-free approach.

how to maintain PAC-MDP sample efﬁciency

GP-Sarsa：off-policy，GP for value function

GPs for Model-Free RL

本文链接： http://example.com/2022/07/19/RL-paper-misc/

版权声明： 本博客所有文章除特别声明外，均采用 CC BY 4.0 CN协议许可协议。转载请注明出处！

xmz

RL paper misc

Off-Policy Reinforcement Learning with Gaussian Processes (JAS 2014)

Intro

Batch Off-policy RL with A GP

Reinforcement learning with Gaussian processes （ICML 2005）

Intro

Sample Efﬁcient Reinforcement Learning with Gaussian Processes （ICML 2014）

Intro

GPs for Model-Free RL

xmz