迁移强化学习论文笔记(一)(Successor Features)
迁移强化学习论文笔记(一)(Successor Features)
一.Background and problem formulation
M ≡ ( S , A , p , R , γ ) M \equiv(\mathcal{S}, \mathcal{A}, p, R, \gamma) M≡(S,A,p,R,γ)
S \cal S S:状态空间
A \cal A A:行动空间
p p p: p ( ⋅ ∣ s t , a t ) p(\cdot\mid s_t,a_t) p(⋅∣st,at)状态转移概率
R R R: R ( s t , a t , s t + 1 ) R(s_t,a_t,s_{t+1}) R(st,at,st+1)奖励
二.Successor features
假设奖励函数可以写为
r
(
s
,
a
,
s
′
)
=
ϕ
(
s
,
a
,
s
′
)
⊤
w
,
r\left(s, a, s^{\prime}\right)=\boldsymbol{\phi}\left(s, a, s^{\prime}\right)^{\top} \mathbf{w},
r(s,a,s′)=ϕ(s,a,s′)⊤w,
其中
ϕ
(
s
,
a
,
s
′
)
\boldsymbol\phi(s,a,s')
ϕ(s,a,s′)是d维向量,
w
\mathbf w
w是对应的权重。利用这种形式,我们有以下结论(定义
ϕ
t
+
1
=
ϕ
(
s
t
,
a
t
,
s
t
+
1
)
\boldsymbol \phi_{t+1}=\boldsymbol \phi(s_t,a_t,s_{t+1})
ϕt+1=ϕ(st,at,st+1))
Q
π
(
s
,
a
)
=
E
π
[
r
t
+
1
+
γ
r
t
+
2
+
…
∣
S
t
=
s
,
A
t
=
a
]
=
E
π
[
ϕ
t
+
1
⊤
w
+
γ
ϕ
t
+
2
⊤
w
+
…
∣
S
t
=
s
,
A
t
=
a
]
=
E
π
[
∑
i
=
t
∞
γ
i
−
t
ϕ
i
+
1
∣
S
t
=
s
,
A
t
=
a
]
⊤
w
=
ψ
π
(
s
,
a
)
⊤
w
.
\begin{aligned} Q^\pi(s, a) & =\mathrm{E}^\pi\left[r_{t+1}+\gamma r_{t+2}+\ldots \mid S_t=s, A_t=a\right] \\ & =\mathrm{E}^\pi\left[\boldsymbol{\phi}_{t+1}^{\top} \mathbf{w}+\gamma \boldsymbol{\phi}_{t+2}^{\top} \mathbf{w}+\ldots \mid S_t=s, A_t=a\right] \\ & =\mathrm{E}^\pi\left[\sum_{i=t}^{\infty} \gamma^{i-t} \boldsymbol{\phi}_{i+1} \mid S_t=s, A_t=a\right]^{\top} \mathbf{w}=\boldsymbol{\psi}^\pi(s, a)^{\top} \mathbf{w} . \end{aligned}
Qπ(s,a)=Eπ[rt+1+γrt+2+…∣St=s,At=a]=Eπ[ϕt+1⊤w+γϕt+2⊤w+…∣St=s,At=a]=Eπ[i=t∑∞γi−tϕi+1∣St=s,At=a]⊤w=ψπ(s,a)⊤w.
ψ
π
(
s
,
a
)
\boldsymbol \psi^{\pi}(s,a)
ψπ(s,a)是在策略
π
\pi
π下
(
s
,
a
)
(s,a)
(s,a)的Successor Features(SFs)
由定义知
ψ
π
(
s
,
a
)
=
E
π
[
ϕ
t
+
1
+
γ
ϕ
t
+
2
+
γ
2
ϕ
t
+
3
+
⋯
∣
S
t
=
s
,
A
t
=
a
]
\boldsymbol \psi^{\pi}(s,a)=\mathrm E^{\pi}[\boldsymbol{\phi}_{t+1}+\gamma \boldsymbol{\phi}_{t+2}+\gamma^2\boldsymbol{\phi}_{t+3}+\cdots\mid S_t=s,A_t=a]
ψπ(s,a)=Eπ[ϕt+1+γϕt+2+γ2ϕt+3+⋯∣St=s,At=a]可得如下贝尔曼公式
ψ
π
(
s
,
a
)
=
E
π
[
ϕ
t
+
1
+
γ
ϕ
t
+
2
+
γ
2
ϕ
t
+
3
+
⋯
∣
S
t
=
s
,
A
t
=
a
]
=
E
S
t
+
1
,
A
t
+
1
[
ϕ
t
+
1
+
ψ
π
(
S
t
+
1
,
A
t
+
1
)
∣
S
t
=
s
,
A
t
=
a
]
如果采取确定策略
π
=
ϕ
t
+
1
(
s
,
a
)
+
E
S
t
+
1
[
ψ
π
(
S
t
+
1
,
π
(
S
t
+
1
)
)
∣
S
t
=
s
,
A
t
=
a
]
\begin{aligned} \boldsymbol \psi^{\pi}(s,a)&=\mathrm E^{\pi}[\boldsymbol{\phi}_{t+1}+\gamma \boldsymbol{\phi}_{t+2}+\gamma^2\boldsymbol{\phi}_{t+3}+\cdots\mid S_t=s,A_t=a]\\ &=\mathrm{E}_{S_{t+1},A_{t+1}}[\boldsymbol{\phi}_{t+1}+\boldsymbol \psi^{\pi}(S_{t+1},A_{t+1})\mid S_t=s,A_t=a]\text{如果采取确定策略}\pi\\ &=\boldsymbol \phi_{t+1}(s,a)+\mathrm E_{S_{t+1}}[\boldsymbol \psi^{\pi}(S_{t+1},\pi(S_{t+1}))\mid S_t=s,A_t=a] \end{aligned}
ψπ(s,a)=Eπ[ϕt+1+γϕt+2+γ2ϕt+3+⋯∣St=s,At=a]=ESt+1,At+1[ϕt+1+ψπ(St+1,At+1)∣St=s,At=a]如果采取确定策略π=ϕt+1(s,a)+ESt+1[ψπ(St+1,π(St+1))∣St=s,At=a]
利用上式即可迭代求解
ψ
π
(
s
,
a
)
\boldsymbol \psi^{\pi}(s,a)
ψπ(s,a),而对于
w
\mathbf w
w的求解则是一个有监督学习问题很多机器学习算法都可进行。
这样对于不同的任务只要求解出不同的 w \mathbf w w即可。
三.Generalized policy improvement
作者在论文中还证明了迁移强化学习的泛化误差界
Theorem 1. (Generalized Policy Improvement) Let π 1 , π 2 , … , π n \pi_1, \pi_2, \ldots, \pi_n π1,π2,…,πn be n n n decision policies and let Q ~ π 1 , Q ~ π 2 , … , Q ~ π n \tilde{Q}^{\pi_1}, \tilde{Q}^{\pi_2}, \ldots, \tilde{Q}^{\pi_n} Q~π1,Q~π2,…,Q~πn be approximations of their respective action-value functions such that
∣ Q π i ( s , a ) − Q ~ π i ( s , a ) ∣ ≤ ϵ for all s ∈ S , a ∈ A , and i ∈ { 1 , 2 , … , n } . \left|Q^{\pi_i}(s, a)-\tilde{Q}^{\pi_i}(s, a)\right| \leq \epsilon \text { for all } s \in \mathcal{S}, a \in \mathcal{A} \text {, and } i \in\{1,2, \ldots, n\} \text {. } Qπi(s,a)−Q~πi(s,a) ≤ϵ for all s∈S,a∈A, and i∈{1,2,…,n}.Define
π ( s ) ∈ argmax a max i Q ~ π i ( s , a ) . \pi(s) \in \underset{a}{\operatorname{argmax}} \max _i \tilde{Q}^{\pi_i}(s, a) . π(s)∈aargmaximaxQ~πi(s,a).Then,
Q π ( s , a ) ≥ max i Q π i ( s , a ) − 2 1 − γ ϵ Q^\pi(s, a) \geq \max _i Q^{\pi_i}(s, a)-\frac{2}{1-\gamma} \epsilon Qπ(s,a)≥imaxQπi(s,a)−1−γ2ϵ
for any s ∈ S s \in \mathcal{S} s∈S and a ∈ A a \in \mathcal{A} a∈A, where Q π Q^\pi Qπ is the action-value function of π \pi π.
proof:为简化符号,定义
Q
m
a
x
(
s
,
a
)
=
max
i
Q
π
i
(
s
,
a
)
(
在策略
π
i
中的最优动作价值函数
)
Q
~
m
a
x
(
s
,
a
)
=
max
i
Q
π
i
~
(
s
,
a
)
(
在策略
π
i
中最优动作价值函数的估计值
)
Q_{max}(s,a)=\text{max}_{i}Q^{\pi_i}(s,a)(在策略\pi_{i}中的最优动作价值函数)\\ \tilde{Q}_{max}(s,a)=\text{max}_{i}\tilde{Q^{\pi_{i}}}(s,a)(在策略\pi_{i}中最优动作价值函数的估计值)
Qmax(s,a)=maxiQπi(s,a)(在策略πi中的最优动作价值函数)Q~max(s,a)=maxiQπi~(s,a)(在策略πi中最优动作价值函数的估计值)
借助以上符号我们有如下不等式
∣
Q
max
(
s
,
a
)
−
Q
~
max
(
s
,
a
)
∣
=
∣
max
i
Q
π
i
(
s
,
a
)
−
max
i
Q
~
π
i
(
s
,
a
)
∣
≤
max
i
∣
Q
π
i
(
s
,
a
)
−
Q
~
π
i
(
s
,
a
)
∣
≤
ϵ
.
\left|Q_{\max }(s, a)-\tilde{Q}_{\max }(s, a)\right|=\left|\max _i Q^{\pi_i}(s, a)-\max _i \tilde{Q}^{\pi_i}(s, a)\right| \leq \max _i\left|Q^{\pi_i}(s, a)-\tilde{Q}^{\pi_i}(s, a)\right| \leq \epsilon .
Qmax(s,a)−Q~max(s,a)
=
imaxQπi(s,a)−imaxQ~πi(s,a)
≤imax
Qπi(s,a)−Q~πi(s,a)
≤ϵ.
于是我们可得
Q
max
(
s
,
a
)
−
ϵ
≤
Q
~
max
(
s
,
a
)
Q_{\max }(s, a)-\epsilon \leq\tilde{Q}_{\max }(s, a)
Qmax(s,a)−ϵ≤Q~max(s,a)
借助贝尔曼算子
T
π
T^{\pi}
Tπ,其中
T
π
f
(
s
,
a
)
=
r
(
s
,
a
)
+
γ
E
s
′
∼
p
(
s
′
∣
s
,
a
)
[
V
(
s
′
)
]
V
(
s
′
)
=
E
a
∼
π
(
a
∣
s
′
)
[
f
(
s
′
,
a
)
]
r
(
s
,
a
)
=
E
s
′
∼
p
(
s
′
∣
s
,
a
)
[
r
(
s
,
a
,
s
′
)
]
T^{\pi}f(s,a)=r(s,a)+\gamma\mathrm E_{s'\sim p(s'\mid s,a)}[V(s')]\\ V(s')=\mathrm E_{a\sim \pi(a\mid s')}[f(s',a)]\\ r(s,a)=\mathrm E_{s'\sim p(s'\mid s,a)}[r(s,a,s')]
Tπf(s,a)=r(s,a)+γEs′∼p(s′∣s,a)[V(s′)]V(s′)=Ea∼π(a∣s′)[f(s′,a)]r(s,a)=Es′∼p(s′∣s,a)[r(s,a,s′)]
因我们采用确定策略
π
\pi
π(在所有策略中选取能使得动作价值最大的动作),
V
(
s
′
)
=
f
(
s
′
,
π
(
s
′
)
)
V(s')=f(s',\pi(s'))
V(s′)=f(s′,π(s′))
对于任意
(
s
,
a
)
∈
S
×
A
(s,a)\in \cal S \times \cal A
(s,a)∈S×A和任意策略
π
i
\pi_{i}
πi我们都有下式成立
T
π
Q
~
max
(
s
,
a
)
=
r
(
s
,
a
)
+
γ
∑
s
′
p
(
s
′
∣
s
,
a
)
Q
~
max
(
s
′
,
π
(
s
′
)
)
=
r
(
s
,
a
)
+
γ
∑
s
′
p
(
s
′
∣
s
,
a
)
max
b
Q
~
max
(
s
′
,
b
)
≥
r
(
s
,
a
)
+
γ
∑
s
′
p
(
s
′
∣
s
,
a
)
max
b
Q
max
(
s
′
,
b
)
−
γ
ϵ
≥
r
(
s
,
a
)
+
γ
∑
s
′
p
(
s
′
∣
s
,
a
)
Q
max
(
s
′
,
π
i
(
s
′
)
)
−
γ
ϵ
≥
r
(
s
,
a
)
+
γ
∑
s
′
p
(
s
′
∣
s
,
a
)
Q
π
i
(
s
′
,
π
i
(
s
′
)
)
−
γ
ϵ
=
T
π
i
Q
π
i
(
s
,
a
)
−
γ
ϵ
=
Q
π
i
(
s
,
a
)
−
γ
ϵ
.
\begin{aligned} T^\pi \tilde{Q}_{\max }(s, a) & =r(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s,a\right) \tilde{Q}_{\max }\left(s^{\prime}, \pi\left(s^{\prime}\right)\right) \\ & =r(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) \max _b \tilde{Q}_{\max }\left(s^{\prime}, b\right) \\ & \geq r(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) \max _b Q_{\max }\left(s^{\prime}, b\right)-\gamma \epsilon \\ & \geq r(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) Q_{\max }\left(s^{\prime}, \pi_i\left(s^{\prime}\right)\right)-\gamma \epsilon \\ & \geq r(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) Q^{\pi_i}\left(s^{\prime}, \pi_i\left(s^{\prime}\right)\right)-\gamma \epsilon \\ & =T^{\pi_i} Q^{\pi_i}(s, a)-\gamma \epsilon \\ & =Q^{\pi_i}(s, a)-\gamma \epsilon . \end{aligned}
TπQ~max(s,a)=r(s,a)+γs′∑p(s′∣s,a)Q~max(s′,π(s′))=r(s,a)+γs′∑p(s′∣s,a)bmaxQ~max(s′,b)≥r(s,a)+γs′∑p(s′∣s,a)bmaxQmax(s′,b)−γϵ≥r(s,a)+γs′∑p(s′∣s,a)Qmax(s′,πi(s′))−γϵ≥r(s,a)+γs′∑p(s′∣s,a)Qπi(s′,πi(s′))−γϵ=TπiQπi(s,a)−γϵ=Qπi(s,a)−γϵ.
又因
T
π
Q
~
max
(
s
,
a
)
≥
Q
π
i
(
s
,
a
)
−
γ
ϵ
T^\pi \tilde{Q}_{\max }(s, a)\geq Q^{\pi_i}(s, a)-\gamma \epsilon
TπQ~max(s,a)≥Qπi(s,a)−γϵ对任意策略成立
T
π
Q
~
max
(
s
,
a
)
≥
Q
π
i
(
s
,
a
)
−
γ
ϵ
f
o
r
∀
π
i
≥
max
i
Q
π
i
−
γ
ϵ
≥
Q
~
max
(
s
,
a
)
−
γ
−
γ
ϵ
\begin{aligned} T^\pi \tilde{Q}_{\max }(s, a)&\geq Q^{\pi_i}(s, a)-\gamma \epsilon \qquad for \forall \pi_{i}\\ &\geq \text{max}_{i}Q^{\pi_{i}}-\gamma \epsilon\\ &\geq \tilde{Q}_{\max }(s, a)-\gamma-\gamma\epsilon \end{aligned}
TπQ~max(s,a)≥Qπi(s,a)−γϵfor∀πi≥maxiQπi−γϵ≥Q~max(s,a)−γ−γϵ
为得出最终结论。我们还需要证明以下事实
T
π
(
f
(
s
,
a
)
+
c
)
=
r
(
s
,
a
)
+
γ
E
s
′
∼
p
(
s
′
∣
s
,
a
)
[
f
(
s
′
,
π
(
s
′
)
)
+
c
]
=
r
(
s
,
a
)
+
γ
E
s
′
∼
p
(
s
′
∣
s
,
a
)
[
f
(
s
′
,
π
(
s
′
)
)
]
+
γ
⋅
c
=
T
π
(
f
(
s
,
a
)
)
+
γ
⋅
c
\begin{aligned} T^{\pi}(f(s,a)+c)&=r(s,a)+\gamma\mathrm E_{s'\sim p(s'\mid s,a)}[f(s',\pi(s'))+c]\\ &=r(s,a)+\gamma\mathrm E_{s'\sim p(s'\mid s,a)}[f(s',\pi(s'))]+\gamma\cdot c\\ &=T^{\pi}(f(s,a))+\gamma\cdot c \end{aligned}
Tπ(f(s,a)+c)=r(s,a)+γEs′∼p(s′∣s,a)[f(s′,π(s′))+c]=r(s,a)+γEs′∼p(s′∣s,a)[f(s′,π(s′))]+γ⋅c=Tπ(f(s,a))+γ⋅c
于是我们可知
T
π
Q
~
max
(
s
,
a
)
≥
Q
~
max
(
s
,
a
)
−
(
1
+
γ
)
ϵ
T
π
(
T
π
Q
~
max
(
s
,
a
)
)
≥
T
π
Q
~
max
(
s
,
a
)
−
γ
(
1
+
γ
)
ϵ
⋮
(
T
π
)
k
(
Q
~
max
(
s
,
a
)
)
≥
(
T
π
)
k
−
1
−
γ
k
−
1
(
1
+
γ
)
ϵ
\begin{aligned} T^{\pi}\tilde{Q}_{\max }(s, a)&\geq \tilde{Q}_{\max }(s, a)-(1+\gamma)\epsilon\\ T^{\pi}(T^{\pi}\tilde{Q}_{\max }(s, a))&\geq T^{\pi}\tilde{Q}_{\max }(s, a)-\gamma(1+\gamma)\epsilon\\ \vdots\\ (T^{\pi})^{k}(\tilde{Q}_{\max }(s, a))&\geq (T^{\pi})^{k-1}-\gamma^{k-1}(1+\gamma)\epsilon \end{aligned}
TπQ~max(s,a)Tπ(TπQ~max(s,a))⋮(Tπ)k(Q~max(s,a))≥Q~max(s,a)−(1+γ)ϵ≥TπQ~max(s,a)−γ(1+γ)ϵ≥(Tπ)k−1−γk−1(1+γ)ϵ
将上式连续相加,且当
k
k
k趋于无穷时可知
Q
π
(
s
,
a
)
=
lim
k
→
∞
(
T
π
)
k
Q
~
max
(
s
,
a
)
≥
Q
~
max
(
s
,
a
)
−
1
+
γ
1
−
γ
ϵ
≥
Q
max
(
s
,
a
)
−
ϵ
−
1
+
γ
1
−
γ
ϵ
=
max
i
Q
π
i
(
s
,
a
)
−
2
1
−
γ
ϵ
\begin{aligned} Q^\pi(s, a) & =\lim _{k \rightarrow \infty}\left(T^\pi\right)^k \tilde{Q}_{\max }(s, a) \\ & \geq \tilde{Q}_{\max }(s, a)-\frac{1+\gamma}{1-\gamma} \epsilon \\ & \geq Q_{\max }(s, a)-\epsilon-\frac{1+\gamma}{1-\gamma} \epsilon\\ & = \max _i Q^{\pi_i}(s, a)-\frac{2}{1-\gamma} \epsilon \end{aligned}
Qπ(s,a)=k→∞lim(Tπ)kQ~max(s,a)≥Q~max(s,a)−1−γ1+γϵ≥Qmax(s,a)−ϵ−1−γ1+γϵ=imaxQπi(s,a)−1−γ2ϵ
证毕
想要证明最后误差界,我们还需借助以下引理
Lemma 1. Let δ i j = max s , a ∣ r i ( s , a ) − r j ( s , a ) ∣ \delta_{i j}=\max _{s, a}\left|r_i(s, a)-r_j(s, a)\right| δij=maxs,a∣ri(s,a)−rj(s,a)∣. Then,
Q i π i ∗ ( s , a ) − Q i π j ∗ ( s , a ) ≤ 2 δ i j 1 − γ . Q_i^{\pi_i^*}(s, a)-Q_i^{\pi_j^*}(s, a) \leq \frac{2 \delta_{i j}}{1-\gamma} . Qiπi∗(s,a)−Qiπj∗(s,a)≤1−γ2δij.
proof为简化记号,令
Q
i
j
(
s
,
a
)
≡
Q
i
π
j
∗
(
s
,
a
)
Q_i^j(s, a) \equiv Q_i^{\pi_j^*}(s, a)
Qij(s,a)≡Qiπj∗(s,a).
Q
i
i
(
s
,
a
)
−
Q
i
j
(
s
,
a
)
=
Q
i
i
(
s
,
a
)
−
Q
j
j
(
s
,
a
)
+
Q
j
j
(
s
,
a
)
−
Q
i
j
(
s
,
a
)
≤
∣
Q
i
i
(
s
,
a
)
−
Q
j
j
(
s
,
a
)
∣
+
∣
Q
j
j
(
s
,
a
)
−
Q
i
j
(
s
,
a
)
∣
.
\begin{aligned} Q_i^i(s, a)-Q_i^j(s, a) & =Q_i^i(s, a)-Q_j^j(s, a)+Q_j^j(s, a)-Q_i^j(s, a) \\ & \leq\left|Q_i^i(s, a)-Q_j^j(s, a)\right|+\left|Q_j^j(s, a)-Q_i^j(s, a)\right| . \end{aligned}
Qii(s,a)−Qij(s,a)=Qii(s,a)−Qjj(s,a)+Qjj(s,a)−Qij(s,a)≤
Qii(s,a)−Qjj(s,a)
+
Qjj(s,a)−Qij(s,a)
.
令
Δ
i
j
=
max
s
,
a
∣
Q
i
i
(
s
,
a
)
−
Q
j
j
(
s
,
a
)
∣
\Delta_{i j}=\max _{s, a}\left|Q_i^i(s, a)-Q_j^j(s, a)\right|
Δij=maxs,a
Qii(s,a)−Qjj(s,a)
.
∣
Q
i
i
(
s
,
a
)
−
Q
j
j
(
s
,
a
)
∣
=
∣
r
i
(
s
,
a
)
+
γ
∑
s
′
p
(
s
′
∣
s
,
a
)
max
b
Q
i
i
(
s
′
,
b
)
−
r
j
(
s
,
a
)
−
γ
∑
s
′
p
(
s
′
∣
s
,
a
)
max
b
Q
j
j
(
s
′
,
b
)
∣
=
∣
r
i
(
s
,
a
)
−
r
j
(
s
,
a
)
+
γ
∑
s
′
p
(
s
′
∣
s
,
a
)
(
max
b
Q
i
i
(
s
′
,
b
)
−
max
b
Q
j
j
(
s
′
,
b
)
)
∣
≤
∣
r
i
(
s
,
a
)
−
r
j
(
s
,
a
)
∣
+
γ
∑
s
′
p
(
s
′
∣
s
,
a
)
∣
max
b
Q
i
i
(
s
′
,
b
)
−
max
b
Q
j
j
(
s
′
,
b
)
∣
≤
∣
r
i
(
s
,
a
)
−
r
j
(
s
,
a
)
∣
+
γ
∑
s
′
p
(
s
′
∣
s
,
a
)
max
b
∣
Q
i
i
(
s
′
,
b
)
−
Q
j
j
(
s
′
,
b
)
∣
≤
δ
i
j
+
γ
Δ
i
j
.
\begin{aligned} \left|Q_i^i(s, a)-Q_j^j(s, a)\right| & =\left|r_i(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) \max _b Q_i^i\left(s^{\prime}, b\right)-r_j(s, a)-\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) \max _b Q_j^j\left(s^{\prime}, b\right)\right| \\ & =\left|r_i(s, a)-r_j(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right)\left(\max _b Q_i^i\left(s^{\prime}, b\right)-\max _b Q_j^j\left(s^{\prime}, b\right)\right)\right| \\ & \leq\left|r_i(s, a)-r_j(s, a)\right|+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right)\left|\max _b Q_i^i\left(s^{\prime}, b\right)-\max _b Q_j^j\left(s^{\prime}, b\right)\right| \\ & \leq\left|r_i(s, a)-r_j(s, a)\right|+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) \max _b\left|Q_i^i\left(s^{\prime}, b\right)-Q_j^j\left(s^{\prime}, b\right)\right| \\ & \leq \delta_{i j}+\gamma \Delta_{i j} . \end{aligned}
Qii(s,a)−Qjj(s,a)
=
ri(s,a)+γs′∑p(s′∣s,a)bmaxQii(s′,b)−rj(s,a)−γs′∑p(s′∣s,a)bmaxQjj(s′,b)
=
ri(s,a)−rj(s,a)+γs′∑p(s′∣s,a)(bmaxQii(s′,b)−bmaxQjj(s′,b))
≤∣ri(s,a)−rj(s,a)∣+γs′∑p(s′∣s,a)
bmaxQii(s′,b)−bmaxQjj(s′,b)
≤∣ri(s,a)−rj(s,a)∣+γs′∑p(s′∣s,a)bmax
Qii(s′,b)−Qjj(s′,b)
≤δij+γΔij.
从上式中可知
Δ
i
j
≤
1
1
−
γ
δ
i
j
.
\Delta_{i j} \leq \frac{1}{1-\gamma} \delta_{i j} .
Δij≤1−γ1δij.
定义
Δ
i
j
′
=
\Delta_{i j}^{\prime}=
Δij′=
max
s
,
a
∣
Q
i
i
(
s
,
a
)
−
Q
i
j
(
s
,
a
)
∣
\max _{s, a}\left|Q_i^i(s, a)-Q_i^j(s, a)\right|
maxs,a
Qii(s,a)−Qij(s,a)
.
∣
Q
j
j
(
s
,
a
)
−
Q
i
j
(
s
,
a
)
∣
=
∣
r
j
(
s
,
a
)
+
γ
∑
s
′
p
(
s
′
∣
s
,
a
)
Q
j
j
(
s
′
,
π
j
∗
(
s
′
)
)
−
r
i
(
s
,
a
)
−
γ
∑
s
′
p
(
s
′
∣
s
,
a
)
Q
i
j
(
s
′
,
π
j
∗
(
s
′
)
)
∣
=
∣
r
i
(
s
,
a
)
−
r
j
(
s
,
a
)
+
γ
∑
s
′
p
(
s
′
∣
s
,
a
)
(
Q
j
j
(
s
′
,
π
j
∗
(
s
′
)
)
−
Q
i
j
(
s
′
,
π
j
∗
(
s
′
)
)
)
∣
≤
∣
r
i
(
s
,
a
)
−
r
j
(
s
,
a
)
∣
+
γ
∑
s
′
p
(
s
′
∣
s
,
a
)
∣
Q
j
j
(
s
′
,
π
j
∗
(
s
′
)
)
−
Q
i
j
(
s
′
,
π
j
∗
(
s
′
)
)
∣
≤
δ
i
j
+
γ
Δ
i
j
′
.
\begin{aligned} \left|Q_j^j(s, a)-Q_i^j(s, a)\right| & =\left|r_j(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) Q_j^j\left(s^{\prime}, \pi_j^*\left(s^{\prime}\right)\right)-r_i(s, a)-\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) Q_i^j\left(s^{\prime}, \pi_j^*\left(s^{\prime}\right)\right)\right| \\ & =\left|r_i(s, a)-r_j(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right)\left(Q_j^j\left(s^{\prime}, \pi_j^*\left(s^{\prime}\right)\right)-Q_i^j\left(s^{\prime}, \pi_j^*\left(s^{\prime}\right)\right)\right)\right| \\ & \leq\left|r_i(s, a)-r_j(s, a)\right|+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right)\left|Q_j^j\left(s^{\prime}, \pi_j^*\left(s^{\prime}\right)\right)-Q_i^j\left(s^{\prime}, \pi_j^*\left(s^{\prime}\right)\right)\right| \\ & \leq \delta_{i j}+\gamma \Delta_{i j}^{\prime} . \end{aligned}
Qjj(s,a)−Qij(s,a)
=
rj(s,a)+γs′∑p(s′∣s,a)Qjj(s′,πj∗(s′))−ri(s,a)−γs′∑p(s′∣s,a)Qij(s′,πj∗(s′))
=
ri(s,a)−rj(s,a)+γs′∑p(s′∣s,a)(Qjj(s′,πj∗(s′))−Qij(s′,πj∗(s′)))
≤∣ri(s,a)−rj(s,a)∣+γs′∑p(s′∣s,a)
Qjj(s′,πj∗(s′))−Qij(s′,πj∗(s′))
≤δij+γΔij′.
同样可知
Δ
i
j
′
≤
1
1
−
γ
δ
i
j
.
\Delta_{i j}^{\prime} \leq \frac{1}{1-\gamma} \delta_{i j} .
Δij′≤1−γ1δij.
证毕
Theorem 2. Let M i ∈ M ϕ M_i \in \mathcal{M}^\phi Mi∈Mϕ and let Q i π j ∗ Q_i^{\pi_j^*} Qiπj∗ be the value function of an optimal policy of M j ∈ M ϕ M_j \in \mathcal{M}^\phi Mj∈Mϕ when executed in M i M_i Mi. Given the set { Q ~ i π 1 ∗ , Q ~ i π 2 ∗ , … , Q ~ i π n ∗ } \left\{\tilde{Q}_i^{\pi_1^*}, \tilde{Q}_i^{\pi_2^*}, \ldots, \tilde{Q}_i^{\pi_n^*}\right\} {Q~iπ1∗,Q~iπ2∗,…,Q~iπn∗} such that
∣ Q i π j ∗ ( s , a ) − Q ~ i π j ∗ ( s , a ) ∣ ≤ ϵ for all s ∈ S , a ∈ A , and j ∈ { 1 , 2 , … , n } , \left|Q_i^{\pi_j^*}(s, a)-\tilde{Q}_i^{\pi_j^*}(s, a)\right| \leq \epsilon \text { for all } s \in S, a \in A \text {, and } j \in\{1,2, \ldots, n\}, Qiπj∗(s,a)−Q~iπj∗(s,a) ≤ϵ for all s∈S,a∈A, and j∈{1,2,…,n},
let
π ( s ) ∈ argmax a max j Q ~ i π j ∗ ( s , a ) . \pi(s) \in \underset{a}{\operatorname{argmax}} \max _j \tilde{Q}_i^{\pi_j^*}(s, a) . π(s)∈aargmaxjmaxQ~iπj∗(s,a).Finally, let ϕ max = max s , a ∥ ϕ ( s , a ) ∥ \phi_{\max }=\max _{s, a}\|\phi(s, a)\| ϕmax=maxs,a∥ϕ(s,a)∥, where ∥ ⋅ ∥ \|\cdot\| ∥⋅∥ is the norm induced by the inner product adopted. Then,
Q i ∗ ( s , a ) − Q i π ( s , a ) ≤ 2 1 − γ ( ϕ max min j ∥ w i − w j ∥ + ϵ ) . Q_i^*(s, a)-Q_i^\pi(s, a) \leq \frac{2}{1-\gamma}\left(\phi_{\max } \min _j\left\|\mathbf{w}_i-\mathbf{w}_j\right\|+\epsilon\right) . Qi∗(s,a)−Qiπ(s,a)≤1−γ2(ϕmaxjmin∥wi−wj∥+ϵ).
proof:
Q
i
∗
(
s
,
a
)
−
Q
i
π
(
s
,
a
)
≤
Q
i
∗
(
s
,
a
)
−
Q
i
π
j
∗
(
s
,
a
)
+
2
1
−
γ
ϵ
≤
2
1
−
γ
max
s
,
a
∣
r
i
(
s
,
a
)
−
r
j
(
s
,
a
)
∣
+
2
1
−
γ
ϵ
=
2
1
−
γ
max
s
,
a
∣
ϕ
(
s
,
a
)
⊤
w
i
−
ϕ
(
s
,
a
)
⊤
w
j
∣
+
2
1
−
γ
ϵ
=
2
1
−
γ
max
s
,
a
∣
ϕ
(
s
,
a
)
⊤
(
w
i
−
w
j
)
∣
+
2
1
−
γ
ϵ
≤
2
1
−
γ
max
s
,
a
∥
ϕ
(
s
,
a
)
∥
∥
w
i
−
w
j
∥
+
2
1
−
γ
ϵ
=
2
ϕ
max
1
−
γ
∥
w
i
−
w
j
∥
+
2
1
−
γ
ϵ
.
\begin{aligned} Q_i^*(s, a)-Q_i^\pi(s, a) & \leq Q_i^*(s, a)-Q_i^{\pi_j^*}(s, a)+\frac{2}{1-\gamma} \epsilon \\ & \leq \frac{2}{1-\gamma} \max _{s, a}\left|r_i(s, a)-r_j(s, a)\right|+\frac{2}{1-\gamma} \epsilon \\ & =\frac{2}{1-\gamma} \max _{s, a}\left|\phi(s, a)^{\top} \mathbf{w}_i-\phi(s, a)^{\top} \mathbf{w}_j\right|+\frac{2}{1-\gamma} \epsilon \\ & =\frac{2}{1-\gamma} \max _{s, a}\left|\phi(s, a)^{\top}\left(\mathbf{w}_i-\mathbf{w}_j\right)\right|+\frac{2}{1-\gamma} \epsilon \\ & \leq \frac{2}{1-\gamma} \max _{s, a}\|\phi(s, a)\|\left\|\mathbf{w}_i-\mathbf{w}_j\right\|+\frac{2}{1-\gamma} \epsilon \\ & =\frac{2 \phi_{\max }}{1-\gamma}\left\|\mathbf{w}_i-\mathbf{w}_j\right\|+\frac{2}{1-\gamma} \epsilon . \end{aligned}
Qi∗(s,a)−Qiπ(s,a)≤Qi∗(s,a)−Qiπj∗(s,a)+1−γ2ϵ≤1−γ2s,amax∣ri(s,a)−rj(s,a)∣+1−γ2ϵ=1−γ2s,amax
ϕ(s,a)⊤wi−ϕ(s,a)⊤wj
+1−γ2ϵ=1−γ2s,amax
ϕ(s,a)⊤(wi−wj)
+1−γ2ϵ≤1−γ2s,amax∥ϕ(s,a)∥∥wi−wj∥+1−γ2ϵ=1−γ2ϕmax∥wi−wj∥+1−γ2ϵ.
原文地址:https://blog.csdn.net/weixin_54255111/article/details/137741809
免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!