Misplaced Pages

Policy gradient method

Article snapshot taken from[REDACTED] with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
(Redirected from Policy gradient) Class of reinforcement learning algorithms

Policy gradient methods are a class of reinforcement learning algorithms.

Policy gradient methods are a sub-class of policy optimization methods. Unlike value-based methods which learn a value function to derive a policy, policy optimization methods directly learn a policy function π {\displaystyle \pi } that selects actions without consulting a value function. For policy gradient to apply, the policy function π θ {\displaystyle \pi _{\theta }} is parameterized by a differentiable parameter θ {\displaystyle \theta } .

Overview

In policy-based RL, the actor is a parameterized policy function π θ {\displaystyle \pi _{\theta }} , where θ {\displaystyle \theta } are the parameters of the actor. The actor takes as argument the state of the environment s {\displaystyle s} and produces a probability distribution π θ ( | s ) {\displaystyle \pi _{\theta }(\cdot |s)} .

If the action space is discrete, then a π θ ( a | s ) = 1 {\displaystyle \sum _{a}\pi _{\theta }(a|s)=1} . If the action space is continuous, then a π θ ( a | s ) d a = 1 {\displaystyle \int _{a}\pi _{\theta }(a|s)da=1} .

The goal of policy optimization is to find some θ {\displaystyle \theta } that maximizes the expected episodic reward J ( θ ) {\displaystyle J(\theta )} : J ( θ ) = E π θ [ t 0 : T γ t R t | S 0 = s 0 ] {\displaystyle J(\theta )=\mathbb {E} _{\pi _{\theta }}\left} where γ {\displaystyle \gamma } is the discount factor, R t {\displaystyle R_{t}} is the reward at step t {\displaystyle t} , s 0 {\displaystyle s_{0}} is the starting state, and T {\displaystyle T} is the time-horizon (which can be infinite).

The policy gradient is defined as θ J ( θ ) {\displaystyle \nabla _{\theta }J(\theta )} . Different policy gradient methods stochastically estimate the policy gradient in different ways. The goal of any policy gradient method is to iteratively maximize J ( θ ) {\displaystyle J(\theta )} by gradient ascent. Since the key part of any policy gradient method is the stochastic estimation of the policy gradient, they are also studied under the title of "Monte Carlo gradient estimation".

REINFORCE

Policy gradient

The REINFORCE algorithm was the first policy gradient method. It is based on the identity for the policy gradient θ J ( θ ) = E π θ [ j 0 : T θ ln π θ ( A j | S j ) i 0 : T ( γ i R i ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left} which can be improved via the "causality trick" θ J ( θ ) = E π θ [ j 0 : T θ ln π θ ( A j | S j ) i j : T ( γ i R i ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left}

Lemma — The expectation of the score function is zero, conditional on any present or past state. That is, for any 0 i j T {\displaystyle 0\leq i\leq j\leq T} and any state s i {\displaystyle s_{i}} , we have E π θ [ θ ln π θ ( A j | S j ) | S i = s i ] = 0. {\displaystyle \mathbb {E} _{\pi _{\theta }}=0.}

Further, if Ψ i {\textstyle \Psi _{i}} is a random variable that is independent of A i , S i + 1 , A i + 1 , {\textstyle A_{i},S_{i+1},A_{i+1},\dots } , then E π θ [ θ ln π θ ( A j | S j ) Ψ i | S i = s i ] = 0. {\displaystyle \mathbb {E} _{\pi _{\theta }}=0.}

Proof Proof

Use the reparameterization trick.

E π θ [ θ ln π θ ( A j | S j ) | S i = s i ] = s P r ( S j = s | S i = s i ) a π θ ( a | s ) θ ln π θ ( a | s ) = s P r ( S j = s | S i = s i ) a π θ ( a | s ) θ π θ ( a | s ) π θ ( a | s ) = s P r ( S j = s | S i = s i ) a θ π θ ( a | s ) = s P r ( S j = s | S i = s i ) θ a π θ ( a | s ) {\displaystyle {\begin{aligned}\mathbb {E} _{\pi _{\theta }}&=\sum _{s}Pr(S_{j}=s|S_{i}=s_{i})\sum _{a}\pi _{\theta }(a|s)\nabla _{\theta }\ln \pi _{\theta }(a|s)\\&=\sum _{s}Pr(S_{j}=s|S_{i}=s_{i})\sum _{a}\pi _{\theta }(a|s){\frac {\nabla _{\theta }\pi _{\theta }(a|s)}{\pi _{\theta }(a|s)}}\\&=\sum _{s}Pr(S_{j}=s|S_{i}=s_{i})\sum _{a}\nabla _{\theta }\pi _{\theta }(a|s)\\&=\sum _{s}Pr(S_{j}=s|S_{i}=s_{i})\nabla _{\theta }\sum _{a}\pi _{\theta }(a|s)\end{aligned}}} Since the policy π θ ( a | s ) {\displaystyle \pi _{\theta }(a|s)} is a probability distribution over actions for a given state, a π θ ( a | s ) = 1 {\textstyle \sum _{a}\pi _{\theta }(a|s)=1} . E π θ [ θ ln π θ ( A | S ) ] = s P r ( S j = s | S i = s i ) θ ( 1 ) = s P r ( S j = s | S i = s i ) 0 = 0 {\displaystyle {\begin{aligned}\mathbb {E} _{\pi _{\theta }}&=\sum _{s}Pr(S_{j}=s|S_{i}=s_{i})\nabla _{\theta }(1)\\&=\sum _{s}Pr(S_{j}=s|S_{i}=s_{i})0\\&=0\end{aligned}}}

By the tower law and the previous lemma.

E π θ [ Ψ i θ ln π θ ( A j | S j ) | S i = s i ] = E π θ [ E π θ [ Ψ i θ ln π θ ( A j | S j ) | S j ] | S i = s i ] = E π θ [ Ψ i E π θ [ θ ln π θ ( A j | S j ) | S j ] | S i = s i ] = E π θ [ Ψ i 0 | S i = s i ] = 0 {\displaystyle {\begin{aligned}\mathbb {E} _{\pi _{\theta }}\left&=\mathbb {E} _{\pi _{\theta }}\left{\Big |}S_{i}=s_{i}\right]\\&=\mathbb {E} _{\pi _{\theta }}\left{\Big |}S_{i}=s_{i}\right]\\&=\mathbb {E} _{\pi _{\theta }}\left\\&=0\end{aligned}}}

Proof Proof

Applying the reparameterization trick,

θ J ( θ ) = θ E π θ [ i 0 : T γ i R i | S 0 = s 0 ] = E π θ [ ( i 0 : T γ i R i ) θ ln ( π θ ( A 0 , A 1 , , A T | S 0 , S 1 , , S T ) ) | S 0 = s 0 ] = E π θ [ ( i 0 : T γ i R i ) j 0 : T θ ln ( π θ ( A j | S j ) ) | S 0 = s 0 ] = E π θ [ i , j 0 : T ( γ i R i ) θ ln π θ ( A j | S j ) | S 0 = s 0 ] {\displaystyle {\begin{aligned}\nabla _{\theta }J(\theta )&=\nabla _{\theta }\mathbb {E} _{\pi _{\theta }}\left\\&=\mathbb {E} _{\pi _{\theta }}\left\\&=\mathbb {E} _{\pi _{\theta }}\left\\&=\mathbb {E} _{\pi _{\theta }}\left\end{aligned}}} which is the first equation.

By the lemma, E π θ [ ( γ i R i ) θ ln π θ ( A j | S j ) | S 0 = s 0 ] 0 {\displaystyle \mathbb {E} _{\pi _{\theta }}\left0} for any 0 i < j T {\textstyle 0\leq i<j\leq T} . Plugging this into the previous formula, we zero out a whole triangle of terms, to get θ J ( θ ) = E π θ [ 0 j i T ( γ i R i ) θ ln π θ ( A j | S j ) | S 0 = s 0 ] = E π θ [ j 0 : T θ ln π θ ( A j | S j ) i j : T ( γ i R i ) | S 0 = s 0 ] {\displaystyle {\begin{aligned}\nabla _{\theta }J(\theta )&=\mathbb {E} _{\pi _{\theta }}\left\\&=\mathbb {E} _{\pi _{\theta }}\left\end{aligned}}} which is the second equation.

Thus, we have an unbiased estimator of the policy gradient: θ J ( θ ) 1 N k = 1 N [ j 0 : T θ ln π θ ( A j , k | S j , k ) i j : T ( γ i R i , k ) ] {\displaystyle \nabla _{\theta }J(\theta )\approx {\frac {1}{N}}\sum _{k=1}^{N}\left} where the index k {\displaystyle k} ranges over N {\displaystyle N} rollout trajectories using the policy π θ {\displaystyle \pi _{\theta }} .

The score function θ ln π θ ( A t | S t ) {\displaystyle \nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})} can be interpreted as the direction in the parameter space that increases the probability of taking action A t {\displaystyle A_{t}} in state S t {\displaystyle S_{t}} . The policy gradient, then, is a weighted average of all possible directions to increase the probability of taking any action in any state, but weighted by reward signals, so that if taking a certain action in a certain state is associated with high reward, then that direction would be highly reinforced, and vice versa.

Algorithm

The REINFORCE algorithm is a loop:

  1. Rollout N {\displaystyle N} trajectories in the environment, using π θ t {\displaystyle \pi _{\theta _{t}}} as the policy function.
  2. Compute the policy gradient estimation: g t 1 N k = 1 N [ j 0 : T θ t ln π θ ( A j , k | S j , k ) i j : T ( γ i R i , k ) ] {\displaystyle g_{t}\leftarrow {\frac {1}{N}}\sum _{k=1}^{N}\left}
  3. Update the policy by gradient ascent: θ t + 1 θ t + α t g t {\displaystyle \theta _{t+1}\leftarrow \theta _{t}+\alpha _{t}g_{t}}

Here, α t {\displaystyle \alpha _{t}} is the learning rate at update step t {\displaystyle t} .

Variance reduction

REINFORCE is an on-policy algorithm, meaning that the trajectories used for the update must be sampled from the current policy π θ {\displaystyle \pi _{\theta }} . This can lead to high variance in the updates, as the returns R ( τ ) {\displaystyle R(\tau )} can vary significantly between trajectories. Many variants of REINFORCE has been introduced, under the title of variance reduction.

REINFORCE with baseline

A common way for reducing variance is the REINFORCE with baseline algorithm, based on the following identity: θ J ( θ ) = E π θ [ j 0 : T θ ln π θ ( A j | S j ) ( i j : T ( γ i R i ) b ( S j ) ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left} for any function b : States R {\displaystyle b:{\text{States}}\to \mathbb {R} } . This can be proven by applying the previous lemma.

The algorithm uses the modified gradient estimator g t 1 N k = 1 N [ j 0 : T θ t ln π θ ( A j , k | S i , k ) ( i j : T ( γ i R i , k ) b t ( S j , k ) ) ] {\displaystyle g_{t}\leftarrow {\frac {1}{N}}\sum _{k=1}^{N}\left} and the original REINFORCE algorithm is the special case where b t = 0 {\displaystyle b_{t}=0} .

Actor-critic methods

If b t {\textstyle b_{t}} is chosen well, such that b t ( S j ) i j : T ( γ i R i ) = γ j V π θ t ( S j ) {\textstyle b_{t}(S_{j})\approx \sum _{i\in j:T}(\gamma ^{i}R_{i})=\gamma ^{j}V^{\pi _{\theta _{t}}}(S_{j})} , this could significantly decrease variance in the gradient estimation. That is, the baseline should be as close to the value function V π θ t ( S j ) {\displaystyle V^{\pi _{\theta _{t}}}(S_{j})} as possible, approaching the ideal of: θ J ( θ ) = E π θ [ j 0 : T θ ln π θ ( A j | S j ) ( i j : T ( γ i R i ) γ j V π θ ( S j ) ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left} Note that, as the policy π θ t {\displaystyle \pi _{\theta _{t}}} updates, the value function V π θ t ( S j ) {\displaystyle V^{\pi _{\theta _{t}}}(S_{j})} updates as well, so the baseline should also be updated. One common approach is to train a separate function that estimates the value function, and use that as the baseline. This is one of the actor-critic methods, where the policy function is the actor and the value function is the critic.

The Q-function Q π {\displaystyle Q^{\pi }} can also be used as the critic, since θ J ( θ ) = E π θ [ 0 j T γ j θ ln π θ ( A j | S j ) Q π θ ( S j , A j ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=E_{\pi _{\theta }}\left} by a similar argument using the tower law.

Subtracting the value function as a baseline, we find that the advantage function A π ( S , A ) = Q π ( S , A ) V π ( S ) {\displaystyle A^{\pi }(S,A)=Q^{\pi }(S,A)-V^{\pi }(S)} can be used as the critic as well: θ J ( θ ) = E π θ [ 0 j T γ j θ ln π θ ( A j | S j ) A π θ ( S j , A j ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=E_{\pi _{\theta }}\left} In summary, there are many unbiased estimators for θ J θ {\textstyle \nabla _{\theta }J_{\theta }} , all in the form of: θ J ( θ ) = E π θ [ 0 j T θ ln π θ ( A j | S j ) Ψ j | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=E_{\pi _{\theta }}\left} where Ψ j {\textstyle \Psi _{j}} is any linear sum of the following terms:

  • 0 i T ( γ i R i ) {\textstyle \sum _{0\leq i\leq T}(\gamma ^{i}R_{i})} : never used.
  • γ j j i T ( γ i j R i ) {\textstyle \gamma ^{j}\sum _{j\leq i\leq T}(\gamma ^{i-j}R_{i})} : used by the REINFORCE algorithm.
  • γ j j i T ( γ i j R i ) b ( S j ) {\textstyle \gamma ^{j}\sum _{j\leq i\leq T}(\gamma ^{i-j}R_{i})-b(S_{j})} : used by the REINFORCE with baseline algorithm.
  • γ j ( R j + γ V π θ ( S j + 1 ) V π θ ( S j ) ) {\textstyle \gamma ^{j}\left(R_{j}+\gamma V^{\pi _{\theta }}(S_{j+1})-V^{\pi _{\theta }}(S_{j})\right)} : 1-step TD learning.
  • γ j Q π θ ( S j , A j ) {\textstyle \gamma ^{j}Q^{\pi _{\theta }}(S_{j},A_{j})} .
  • γ j A π θ ( S j , A j ) {\textstyle \gamma ^{j}A^{\pi _{\theta }}(S_{j},A_{j})} .

Some more possible Ψ j {\textstyle \Psi _{j}} are as follows, with very similar proofs.

  • γ j ( R j + γ R j + 1 + γ 2 V π θ ( S j + 2 ) V π θ ( S j ) ) {\textstyle \gamma ^{j}\left(R_{j}+\gamma R_{j+1}+\gamma ^{2}V^{\pi _{\theta }}(S_{j+2})-V^{\pi _{\theta }}(S_{j})\right)} : 2-step TD learning.
  • γ j ( k = 0 n 1 γ k R j + k + γ n V π θ ( S j + n ) V π θ ( S j ) ) {\textstyle \gamma ^{j}\left(\sum _{k=0}^{n-1}\gamma ^{k}R_{j+k}+\gamma ^{n}V^{\pi _{\theta }}(S_{j+n})-V^{\pi _{\theta }}(S_{j})\right)} : n-step TD learning.
  • γ j n = 1 λ n 1 1 λ ( k = 0 n 1 γ k R j + k + γ n V π θ ( S j + n ) V π θ ( S j ) ) {\textstyle \gamma ^{j}\sum _{n=1}^{\infty }{\frac {\lambda ^{n-1}}{1-\lambda }}\cdot \left(\sum _{k=0}^{n-1}\gamma ^{k}R_{j+k}+\gamma ^{n}V^{\pi _{\theta }}(S_{j+n})-V^{\pi _{\theta }}(S_{j})\right)} : TD(λ) learning, also known as GAE (generalized advantage estimate). This is obtained by an exponentially decaying sum of the n-step TD learning ones.

Other methods

Other important examples of policy gradient methods include Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO).

Natural policy gradient

The natural policy gradient method is a variant of the policy gradient method, proposed by (Kakade, 2001). The key idea is that the standard policy gradient methods, given above, involve optimizing J ( θ ) {\displaystyle J(\theta )} by taking its gradient θ J ( θ ) {\displaystyle \nabla _{\theta }J(\theta )} . However, this gradient depends on the particular choice of the coordinate θ {\displaystyle \theta } . So, for example, if we were to change the coordinates by θ = 2 θ {\displaystyle \theta '=2\theta } , where f {\displaystyle f} is some smooth function, then we would obtain a new policy gradient θ J ( θ ) = 1 2 θ J ( θ ) {\displaystyle \nabla _{\theta '}J(\theta ')={\frac {1}{2}}\nabla _{\theta }J(\theta )} .

Thus, policy gradient method is "unnatural" in the geometric sense, since its updates depends on the choice of coordinates. A "natural" policy gradient would change it so that the policy updates are coordinate-free.

See also

References

  1. Mohamed, Shakir; Rosca, Mihaela; Figurnov, Michael; Mnih, Andriy (2020). "Monte Carlo Gradient Estimation in Machine Learning". Journal of Machine Learning Research. 21 (132): 1–62. arXiv:1906.10652. ISSN 1533-7928.
  2. Williams, Ronald J. (May 1992). "Simple statistical gradient-following algorithms for connectionist reinforcement learning". Machine Learning. 8 (3–4): 229–256. doi:10.1007/BF00992696. ISSN 0885-6125.
  3. Sutton, Richard S; McAllester, David; Singh, Satinder; Mansour, Yishay (1999). "Policy Gradient Methods for Reinforcement Learning with Function Approximation". Advances in Neural Information Processing Systems. 12. MIT Press.
  4. Schulman, John; Moritz, Philipp; Levine, Sergey; Jordan, Michael; Abbeel, Pieter (2018-10-20), High-Dimensional Continuous Control Using Generalized Advantage Estimation, doi:10.48550/arXiv.1506.02438
  5. Schulman, John; Levine, Sergey; Moritz, Philipp; Jordan, Michael; Abbeel, Pieter (2015-07-06). "Trust region policy optimization". Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37. ICML'15. Lille, France: JMLR.org: 1889–1897.
  6. Schulman, John; Wolski, Filip; Dhariwal, Prafulla; Radford, Alec; Klimov, Oleg (2017-08-28), Proximal Policy Optimization Algorithms, arXiv:1707.06347
  7. Kakade, Sham M (2001). "A Natural Policy Gradient". Advances in Neural Information Processing Systems. 14. MIT Press.
  • Sutton, Richard S.; Barto, Andrew G. (2018). Reinforcement learning: an introduction. Adaptive computation and machine learning series (2 ed.). Cambridge, Massachusetts: The MIT Press. ISBN 978-0-262-03924-6.
  • Bertsekas, Dimitri P. (2019). Reinforcement learning and optimal control (2 ed.). Belmont, Massachusetts: Athena Scientific. ISBN 978-1-886529-39-7.
  • Grossi, Csaba (2010). Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning (1 ed.). Cham: Springer International Publishing. ISBN 978-3-031-00423-0.
  • Mohamed, Shakir; Rosca, Mihaela; Figurnov, Michael; Mnih, Andriy (2020). "Monte Carlo Gradient Estimation in Machine Learning". Journal of Machine Learning Research. 21 (132): 1–62. arXiv:1906.10652. ISSN 1533-7928.
Categories:
Policy gradient method Add topic