TUTCRIS - Tampereen teknillinen yliopisto

TUTCRIS

Compatible natural gradient policy search

Tutkimustuotosvertaisarvioitu

Standard

Compatible natural gradient policy search. / Pajarinen, Joni; Thai, Hong Linh; Akrour, Riad; Peters, Jan; Neumann, Gerhard.

julkaisussa: Machine Learning, 2019.

Tutkimustuotosvertaisarvioitu

Harvard

Pajarinen, J, Thai, HL, Akrour, R, Peters, J & Neumann, G 2019, 'Compatible natural gradient policy search', Machine Learning. https://doi.org/10.1007/s10994-019-05807-0

APA

Pajarinen, J., Thai, H. L., Akrour, R., Peters, J., & Neumann, G. (2019). Compatible natural gradient policy search. Machine Learning. https://doi.org/10.1007/s10994-019-05807-0

Vancouver

Pajarinen J, Thai HL, Akrour R, Peters J, Neumann G. Compatible natural gradient policy search. Machine Learning. 2019. https://doi.org/10.1007/s10994-019-05807-0

Author

Pajarinen, Joni ; Thai, Hong Linh ; Akrour, Riad ; Peters, Jan ; Neumann, Gerhard. / Compatible natural gradient policy search. Julkaisussa: Machine Learning. 2019.

Bibtex - Lataa

@article{8eeb22fc25584feea54e02aba5a70d9b,
title = "Compatible natural gradient policy search",
abstract = "Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks.",
keywords = "Policy search, Reinforcement learning",
author = "Joni Pajarinen and Thai, {Hong Linh} and Riad Akrour and Jan Peters and Gerhard Neumann",
year = "2019",
doi = "10.1007/s10994-019-05807-0",
language = "English",
journal = "Machine Learning",
issn = "0885-6125",
publisher = "Springer Verlag",

}

RIS (suitable for import to EndNote) - Lataa

TY - JOUR

T1 - Compatible natural gradient policy search

AU - Pajarinen, Joni

AU - Thai, Hong Linh

AU - Akrour, Riad

AU - Peters, Jan

AU - Neumann, Gerhard

PY - 2019

Y1 - 2019

N2 - Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks.

AB - Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks.

KW - Policy search

KW - Reinforcement learning

U2 - 10.1007/s10994-019-05807-0

DO - 10.1007/s10994-019-05807-0

M3 - Article

JO - Machine Learning

JF - Machine Learning

SN - 0885-6125

ER -