Similarities Between Policy Gradient Methods (PGM) in Reinforcement Learning (RL) and Supervised Learning (SL)

7 Pages Posted: 7 Jun 2019

See all articles by Eric Benhamou

Eric Benhamou

Université Paris Dauphine; EB AI Advisory; AI For Alpha

Date Written: May 20, 2019

Abstract

Reinforcement learning (RL) is about sequential decision making and is traditionally opposed to supervised learning (SL) and unsupervised learning (USL). In RL, given the current state, the agent makes a decision that may influence the next state as opposed to SL (and USL) where, the next state remains the same, regardless of the decisions taken, either in batch or online learning. Although this difference is fundamental between SL and RL, there are connections that have been overlooked. In particular, we prove in this paper that gradient policy method can be cast as a supervised learning problem where true label are replaced with discounted rewards. We provide a new proof of policy gradient methods (PGM) that emphasizes the tight link with the cross entropy and supervised learning.

We provide a simple experiment where we interchange label and pseudo rewards. We conclude that other relationships with SL could be made if we modify the reward functions wisely.

Keywords: Policy gradient, Supervised learning, Cross entropy, Kullback Leibler divergence, entropy

Suggested Citation

Benhamou, Eric, Similarities Between Policy Gradient Methods (PGM) in Reinforcement Learning (RL) and Supervised Learning (SL) (May 20, 2019). Available at SSRN: https://ssrn.com/abstract=3391216 or http://dx.doi.org/10.2139/ssrn.3391216

Eric Benhamou (Contact Author)

Université Paris Dauphine ( email )

Place du Maréchal de Tassigny
Paris, Cedex 16 75775
France

EB AI Advisory ( email )

35 Boulevard d'Inkermann
Neuilly sur Seine, 92200
France

AI For Alpha ( email )

35 boulevard d'Inkermann
Neuilly sur Seine, 92200
France

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
6,379
Abstract Views
93,456
rank
1,632
PlumX Metrics