Temporal Credit Assignment in Off-Policy Reinforcement Learning
Matteo Papini
DEIB PhD student
DEIB - Seminar Room (building 20)
November 28th, 2017
10.30 am
Research Line:
Artificial Intelligence and robotics
DEIB PhD student
DEIB - Seminar Room (building 20)
November 28th, 2017
10.30 am
Research Line:
Artificial Intelligence and robotics
Sommario
Temporal credit assignment is the problem of assigning a value to sequential decisions in the presence of sparse or delayed rewards. In off-policy reinforcement learning, this is made more difficult by the mismatch between the target policy and the actions that are actually performed. For this reason, combining off-policy learning with multi-step-return solutions, such as eligibility traces, is not trivial and has generated long-lasting open problems. In this seminar, we present the work of Munos et al. [Munos, Stepleton, Harutyunyan, Bellemare, Safe and Efficient Off-Policy, NIPS 2016], which provides both a theoretical analysis of the problem and a practical algorithm. We start with an introduction to temporal difference reinforcement learning and traditional eligibility traces. Then, we frame the problem of off-policy credit assignment in a very general way, which allows us to compare existing solutions. Going further, we provide sufficient conditions for the convergence of a generalized Q-learning algorithm, obtaining as a corollary the convergence of Watkin's Q(\lambda), which was an open problem since 1989. We then analyze the novel Retrace(\lambda) algorithm, which has many desirable properties.
Finally, we show experimental evaluations of Retrace(\lambda) on the Atari Games.
Finally, we show experimental evaluations of Retrace(\lambda) on the Atari Games.
Biografia
Matteo Papini was born in Sondrio, Italy, on 5th July 1993. In 2015 he obtained the Bachelor Degree in Ingegneria Informatica (Computer Engineering) cum laude at Politecnico di Milano. In 2017 he obtained the Master Degree in Computer Science and Engineering - Ingegneria Informatica cum laude at Politecnico di Milano. From November 2017 he is a Ph.D. student at Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB) at Politecnico di Milano. His research interests include artificial intelligence, robotics, and machine learning, with a focus on reinforcement learning.