Temporal credit assignment is the problem of assigning a value to sequential decisions in the presence of sparse or delayed rewards. In off-policy reinforcement learning, this is made more difficult by the mismatch between the target policy and the actions that are actually performed. For this reason, combining off-policy learning with multi-step-return solutions, such as eligibility traces, is not trivial and has generated long-lasting open problems. In this seminar, we present the work of Munos et al. [Munos, Stepleton, Harutyunyan, Bellemare, Safe and Efficient Off-Policy, NIPS 2016], which provides both a theoretical analysis of the problem and a practical algorithm. We start with an introduction to temporal difference reinforcement learning and traditional eligibility traces. Then, we frame the problem of off-policy credit assignment in a very general way, which allows us to compare existing solutions. Going further, we provide sufficient conditions for the convergence of a generalized Q-learning algorithm, obtaining as a corollary the convergence of Watkin's Q(\lambda), which was an open problem since 1989. We then analyze the novel Retrace(\lambda) algorithm, which has many desirable properties.
Finally, we show experimental evaluations of Retrace(\lambda) on the Atari Games.
Matteo Papini was born in Sondrio, Italy, on 5th July 1993. In 2015 he obtained the Bachelor Degree in Ingegneria Informatica (Computer Engineering) cum laude at Politecnico di Milano. In 2017 he obtained the Master Degree in Computer Science and Engineering - Ingegneria Informatica cum laude at Politecnico di Milano. From November 2017 he is a Ph.D. student at Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB) at Politecnico di Milano. His research interests include artificial intelligence, robotics, and machine learning, with a focus on reinforcement learning.