Efficient Exploration-Exploitation in Reinforcement Learning
Matteo Pirotta
Post-doc - INRIA Lille
DEIB - Conference Room "Emilio Gatti" (building 20)
April 3rd, 2018
10.30 pm
Contact:
Marcello Restelli
Research Line:
Artificial intelligence and robotics
Post-doc - INRIA Lille
DEIB - Conference Room "Emilio Gatti" (building 20)
April 3rd, 2018
10.30 pm
Contact:
Marcello Restelli
Research Line:
Artificial intelligence and robotics
Sommario
Reinforcement Learning (RL) faces the problem of learning and controlling in sequential decision-making problems when the dynamics of the underlying environment are unknown but can be learnt by performing actions and observing their outcome. While learning in an unknown environment, an RL agent needs to trade-off the exploration required to collect new information about dynamics and reward (i.e., to improve the internal representation), and the exploitation of experience collected so far to gain as much as reward as possible. In this setting, the performance of an algorithm is usually measured in terms of regret which compares the rewards accumulated by the algorithm and by the optimal policy.
We start discussing exploration-exploitation in a bandit scenario where we review the UCB algorithm. We progressively move to the general RL case presenting the upper-confidence in RL (UCRL algorithm). We finally introduce SCAL, an algorithm designed to perform efficient exploration-exploitation in any unknown weakly-communicating Markov Decision Process (MDP) for which an upper bound c on the span of the optimal bias function is known. SCAL achieves a regret bound which significantly improves over existing algorithms (e.g., UCRL and PSRL), whose regret scales linearly with the MDP diameter D. In fact, the optimal bias span is finite and often much smaller than D (e.g., the diameter is infinite in non-communicating MDPs).