Soft Safe Policy Improvement with Baseline Bootstrapping
- Kimia Nadjahi ,
- Romain Laroche ,
- Remi Tachet des Combes
Batch Reinforcement Learning is a common setting in sequential decision-making under uncertainty. It consists in finding an optimal policy using trajectories collected with another policy, called the baseline. Previous work shows that safe policy improvement (SPI) methods improve mean performance compared to the basic algorithm (Laroche and Trichelair, 2017). Here, we build on that work and improve the algorithm by allowing finer optimization under the safety constraint. Instead of binarily classifying the state-action pairs into two sets (the uncertain and the safe-to-train-on ones), we adopt a softer strategy by
considering locally the error due to the model uncertainty. The method takes the right amount of risk to try uncertain actions all the while remaining safe in practice, and therefore is less conservative than the state-of-the-art methods. We propose four algorithms for this constrained optimization problem and empirically show a significant improvement over existing SPI methods.