Open Problem: First-Order Regret Bounds for Contextual Bandits
- Alekh Agarwal ,
- Akshay Krishnamurthy ,
- John Langford ,
- Haipeng Luo ,
- Robert E. Schapire
Conference on Learning Theory |
Published by PMLR
We describe two open problems related to first order regret bounds for contextual bandits. The first asks for an algorithm with a regret bound of \(O^\tilde(L^*\sqrt{K\ln N})\) where there are \(K\) actions, \(N\) policies, and \(L^*\) is the cumulative loss of the best policy. The second asks for an optimization-oracle-efficient algorithm with regret \(O^\tilde({L^*}^{2/3}poly(K,\ln(N/\delta)))\). We describe some positive results, such as an inefficient algorithm for the second problem, and some partial negative results.