Exploration vs exploitation with partially observable Gaussian autoregressive arms

Kuhn, Julia, Mandjes, Michel and Nazarathy, Yoni (2014). Exploration vs exploitation with partially observable Gaussian autoregressive arms. In: Valuetools '14: Proceedings of the 8th International Conference on Performance Evaluation Methodologies and Tools. Valuetools, Bratislava, Slovakia, (209-216). 9-11 December 2014. doi:10.4108/icst.valuetools.2014.258207


Author Kuhn, Julia
Mandjes, Michel
Nazarathy, Yoni
Title of paper Exploration vs exploitation with partially observable Gaussian autoregressive arms
Conference name Valuetools
Conference location Bratislava, Slovakia
Conference dates 9-11 December 2014
Proceedings title Valuetools '14: Proceedings of the 8th International Conference on Performance Evaluation Methodologies and Tools
Journal name Proceedings of the 8th International Conference on Performance Evaluation Methodologies and Tools, VALUETOOLS 2014
Place of Publication Brussels, Belgium
Publisher ICST
Publication Year 2014
Sub-type Fully published paper
DOI 10.4108/icst.valuetools.2014.258207
Open Access Status Not Open Access
ISBN 9781631900570
Start page 209
End page 216
Total pages 8
Language eng
Abstract/Summary We consider a restless bandit problem with Gaussian autoregressive arms, where the state of an arm is only observed when it is played and the state-dependent reward is collected. Since arms are only partially observable, a good decision policy needs to account for the fact that information about the state of an arm becomes more and more obsolete while the arm is not being played. Thus, the decision maker faces a tradeoff between exploiting those arms that are believed to be currently the most rewarding (i.e. those with the largest conditional mean), and exploring arms with a high conditional variance. Moreover, one would like the decision policy to remain tractable despite the infinite state space and also in systems with many arms. A policy that gives some priority to exploration is the Whittle index policy, for which we establish structural properties. These motivate a parametric index policy that is computationally much simpler than the Whittle index but can still outperform the myopic policy. Furthermore, we examine the many-arm behavior of the system under the parametric policy, identifying equations describing its asymptotic dynamics. Based on these insights we provide a simple heuristic algorithm to evaluate the performance of index policies; the latter is used to optimize the parametric index.
Keyword Restless bandits
Partially observable
Whittle index
Performance evaluation
Asymptotic dynamics
Q-Index Code EX
Q-Index Status Provisional Code
Institutional Status UQ

Document type: Conference Paper
Collection: School of Mathematics and Physics
 
Versions
Version Filter Type
Citation counts: Scopus Citation Count Cited 1 times in Scopus Article | Citations
Google Scholar Search Google Scholar
Created: Wed, 02 Mar 2016, 21:34:57 EST by Julia Kuhn on behalf of School of Mathematics & Physics