Learning in Combinatorial Optimization: What and How to Explore
66 Pages Posted: 26 Sep 2017 Last revised: 10 Jun 2019
Date Written: March 15, 2019
We study dynamic decision-making under uncertainty when, at each period, a decision-maker implements a solution to a combinatorial optimization problem. The objective coefficient vectors of said problem, which are unobserved prior to implementation, vary from period to period. These vectors, however, are known to be random draws from an initially unknown distribution with known range. By implementing different solutions, the decision-maker extracts information about the underlying distribution, but at the same time experiences the cost associated with said solutions. We show that resolving the implied exploration versus exploitation trade-off efficiently is related to solving a Lower Bound Problem (LBP), which simultaneously answers the questions of what to explore and how to do so. We establish a fundamental limit on the asymptotic performance of any admissible policy that is proportional to the optimal objective value of the LBP problem. We show that such a lower bound might be asymptotically attained by policies that adaptively reconstruct and solve LBP at an exponentially decreasing frequency. Because LBP is likely intractable in practice, we propose policies that instead reconstruct and solve a proxy for LBP, which we call the Optimality Cover Problem (OCP). We provide strong evidence of the practical tractability of OCP which implies that the proposed policies can be implemented in real-time. We test the performance of the proposed policies through extensive numerical experiments and show that they significantly outperform relevant benchmarks in the long-term and are competitive in the short-term.
Keywords: Combinatorial Optimization, Multi-Armed Bandit, Mixed-Integer Programming
Suggested Citation: Suggested Citation