Exploration Optimization for Dynamic Assortment Personalization under Linear Preferences
56 Pages Posted: 31 May 2022
Date Written: June 18, 2024
Abstract
We study the dynamic assortment personalization problem of an online retailer that adaptively personalizes assortments based on customers' attributes to learn their preferences and maximize revenue. We assume that there is a linear relationship between product utilities and customer attributes which governs the customer preferences for products. The coefficient matrix characterizing this linear relationship is unknown to the retailer and, as a result, the retailer faces the classic exploration (learning preferences) vs. exploitation (earning revenue) trade-off. We show that there are price-driven and linearity-driven efficiencies that can be leveraged for exploration. Specifically, we show that not all products need to be shown to all customer profiles to recover the optimal assortments and maximize revenue. We prove an instance-dependent lower bound on the regret (i.e., expected revenue loss relative to a clairvoyant retailer) of any admissible policy. We show that the regret lower bound depends on the optimal objective value of a Regret Lower Bound (RLB) problem. Even though the RLB is a linear program, solving and using its solution in practice might be challenging as it has a complex structure and depends non-trivially on unknown (to the retailer) parameters. We therefore also consider an alternative formulation, which we call the Exploration-Optimization problem, that imposes a simple and clear structure for exploration, which is easy to interpret. We show that this problem can be formulated as a Mixed Integer Linear Program (MILP) that can be effectively solved with state-of-the-art solvers. We design efficient learning policies that identify an efficient exploration set by solving either the RLB or the Exploration-Optimization problems. Finally, we prove a regret upper bound for our proposed exploration-optimization policy to further provide theoretical support for its performance. To illustrate the practical value of the proposed policies, we consider a setting calibrated on a dataset from a large Chilean retailer. We find that, in addition to running significantly faster, our proposed policies outperform the Thompson sampling benchmark in terms of regret (revenue). We also run experiments to show that our proposed policies are scalable in practice.
Keywords: Dynamic Assortment Planning, Personalization, Multi-Armed Bandit, Online Retailing
Suggested Citation: Suggested Citation