Sage Journals: Discover world-class research

Abstract

We study the problem of online interaction in general decision making problems, where the objective is not only to find optimal strategies, but also to satisfy certain safety guarantees, expressed in terms of costs accrued. In particular, we focus on the online learning problem in which an agent has to find the optimal solution of a linear objective. Moreover, the agent has to satisfy a linear safety constraint at each round. We propose a theoretical framework to address such problems and present BAN-SOLO, a UCB-like algorithm that, in an online interaction with an unknown environment, attains sublinear regret of order $O (\sqrt{T})$ and satisfies a safety constraint with high probability at each iteration. BAN-SOLO provides a general framework that can be applied to any setting in which estimators of the objective and the cost function are available. At its core, it relies on tools from convex duality to manage environment exploration while satisfying the safety constraint imposed by the problem. To show the applicability of our framework, we provide two game theoretical applications: normal-form games and sequential decision-making problems.

Keywords

Online learning safety regret minimization convex duality learning in games

Get full access to this article

View all access options for this article.

References

Yasin Abbasi-Yadkori , Dávid Pál , Csaba Szepesvári , Improved algorithms for linear stochastic bandits, Advances in Neural Information Processing Systems 24 (2011), 2312–2320.

Amani

, Alizadeh

, Thrampoulidis

Regret bound for safe gaussian process bandit optimization, In L4DC, pages 158–159, 2020.

Nolan Bard , Jakob Foerster

, Sarath Chandar , Neil Burch , Marc Lanctot , Francis Song

, Emilio Parisotto , Vincent Dumoulin , Subhodeep Moitra , Edward Hughes et al. The hanabi challenge: A new frontier for ai research, Artificial Intelligence 280 (2020), 103216.

Martino Bernasconi , Federico Cacciamani , Matteo Castiglioni , Alberto Marchesi , Nicola Gatti , Francesco Trovò Safe learning in tree-form sequential decision making: Handling hard and soft constraints, In International Conference on Machine Learning, pages 1854–1873. PMLR, 2022.

Martino Bernasconi , Matteo Castiglioni , Alberto Marchesi , Nicola Gatti , Francesco Trovò , Sequential information design: Learning to persuade in the dark,, Advances in Neural Information Processing Systems 35 (2022).

Martino Bernasconi-de-Luca , Federico Cacciamani , Simone Fioravanti , Nicola Gatti , Alberto Marchesi , Francesco Trovò , Exploiting opponents under utility constraints in sequential games,, Advances in Neural Information Processing Systems 34 (2021).

Noam Brown , Tuomas Sandholm , Superhuman ai for heads-up no-limit poker: Libratus beats top professionals, Science 359 (2018), 418–424.

Matteo Castiglioni , Andrea Celli , Christian Kroer Online learning with knapsacks: the best of both worlds, In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17–23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 2767–2783. PMLR, 2022. URL https://proceedings.mlr.press/v162/castiglioni22a.html

Matteo Castiglioni , Andrea Celli , Alberto Marchesi , Giulia Romano , Nicola Gatti et al. A unifying framework for online optimization with long-term constraints, In Thirty-sixth Conference on Neural Information Processing Systems, pages (2022), 1–30.

10.

Luc Devroye The equivalence of weak, strong and complete convergence in l1 for kernel density estimates, The Annals of Statistics, pages 896–904, 1983.

11.

Lutz Dumbgen , Gunther Walther , Rates of convergence for random approximations of convex sets, Advances in Applied Probability 28(2) (1996), 384–393.

12.

Meta Fundamental AI Research Diplomacy Team (FAIR) Anton Bakhtin , Noam Brown , Emily Dinan , Gabriele Farina , Colin Flaherty , Daniel Fried , Andrew Goff , Jonathan Gray , Hengyuan Hu et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning, Science 378(6624) (2022), 1067–1074.

13.

Gabriele Farina , Robin Schmucker , Tuomas Sandholm , Bandit linear optimization for sequential decision making and extensive-form games, In Proceedings of the AAAI Conference on Artificial Intelligence 35 (2021), 5372–5380.

14.

Julia Higle

and Stein Wallace

, Sensitivity analysis and uncertainty in linear programming, Interfaces 33(4) (2003), 53–60.

15.

Samid Hoda , Andrew Gilpin , Javier Pena , Tuomas Sandholm , Smoothing techniques for computing nash equilibria of sequential games, Mathematics of Operations Research 35(2) (2010), 494–512.

16.

Moradipari

, Thrampoulidis

, Alizadeh

Stage wise conservative linear bandits, In NeurIPS, pages 11191–11201, 2020.

17.

Tyrrell Rockafellar

, Roger

J.-B.

Wets, Variational analysis, volume 317, Springer Science & Business Media, 2009.

18.

David Silver , Julian Schrittwieser , Karen Simonyan , Ioannis Antonoglou , Aja Huang , Arthur Guez , Thomas Hubert , Lucas Baker , Matthew Lai , Adrian Bolton et al. Mastering the game of go without human knowledge, nature 550(7676) (2017), 354–359.

19.

David Silver , Thomas Hubert , Julian Schrittwieser , Ioannis Antonoglou , Matthew Lai , Arthur Guez , Marc Lanctot , Laurent Sifre , Dharshan Kumaran , Thore Graepel et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play, Science 362(6419) (2018), 1140–1144.

20.

Milind Tambe Security and game theory: algorithms, deployed systems, lessons learned, Cambridge University press, 2011.

21.

Ilnura Usmanova , Andreas Krause , Maryam Kamgarpour Safe convex learning under uncertain constraints, In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 2106–2114. PMLR, 16–18 Apr. 2019, URL https://proceedings.mlr.press/v89/usmanova19a.html

22.

Bernhard Von Stengel , Efficient computation of behavior strategies, Games and Economic Behavior 14(2) (1996), 220–246

23.

Xiaohan Wei , Hao Yu and Michael Neely

, Online primal-dual mirror descent under stochastic constraints, Proceedings of the ACM on Measurement and Analysis of Computing Systems 4(2) (2020), 1–36.

24.

Hao Yu , Michael Neely , Xiaohan Wei , and Online convex optimization with stochastic constraints, Advances in Neural Information Processing Systems 30 (2017).

25.

Günter Ziegler

Lectures on Polytopes, volume 152, Springer Science & Business Media, 2012.

A framework for safe decision making: A convex duality approach

Abstract

Keywords

Get full access to this article

References