Online and offline learning of player objectives from partial observations in dynamic games

Abstract

Robots deployed to the real world must be able to interact with other agents in their environment. Dynamic game theory provides a powerful mathematical framework for modeling scenarios in which agents have individual objectives and interactions evolve over time. However, a key limitation of such techniques is that they require a priori knowledge of all players’ objectives. In this work, we address this issue by proposing a novel method for learning players’ objectives in continuous dynamic games from noise-corrupted, partial state observations. Our approach learns objectives by coupling the estimation of unknown cost parameters of each player with inference of unobserved states and inputs through Nash equilibrium constraints. By coupling past state estimates with future state predictions, our approach is amenable to simultaneous online learning and prediction in receding horizon fashion. We demonstrate our method in several simulated traffic scenarios in which we recover players’ preferences, for, e.g. desired travel speed and collision-avoidance behavior. Results show that our method reliably estimates game-theoretic models from noise-corrupted data that closely matches ground-truth objectives, consistently outperforming state-of-the-art approaches.

Keywords

Inverse dynamic games inverse optimal control multi-agent prediction

1. Introduction

To operate safely and efficiently in environments shared with other agents, robots must be able to predict the effects of their actions on the decisions of others. In many such settings, agents do not form a single team that shares a joint objective. Instead, each agent may have an individual objective, encoded by a cost function which they optimize unilaterally. Unless the objectives of all agents are perfectly aligned, agents must therefore compete to minimize their own cost, while accounting for the strategic behavior of others. For example, consider the highway navigation scenario in Figure 1. Here, each driver travels along the highway with an individual objective that encodes their preferences for speed, acceleration, and proximity to other cars. In heavy traffic, the objectives of drivers may conflict. For instance, if car 1 (blue) wishes to maintain its speed, it must overtake the slower vehicles in front. At the same time, however, the faster car 2 (orange) may wish maintain its speed and but would be forced to decelerate if the driver of car 1 changes lanes.

Figure 1.

5-player highway driving scenario, modeled as a dynamic game. Solving the “forward” problem amounts to finding optimal trajectories (right) for all cars, given their objectives (left). In contrast, this paper addresses the “inverse,” that is, it estimates the objectives of each player given noise-corrupted observations of each agent’s trajectories. For example, our method can infer properties such as the degree to which each player wishes to keep a safe distance from others (heat map, left). These learned objectives constitute an abstract model which can be used to predict players’ actions in the future.

Mathematically, such interactions of multiple agents with individual, potentially conflicting objectives are characterized by a noncooperative dynamic game. The theory underpinning dynamic games is well established (Isaacs 1954-1955; Başar and Olsder 1999), and recent work has put forth efficient algorithms to determine equilibrium solutions to these problems, given players’ objectives (Fridovich-Keil et al., 2020; Di and Lamperski 2019). The forward game problem is depicted in Figure 1 (left to right) for the highway driving scenario: given the cost functions of all players (left), a forward game solver computes their rational strategies and corresponding future trajectories (right).

Unfortunately, the objectives of agents in a scene are often not known a priori. Therefore, in order for game-theoretic methods to find practical application in fields such as robotics, it is imperative to recover these objectives from data. This inverse dynamic game problem is illustrated in Figure 1 (right to left) for the highway driving scenario: given observations of player’s strategies (right), an inverse game solver recovers objectives (left) which explain the observed behavior. This inverse dynamic game problem is the focus of this work.

The challenge of recovering objectives from observed behavior has been extensively studied in the literature on inverse optimal control (IOC) (Kalman 1964; Mombaur et al., 2010; Albrecht et al., 2011) and inverse reinforcement learning (IRL) (Ng and Russell 2000; Ziebart et al., 2008). Unfortunately, however, these methods are fundamentally limited to the single-player setting. While recent efforts extend these ideas to multi-agent IRL (Šošić et al., 2016; Natarajan et al., 2010), those approaches are limited to games with potential cost structures (Monderer and Shapley 1996) and do not directly apply in general noncooperative settings. While initial work extends IOC methods to address this limitation (Rothfuß et al., 2017; Inga et al., 2019; Awasthi and Lamperski 2020), these inverse dynamic game solvers rely upon full observation of states and inputs of all players.

The main contribution of this work is a novel method for learning player’s objectives in noncooperative dynamic games from only noise-corrupted, partial state observations. In addition to learning a cost model for all players, our method also recovers a forward game solution consistent with the learned objectives by enforcing equilibrium constraints on latent trajectory estimates. This bilevel formulation further allows to couple observed and predicted behavior to recover player’s objectives even from temporally incomplete interactions. As a result, our approach is amenable for online learning and prediction in receding horizon fashion.

This paper builds upon and extends our earlier work (Peters et al., 2021). In this work, we provide a more in-depth analysis of that approach. Additionally, while our original work was limited to offline operation and could therefore only recover players’ objectives for interactions which had already occurred, in this work we remove this requirement.

We evaluate our method in extensive Monte Carlo simulations in several traffic scenarios with varying numbers of players and interaction geometries. Empirical results show that our approach is more robust to partial state observations, measurement noise, and unobserved time-steps than existing methods, and consequently it is more suitable for predicting agents’ actions in the future.

2. Prior work

We begin by discussing recent advances in the well-studied area of IOC. While methods from that field address only single-player, cooperative settings, this body of work exposes many of the important mathematical and algorithmic concepts that appear in games. We discuss how some of these approaches have been applied in the noncooperative multi-player setting and emphasize the connections between existing approaches and our contributions.

2.1. Single-player inverse optimal control

The IOC problem has been extensively studied since the well-known work of Kalman (1964). In the context of IRL, early formulations such as that of Ng and Russell (2000) and maximum-entropy variants (Ziebart et al., 2008; Kretzschmar et al., 2016) have proven successful in treating problems with discrete state and control sets. In robotic applications, optimal control problems typically involve decision variables in a continuous domain. Hence, recent work in IOC differs from the IRL literature mentioned above as it is explicitly designed for smooth problems.

One common framework for addressing IOC problems with nonlinear dynamics and nonquadratic cost structures is bilevel optimization (Mombaur et al., 2010; Albrecht et al., 2011). Here, the outer problem is a least squares or maximum likelihood estimation (MLE) problem in which demonstrations are matched with a nominal trajectory estimate and decision variables parameterize the objective of the underlying optimal control problem. The inner problem determines the nominal trajectory estimate as the optimizer of the “forward” (i.e., standard) optimal control problem for the outer problem’s decision variables. A key benefit of bilevel IOC formulations is that they naturally adapt to settings with noise-corrupted partial state observations (Albrecht et al., 2011).

Early bilevel formulations for IOC utilize derivative-free optimization schemes to estimate the unknown objective parameters in order to avoid explicit differentiation of the solution to the inner optimal control problem (Mombaur et al., 2010). That is, the inner solver is treated as a black-box mapping from cost parameters to optimal trajectories which is utilized by the outer solver to identify the unknown parameters using a suitable derivative-free method. While black-box approaches can be simple to implement due to their modularity and lack of reliance on derivative information, they often suffer from a high sampling complexity (Nocedal and Wright 2006). Since each sample in the context of black-box IOC methods amounts to solving a full optimal control problem, such approaches remain intractable for scenarios with large state spaces or additional unknown parameters, such as unknown initial conditions.

Other works instead embed the Karush–Kuhn–Tucker (KKT) conditions of the inner problem as constraints on the outer problem. Since these techniques enforce only first-order necessary conditions of optimality, globally optimal observations are unnecessary and locally optimal demonstrations suffice. Yet, a key computational difficulty of KKT-constrained IOC formulations is that they yield a nonconvex optimization problem due to decision variables in the outer problem appearing nonlinearly with inner problem variables in KKT constraints. This occurs even in the relatively benign case of linear-quadratic IOC.

In contrast to bilevel optimization formulations where necessary conditions of optimality are embedded as constraints, recent methods (Levine and Koltun 2012; Englert and Toussaint 2018; Awasthi 2019; Menner and Zeilinger 2020; Jin et al., 2021) minimize the residual of these conditions directly at the demonstrations. Since the observed demonstration is assumed to satisfy any constraints of the underlying forward optimal control problem, this method can be formulated as fully unconstrained optimization. Additionally, these residual formulations yield a convex optimization problem if the class of objective functions is convex in the unknown parameters at the demonstration (Keshavarz et al., 2011; Englert and Toussaint 2018). This condition holds in the common setting of linearly parameterized objective functions. Levine and Koltun (2012) propose a variant of this approach that utilizes quadratic approximations of the reward model around demonstrations to derive optimality residuals in a maximum entropy framework. Englert and Toussaint (2018) present extensions of this method do accommodate inequality constraints on states and inputs. Much like KKT-constrained formulations, these residual methods operate on locally optimal demonstrations. However, an important limitation of residual methods is that they require observations of full state and input sequences. More recently, Menner and Zeilinger (2020) compared IOC techniques based on KKT constraints and residuals and demonstrated inferior performance of the latter even in problems with linear dynamics and quadratic target objectives.

Our work takes inspiration from the KKT-constraint formulation for single-player IOC as discussed by Albrecht et al. (2011) and Menner and Zeilinger (2020). While these works apply only to single-player settings, we utilize the necessary conditions for open-loop Nash equilibriums (OLNEs) (Başar and Olsder 1999) to generalize this approach to noncooperative multi-player scenarios.

2.2. Multi-player inverse dynamic games

Many of the IOC techniques discussed above have close analogues in the context of multi-player inverse dynamic games.

As in single-player IOC, methods akin to black-box bilinear optimization have also been studied in the context of inverse games (Peters 2020; Le Cleac’h et al., 2021). Peters (2020) uses a particle-filtering technique for online estimation of human behavior parameters. This work demonstrates the importance of inferring human behavior parameters for accurate prediction in interactive scenarios. However, there, inference is limited to a single parameter and the work highlights the challenges associated with scaling this sampling-based approach to high-dimensional latent parameter spaces. Le Cleac’h et al. (2021) employ a similar derivative-free filtering technique based on an unscented Kalman filter. While this approach drastically reduces the overall sample complexity, it still relies on exact observations of the state to reduce the required number of solutions to full dynamic games at the inner level.

Another line of research has put forth solution techniques for inverse games that follow from the residual methods outlined in Section 2.1 (Köpf et al., 2017; Rothfuß et al., 2017; Awasthi and Lamperski 2020; Inga et al., 2019). Köpf et al. (2017) study a special case of an inverse linear-quadratic game in which the equilibrium feedback strategies of all but one player are known. This assumption reduces the estimation problem to single-player IOC to which the residual methods discussed above can be applied directly. Rothfuß et al. (2017) present a more general approach that does not exploit such special structure but instead minimizes the residual of the first-order necessary conditions for a local OLNE. Inga et al. (2019) present a variant of this OLNE residual method in a maximum entropy framework, generalizing the single-player IOC algorithm proposed by Levine and Koltun (2012). Recently, Awasthi and Lamperski (2020) also extended the OLNE residual method of Rothfuß et al. (2017) to inverse games with state and input constraints. This approach extends that of Englert and Toussaint (2018) to noncooperative multi-player scenarios.

All of these inverse game KKT residual methods share many properties with their single-player counterparts. In particular, since they rely upon only local equilibrium criteria, they are able to recover player objectives even from local-rather than only global-equilibrium demonstrations. However, as in the single-player case, they rely upon observation of both state and input to evaluate the residuals.

In contrast to KKT residual methods (Rothfuß et al., 2017; Awasthi and Lamperski 2020; Inga et al., 2019), we enforce these conditions as constraints on a jointly estimated trajectory, rather than minimizing the residual of these conditions directly at the observation. By maintaining a trajectory estimate in this manner, our method explicitly accounts for observation noise, partial state observability, and unobserved control inputs. Furthermore, in contrast to black-box approaches to the inverse dynamic game problem (Peters 2020; Le Cleac’h et al., 2021), our method does not require repeated solutions of the underlying forward game. Moreover, our method returns a full forward game solution in addition to the estimated objective parameters for all players.

3. Background: Open-loop Nash games

While this work is concerned with the inverse game problem of learning objectives from observed behavior, we first provide a technical introduction to the theory of forward open-loop dynamic Nash games. These forward games correspond to the model that we seek to recover in this work. Furthermore, as we shall discuss in Section 4, they may be used at the inner level of a bilevel optimization problem to formulate the inverse game problem.

As discussed in Section 1, dynamic games provide an expressive mathematical formalism for modeling the strategic interactions of multiple agents with differing objectives. Interested readers are directed to Başar and Olsder (1999) for a more complete discussion. We note that dynamic games afford a wide variety of equilibrium concepts; our choice of open-loop Nash equilibria in this work captures scenarios in which players do not account for future information gains and instead commit to a sequence of control decisions a priori. These conditions may occur when occlusions prevent future information gains or when bounded rationality causes players to ignore them. OLNEs have been demonstrated to capture dynamic interaction when embedded in receding-horizon re-planning schemes (Wang et al., 2019; Le Cleac’h et al., 2020). Beyond that, restricting our attention to OLNEs engenders computational advantages which are discussed below. Other choices of solution concept are possible and should be explored in future work. Recent methods such as those of Di and Lamperski (2019) and Le Cleac’h et al. (2020) facilitate efficient solutions to the “forward” open-loop games given players’ objectives a priori.

3.1. Preliminaries

Consider a game played between N players over discrete time-steps t ∈ [T]≔{1, …, T}. The game is composed of three key components: dynamics, objectives (which are later presumed to be unknown in this work), and information structure.

We presume that the game is Markov with respect to state $x \in R^{n}$ . That is, given each player’s control input $u^{i} \in R^{m^{i}}, i \in [N]$ , the state evolves according to the difference equation

x_{t + 1} = f_{t} (x_{t}, u_{t}^{1}, \dots, u_{t}^{N})

(1)

For clarity, we shall introduce the following shorthand notation

\begin{array}{c} x & = (x_{1}, \dots, x_{T}), \\ u^{i} & = (u_{1}^{i}, \dots, u_{T}^{i}), \\ u_{t} & = (u_{t}^{1}, \dots, u_{t}^{N}), \\ u & = (u^{1}, \dots, u^{N}) \end{array}

Observe that the state x pertains to the entire game, not only to a single player. In the examples presented in this paper, x is simply the concatenation of individual players’ states, and correspondingly the dynamics are independent for all players. However, this is not always the case, and the methods developed here apply in the more general settings as well.

The objective of player i is encoded by their distinct cost function Jⁱ, which they seek to minimize. This cost can in general depend upon the sequence of states and inputs for all players.¹ In this paper, we presume that objectives are expressed in time-additive form, as is common across the optimal control and reinforcement learning literature

J^{i} (x, u) : = \sum_{t = 1}^{T} g_{t}^{i} (x_{t}, u_{t}^{1}, \dots, u_{t}^{N})

(2)

Since the state trajectory x follows equation (1), these cost functions can also be written in terms of the initial condition x₁ and the sequence of control inputs for all players u. For this reason, we shall also use the notation Jⁱ (u; x₁) and refer to the tuple of initial state, dynamics, and objectives as follows

Γ : = (x_{1}, {f_{t}}_{t \in [T]}, {J^{i}}_{i \in [N]})

(3)

Finally, the information structure of a dynamic game refers to the information available to each player when they are required to make a decision at each time. At time t, then, Player-i’s input is a function $γ_{t}^{i} : I_{t}^{i} \to R^{m^{i}}$ , where $I_{t}^{i}$ is the set of information available to Player-i at time t. In this paper, we consider open-loop information structures, that is, where $I_{t}^{i} = {x_{1}}$ .² In open-loop information, then, it suffices for Player-i to specify their input sequence uⁱ given a fixed initial condition x₁. For this reason, we neglect a more detailed treatment of strategy spaces and information structure and simply refer to the finite-dimensional sequence of control inputs for each player.

This characterization of a dynamic game is intentionally general. Our solution methods will rely upon established numerical methods for smooth optimization, however, and as such we require the following assumption.

Assumption 1

(Smoothness) Dynamics f and objectives J ⁱ have well-defined second derivatives in all state and control variables, at all times and for all players.

Most physical systems of interest and interactions thereof are naturally modeled in this way. However, we note that, for example, hybrid dynamics such as those induced by contact do not satisfy this assumption.

We shall illustrate key concepts using a consistent “running example” throughout the paper.

Running example

Consider an N = 2-player linear-quadratic (LQ) game that is, one in which dynamics f _t are linear in state x _t and control inputs u _t , and costs J ⁱ are quadratic in states and controls. Let each player independently follow the dynamics of a double integrator in the Cartesian plane. State $x = (p_{x}^{1}, p_{y}^{1}, {\dot{p}}_{x}^{1}, {\dot{p}}_{y}^{1}, p_{x}^{2}, p_{y}^{2}, {\dot{p}}_{x}^{2}, {\dot{p}}_{y}^{2})$ then evolves with inputs $u^{i} = ({\ddot{p}}_{x}^{i}, {\ddot{p}}_{y}^{i})$ according to

\begin{array}{c} x_{t + 1} = \overset{A}{\overset{⏞}{[\begin{array}{c} \tilde{A} & 0 \\ 0 & \tilde{A} \end{array}]}} x_{t} + \overset{B^{1}}{\overset{⏞}{[\begin{array}{c} \tilde{B} \\ 0 \end{array}]}} u_{t}^{1} + \overset{B^{2}}{\overset{⏞}{[\begin{array}{c} 0 \\ \tilde{B} \end{array}]}} u_{t}^{2}, \end{array}

(4)

\begin{array}{c} w h e r e \tilde{A} = [\begin{array}{c} 1 & 0 & Δ t & 0 \\ 0 & 1 & 0 & Δ t \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{array}], \tilde{B} = [\begin{array}{c} 0 & 0 \\ 0 & 0 \\ Δ t & 0 \\ 0 & Δ t \end{array}], \end{array}

and Δt is a uniform time discretization, for example., 0.1s. Each player has a quadratic objective of the form

J^{i} = \frac{1}{2} \sum_{t = 1}^{T} (θ_{Q}^{i} x_{t}^{T} Q_{t}^{i} x_{t} + \sum_{j = 1}^{N} θ_{R}^{i j} u_{t}^{j •} R_{t}^{i j} u_{t}^{j})

(5)

In this simple example,

Q_{t}^{i}

and

R_{t}^{i j}

are known, positive definite matrices encoding the preferences of each player. The scalars

θ_{Q}^{i} \in R

and

θ_{R}^{i j} \in R

weight these known matrices. In this paper, we develop a technique to learn a priori unknown parameters such as the costs weights above from both offline and online data. Note that this simple LQ game shall only serve to explain the general concepts of our method. For our experiments presented in Section 7 , we consider more complex problems with nonlinear dynamics and nonquadratic costs, such as the 5-player highway navigation problem shown in Figure 1 .

3.2. The Nash solution concept

Combining these components, each player i in an open-loop dynamic game seeks to solve the following optimization problem

\forall i \in [N] {\begin{cases} \min_{x, u^{i}} J^{i} (u; x_{1}) & (6 a) \\ s . t . x_{t + 1} = f_{t} (x_{t}, u_{t}), \forall t \in [T - 1] . & (6 b) \end{cases}

There exist a variety of distinct solution concepts for such smooth open-loop dynamic games. In this paper, we consider the well-known Nash equilibrium concept, wherein no player has a unilateral incentive to change its strategy. Mathematically, the Nash concept is defined as follows.

Definition 1

(Open-loop Nash equilibrium) The strategies u*≔(u¹*, …, u^N∗) constitute an open-loop Nash equilibrium (OLNE) in the game $Γ = (x_{1}, {f_{t}}_{t \in [T]}, {J^{i}}_{i \in [N]})$ if the following inequalities hold:

J^{i *} = J^{i} (u^{*}; x_{1}) \leq J^{i} ((u^{i}, u^{- i *}); x_{1}), \forall i \in [N]

(7)

Here, we use the shorthand (uⁱ, u⁻ⁱ*) to indicate the collection of strategies in which only Player-i deviates from the Nash profile, that is, uⁱ ≠ uⁱ*.

Note that, at a Nash equilibrium, each player must independently have no incentive to deviate from its strategy. Since players’ objectives may generally conflict, the Nash concept encodes noncooperative, rational, and potentially selfish behavior.

Unfortunately, Nash equilibria are known to be very difficult to find in general (Daskalakis et al., 2009). In this work, we seek only local equilibria which satisfy the Nash conditions equation (7) to first order. That is, following similar approaches in both single-player IOC (Albrecht et al., 2011; Englert and Toussaint 2018) and forward/inverse open-loop games (Le Cleac’h et al., 2020; Awasthi 2019), we encode forward optimality via the corresponding first-order necessary conditions. These first-order necessary conditions are given by the union of the individual players’ KKT conditions, that is

\begin{array}{l} 0 = G (x, u, λ) : = [\begin{array}{c} \begin{array}{c} \nabla_{x} J^{i} + \nabla_{x} F {(x, u)}^{•} λ^{i} \\ \nabla_{u^{i}} J^{i} + \nabla_{u^{i}} F {(x, u)}^{•} λ^{i} \end{array}} \forall i \in [N] \\ F (x, u) \end{array}] \end{array}

(8)

Here, the first two block rows are repeated for all players, and the function F (x, u) accumulates the dynamic constraints of equation (6a) at all time steps, with the t^th row given by $x_{t + 1} - f_{t} (x_{t}, u_{t}^{1}, \dots, u_{t}^{N})$ . Note that we have also introduced costate variables $λ^{i} : = (λ_{1}^{i}, \dots, λ_{T - 1}^{i})$ for each player, with $λ_{t}^{i} \in R^{n}$ the Lagrange multiplier corresponding to Player-i’s dynamics constraint in equation (6a) at time step t. Note that, as with control inputs, we use the notation λ ≔( λ ¹, …, λ ^N ).

Running example

Consider the two-player LQ example above with double integrator dynamics given by equation (4) and quadratic objectives given by equation (5). The t^th block of the first row of equation (8) is given by

0 = θ_{Q}^{i} Q_{t}^{i} x_{t} + λ_{t - 1}^{i} - A^{•} λ_{t}^{i}

(9)

for Player-i. Likewise, the t^th block of the second row of equation (8) for Player-i is given by

0 = θ_{R}^{i i} R_{t}^{i i} u_{t}^{i} - B^{i •} λ_{t}^{i}

(10)

Finally, the t^th block of the final row of equation (8) is given by

0 = x_{t + 1} - A x_{t} - B^{1} u_{t}^{1} - B^{2} u_{t}^{2}

(11)

Computationally, the KKT conditions of the forward game, given in equation (8), are a set of, generally nonlinear, equality constraints in the variables x, u, and λ . To find a solution, that is, a root of G (x, u, λ ), we may employ a root-finding algorithm such as a variant of Newton’s method (Nocedal and Wright 2006: Chapter 11). This is the approach taken by, for example, Le Cleac’h et al. (2020).

Running example

For our LQ example, it can be seen that a single step of Newton’s method on G (⋅) amounts to the well-known Riccati solution to an open-loop LQ game ( Başar and Olsder 1999 : Chapter 6. ³

4. Problem setup

Solving a forward Nash game amounts to identifying optimal strategies for all players, provided a priori knowledge of their objectives Jⁱ. By contrast, in this work we are concerned with the inverse Nash problem, that is, that of identifying players’ objectives which explain their observed behavior. To develop the inverse Nash problem, here we shall presume that learning occurs offline, given a sequence of noisy, partial observations of all players’ state. The method we develop for this setting, however, is amenable to trajectory prediction and online, receding horizon operation as discussed in Section 5.2.

We formulate the inverse Nash problem as one of offline learning, in which players’ objectives belong to a known parametric function class. To that end, we make the following assumption.

Assumption 2

(Parametric objectives) Player-i’s cost function is fully described by a vector of parameters $θ^{i} \in R^{k^{i}}$ . That is, $J^{i} (\cdot; θ^{i}) \equiv \sum_{t = 1}^{T} g_{t}^{i} (x_{t}, u_{t}^{1}, \dots, u_{t}^{N}; θ^{i})$ .

Recalling Assumption 1, the functions $g_{t}^{i} (\cdot; θ^{i})$ have well-defined derivatives in states x_t and controls uⁱ. We shall also extend this smoothness assumption to include the parameters themselves.

Assumption 3

(Smoothness in parameter space) Extending Assumption 1, we require that stage cost functions $g_{t}^{i} (\cdot; θ^{i})$ have well-defined first- and second-derivatives with respect to the parameter vector θ ⁱ .

This smoothness assumption is quite general. For example, players’ stage costs $g_{t}^{i} (\cdot; θ^{i})$ may be encoded as arbitrary function approximators such as artificial neural networks. In this work, we choose a more interpretable (though less flexible) parametric structure; we defer an investigation of more general cost structures for future work. In particular, the examples considered here use a linearly parameterized structure in which $g_{t}^{i} (\cdot; θ^{i})$ is a linear function of θⁱ, that is, $g_{t}^{i} (\cdot; θ^{i}) \equiv θ^{i T} {\tilde{g}}_{t}^{i} (\cdot)$ for some set of potentially nonlinear basis functions ${\tilde{g}}_{t}^{i} (\cdot)$ . By incorporating appropriate domain-specific knowledge, however, these relatively simple cost structures are able to encapsulate complex, strategic interactions such as the highway lane changes of Figure 1.

Running example

Recall the quadratic objectives of equation ( 5 ), and take cost parameters $θ^{i} = {(θ_{Q}^{i}, θ_{R}^{i j})}_{j \in [N]}$ . Observe, therefore, that Player-i’s objective depends linearly upon its cost parameters θ ⁱ .

Thus equipped, the objective learning problem reduces to maximizing the likelihood of a sequence of partial state observations y ≔(y₁,…,y_T) for the parametric class of games $Γ (θ) = (x_{1}, f, {J^{(i)} (\cdot; θ^{(i)})}_{i \in [N]})$ . Formally, we seek to solve a problem of the form

\begin{array}{c} \max_{θ, x, u} & p (y ∣ x, u) \end{array}

(12a)

\begin{array}{c} s . t . & (x, u) i s a n O L N E o f Γ (θ) \end{array}

(12b)

\begin{array}{c} (x, u) i s d y n a m i c a l l y f e a s i b l e u n d e r f \end{array},

(12c)

where θ aggregates all players’ cost parameters, that is, θ≔(θ¹,…,θ ^N ), and p(y∣x, u) constitutes a known observation likelihood, or measurement, model.

Remark 1

(Initial state) Observe that x ₁ is an explicit decision variable in equation ( 12a ), whereas it represents a constant (known) initial condition in the forward game problem discussed in Section 3 . This reflects the fact that the state trajectory, including initial conditions, must be estimated as part of the inverse problem. As we shall see, estimating the state trajectory jointly with the cost parameters allows our method to be less sensitive to observation noise.

This measurement model is arbitrary, though, following Assumption 1 and Assumption 3, it must be smooth. In the simplest instance, we may receive an exact measurement of the sequence of states and inputs for all players. In that case, the measurement model p(y∣x, u) reduces to a Dirac delta function. More generally, p(y∣x, u) may be an arbitrary smooth probability density function, making our formulation amenable to realistic sensors such as cameras or LiDARs.

Prior work in both single-player IOC, such as that of Englert and Toussaint (2018), and inverse games, such as those of Awasthi and Lamperski (2020) and Rothfuß et al. (2017), presumes a degenerate measurement model in which states and controls are observed directly without any noise. When perfect observations are unavailable, these methods naturally extend by first estimating a sequence of likely states and controls (a standard nonlinear filtering problem). In Section 6, we describe these sequential estimation methods in greater detail. In contrast, our formulation given in equation (12b) encodes a coupled estimation problem in which states, control inputs, and cost parameters must all be estimated simultaneously. Thus, our method exploits the additional coupling imposed by the Nash equilibrium constraints onto the unknowns. In Section 7, we conduct a series of Monte Carlo experiments to quantify the advantages afforded by simultaneous learning over sequential estimation.

5. Equilibrium-constrained cost learning

Here we present our core contribution, a mathematical formulation of objective inference in multi-agent, noncooperative games. We express this problem as a nonconvex optimization problem with equilibrium constraints, which we relax into a standard-format equality-constrained nonlinear program.

5.1. Offline learning

We first consider the problem of learning each player’s objective from previously recorded data of prior interactions, offline.

Equation (12c) is a mathematical program with equilibrium constraints (Luo et al., 1996; Ferris et al., 2005), with the nested equilibrium conditions of equation (12b) encoding the Nash inequalities of Definition 1. Equilibrium constraints generalize bilevel programming, and computational approaches tend to be less mature than those for standard-form (in)equality-constrained programming.

We relax the equilibrium constraint of equation (12b) by replacing it with its KKT conditions, that is, by substituting equation (8). This yields the following equation

\begin{array}{c} \max_{θ, x, u, λ} & p (y ∣ x, u) \end{array}

(13a)

\begin{array}{c} s . t . & G (x, u, λ; θ) \end{array} = 0

(13b)

Here, we have explicitly written the KKT conditions from equation (8) in terms of the cost parameters θ. Additionally, observe that in equation (13a), the costates λ required to evaluate the KKT conditions G (⋅; θ) appear as additional primal variables. The constraints of equation (subsection 13b) will be assigned their own Lagrange multipliers, which are distinct from the original costates. By letting states, control inputs, and costates be primal variables, the KKT conditions G (⋅) do not depend explicitly upon the observations y. Thus, solving equation (13b) does not require complete state or input observations; rather, the equilibrium constraints of equation (subsection 13b) allow us to reconstruct this missing information while we estimate cost parameters θ, simultaneously. Several remarks are in order.

Remark 2

(Multiple observed trajectories) We have developed equation ( 13a ) for the setting in which a single trajectory ( x , u ) has been observed, yielding a measurement sequence y . However, our approach affords straightforward extension to settings in which player’s objectives are learned from multiple demonstrations. In this instance, the primal variables ( x , u , λ ) would be replicated for all trajectories, although the cost parameters θ would be shared. The objective given by equation ( 13a ) would be replaced by the joint probability of all measurements conditioned on all underlying trajectories, and the equilibrium constraints in equation ( 13b ) would be concatenated for all trajectories.

Remark 3

(Regularizing parameters) Depending upon the parametric structure of players’ objectives J ⁱ (⋅; θ ⁱ ), and hence the structure of KKT residual G (⋅; θ), it can be critical to regularize and/or constrain cost parameters. For example, if there exists a choice of θ ⁱ for Player-i such that J ⁱ ( x , u ; θ ⁱ ) is constant for all dynamically feasible trajectories ( x , u ), then every such trajectory would satisfy the equilibrium constraint of equation ( 12b ). Such choices of θ must be avoided, for example, by regularizing or otherwise constraining parameters.

Running example

Following Remark 3, we constrain the parameters θ ⁱ ≥ c > 0. Moreover, to account for scale invariance, we constrain their sum to unity, that is, $\sum_{i \in [N]} (θ_{Q}^{i} + \sum_{j \in [N]} θ_{R}^{i j}) = 1$ .

5.1.1. Least squares

A common observation model p(y∣x, u) is the additive white Gaussian noise (AWGN) model. Here, each observation y_t depends only upon the current state x_t and control inputs u _t , that is

y_{t} = h_{t} (x_{t}, u_{t}) + n_{t}

(14)

where the (potentially nonlinear) function h_t computes the expected measurement and n_t is a zero-mean Gaussian white noise process with known covariance, that is,

n_{t} \sim N (0, Σ_{t})

. In this case, following standard methods in maximum likelihood estimation (Gallager 2013), it is straightforward to express the maximization in equation (13a) as nonlinear least squares by taking the negative logarithm of p(y∣x, u)

\min_{θ, x, u, λ} \sum_{t = 1}^{T} {(y_{t} - h_{t} (x_{t}))}^{•} Σ_{t}^{- 1} (y_{t} - h_{t} (x_{t}))

(15a)

\begin{array}{c} s . t . & G (x, u, λ; θ) = 0 \end{array}

(15b)

in summary, this inverse problem entails the following task: Find those parameters θ for which the corresponding game solution generates expected observations near the observed data. This formulation of the inverse game problem can be encoded using well-established numerical modeling languages such as CasADi (Andersson et al., 2019) or JuMP (Dunning et al., 2017), and solved using off-the-shelf optimization routines such as IPOPT (Wächter and Biegler 2006) or SNOPT (Gill et al., 2005).

5.1.2. Problem complexity

Let us examine the structure of the least squares problem in equation (15a) more carefully. In general, the observation map h_t (⋅) and KKT conditions G(⋅; ⋅) may be arbitrarily nonlinear. Therefore, without further structural assumptions, our formulation is an equality-constrained nonlinear least squares problem. Due primarily to the nonlinearities in G, equation (15b) is generally nonconvex. Solution methods, therefore, may be sensitive to initial values of primal variables; we discuss a straightforward initialization scheme in Section 6.1.

Perhaps surprisingly, this nonconvexity persists in the LQ setting of our running example, even when h_t (⋅) is the identity.

Running example

Consider the LQ setting, with $θ^{i} = {(θ_{Q}^{i}, θ_{R}^{i j})}_{j \in [N]}$ as before. Let the observation map be the identity, that is, h _t (x _t ) = x _t and presume AWGN. The resulting nonlinear least squares problem in equation ( 15a ) has constraints of the form given in equations ( 9 ) to ( 11 ). Let us consider the first of these constraints for a single time step t and Player-i

0 = θ_{Q}^{i} Q_{t}^{i} x_{t} + λ_{t - 1}^{i} - A^{•} λ_{t}^{i}

Recall that the decision variables in our formulation are (θ, x, u, λ ). Here, we see that θⁱ multiplies x_t. At best, therefore, this constraint is a bilinear equality, making the overall problem in equation (15b) nonconvex even for this minimal inverse LQ game.

When we directly observe both state and control inputs without noise, that is, y_t ≡ (x_t, u _t ), these constraints become linear even in the general non-LQ setting, so long as players’ objectives are linearly parameterized. In this setting, we may rewrite equation (9) as follows

\begin{array}{c} 0 & = \nabla_{x_{t}} \overset{θ^{i ⊤} {\tilde{g}}_{t}^{i} (\cdot)}{\overset{⏞}{g_{t}^{i} (x_{t}, u_{t}; θ^{i})}} + λ_{t - 1}^{i} - \nabla_{x_{t}} f_{t} {(x_{t}, u_{t})}^{•} λ_{t}^{i} \end{array}

(16a)

\begin{array}{c} = θ^{i T} \nabla_{x_{t}} {\tilde{g}}_{t}^{i} (x_{t}, u_{t}) + λ_{t - 1}^{i} - \nabla_{x_{t}} f_{t} {(x_{t}, u_{t})}^{•} λ_{t}^{i} \end{array}

(16b)

with this observation model, then, the only decision variables are

(θ^{i}, λ_{t}^{i}, λ_{t - 1}^{i})

, which all appear linearly. Furthermore, the least squares objective in equation (15a) becomes unnecessary, since, by assumption, the measurements y already include the states x exactly. Incorporating these simplifications, the entire constrained least squares problem of equation (15a) reduces to the problem

\begin{array}{c} f i n d & θ, λ \end{array}

(17a)

\begin{array}{c} s . t . & 0 = θ^{i ⊤} \nabla_{x_{t}} {\tilde{g}}_{t}^{i} (x_{t}, u_{t}) + λ_{t - 1}^{i} \\ - \nabla_{x_{t}} f_{t} {(x_{t}, u_{t})}^{•} λ_{t}^{i}, \forall i, t \end{array}

(17b)

\begin{array}{c} 0 = θ^{i ⊤} \nabla_{u_{t}^{i}} {\tilde{g}}_{t}^{i} (x_{t}, u_{t}) \\ - \nabla_{u_{t}^{i}} f_{t} {(x_{t}, u_{t})}^{•} λ_{t}^{i}, \forall i, t \end{array}

(17c)

Because the constraints in equation (17a) are linear, the problem is equivalent to a linear system of equations. Moreover, since the constraints are completely decoupled for each player, they may be solved separately and in parallel for all players to obtain cost parameters θ ⁱ and costates λ ⁱ . This reduction forms the basis for the state of the art in solving inverse dynamic games (Rothfuß et al., 2017; Awasthi and Lamperski 2020), which only apply in settings with perfect state and input observations. To compare against these methods in more general settings that feature noise, unobserved inputs, and partial state measurements, we augment these methods with a sequential optimization procedure in Section 6. Comparative Monte Carlo studies of all approaches are presented in Section 7.

5.2. Online learning

While Section 5.1 estimates the objectives of interacting agents from recorded data offline, our formulation for inverse Nash problems extends naturally to an online learning setting; i.e. cost learning from observations of ongoing interactions. As we shall discuss below, our method can perform online cost learning and trajectory prediction simultaneously, making it suitable for receding horizon applications.

5.2.1. Learning with prediction

Equipped with a tractable solution strategy for the setting of offline learning, we now consider a coupled prediction and learning problem. Similar problems have been considered in the single-agent setting by, e.g. Jin et al. (2021) and Mukadam et al. (2019). Here, we aim to learn the cost parameters θ from only a subset of the game horizon; i.e., we presume that observations $y = (y_{1}, \dots, y_{\tilde{T}})$ where the observation horizon $\tilde{T} \leq T$ . Despite this change, the original problem of equation (12a) remains effectively unchanged; only the objective has changed. In particular, by substituting the KKT conditions for an OLNE in place of the original equilibrium constraint as in equation (13a), and making AWGN assumptions, we recover a variant of the constrained least squares formulation of equation (15a)

\min_{θ, x, u, λ} \sum_{t = 1}^{\tilde{T}} {(y_{t} - h_{t} (x_{t}))}^{•} Σ_{t}^{- 1} (y_{t} - h_{t} (x_{t}))

(18a)

\begin{array}{c} s . t . & G (x, u, λ; θ) = 0 \end{array}

(18b)

Note that the upper limit of addition is $\tilde{T}$ , rather than T as in equation (15a), while the OLNE KKT conditions in equation (subsection 18b) depend upon states, inputs, and costates for all times $t \in {1, \dots, \tilde{T}, \dots, T}$ .

Despite the similarities between this problem and equation (15a), the Nash trajectory (x∗, u*), which emerges as a solution, affords a new interpretation. In particular, for times $t \leq \tilde{T}$ , these equilibrium states and controls constitute filtered estimates of the observed quantities y, while for times $t > \tilde{T}$ they represent predictions of the future. Importantly, however, extending trajectories beyond the observation horizon $\tilde{T}$ adds additional constraints to equation (15b). This ability to incorporate future, unobserved states makes the method more robust and data efficient when only a fraction of the game horizon is observed. Consequently, this formulation can be employed for online learning in scenarios of ongoing interactions. We provide a detailed empirical analysis of this setting in Section 7.2.2. A summary of this variant of our inverse game solver is provided in Figure 2(a).

Figure 2.

Schematic overview of inverse game solvers set up for online operation. (a) Our method computes player’s objectives, state estimates, and trajectory predictions jointly. (b) The baseline requires full knowledge of states and inputs and therefore must preprocess raw observations before it can estimate players’ objectives. In order to generate trajectory predictions, the baseline must solve an additional forward game formulated over the estimated initial states and objectives.

5.2.2. Receding horizon learning

Our method is directly amenable to receding horizon, online operation. Here, we suppose that the agents interact over the half-open time-interval $t \in {1, \dots, \tilde{T}, \dots, \infty}$ , and that observations exist for $t \leq \tilde{T}$ . Here, $\tilde{T}$ may be interpreted as the current time and, as time elapses, both $\tilde{T}$ and the overall prediction horizon T increase accordingly. Unfortunately, however, increasing the overall problem horizon increases the number of variables in equation (12a), eventually making the problem intractable.

To simplify matters, we approximate the learning problem at each instant by neglecting all times outside the interval ${\tilde{T} - s_{o}, \dots, \tilde{T}, \dots, \tilde{T} + s_{p}}$ , where s_o is the length of a fixed-lag buffer of past observations and s_p is the horizon of future state predictions. In this setting, the total number of variables remains constant (since the length of this interval is constant), rendering equation (12b) tractable to solve online. More precisely, at time $\tilde{T}$ (and under AWGN assumptions), we solve a modified version of equation (18b)

\min_{θ, x, u, λ} \sum_{t = \tilde{T} - s_{o}}^{\tilde{T}} {(y_{t} - h_{t} (x_{t}))}^{•} Σ_{t}^{- 1} (y_{t} - h_{t} (x_{t}))

(19a)

\begin{array}{c} s . t . & G (x, u, λ; θ) = 0 \end{array}

(19b)

where the KKT constraint G (⋅) is understood to depend upon times

t \in {\tilde{T} - s_{o}, \dots, \tilde{T}, \dots, \tilde{T} + s_{p}}

and states, control inputs, and costates are also limited to that interval. At each later time, we solve a problem with identical structure, with the understanding that

\tilde{T}

will have changed to reflect the elapsed time. In effect, this procedure amounts to simultaneous fixed-lag smoothing and receding-horizon prediction. We simulate this online learning procedure in Section 7.3.2.

6. Baseline

Recall the discussion of Section 5.1.2, in which we show that—with noiseless observations of states x and controls u, and linear cost parameterization $g_{t}^{i} (\cdot; θ^{i}) \equiv θ^{i T} {\tilde{g}}_{t}^{i} (\cdot)$ —our formulation reduces to the linear system of equations of equation (17a). This reduction underlies state-of-the-art methods for learning the objectives of players in games (Rothfuß et al., 2017; Awasthi and Lamperski 2020). Therefore, such methods unfortunately require noiseless observations of the full state and input sequences for all players. In contrast, our approach in equation (13b) is amenable to noisy, partial observations.

6.1. Recovering unobserved variables

To provide a meaningful comparison between our proposed technique and the state-of-the-art in settings with imperfect observations, we augment (Rothfuß et al., 2017; Awasthi and Lamperski 2020) with a pre-processing to estimate unobserved states and inputs. To that end, we solve the following relaxed version of equation (13a)

\tilde{x}, \tilde{u} : = \arg \max_{x, u} p (y ∣ x, u)

(20a)

\begin{array}{c} s . t . & F (x, u) = 0 \end{array}

(20b)

As in Section 5.1.1, under an AWGN assumption equation (20a) becomes equality-constrained nonlinear least squares. However, unlike equation (15a), we have neglected the first two rows of the equilibrium constraint given in equation (8). That is, equation (20b) computes a maximum likelihood estimate of states and inputs irrespective of the underlying game structure.

The solution of this smoothing problem is used as an estimate of states and inputs when the baseline is employed in partially observed settings. Beyond that, the same procedure serves as simple, yet effective initialization scheme for our method to tackle issues of non-convexity discussed in Section 5.1.2.

6.2. Minimizing KKT residuals

Like our proposed method, the state-of-the-art methods developed by Rothfuß et al. (2017) and Awasthi and Lamperski (2020) use the forward game’s KKT conditions to measure the quality of a set of cost parameters θ. While we compare to this derivative-based, KKT condition approach, we note that other approaches outlined in Section 2.2 such as Le Cleac’h et al. (2021) utilize black-box optimization methods and do not require or exploit derivative information. These significant algorithmic differences and the resulting differences in sample complexity, locality of solutions, etc., make a direct comparison difficult to interpret.

Specifically, the KKT residual method of Awasthi and Lamperski (2020) and Rothfuß et al. (2017) fixes the state and input sequences to their observed—or in our case, estimated via equation (20a)—values. Fixing these variables, however, the resulting linearly constrained satisfiability problem of equation (17c) may be infeasible, depending upon the parametric structure of costs $g_{t}^{i} (\cdot; θ^{i})$ . In lieu, state-of-the-art approaches minimize the KKT residual itself, that is

\min_{θ, λ} {‖ G (\tilde{x}, \tilde{u}, λ; θ) ‖}_{2}^{2}

(21)

In prior work (Awasthi and Lamperski 2020; Rothfuß et al., 2017), $\tilde{x}$ and $\tilde{u}$ are assumed to be directly observed. As discussed in Section 6.1, here we presume they are the results of the pre-processing step given in equation (20a). Additionally, like the linear system of equations in equation (17a), the only decision variables here are the objective parameters θ and the costates λ . In effect, the baseline does not refine the state and input estimates given by the pre-processing step of equation (20b). Furthermore, as in equation (17b), the problem may be decomposed into separate problems for each player and solved in parallel. In essence, then, this KKT residual formulation neglects the coupling between players’ actions which is encoded in the equilibrium conditions; computationally, it reduces to solving separate IOC problems for each player neglecting game-theoretic interactions with others.

A schematic overview of this baseline approach is depicted in Figure 2. By first estimating the states x and inputs u from measurements y, and only afterward learning the cost parameters θ and associated costates λ , the KKT residual method can be thought of as a sequential decomposition of our approach. By contrast, our formulation maintains (x, u) as decision variables and refines the initial guess of $(\tilde{x}, \tilde{u})$ by identifying all variables simultaneously.

7. Experiments

In this work, we develop a technique for learning players’ objectives in continuous dynamic games from noise-corrupted, partial state observations. We conduct a series of Monte Carlo studies to examine the relative performance of our proposed methods and the KKT residual baseline in both offline and online learning settings.⁴

7.1. Experimental setup

We implement our proposed approach as well as the KKT residual baseline of Rothfuß et al. (2017) in the Julia programming language (Bezanson et al., 2017), using the mathematical modeling framework JuMP (Dunning et al., 2017). As a consequence, our implementation encodes an abstract description of equation (13b), making it straightforward to use in concert with a variety of optimization routines. In this work, we use the open source COIN-OR IPOPT algorithm (Wächter and Biegler 2006). The source code for our implementation is publicly available.⁵

To evaluate the relative performance of our proposed approach with the KKT residual baseline, we perform several Monte Carlo studies. The details of these studies are described below. However, all of these studies share the following overall setup: we fix a cost parameterization for each player, find corresponding OLNE trajectories as roots of equation (8) using the well-known iterated best response (IBR) algorithm (Wang et al., 2019), and simulate noisy observations thereof with additive white Gaussian noise (AWGN) as in equation (14). Each study then presents samples across a different problem parameter to test the sensitivity of both approaches to observation noise (Sections 7.2.1 and 7.3.1) and unobserved time-steps (Section 7.2.2) in two different problem settings.

In each of the studies below, we consider N vehicles navigating traffic, and instantiate game dynamics and player objectives as follows. Each vehicle has its own state xⁱ such that the global game state is concatenated as x = (x¹, …, x^N). Further, each vehicle follows unicycle dynamics at time discretization Δt

x_{t + 1}^{i} = {\begin{cases} (x - position) p_{x, t + 1}^{i} & = p_{x, t}^{i} + Δ t v_{t}^{i} \cos ψ_{t}^{i} \\ (y - position) p_{y, t + 1}^{i} & = p_{y, t}^{i} + Δ t v_{t}^{i} \sin ψ_{t}^{i} \\ (heading) ψ_{t + 1}^{i} & = ψ_{t}^{i} + Δ t ω_{t}^{i} \\ (s p e e d) v_{t + 1}^{i} & = v_{t}^{i} + Δ t a_{t}^{i}, \end{cases}

(22)

where

u_{t}^{i} = (ω_{t}, a_{t})

includes the yaw rate and longitudinal acceleration. Finally, each player’s objective is characterized by a stage cost

g_{t}^{i}

which is a weighted sum of several basis functions, that is

g_{t}^{i} = \sum_{ℓ = 1}^{5} w_{ℓ}^{i} g_{ℓ, t}^{i} {\begin{cases} g_{1, t}^{i} = 1 (t \geq T - t_{goal}) d (x_{t}^{i}, x_{goal}^{i}) & (23 a) \\ g_{2, t}^{i} = - \sum_{j \neq i} \log ({‖ p_{i} - p_{j} ‖}_{2}^{2}) & (23 b) \\ g_{3, t}^{i} = {(v^{i})}^{2} & (23 c) \\ g_{4, t}^{i} = {(ω_{t}^{i})}^{2} & (23 d) \\ g_{5, t}^{i} = {(a_{t}^{i})}^{2} . & (23 e) \end{cases}

Here, the cost parameters $θ^{i} = {(w_{ℓ}^{i})}_{ℓ \in [5]}, w_{ℓ}^{i} \in R_{+}$ are positive weights for each cost component. Further, $p_{i} = (p_{x}^{i}, p_{y}^{i})$ denotes the planar position of Player-i, and d (⋅, ⋅) is an arbitrary distance mapping. For example, we may choose $d (x_{t}^{i}, x_{goal}^{i}) = ‖ p_{t}^{i} - p_{goal}^{i} ‖_{2}^{2}$ to compute squared distance from a fixed goal position. Note, however, that this map is generic and can also be used to encode more complex goal-reaching specifications as in the highway lane-changing example depicted in Figure 1. Taken together, the basis functions encode the following aspects of each player’s preferences:

1. Be close to the goal state within the last t_goal time steps (23a).

2. Avoid close proximity to other vehicles (23b).

3. Avoid high speeds (23c).

4. Avoid large control efforts (23d, 23e).

Games of this form are inherently noncooperative since players must compete to reach their own goals efficiently while avoiding collision with one another. Hence, they must negotiate these conflicting objectives and thereby find an equilibrium of the underlying game.

In all of the Monte Carlo studies, we evaluate the approaches for two different noisy observation models $h_{t}^{full}$ and $h_{t}^{partial}$ . In $h_{t}^{full} (x_{t}) : = x_{t}$ , estimators observe the full state, and in $h_{t}^{partial} (x_{t}) : = (p_{t}^{1}, ψ_{t}^{1}, \dots, p_{t}^{N}, ψ_{t}^{N})$ , estimators observe the position and heading but not the speed of each agent, that is, they receive a partial state observation.

7.2. Detailed analysis of a 2-player game

We first study the performance of our method in a simplified, N = 2-player game. This set of experiments demonstrates the performance gap of our approach and the KKT residual baseline in methods in a conceptually simple and easily interpretable scenario. Here, the game dynamics are given as in equation (22), and player objectives are parameterized as in equation (23a). In particular, we let $d (x_{t}^{i}, x_{goal}^{i}) = ‖ p_{t}^{i} - p_{goal}^{i} ‖_{2}^{2}$ . In summary, therefore, each vehicle wishes to reach a fixed, known goal position in the plane while avoiding collision with the other.

7.2.1. Offline learning

We begin by studying both our method’s and the baseline’s ability to infer the unknown objective parameters θ, as developed in Section 5.1. To do so, we conduct a Monte Carlo study for the aforementioned 2-player collision-avoidance application.

We generate 40 random observation sequences at each of 22 different levels of isotropic observation noise. For each of the resulting 880 observation sequences, we run both our method and the baseline to recover estimates of weights $θ^{i} = {(w_{ℓ}^{i})}_{ℓ \in [5]}$ for each player. Note that in this offline setting both methods learn these objective parameters from noisy observations of a single, complete game trajectory. That is, each estimate relies upon 25 s of simulated interaction history from a single scenario.

Figure 3 shows the estimator performance for varying levels of observation noise in two different metrics. Figure 3(a) reports the mean cosine error of the objective parameter estimates. That is, we measure cosine-dissimilarity between the unobserved true model parameters θ_true and the learned estimates θ_est according to

D_{\cos} (θ_{true}, θ_{est}) = 1 - \frac{1}{N} \sum_{i \in [N]} \frac{θ_{true}^{i ⊤} θ_{est}^{i}}{{‖ θ_{true}^{i} ‖}_{2} {‖ θ_{est}^{i} ‖}_{2}}

(24)

where the mean is taken over the N players. The normalization of the parameter vectors in equation (24) reflects the fact that the absolute scaling of each player’s objective parameters does not affect their optimal behavior, holding other players’ parameters fixed. In sum, this metric measures the estimator performance in objective parameter space.

Figure 3.

Estimation performance of our method and the baseline for the 2-player collision-avoidance example, with noisy full and partial state observations. (a) Error measured directly in parameter space using equation (24). (b) Error measured in position space using equation (25). Triangular data markers in (b) highlight objective estimates which lead to ill-conditioned games. Solid lines and ribbons indicate the median and IQR of the error for each case.

Figure 3(b) shows the mean absolute position error for trajectory reconstructions computed by finding a root of equation (8) using the estimated objective parameters. Reconstruction error allows us to inspect the quality of learned cost parameters for explaining observed vehicle motion, providing a more tangible metric of algorithmic quality. In addition to the raw data, we highlight the median as well as the interquartile range (IQR) of the estimation error over a rolling window of 60 data points.

Figure 3(a) shows that both our method and the baseline recover the true parameters θ reliably even for partial observations, if the observations are noiseless. However, the performance of the baseline degrades rapidly with increasing noise variance. This pattern is particularly pronounced in the setting of partial observations. On the other hand, our estimator recovers the unknown cost parameters more accurately in both settings, and with a smaller variance than the baseline. Thus, compared to the KKT residual baseline, the performance of our method degrades gracefully when both full and partial observations are corrupted by noise.

Next, we study these methods’ relative performance as measured by reconstruction error, as shown in Figure 3(b). Here, reconstruction error is measured according to

D_{rec} (θ_{true}, θ_{est}) = \frac{1}{N T} \sum_{i \in [N]} \sum_{t \in [T]} {‖ p_{rec, t}^{i} - p_{true, t}^{i} ‖}_{2}

(25)

where

p_{true, t}^{i}

denotes the true position of Player-i at time step t and

p_{rec, t}^{i}

denotes the position reconstructed from a Nash solution to the game with estimated cost parameters θ_est. We see similar patterns here as in the parameter error space, indicating the reliability of our method in both noisy full and partial observation settings.

Additionally, note that we have denoted some data points for the baseline method with triangular markers. For these Monte Carlo samples, the learned parameters θ_est specify ill-conditioned objectives that prevent us from recovering roots of equation (8)—essentially rendering the parameter estimates useless for downstream applications. This can happen, for example, when proximity costs dominate control input costs. For the baseline, a total of 104 out of 880 estimates result in an ill-conditioned forward game when states are fully observed. In the case of partial observations, the number of learning failures increases to 218. In contrast, our method recovers well-conditioned player objectives for all demonstrations and allows for accurate reconstruction of the game trajectory.

For additional intuition of the performance gap, Figure 4 visualizes the reconstruction results in trajectory space for a fixed initial condition. Figure 4(a) shows the noise corrupted demonstrations generated for isotropic AWGN with standard deviation σ = 0.1. Figures 4(b) and (c) show the corresponding trajectories reconstructed by solving the game using the objective parameters learned by our method and the baseline, respectively. Note that our method generates a far smaller fraction of outliers than the baseline. Furthermore, the performance of our method is only marginally affected by partial state observability, whereas baseline performance degrades substantially.

Figure 4.

Qualitative reconstruction performance for the 2-player collision-avoidance example at noise level σ = 0.1 for 40 different observation sequences. (a) Ground truth trajectory and observations, where each player wishes to reach a goal location opposite their initial position. (b, c) Trajectories recovered by solving the game at the estimated parameters for our method and the baseline using noisy full and partial state observations.

7.2.2. Online learning with prediction

Next, we study the performance of both our proposed method and the KKT residual baseline in the setting of objective learning with prediction. Following the problem description of Section 5.2.1, here, only the beginning of an unfolding dynamic game is observed. This problem naturally describes a single time frame of online operations where observations accumulate as time evolves.

We conduct a Monte Carlo analysis of the two-player collision-avoidance game from Section 7.1 in which we vary the number of observed time steps of a fixed-length game. For this truncated observation sequence, each method is tasked to learn the players’ underlying cost parameters θⁱ and predict their motion for the next s_p = 10 time steps. Our method accomplishes these coupled tasks jointly by solving equation (18b). The KKT residual baseline, however, operates on the estimates provided the preceding smoothing step, therefore, cannot couple unobserved, future time steps with cost inference. Instead, it achieves this task in a two-stage procedure: First, parameter estimates are recovered from a truncated game over only the observed $\tilde{T}$ time steps. With these parameters in hand, the baseline then predicts future game states by re-solving a forward game starting from the final state estimate ${\tilde{x}}_{\tilde{T}}$ with time steps simulated from $t \in {\tilde{T}, \dots, \tilde{T} + 10}$ .

In Figure 5, we vary the observation horizon $\tilde{T} \in {5, \dots, 15}$ for a ground-truth game played over 25 time steps. For each value of $\tilde{T}$ , we sample 40 sequences of observations ${y_{t}}_{t = 1}^{\tilde{T}}$ . Here, we fix an isotropic Gaussian noise level of σ = 0.05 and measure the performance of both our method and the baseline using two distinct metrics. In Figure 5(a), we measure learning performance in parameter space using the metric given in equation (24). As shown, our approach consistently estimates the cost parameters more accurately than the baseline. Furthermore, as the observation horizon $\tilde{T}$ increases, both methods improve. In Figure 5(b), we see that these patterns persist when we measure performance in trajectory space, applying the metric of equation (25) to the predicted states $x_{t}, t \in {\tilde{T}, \dots, \tilde{T} + 10}$ . Indeed, in this case, the performance gap is even more pronounced. By observing only $\tilde{T} = 5$ steps, our method reliably outperforms the baseline even when the baseline is given triple the number of observations.

Figure 5.

Estimation performance for our method and the baseline for varying numbers of observations of the 2-player collision-avoidance example at a fixed noise level of σ = 0.05. (a) Estimation performance measured directly in parameter space using equation (24). (b) Prediction error over the next 10 s beyond the observation horizon using equation (25).

To inspect these results more closely, in Figure 6 we show the output of both methods for a single observation sequence of length $\tilde{T} = 10$ . This visualization highlights a key advantage of our approach compared with the baseline. In this scenario, Player-2 (bottom) turns left early on in order to avoid Player-1 (left) later along the path to its goal. Their ground truth trajectories are shown in black. However, the methods only receive noise-corrupted partial state observations of the first $\tilde{T} = 10$ time steps shown in gray. Our method models the players’ interactions as continuing into the future, allowing it to attribute observed behavior to future costs. In this instance, our method correctly explains Player-2’s observed left turn as the result of a modest penalty on proximity, which becomes important only later in the trajectory when the players are close to one another. Cost estimation is shown at the bottom of Figure 6. The KKT residual baseline is incapable of such attributions. More precisely, it can only consider the KKT residuals G (⋅; θ) of equation (21) for time steps $t \in [\tilde{T}]$ . Hence, the baseline must presume that the game terminates at $\tilde{T}$ rather than at some time in the future. Thus, it cannot anticipate the immediate future consequences of particular cost models. In Figure 6, the baseline can only explain the players’ early observed collision avoidance maneuver with an extremely large penalty on proximity to their opponents. As a result, it predicts that the players will quickly drive away from one another. Unlike our method, the baseline’s prediction rapidly diverges from the ground truth.

Figure 6.

Qualitative prediction performance of our method and the baseline for the 2-player collision-avoidance example when only the first 10 out of 25 time steps are observed.

Beyond inference and prediction accuracy, a key factor for online operation is the computational complexity. To investigate this point, Figure 7 shows the computation time of both methods for the same dataset underpinning Figure 5. These timing results were obtained on an AMD Ryzen 9 5900HX laptop CPU. Overall, we observe that the KKT residual baseline has a lower runtime than our approach. The reduced runtime can be attributed to the fact that, by fixing the states and inputs a priori, the KKT residual formulation yields a simpler convex optimization problem in equation (21). Nonetheless, our method’s runtime still remains moderate and scales gracefully with the observation horizon. We note that our current implementation is not optimized for speed. In practical applications in the context of receding-horizon applications—a topic that we shall discuss in Section 7.3.2-the runtime may be further reduced via improved warm-starting and memory sharing across planner invocations.

Figure 7.

Runtime of our method and the baseline for varying numbers of observations of the 2-player collision-avoidance example at a fixed noise level of σ = 0.05.

7.3. Scaling to larger games

While our approach is more easily analyzed in the small, two-player collision-avoidance game of Section 7.2, it readily extends to larger multi-agent interactions. In order to demonstrate scalability of the approach, we therefore replicate the offline learning analysis of Section 7.2.1 in a larger 5-player highway driving scenario depicted in Figure 1. Finally, we demonstrate a proof of concept for online, receding horizon learning in this scaled setting following the setup of Section 5.2.

In the highway scenario discussed through the remainder of this section, each player wishes to make forward progress in a particular lane at an unknown nominal speed, rather than reach a desired position as above. Therefore, ground-truth objectives use a quadratic penalty on deviation from a desired state that encodes each player’s target lane and preferred travel speed rather than a specific goal location. Despite these differences, this class of objectives is still captured by the cost structure introduced in equation (23e).

7.3.1. Offline learning

First, we study the performance of our method and the KKT residual baseline in the setting of offline learning without trajectory prediction. Figure 8 displays these results, using the same metrics as in Section 7.2.1 to measure performance in parameter space-Figure 8(a)-and position space-Figure 8(b). As before, our method demonstrably outperforms the baseline in both fully and partially observed settings. Furthermore, whereas our method performs comparably according to both metrics in the full and partial observation settings, the baseline performance differs between the two metrics. That is, while the performance of the baseline measured in parameter space is not significantly affected by less informative observations, the effect is significant in trajectory space. This inconsistency can be attributed to the fact that certain objective parameters have stronger influence on the resulting game trajectory than others. Since our method’s objective is observation fidelity, here measured by the measurement likelihood of equation (subsection 13a), it directly accounts for these varying sensitivities. The baseline, however, greedily optimizes the KKT residual of equation (21), irrespective of the resulting equilibrium trajectory.

Figure 8.

Estimation performance of our method and the baseline for the 5-player highway overtaking example, with noisy full and partial state observations. (a) Error measured directly in parameter space using equation (24). (b) Error measured in position space using equation (25). Triangular data markers in (b) highlight objective estimates which lead to ill-conditioned games. Solid lines and ribbons indicate the median and IQR of the error for each case.

7.3.2. Online learning and receding horizon prediction

Finally, we demonstrate the application of our method for simultaneous online learning and receding-horizon prediction in the 5-player highway navigation scenario depicted in Figure 1.

Here, the information available to the estimator evolves over time and the problem only admits access to past observations of the game state for cost learning. Following the proposed procedure of Section 5.2, here, we limit the computational complexity of the estimation problem by considering only a fixed-lag buffer of observations over the last 5s and predict all player’s behavior over the next 10s. The qualitative performance of our method under noise-corrupted partial state observation is shown in Figure 9. As can be seen, from only a few seconds of data, our method learns player objectives that accurately predict the evolution of the game over a receding prediction horizon. Note that, by design, objective learning and behavior prediction are achieved simultaneously by solving a single joint optimization problem as in equation (13a). This ability to couple online learning and prediction makes it particularly suitable for online applications.

Figure 9

. Demonstration of our method in an online application of simultaneous objective learning and trajectory prediction for the 5-player highway navigation scenario. At each time step, objective learning is performed on a fixed-lag buffer of 5s of observation data which is coupled with trajectory prediction 10s into the future.

8. Conclusion

In this paper, we have introduced a novel approach to learn the parameters of players’ objectives in dynamic, noncooperative interactions, given only noisy, partial observations. This inverse dynamic game arises in a wide variety of multi-robot and human–robot interactions and generalizes well-studied problems such as inverse optimal control, inverse reinforcement learning, and learning from demonstrations. Contrary to prior work, our method learns players’ cost parameters while simultaneously recovering the forward game trajectory consistent with those parameters, with overall performance measured according to observation fidelity. We have shown how this formulation naturally extends to both offline learning and prediction problems, as well as online, receding horizon learning.

We have conducted extensive numerical simulations to characterize the performance of our method and compare it to a state-of-the-art baseline method (Rothfuß et al., 2017; Awasthi and Lamperski 2020). These simulations clearly demonstrate our method’s improved robustness to both observation noise and partial observations. Indeed, existing methods presume noiseless, full-state observations and thus require a priori estimation of states and inputs. Our method recovers objective parameters, reconstructs past game trajectories, and predicts future trajectories far more accurately than the baseline. Beyond that, our method’s structure allows to perform all of these tasks jointly as the solution of a single optimization problem. This feature renders our method suitable for online learning and prediction in a receding horizon fashion.

In light of these encouraging results, there are several directions for future research. Most immediately, our method lends itself naturally to deployment onboard physical robotic systems such as the autonomous vehicles considered in the examples of Section 7. In particular, the online, receding horizon learning and prediction procedure of Section 5.2 may be run onboard an autonomous car. Here, the “ego” agent would seek to learn other vehicles’ objective parameters while simultaneously using the receding horizon game solution to respond to predicted opponent strategies.

Another exciting, more theoretical direction consists of extending our formulation to more complex equilibrium concepts than OLNE. For example, recent solution methods for forward games in state feedback Nash equilibria (Fridovich-Keil et al., 2020; Laine et al., 2021; Di and Lamperski 2021) might be adapted to solve inverse games along the lines of equation (12a).

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Lasse Peters

Notes

References

Albrecht

Ramirez-Amaro

Ruiz-Ugalde

, et al. (2011) Imitating human reaching motions using physically inspired optimization principles. In: Proceedings of the IEEE International Confirence on Humanoid Robots. Bled, Slovenia: IEEE.

Andersson

JAE

Gillis

Horn

, et al. (2019) CasADi: a software framework for nonlinear optimization and optimal control. Mathematical Programming Computation 11(1): 1–36.

Awasthi

(2019) Forward and Inverse Methods in Optimal Control and Dynamic Game Theory. Minneapolis, MN, USA: Master’s Thesis, University of Minnesota.

Awasthi

Lamperski

(2020) Inverse differential games with mixed inequality constraints. In: Proceedings of the IEEE American Control Conference (ACC). Philadelphia, PA, USA: IEEE.

Başar

Olsder

(1999) Dynamic Noncooperative Game Theory. Philadelphia, PA, USA; 2nd edition: Society for Industrial and Applied Mathematics (SIAM), volume 23.

Bezanson

Edelman

Karpinski

, et al. (2017) Julia: a fresh approach to numerical computing. SIAM Review (SIREV) 59(1): 65–98.

Daskalakis

Goldberg

Papadimitriou

(2009) The complexity of computing a Nash equilibrium. SIAM Journal on Computing 39(1): 195–259.

Lamperski

(2019) Newton’s method and differential dynamic programming for unconstrained nonlinear dynamic games. In: Proceedings of the Conference on Decision Making and Control (CDC). Nice, France: IEEE.

Lamperski

(2021) Newton’s method, Bellman recursion and differential dynamic programming for unconstrained nonlinear dynamic games. Dynamic Games and Applications 12: 394–442.

10.

Dirkse

Ferris

(1995) The path solver: a nommonotone stabilization scheme for mixed complementarity problems. Optimization Methods and Software 5(2): 123–156.

11.

Dunning

Huchette

Lubin

(2017) JuMP: a modeling language for mathematical optimization. SIAM Review (SIREV) 59(2): 295–320.

12.

Englert

Vien

Toussaint

(2017) Inverse KKT cost functions of manipulation tasks from demonstrations. The International Journal of Robotics Research 36(13-14): 1474–1488. DOI: 10.1177/0278364917745980

13.

Ferris

Dirkse

Meeraus

(2005) Mathematical programs with equilibrium constraints: automatic reformulation and solution via constrained optimization. Frontiers in Applied General Equilibrium Modeling. 67–93: Cambridge, England: Cambridge University Press.

14.

Fridovich-Keil

Ratner

Peters

, et al. (2020) Efficient iterative linear-quadratic approximations for nonlinear multi-player general-sum differential games. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). Paris, France: IEEE.

15.

Gallager

(2013) Stochastic Processes: Theory for Applications. Cambridge, England: Cambridge University Press.

16.

Gill

Murray

Saunders

(2005) SNOPT: an SQP algorithm for large-scale constrained optimization. SIAM Review (SIREV) 47: 99–131.

17.

Inga

Bischoff

Köpf

, et al. (2019) Inverse Dynamic Games Based on Maximum Entropy Inverse Reinforcement Learning. arXiv preprint arXiv:1911.07503.

18.

Isaacs

(1954-1955) Differential Games I-Iv. Santa Monica, CA, USA: Rand Corp Santa Monica Ca Santa Monica. Technical report.

19.

Jin

Kulić

Mou

, et al. (2021) Inverse optimal control from incomplete trajectory observations. The International Journal of Robotics Research 40(6–7): 848–865.

20.

Kalman

(1964) When Is a Linear Control System Optimal? Journal of Basic Engineering 86(1): 51–60. DOI: 10.1115/1.3653115

21.

Keshavarz

Wang

Boyd

(2011) Imputing a convex objective function. In: Proceedings of the International Symposium on Intelligent Control (ISIC). Denver, CO, USA: IEEE.

22.

Köpf

Inga

Rothfuß

, et al. (2017) Inverse reinforcement learning for identification in linear-quadratic dynamic games. IFAC-PapersOnLine 50(1): 14902–14908.

23.

Kretzschmar

Spies

Sprunk

, et al. (2016) Socially compliant mobile robot navigation via inverse reinforcement learning. The International Journal of Robotics Research 35(11): 1289–1307.

24.

Laine

Fridovich-Keil

Chiu

, et al. (2023) The computation of approximate generalized feedback nash equilibria. SIAM Journal on Optimization 33(1): 294–318.

25.

Le Cleac’h

Schwager

Manchester

(2020) ALGAMES: a fast solver for constrained dynamic games. In: Proceedings of Robotics: Science and Systems (RSS).

26.

Le Cleac’h

Schwager

Manchester

(2021) LUCIDGames: online unscented inverse dynamic games for adaptive trajectory prediction and planning. IEEE Robotics and Automation Letters (RA-L) 6(3): 5485–5492.

27.

Levine

Koltun

(2012) Continuous inverse optimal control with locally optimal examples. Proceedings of the International Conference on Machine Learning (ICML). Edinburgh, Scotland.

28.

Luo

Pang

Ralph

(1996) Mathematical Programs with Equilibrium Constraints. Cambridge, England: Cambridge University Press.

29.

Menner

Zeilinger

(2020) IFAC-PapersOnLine 53(2): 5266–5272. Amsterdam, Netherlands: Elsevier.

30.

Mombaur

Truong

Laumond

(2010) From human to humanoid locomotion-an inverse optimal control approach. Autonomous Robots 28(3): 369–383.

31.

Monderer

Shapley

(1996) Potential games. Games and Economic Behavior 14(1): 124–143.

32.

Mukadam

Dong

Dellaert

, et al. (2019) Steap: simultaneous trajectory estimation and planning. Autonomous Robots 43(2): 415–434.

33.

Natarajan

Kunapuli

Judah

, et al. (2010) Multi-agent inverse reinforcement learning. In: 2010 Ninth International Conference on Machine Learning and Applications. Washington, DC, USA: IEEE, pp. 395–400.

34.

Russell

(2000) Algorithms for inverse reinforcement learning. In: Proceedings of the International Conference on Machine Learning (ICML). Haifa, Israel: Omnipress 2010.

35.

Nocedal

Wright

(2006) Numerical Optimization. New York, NY, USA: Springer.

36.

Peters

(2020) Accommodating Intention Uncertainty in General-Sum Games for Human-Robot Interaction. Master’s Thesis. Hamburg, Germany: Hamburg University of Technology.

37.

Peters

Fridovich-Keil

Rubies-Royo

, et al. (2021) Inferring objectives in continuous dynamic games from noise-corrupted partial state observations. In: Proceedings of Robotics: Science and Systems (RSS).

38.

Rothfuß

Inga

Köpf

, et al. (2017) Inverse optimal control for identification in non-cooperative differential games. IFAC-PapersOnLine 50(1): 14909–14915.

39.

Šošić

Adrian

, et al. (2017) Inverse Reinforcement Learning in Swarm Systems. Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems. Richland, SC, USA: International Foundation for Autonomous Agents and Multiagent Systems.

40.

Wächter

Biegler

(2006) On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical Programming 106(1): 25–57.

41.

Wang

Zhang

, et al. (2019) Analyses of long non-coding RNA and mRNA profiling using RNA sequencing in calcium oxalate monohydrate-stimulated renal tubular epithelial cells. Urolithiasis 47: 225–234.

42.

Ziebart

Maas

Bagnell

, et al. (2008) Maximum entropy inverse reinforcement learning. In: Proc. of the Conference on Advancements of Artificial Intelligence (AAAI). Chicago, IL, USA.