Sage Journals: Discover world-class research

Abstract

Reward learning enables robots to learn adaptable behaviors from human input. Traditional methods model the reward as a linear function of hand-crafted features, but that requires specifying all the relevant features a priori, which is impossible for real-world tasks. To get around this issue, recent deep Inverse Reinforcement Learning (IRL) methods learn rewards directly from the raw state but this is challenging because the robot has to implicitly learn the features that are important and how to combine them, simultaneously. Instead, we propose a divide-and-conquer approach: focus human input specifically on learning the features separately, and only then learn how to combine them into a reward. We introduce a novel type of human input for teaching features and an algorithm that utilizes it to learn complex features from the raw state space. The robot can then learn how to combine them into a reward using demonstrations, corrections, or other reward learning frameworks. We demonstrate our method in settings where all features have to be learned from scratch, as well as where some of the features are known. By first focusing human input specifically on the feature(s), our method decreases sample complexity and improves generalization of the learned reward over a deep IRL baseline. We show this in experiments with a physical 7-DoF robot manipulator, and in a user study conducted in a simulated environment.

Keywords

Learning from humans inverse reinforcement learning feature learning

1. Introduction

Whether it is semi-autonomous driving (Sadigh et al., 2016), recommender systems (Ziebart et al., 2008), or household robots working in close proximity with people (Jain et al., 2015), reward learning can greatly benefit autonomous agents to generate behaviors that adapt to new situations or human preferences. Under this framework, the robot uses the person’s input to learn a reward function that describes how they prefer the task to be performed. For instance, in the scenario in Figure 1, the human wants the robot to keep the cup away from the laptop to prevent spilling liquid over it; she may communicate this preference to the robot by providing a demonstration of the task or even by directly intervening during the robot’s task execution to correct it. After learning the reward function, the robot can then optimize it to produce behaviors that better resemble what the person wants.

Figure 1.

(Left) The person teaches the robot the concept of horizontal distance from the laptop by providing a few feature traces. (Right-Top) In the online reward learning from corrections setting, once the robot detects that its feature set is incomplete, it queries the human for feature traces that teach it the missing feature and adapts the reward to account for it. (Right-Bottom) In the offline reward learning from demonstrations setting, the person has to teach the robot each feature separately one at a time using feature traces, and only then teach their combined reward.

In order to correctly interpret and efficiently learn from human input, traditional methods resorted to structuring the reward as a (linear) function of carefully hand-engineered features—important aspects of the task (Abbeel and Ng, 2004; Bajcsy et al., 2017; Jain et al., 2015; Ziebart et al., 2008). Unfortunately, selecting the right space of features is notoriously challenging, even for expert system designers: knowing and specifying a priori an exhaustive set of all the features that might be relevant for the reward is impossible for most real-world tasks. To bypass this feature specification problem, state-of-the-art deep IRL methods (Brown et al., 2020; Finn et al., 2016; Wulfmeier et al., 2016) learn rewards defined directly on the high-dimensional raw state (or observation) space, thereby implicitly constructing features automatically from task demonstrations.

In doing so, however, these approaches sacrifice the sample efficiency and generalizability that a well-specified feature set offers. While using an expressive function approximator to extract features and learn their reward combination at once seems advantageous, many such functions can induce policies that explain the demonstrations. Hence, to disambiguate between all these candidate functions, the robot requires a very large amount of (laborious to collect) data, and this data needs to be diverse enough to identify the true reward. For example, the human in the household robot setting in Figure 1 might want to demonstrate keeping the cup away from the laptop, but from a single demonstration the robot could find many other explanations for the person’s behavior: perhaps they always happened to keep the cup upright or they really like curved trajectories in general.

The underlying problem here is that demonstrations—or task-specific input more broadly—are meant to teach the robot about the reward and not about the features per se, so these function approximators struggle to capture the right feature structure for the reward. In this work, we argue that the robot does not have to learn everything at once; instead, it can divide-and-conquer the reward learning problem and focus on explicitly learning the features separately from learning how to combine them into the reward. In our earlier example, if the robot were taught about the concept of distances to laptops separately, it would be able to quickly tell what the person wants from a single demonstration.

We make the following contributions:

Learning features from a novel type of human input. We present a method for learning complex non-linear features separately from the reward (Section 4). We introduce a new type of human input specifically designed to teach features, which we call feature traces—partial trajectories that describe the monotonic evolution of the value of the feature to be learned. To provide a feature trace, the person guides the robot from states where the feature is highly expressed to states where it is not, in a monotonic fashion. Looking at Figure 1 (Left), the person teaches the robot to avoid the laptop by giving a few feature traces: she starts with the arm above the laptop and moves it away until comfortable with the distance from the object. We present an algorithm that harvests the structure inherent to feature traces and uses it to efficiently learn a feature relevant for the reward: in our example, the horizontal distance from the laptop. In experiments on a 7-DoF robot arm, we find that our method can learn high quality features closely resembling the ground truth (Section 6.1).

Demonstrating our feature learning method in a user study on a simulated 7-DoF robot arm. In a user study with the JACO2 (Kinova) robotic arm, we show that non-expert users can use our approach for learning features (Section 6.2). The participants were able to provide feature traces to teach good features, and found our teaching protocol intuitive. Unfortunately, due to the current pandemic, we conducted the study online in a simulated environment; despite the inevitable degradation in input quality that this entails, the users were still able to teach features that induced informative bias.

Analyzing generalization and sample complexity benefits of learning features for rewards. We show how our method, which we call Feature Expansive Reward Learning (FERL) because it expands the feature set one by one, can improve reward learning sample complexity and generalization. First, we look at an easier online reward learning setting like the one in Figure 1 (Right-Top) where the robot knows part of the feature set from the get-go, but the person’s preference also depends on other features not in the set (Section 5.2). We show that, by learning the missing feature, the robot obtains a more generalizable reward than if it had trained a deep IRL network directly from the raw state and the known set (Section 7). We then consider the more challenging offline reward learning case in Figure 1 (Right-Bottom) where the person teaches the reward from scratch, one feature at a time (Section 5.1). We find that the robot outperforms the baseline most of the time, with less clear results when the learned features are noisily taught by novice users in simulation (Section 8).

We note that this work is an extension of Bobu et al. (2021), which was published at the International Conference on Human Robot Interaction. We build on this work by formalizing a general framework for feature-based reward learning, and instantiating it in a new offline learning setting where the person can teach each feature one by one before combining them into a reward. Not only is this offline setting more commonly encountered in reward learning, but it also showcases that our approach can be applied more generally to preference learning from different kinds of human input.

Overall, this work provides evidence that taking a divide-and-conquer approach focusing on learning important features separately before learning the reward improves sample complexity in reward learning. Although showcased in manipulation, our method can be used in any robot learning scenarios where feature learning is beneficial: in collaborative manufacturing users might care about the rotation of the object handed over, or in autonomous driving passengers may care about how fast to drive through curves.

2. Related Work

Programming robot behavior through human input is a well-established paradigm. In this paradigm, the robot receives human input and aims to infer a policy or reward function that captures the behavior the human wants the robot to express. In imitation learning, the robot directly learns a policy that imitates demonstrations given by the human (Osa et al., 2018). The policy learns a correlation between situations and actions but not why a specific behavior is desirable. Because of that, imitation learning only works in the training regime, whereas optimizing a learned reward, which captures why a behavior is desirable, can generalize to unseen situations (Abbeel and Ng, 2004).

In the IRL framework, the robot receives demonstrations through teleoperation (Abbeel and Ng, 2004; Javdani et al., 2018) or kinesthetic teaching (Argall et al., 2009) and learns a reward under which these demonstrations are optimal (Abbeel and Ng, 2004; Russell and Norvig, 2002). Recent research goes beyond demonstrations, utilizing other types of human input for reward learning such as corrections (Bajcsy et al., 2017; Jain et al., 2015), comparisons (Christiano et al., 2017) and rankings (Brown et al., 2019), examples of what constitutes a goal (Fu et al., 2018b), or even specified proxy objectives (Hadfield-Menell et al., 2017). Depending on the interaction setting, the human input can be given all-at-once, iteratively, or on specific requests of the robot in an active learning setting (Brown et al., 2018; Lopes et al., 2009; Sadigh et al., 2016).

All these methods require less human input if a parsimonious representation of the world, which summarizes raw state information in the form of relevant features, is available. This is because finite feature sets significantly reduce the space of possible functions which according to statistical learning theory reduces the information complexity of the learning problem (Vapnik, 2013). In the following, we discuss the role of feature representations in reward learning and methods for learning features.

2.1. Feature representations in reward learning

Traditional reward learning methods rely on a set of carefully hand-crafted features that capture aspects of the environment a person may care about. These are selected by the system designer prior to the task (Abbeel and Ng, 2004; Bajcsy et al., 2017; Hadfield-Menell et al., 2017; Jain et al., 2015; Ziebart et al., 2008). If chosen well, this feature set introduces an inductive bias that enables the algorithms to find a good estimate of the human’s preferences with limited input. Unfortunately, selecting such a set in the first place is notoriously challenging, even for experts like system designers. For one, defining a good feature function can be a time-consuming trial-and-error process, especially if the feature is meant to capture a complex aspect of the task (Wulfmeier et al., 2016). Moreover, the chosen feature space may not be expressive enough to represent everything that a person might want (and is giving input about) (Bobu et al., 2020; Haug et al., 2018). When this is the case, the system may misinterpret human guidance, perform unexpected or undesired behavior, and degrade in overall performance (Amodei and Clark, 2016; Haug et al., 2018; Russell and Norvig, 2002).

To tackle these challenges that come with hand-designing a feature set, state-of-the-art deep IRL methods use the raw state space directly and shift the burden of extracting behavior-relevant aspects of the environment onto the function approximator (Finn et al., 2016; Wulfmeier et al., 2016). The objective of IRL methods is to learn a reward which induces behavior that matches the state expectation of the demonstrations. The disadvantage of such approaches is that they require large amounts of highly diverse data to learn a reward function which generalizes across the state space. This is because with expressive function approximators there exists a large set of functions that could explain the human input; that is, many reward functions induce policies that match the demonstrations’ state expectation. The higher dimensional the state, the more human input is needed to disambiguate between those functions sufficiently to find a reward function which accurately captures human preferences and thereby generalizes to states not seen during training and not just replicates the demonstrations’ state expectations. Thus, when venturing sufficiently far away from the demonstrations the learned reward in IRL does not generalize which can lead to unintended behavior (Fu et al., 2018a; Reddy et al., 2020b).

It has been shown that providing linear feature functions as human input can reduce the risk of unintended behavior (Haug et al., 2018). In our work, we argue that generalization with limited input can be achieved without requiring hand-crafted features if the robot explicitly learns features, instead of attempting to learn them implicitly from demonstrations.

2.2. Learning features

In IRL researchers have explored the direction of inferring a set of relevant features directly from task demonstrations. This can take the form of joint Bayesian inference on both reward and feature parameters (Choi and Kim, 2013) or projecting the raw state space to lower dimensions via PCA on demonstrated trajectories (Vernaza and Bagnell, 2012). There are also methods that add features iteratively to learn a non-linear reward, such as Levine et al. (2010), which constructs logical conjunctions of primitive integer features, and Ratliff et al. (2007), which trains regression trees to distinguish expert from non-expert trajectories in a base feature space. Levine et al. (2010) perform well in discrete-state MDPs, but is not suitable for continuous state spaces, does not operate on raw states but rather a hand-engineered set of integer component features, and requires the reward structure to be expressible as logical conjunctions. Meanwhile, Ratliff et al. (2007) allows for larger state spaces and arbitrary continuous rewards, but still relies on engineering a relevant set of base features and severely underperforms in the case of non-expert human input when compared to more recent IRL techniques (Levine et al., 2011; Wulfmeier et al., 2016). Because of these shortcomings, IRL researchers have opted recently for either completely hand-specifying the features or using deep IRL for extracting them automatically from the raw continuous state space with non-expert demonstrations (Finn et al., 2016; Fu et al., 2018a).

Rather than relying on demonstrations for everything, we propose to first learn complex non-linear features leveraging explicit human input about relevant aspects of the task (Section 4). Based on these features, a reward can be inferred with minimal input (Section 5). Our results show that adding structure in such a targeted way can enhance both the generalization of the learned reward and data-efficiency of the method.

3. Problem Formulation

We consider a robot R operating in the presence of a human H from whom it is trying to learn to perform a task, ultimately seeking to enable autonomous execution. In the most general setting, both H and R are able to affect the evolution of the continuous state $s \in ℝ^{d}$ (i.e., robot joint poses or object poses) over time through their respective continuous actions a_H and a_R via a dynamics function f

s^{t + 1} = f (s^{t}, a_{H}^{t}, a_{R}^{t})

(1)

with

a_{H} \in A_{H}

and

a_{R} \in A_{R}

, and

A_{H}

and

A_{R}

compact sets. Thus, when executing a task, the robot follows a trajectory

τ = [s^{0}, a_{H}^{0}, a_{R}^{0}, s^{1}, a_{H}^{1}, a_{R}^{1}, \dots, s^{T}, a_{H}^{T}, a_{R}^{T}]

We assume that the human has some consistent internal preference ordering between different trajectories τ, which affects the actions a_H that they choose. In principle, these human preferences could be captured by a reward function R*(τ). Unfortunately, the robot does not have access to R*, so to learn how to perform the task it must attempt to infer it. Since R* may encode arbitrary preference orderings deeming the inference problem intractable, we assume that the robot reasons over a parameterized approximation R_θ induced by parameters θ ∈ Θ. The robot’s goal is, thus, to estimate the human’s preferred θ from their actions a_H.

Even with this parameterization, the space of possible reward functions is infinite-dimensional. One way to represent it using a finite θ is through the means of a finite family of basis functions Φ_i, also known as features (Ng and Russell, 2000): $R_{θ} (\vec{Φ} (τ))$ , where $\vec{Φ}$ is the set of chosen features Φ_i. Consistent with classical utility theories (Von Neumann and Morgenstern, 1945), we may decompose trajectory features Φ_i into state features ϕ_i and approximate the trajectory’s reward through a cumulative return over time

R_{θ} (τ) = R_{θ} (\vec{Φ} (τ)) = \sum_{(s, a_{H}, a_{R}) \in τ} r_{θ} (\vec{ϕ} (s, a_{H}, a_{R})) .

(2)

This restriction to a finite set of features $\vec{ϕ}$ is essentially a truncation of the infinite collection of basis functions spanning the full reward function space. Thus, the features we choose to represent the reward dramatically impact the reward functions that can be learned altogether. Importantly, this observation holds regardless of the representation power that r_θ has (linear combination, neural network, etc). Motivated by recovering a reward function r_θ that captures the person’s preferences as best as possible, we are, thus, interested in the question of how to choose the feature representation $\vec{ϕ}$ .

We assume the robot has access to a (possibly empty) initial set of features $\vec{ϕ}$ . In Section 4, we propose a protocol via which the robot can learn a novel feature to add to its existing set by soliciting feature-specific human input. We then describe classic offline IRL and its adaptation to situations where the human is teaching the reward from scratch (Section 5.1); our framework enables them to teach one feature at a time before teaching the reward on top using task demonstrations. Lastly, we present the online variant, where the robot executes the task according to a reward function defined on an incomplete feature set and the human intervenes to correct it (Section 5.2); our method allows them to explicitly focus on teaching the missing feature(s) and adding them to the set before the reward is updated.

4. Algorithmic Approach: Feature Learning

We first look at learning individual feature functions. In this paper, we focus on state features (ignoring actions from the feature representation), which we define as arbitrary complex mappings $ϕ (s) : ℝ^{d} \to ℝ^{+}$ . As such, in regions of the state space where the feature is highly expressed, this function has high positive values, whereas for states where the feature is not expressed, ϕ is closer to zero.

One natural idea for learning this mapping is treating it as a regression problem and asking the human for regression labels (s, ϕ(s)) directly. Unfortunately, to learn anything useful, the robot would need a very large set of labels from the person, which would be too effortful for them to provide. Even worse, humans are notoriously unreliable at quantifying their preferences with any degree of precision (Braziunas and Boutilier, 2008), so their labels might result in arbitrarily noisy regressions. Hence, we need a type of human input that balances being informative and not placing too much burden on the human.

4.1 Feature traces

To teach a non-linear representation of ϕ with little data, we introduce feature traces ξ = s_0:n, a novel type of human input defined as a sequence of n states that are monotonically decreasing in feature value, that is, ϕ(s_i) ≥ ϕ(s_j), ∀i < j. This approach relaxes the need for accurate state labeling, while simultaneously providing a combinatorial amount of state comparisons (see Section 4.2 for details) from each trace ξ.

When learning a feature, the robot can query the human for a set Ξ of N traces. The person gives a trace ξ by simply moving the system from any start state s₀ to an end state s_n, noisily ensuring monotonicity. Our method, thus, only requires an interface for communicating ordered feature values over states: kinesthetic teaching is useful for household or small industrial robots, while teleoperation and simulation interfaces may be better for larger robotic systems.

To illustrate how a human might offer feature traces in practice, let’s turn to Figure 1 (Left). Here, the person is teaching the robot to keep the mug away from the laptop (i.e., not above). The person starts a trace at s₀ by placing the end-effector directly above the object center, then leads the robot away from the laptop to s_n. Our method works best when the person tries to be informative, that is, covers diverse areas of the space: the traces illustrated move radially in all directions and start at different heights. While for some features, like distance from an object, it is easy to be informative, for others, like slowing down near objects, it might be more difficult. We explore how easy it is for users to be informative in our study in Section 6.2, with encouraging findings, and discuss alleviating existing limitations in Section 9.

The power of feature traces lies in their inherent structure. Our algorithm, thus, makes certain assumptions to harvest this structure for learning. First, we assume that the feature values of states along the collected traces ξ ∈ Ξ are monotonically decreasing. Secondly, we assume that by default the human starts all traces in states s₀ with the highest feature value across the domain, then leads the system to states s_n with the lowest feature value. In some situations, this assumption might unnecessarily limit the kinds of feature traces the human can provide. For example, the person might want to start somewhere where the feature is only “half” expressed relative to the feature range of the domain. Because of this, we optionally allow the human to provide relative values v₀, v_n ∈ [0, 1]¹ to communicate that the traces start/end at values that are fractions of the feature range of the domain.

4.2 Learning a feature function

To allow for arbitrarily complex non-linear features, we approximate a feature by a neural network $ϕ_{ψ} (s) : ℝ^{d} \to ℝ^{+}$ . We incorporate the assumptions in the previous section by training ϕ_ψ as a discriminative function with respect to the state ordering in feature traces ξ ∈ Ξ, and also encouraging the starts s₀ and ends s_n across all traces to have the same high and low values, respectively. For ease of exposition, we present our feature learning technique without the relative values v₀ and v_n first, then later describe how to modify the algorithm to include them.

4.2.1 Monotonicity along feature traces

First, due to the monotonicity assumption along any feature trace $ξ_{k} = (s_{0}^{k}, s_{1}^{k}, \dots, s_{n}^{k})$ , when training ϕ_ψ we want to encourage feature values to decrease monotonically along every trace, that is, $ϕ_{ψ} (s_{i}^{k}) \geq ϕ_{ψ} (s_{j}^{k}), \forall j > i, k$ . For this purpose, we convert the set of collected traces ξ_k ∈ Ξ into a dataset of ordered tuples $(s_{i}^{k}, s_{j}^{k}) \in T_{o r d}$ , where every first element appears earlier in the trace than the second element (hence, its feature value should be higher). This results in $(\begin{matrix} (n + 1) \\ 2 \end{matrix})$ tuples per trace, which we can use for training ϕ_ψ.

We train the discriminative function ϕ_ψ as a predictor for whether a state s has a higher feature value than another state s^′, which we represent as a softmax-normalized distribution

P (ϕ_{ψ} (s) > ϕ_{ψ} (s^{'})) = P (s ≻ s^{'}) = \frac{e^{ϕ_{ψ} (s)}}{e^{ϕ_{ψ} (s)} + e^{ϕ_{ψ} (s^{'})}}

(3)

where we define the shorthand notation s ≻ s^′ for ϕ_ψ(s) > ϕ_ψ(s^′). We choose ψ to minimize a negative log-likelihood loss L_ord(ψ) operating on the ordered tuples dataset

L_{o r d} (ψ) = - \sum_{(s, s^{'}) \in T_{o r d}} \log (P (s ≻ s^{'}))

(4)

= - \sum_{(s, s^{'}) \in T_{o r d}} \log \frac{e^{ϕ_{ψ} (s)}}{e^{ϕ_{ψ} (s)} + e^{ϕ_{ψ} (s^{'})}} .

(5)

Intuitively, this loss spaces out the feature values ϕ_ψ such that they decrease monotonically along every trace; however, this alone does not constrain the traces to have the same start and end values, respectively.

4.2.2 Start/end feature value equivalence

To encourage all traces to start and end in the same high and low feature values, we need an additional loss term encoding $ϕ_{ψ} (s_{0}^{i}) = ϕ_{ψ} (s_{0}^{j})$ and $ϕ_{ψ} (s_{n}^{i}) = ϕ_{ψ} (s_{n}^{j})$ for all ξ_i, ξ_j ∈ Ξ. We thus convert the set of collected traces Ξ into another dataset $T_{e q u i v}$ of equivalence tuples $(s_{0}^{i}, s_{0}^{j}), (s_{n}^{i}, s_{n}^{j}) \forall ξ_{i}, ξ_{j} \in Ξ, i \neq j, i > j$ . This results in $2 (\begin{matrix} N \\ 2 \end{matrix})$ tuples where the states of the tuple (s, s^′) should have the same feature value, that is, ϕ_ψ(s) = ϕ_ψ(s^′). We denote this relationship as s ∼ s^′ to simplify notation.

When training ϕ_ψ, the predictor should not be able to distinguish which state has a higher feature value, hence P (ϕ_ψ(s) > ϕ_ψ(s^′)) = 0.5. As such, we introduce a second loss function L_equiv(ψ) that minimizes the negative log-likelihood of both s having a higher feature value than s^′ and s^′ having a higher feature value than s:

L_{e q u i v} (ψ) = - \sum_{(s, s^{'}) \in T_{e q u i v}} \log (P (s ≻ s^{'})) + \log (P (s^{'} ≻ s))

(6)

= - \sum_{(s, s^{'}) \in T_{e q u i v}} \log \frac{e^{ϕ_{ψ} (s) + ϕ_{ψ} (s^{'})}}{{(e^{ϕ_{ψ} (s)} + e^{ϕ_{ψ} (s^{'})})}^{2}}

(7)

This loss ensures the state space around feature trace starts and ends have similar feature values, respectively.²

We now have a total dataset $T = T_{o r d} \cup T_{e q u i v}$ of $| T | = \sum_{i = 1}^{N} (\begin{matrix} (n^{i} + 1) \\ 2 \end{matrix}) + 2 (\begin{matrix} N \\ 2 \end{matrix})$ tuples, which is already significantly large for a small set of feature traces. We can use it to optimize a loss L(ψ) that combines the ordered and equivalence losses

L (ψ) = L_{o r d} (ψ) + λ L_{e q u i v} (ψ)

(8)

where λ is a hyperparameter trading off the two loss functions.

Given the loss function in equation (8), we can use any automatic differentiation package to compute its gradients and update ψ via gradient descent. Note that L_equiv is akin to a binary cross-entropy loss with a target of 0.5, whereas L_ord is similar to a binary cross-entropy loss with a target of 1. This form of loss function has been shown to be effective for preference learning (Christiano et al., 2017; Ibarz et al., 2018). The key differences here are that our loss is over feature functions not rewards, and that preferences are state orderings provided via feature traces not trajectory comparisons. Additionally, in practice we normalize the feature functions to make their subsequent reward weights reflect importance relative to one another. We present the full feature learning algorithm using feature traces in Alg. 1.

Algorithm 1

Feature Learning via Feature Traces

Input: N number of queries, K iterations.

for i ← 1 to N do

Query feature trace ξ as in Section 4.1.

Ξ ←Ξ ∪ ξ.

end

Convert Ξ to datasets $T_{o r d}$ and $T_{e q u i v}$ as in Section 4.2.

Initialize ϕ_ψ randomly.

for iteration k ← 1 to K do

Sample tuples batch ${\hat{T}}_{o r d} \in T_{o r d}$ .

Sample tuples batch ${\hat{T}}_{e q u i v} \in T_{e q u i v}$ .

Estimate L(ψ) using ${\hat{T}}_{o r d}$ , ${\hat{T}}_{e q u i v}$ , and equation (8).

Update parameter ψ via gradient descent on L(ψ).

end

return normalized ϕ_ψ

4.2.3 Incorporating relative values

So far, we have assumed that all feature traces have starts and ends of the same high and low feature value, respectively. The optional relative values v₀, v_n can relax this assumption to enable the human to provide richer traces and teach more complex feature functions, for example, where no monotonic path from the highest to lowest feature value exists. By default, v₀ = 1 communicating that the trace starts at the highest feature value of the domain, and v_n = 0 signifying that the trace ends at the lowest feature value. By allowing v₀ and v_n to be something different from their defaults, the person can provide traces that start at higher feature values or end at lower ones. We describe how to include these relative values in the feature training procedure in Online Appendix A.1.

5. Algorithmic Approach: Reward Learning

Now that we have a method for learning relevant features, we discuss how the robot can include this capability in reward learning frameworks. For exposition, we chose two reward learning frameworks—learning from demonstrations (offline) and from corrections (online)—but we stress that features learned with our method are applicable to any other reward learning method that admits features (e.g., comparisons, scalar feedback, state of the world, etc.).

5.1 Offline FERL

We first consider the scenario where the human is attempting to teach the robot a reward function from scratch; that is, the robot starts off with an empty feature set $\vec{ϕ}$ . For instance, imagine a system designer trying to engineer the robot’s reward before deployment, or an end user resetting it and custom designing the reward for their home. We can think of this as an offline reward learning setting, where the person provides inputs to the robot before it starts executing the task. Here, we focus on learning from demonstrations, although our framework can be adapted to any other offline reward learning strategy.

In standard learning from demonstrations, deep IRL uses a set of demonstrations to train a reward function directly from the raw state, in an end-to-end fashion. Under our divide-and-conquer framework, we redistribute the human input the robot asks for: first, ask for feature traces ξ focusing explicitly on learning F features one by one via Alg. 1, and only then collect a few demonstrations $τ \in D^{*}$ to learn the reward on top of them. Alg. 2 summarizes the full procedure.

Algorithm 2

Offline FERL

Input: Demonstration set D*, F number of features,

K iterations, α learning rate.

Initialize empty feature set $\vec{ϕ} = []$ . for f ← 1 to F do

Learn feature ϕ_f using Alg. 1.

$\vec{ϕ} \leftarrow (\vec{ϕ}, ϕ_{f})$ . end

Initialize θ randomly. for iteration k ← 1 to K do

Generate samples $D^{θ}$ using current reward R_θ.

Estimate gradient $\nabla ℒ$ using

D*, $D^{θ}$ in equation (15).

Update parameter θ using gradient $\nabla ℒ$ in equation (16).

end

return optimized reward parameters θ

5.1.1 Creating the feature set

Since the robot starts off with an empty feature set $\vec{ϕ}$ , the person has to teach it every relevant feature one at a time. To do so, they follow the procedure in Alg. 1, that is they collect a set of feature traces ξ ∈ Ξ for the current feature, then use them to train ϕ_ψ. The person can add this new feature to the robot’s existing set

\vec{ϕ} \leftarrow (\vec{ϕ}, ϕ_{ψ})

(9)

and repeat the procedure for as many features F as they want.

After being equipped with a new set of features taught by the human, the robot can undergo standard learning from demonstration procedures to recover the person’s preferences. We now review Maximum Entropy IRL (Ziebart et al., 2008) for completion of the offline reward learning exposition.

5.1.2 Offline reward learning

To teach the robot the desired reward function R_θ, the person collects a set of demonstrations $τ \in D^{*}$ for how to perform the task by directly controlling the state s through their input a_H. During a demonstration, the robot is put in gravity compensation mode or teleoperated, to allow the person full control over the desired trajectory. The robot interprets the set of demonstrations $D^{*}$ as evidence about the human’s preferred θ parameter, and uses them to estimate it and, thus, to learn the reward function.

In order to reason about the human’s preferences, the robot needs to be equipped with a model P(τ∣θ) for how those preferences affect their choice of demonstrations. For example, if the human were assumed to act optimally, the model would place all the probability on the set of trajectories that perfectly optimize the reward R_θ. However, since humans are not perfect, we relax this assumption and model them as being noisily-optimal, choosing trajectories that are approximately aligned with their preferences. We follow the Boltzmann noisily-rational decision model

P (τ | θ, β) = \frac{e^{β R_{θ} (τ)}}{\int_{\bar{τ}} e^{β R_{θ} (\bar{τ})} d \bar{τ}}

(10)

where the human picks trajectories proportional to their exponentiated reward (Baker et al., 2007; Von Neumann and Morgenstern, 1945). Here, β ∈ [0, ∞) controls how much the robot expects to observe human input consistent with its reward model. For now, we use the Maximum Entropy IRL (Ziebart et al., 2008) version of this observation model where β is fixed to 1, so for notation simplicity we refer to this model as P(τ∣θ). Later in Section 5.2, we will allow β to vary and make use of it in the online version of our framework.

In maximum entropy IRL, to recover the θ parameter we maximize the log-likelihood $ℒ (θ)$ of the observed data under the above model (Jaynes, 1957). To see how, let’s start by writing down the log-likelihood formula

\begin{array}{l} ℒ (θ) = \log \prod_{τ \in D^{*}} P (τ | θ) = \sum_{τ \in D^{*}} \log \frac{e^{R_{θ} (τ)}}{\int_{\bar{τ}} e^{R_{θ} (\bar{τ})} d \bar{τ}} \\ = \sum_{τ \in D^{*}} R_{θ} (τ) - | D^{*} | \log \int_{\bar{τ}} e^{R_{θ} (\bar{τ})} d \bar{τ} . \end{array}

(11)

Computing the integral over trajectories is intractable in real-world problems, so sample-based approaches to maximum entropy IRL estimate it with samples $τ \in D^{'}$ drawn from a background distribution q(τ)

ℒ (θ) \approx \sum_{τ \in D^{*}} R_{θ} (τ) - | D^{*} | \log \frac{1}{| D^{'} |} \sum_{\bar{τ} \in D^{'}} \frac{e^{R_{θ} (\bar{τ})}}{q (\bar{τ})} .

(12)

The distribution q(τ) is chosen oftentimes to be uniform; instead, we follow Finn et al. (2016) and generate samples in those regions of the trajectory space that are good according to the current estimate of the reward function, that is, $q (τ) \propto e^{R_{θ} (τ)}$ . We denote $τ \in D^{θ}$ such a set sampled under θ.

We may now find θ by maximizing the log-likelihood $ℒ (θ)$ using gradient-based optimization on the above objective. The gradient then takes the following form

\nabla ℒ = \frac{1}{| D^{*} |} \sum_{τ \in D^{*}} \nabla R_{θ} (τ) - \frac{1}{| D^{θ} |} \sum_{\bar{τ} \in D^{θ}} \nabla R_{θ} (\bar{τ}) .

(13)

At this point, a standard deep IRL baseline could use any automatic differentiation package to compute the gradient and update the reward parameters directly from the raw trajectory state. Instead, consistent with prior work on reward learning with feature sets, we represent the reward as a linear combination of the learned features $\vec{ϕ}$

R_{θ} (τ) = θ^{T} \vec{Φ} (τ) = \sum_{(s, a_{H}, a_{R}) \in τ} θ^{T} \vec{ϕ} (s) .

(14)

Note that the linear reward assumption is not necessary for our algorithm to work. While in theory the reward could be modeled as non-linear, our divide-and-conquer approach is motivated by keeping the reward parameter space small while still effectively capturing the person’s preferences.

For the linear case, the gradient becomes the difference between the observed demonstration feature values and the expected feature values dictated by the sampled trajectories

\nabla ℒ = \frac{1}{| D^{*} |} \sum_{τ \in D^{*}} \vec{Φ} (τ) - \frac{1}{| D^{θ} |} \sum_{\bar{τ} \in D^{θ}} \vec{Φ} (\bar{τ}) .

(15)

Lastly, we compute an estimate $\hat{θ}$ by iteratively computing the gradient $\nabla ℒ$ and updating the parameters until convergence

{\hat{θ}}^{'} = \hat{θ} - α (\frac{1}{| D^{*} |} \sum_{τ \in D^{*}} \vec{Φ} (τ) - \frac{1}{| D^{θ} |} \sum_{\bar{τ} \in D^{θ}} \vec{Φ} (\bar{τ}))

(16)

where α is the learning rate chosen appropriately. The final reward learning procedure, thus, consists of K iterations of generating samples

D^{θ}

under the current reward, using them to estimate the gradient in equation (15), and updating the parameter θ via gradient descent with equation (16).

5.2 Online FERL

In Section 5.1, we saw that our method allows the person to specify a reward by sequentially teaching features and adding them to the robot’s feature set before using demonstrations to combine them. However, in many situations the system designer or even the user teaching the features might not consider all aspects relevant for the task a priori. As such, we now consider an online reward learning version of our previous scenario, where the person provides inputs to the robot during the task execution and its feature space may or may not be able to correctly interpret them.

We assume the robot has access to an initial feature set $\vec{ϕ}$ , and is tracking a trajectory τ optimizing its current estimate of the reward function R_θ in equation (14). If the robot is not executing the task according to the person’s preferences, the human can intervene with input a_H. For instance, a_H might be an external torque that the person applies to change the robot’s current configuration. Or, they might stop the robot and kinesthetically demonstrate the task, resulting in a trajectory. Building on prior work, we assume the robot can evaluate whether its existing feature space can explain the human input (Section 5.2.2). If it can, the robot directly updates its reward function parameters θ, also in line with prior work (Bajcsy et al., 2017; Ratliff et al., 2006) (Section 5.2.1). If it cannot, the human can teach the robot a new feature³ ϕ_ψ just like in Section 5.1 and augment its feature set $\vec{ϕ} \leftarrow (\vec{ϕ}, ϕ_{ψ})$ . The robot can then go back to the original human input a_H that previously could not be explained by the old features and use it to update its estimate of the reward parameters θ. Algorithm 3 summarizes the full procedure.

Algorithm 3

Online FERL

Input: Features $\vec{ϕ} = [ϕ_{1}, \dots, ϕ_{f}]$ , initial parameters θ, confidence threshold ϵ.

Plan initial trajectory τ by optimizing R_θ.

while executing τ do

if a_H then

Estimate confidence $\hat{β}$ from a_H using equation (19).

if $\hat{β} < ϵ$ then

Learn feature ϕ_new using Alg. 1.

$\vec{ϕ} \leftarrow (\vec{ϕ}, ϕ_{n e w}), θ \leftarrow (θ, 0.0)$ .

end

Get induced trajectory τ_H from equation (17).

Update parameter θ using τ_H in equation (18).

Replan trajectory τ by optimizing new R_θ.

end

5.2.1 Online reward update

Whether it needs to learn a new feature ϕ_ψ or not, the robot has to then use the human input a_H to update its estimate of the reward parameters θ. Here, any prior work on online reward learning from user input is applicable, but we highlight one example to complete the exposition.

For instance, take the setting where the human’s input a_H was an external torque, applied as the robot was tracking a trajectory τ that was optimal under its current reward R_θ. Prior work Bajcsy et al. (2017) has modeled this as inducing a deformed trajectory τ_H, by propagating the change in configuration to the rest of the trajectory

τ_{H} = τ + μ A^{- 1} {\tilde{a}}_{H}

(17)

where μ > 0 scales the magnitude of the deformation, A defines a norm on the Hilbert space of trajectories⁴ and dictates the deformation shape (Dragan et al., 2015), and

{\tilde{a}}_{H}

is a_H at the interaction time and 0 otherwise.

If we think of τ_H as the human observation and of τ as the expected behavior according to the current reward function (Bajcsy et al., 2017), we arrive at a natural alternation of the update rule in equation (16)

{\hat{θ}}^{'} = \hat{θ} - α (\vec{Φ} (τ_{H}) - \vec{Φ} (τ)) .

(18)

Intuitively, the robot updates its estimate $\hat{θ}$ in the direction of the feature change induced by the human’s correction a_H from τ to τ_H.

If instead, the human intervened with a full demonstration, work on online learning from demonstrations (Section 3.2 in Ratliff et al. (2006)) has derived the same update with τ_H now being the human demonstration. In our implementation, we use corrections and follow Bajcsy et al. (2018), which shows that people more easily correct one feature at a time, and only update the θ index corresponding to the feature that changes the most (after feature learning this is the newly learned feature). After the update, the robot replans its trajectory using the new reward.

5.2.2 Confidence estimation

The robot can learn a new feature from the person because we assumed it has the capacity to detect that a feature is missing in the first place. We alluded earlier in Section 5.1 how this ability might be enabled by manipulating the β parameter in the observation model in equation (10). We now expand on this remark.

In the presented Boltzmann model, β controls how much the robot expects to observe human input consistent with its reward structure, and, thus, its feature space. A high β suggests that the input is consistent with the robot’s feature space, whereas a low β may signal that no reward function captured by the feature space can explain the input. As such, inspired by work in Fridovich-Keil et al. (2019), Fisac et al. (2018), and Bobu et al. (2020), instead of keeping β fixed like in the maximum entropy IRL observation model, we reinterpret it as a confidence in the robot’s features’ ability to explain human data.

When the human input a_H is a correction, following Bobu et al. (2020), the robot estimates $\hat{β}$ by considering how efficient the human input a_H is in achieving the induced trajectory features $\vec{Φ} (τ_{H})$ . Accordingly, $\hat{β}$ is inversely proportional to the difference between the actual human input and the input that would have produced $\vec{Φ} (τ_{H})$ optimally

\hat{β} \propto \frac{1}{{‖ a_{H} ‖}^{2} - {‖ a_{H}^{*} ‖}^{2}}

(19)

where we obtain

a_{H}^{*}

by solving the optimization problem presented in Bobu et al. (2020) equation (21).

Intuitively, if the person’s input is close to the optimal $a_{H}^{*}$ , then it achieves the induced features $\vec{Φ} (τ_{H})$ efficiently, resulting in high confidence $\hat{β}$ . If, however, there is a far more efficient alternative input—the difference between a_H and $a_{H}^{*}$ is large— $\hat{β}$ will be small: the person probably intended to give input about a feature the robot does not know about.

Alternatively, if the human input a_H is a demonstration, like in the classical IRL presented in Section 5.1, also following Bobu et al. (2020), we may estimate $\hat{β}$ via a Bayesian belief update: b^′(θ, β) ∝ P (τ∣θ, β)b (θ, β). Once again, in our implementation we used corrections, but the work in Bobu et al. (2020) shows confidence estimation can easily be adapted to learning from demonstrations if desired.

To detect a missing feature, the robot simply needs a confidence threshold ϵ. If $\hat{β}$ is above the threshold, the robot is confident in its feature space, so it updates the reward as usual; if $\hat{β} < ϵ$ , its features are insufficient and the robot asks the person to be taught a new one.

6. Experiments: Learning Features

Before testing FERL in the two reward learning settings of interest, we first analyze our method for learning features in experiments with a robotic manipulator. In Section 6.1, we inspect how well FERL can learn six different features of varying complexity by using real robot data collected from an expert—a person familiar with how the algorithm works. We then conduct an online user study in simulation in Section 6.2 to test whether non-experts—people not familiar with FERL but taught to use it—can teach the robot good features.

6.1 Expert users

We have argued that feature traces are useful in teaching the robot features explicitly. In our first set of experiments, we look at how good the learned features are, and how their quality varies with the amount of feature traces provided.

6.1.1 Experimental design

We conduct our experiments on a 7-DoF JACO robotic arm. We investigate six features in the context of personal robotics:

1. table: distance of the End-Effector (EE) to the table (T), as a z-coordinate difference: EE^z − T^z (superscript denotes pose coordinate selection);

2. coffee: coffee cup upright orientation, defined by how far the EE is from pointing up: 1 − EE^R ⋅ [0, 0, 1] (superscript denotes pose rotation matrix);

3. laptop: 0.3 m xy-plane distance of the EE to a laptop (L), to avoid passing over the laptop: max{0.3 − ‖EE^xy − L^xy‖₂, 0};

4. test laptop location: same as laptop, but the test position differs from the training ones;

5. proxemics: non-symmetric 0.3 m xy-plane distance between the EE and the human (H), to keep the EE away from them, three times as much when moving in front of them than on their side: $\max {0.3 - \sqrt{{(\frac{E E^{y} - H^{y}}{3})}^{2} + {(E E^{x} - H^{x})}^{2}, 0}}$ ;

6. between objects: 0.2 m xy-plane distance of the EE to two objects, O₁ and O₂—the feature penalizes being above either object, and, to a lesser extent, passing in between the objects as defined by a distance to the imaginary line O₁O₂: $\max {0.2 - \min {0.8 * {‖ O_{1} O_{2}^{x y} - E E^{x y} ‖}_{2}, {‖ O_{1}^{x y} - E E^{x y} ‖}_{2}, {‖ O_{2}^{x y} - E E^{x y} ‖}_{2}}, 0}$ .

Most features can be taught with the default relative values v_n = 0 and v₀ = 1, but between objects requires some traces with explicit values v₀, v_n. We approximate all features ϕ_ψ by neural networks (2 layers, 64 units each), and train them on a set of traces Ξ using stochastic gradient descent (see Online Appendix C.1 for training details).

For each feature, we collected a set $ℱ$ of 20 feature traces (40 for the complex test laptop location and between objects) from which we sample subsets $Ξ \in ℱ$ for training. We decide for each feature what an informative and intuitive set of traces would be, that is, how to choose the starting states to cover enough of the space (details in Online Appendix B.1). As described in Section 4.2, the human teacher starts at a state where the feature is highly expressed, for example, for laptop that is the EE positioned above the laptop. They then move the EE away until the distance is equal to the desired radius. They do this for a few different directions and heights to give a diverse dataset.

Our raw state space consists of the 27D xyz positions of all robot joints and objects in the scene, as well as the rotation matrix of the EE. We assume known object positions but they could be obtained from a vision system. It was surprisingly difficult to train on both positions and orientations due to spurious correlations in the raw state space, hence we show results for training only on positions or only on orientations. This speaks to the need for methods that can handle correlated input spaces, which we expand on in Online Appendix B.3.

Manipulated variables

We are interested in seeing trends in how the quality of the learned features changes with more or less data available. Hence, we manipulate the number of traces N the learner gets access to.

Dependent measures

After training a feature ϕ_ψ, we measure error compared to the ground truth feature ϕ_True that the expert tries to teach, on a test set of states $S_{Test}$ . To form $S_{Test}$ , we uniformly sample 10,000 states from the robot’s reachable set. Importantly, many of these test points are far from the training traces, probing the generalization of the learned features ϕ_ψ. We measure error via the Mean-Squared-Error (MSE), $MSE = (1 / | S_{Test} |) \sum_{S_{Test}} {‖ ϕ_{ψ} (s) - ϕ_{True} (s) ‖}^{2}$ . To ground the MSE values, we normalize them with the mean MSE of a randomly initialized untrained feature function, ${MSE}_{norm} = MSE / {MSE}_{random}$ , hence a value of 1.0 is random performance. For each N, we run 10 experiments sampling different feature trace sets Ξ from $ℱ$ , and calculate MSE_norm.

Hypotheses

H1: With enough data, FERL learns good features.

H2: FERL learns increasingly better features with more data.

H3: FERL becomes less input-sensitive with more data.

6.1.2 Qualitative results

We first inspect the results qualitatively, for N = 10. In Figure 2, we show the learned table and laptop features ϕ_ψ by visualizing the position of the EE for all 10,000 points in our test set. The color of the points encodes the learned feature values ϕ_ψ(s) from low (blue) to high (yellow): table is highest when the EE is farthest, while laptop peaks when the EE is above the laptop. In Figure 3, we illustrate the Ground Truth (GT) feature values ϕ_True and the trained features ϕ_ψ by projecting the test points on 2D sub-spaces and plotting the average feature value per 2D grid point. For Euclidean features, we used the EE’s xy-plane or yz-plane (table), and for coffee, we project the x-axis basis vector of the EE after forward kinematic rotations onto the xz-plane (arrow up represents the cup upright). White pixels are an artifact of sampling.

Figure 2.

Visualization of the experimental setup, learned feature values ϕ_ψ(s), and training feature traces ξ for table (up) and laptop (down). We display the feature values ϕ_ψ(s) for states s sampled from the reachable set of the 7-DoF arm, as well as their projections onto the yz and xy planes.

Figure 3.

The plots display the ground truth ϕ_True (top rows) and learned feature values ϕ_ψ (bottom rows) over $S_{Test}$ , averaged and projected onto a representative 2D subspace: the xy-plane, the yz-plane (table), and the xz orientation plane for coffee (the arrow represents the cup upright).

We observe that ϕ_ψ resembles ϕ_True very well for most features. Our most complex feature, between objects, does not recreate the GT as well, although it does learn the general shape. However, we note in Online Appendix D.1 that in smaller raw input space it is able to learn the fine-grained GT structure. This implies that spurious correlation in input space is a problem, hence for complex features more data or active learning methods to collect informative traces are required.

6.1.3 Quantitative analysis

Figure 4 displays the means and standard errors across 10 seeds for each feature with increasing amount of data N. To test H1, we look at the errors with the maximum amount of data. Indeed, FERL achieves small errors, put in context by the comparison with the error a random feature incurs (gray line). This is confirmed by an ANOVA with random versus FERL as a factor and the feature ID as a covariate, finding a significant main effect (F(1, 113) = 372.012 3, p < 0.0001). In line with H2, most features have decreasing error with increasing data. Indeed, an ANOVA with N as a factor and feature ID as a covariate found a significant main effect (F(8, 526) = 21.140 7, p < 0.0001). Lastly, supporting H3, we see that the standard error on the mean decreases when FERL gets more data. To test this, we ran an ANOVA with the standard error as the dependent measure and N as a factor, finding a significant main effect (F(8, 45) = 3.098, p = 0.0072).

Figure 4.

For each feature, we show the MSE_norm mean and standard error across 10 random seeds with an increasing number of traces (orange) compared to random (gray).

6.1.4 Summary

The qualitative and quantitative results support our hypotheses and suggest that our method requires few traces to reliably learn features ϕ_ψ that generalize well to states not seen during training. We also find that the more complex a feature, the more traces are needed for good performance: while table and laptop perform well with just N = 4, some other features, like between objects, require more traces. Active learning approaches that disentangle the learned function by querying traces at parts of the state space that are confusing could further reduce the amount of data required.

6.2 User study

In the previous section, we have demonstrated that experts can teach the robot good feature functions. We now design a user study to test how well non-expert users can teach features with FERL and how easily they can use the FERL protocol.

6.2.1 Experimental design

Due to COVID, we replicated our setup from Figure 1 (Left) in a pybullet simulator (Coumans and Bai, 2016–2019) in which users can move a 7 DoF-JACO robotic arm using their cursor. Through the interface in Figure 5, the users can drag the robot to provide feature traces, and use the buttons for recording, saving, and discarding them.

Figure 5.

The pybullet simulator interface used in the user study, replicating our lab setup with the JACO robot.

The user study is split into two phases: familiarization and teaching. In the first phase, we introduce the user to the task context, the simulation interface, and how to provide feature traces through an instruction video and a manual. Next, we describe and 3D visualize the familiarization task feature human (0.3 m xy-plane distance of the EE to the human position), after which we ask them to provide 10 feature traces to teach it. Lastly, we give the users a chance to see what they did well and learn from their mistakes by showing them a 3D visualization of their traces and the learned feature. See Online Appendix B.4 for more details on the user training.

In the second phase, we ask users to teach the robot three features from Section 6.1: table, laptop, and proxemics. This time, we don’t show the learned features until after all three tasks are complete.

Manipulated variables

We manipulate the input type with three levels: Random, Expert, and User. For Random, we randomly initialize 12 feature functions per task; for Expert, the authors collected 20 traces per task in the simulator, then randomly subsampled 12 sets of 10 that lead to features of similar MSEs to the ones in the physical setup before; for User, each person provided 10 traces per task.

Dependent measures

Our objective metric is the learned feature’s MSE compared to the GT feature on $S_{Test}$ , similar to Section 6.1. Additionally, to assess the users’ interaction experience we administered the subjective 7-point Likert scale survey from Figure 6, with some items inspired by NASA-TLX (Hart and Staveland, 1988). After they provide the feature traces for all three tasks, we ask the top eight questions in Figure 6. The participants then see the 3D visualizations of their feature traces and learned features, and we survey all 11 questions as in Figure 6 to see if their assessment changed.

Figure 6.

Questions, answer distributions, and p-values (2-sided t-test against the middle score 4) from the user study. The p-values in orange are significant after adjusted for multiple comparisons using the Bonferroni correction.

Participants

We recruited 12 users (11 males, aged 18–30) from the campus community to interact with our simulated JACO robot and provide feature traces for the three tasks. All users had technical background, so we caution that our results will speak to FERL’s usability with this population rather than the general population.

Hypotheses

H4: FERL learns good features from non-expert user data.

H5: Users find it easy to think of traces to give the robot, believe they understand how these traces influence the learned feature, believe they were successful teachers, and find our teaching protocol intuitive (little mental/physical effort, time, or stress).

6.2.2 Analysis

Objective

Figure 7 summarizes the results by showing how the MSE varies with each of our input types, for each task feature. Right off the bat, we notice that in line with H4, the MSEs for the user features are much closer to the expert level than to random. We ran an ANOVA with input type as a factor and task as a covariate, finding a significant main effect (F(2, 103) = 132.7505, p < 0.0001). We then ran a Tukey HSD post-hoc, which showed that the MSE for Random input was significantly higher than both Expert (p < 0.0001) and User (p < 0.0001), and found no significant difference between Expert and User (p = 0.0964). While this does not mean that user features are as good as expert features (we expect some degradation in performance when going to non-experts), it shows that they are substantially closer to them than to random; that is, the user features maintain a lot of signal despite this degradation.

Figure 7.

MSE to GT for the three features learned from expert (orange) and user (yellow) traces provided in simulation, and randomly (gray) initialized feature for comparison.

Subjective

In Figure 6, we see the Likert survey scores before and after the users saw the teaching results. For every question, we report 2-sided t-tests against the neutral score 4. These results support H5, although the evidence for finding the teaching protocol intuitive is weaker, and participants might have a bias to be positive given they are in a study. In fact, several participants mentioned in their additional remarks that they had a good idea of what traces to give, and the only frustrating part was the GUI interface, which was necessary because in-person studies are not possible during the COVID pandemic (“I had a pretty good mental model for what I wanted to show, but found it frustrating doing that with a mouse,” “I know what it wants, but the interface makes it difficult to give those exact traces”); performing the experiment as it was originally intended with the real robot arm would have potentially alleviated this issue (“With manual control of the arm it would have been a lot easier.”).

Looking before and after the visualization, we find a trend: seeing the result seems to reinforce people’s belief that they were effective teachers (Q3, Q4), also noticed in their comments (“Surprising how well it learned!”, “Surprised that with limited coverage it generalized pretty well.”). Also, in support of H4, we see significant evidence that users thought the robot learned the correct feature (Q9-Q11).

Lastly, we wanted to know if there was a correlation between subjective scores and objective performance. We isolated the “good teachers”—the participants who scored better than average on all three feature tasks in the objective metric, and compared their subjective scores to the rest of the teachers. By running a factorial likelihood-ratio test for each question, we found a significant main effect for good teachers: they are more certain that the robot has learned a correct feature even before seeing the results (Q3, p = 0.001), are more inclined to think they were successful (Q4, p = 0.0203), and find it significantly easier to teach features (Q7, p = 0.0202).

6.2.3 Summary

Both the objective and subjective results provide evidence that non-expert users can teach the robot reasonable features using our FERL protocol. In addition, participants found our teaching protocol intuitive, suggesting that feature traces can be useful for teaching features outside of the system designer’s setting. In the following sections, we explore whether both expert and non-expert features can be used to improve reward learning generalization.

7. Experiments: Online FERL

Now that we have tested our method for learning features with both experts and non-experts, we analyze how the learned features affect reward learning. In this section, we start with the easier setting where the robot already has a feature set that it is using for online reward learning, but the human might provide input about a missing feature.

7.1 Expert users

When the robot receives human input that cannot be explained by its current set of features, we hypothesize that adding FERL features to it can induce structure in the reward learning procedure that helps better recover the person’s preferences. We first test this hypothesis with expert user data.

7.1.1 Experimental design

We run experiments on the same JACO robot arm in three settings in which two features are known (ϕ_coffee, ϕ_known) and one is unknown. In all tasks, the true reward is $r_{true} = (0,10,10) {(ϕ_{coffee}, ϕ_{known}, ϕ_{unknown})}^{T}$ . We include ϕ_coffee with zero weight to evaluate if the methods can learn to ignore irrelevant features. In task 1, ϕ_laptop is unknown and the known feature is ϕ_table; in task 2, ϕ_table is unknown and ϕ_laptop is known; and in task 3, ϕ_proxemics is unknown and ϕ_table is known. We name the tasks Laptop Missing, Table Missing, and Proxemics Missing, respectively.

Manipulated variables

We manipulate the learning method with 2 levels: FERL and an adapted Maximum Entropy Inverse Reinforcement Learning (ME-IRL) baseline⁵ (Finn et al., 2016; Wulfmeier et al., 2016) learning a deep reward function from demonstrations. We model the ME-IRL reward function r_ω as a neural network with 2 layers, 128 units each. For a fair comparison, we gave r_ω access to the known features: once the 27D Euclidean input is mapped to a neuron, a last layer combines it with the known feature vector.

Also for a fair comparison, we took great care to collect a set of demonstrations for ME-IRL designed to be as informative as possible: we chose diverse start and goal configurations for the demonstrations, and focused some of them on the unknown feature and some on learning a combination between features (see Online Appendix B.2). Moreover, FERL and ME-IRL rely on different input types: FERL on feature traces ξ and pushes a_H and ME-IRL on a set of near-optimal demonstrations $D^{*}$ . To level the amount of data each method has access to, we collected the traces Ξ and demonstrations $D^{*}$ such that ME-IRL has more data points: the average number of states per demonstration/trace were 61 and 31, respectively.

Following equation (13), the gradient of the ME-IRL objective with respect to the reward parameters ω can be estimated by: $\nabla_{ω} ℒ \approx (1 / | D^{*} |) \sum_{τ \in D^{*}} \nabla_{ω} R_{ω} (τ) - (1 / | D^{ω} |) \sum_{τ \in D^{ω}} \nabla_{ω} R_{ω} (τ)$ (Wulfmeier et al., 2016; Finn et al., 2016). Here, R_ω(τ) = ∑_s∈τ r_ω(s) is the parametrized reward, $D^{*}$ the expert demonstrations, and $D^{ω}$ are trajectory samples from the r_ω induced near-optimal policy. We use TrajOpt (Schulman et al., 2013) to obtain the samples $D^{ω}$ (see Online Appendix C.4 for details). For practical considerations and implementation details of the online version of FERL we used, see Online Appendix C.2.

Dependent measures

We compare the two reward learning methods across three metrics commonly used in the IRL literature (Choi and Kim, 2011: 1) Reward Accuracy: how close to GT the learned reward is, 2) Behavior Accuracy: how well do the behaviors induced by the learned rewards compare to the GT optimal behavior, measured by evaluating the induced trajectories on GT reward, and 3) Test Probability: how likely trajectories generated by the GT reward are under the learned reward models.

For Reward Accuracy, note that any affine transformation of a reward function would result in the same induced behaviors, so simply measuring the MSE between the learner’s reward and the GT reward may not be informative. As such, we make reward functions given by different methods comparable by computing each learner’s reward values on $S_{Test}$ and normalizing the resulting set of rewards to be in [0, 1]. This allows us to compute the MSE on $S_{Test}$ between each method and the GT. Similarly to Section 6.1, we report this metric by varying the number of traces/demonstrations each learner gets access to. For Behavior Accuracy and Test Probability, we train FERL and ME-IRL with a set of 10 traces/demonstrations. For Behavior Accuracy, we use TrajOpt (Schulman et al., 2013) to produce optimal trajectories for 100 randomly selected start-goal pairs under the learned rewards. We evaluate the trajectories with the GT reward r_true and divide by the reward of the GT induced trajectory for easy relative comparison. For Test Probability, we generate 100 optimal trajectories using the GT reward, then evaluate their likelihood under the Boltzmann model in equation (10) with each learned reward. To approximate the intractable integral in equation (10), we sample⁶ sets of 100 trajectories for every start-goal pair corresponding to the optimal trajectories. For a fair comparison, we use the normalized rewards once again, and fit the maximum likelihood coefficient $\hat{β}$ for each model.

Hypotheses

H6: Online FERL learns rewards that better generalize to the state space than ME-IRL.

H7: Online FERL performance is less input-sensitive than ME-IRL’s.

7.1.2 Qualitative comparison

In Figure 8, we show the learned FERL and ME-IRL rewards as well as the GT for all three tasks evaluated at the test points. As we can see, by first learning the missing feature and then the reward on the extended feature vector, FERL is able to learn a fine-grained reward structure closely resembling the GT. Meanwhile, ME-IRL learns some structure capturing where the laptop or the human is, but not enough to result in a good trade-off between the active features.

Figure 8.

Visual comparison of the ground truth, online FERL, and ME-IRL rewards for Laptop Missing (top), Table Missing (middle), and Proxemics Missing (bottom).

7.1.3 Quantitative analysis

To compare Reward Accuracy, we show in Figure 9 the MSE mean and standard error across 10 seeds, with increasing training data. We visualize results from all three tasks, with FERL in orange and ME-IRL in gray. FERL is closer to GT than ME-IRL no matter the amount of data, supporting H6. To test this, we ran an ANOVA with learning method as the factor, and with the task and data amount as covariates, and found a significant main effect (F(1, 595) = 335.5253, p < 0.0001).

Figure 9.

MSE of online FERL and ME-IRL to GT reward across all three tasks. FERL learns rewards that better generalize to the state space.

Additionally, the consistently decreasing MSE in Figure 9 for FERL suggests that our method gets better with more data; in contrast, the same trend is inexistent with ME-IRL. Supporting H7, the high standard error that ME-IRL displays implies that it is highly sensitive to the demonstrations provided and the learned reward likely overfits to the expert demonstrations. We ran an ANOVA with standard error as the dependent measure, focusing on the N = 10 trials which provide the maximum data to each method, with the learning method as the factor and the task as a covariate. We found that the learning method has a significant effect on the standard error (F(1, 4) = 12.1027, p = 0.0254). With even more data, this shortcoming of IRL might disappear; however, this would pose an additional burden on the human, which our method successfully alleviates.

We also looked at Behavior Accuracy for the two methods. Figure 10 illustrates the reward ratios to GT for all three tasks. The GT ratio is 1 by default, and the closer to 1 the ratios are, the better the performance because all rewards are negative. The figure further supports H6, showing that FERL rewards produce trajectories that are preferred under the GT reward over ME-IRL reward trajectories. An ANOVA using the task as a covariate reveals a significant main effect for the learning method (F(1, 596) = 14.9816, p = 0.0001).

Figure 10.

Induced trajectories’ reward ratio for the two methods compared to GT. ME-IRL struggles to generalize across all tasks.

Lastly, we compare how likely a test set of trajectories given by optimizing the GT reward is under the two models. A more accurate reward model should give higher probabilities to the demonstrated trajectories under the Boltzmann noisily-rational assumption in equation (10). Figure 11 illustrates that FERL does indeed assign higher likelihood to the test trajectories than ME-IRL, which is consistent with H6.

Figure 11.

Probability assigned by the two methods to a set of optimal trajectories under the Boltzmann assumption. The trajectories are more likely under FERL than ME-IRL, suggesting FERL is the more accurate reward model.

7.1.4 Summary

The rewards learned with FERL qualitatively capture more structure than ME-IRL ones, but they also quantitatively get closer to the GT. Using FERL features—at least when the robot is missing one feature—seems to induce useful structure in the reward learning process that guides the robot to better capture the person’s preferences. These results hold when the person teaching the missing feature is an expert user; we next look at the case where a novice interacts with the robot instead.

7.2 Non-expert users

The objective results in Section 6.2 show that while users’ performance degrades from expert performance, they are still able to teach features with a lot of signal. We now want to test how important the user-expert feature quality gap is when it comes to using these features for online reward learning.

7.2.1 Experimental design

For this experiment, we had a similar setup to the one in Section 7.1, only that we performed reward learning with FERL using the user-taught simulation features from the user study. We wanted to see if the divide-and-conquer approach employed by FERL results in better rewards than ME-IRL even when using noisy simulation data.

Manipulated Variables

We manipulate the learning method, FERL or ME-IRL, just like in Section 7.1. Because corrections and demonstrations would be very difficult in simulation, we use for ME-IRL the expert data from the physical robot. For FERL, we use the user data from the simulation, and the expert corrections that teach the robot how to combine the learned feature with the known ones. Note that this gives ME-IRL an advantage, since its data is both generated by an expert, and on the physical robot. Nonetheless, we hypothesize that the advantage of the divide-and-conquer approach is stronger.

Dependent measures

We use the same objective metric as Reward Accuracy in the expert comparison in Section 7.1: the learned reward MSE to the GT reward on $S_{Test}$ .

Hypothesis

H8: Online FERL learns more generalizable rewards than ME-IRL even when using features learned from data provided by non-experts in simulation.

7.2.2 Analysis

Figure 12 illustrates our findings for the reward comparison. In the figure, we added FERL with expert-taught simulation features for reference: we randomly subsampled sets of 10 from 20 expert traces collected by the authors, and trained 12 expert features for each of our three task features. We see that, even though ME-IRL was given the advantage of using physical expert demonstrations, it still severely underperforms when compared to FERL with both expert and user features learned in simulation. This finding is crucial because it underlines the power of our divide-and-conquer approach in online reward learning: even when given imperfect features, the learned reward is superior to trying to learn everything implicitly from demonstrations.

Figure 12.

MSE to GT reward for the three tasks, comparing ME-IRL from expert physical demonstrations (gray) to online FERL from expert (orange) and non-expert (yellow) features learned in simulation and combined via corrections.

We verified the significance of this result with an ANOVA with the learning method as a factor and the task as a covariate. We found a significant main effect for the learning method (F(1, 62) = 41.2477, p < 0.0001), supporting our H8.

7.2.3 Summary

Despite the degradation in feature quality we see in user features when compared to expert ones, we find that the structure they do maintain is advantageous in online reward learning. This suggests that the online instantiation of FERL can be used even by non-experts to better teach the robot their preferences.

8. Experiments: Offline FERL

In the online reward learning setting, the robot was already equipped with a starting feature set, and we tested how learning missing features affects the reward. We now look at the scenario where the robot’s reward must be programmed entirely from scratch, teaching each feature separately before combining them into a reward via demonstrations.

8.1 Expert users

We have argued that learned features can induce useful structure that speeds up reward learning. We test how the reward is affected when the entire structure is built up from the expert features taught from real robot data in Section 6.1.

8.1.1 Experimental design

We run experiments on the robot arm in three settings of increasing complexity: in the first, the true reward depends on a single feature, and every subsequent task adds another feature to the reward. In task 1, the true reward depends on only ϕ_table. In task 2, we add the ϕ_laptop feature, and in task 3 the ϕ_proxemics feature. In both tasks 2 and 3, the reward equally combines the two and three features, respectively. Task 1 should be easy enough for even an end-to-end IRL method to solve, especially since it relies on the simplest feature that we have considered. Meanwhile, tasks 2 and 3 require learning rewards that are more structurally complex. We name the three tasks One Feature, Two Features, and Three Features, respectively.

Manipulated variables

We manipulated the learning method with 2 levels: FERL and ME-IRL. While in Section 7 ME-IRL had access to the known features, this time the reward network is a function mapping directly from the 27D Euclidean input space only. For practical considerations and implementation details of the offline version of FERL we used, see Online Appendix C.3.

For a fair comparison, we once again took great care in how we collected the demonstrations ME-IRL learns from. Just like before, we chose diverse start and goal configurations, and focused some of the demonstrations on each individual feature, and, when it applies, on each combination of features (see Online Appendix B.2). Importantly, while ME-IRL uses a set of near-optimal demonstrations $D^{*}$ , FERL requires both demonstrations and feature traces ξ. To level the amount of data each method has access to, we distributed the demonstrations and traces FERL has access to such that ME-IRL has more data points. The average number of states per demonstration/trace were 64 and 31, respectively, so if we keep the number of ME-IRL demonstrations and FERL traces the same, FERL has a non-zero budget of demonstrations to use for cases with more than one demonstration (N > 1).

Dependent measures

We use the same objective metrics as Reward Accuracy, Behavior Accuracy, and Test Probability in Section 7.1. For Reward Accuracy, we vary the number N of traces/demonstrations each learner gets but skip N = 1 because FERL would have an unfair advantage in the amount of data given. We give ME-IRL up to 10, 20, and 30 demonstrations for the three tasks, respectively. Meanwhile, we give FERL up to 10 traces for each feature, and 1, 2, and 3 demonstrations for each task, respectively. Overall, FERL would use up to 10 traces and one demonstration, up to 20 traces and 2 demonstrations, and up to 30 traces and 3 demonstrations, while ME-IRL would be given 10, 20, and 30 demonstrations for each task, respectively. For Behavior Accuracy and Test Probability, we train FERL with 10 traces per feature and 1, 2, or 3 demonstrations, and ME-IRL with 10, 20, and 30 demonstrations, respectively. Just like in Section 7.1, for Behavior Accuracy we produce optimal trajectories for 100 randomly selected start-goal pairs under the learned rewards and evaluate them under the GT reward. Meanwhile, for Test Probability, we generate 100 optimal trajectories using the GT reward, then evaluate their likelihood under the learned models.

Hypothesis

H9: Offline FERL learns rewards that better generalize to the state space than ME-IRL.

8.1.2 Qualitative comparison

In Figure 13, we show the learned FERL and ME-IRL rewards as well as the GT for all three tasks evaluated at the test points. The figure illustrates that by first learning each feature separately and then the reward that combines them, FERL is able to learn a fine-grained reward structure closely resembling the GT. For the easiest task, One Feature, ME-IRL does recover the GT appearance, but this is unsurprising since the table feature is very simple. For the other more complex two tasks, just like in the online case, ME-IRL learns some structure capturing where the laptop or the human is, but not enough to result in a good trade-off between the features.

Figure 13.

Visual comparison of the ground truth, offline FERL, and ME-IRL rewards for One Feature (top), Two Features (middle), and Three Features (bottom).

8.1.3 Quantitative analysis

To compare Reward Accuracy, we show in Figure 14 the MSE mean and standard error across 10 seeds, with increasing training data. We visualize results from all three tasks side by side, with FERL in orange and ME-IRL in gray. For One Feature, as expected, ME-IRL does eventually learn a good reward with enough data. However, for the other more complex tasks that combine multiple features, ME-IRL underperforms when compared to our method. Overall, across the tasks, FERL is closer to GT than ME-IRL no matter the amount of data, supporting H9. To test this, we ran an ANOVA with learning method as the factor, and with the task and data amount as covariates, and found a significant main effect (F(1, 535) = 148.8431, p < 0.0001).

Figure 14.

MSE of offline FERL and ME-IRL to GT reward for One Feature (left), Two Features (middle), and Three Features (right). In most data regimes, FERL learns rewards that better generalize to the state space.

For comparing Behavior Accuracy, Figure 15 illustrates the reward ratios to GT for all three tasks. When the reward consists of a single very simple feature, ME-IRL performs just as well as our method. However, when the reward structure more complexly combines multiple features, ME-IRL does not produce as good trajectories under the GT reward as FERL, supporting H8. We ran an ANOVA using the learning method as a factor and the task as a covariate and did not find a significant main effect, probably due to the One Feature results. To verify this theory, we re-ran the ANOVA using only the data from the more complex Two Features and Three Features tasks, and did, in fact, find a significant main effect (F(1, 397) = 5.7489, p = 0.0097). Results with the Test Probability metric paint a similar picture. Figure 16 shows that for the easy One Feature case, both methods perform comparably, but when the reward is more complex (Two Features and Three Features), FERL outperforms ME-IRL and assigns higher probability to the test trajectories.

Figure 15.

Induced trajectories’ reward ratio for the two methods compared to GT. While ME-IRL generalizes for the single feature task, it struggles with the more complex multiple feature tasks.

Figure 16.

Probability assigned by the two methods to a set of optimal trajectories under the Boltzmann assumption. For the more complex multiple feature tasks, the trajectories are more likely under FERL than ME-IRL.

8.1.4 Summary

The results in this section suggest that while ME-IRL is capable of recovering very simple reward structures, it does not perform as well as using FERL features for complex rewards. This observation applies when the features are taught by experts, so we now test what happens if we instead use non-expert user features.

8.2 Non-expert users

In Section 7.2, we saw that user-taught FERL features have enough structure to help the robot recover the human’s preferences in online reward setting where the original feature set is incomplete. However, there we only had one missing feature. In this section, we test the more challenging scenario, where we learn a reward from scratch using the noisy user features learned in simulation.

8.2.1 Experimental design

For this experiment, we had a similar setup as in Section 7.2—using the user-taught simulation features for learning the reward—only this time we tested the offline instantiation of FERL. Given that now we combine multiple noisy features together into a reward, we wanted to see how our divide-and-conquer approach fares against the ME-IRL baseline.

Manipulated variables

We manipulate the learning method, FERL or ME-IRL, just like in Section 8.1. Like in Section 7.2, we use demonstrations collected from the expert on the physical robot for ME-IRL. For FERL, we use the user data from the simulation, and the expert demonstrations that teach the robot how to combine the learned feature into a reward. Note that this gives ME-IRL an advantage, since all its data is both generated by an expert, and on the physical robot.

Dependent measures

We use the same objective metric as Reward Accuracy in the expert comparison in Section 8.1: the learned reward MSE to the GT reward on $S_{Test}$ .

Hypotheses

H10: Offline FERL learns more generalizable rewards than ME-IRL even when using features learned from data provided by non-experts in simulation.

8.2.2 Analysis

Figure 17 illustrates our findings for the reward comparison. We also added the offline FERL reward using expert-taught simulation features for reference, where we randomly subsampled sets of 10 traces and trained 12 expert features for each of the three features. This time, we find that the user features are noisy enough that, when combined into a reward, they do not reliably provide an advantage over ME-IRL. This could be attributed to the difficulty of teaching features in a simulator, especially given that there is no easy way to approximate distances and traces in 3D space with a 2D interface are hard. We verified this result with an ANOVA with the learning method as a factor and the task as a covariate, and, as expected, we found no significant main effect.

Figure 17.

MSE to GT reward for the three tasks, comparing ME-IRL from expert physical demonstrations (gray) to offline FERL from expert (orange) and non-expert (yellow) features learned in simulation and combined via corrections.

8.2.3 Summary

Previously, we have seen how structure can indeed help reward learning generalizability and sample efficiency; but we now see that the wrong—or very noisy—structure obtained from traces from simulation may diminish the benefits that our divide-and-conquer approach promises. However, we suggest taking this result with a grain of salt, since ME-IRL had the advantage of all-expert, all-physical data, whereas our method was limited to data collected in simulation from novice users. While not possible during the pandemic, we are optimistic that with physical demonstrations the benefits would be more prominent.

9. Discussion

Learning reward functions is a popular way to help robots generate behaviors that adapt to new situations or human preferences. In this work, we propose that robots can learn more generalizable rewards by using a divide-and-conquer approach, focusing on learning features separately from learning how to combine them. We introduced feature traces as a novel type of human input that allows for intuitive teaching of non-linear features from high-dimensional state spaces. We then presented two instantiations of our FERL algorithm: one that enables expanding the robot’s feature set in online reward learning situations, and one that lets the user sequentially teach every feature and then combine them into a reward. In extensive experiments with a real robot arm and a user study in simulation, we showed that online FERL outperforms deep reward learning from demonstrations (ME-IRL) in data-efficiency and generalization. Offline FERL similarly beats ME-IRL when the features used are of high enough quality, but the results are less conclusive when using very noisy features.

Implications for online reward learning

Because they have to perform updates in real time from very little input, online reward learning methods represent the reward as a linear function of a small set of hand-engineered features. As discussed, exhaustively choosing such a set a priori puts too much burden on system designers, and using an incomplete set of features can lead to learning the wrong reward. Prior work enabled robots to at least detect that its feature space is insufficient to explain the human’s input (Bobu et al., 2018), but then the robot’s only option was to either not update the reward or completely stop task execution. Our online FERL approach provides an alternative that allows people to teach features when the robot detects it is missing something, and then update the reward using the new feature set. Although in this paper we presented experiments where the robot learns rewards from corrections, our framework can conceivably be adapted to any online reward learning method, provided there is a way to detect the feature set is insufficient. Recent work on confidence estimation from human demonstrations (Bobu et al., 2020) and teleoperation (Zurek et al., 2021) offers encouraging pathways to adapting FERL to other online human–robot collaborative settings.

Implications for learning complex rewards from demonstrations

Reward learning from raw state space with expressive function approximators is considered difficult because there exists a large set of functions r_θ (s) that could explain the human input. For example, in the case of learning from demonstrations, many functions r_θ (s) induce policies that match the demonstrations’ state expectation. The higher dimensional the state s, the more human input is needed to disambiguate between those functions sufficiently to find a reward r_θ that accurately captures human preferences. Without that, the learned reward is unlikely to generalize to states not seen during training and might simply replicate the demonstrations’ state expectations. In this paper, we presented evidence that offline FERL may provide an alternative to better disambiguate the reward and improve generalization.

The reason our divide-and-conquer approach can help relative to relying on demonstrations for everything is that demonstrations aggregate a lot of information. First, by learning features, we can isolate learning what matters from learning how to trade off what matters into a single value (the features vs. their combination)—in contrast, demonstrations have to teach the robot about both at once. Second, feature traces give information about states that are not on optimal trajectories, be it states with high feature values that are undesirable, or states with low feature values where other, more important features have high values. Third, feature traces are also structured by the monotonicity assumption: they tell us relative feature values of the states along a trace, whereas demonstrations only tell us about the aggregate reward across a trajectory. Thus, by focusing on learning features first before combining them into a reward, the robot can incorporate all three benefits and ultimately improve reward learning from demonstrations.

Limitations and future work

Our work is merely a step towards understanding how explicitly focusing on learning features can impact reward learning generalization and sample complexity. While FERL enables robots to learn features and induce structure in reward learning, there are several limitations that may affect its usability.

Our user study provides evidence that non-expert users can, in fact, use FERL to teach good features. However, due to the current pandemic, we conducted the study in a simulated environment instead of in person with the real robot, and most of our users had technical background. It is unclear how people without technical background would perform, and especially how kinesthetically providing feature traces (instead of clicking and dragging in a simulator) would affect their perception of the protocol’s usability. Further, we only tested whether users could teach features we tell them about, so we still need to test whether users can teach features they implicitly know about (as would happen when intervening to correct the robot or designing a reward from scratch).

Even if people know the feature they want to teach, it might be so abstract (e.g., comfort) that they would not know how to teach it. Moreover, with the current feature learning protocol, they might find it cumbersome to teach discontinuous features like constraints. We could ease the human supervision burden by developing an active learning approach where the robot autonomously picks starting states most likely to result in informative feature traces. For instance, the robot could fit an ensemble of functions from traces online, and query for new traces from states where the ensemble disagrees (Reddy et al., 2020a). But for such complex features, it may be more effective to investigate combining feature traces with other types of structured human input.

The quality of the learned rewards depends directly on the quality of the learned features. When the human provides feature traces that lead to good features, many of our experiments demonstrate that they induce structure in the reward learning procedure that helps generalization and sample complexity. However, if the robot learns features that are too noisy or simply incorrect, that (wrong) structure may impair performance. We saw an example of this when we tried to utilize the user study features for reward learning. In online FERL where a single feature was missing, the structure captured by the (noisy) non-expert features was still helpful in learning a better reward than the baseline. However, when trying to combine multiple noisy features in offline FERL, reward learning did not see a benefit. Future work must investigate ways in which the robot can determine whether to accept or reject the newly learned feature. One idea is to use our current framework’s confidence estimation capability in Section 5.2.2 to determine whether the learned feature set explains the human’s reward input. Another idea is to visualize either the feature function or examples of behaviors induced by it, and let the person decide whether the learned feature is acceptable.

Lastly, while we show that FERL works reliably in 27D, more work is necessary to extend it to higher dimensional state spaces, like images. In our discussion in Online Appendix B.3, we show how spurious correlations in large input spaces may affect the quality of the learned features in low data regimes. To counteract that, we could ask the person for more data, but after a certain point this becomes too burdensome on the user. Alternatively, approaches that encode these spaces to lower dimensional representations or techniques from causal learning, such as Invariant Risk Minimization (Arjovsky et al., 2019), could help tackle these challenges.

Supplemental Material

sj-pdf-1-ijr-10.1177_02783649221078031 – Supplemental Material for Inducing structure in reward learning by learning features

Supplemental Material, sj-pdf-1-ijr-10.1177_02783649221078031 for Inducing structure in reward learning by learning features by Andreea Bobu, Marius Wiggert, Claire Tomlin and Anca D. Dragan, in The International Journal of Robotics Research

Footnotes

Acknowledgements

We also thank Rohin Shah for providing guidance and feedback on our work.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

'The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is supported by the Air Force Office of Scientific Research (AFOSR), the Office of Naval Research (ONR-YIP), the DARPA Assured Autonomy Grant, the CONIX Research Center, and the German Academic Exchange Service (DAAD).

ORCID iD

Andreea Bobu

Notes

References

Abbeel

(2004) Apprenticeship learning via inverse reinforcement learning. In: Machine Learning (ICML), International Conference on. Banff, Alberta, Canada: ACM.

Amodei

Clark

(2016) Faulty reward functions in the wild. https://blog.openai.com/faulty-reward-functions/.

Argall

Chernova

Veloso

, et al. (2009) A survey of robot learning from demonstration. Robotics and autonomous systems 57(5): 469–483.

Arjovsky

Bottou

Gulrajani

, et al. (2019) Invariant risk minimization. ArXiv abs/1907.02893.

Bajcsy

Losey

O’Malley

, et al. (2017) Learning robot objectives from physical human interaction. In: Proceedings of the 1st Annual Conference on Robot Learning, Proceedings of Machine Learning Research (eds Levine

Vanhoucke

Goldberg

), pp. 217–226. PMLR. http://proceedings.mlr.press/v78/bajcsy17a.html.

Bajcsy

Losey

O’Malley

, et al. (2018) Learning from physical human corrections, one feature at a time. In: Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, HRI ’18, pp. 141–149. New York, NY, USA: ACM. DOI: 10.1145/3171221.3171267.

Baker

B Tenenbaum

R Saxe

(2007) Goal inference as inverse planning. In: Proceedings of the 29th Annual Conference of the Cognitive Science Society.

Bobu

Bajcsy

Fisac

, et al. (2020) Quantifying hypothesis space misspecification in learning from human–robot demonstrations and physical corrections. IEEE Transactions on Robotics 36(3): 1–20.

Bobu

Bajcsy

Fisac

, et al. (2018) Learning under misspecified objective spaces. In: Proceedings of The 2nd Conference on Robot Learning, Proceedings of Machine Learning Research (eds Billard

Dragan

Peters

, et al.), pp. 796–805. PMLR. http://proceedings.mlr.press/v87/bobu18a.html.

10.

Bobu

Wiggert

Tomlin

, et al. (2021) Feature expansive reward learning: Rethinking human input. In: Proceedings of the 2021 ACM/IEEE International Conference on Human-Robot Interaction, HRI ’21. New York, NY, USA: Association for Computing Machinery, p. 216–224. DOI: 10.1145/3434073.3444667.

11.

Braziunas

Boutilier

(2008) Elicitation of factored utilities. AI Magazine 29(4): 79. https://ojs.aaai.org/index.php/aimagazine/article/view/2203.

12.

Brown

Coleman

Srinivasan

, et al. (2020) Safe imitation learning via fast Bayesian reward inference from preferences. In: Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research (eds III

Singh

), pp. 1165–1177. PMLR. http://proceedings.mlr.press/v119/brown20a.html.

13.

Brown

Goo

Nagarajan

, et al. (2019) Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In: International Conference on Machine Learning, 09-15 Jun, pp. 783–792. PMLR.

14.

Brown

Cui

Niekum

(2018) Risk-aware active inverse reinforcement learning. In: Conference on Robot Learning, pp. 362–372. Zurich, Switzerland: PMLR.

15.

Choi

Kim

(2011) Inverse reinforcement learning in partially observable environments. Journal of Machine Learning Research 12: 691–730.

16.

Choi

Kim

(2013) Bayesian nonparametric feature construction for inverse reinforcement learning. In: Twenty-Third International Joint Conference on Artificial Intelligence.

17.

Christiano

Leike

Brown

, et al. (2017) Deep reinforcement learning from human preferences. In: I

Guyon

Luxburg

Bengio

, et al. (eds.) Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf.

18.

Coumans

Bai

(2016–2019) Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org.

19.

Dragan

Muelling

Bagnell

, et al. (2015) Movement primitives via optimization. In: 2015 IEEE International Conference on Robotics and Automation (ICRA). may, pp. 2339–2346. DOI:10.1109/ICRA.2015.7139510.

20.

Finn

Levine

Abbeel

(2016) Guided cost learning: Deep inverse optimal control via policy optimization. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16. New York, NY, USA: JMLR.org, p. 49–58.

21.

Fisac

Bajcsy

Herbert

, et al. (2018) Probabilistically safe robot planning with confidence-based human predictions. Robotics: Science and Systems (RSS).

22.

Fridovich-Keil

Bajcsy

Fisac

, et al. (2019) Confidence-aware motion prediction for real-time collision avoidance. International Journal of Robotics Research.

23.

Luo

Levine

(2018a) Learning robust rewards with adverserial inverse reinforcement learning. In: International Conference on Learning Representations. https://openreview.net/forum?id=rkHywl-A-.

24.

Singh

Ghosh

, et al. (2018b) Variational inverse control with events: A general framework for data-driven reward definition. arXiv preprint arXiv:1805.11686.

25.

Hadfield-Menell

Milli

Abbeel

, et al. (2017) Inverse reward design. In: I

Guyon

Luxburg

Bengio

, et al. (eds.) Advances in Neural Information Processing Systems, volume 30. Red Hook, NY, USA: Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/32fdab6559cdfa4f167f8c31b9199643-Paper.pdf.

26.

Hart

Staveland

(1988) Development of nasa-tlx (task load index): Results of empirical and theoretical research. In: PA

Hancock

Meshkati

(eds.) Human Mental Workload, Advances in Psychology. North-Holland, Vol. 52, 139–183. DOI: 10.1016/S0166-4115(08)62386-9.

27.

Haug

Tschiatschek

Singla

(2018) Teaching inverse reinforcement learners via features and demonstrations. In: Advances in Neural Information Processing Systems. Red Hook, NY, USA: 8464–8473.

28.

Ibarz

Leike

Pohlen

, et al. (2018) Reward learning from human preferences and demonstrations in atari. In: S

Bengio

Wallach

Larochelle

, et al. (eds.) Advances in Neural Information Processing Systems. Curran Associates, Inc., Vol. 31, 8011–8023. https://proceedings.neurips.cc/paper/2018/file/8cbe9ce23f42628c98f80fa0fac8b19a-Paper.pdf.

29.

Jain

Sharma

Joachims

, et al. (2015) Learning preferences for manipulation tasks from online coactive feedback. The International Journal of Robotics Research 34(10): 1296–1313.

30.

Javdani

Admoni

Pellegrinelli

, et al. (2018) Shared autonomy via hindsight optimization for teleoperation and teaming. The International Journal of Robotics Research 37(7): 717–742. DOI:10.1177/0278364918776060.

31.

Jaynes

(1957) Information Theory and Statistical Mechanics. American Physical Society, 620–630. DOI:10.1103/PhysRev.106.620.

32.

Levine

Popovic

Koltun

(2010) Feature construction for inverse reinforcement learning. In: Advances in Neural Information Processing Systems. 1342–1350.

33.

Levine

Popovic

Koltun

(2011) Nonlinear inverse reinforcement learning with gaussian processes. In: Advances in Neural Information Processing Systems. 19–27.

34.

Lopes

Melo

Montesano

(2009) Active learning for reward estimation in inverse reinforcement learning. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Berlin, Heidelberg: Springer, 31–46.

35.

Russell

(2000) Algorithms for inverse reinforcement learning. International Conference on Machine Learning (ICML) 0: 663–670. DOI: 10.2460/ajvr.67.2.323.

36.

Osa

Pajarinen

Neumann

, et al. (2018) An algorithmic perspective on imitation learning. Foundations and Trends in Robotics 7(1–2): 1–179.

37.

Ratliff

Bradley

Chestnutt

, et al. (2007) Boosting structured prediction for imitation learning. In: Advances in Neural Information Processing Systems. 1153–1160.

38.

Ratliff

Bagnell

Zinkevich

(2006) Maximum margin planning. In: Proceedings of the 23rd International Conference on Machine Learning, ICML ’06. New York, NY, USA: Association for Computing Machinery, p. 729–736. DOI: 10.1145/1143844.1143936.

39.

Reddy

Dragan

Levine

, et al. (2020a) Learning human objectives by evaluating hypothetical behavior. In: ICML.

40.

Reddy

Dragan

Levine

(2020b) SQIL: imitation learning via reinforcement learning with sparse rewards. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. https://openreview.net/forum?id=S1xKd24twB.

41.

Russell

Norvig

(2002) Artificial intelligence: a modern approach.

42.

Sadigh

Sastry

Seshia

, et al. (2016) Information gathering actions over human internal state. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 66–73. DOI: 10.1109/IROS.2016.7759036.

43.

Schulman

Lee

, et al. (2013) Finding locally optimal, collision-free trajectories with sequential convex optimization. In: Robotics: Science and Systems. Citeseer, Vol. 9, pp. 1–10.

44.

Vapnik

(2013) The Nature of Statistical Learning Theory. Springer science & business media.

45.

Vernaza

Bagnell

(2012) Efficient high dimensional maximum entropy modeling via symmetric partition functions. In: Advances in Neural Information Processing Systems. 575–583.

46.

Von Neumann

Morgenstern

(1945) Theory of Games and Economic Behavior. Princeton, NJ: Princeton University Press.

47.

Wulfmeier

Wang

Posner

(2016) Watch this: Scalable cost-function learning for path planning in urban environments. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Daejeon, Korea, October 9-14, 2089–2095.

48.

Ziebart

Maas

Bagnell

, et al. (2008) Maximum entropy inverse reinforcement learning. In: Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 3, AAAI’08. AAAI Press, pp. 1433–1438. http://dl.acm.org/citation.cfm?id=1620270.1620297.

49.

Zurek

Bobu

Brown

, et al. (2021) Situational confidence assistance for lifelong shared autonomy.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.35 MB