Sage Journals: Discover world-class research

Abstract

We present the architecture of a fully autonomous, bio-inspired cognitive agent built around a spiking neural network (SNN) implementing the agent’s semantic memory. This agent explores its universe and learns concepts of objects/situations and of its own actions in a one-shot manner. While object/situation concepts are unary, action concepts are triples made up of an initial situation, a motor activity, and an outcome. They embody the agent’s knowledge of its universe’s action laws. Both kinds of concepts have different degrees of generality. To make decisions the agent queries its semantic memory for the expected outcomes of envisaged actions and chooses the action to take on the basis of these predictions. Our experiments show that the agent handles new situations by appealing to previously learned general concepts and rapidly modifies its concepts to adapt to environment changes.

Keywords

Cognitive agent open world online learning concept learning action laws one-shot learning spiking neural networks semantic memory

1. Introduction

The ability of a cognitive agent to act adequately in a given environment depends on its ability to predict how performing a given action will affect its current situation; that is, it depends on the agent’s knowledge of the laws of the universe that determine the effects of its possible actions on its environment (hereafter the action laws of the universe). How artificial agents can acquire these laws and how these should be updated if their environment changes has proved a difficult question. In the case where the intended environment is open—that is, where the agent’s designer cannot foresee all the situations the agent might encounter in the future—, providing a suitable set of action laws to the agent “by hand” is unfeasible. The only viable solution is that the agent continuously learns the relevant laws from experience, just as natural agents (humans and animals) do. Crucially, this learning process should allow for generalization over disparate experiences, so that the agent is able to behave appropriately in new situations. It should also allow for rapid modification of the learned action laws to accommodate environment changes. Finally, it should address the fact that however big its memory resources may be they will always be finite, while the number of distinct objects and situations it may encounter in an open world is unbounded.

The aim of the present article is to show how such a learning (autonomous, online, achieving generalization and rapid updating while being adapted to open universes), could be done. Solving this problem would be useful in applications such as mobile autonomous robots (e.g., service robots) living in open worlds.

The proposed approach’s main hypothesis is that natural agents’ ability to perform well in our open and changing world largely relies on the fact that they store their knowledge in the form of rapidly updatable concepts with various degrees of generality. Presumably, natural agents first form concepts about the encountered objects/situations, and then use these as elements for composing more complex concepts, notably concepts of action. The latter constitutes their knowledge of their environment’s action laws. Concepts are a very compact way to store information, as more general concepts can apply to a whole range of distinct cases while being usable in new similar situations. They also allow for efficient updating, as the adjustment of a single concept modifies all the inferences and further decisions that can be made on its basis. Another important hypothesis is that natural agents’ endless ability to learn new concepts despite having finite brains relies on some efficacious management of forgetting. Forgetting is inevitable for an agent with finite memory living in an open world; but it needs to avoid catastrophic forgetting, that is, some kind of forgetting that would severely affect its ability to act judiciously in its current world. Most likely, natural agents avoid catastrophic forgetting by selectively forgetting their less important knowledge (such as, e.g., details and old, unused memories) and reallocating the corresponding neurons to encode new useful knowledge. We suggest that artificial agents could take inspiration from these strategies and use some artificial neural network to learn and store concepts, and query this network to make predictions about the outcome of envisaged actions.

To test this idea, we here build an artificial agent with an SNN at its core. This agent lives in a very simple virtual world, composed of rooms which may be, or not, accessible (hence, knowable) to it. At first, the agent is confined to one single room and learns by itself how to act in it according to its own interests. Then at some point a door opens to a new room containing some never encountered before objects and situations. Yet, although these are new to the agent, some general laws are preserved from one room to the other. Our experiments show that having learned these laws in the first room allows the agent to act by and large properly in the second one, as soon as it enters it. They also show that the agent is able to learn new laws holding in the second room without catastrophic loss of previous knowledge. Finally, some changes are introduced in the second room, rendering some of the previously learned rules obsolete while some new rules become true. Again, experiments show that the agent quickly updates its knowledge to account for these changes.

The article is organized as follows. Section 2 discusses related work and Section 3 presents the agent and its universe. Section 4 gives a general overview of the neural network, while Section 5 details its functioning. Section 6 describes the agent’s workings and Section 7 presents the experiments conducted on the agent and discusses the results. Section 8 concludes and outlines future possible developments of the framework.

2. Related Works

The research problem addressed in this work is autonomous online learning, updating and generalization of concepts and action laws in an open universe. To our knowledge, no existing approach addresses this problem in all its dimensions, but these are investigated in separate research fields.

Continual Learning tackles the problem of lifelong knowledge acquisition (Lesort et al., 2020; Wang et al., 2024). Its main challenge is to avoid catastrophic loss of previous knowledge when acquiring new knowledge; a secondary research axis is one-shot/few-shots learning, which is the ability to learn online from one or few examples (Wang et al., 2020). However, current approaches mostly consider the learning of tasks (mostly image classification/recognition tasks, but also some more complex tasks such as playing games; Kirkpatrick et al., 2017), not of concepts nor action laws. Furthermore, most of them rely on learning algorithms that use labeled training data (mainly some supervised learning algorithms), which makes them unsuitable for open world autonomous agents.

Concept Learning has mainly been studied in view of explainability (Gupta & Narayanan, 2024), mostly of classification models (e.g., Koh et al., 2020) but also of decision making in the context of reinforcement learning (Das et al., 2023; Zabounidis et al., 2023). For this reason, many proposals are dedicated to learning a human-predefined set of concepts using some annotated data. In the field of Image Classification some approaches deal with the extraction of concepts from data (Ghorbani et al., 2019; Hase et al., 2019; Wang et al., 2022), but then these are extracted from labeled classes of images, which again is unsuitable for open world autonomous agents. The vast majority of these methods also disregards the hierarchical organization of concepts from particular to more general, and they generally do not address one-shot learning, online revision or updating of concepts.

Action Laws Learning has been studied from various perspectives. In Dynamic Epistemic Logic, (Bolander & Gierasimczuk, 2018) proposed a method to learn an action model through successive observations of transitions between states. But this method does not address generalization nor environment changes and only considers universally applicable actions (i.e., actions that can be executed in every logically possible state), a condition real-world actions rarely satisfy. In the field of Planning, Bonet et al. (2019) showed how to learn abstract actions from a few carefully chosen instances of some general planning problem. But said instances come with their own set of ground actions which must be known beforehand, so this approach cannot be used in open worlds, where an agent needs to incrementally learn from experience.

Reinforcement Learning (RL) is concerned with agents learning by experience how to achieve a goal in an optimal manner. A number of variants have been developed, which all have in common that the agent learns through trials and errors by means of a reward system, and that what is finally learned is a policy, that is, a function which takes a given state of the agent as an argument and returns an action (or a set of actions) to be taken. RL approaches have proved very successful in a wide range of domains, and for this reason they are quite popular. However, they struggle to adapt to environment changes and to revise the learned policies (Farebrother et al., 2018; Kirk et al., 2023). This could be related with the fact that in natural agents, learning policies is rather a matter of procedural memory (“memory of procedures”), a memory system that typically requires a large number of learning trials (Knowlton et al., 2017; Tulving, 1985). By contrast, the semantic memory, which stores facts about the world in the form of concepts, is a fast learning system that can learn or update a concept over one single experience. It thus seems that an artificial agent that would be able to learn and update in real time a conceptual model of its environment and to use it to make its decisions would be able to quickly adapt to environment changes. RL approaches also typically encounter difficulties in contexts where rewards are scarce or deceptive (Hare, 2019; Ocana et al., 2023), which is the case of most real-world contexts. Although some solutions have been proposed (see notably Ecoffet et al., 2021), dealing with sparse rewards is still an active field of research in RL. But since concept learning does not depend on rewards, an agent able to learn concepts and to use them to make its decisions would be immune to the problem.

In the field of robotics, semantic information has long been recognized as instrumental in autonomous robots’ decision making (Prasad & Ertel, 2020), but there have been very few attempts to make robots capable of learning semantic knowledge from experience. The most prominent are probably Wang et al. (2016) and Nasir et al. (2019), which both use fusion ART (for Adaptive Resonance Theory) neural networks to encode concepts. However, these neural networks rely on the extension of their architecture to learn new knowledge (i.e., new neurons are created to encode new concepts), which is unsuitable for autonomous agents with finite memory resources living in an open world.

Agents living in complex open environments (such as the real world) can be confronted with a huge amount of information in the course of their life. Retaining the totality of that information and further processing it to use it does not seem a practical option, as this would be extremely costly in terms of memory, computational power and energy consumption. This is especially true in the case of agents such as autonomous mobile robots, for which frugality regarding these same resources can be critical. For these agents, memorizing the whole of the experienced objects and events cannot be taken as a reasonable goal or criterion. Rather, forgetting appears as the unavoidable counterpart of continual learning, and the question is how to manage it so as to preserve the agent’s performance as much as possible. The most natural way to do this is by ensuring that the previous knowledge that is lost through new learning is the agent’s less useful knowledge at the time of the learning. How to characterize “less useful knowledge” and how to preferentially select it for deletion is a key issue to address to efficiently manage forgetting. In turn, such an efficient management of forgetting appears as a necessary condition for making lifelong learning autonomous agents in complex open environments possible.

The present article aims at providing a proof of concept of a cognitive agent satisfying the above requirements. As a first step, it focuses on the learning of concepts and only uses them in some basic decision making to demonstrate the agent’s learning abilities. However the agent is designed so as to allow the retrieval of the learned concepts and action laws, which could then be encoded in symbolic format and used in more complex decision making involving explicit reasoning and action planning. The implementation of such complex reasoning abilities in the agent is left to future work.

In the absence of studies investigating the considered research problem in all its dimensions, comparing the proposed setup with existing approaches seems inappropriate. Indeed, this would only take into account some of the dimensions of the problem and fail to evaluate the proposal relative to its goal. It should also be remarked that the standard metrics commonly used to assess neural networks’ learning performance cannot be used in the present case. Notably, notions such as Average Accuracy, Forward and Backward Transfer (Chaudhry et al., 2018; Lopez-Paz & Ranzato, 2017) used in the field of Continual Learning to assess learning accuracy, generalization and resistance to catastrophic forgetting cannot be used. Indeed, these metrics were devised to assess the performance of classifiers and are built upon a basic notion of accuracy that consists in checking whether two objects (generally the predicted class and the actual class of a sample item) are identical. But in the present proposal a prediction is not a single object but a set of objects (namely a set of predicted features), which is to be compared with another set of objects (the actual situation’s set of features). Simply checking whether these two sets are identical would be a rather crude way to assess the prediction’s accuracy: a more fine-grained comparison is needed, based on set inclusion properties. For these reasons, a suitable set of experiments and metrics was devised (see Section 7).

3. The Agent and Its Universe

This section describes the universe the agent lives in and gives a general overview of the agent. It also clarifies the notions of concepts and action laws used in the sequel.

3.1. The Universe

The universe is designed so as to support a set of experiments intended to assess the agent’s abilities. It presents a number of challenges for the agent to face, while being kept as simple as possible to allow the precise monitoring of the agent’s performances.

The universe is built over a grid of boxes, which we (not the agent) identify using an orthonormal coordinate system (see Figure 1). Each box represents a particular location in the agent’s universe and possesses a particular set of features the agent is able to perceive, drawn from the set LF = {OK, KO, NorthWall, EastWall, SouthWall, WestWall, Cold, Sound, #0, #1,…, #24}. For example, the box with coordinates $(- 2, - 2)$ has the feature set ${L F}_{(- 2, - 2)} =$ {OK, SouthWall, WestWall, Cold, #0}, while the box with coordinates $(5, 0)$ has the feature set ${L F}_{(5, 0)} =$ {KO}. For $n \in {1, \dots, 24}$ , the expression “#n” is to be taken as a particular name for a box. It is therefore a feature, and not all boxes need to have one. Two boxes with the same feature set are indistinguishable for the agent. Some features are mutually exclusive, which means that they cannot pertain at the same time to a same box. Mutually exclusive features are OK and KO on the one hand, and boxes’ names on the other. This means that a box cannot have both OK and KO in its feature set, and can have at most one name.

The rooms are made out of boxes, and delimited with impassable walls. For instance, Figure 1A shows the first room, which is composed of 25 boxes. Opening a door amounts to removing the wall features from the corresponding boxes’ feature sets (as in Figure 1B, where the EastWall feature has been removed from boxes $(2, - 1)$ , (2, 0) and $(2, 1)$ ). More generally, changes in boxes’ feature sets model environment changes. For example, at the time the door opens the boxes $(3, 3)$ to $(8, 3)$ all have Sound in their respective feature sets, but at some point the Sound feature moves to the boxes $(3, - 1)$ to $(8, - 1)$ (Figure 1C).

Figure 1.

The agent’s accessible world: (A) In the first phase, room 1 only; (B) After opening the door, rooms 1 and 2; (C) After the sound feature is moved downwards (from row 3 to row $- 1$ ) in room 2.

When considering the agent’s universe, at any point in time we only consider the boxes to which it has access, the set of which is always finite. Note that this does not contradict the unboundedness of the universe: although each possible room is finite (as are rooms in real world), the number of possible rooms the agent may discover in its life and the number of new objects it may encounter in them, as well as the number of changes that may occur in these rooms, are unbounded. Being able to handle these novelties (react appropriately and update its knowledge on the fly) is the agent’s main challenge.

As regards the features, KO corresponds to some unpleasant stimulus the agent spontaneously wants to avoid, and OK to the absence of such a stimulus. The other features convey some indifferent information. It should be stressed that OK and KO are not rewards in the sense of RL, as they play no role in the neural network’s learning process (see section 5 for details). Their only purpose is to motivate the agent’s choices and to allow us to assess its ability to make appropriate decisions.

The distribution of OK and KO boxes, which is common to both rooms, provides some general (non-monotonic) laws of the universe (such as “going North-East from an OK box leads to another OK box”), the learning of which is the agent’s second challenge. Boxes’ names in the first room are specific features (i.e., they apply to one single object). They are used to check that the agent also forms particular concepts of individual boxes and uses them to learn particular rules (such as “going North-East from the OK box with name ‘#12’ leads to the OK box with name ‘#18’’’). Cold is a feature common to a small number of boxes. It is mainly used to increase the number of features boxes may have (from two to five), and also to check that the agent can form concepts with an intermediate degree of generality such as, for example, the concept of cold OK boxes with a West Wall (see Section 3.3 for a formal definition of generality). Being able to handle concepts with various degrees of generality is a third challenge. Sound is used to test the agent’s ability to use the learned general rules in the presence of novelty, and also to check that it can learn new concepts and action laws without suffering catastrophic forgetting. Notably, after the door opens (Figure 1B) the agent is expected to learn new general rules such as “going North from a box with sound leads to a box with a north wall”. Moving the Sound feature from its “up” position (Figure 1B) to its “down” position (Figure 1C) is used to test the agent’s ability to update its knowledge when the environment changes.

3.2. The Agent

The agent is composed of a set of sensors, a perceptual system, a semantic memory, a decision system, a motor system and a set of actuators (see Figure 2). Sensors collect data from the external world and feed it to the perceptual system, which performs feature/object recognition. Neural networks doing this while relying on unsupervised learning from unlabeled data already exist (e.g., Kheradpisheh et al., 2017, Thiele et al., 2018), so we simply suppose that the agent’s perceptual system operates as intended and provides the semantic memory with the appropriate inputs, namely, the correct and complete set of features of the agent’s current location. Semantic memory forms concepts by binding together sets of features, and stores them for further retrieval. Its modeling is the main focus of the article. The decision system is the other important part: it queries the semantic memory to predict the outcome of possible actions, and decides which one to take on the basis of these predictions. This decision is then sent to the motor system, which activates the actuators to perform the corresponding motor activity. Information from the actuators is sent back to semantic memory through proprioception, allowing the agent to memorize the motor-related features of the realized actions.

The agent’s possible actions consist in steps from one box to another adjacent box, in any of the eight directions. Formally, an action is a triple made up of a depart location, a motor activity, and an outcome. By “motor activity” we mean the fact that the agent’s actuators are activated so as to make it move to the immediate next box in the selected direction. The set of the agent’s possible motor activities is therefore the set MotAct = {North, North-East, East, South-East, South, South-West, West, North-West}. The set of motor activity features the agent is able to perceive by proprioception is the set MF = {N, NE, E, SE, S, SW, W, NW, Diag, Orth}, where the first eight are specific to each particular motor activity, while Diag and Orth are more general features shared by all motor activities yielding diagonal/orthogonal moves. In cases where there is a wall at the edge of the depart box in the selected direction, the agent bumps into it and remains at the same place. We then say that the action’s outcome is a failure. Otherwise, the action’s outcome is the agent’s new location.

Figure 2.

Schema of the agent.

3.3. Concepts and Action Laws

The agent can form two kinds of concepts. First, concepts of “things,” in the broad sense. These concepts bind together co-occurrent features and can be seen as some sort of conjunction in which conjuncts have different “weights,” reflecting the fact that some features are more important than others in a concept’s definition (Freund, 2008). They store the agent’s knowledge about locations and more generally about any object, so we call them object concepts. The second kind is relational concepts. These take other concepts as elements and bind them together into tuples. Concepts of actions are of this kind: they bind together the agent’s concepts of a depart location, a performed motor activity, and a subsequent outcome, in the order in which they were experienced.

An object concept X is said to be general, as opposed to particular, if there is another concept Y such that the set of features composing X is a strict subset of the set of features composing Y. Y is then said to be more particular than X. These definitions capture the fact that if X is more general than Y then the set of objects X applies to is a superset of the set of objects Y applies to. In line with this idea, in this work an action concept is said to be general if the object concept of its initial situation is general or its motor activity component only contains Diag or Orth (thus, if X and Y are action concepts such that X is more general than Y, then the set of actions X applies to is a superset of the set of actions Y applies to). Generality of both kinds of concepts is understood relatively to the set of concepts the agent possesses at some point, so no concept is general or particular in itself.

For example, when visiting the box $(0, 0)$ the agent may form the particular object concept [OK,#12] , which is a memory of an OK place with name #12, and only applies to this particular box in its accessible universe. If the agent then moves North-East and arrives at box $(1, 1)$ , it can form the particular object concept [OK,#18] , and also the particular action concept [[OK,#12], [NE, Diag], [OK,#18]] which corresponds to the memory of being in an OK place with name #12 and then moving North-East to arrive at another OK place with name #18. Yet, after visiting a number of locations having the feature OK in common, the agent may also form the general object concept [OK] . Furthermore, it is a general rule in its accessible universe that moving North-East from an OK location always leads to another OK location, except for when there is a wall at the North or East edge of the depart box. Therefore, after having experienced a number of North-East moves from various OK locations, the agent may form general action concepts such as [[OK], [NE, Diag], [OK]] and [[OK, NorthWall][NE, Diag], [Failure]] . Such general concepts capture the general (non-monotonic) action laws of the agent’s universe. It is on them that the agent shall rely on to behave in never encountered situations.

It should be mentioned that this representation of action laws is slightly different from the one used in the AI field of planning (see e.g., Fikes & Nilsson, 1971), which involves a notion of executability that specifies the conditions that need to hold for an action to be possible. Yet the notion of executability can readily be recovered from action concepts by stating that the action represented by the action concept [x, y, z] is executable (up to the agent’s knowledge) in a situation $s$ if $s$ satisfies all the features composing x and z $\neq$ [Failure] .

4. Implementing the Agent’s Semantic Memory in the Neural Network: Overview

This section presents the neural network implementing the agent’s semantic memory. It first gives an overview of SNNs, and more particularly of the neural models and the learning rules which are relevant for that purpose. Then it presents the network’s general architecture.

4.1. Spiking Neural Networks

Spiking neural networks (SNNs) are artificial neural networks in which information is transmitted between neurons by the means of spikes fired by neurons at other neurons. At each time step the activation level of each neuron is computed as a function of both its previous activation level and the amount of activation received by the neuron at this time step. If this activation level reaches a certain threshold (the neuron’s spike threshold) the neuron fires a spike and its activation level is reset to some base value. Emitted spikes are conveyed from one neuron to another through weighted connections that modulate the amount of input transmitted by spikes to the receiving neuron. Learning consists in adjusting these weights.

SNNs are well suited for autonomous learning in open worlds because they allow for spike time dependent plasticity (STDP), a family of biologically plausible learning rules which can achieve unsupervised online learning from unlabeled data (Thiele et al., 2018). There are a number of STDP variants, but all have in common that the modification of the connection’s weight between two neurons depends on the relative timing of their spikes. In the most popular version, the connection between two neurons is reinforced if the input neuron spikes just before the output neuron; it is decreased if it spikes just after the output neuron; and it is left unchanged otherwise.

There are also many different implementations of STDP in SNNs, ranging from highly detailed and biologically accurate models to drastically simplified ones. Simplified versions are known for being rapid and energy-efficient (Yamazaki et al., 2022), which are interesting properties for autonomous robots.

The JAST learning rule (Thorpe, 2023; Thorpe et al., 2019) is certainly one of the simplest STDP implementations one can find. In this family of models, connections between neurons are binary (i.e., weights are either 0—meaning no connection—or 1) and each neuron has a fixed number of incoming connections. Learning is achieved through the replacement, at each time step, of a number $n_{swap}$ of inactive connections (i.e., connections from currently inactive input neurons) with an equal number of connections from unconnected active input neurons. Each neuron $i$ has two different thresholds $T_{fire}$ and $T_{learn} (i)$ , which respectively decide whether the neuron spikes and learns. $T_{fire}$ is fixed and is the same for all neurons, while $T_{learn} (i)$ is initialized at some minimal value and increased each time $i$ learns until it reaches $T_{fire}$ . It is then fixed at this same value, so $T_{learn} (i) \leq T_{fire}$ in any case. The number $n_{swap} (i)$ of connections to replace is initialized at some maximal value and is then decreased each time $i$ learns, until $n_{swap} (i) = 0$ when $T_{learn} (i) = T_{fire}$ . This mechanism “freezes” the neuron after learning, thus preventing it from forgetting the acquired knowledge. In another version of JAST the spiking threshold $T_{fire}$ is not fixed but is dynamically adjusted so that a number $n$ of neurons (the “n-best” neurons) spike at each step.

Although this learning rule is appealing for its simplicity, it cannot be used as it is for the present purpose. Obviously, freezing neurons is not suitable for continual learning and updating. Furthermore, neurons having both binary connections and a fixed number thereof would be unable to encode concepts having a variable number of features, which is required since an object’s number of features may vary over time (for instance, when the door is closed the box $(2, 0)$ has the feature set ${L F}_{(2, 0)} =$ {OK, #22, EastWall}, but this last feature is lost when the door opens). More generally, online learning and updating in a changing world presents challenges to neural networks that seem difficult to overcome while sticking to JAST’s simple setup.

For this reason, the neural network implementing the agent’s semantic memory only retains some of the JAST learning rule’s ideas, while relying on a more refined modeling of the neurons’ internal dynamics to address these issues. Retained ideas are mainly the use of binary synapses, and the fact that learning consists in swapping a number of these so that the sum of the connections’ weights on any given neuron remains constant through learning. This character ensures that there can be no “dead” neurons, that is, neurons having lost most of their incoming connections and which are therefore unable to respond to any input. However here a same input neuron may have multiple synapses onto a same output neuron, which means that the connections’ weights are in fact integers. Another retained idea is the use, in some specific cases, of dynamic spiking thresholds, to ensure that a given input always has a number of neurons responding to it.

The article’s original contribution regarding SNNs mainly concerns the use of some internal metrics of the neurons to regulate learning and modulate information transmission. For instance, the input neurons’ ability to establish new synapses onto output neurons depends on the number of synapses they already have, which facilitates the learning of rare or new features (such as e.g., Sound). Another example is that the choice of the neurons recruited for learning a new concept depends on the time of their last spike. This allows to preferentially select less used neurons for new learning and to prevent in this manner the forgetting of more used—and therefore presumably more useful—knowledge. A third example is the modulation of information transmission between neurons in the querying process, which depends on the weight sum of input neurons’ connections onto output neurons. This modulation favors the transmission of more specific information over that of less specific information, which is critical in non-monotonic contexts (see Section 6.1). It should be remarked that these methods do not depend on the specific type of spiking neurons used in the proposal and could thus be applied to regulate learning and forgetting and to balance information retrieval in other contexts.

4.2. The Neural Network’s Architecture

The network is composed of an interface, which communicates with the agent’s other components, and a body of hidden neurons which is itself divided into two layers (see Figure 3). The first layer learns object concepts and the second learns action concepts. For this reason we call their neurons, respectively, object concept neurons (O-neurons for short) and action concept neurons (A-neurons). This architecture draws on neuroanatomical studies according to which concepts are represented in the brain by hierarchically organized concept neurons, each receiving information from some lower neurons and sending reciprocal connections to these same neurons so that it can reactivate them for information retrieval (Bausch et al., 2021; Quiroga, 2012; Shimamura, 2010). For simplicity these reciprocal connections are not modeled as such, but instead information is allowed to flow in both directions along the same connections: from interface neurons to O-neurons and from there to A-neurons for learning and querying, and the other way round for retrieving information. A key point is that interface neurons are both input and output neurons, depending on the phase of the computation.

Interface neurons (I-neurons for short) mainly support the representation of features, be it of the visited locations or of the agent’s own motor activities. An additional neuron acts as a failure detector, specifically firing when the agent bumps into a wall and remains at the same place. All of them have their labels fixed from the start. I-neurons encoding mutually exclusive features have fixed reciprocal inhibitory connections which prevents them from firing together in a same computing process: as soon as one has fired, the others are shut down for the rest of the process. The $F a i l u r e$ neuron has reciprocal inhibitory connections with each and any of the L-neurons.

Figure 3.

Schema of the SNN. O-neurons #1 and #2, respectively, encode the concepts [#0, Cold, SouthWall, WestWall, OK] and [#5, SouthWall, KO] . A-neuron #1 encodes the concept [[#0, Cold, SouthWall, WestWall, OK], [E, Orth], [#5, SouthWall, KO]] . A-neuron #3 encodes the concept [[OK], [Diag], [OK]] . For graphical readability, connections weights are omitted.

The first layer of hidden neurons (O-neurons) is composed of 100 neurons with a differentiated dynamics depending on whether their input source is I-neurons or A-neurons. O-neurons learn co-occurrences of features. The second layer is composed of 400 compartment neurons with three separate input compartments. The first compartment receives connections from O-neurons, the second from motor activities I-neurons, and the third from O-neurons and the $F a i l u r e$ neuron. Inputs received at each compartment are unable to trigger a spike by themselves, but the first and second compartments respectively make their next compartment ready to receive and transmit inputs for a certain amount of time. In this manner, inputs can only be efficient if they occur in the correct order, so that A-neurons encode sequences of inputs. The use of compartment neurons to learn sequences of inputs was suggested in Cui et al. (2016); Hawkins and Subutai (2016).

5. The Neural Network’s Functioning

This section describes the neural network’s functioning and details its implementation. For readability, inessential algorithms and parameters settings are relegated in Appendices.

5.1. General Notions

Let L , M , and F be the sets of interface neurons that respectively encode locations’ features, motor activities’ features, and failure (note that F is the singleton {Failure }). I = L $\cup$ M $\cup$ F is the set of interface neurons. The sets of O- and A-neurons are respectively noted O and A . For convenience the neurons from L are called “L-neurons”, and similarly for the other sets.

5.1.1. Connections

Synapses between any two neurons always have a weight of $1$ , but an input neuron may have multiple synapses onto a same output neuron, so the weight of a connection between two neurons is a positive integer. By a slight abuse of language, when there is no synapse connecting two neurons we say that the weight of their connection is 0. Connections between the above sets of neurons are stored in separate arrays $c n x L O$ , $c n x O A 1$ , $c n x M A 2$ and $c n x O A 3$ , where $c n x L O$ reads “connections from L to O ”, $c n x O A 1$ reads “connections from O to A ’s first compartment”, and similarly for the others. $c n x L O [l] [o]$ denotes the weight of the connection between the L-neuron $l$ and the O-neuron $o$ , that is, the number of synapses that compose this connection.

These arrays are initialized at random, in such a way that for any O-neuron $o$

\begin{aligned} \sum c n x L O [l] [o] for l \in L & = W S_{O}, \end{aligned}

and for any A-neuron

a

\begin{aligned} \sum c n x O A 1 [o] [a] for o \in O & = W S_{A}, \\ \sum c n x M A 2 [m] [a] for m \in M & = W S_{A}, \\ \sum c n x O A 3 [n] [a] for n \in O \cup F & = W S_{A}, \end{aligned}

where

W S_{O} = 40

and

W S_{A} = 30

. These values were chosen so as to allow enough redundancy between synapses to ensure robustness of learning while keeping the computational complexity as low as possible. The agent’s learning consists in the update of these arrays.

5.1.2. Spike Thresholds

Each O-neuron $o$ has two spike thresholds. The first one is noted $S T O [o]$ and governs its forward spiking, that is, the cases where its received input originates from L-neurons. The second is noted $b a c k w a r d S T O [o]$ and governs its backward spiking, that is, the cases where the input originates from A-neurons. Both spike thresholds are fixed. $S T O [o]$ is a random integer taken in ${22, \dots, 31}$ , and $b a c k w a r d S T O [o] = floor (S T O [o] / 3)$ . These spike thresholds are stored in separate vectors $S T O$ and $b a c k w a r d S T O$ of length $| O |$ . Differentiated forward spike thresholds favor the encoding of concepts with different degrees of generality, as neurons with lower spike thresholds tend to spike more easily in response to inputs that do not perfectly fit their connections, which brings them to learn more general concepts. Lower backward spike thresholds eases information retrieval. L-neurons all have the same spike threshold $S T L = 50$ , and the $F a i l u r e$ neuron has its own spike threshold $S T F a i l = 40$ . Spike thresholds for A-neurons are dynamically adjusted, in a way that is described in Section 5.3 below.

5.1.3. Last Spikes

The number of steps performed by the agent since the last spike of each O- and A-neuron is stored in separate vectors $l a s t S p i k e d O$ and $l a s t S p i k e d A$ of respective lengths $| O |$ and $| A |$ . These vectors are initialized at random with values comprised between 1 and some upper bound. Last spike values are then increased by 1 at each step and reset to 0 each time the corresponding neuron spikes. Only the spikes that occur after a learning episode induce a reset, as they are the ones that indicate a successful encoding of the input. Notably, spikes occurring in the course of the querying process do not induce a reset.

5.1.4. Synapses’ Growth Rates

While the number of incoming synapses on O- and A-neurons is strictly constrained by $W S_{O}$ and $W S_{A}$ , the number of outgoing synapses from I- and O-neurons is softly constrained by synpases growth rates ( $S G R$ s). For any neuron $n$ from I or O , $n$ ’s $S G R$ is a decreasing function of the number of outgoing synapses $n$ has at the moment. $S G R$ s account for the fact that neurons cannot indefinitely grow new synapses, hence that a neuron having already a lot of synapses should be less prone to establish new ones than a neuron having only a few of them (see Algorithms 4 and 17 below for details). $S G R$ s are instrumental in avoiding the over-representation of frequent inputs over rare ones, and so in ensuring that the agent is able to learn new or rarely observed features, objects or actions.

5.1.5. Implementation Apparatus

In the algorithms provided in the sequel we introduce and use dictionary structures in which each value $v$ can be accessed through an index (“key”) $k$ . Formally, a dictionary can be seen as a set of pairs $(k, v)$ such that for any pairs $(k, v)$ and $(k^{'}, v^{'})$ in $D$ , $k \neq k^{'}$ . If $D$ is a dictionary, then the operation $D [k] \leftarrow v$ means that the value $v$ is added at index $k$ in $D$ . If $k$ already exists in $D$ then its value is overwritten. If $S$ is a subset of $D$ ’s indices, the operation $R e m o v e B y K e y (D, S)$ modifies $D$ by removing all its elements having their indices in $S$ . That is, $R e m o v e B y K e y (D, S) = D ∖ {(k, v) \in D ∣ k \in S}$ . We also use a function $argmaxD$ which given a dictionary $D$ returns the subset of its indices with maximal values (i.e., if $x = max ({v ∣ (k, v) \in D})$ , then $argmaxD (D) = {k ∣ (k, x) \in D}$ ).

We also use Python’s Numpy function randomChoice $(a, n, r e p l a c e, p)$ , where $a$ is a set, a vector or an array, and the other parameters are optional with default values:

–
$n$ is the number of elements to return (default is $1)$ .
–
$r e p l a c e$ specifies whether the chosen elements in $a$ should be replaced between successive draws when $n > 1$ (default is $True$ ).
–
$p$ is a probability distribution over $a$ (default is the uniform distribution).
randomChoice returns a vector of length $n$ of randomly chosen elements from $a$ , following the probability distribution $p$ . If $n$ is not specified, randomChoice does not return a vector $[v]$ but a single value $v$ .
5.2. O-Neurons

5.2.1. Forward Firing of O-Neurons

The set of L-neurons that fire in response to an input from either the perceptual system (if the agent is observing its current location) or the decision system (if the agent is querying its semantic memory) is noted $i n p u t L O$ . The set of O-neurons that fire in response to $i n p u t L O$ is noted $f i r e d O$ . It is computed as follows (see Algorithm 1). First, for any neuron $o$ in O the sum $i n p u t S u m O [o]$ of inputs received by $o$ is computed. O-neurons for which $i n p u t S u m O [o]$ is above their spike threshold $S T O [o]$ then fire in turn, starting by those with highest difference with their spike threshold, until a target number $T n b F i r e d O$ of firing neurons is reached or no neuron reaching its spike threshold remains. $T n b F i r e d O$ is the maximum desired number of firing O-neurons; it serves to keep the number of firing O-neurons under control.

5.2.2. Information Retrieving

Given a set $f i r e d O$ of firing O-neurons, the information stored by these neurons can be retrieved by the backward firing of L-neurons. This consists in having the O-neurons from $f i r e d O$ send an input to L-neurons to make them spike (see Algorithm 11 in Appendices). The process is similar as above, except that in order to suppress noise only the connections having a weight above a noise threshold $n o i s e O$ transmit the input. Furthermore, mutual inhibition between L -neurons encoding mutually exclusive features makes that once a L -neuron encoding a feature $f$ has fired, all the L -neurons encoding features $f^{'}$ excluded by $f$ are prevented from firing.

5.2.3. Learning of O-Neurons

Learning of object concepts (see Algorithm 2) occurs after each observation by the agent of its current situation. The set of O-neurons selected for learning is noted $l e a r n i n g O$ . These are primarily the neurons that fired at observation time, that is, the neurons from $f i r e d O$ .

However, it may happen that there are not enough of them, which is materialized by the fact that a target number $T n b L e a r n i n g O$ of learning neurons is not reached. This typically occurs when the observed situation is new to the agent, as in such a case very few neurons may be able to reach their spike threshold hence to fire.

When this happens, the network seeks additional neurons to make learn by “boosting” the input received by O-neurons (see Algorithm 3).

In practice, for each non-firing O-neuron a boosting coefficient $b o o s t$ is calculated, which is an increasing function of $l a s t S p i k e d O [o]$ . $o$ ’s boosted input sum $b o o s t e d I n p u t S u m O [o]$ is then obtained by multiplying $i n p u t S u m O [o]$ by this coefficient. Neurons with the highest boosted input sums are then successively added to $l e a r n i n g O$ until $T n b L e a r n i n g O$ is reached or no more neurons can be found. As neurons that did not spike for a long time receive more boosting, they tend to be preferentially recruited in this process. This means that the network preferentially selects neurons encoding less used knowledge to learn new information. Of course, these neurons will lose part of their previously stored information in this new learning process, bringing the agent to forget some of its previous knowledge. But this will only affect its least used knowledge, as these neurons were the least used at the time of the learning. To ensure that boosting only occurs when very few O-neurons fire in response to the input, $T n b L e a r n i n g O$ is set to $T n b F i r e d O / 2$ ( $T n b F i r e d O = 12$ and $T n b L e a r n i n g O = 6$ ). The learning of O-neurons essentially consists in replacing synapses they have with currently non-firing L-neurons with synapses from currently firing L-neurons (i.e., those from $i n p u t L O$ ). To regulate the formation of new synapses and avoid the over-representation of frequently observed features, a probability distribution $p r o b a N e w S y n a p s e s L O$ is defined based on L-neurons’ Synapses Growth Rates (see Algorithm 4).

The learning process then depends on the accuracy of the agent’s knowledge relative to the current situation. To assess it, the O-neurons from $f i r e d O$ send a backward input to L-neurons, and the resulting set $f i r e d L$ of firing L-neurons (see Algorithm 11 in Appendices) is compared with the initial input $i n p u t L O$ . This corresponds to the agent comparing what it observes (represented by $i n p u t L O$ ) with what it would have expected based on its current knowledge (represented by $f i r e d L$ ). If $i n p u t L O \subseteq f i r e d L$ , that is, if all the observed features can be retrieved from the firing O-neurons, then for any $o \in l e a r n i n g O$ , all of $o$ ’s inactive synapses are deleted and replaced with new synapses from neurons from $i n p u t L O$ , following the probability distribution $p r o b a N e w S y n a p s e s L O$ (see Algorithm 12 in Appendices). This procedure tends to reinforce O-neurons’ connections with L-neurons encoding well-shared features at the expense of connections with L-neurons encoding more specific features, and is the driving force for the formation of general object concepts.

If, on the contrary, $i n p u t L O ⊈ f i r e d L$ , then the network first checks if $| l e a r n i n g O |$ reaches the target number $T n b F i r e d O$ (recall that $T n b F i r e d O > T n b L e a r n i n g O$ ). If not, it looks for additional neurons for $l e a r n i n g O$ , using the same boosting method as above except that only neurons with the maximal boosted input sum are added (most of the time, this amounts to one single neuron). The learning process is then similar as before, except that for one O-neuron with maximal $l a s t S p i k e d O$ value all synapses are replaced whether active or inactive (see Algorithm 13 in Appendices). This particular neuron will therefore have the opportunity to establish connections with all the neurons from $i n p u t L O$ and will thus tend to learn the particular situation with all its features, counterbalancing the general trend towards generalization. The choice of a neuron with a maximal $l a s t S p i k e d O$ value for this complete re-learning is meant to mitigate the forgetting of useful knowledge.

5.3. A-Neurons

5.3.1. Inputs

The sets of neurons from O , M and O $\cup$ F that send inputs to A-neurons’ first, second and third compartments are respectively noted $i n p u t O A 1$ , $i n p u t M A 2$ and $i n p u t O A 3$ . Inputs on the first compartment encode an initial situation, inputs on the second compartment encode motor activity features, and inputs on the third compartment encode an action’s outcome. Such inputs may occur in two different situations: when the neural network learns about a just performed action, or when the decision system queries the neural network for the expected outcome of an envisaged action (see Section 6.1).

5.3.2. A-Neurons’ Forward Spiking

Let $i n p u t O A 1$ be a set of firing O-neurons and $i n p u t M A 2$ a set of firing M-neurons. The set of A-neurons that fire in response to the sequence of inputs $(i n p u t O A 1, i n p u t M A 2)$ is noted $f i r e d A$ . It is computed as follows (see Algorithm 5). First, the sums of inputs received by each A-neuron $a$ on its first two compartments are separately computed (see Algorithm 14 in Appendices). Then, if both compartments receive significant input (i.e., above some noise threshold $n o i s e A$ ) these inputs add up and are transmitted to the neuron, otherwise they are discarded (see Algorithm 15 in Appendices). The set $f i r e d A$ is obtained by first setting A-neurons’ spike threshold to the highest input sum value (so only A-neurons with the highest input sum fire), and then repeatedly lowering it to the next highest value until the number of firing A-neurons reaches a target number $T n b F i r e d A$ . Each time the spike threshold is lowered all the A-neurons with input sums above the threshold spike again, so A-neurons with higher inputs sums spike repeatedly. This behavior will be used in the querying process described in Section 6.1 below.

5.3.3. A-Neurons’ Learning

Learning of action concepts (see Algorithm 6) takes place after each step made by the agent. In this case, $i n p u t O A 1$ is the set $f i r e d O$ resulting from the agent observing its depart location and $i n p u t M A 2$ is the set of M-neurons firing by proprioception as the agent performs its motor activity. $a c t R e s u l t I n e u r o n s$ is the set of interface neurons that encode the action’s outcome. If the action failed (that is, if the agent bumped into a wall) $a c t R e s u l t I n e u r o n s$ is the set F $= {F a i l u r e}$ ; otherwise it is the set $i n p u t L O$ of L-neurons encoding the new location’s features. $i n p u t O A 3$ is either the set $f i r e d O$ of O-neurons firing in response to $i n p u t L O$ , or F if the action failed. $p r e d F e a t I n e u r o n s$ is the set of interface neurons that encode the agent’s prediction about the action’s outcome. It results from the agent querying its memory for the action’s expected outcome before taking it (see Algorithm 10 in Section 6.3). If the agent predicted a failure then $p r e d F e a t I n e u r o n s = F$ , otherwise $p r e d F e a t I n e u r o n s \subset L$ .

The set of A-neurons selected for learning is noted $l e a r n i n g A$ . To obtain it (see Algorithm 17 in Appendices), the network first gets a number of A-neurons to fire in response to the input sequence $(i n p u t O A 1, i n p u t M A 2)$ . However, A-neuron’s dynamic spiking threshold makes it possible that A-neurons with only very weak connections from these input neurons —but with stronger connections from other input neurons— fire. Recruiting them for learning could thus lead to the loss of valuable previous knowledge. To avoid this, only A-neurons whose input sum reaches a learning threshold $l e a r n A T$ are selected for learning. If the number of such neurons reaches a target number $T n b L e a r n i n g A$ , the network directly proceeds to learn. Otherwise, additional neurons are selected by boosting the input sums received by A-neurons in a way similar to the one used for O-neurons.

For each input set $i \in {i n p u t O A 1, i n p u t M A 2, i n p u t O A 3}$ , a probability distribution ${p r o b a N e w S y n a p s e s}_{i}$ is then computed (see Algorithm 18 in Appendices). It models the propensity of each neuron $n \in i$ to grow new synapses to A-neurons’ corresponding compartment. This probability is based on input neurons’ $S G R$ s, just as in O-neurons’ case.

For each neuron $a$ in $l e a r n i n g A$ , a learning rate $L R (a)$ is computed (see Algorithm 18 in Appendices). It represents $a$ ’s propensity to modify its connections through learning, by specifying how many of its synapses should be at most/at least replaced in the learning process. This number is calculated in such a way that A-neurons with the highest $l a s t S p i k e d A$ value get the maximal learning rate $W S_{A}$ , those with the lowest $l a s t S p i k e d A$ value get a learning rate of $1$ , and the others get learning rates in between. The idea behind this is that neurons having not spiked for a long time should be more ready to modify their connections so to adjust to the precise current situation and thus get to encode particular concepts, while neurons that often spike should minimally change their connections each time they spike so to tune themselves to the common features shared by all these situations and thus get to encode general concepts.

The learning process then depends on the accuracy of the agent’s expectations relative to the action’s outcome. If these expectations were both correct (i.e., all predicted features are actually present) and complete (i.e., all actually present features were predicted), then for any learning neuron $a$ the learning process deletes all of $a$ ’s inactive synapses in the first two compartments, but only some of them in the third one (see Algorithms 12 and 19 in Appendices). More precisely, the number of inactive synapses to delete is $min (L R (a), n)$ , where $n$ is the number of inactive synapses on $a$ ’s third compartment. Then, in all three compartments new synapses replacing the deleted ones are chosen at random according to the probability distribution ${p r o b a N e w S y n a p s e s}_{i}$ .

If the agent’s expectations were not correct, then for any learning neuron $a$ not only all of $a$ ’s inactive synapses are deleted and replaced on the first two compartments, but also a number of randomly chosen active synapses (see Algorithm 20 in Appendices). Here the total number of synapses to be replaced is $max (L R (a), n)$ . The learning process for $a$ ’s third compartment is the same as in the first case. As previously for O-neurons, replacing some active synapses in addition to the inactive ones allows these compartments to “re-specialize” to the input. This helps the neuron to only fire in response to this precise action, hence to retain more accurate information about its outcome.

If the agent’s expectations were not complete, then for any learning neuron $a$ the number of replaced synapses is $max (L R (a), n)$ in all three compartments. If the predictions were neither correct nor complete, the learning process successively runs the two above procedures.

6. The Agent’s Functioning

This section first explains how the agent queries its semantic memory to predict the outcomes of envisaged actions, and then how the agent uses these predictions to make its decisions. Finally, it describes the whole process of making a step.

6.1. Querying the Network

To query the neural network, the decision system first sends an input to the L-neurons that encode the features of the considered initial situation, bringing a set $i n p u t L O$ of L-neurons to fire. This triggers the firing of a set $f i r e d O$ of O-neurons, following the process described in Section 5.2. $f i r e d O$ acts as an input for A-neurons’ first compartment, that is, $i n p u t O A 1 = f i r e d O$ . Given an envisaged motor activity $m o v e$ , the decision system then sends an input to the M-neurons that encode $m o v e$ ’s features. These fire, sending an input $i n p u t M A 2$ to A-neurons’ second compartment. Querying the network consists in bringing some A-neurons to spike in response to the input pair $(i n p u t O A i, i n p u t M A i i)$ and send backward input to O-neurons, so that these spike and send backward input to interface neurons (see Algorithm 7). The resulting set of firing interface neurons is the network’s answer to the query.

Now, among the A-neurons that receive connections from $i n p u t O A 1$ and $i n p u t M A 2$ neurons, some support the representation of general action concepts (such as, e.g., [[OK], [NE], [OK]] ), while others support the representation of more particular action concepts (such as, say, [[OK, NorthWall], [NE], [Failure]] ). Obviously, it is desirable that the A-neurons supporting more particular concepts take precedence over those supporting more general concepts, as the information conveyed by more particular concepts is more accurate. To achieve this, the input sent to A-neurons is modulated by applying a multiplying factor $c o e f M u l t$ to the connections’ weights (see Algorithm 21 in Appendices). More precisely, for any neuron $o$ in $i n p u t O A 1$ , $c o e f M u l t \in [1, 2]$ is a decreasing function of $\sum {c n x O A 1}_{a \in A} [o] [a]$ , and similarly for $i n p u t M A 2$ . Modulated input sums are then computed and combined, triggering the firing of a growing set of A-neurons (the process is the same as the one previously described in Algorithm 5). As O- and M- neurons with fewer connections to A-neurons tend to support more particular concepts than neurons with more connections, this modulation mechanism helps A-neurons encoding more particular actions to fire earlier and more repeatedly than A-neurons encoding more general actions.

Each time an A-neuron spikes, it sends a backward input to O- and $F a i l u r e$ neurons through its third compartment’s connections. Backward input sums are computed incrementally as A-neurons fire (see Algorithm 22 in Appendices), bringing O- and $F a i l u r e$ neurons to fire as soon as their spike threshold is reached (see Algorithm 23 in Appendices). If some O-neuron fires first, it inhibits the $F a i l u r e$ neuron and prevents it from further firing. Similarly, if the $F a i l u r e$ neuron spikes first, it inhibits all the O-neurons and prevents them from further firing. Firing O-neurons in turn send backward inputs to L-neurons, and the input sum received by L-neurons is computed incrementally. L-neurons spike as soon as they reach their threshold $S T L$ (see Algorithm 24 in Appendices), and their firing inhibits the L-neurons that encode incompatible features (for instance, the firing of the L-neuron encoding the feature OK prevents the L-neuron encoding the feature KO from further firing).

Due to input modulation, the input sum received by a given A-neuron $a$ is a good indicator of both the accuracy of the action concept encoded by $a$ relative to the input pair $(i n p u t O A 1, i n p u t M A 2)$ , and of the concept’s degree of generality: the more adequate and particular the encoded concept, the higher this input sum. The value of A-neurons’ spike threshold at the moment a given neuron $i \in L \cup F$ spikes can thus be used to assess the relevance of the A-neurons involved in $i$ ’s spiking relative to the input pair $(i n p u t O A 1, i n p u t M A 2)$ . This provides the agent with a means to estimate the reliability of its prediction of the feature $f_{i}$ encoded by $i$ : the higher A-neurons’ spike threshold at the moment $i$ spiked, the higher this reliability. To keep track of this information, the set $b a c k w a r d F i r e d I$ returned by Algorithm 24 and further returned by the querying process is a set of pairs $(i, c)$ , where $c$ is A-neurons’ spike threshold at the moment $i$ spiked.

6.2. Making Decisions

Suppose that the agent is at some location and wants to make a step. The process by which it decides which motor activity to perform (i.e., in which direction to go) is run by its decision system (see Algorithm 8). First, the agent decides whether to exploit its current knowledge about its environment, or to explore its environment to improve its knowledge. The exploration/exploitation dilemma is a well-known problem in online learning Sutton and Barto (2018); Watkins (1989), and changing environments make it even more difficult. For this reason we do not try to reach an optimal solution here, but we simply make the agent’s decision system choose at random with equal probability between an Exploration and an Exploitation mode.

The agent then makes predictions about the outcome of each possible action. To do so, it successively queries its semantic memory for the outcome of the action having the current location as initial situation and one of the eight possible moves as motor activity (see Algorithm 9).

After each such querying process, a coefficient $s h o r t T e r m M e m o r y C o e f$ is computed from the $l a s t S p i k e d A$ values of the A-neurons firing in the course of the query process. $s h o r t T e r m M e m o r y C o e f = 2$ if all the A-neurons firing in the querying process learned at the previous step —which can only occur if the queried action is the same as the action performed at the previous step, i.e., if the action’s outcome is a failure— and quickly decreases as more steps were performed since the neurons’ last learning episode to get close to 1. $s h o r t T e r m M e m o r y C o e f$ represents the agent’s working memory, that is, its memory of having just performed the considered action. For each interface neuron $i$ firing as a result of the querying process, the feature $f$ encoded by $i$ is then retrieved, and a confidence degree $c o n f_{f}$ is computed from $s h o r t T e r m M e m o r y C o e f$ and the A-neurons’ spike threshold at the moment $i$ fired. $c o n f_{f}$ represents the confidence the agent has in its prediction of $f$ . It corresponds to the agent’s estimation of how well the action concepts on which it relied to make its prediction fit the envisaged action, weighted by its short term memory.

The prediction process thus returns a set (dictionary)

p r e d i c t i o n s = {(m o v e, p r e d i c t i o n s [m o v e]) ∣ m o v e \in M o t A c t},

where

p r e d i c t i o n s [m o v e] = {(f_{1}, c o n f_{f_{1}}), \dots (f_{n}, c o n f_{f_{n}})}

is itself a dictionary which to each feature

f_{i}

predicted by the agent associates the degree of confidence

c o n f_{f i}

the agent has in its prediction of

f_{i}

The agent then rates each motor activity $m o v e$ for its suitability (see Algorithm 25 in Appendices), by building the sets $s u i t a b l e$ , $u n s u i t a b l e$ and $u n d e c i d e d$ such that:

\begin{aligned} s u i t a b l e & = {(m o v e, c o n f_{O K}) ∣ (OK, c o n f_{O K}) \in p r e d i c t i o n s [m o v e]} \\ u n s u i t a b l e & = {(m o v e, c o n f_{f}) ∣ (f = KO or f = Failure) and (f, c o n f_{f}) \in p r e d i c t i o n s [m o v e]} \\ u n d e c i d e d & = {m o v e ∣ ∄ c o n f such that (m o v e, c o n f) \in s u i t a b l e \cup u n s u i t a b l e} \end{aligned}

Finally, the decision system chooses a motor activity depending on the selected mode (see Algorithm 26 in Appendices). In Exploration mode, the agent is willing to take risks and chooses an action with the most uncertain outcome possible so to increase its knowledge: if

u n d e c i d e d \neq \emptyset

it picks one from it, otherwise it goes for one with the least

c o n f

s u i t a b l e \cup u n s u i t a b l e

. In Exploitation mode by contrast, the agent just wants to avoid KO boxes and failure as much as possible. So, if

s u i t a b l e \neq \emptyset

it chooses a move with the greatest

c o n f

, otherwise if

u n d e c i d e d \neq \emptyset

it picks one from it, and if both

s u i t a b l e

and

u n d e c i d e d

are empty it chooses a move with the least

c o n f

u n s u i t a b l e

6.3. Making a Step

Let $d e p a r t L o c$ be the agent’s current location, and $p r e v F i r e d O$ be the set of firing O-neurons resulting from the agent considering that location after having learned about it at the previous step ( $p r e v F i r e d O$ is initialized to $\emptyset$ for the agent’s first step). To make a step, the agent proceeds as follows (see Algorithm 10).

First, it decides in which direction to go by running the decision process described in Section 6.2. $p r e v F i r e d O$ is the input to A-neurons’ first compartment. Once the agent’s decision is made, the decision system transmits the information to the motor system which performs the selected motor activity. The M-neurons encoding the motor activity’s features are activated by proprioception and send an input $i n p u t M A 2$ to A-neurons’ second compartment.

The agent’s move is simulated by a function $calculateNewLoc$ , which computes the agent’s arrival location and the set of its features. If the arrival location is the same as the initial one (that is, if the agent bumped into a wall) the $F a i l u r e$ neuron fires and sends an input $i n p u t O A 3$ to A-neurons’ third compartment. The agent then directly proceeds to learn the action (see A-neurons’ learning in Section 5.3). Otherwise, it first observes its arrival location and learns about it. Do do so, the agent’s perceptual system triggers the firing of the set $i n p u t L O$ of L-neurons that encode the location’s features, which in turn triggers the firing of a set $f i r e d O$ of O-neurons. These are the neurons that encode the agent’s concepts that best fit the observed situation and that are therefore the best suited for updating. The object concepts corresponding to the arrival location are then learned according to the learning process described in Section 5.2. This being done, the agent considers once again its arrival location: interface neurons from $i n p u t L O$ fire again, triggering the firing of a new set $f i r e d O$ of O-neurons (which may, or not, be the same as previously). These neurons encode the agent’s updated knowledge about its arrival location. They are used as input set $i n p u t O A 3$ to A-neurons’ third compartment. The agent then proceeds to learn the action, according to the learning process described in Algorithm 6.

Once the action is learned, the input pair $(i n p u t O A 1, i n p u t M A 2)$ is sent again to A-neurons’ first two compartments, using connections’ modulation as in the querying process (see Section 6.1). The neurons that fire in response to these inputs are the ones that encode the agent’s updated knowledge about the action. The agent is now ready for the next step. The current arrival location is next step’s depart location, and the current set $f i r e d O$ is next step’s $p r e v F i r e d O$ . O- and A-neurons’ last spikes values are updated: the neurons that did not spike after learning have their values increased by one while those that did spike have their values set to 1 (reset to zero and then increased by one for the next step).

7. Experiments and Results

The testing of the agent’s abilities was carried out by placing it at location $(0, 0)$ and prompting it to perform a succession of series of steps, each complete sequence of series of steps being called a trial. Trials are therefore sequences $t = [s_{1}, \dots, s_{n - 1}, s_{n}]$ , where each series of steps $s_{i}$ is a number of steps to be performed; the total number of steps in the trial is $\sum_{i \in {1, \dots, n}} s_{i}$ . For brevity, in plots and tables series of steps are denoted by the total number of steps performed since the beginning of the trial at the moment of their last step. This means that the series $s_{j}$ is denoted by the integer $n = \sum_{i \leq j} s_{i}$ .

The results presented here are averaged over 50 trials. Three distinct groups of tests (“Experiments”) were carried out. The code provided in supplementary material allows to reproduce the experiments while varying the number of trials, sequences, steps in sequences, and others parameters.

7.1. Experiment #1

In a first group of tests, trials were sequences

$t = [1, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768]$ ,

which is 65536 steps in total. The door was kept closed all along, so the agent had no access to the second room.

Table 1.
Post-Learning Predictions Percentages (Mean over 50 trials). CC: Correct and Complete predictions, MF: Missed Features, PE: Prediction Errors.

Total nb of Steps	1	2	4	8	16	32	64	128	256
CC	0.0	100.0	99.0	99.5	99.8	99.0	97.3	91.8	88.2
MF	100.0	0.0	0.3	0.2	0.1	0.6	1.7	4.6	6.1
PE	0.0	0.0	0.0	0.0	0.0	0.0	0.1	0.1	0.3
Total nb of Steps	512	1024	2048	4096	8192	16384	32768	65536	Global
CC	90.0	93.2	95.1	95.6	96.1	96.3	97.1	97.0	96.7
MF	5.0	3.2	2.1	2.0	1.6	1.6	1.2	1.3	1.4
PE	0.3	0.2	0.2	0.1	0.1	0.1	0.1	0.1	0.1

A first test concerned the agent’s ability to learn an action over one single experience (“one-shot learning”). To assess it, after each step the agent was asked to redo the prediction that led to the just realized action, and this new prediction was compared with the action’s actual outcome (see Table 1). A prediction is said to be Correct and Complete (CC) if the predicted features are exactly those of the arrival location. The table’s first line shows the percentage of performed steps leading to a CC post-learning prediction for each series of steps.

At each step, the arrival location’s features (“features to predict”) were listed and counted, and those among them that the agent failed to predict (“Missed Features”) were counted. The table’s second line (MF for “Missed Features”) shows the percentage of features to predict that the agent failed to predict for each series of steps.

The features predicted by the agent were also listed and counted at each step, and those among them that were wrongly predicted (i.e., that did not belong to the arrival location’s feature set) were counted. The table’s third line (PE for “Predictions Errors”) shows the percentage of predicted features that were wrongly predicted for each series of steps.

These results show a good performance at immediate recall after learning. The fact that the network does not learn anything at the first step is expected, as there is no previous step hence $i n p u t O A 1$ is the empty set. This means that A-neurons receive no input on their first compartment, so the inputs received on their second compartment cannot be integrated. It follows that the total input received by A-neurons at the first step is zero, hence no learning can occur. The lower scores around 128-1024 steps are explained by the fact that at start-up the network has a lot of unused neurons at its disposal and simply learns the particular concepts with all their features, so specific features are well learned. As the network accumulates knowledge, unused neurons get scarce and the network tends to recruit less-used neurons for learning instead. A trade-off is then made between learning the current observation’s specific features and not forgetting the common features already encoded by the neurons, which makes specific features more difficult to learn. Yet after repeated encounters the specific features are finally learned, so the scores improve again.

To test whether the acquired knowledge was retained in the long run, after each series of steps the simulation was frozen and learning was deactivated, and the agent was placed successively in each location of each room. There, it was queried for its predictions for each of the eight possible motor activities. Its predictions were recorded and compared with the actions’ actual outcomes. Tables 2 and 3 show each feature’s mean Hit Rate (that is, its chances of being predicted when effectively present), and Correctness (its chances of being effectively present when predicted)¹ for each room.

Table 2.

Predictions’ Hit Rates after $n$ Steps (Mean over 50 Trials).

Table 3.

Predictions’ Correctness after $n$ Steps (Mean over 50 Trials).

Values for the first room (white lines) show that learned actions are indeed recalled long after having been performed. Values for the second room (grey lines) show that despite never having been in this room (since the door was kept closed) the agent was able to correctly predict OK and KO features and to a lesser extent Failure—and this, even though locations from the second room have different sets of features, including for some of them a new feature, Sound. The poor performance at wall prediction is due to the lack of general rules of the universe observable in the first room regarding the presence of walls in adjacent boxes. The agent thus relies on concepts of particular actions (i.e., action concepts involving particular concepts of individual boxes) to predict walls in the first room, but these cannot be reliably used in the second room. The mixed result at failure prediction comes from a competition between general action concepts, the control of which needs to be improved.

To test the agent’s ability to use its knowledge to make appropriate decisions, the outcome of each action taken in Exploitation mode was recorded throughout the trials. Figure 4.A shows the percentages of OK, KO, and Failure outcomes obtained in this manner for each series of steps (“no data” corresponds to trials for which no steps were taken in Exploitation mode in the considered series). Visited locations were logged to check that the agent was not looping indefinitely on the same boxes: all boxes kept being visited, be it very rarely, at any point of the trials, due to the Exploration mode. In average, most visited boxes were visited in about 10% of the taken steps, versus 2% for the least visited ones. Most visited locations are the OK boxes with no walls, which is expected since they are visited in both Exploration and Exploitation modes and can be accessed from all directions. Least visited locations are KO boxes with walls, which are less accessible and are mostly visited in Exploration mode.

Finally, the agent’s ability to use the knowledge acquired in the first room to act judiciously in the second room was tested. At the end of each series of steps the agent was asked to chose a move from each of the second room’s type of locations (where two locations are said to be of the same type if they have exactly the same features). Figure 4.B shows the percentages of OK, KO, and Failure outcomes obtained in this manner. These results reflect the agent’s performance at making predictions about the second room’s locations: it successfully predicts OK and KO boxes, but has more difficulties predicting failure.

Figure 4.

Action’s mean outcomes with door closed; Green: OK, Red: KO, Blue: Failure, Grey: No Data.

7.2. Experiment #2

To test the agent’s ability to handle changes, a second group of tests was carried out. The setup was the same as in Experiment #1, except that the door was opened at the 2048 $^{t h}$ step (as in Figure 1.B). The agent spontaneously went in the second room, and spent a variable but significant amount of time in it (38.3% of steps on average, standard deviation = 6.1). Tables 4 and 5 show the features’ Hit Rates and Correctness from the moment the door was opened. These results show that the agent was able to learn new concepts involving the Sound feature. The rather low Hit Rates for the feature are due to the lack of observable cues in boxes at the direct south of boxes with sound, which prevents the agent from being able to predict it in 40% of the cases.

Table 4.
Predictions’ Hit Rates after $n$ Steps, with the Door Opened at the 2048 $^{t h}$ Step.

Table 5.

Predictions’ Correctness after $n$ Steps, with the Door Opened at the 2048 $^{t h}$ Step.

It should be remarked that this learning did not lead to significant loss of previous knowledge regarding the first room: a moderate drop in Hit Rates can be observed for boxes’ names, Cold and walls, but not for OK, KO, and Failure. Furthermore, Correctness is barely impacted. This is due to the use of selective forgetting in the learning process (remember O- and A- neurons’ boosting of inputs in Section 5). As neurons encoding more particular concepts (such as detailed descriptions of particular locations) are less often reactivated, they tend to be reallocated over time, which leads to the loss of the particular features they encode. But neurons encoding more general concepts are more often reactivated, which preserves them from reallocation and forgetting. As a result, particular and less-used concepts tend to fade out over time, while more general and well-used concepts are preserved.

Accordingly, the agent’s ability to make appropriate decisions in the first room was not impacted by the door being opened. In fact, the bar chart of OK, KO, and Failure outcomes obtained in this second run of tests showed no visible difference with the one obtained with the door closed shown in Figure 4.A.

7.3. Experiment #3

To better assess the agent’s ability to update its knowledge to accommodate environment changes, a third experiment was conducted. As previously, the agent was kept in the first room for 2048 steps before the door was opened, but then it was prompted to run 50 series of 100 steps. Half way through these, the Sound feature was moved from the boxes $(3, 3)$ - $(8, 3)$ to the boxes $(3, - 1)$ - $(8, - 1)$ (as in Figure 1.C). At the end of each of these series of steps, the agent was asked for its predictions about the outcome of two actions having Sound as the only available information about the depart location: the first one when considering a north move, and the second, when considering a south move. It was recorded whether NorthWall (resp., SouthWall) was predicted. Other predicted features, if any, were discarded.

Figure 5 shows the percentage of NorthWall (resp., SouthWall) predictions after each series of steps. These results show that after the door opened the agent first learned to predict a north wall in the arrival box when considering a north move, but that once the Sound feature was moved to its “down” position it stopped doing so and learned to predict a south wall in the arrival box when considering a south move instead. This learning was both fast and robust, considering that the agent had very few learning occasions. Indeed less than 0.6% (on average) of taken steps were north moves from a box with sound, and similarly for south moves from a box with sound. This means that very few experiences were enough for the agent to update its knowledge, which is consistent with the results obtained in the first experiment.

Figure 5.

Prediction of walls in arrival boxes given Sound as information about the depart location. Black dotted line: door opens; red dotted line: sound changes location; blue: NorthWall predictions considering a north move; orange: SouthWall predictions considering a south move (percentages over 50 trials).

7.4. Computing Time

Computing time was simply estimated by running trials consisting in a single series of 2024 steps. An average of 5.9 seconds (standard deviation 0.16) was obtained on a conventional computer. However this is to be taken as an upper bound, since for practical reasons some testing carried out in the course of the series was not deactivated. No attempt to optimize the computing time was made, as it seems less critical in the case of online learning of autonomous agents which can learn while physically performing their actions.

8. Conclusion and Future Developments

In this article, a proof of concept of the architecture of a fully autonomous agent learning action laws online and accommodating environment changes was provided. This agent relies on general concepts to handle new situations and dynamically adjusts its concepts to its current environment. This makes it well suited for open worlds: if a new door were to open to a third room with new objects and laws, it would learn them just as it did in the second room. Of course, this would come at the cost of the forgetting of its least used concepts, but these are precisely the ones it needs the less. In fact, the agent’s ability to selectively forget its least used knowledge ensures that it will always be able to adapt to new environments by replacing old unused concepts by new useful ones.

It should be stressed that this architecture does not crucially depends on the specific type of spiking neurons used to implement it nor on the details of its implementation. Any similar architecture based on a SNN using some STDP-like learning rules would probably work, provided that it retains the approach’s main ideas. Notably, strict constraints on output neurons’ connections weight sums associated with flexible constraints on input neurons’ forward connections weight sums allow to regulate learning and keep the network balanced; target numbers of neurons for spiking and learning associated with dynamic learning thresholds ensure that the network will always have neurons to learn a new input; boosting the inputs received by the neurons having not spiked for a long time allows to preferentially select those encoding less used knowledge for learning new inputs, and to prevent in this manner catastrophic forgetting; forcing a few neurons to learn particular concepts counterbalances STDP’s tendency towards generalization and favors the representation of concepts with diverse degrees of generality; finally, input modulation in the querying process promotes the firing of neurons encoding the most specific information, thus helping the agent to make accurate predictions in non-monotonic contexts. All these methods rely on individual neurons’ properties such as the weight sum of their incoming/forward connections, the time of their last spike and other similar metrics.

As regards scalability to larger environments, the number of neurons needed in each layer of the network (interface, O-neurons, A-neurons) grows linearly with the number of items (features, object concepts, action concepts) to encode; the number of synapses needed on each O-neuron grows linearly with the maximum number of features to be learned for a given object/situation. However, it is important to observe that these are upper bounds: the agent’s ability to form general concepts out of individual experiences and to use them to make decisions makes that it is not necessary to encode each individual object with all its features and each performed action.

Future developments of the architecture include endowing the agent with planning abilities. To do so, a notion of executable action law should be defined (see Section 3.3). The agent would also need to build its own set of possible situations (states) online (the set of its object concepts could probably be used to this end). Finally, the decision system should be augmented to represent goals and embed a cost function and a planning algorithm.

However further work remains to be done to allow the agent to live in more realistic environments. Notably, it would be necessary to make the agent able to use incomplete information as input for learning and querying, as real-world agents’ observations are rarely complete. It would then be interesting to make the agent able to query its semantic memory for object properties given some partial input. It would also be useful to implement negation in the network to allow the agent to represent the fact that a given object does not have a given feature. Neural inhibition could probably be used for this, but the appropriate learning rules remain to be found. It seems that taken together these two improvements would bring the agent to draw non-monotonic inferences in the spirit of (Grimaud, 2016). It would also be suitable to make the agent able to distinguish between objects and their locations, as actions can modify one, the other, or both. Biological brains process the “what” and the “where” components of observations in two separate pathways before reunifying them, and this could be an inspiration source. A further line of research would be to investigate how an agent should decide between Exploration and Exploitation modes in an open world. Intuitively, it seems that the choice between these two modes should depend on the agent’s estimation of the risk associated with an explorative behavior, and that the agent should refrain from entering in exploration mode in situations where the assessed risk is high, while allowing itself more exploration in situations where it is low. Yet a correct formalization of this idea remains to be found.

Footnotes

ORCID iD

Christel Grimaud

Dominique Longin

Andreas Herzig

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by the ANR (Agence Nationale de la Recherche) project ALoRS (“Action, Logical Reasoning and Spiking networks”) [ANR-21-CE23-0018-01]

Declaration of Competing Interest

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability

The program allowing to run and test the agent can be download at .

Notes

Appendices

1. Parameters

The network’s parameters were set as follows:

Interface neurons: –

$S T L = 50$ (L-neurons’ Spike Threshold)

–

$S T F a i l = 40$ ( $F a i l u r e$ neuron’s Spike Threshold)

Lowering these spike thresholds increases the predictions’ Hit Rates while decreasing their Correctness. Reciprocally, increasing these spike thresholds increases the predictions’ Correctness but decreases their Hit Rates.

O-neurons: –

$W S_{O} = 40$ is the number of incoming synapses on O-neurons. This value was chosen so as to allow enough redundancy between synapses to ensure robustness of learning while keeping the computational complexity as low as possible.

–

For $o$ in O , $S T O [o]$ ( $o$ ’s forward Spike Threshold) is an integer randomly chosen in ${22, \dots, 31}$ . This range of values was chosen so that about 1/2 - 3/4 of a neuron’s synapses need to be activated for the neuron to reach its spike threshold.

–

For $o$ in O , $b a c k w a r d S T O [o] = floor (S T O [o] / 3)$ . $b a c k w a r d S T O [o]$ is $o$ ’s backward Spike Threshold. Increasing its value prevents the backward firing of O-neurons in the querying process and decreases predictions’ Hit Rates; decreasing its value decreases the predictions’ Correctness.

–

$b o o s t P a r a m O = 50$ (parameter in the boosting equation for O-neurons). Lowering its value augments the boosting and tends to destabilize the network; increasing its value lessens the boosting, which leaves some neurons unused while other neurons are consistently reused, thus preventing the agent to retain the acquired knowledge in the long run.

–

$T n b F i r e d O = 12$ (target number of forward firing O-neurons). Keeps the number of firing / learning O-neurons under control and prevents that too many neurons learn a same situation.

–

$T n b L e a r n i n g O = 6$ (target number of learning O-neurons). This parameter’s value is set to half the target number of forward firing O-neurons to ensure that boosting is only used when very few neurons respond to the input (which corresponds to situations that are new to the agent).

–

$T n b Q u e r y O = 6$ (target number of backward firing O-neurons in the query process). Its value is set at the target number of learning O-neurons as the neurons having learned the situation to predict are the ones that need to be reactivated in the querying process.

–

Upper bound for $l a s t S p i k e d O$ initialization $= 2, 000$ . Setting this parameter to some high value favors the use of new neurons at early stages of learning and accelerates early learning.

–

$n o i s e O = 2$ (noise threshold for O-Neurons). Noise suppression improves correctness.

A-neurons: –

$W S_{A} = 30$ is the number of incoming synapses on each compartment of A-neurons. This value was chosen so as to allow enough redundancy between synapses to ensure robustness of learning while keeping the computational complexity as low as possible.

–

$l e a r n A T = 55$ is the learning threshold for A-neurons. It ensures that only A-neurons that fit the input pair $(i n p u t O A 1, i n p u t M A 2)$ among those that fire are primarily selected for learning (the maximal possible input sum on the first two compartment is $2 * W S_{A} = 60)$ .

–

$T n b F i r e d A = T n b L e a r n i n g A = T n b Q u e r y A = 4$ (target numbers of firing/learning A-neurons).

–

$b o o s t P a r a m A = 800$ (Parameter in the boosting equation for A-neurons). Lowering its value augments the boosting and tends to destabilize the network; increasing its value lessens the boosting, which leaves some neurons unused while other neurons are consistently reused, thus preventing the agent to retain the acquired knowledge in the long run.

–

upper bound for $l a s t S p i k e d A$ initialization $= 20, 000$ . Setting this parameter to some high value favors the use of new neurons at early stages of learning and accelerates early learning.

–

$n o i s e A = 2$ (noise threshold for A-Neurons). Noise suppression improves correctness.

–

$m i n C o e f M o d C n x = 1$ , $m a x C o e f M o d C n x = 2$ (minimal and maximal coefficients for connection modulation). Favors the early spiking of A-neurons encoding more particular action concepts in the Query process.

2. Pseudocode

Computing Infrastructure

Research was carried out on a MacBook Pro with Apple M1 Max chip (2022). Operating System: macOS-15.6-arm64-arm-64bit. We used Python with Spyder IDE. .

Spyder version: 6.0.5 (standalone)

Python version: 3.11.11 64-bit

Qt version: 5.15.8

PyQt5 version: 5.15.9

References

Bausch

Niediek

Reber

T. P.

Mackay

Boström

Elger

C. E.

Mormann

(2021). Concept neurons in the human medial temporal lobe flexibly represent abstract relations between concepts. Nature Communications, 12(1), 6164.

Bolander

Gierasimczuk

(2018). Learning to act: Qualitative learning of deterministic action models. Journal of Logic and Computation, 28(2), 337–365.

Bonet

Frances

Geffner

(2019). Learning features and abstract actions for computing generalized plans. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, pp. 2703–2710).

Chaudhry

Dokania

P. K.

Ajanthan

Torr

P. H.

(2018). Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European conference on computer vision (ECCV) (pp. 532–547).

Cui

Ahmad

Hawkins

(2016). Continuous online sequence learning with an unsupervised neural network model. Neural Computation, 28(11), 2474–2504.

Das

Chernova

Kim

(2023). State2explanation: Concept-based explanations to benefit agent learning and user understanding. Advances in Neural Information Processing Systems, 36, 67156–67182.

Ecoffet

Huizinga

Lehman

Stanley

K. O.

Clune

(2021). First return, then explore. Nature, 590(7847), 580–586.

Farebrother

Machado

M. C.

Bowling

(2018). Generalization and regularization in dqn. arXiv preprint arXiv:1810.00123.

Fikes

Nilsson

N. J.

(1971). STRIPS: A new approach to the application of theorem proving to problem solving. Artificial Intelligence, 2(3/4), 189–208.

10.

Freund

(2008). On the notion of concept I. Artificial Intelligence, 152(1), 105–137.

11.

Ghorbani

Wexler

Zou

J. Y.

Kim

(2019). Towards automatic concept-based explanations. Advances in Neural Information Processing Systems, 32.

12.

Grimaud

(2016). Modelling reasoning processes in natural agents: A partial-worlds-based logical framework for elemental non-monotonic inferences and learning. Journal of Applied Non-Classical Logics, 26(4), 251–285.

13.

Gupta

Narayanan

(2024). A survey on concept-based approaches for model improvement. arXiv preprint arXiv:2403.14566.

14.

Hare

(2019). Dealing with sparse rewards in reinforcement learning. arXiv preprint arXiv:1910.09281.

15.

Hase

Chen

Rudin

(2019). Interpretable image recognition with hierarchical prototypes. In Proceedings of the AAAI Conference on human computation and crowdsourcing (Vol. 7, pp. 32–40).

16.

Hawkins

Subutai

(2016). Why neurons have thousands of synapses, a theory of sequence memory in neocortex. Frontiers in Neural Cicuits, 10. https://doi.org/10.3389/fncir.2016.00023

17.

Kheradpisheh

S. R.

Ganjtabesh

Thorpe

S. J.

Masquelier

(2017). Stdp-based spiking deep convolutional neural networks for object recognition. Neural Networks, 99, 56–67.

18.

Kirk

Zhang

Grefenstette

Rocktäschel

(2023). A survey of zero-shot generalisation in deep reinforcement learning. Journal of Artificial Intelligence Research, 76, 201–264.

19.

Kirkpatrick

Pascanu

Rabinowitz

Veness

Desjardins

Rusu

A. A.

Milan

Quan

Ramalho

Grabska-Barwinska

Hascanu

Rabinowitz

Veness

Desjardins

Rusu

A. A.

Milan

Quan

Ramalho

Grabska-Barwinska

Hassabis

Clopath

Kumaran

Hadsell

(2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13), 3521–3526.

20.

Knowlton

B. J.

Siegel

A. L.

Moody

T. D.

(2017). 3.17 - procedural learning in humans. In J. H. Byrne (Ed.), Learning and Memory: A comprehensive reference (Second Edition)(2nd ed., pp. 295–312). Oxford: Academic Press. ISBN 978-0-12-805291-4. https://doi.org/10.1016/B978-0-12-809324-5.21085-7

21.

Koh

P. W.

Nguyen

Tang

Y. S.

Mussmann

Pierson

Kim

Liang

(2020). Concept bottleneck models. In International conference on machine learning (pp. 5338–5348). PMLR.

22.

Kohavi

Provost

(1998). Glossary of terms. https://doi.org/10.1023/A:1017181826899.

23.

Lesort

Lomonaco

Stoian

Maltoni

Filliat

Díaz-Rodríguez

(2020). Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges. Information Fusion, 58, 52–68.

24.

Lopez-Paz

Ranzato

(2017). Gradient episodic memory for continual learning. Advances in Neural Information Processing Systems, 30.

25.

Nasir

Kim

D. H.

Kim

J. H.

(2019). Art neural network-based integration of episodic memory and semantic memory for task planning for robots. Autonomous Robots, 43(8), 2163–2182.

26.

Ocana

J. M. C.

Capobianco

Nardi

(2023). An overview of environmental features that impact deep reinforcement learning in sparse-reward domains. Journal of Artificial Intelligence Research, 76, 1181–1218.

27.

Prasad

P. K.

Ertel

(2020). Knowledge acquisition and reasoning systems for service robots: A short review of the state of the art. In 2020 5th International conference on robotics and automation engineering (ICRAE) (pp. 36–45). IEEE.

28.

Quiroga

R. Q.

(2012). Concept cells: The building blocks of declarative memory functions. Nature Reviews Neuroscience, 13, 587–597.

29.

Shimamura

A. P.

(2010). Hierarchical relational binding in the medial temporal lobe: The strong get stronger. Hippocampus, 20(11), 1206–1216.

30.

Sutton

R. S.

Barto

A. G.

(2018). Reinforcement learning: An introduction. MIT press.

31.

Thiele

Bichler

Dupret

(2018). Event-based, timescale invariant unsupervised online deep learning with STDP. Frontiers in Computational Neuroscience 12, 46 (2018).

32.

Thorpe

(2023). Timing, spikes, and the brain. In Time and science: Volume 2: Life Sciences (pp. 207–236). World Scientific Publishing Europe Ltd.

33.

Thorpe

Masquelier

Martin

Yousefzadeh

A. R.

Linares-Barranco

(2019). Method, digital electronic circuit and system for unsupervised detection of repeating patterns in a series of events. US Patent 20190286944A1.

34.

Tulving

(1985). How many memory systems are there? American Psychologist, 40(4), 385.

35.

Wang

Lee

W. N.

(2022). Hint: Hierarchical neuron concept explainer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10254–10264).

36.

Wang

Zhang

Zhu

(2024). A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8), 5362–5383.

37.

Wang

Tan

A. H.

Teow

L. N.

(2016). Semantic memory modeling and memory interaction in learning agents. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 47(11), 2882–2895.

38.

Wang

Yao

Kwok

J. T.

L. M.

(2020). Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys (csur), 53(3), 1–34.

39.

Watkins

C. J. C. H.

(1989). Learning from delayed rewards.

40.

Yamazaki

Vo-Ho

V. K.

Bulsara

(2022). Spiking neural networks and their applications: A review. Brain Sciences, 12(7), 863.

41.

Zabounidis

Campbell

Stepputtis

Hughes

Sycara

K. P.

(2023). Concept learning for interpretable multi-agent reinforcement learning. In Conference on robot learning (pp. 1828–1837). PMLR.

SNN-Based Online Learning of Concepts and Action Laws in an Open World

Abstract

Keywords

1. Introduction

2. Related Works

3. The Agent and Its Universe

3.1. The Universe

4. Implementing the Agent’s Semantic Memory in the Neural Network: Overview

4.1. Spiking Neural Networks

4.2. The Neural Network’s Architecture

5.1. General Notions

5.1.1. Connections

5.1.2. Spike Thresholds

5.1.3. Last Spikes

5.1.4. Synapses’ Growth Rates

5.1.5. Implementation Apparatus

5.2.1. Forward Firing of O-Neurons

5.2.2. Information Retrieving

5.2.3. Learning of O-Neurons

5.3. A-Neurons

5.3.1. Inputs

5.3.2. A-Neurons’ Forward Spiking

5.3.3. A-Neurons’ Learning

6. The Agent’s Functioning

6.1. Querying the Network

6.2. Making Decisions

6.3. Making a Step

7. Experiments and Results

7.1. Experiment #1

Table 1. Post-Learning Predictions Percentages (Mean over 50 trials). CC: Correct and Complete predictions, MF: Missed Features, PE: Prediction Errors.

Table 4. Predictions’ Hit Rates after n Steps, with the Door Opened at the 2048 t h Step.

8. Conclusion and Future Developments

Footnotes

ORCID iD

Funding

Declaration of Competing Interest

Data Availability

Notes

Appendices

References

Table 1.
Post-Learning Predictions Percentages (Mean over 50 trials). CC: Correct and Complete predictions, MF: Missed Features, PE: Prediction Errors.

Table 4.
Predictions’ Hit Rates after $n$ Steps, with the Door Opened at the 2048 $^{t h}$ Step.