Abstract
In this paper, we review recent approaches for explaining concepts in neural networks. Concepts can act as a natural link between learning and reasoning: once the concepts are identified that a neural learning system uses, one can integrate those concepts with a reasoning system for inference or use a reasoning system to act upon them to improve or enhance the learning system. On the other hand, knowledge can not only be extracted from neural networks but concept knowledge can also be inserted into neural network architectures. Since integrating learning and reasoning is at the core of neuro-symbolic AI, the insights gained from this survey can serve as an important step towards realizing neuro-symbolic AI based on explainable concepts.
Introduction
In recent years, neural networks have been successful in tasks that were regarded to require human-level intelligence, such as understanding and generating images and texts, performing dialogues, and controlling robots to follow instructions [25,40,44]. However, their decision-making is often not explainable, which undermines user trust and negatively impacts their usage in sensitive or critical domains, such as automation, law, and medicine. One way to overcome this limitation is by making neural networks explainable, e.g., by designing them to generate explanations or by using a post-hoc explanation method that analyzes the behavior of a neural network after it has been trained.
This paper reviews explainable artificial intelligence (XAI) methods with a focus on explaining how neural networks learn concepts, as concepts can act as primitives for building complex rules, presenting themselves as a natural link between learning and reasoning [50], which is at the core of neuro-symbolic AI [23,30,33,55,56]. On the one hand, identifying the concepts that a neural network uses for a given input can inform the user about what information the network is using to generate its output [5,22,27,28,35,54]. Combined with an approach to extract all relevant concepts and their (causal) relationships, one could generate explanations in logical or natural language that faithfully reflects the decision procedure of the network. On the other hand, the identified concepts can help a symbolic reasoner intervene in the neural network such that debugging the network becomes possible by modification of the concepts [1,6,28,34].
Some XAI surveys have been published in recent years [15,24,31,32,45,47,49]. However, almost all of them are mainly concerned with the use of saliency maps to highlight important input features. Only a few surveys include concept explanation as a way to explain neural networks. A recent survey in this vein is by Casper et al. [8], which discusses a broad range of approaches to explaining the internals of neural networks. However, due to its broader scope, the survey does not provide detailed descriptions of methods for explaining concepts and misses recent advances in the field. The surveys by Schwalbe [51] and Sajjad et al. [48], on the other hand, are dedicated to specific kinds of concept explanation methods with a focus on either vision [51] or natural language processing [48] and are, therefore, limited in scope, failing to analyze the two areas together.
We categorize concept explanation approaches and structure this survey based on whether they explain concepts at the level of individual neurons (Section 2) or at the level of layers (Section 3). The last section summarizes this survey with open questions.
Neuron-level explanations
The smallest entity in a neural network that can represent a concept is a neuron [48], which could be – in a broader sense – also a unit or a filter in a convolutional neural network [5]. In this section, we survey approaches that explain, in a post-hoc manner, concepts that a neuron of a pre-trained neural network represents, either by comparing the similarity between a concept and the activation of the neuron (see Section 2.1) or by detecting the causal relationship between a concept and the activation of the neuron (see Section 2.2).
Using similarities between concepts and activations
In this category, the concept a neuron is representing is explained by comparing the concept with the activations of the neuron when the concept is passed as an input to the model. The network dissection approach by Bau et al. [5] is arguably the most prominent approach in this category, which is mainly applied to computer vision models. In this approach, a set
Fong et al. [19] question whether a concept has to be represented by a single convolutional filter alone or whether it can be represented by a linear combination of filters. They show that the latter leads to a better representation of the concept and also suggest to use binary classification for measuring how well filters represent a concept. Complementary to that extension, Mu et al. [35] investigate how to approximate better what a single filter represents. To this end, they assume that a filter can represent a Boolean combination of concepts (e.g.,
One strong assumption made by the network dissection approach is the availability of a comprehensive set
The dissection approach can also be used in generative vision models. Bau et al. [6] identify that units of generative adversarial networks [21] learn concepts similar to network dissection and that one can intervene on the units and remove specific concepts to change the output image (e.g., removing units representing the concept

Neuron-level explanation using similarities between concepts and activations. Depicted is the network dissection approach, which compares the segmented concept in the input with the activation mask of a neuron [5].
In this category, the concepts that a neuron is representing are explained by analyzing the causal relationship either (i) between the input concept and the neuron by intervening on the input and measuring the neural activation or (ii) between the neuron and the output concept by intervening on the neural activation and measuring the probability in predicting the concept. This approach is often used for explaining neurons of NLP models [48], where the types of concepts can be broader (e.g., subject-verb behavior, causal relationship, semantic tags).
The first line of work investigates the influence of a concept in the input on the activation of a neuron by intervening in the input. Kádár et al. [26] find the n-grams (i.e., a sequence of n words) that have the largest influence on the activation of a neuron by measuring the change in its activations when a word is removed from the n-grams. Na et al. [36] first identify k sentences that most highly activate a filter of a CNN-based NLP model. From these k sentences, they extract concepts by breaking down each sentence into a set of consecutive word sequences that form a meaningful chunk. Then they measure the contribution of each concept to the filter’s activations by first repeating the concept to create a synthetic sentence of a fixed length (to normalize the input’s contribution to the unit across different concepts) and then measuring the mean value of the filter’s activations.
The second line of work investigates the role of a neuron in generating a concept by intervening in the activation of the neuron. Dai et al. [11] investigate the factual linguistic knowledge of the BERT model [13], a widely used pre-trained model for text classification, which is pre-trained among other tasks by predicting masked words in a sentence. In this approach, given relational facts with a mask word (e.g. “Rome is the capital of [MASK]”), each neuron’s contribution to predicting the mask is measured using the integrated gradients method [52]. To verify the causal role of the neuron that is supposed to represent a concept, the authors also intervene in the neuron’s activation (by suppressing or doubling) and measure the change in accuracy in predicting the concept. Finlayson et al. [18] analyze whether a neuron of a transformer-based language model (e.g., GPT-2 [43]) has acquired the concept of conjugation. The authors determine which neuron contributes most to the conjugation of a verb by using the causal mediation analysis [53]. To this end, they first modify the activation of a neuron to the one that the neuron would have output if there was an intervention on the input (e.g., the subject in the input sentence was changed from singular to plural) and then measure the amount of change between the predictions of the correct conjugation of a verb with and without the intervention (see Fig. 2). Meng et al. [34] also apply causal mediation analysis to GPT-2 to understand which neurons memorize factual knowledge and modify specific facts (e.g., “The Eiffel Tower is in Paris” is modified to “The Eiffel Tower is in Rome”). The data they use consists of triples of the form (subject, relation, object) and the model has to predict the object given subject and relation. They discover that the neurons in the middle layer feed-forward modules in GPT-2 are the most relevant for encoding factual information and implementing a weight modifier to change the value of weights and alter the factual knowledge.

Neuron-level explanation using causal relationships between concepts and activations. In causal mediation analysis, the activation of a neuron is modified to the one that the neuron would have output if there was an intervention on the input (the subject in the input sentence was changed from singular to plural). Afterward, the amount of change between the predictions of the correct conjugation of a verb with and without the intervention is measured [18].
Concepts can also be represented by a whole layer as opposed to a neuron or a convolutional filter, as mentioned in the paragraph about the work by Fong et al. [19] in Section 2.1. This can be achieved in a post-hoc manner for a pre-trained model by passing examples of a concept dataset
Using vectors to explain concepts: Concept activation vectors
A concept activation vector (CAV) introduced by Kim et al. [27] is a continuous vector that corresponds to a concept represented by a layer of a neural network f (see Fig. 3). Let
CAVs can be used in many different ways. Nejagholi et al. [37] use CAVs to identify sensitivity of abusive language classifiers with respect to implicit types (as opposed to explicit types) of abusive language. Different from the original approach [27] which obtains CAVs by taking the vector normal to the decision boundary, they obtain CAVs by just averaging over the activations

Layer-level explanation using vectors to explain concepts. For each concept C positive examples
An issue with the original approach for learning CAVs is that one needs to prepare a set of concept labels and images to learn the CAVs. Ghorbani et al. [20] partially tackle this issue by preparing images of the same class and then segmenting them with multiple resolutions. The clusters of resulting segments then form concepts and can be used for TCAV. As corresponding concept labels are missing, the concepts need to be manually inspected. Yeh et al. [58] circumvent the problem of preparing a concept dataset by training CAVs together with a model on the original image classification dataset. To this end, they compute a vector-valued score, where each value corresponds to a learnable concept and indicates to which degree the concept is present in the receptive field of the convolutional layer (computed by building a scalar product). The score is then passed to a multilayer perceptron (MLP) to perform classification.
Similar to the CAV-based approaches in Section 3.1, probing uses a classifier to explain concepts. However, instead of training a binary linear classifier for each concept

Layer-level explanation using a classifier to explain concepts. In this example, a pre-trained model takes as its input a sentence and a probing classifier is applied to the activation of the highlighted layer to check whether the activation encodes the concept of sentence length [2].
Probing, which is designed as a layer-level explanation method, can also be combined with a neuron-level explanation method (see Section 2) by applying the probing classifier only to neurons that are relevant for the classification [14]. Finding such neurons can be accomplished by applying the elastic-net regularization to the classifier, which constrains both the L1- and the L2-norm of the classifier weights.
The concepts learned by such probing classifiers can be combined with a knowledge base to provide richer explanations. Ribiero and Leite [46] use identified concepts as evidence to draw conclusions from a set of axioms in a knowledge base (e.g., given an axiom
Since the probing classifier is trained independently from the pre-trained model, it is pointed out that the pre-trained model does not necessarily leverage the same features that the classifier uses for predicting a given concept, i.e., what the probing classifier detects can be merely a correlation between the activation and the concept [3,7].

Layer-level explanation using the concept bottleneck model approach [28]. Each neuron in the concept bottleneck
Different from the neuron-based approach in Section 2, where concepts are learned in a post-hoc manner, in a concept bottleneck model (CBM) [28], each concept is represented by a unique neuron in the bottleneck layer
One of the main limitations of CBMs is the need for the aforementioned concept labels, which might not be available for specific tasks. Several recent approaches overcome this limitation [38,57,59]. The main idea behind these approaches is using an external resource to obtain a set
Conclusion
In this survey, we have reviewed recent methods for explaining concepts in neural networks. We have covered different approaches that range from analyzing individual neurons to learning classifiers for a whole layer. As witnessed by the increasing number of recent papers, this is an active research area and a lot is still to be discovered, for example, empirically comparing or integrating different approaches.1 With the progress of concept extraction from neural networks, integrating the learned neural concepts with symbolic representations – also known as neuro-symbolic integration – is receiving (again) increasing attention [4,9,12,17,29,46]. In the near future, we expect tighter integration between neural models and symbolic rules via concept representations to make the models more transparent and easier to control. In conclusion, this line of research is still very active and in development, providing ample opportunities for new forms of integration in neuro-symbolic AI.
Footnotes
Acknowledgement
The authors gratefully acknowledge support from the DFG (CML, MoReSpace, LeCAREbot), BMWK (SIDIMO, VERIKAS), and the European Commission (TRAIL, TERAIS). We would like to thank Cornelius Weber for valuable comments on this paper.
