From neural activations to concepts: A survey on explaining concepts in neural networks

Abstract

In this paper, we review recent approaches for explaining concepts in neural networks. Concepts can act as a natural link between learning and reasoning: once the concepts are identified that a neural learning system uses, one can integrate those concepts with a reasoning system for inference or use a reasoning system to act upon them to improve or enhance the learning system. On the other hand, knowledge can not only be extracted from neural networks but concept knowledge can also be inserted into neural network architectures. Since integrating learning and reasoning is at the core of neuro-symbolic AI, the insights gained from this survey can serve as an important step towards realizing neuro-symbolic AI based on explainable concepts.

Keywords

Explainable artificial intelligence concept explanation neuro-symbolic integration

1. Introduction

In recent years, neural networks have been successful in tasks that were regarded to require human-level intelligence, such as understanding and generating images and texts, performing dialogues, and controlling robots to follow instructions [25,40,44]. However, their decision-making is often not explainable, which undermines user trust and negatively impacts their usage in sensitive or critical domains, such as automation, law, and medicine. One way to overcome this limitation is by making neural networks explainable, e.g., by designing them to generate explanations or by using a post-hoc explanation method that analyzes the behavior of a neural network after it has been trained.

This paper reviews explainable artificial intelligence (XAI) methods with a focus on explaining how neural networks learn concepts, as concepts can act as primitives for building complex rules, presenting themselves as a natural link between learning and reasoning [50], which is at the core of neuro-symbolic AI [23,30,33,55,56]. On the one hand, identifying the concepts that a neural network uses for a given input can inform the user about what information the network is using to generate its output [5,22,27,28,35,54]. Combined with an approach to extract all relevant concepts and their (causal) relationships, one could generate explanations in logical or natural language that faithfully reflects the decision procedure of the network. On the other hand, the identified concepts can help a symbolic reasoner intervene in the neural network such that debugging the network becomes possible by modification of the concepts [1,6,28,34].

Some XAI surveys have been published in recent years [15,24,31,32,45,47,49]. However, almost all of them are mainly concerned with the use of saliency maps to highlight important input features. Only a few surveys include concept explanation as a way to explain neural networks. A recent survey in this vein is by Casper et al. [8], which discusses a broad range of approaches to explaining the internals of neural networks. However, due to its broader scope, the survey does not provide detailed descriptions of methods for explaining concepts and misses recent advances in the field. The surveys by Schwalbe [51] and Sajjad et al. [48], on the other hand, are dedicated to specific kinds of concept explanation methods with a focus on either vision [51] or natural language processing [48] and are, therefore, limited in scope, failing to analyze the two areas together.

We categorize concept explanation approaches and structure this survey based on whether they explain concepts at the level of individual neurons (Section 2) or at the level of layers (Section 3). The last section summarizes this survey with open questions.

2. Neuron-level explanations

The smallest entity in a neural network that can represent a concept is a neuron [48], which could be – in a broader sense – also a unit or a filter in a convolutional neural network [5]. In this section, we survey approaches that explain, in a post-hoc manner, concepts that a neuron of a pre-trained neural network represents, either by comparing the similarity between a concept and the activation of the neuron (see Section 2.1) or by detecting the causal relationship between a concept and the activation of the neuron (see Section 2.2).

2.1. Using similarities between concepts and activations

In this category, the concept a neuron is representing is explained by comparing the concept with the activations of the neuron when the concept is passed as an input to the model. The network dissection approach by Bau et al. [5] is arguably the most prominent approach in this category, which is mainly applied to computer vision models. In this approach, a set $C$ of concepts are prepared as well as a set $X_{C}$ of images for each concept $C \in C$ . Then the activations of a convolutional filter are measured for each input $x \in X_{C}$ . Afterward, the activation map is thresholded to generate a binary activation mask $M (x)$ and scaled up to be compared with the original concept (e.g., concept head) in the binary segmentation mask $L_{C} (x)$ of the input X (e.g., the head segment of an image with a bird). See Fig. 1 for an illustration. Then, to measure to which degree concept C is represented by the convolutional filter, the dataset-wide intersection over union metric (IoU) is computed, which is defined as $IoU (C) = \sum_{x \in X_{C}} | M (x) \cap L_{C} (x) | / \sum_{x \in X_{C}} | M (x) \cup L_{C} (x) |$ . If the IoU value is above a given threshold, then the convolutional filter represents the concept C. Several extensions of this approach have been introduced.

Fong et al. [19] question whether a concept has to be represented by a single convolutional filter alone or whether it can be represented by a linear combination of filters. They show that the latter leads to a better representation of the concept and also suggest to use binary classification for measuring how well filters represent a concept. Complementary to that extension, Mu et al. [35] investigate how to approximate better what a single filter represents. To this end, they assume that a filter can represent a Boolean combination of concepts (e.g., (water OR river) AND NOT blue) and show that this compositional explanation of concepts leads to higher IoU. An intuitive extension of using compositional explanations is using natural language explanations. The approach called MILAN by Hernandez et al. [22] finds such natural language explanations as a sequence d of words that maximizes the pointwise mutual information between d and a set of image regions E that maximally activates the filter, (i.e., $a r g m a x_{d} \log P (d | E) - \log P (d)$ ). In the approach, the two probabilities $P (d | E)$ and $P (d)$ are approximated by an image captioning model and a language model respectively, which are trained on a dataset that the authors curated.

One strong assumption made by the network dissection approach is the availability of a comprehensive set $C$ of concepts and corresponding labeled images to provide accurate explanations of neurons. This is, however, difficult to obtain in general. Oikarinen et al. [39] tackle this problem with their CLIP-Dissect method, which is based on the CLIP [42] vision-language model. (CLIP embeds images and texts in the same vector space, allowing for measuring the similarity between texts and images.) To explain the concept a convolutional filter k is representing, they choose a set $X_{k}$ of the most highly activating images for filter k, then use CLIP to measure the similarity between $X_{k}$ and each concept $C \in C$ (here, the concept set $C$ consists of 20K most common English words), and finally find the best matching concept C.

The dissection approach can also be used in generative vision models. Bau et al. [6] identify that units of generative adversarial networks [21] learn concepts similar to network dissection and that one can intervene on the units and remove specific concepts to change the output image (e.g., removing units representing the concept tree leads to output images with fewer trees in their scenes).

Fig. 1.

Neuron-level explanation using similarities between concepts and activations. Depicted is the network dissection approach, which compares the segmented concept in the input with the activation mask of a neuron [5].

2.2. Using causal relationships between concepts and activations

In this category, the concepts that a neuron is representing are explained by analyzing the causal relationship either (i) between the input concept and the neuron by intervening on the input and measuring the neural activation or (ii) between the neuron and the output concept by intervening on the neural activation and measuring the probability in predicting the concept. This approach is often used for explaining neurons of NLP models [48], where the types of concepts can be broader (e.g., subject-verb behavior, causal relationship, semantic tags).

The first line of work investigates the influence of a concept in the input on the activation of a neuron by intervening in the input. Kádár et al. [26] find the n-grams (i.e., a sequence of n words) that have the largest influence on the activation of a neuron by measuring the change in its activations when a word is removed from the n-grams. Na et al. [36] first identify k sentences that most highly activate a filter of a CNN-based NLP model. From these k sentences, they extract concepts by breaking down each sentence into a set of consecutive word sequences that form a meaningful chunk. Then they measure the contribution of each concept to the filter’s activations by first repeating the concept to create a synthetic sentence of a fixed length (to normalize the input’s contribution to the unit across different concepts) and then measuring the mean value of the filter’s activations.

The second line of work investigates the role of a neuron in generating a concept by intervening in the activation of the neuron. Dai et al. [11] investigate the factual linguistic knowledge of the BERT model [13], a widely used pre-trained model for text classification, which is pre-trained among other tasks by predicting masked words in a sentence. In this approach, given relational facts with a mask word (e.g. “Rome is the capital of [MASK]”), each neuron’s contribution to predicting the mask is measured using the integrated gradients method [52]. To verify the causal role of the neuron that is supposed to represent a concept, the authors also intervene in the neuron’s activation (by suppressing or doubling) and measure the change in accuracy in predicting the concept. Finlayson et al. [18] analyze whether a neuron of a transformer-based language model (e.g., GPT-2 [43]) has acquired the concept of conjugation. The authors determine which neuron contributes most to the conjugation of a verb by using the causal mediation analysis [53]. To this end, they first modify the activation of a neuron to the one that the neuron would have output if there was an intervention on the input (e.g., the subject in the input sentence was changed from singular to plural) and then measure the amount of change between the predictions of the correct conjugation of a verb with and without the intervention (see Fig. 2). Meng et al. [34] also apply causal mediation analysis to GPT-2 to understand which neurons memorize factual knowledge and modify specific facts (e.g., “The Eiffel Tower is in Paris” is modified to “The Eiffel Tower is in Rome”). The data they use consists of triples of the form (subject, relation, object) and the model has to predict the object given subject and relation. They discover that the neurons in the middle layer feed-forward modules in GPT-2 are the most relevant for encoding factual information and implementing a weight modifier to change the value of weights and alter the factual knowledge.

Fig. 2.

Neuron-level explanation using causal relationships between concepts and activations. In causal mediation analysis, the activation of a neuron is modified to the one that the neuron would have output if there was an intervention on the input (the subject in the input sentence was changed from singular to plural). Afterward, the amount of change between the predictions of the correct conjugation of a verb with and without the intervention is measured [18].

3. Layer-level explanations

Concepts can also be represented by a whole layer as opposed to a neuron or a convolutional filter, as mentioned in the paragraph about the work by Fong et al. [19] in Section 2.1. This can be achieved in a post-hoc manner for a pre-trained model by passing examples of a concept dataset $C$ to the model and extracting the activations of a specific layer to train a concept classifier. Two approaches are prominent in layer-level explanations: the first is explaining with concept activation vectors (CAVs) (see Section 3.1) and the second is probing (see Section 3.2). The main difference between the two approaches is that in the case of CAV a linear binary classifier is trained for each concept $C \in C$ , and in probing a multiclass classifier is trained with classification labels that are often related to certain linguistic features (e.g, sentiments, part-of-speech tags). On the other hand, concepts can be baked in a layer, where each concept represents a neuron as was done with localist representations in the early days of neural network research (see Section 3.3).

3.1. Using vectors to explain concepts: Concept activation vectors

A concept activation vector (CAV) introduced by Kim et al. [27] is a continuous vector that corresponds to a concept represented by a layer of a neural network f (see Fig. 3). Let $f = f^{⊤} \circ f^{⊥}$ , where $f^{⊥} : R^{m} \to R^{n}$ is the bottom part of the network whose final convolutional layer ℓ is of interest. To identify the existence of a concept C (e.g., the concept stripes) in layer ℓ, network $f^{⊥}$ is first fed with positive examples $x_{C}^{+}$ that contain concept C and negative examples $x_{C}^{-}$ that do not contain the concept, and then their corresponding activations $f^{⊥} (x_{C}^{+}) \in R^{n}$ and $f^{⊥} (x_{C}^{-}) \in R^{n}$ are collected. Next, a linear classifier is learned that distinguishes activations $f^{⊥} (x_{C}^{+})$ from activations $f^{⊥} (x_{C}^{-})$ . The vector normal $v_{C} \in R^{n}$ to the decision boundary of the classifier is then a CAV of concept C. One useful feature of a CAV is that it allows for testing how much an input image x is correlated with a concept C (e.g., an image of a zebra and concept stripes), which is called testing with CAVs (TCAV) in [27]. This is accomplished, roughly speaking, by measuring the probability of a concept C having a positive influence on predicting a class label $k \in {1, \dots, K}$ on a dataset $X$ , i.e., how much moving the latent vector $f^{⊥} (x) \in R^{n}$ along the direction of $v_{C}$ , i.e., $f^{⊥} (x) + ϵ \cdot v_{C}$ , changes the log-probability of label k when it is fed to $f^{⊤}$ for all images $x \in X$ with class label k.

CAVs can be used in many different ways. Nejagholi et al. [37] use CAVs to identify sensitivity of abusive language classifiers with respect to implicit types (as opposed to explicit types) of abusive language. Different from the original approach [27] which obtains CAVs by taking the vector normal to the decision boundary, they obtain CAVs by just averaging over the activations $f^{⊥} (x_{C}^{+})$ for all positive samples $x_{C}^{+}$ to mitigate the impact of the choice of random negative samples $x_{C}^{-}$ on determining the decision boundary. Zhou et al. [60] decompose the row vector of the last linear layer for predicting a class label k and represent it as a linear combination of a basis that consists of CAVs using only positive weights. Each positive weight then indicates how much of the corresponding concept is involved in predicting class label k. Similarly, Abid et al. [1] propose an approach that learns a set of CAVs, but for debugging purposes. Given an input image misclassified by a model, a weighted sum of the set of CAVs is computed that leads to correct classification when added to the activations before the last linear layer of the model. In addition to explaining bugs on a conceptual level, this approach allows for identifying spurious correlations in the data.

Fig. 3.

Layer-level explanation using vectors to explain concepts. For each concept C positive examples $x_{C}^{+}$ and negative examples $x_{C}^{-}$ are fed to a pre-trained model to learn the so-called concept activation vector (CAV) $v_{C}$ from the corresponding activations of the target layer [27].

An issue with the original approach for learning CAVs is that one needs to prepare a set of concept labels and images to learn the CAVs. Ghorbani et al. [20] partially tackle this issue by preparing images of the same class and then segmenting them with multiple resolutions. The clusters of resulting segments then form concepts and can be used for TCAV. As corresponding concept labels are missing, the concepts need to be manually inspected. Yeh et al. [58] circumvent the problem of preparing a concept dataset by training CAVs together with a model on the original image classification dataset. To this end, they compute a vector-valued score, where each value corresponds to a learnable concept and indicates to which degree the concept is present in the receptive field of the convolutional layer (computed by building a scalar product). The score is then passed to a multilayer perceptron (MLP) to perform classification.

3.2. Using classifiers to explain concepts: Probing

Similar to the CAV-based approaches in Section 3.1, probing uses a classifier to explain concepts. However, instead of training a binary linear classifier for each concept $C \in C$ to measure the existence of the concept in the activation of a layer, probing uses a classifier for multiclass classifications with labels that often represent linguistic features in NLP (e.g., sentiments, part-of-speech tags). For example, given sentences as inputs to a pre-trained NLP model (e.g., BERT [13]), probing allows for evaluating how well the sentence embeddings of the model capture certain syntactic and semantic information, such as the length or the tense of the sentence [2,10,16] (see Fig. 4).

Fig. 4.

Layer-level explanation using a classifier to explain concepts. In this example, a pre-trained model takes as its input a sentence and a probing classifier is applied to the activation of the highlighted layer to check whether the activation encodes the concept of sentence length [2].

Probing, which is designed as a layer-level explanation method, can also be combined with a neuron-level explanation method (see Section 2) by applying the probing classifier only to neurons that are relevant for the classification [14]. Finding such neurons can be accomplished by applying the elastic-net regularization to the classifier, which constrains both the L1- and the L2-norm of the classifier weights.

The concepts learned by such probing classifiers can be combined with a knowledge base to provide richer explanations. Ribiero and Leite [46] use identified concepts as evidence to draw conclusions from a set of axioms in a knowledge base (e.g., given an axiom $L o n g F r e i g h t T r a i n \leftarrow L o n g T r a i n \land F r e i g h t T r a i n$ in the knowledge base, identifying both antecedent concepts $L o n g T r a i n$ and $F r e i g h t T r a i n$ in the activations explains the presence of the consequence $L o n g F r e i g h t T r a i n$ in the input). However, one cannot always assume the presence of a knowledge base for a given task. Ferreira et al. [17] weaken this assumption by learning the underlying theory from the identified concept using an induction framework.

Since the probing classifier is trained independently from the pre-trained model, it is pointed out that the pre-trained model does not necessarily leverage the same features that the classifier uses for predicting a given concept, i.e., what the probing classifier detects can be merely a correlation between the activation and the concept [3,7].

Fig. 5.

Layer-level explanation using the concept bottleneck model approach [28]. Each neuron in the concept bottleneck $f^{ℓ}$ corresponds to a unique concept (e.g., wing color).

3.3. Using localist representations: Concept bottleneck models

Different from the neuron-based approach in Section 2, where concepts are learned in a post-hoc manner, in a concept bottleneck model (CBM) [28], each concept is represented by a unique neuron in the bottleneck layer $f^{ℓ}$ of a model f (see Fig. 5), which is a reminiscence of localist representations [41]. This layer provides information about the existence or the strengths of each concept in the input. The output of the bottleneck layer is then used by a classifier or regressor $f^{⊤}$ for the prediction, which allows for explaining what concept led to the given prediction. Often, the bottom part $f^{⊥}$ of a pre-trained model is used for initializing the layers before the concept bottleneck $f^{ℓ}$ and $f^{ℓ}$ is a linear layer that maps the features from $f^{⊥}$ to concepts. Therefore, a CBM is $f = f^{⊤} \circ f^{ℓ} \circ f^{⊥}$ , where ∘ stands for composition. To train the concept bottleneck $f^{ℓ}$ , the training data has to include concept labels in addition to task labels.

One of the main limitations of CBMs is the need for the aforementioned concept labels, which might not be available for specific tasks. Several recent approaches overcome this limitation [38,57,59]. The main idea behind these approaches is using an external resource to obtain a set $C$ of concepts relevant to the task. This external resource could be a knowledge base such as ConceptNet [59], or the 20K common English words [38], or a language model like GPT-3 [57]. After obtaining concept set $C$ , each concept word $C \in C$ is embedded as a vector $v_{C}$ by means of the CLIP vision-language model (cf. Section 2.1) such that vector $v_{C}$ can be used for computing the strength of concept C for a given input $x \in X$ , e.g., by measuring the cosine similarity between $v_{C}$ and the embedding $f^{⊥} (x)$ . Finally, the presence of concepts in the concept bottleneck layers allows for inducing logical explanations, e.g., Ciravegna et al. [9] induce explanations in disjunctive normal form (DNF) from concept activations and predicted labels, which is similar to logic-based explanation approaches [17,46] in Section 3.2.

4. Conclusion

In this survey, we have reviewed recent methods for explaining concepts in neural networks. We have covered different approaches that range from analyzing individual neurons to learning classifiers for a whole layer. As witnessed by the increasing number of recent papers, this is an active research area and a lot is still to be discovered, for example, empirically comparing or integrating different approaches.1 With the progress of concept extraction from neural networks, integrating the learned neural concepts with symbolic representations – also known as neuro-symbolic integration – is receiving (again) increasing attention [4,9,12,17,29,46]. In the near future, we expect tighter integration between neural models and symbolic rules via concept representations to make the models more transparent and easier to control. In conclusion, this line of research is still very active and in development, providing ample opportunities for new forms of integration in neuro-symbolic AI.

Footnotes

Acknowledgement

The authors gratefully acknowledge support from the DFG (CML, MoReSpace, LeCAREbot), BMWK (SIDIMO, VERIKAS), and the European Commission (TRAIL, TERAIS). We would like to thank Cornelius Weber for valuable comments on this paper.

Notes

References

Abid

Yuksekgonul

Zou

, Meaningfully debugging model mistakes using conceptual counterfactual explanations, in: Proceedings of the 39th International Conference on Machine Learning, PMLR, 2022, pp. 66–88, ISSN 2640-3498, https://proceedings.mlr.press/v162/abid22a.html.

Adi

Kermany

Belinkov

Lavi

Goldberg

, Fine-grained analysis of sentence embeddings using auxiliary prediction tasks, in: International Conference on Learning Representations, 2016, https://openreview.net/forum?id=BJh6Ztuxl .

Amini

Pimentel

Meister

Cotterell

, Naturalistic Causal Probing for Morpho-Syntax, 2022, arXiv preprint arXiv:2205.07043. http://arxiv.org/abs/2205.07043.

Barbiero

Ciravegna

Giannini

Zarlenga

M.E.

Magister

L.C.

Tonda

Lio

Precioso

Jamnik

Marra

, Interpretable neural-symbolic concept reasoning, in: Proceedings of the 40th International Conference on Machine Learning, PMLR, 2023, pp. 1801–1825, https://proceedings.mlr.press/v202/barbiero23a.html .

Bau

Zhou

Khosla

Oliva

Torralba

, Network dissection: Quantifying interpretability of deep visual representations, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6541–6549, https://openaccess.thecvf.com/content_cvpr_2017/html/Bau_Network_Dissection_Quantifying_CVPR_2017_paper.html .

Bau

Zhu

J.-Y.

Strobelt

Zhou

Tenenbaum

J.B.

Freeman

W.T.

Torralba

Dissection

G.A.N.

, Visualizing and understanding generative adversarial networks, in: International Conference on Learning Representations, 2018, https://openreview.net/forum?id=Hyg_X2C5FX .

Belinkov

, Probing classifiers: Promises, shortcomings, and advances, Computational Linguistics 48(1) (2022), 207–219, https://aclanthology.org/2022.cl-1.7 . doi:10.1162/coli_a_00422.

Casper

Rauker

Hadfield-Menell

, Toward transparent AI: A survey on interpreting the inner structures of deep neural networks, in: First IEEE Conference on Secure and Trustworthy Machine Learning, 2023, https://openreview.net/forum?id=8C5zt-0Utdn .

Ciravegna

Barbiero

Giannini

Gori

Liò

Maggini

Melacci

, Logic Explained Networks, Artificial Intelligence (2022), 103822, https://www.sciencedirect.com/science/article/pii/S000437022200162X.

10.

Conneau

Kruszewski

Lample

Barrault

Baroni

, What you can cram into a single backslash$&!#* vector: Probing sentence embeddings for linguistic properties, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 2126–2136, http://aclweb.org/anthology/P18-1198 .

11.

Dai

Dong

Hao

Sui

Chang

Wei

, Knowledge neurons in pretrained transformers, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 8493–8502, https://aclanthology.org/2022.acl-long.581 .

12.

Dalal

Sarker

M.K.

Barua

Vasserman

Hitzler

, Understanding CNN Hidden Neuron Activations Using Structured Background Knowledge and Deductive Reasoning, 2023, arXiv preprint arXiv:2308.03999. http://arxiv.org/abs/2308.03999.

13.

Devlin

Chang

M.-W.

Lee

Toutanova

, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186, https://aclanthology.org/N19-1423 .

14.

Durrani

Sajjad

Dalvi

Belinkov

, Analyzing individual neurons in pre-trained language models, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 4865–4880, https://aclanthology.org/2020.emnlp-main.395 .

15.

Dwivedi

Dave

Naik

Singhal

Omer

Patel

Qian

Wen

Shah

Morgan

Ranjan

, Explainable AI (XAI): Core ideas, techniques, and solutions, ACM Computing Surveys 55(9) (2023), 194:1–194:33. doi:10.1145/3561048.

16.

Ettinger

Elgohary

Resnik

, Probing for semantic evidence of composition by means of simple classification tasks, in: Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, Association for Computational Linguistics, Berlin, Germany, 2016, pp. 134–139, https://aclanthology.org/W16-2524 . doi:10.18653/v1/W16-2524.

17.

Ferreira

Ribeiro

M.d.S.

Gonçalves

Leite

, Looking inside the black-box: Logic-based explanations for neural networks, in: Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning, Vol. 19, 2022, pp. 432–442, https://proceedings.kr.org/2022/45/ .

18.

Finlayson

Mueller

Gehrmann

Shieber

Linzen

Belinkov

, Causal analysis of syntactic agreement mechanisms in neural language models, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online, 2021, pp. 1828–1843, https://aclanthology.org/2021.acl-long.144 .

19.

Fong

Vedaldi

, Net2Vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8730–8738, https://openaccess.thecvf.com/content_cvpr_2018/html/Fong_Net2Vec_Quantifying_and_CVPR_2018_paper.html .

20.

Ghorbani

Wexler

Zou

J.Y.

Kim

, Towards Automatic Concept-Based Explanations, Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, Inc., 2019, https://papers.nips.cc/paper_files/paper/2019/hash/77d2afcb31f6493e350fca61764efb9a-Abstract.html.

21.

Goodfellow

Pouget-Abadie

Mirza

Warde-Farley

Ozair

Courville

Bengio

, Generative adversarial nets, in: Advances in Neural Information Processing Systems 27, Ghahramani

Welling

Cortes

Lawrence

N.D.

Weinberger

K.Q.

, eds, Curran Associates, Inc., 2014, pp. 2672–2680, http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf .

22.

Hernandez

Schwettmann

Bau

Bagashvili

Torralba

Andreas

, Natural Language Descriptions of Deep Visual Features, in: International Conference on Learning Representations, 2022, https://openreview.net/forum?id=NudBMY-tzDr .

23.

Hitzler

Sarker

M.K.

Eberhart

(eds), Compendium of Neurosymbolic Artificial Intelligence, Frontiers in Artificial Intelligence and Applications / Faia, Vol. 369, IOS Press, Washington, 2023. ISBN 978-1-64368-406-2.

24.

Ibrahim

Shafiq

M.O.

, Explainable convolutional neural networks: A taxonomy, review, and future directions, ACM Computing Surveys 55(10) (2023), 206:1–206:37, https://dl.acm.org/doi/10.1145/3563691 . doi:10.1145/3563691.

25.

Jiang

Gupta

Zhang

Wang

Dou

Chen

Fei-Fei

Anandkumar

Zhu

Fan

, VIMA: Robot manipulation with multimodal prompts, in: Proceedings of the 40th International Conference on Machine Learning, PMLR, 2023, pp. 14975–15022, ISSN 2640-3498, https://proceedings.mlr.press/v202/jiang23b.html.

26.

Kádár

Á.

Chrupała

Alishahi

, Representation of linguistic form and function in recurrent neural networks, Computational Linguistics 43(4) (2017), 761–780. doi:10.1162/COLI_a_00300.

27.

Kim

Wattenberg

Gilmer

Cai

Wexler

Viegas

Sayres

, Interpretability beyond feature attribution: Quantitative Testing with Concept Activation Vectors (TCAV), in: Proceedings of the 35th International Conference on Machine Learning, PMLR, 2018, pp. 2668–2677, ISSN 2640-3498, https://proceedings.mlr.press/v80/kim18d.html.

28.

Koh

P.W.

Nguyen

Tang

Y.S.

Mussmann

Pierson

Kim

Liang

, Concept bottleneck models, in: Proceedings of the 37th International Conference on Machine Learning, PMLR, 2020, pp. 5338–5348, ISSN 2640-3498, https://proceedings.mlr.press/v119/koh20a.html.

29.

Lecue

, On the role of knowledge graphs in explainable AI, Semantic Web 11(1) (2020), 41–51, https://content.iospress.com/articles/semantic-web/sw190374. doi:10.3233/SW-190374.

30.

Lee

J.H.

Sioutis

Ahrens

Alirezaie

Kerzel

Wermter

, Chapter 19. Neuro-symbolic spatio-temporal reasoning, in: Compendium of Neurosymbolic Artificial Intelligence, IOS Press, 2023, pp. 410–429, https://ebooks.iospress.nl/doi/10.3233/FAIA230151.

31.

Xiong

Zhang

Liu

Bian

Dou

, Interpretable Deep Learning: Interpretation, Interpretability, Trustworthiness, and Beyond, Knowledge and Information Systems 64(12) (2022), 3197–3234. doi:10.1007/s10115-022-01756-8.

32.

Madsen

Reddy

Chandar

, Post-hoc interpretability for neural NLP: A survey, ACM Computing Surveys 55(8) (2022), 155:1–155:42, https://dl.acm.org/doi/10.1145/3546577 .

33.

McGarry

K.J.

Tait

Wermter

MacIntyre

, Rule-extraction from radial basis function networks, in: International Conference on Artificial Neural Networks, Vol. 2, 1999, pp. 613–618, ISSN 0537-9989.

34.

Meng

Bau

Andonian

Belinkov

, Locating and editing factual associations in GPT, Advances in Neural Information Processing Systems 35 (2022), 17359–17372, https://proceedings.neurips.cc/paper_files/paper/2022/hash/6f1d43d5a82a37e89b0665b33bf3a182-Abstract-Conference.html.

35.

Andreas

, Compositional explanations of neurons, in: Advances in Neural Information Processing Systems, Vol. 33, Curran Associates, Inc., 2020, pp. 17153–17163, https://proceedings.neurips.cc/paper/2020/hash/c74956ffb38ba48ed6ce977af6727275-Abstract.html.

36.

Choe

Y.J.

Lee

D.-H.

Kim

, Discovery of natural language concepts in individual units of CNNs, in: International Conference on Learning Representations, 2018, https://openreview.net/forum?id=S1EERs09YQ .

37.

Nejadgholi

Fraser

Kiritchenko

, Improving generalizability in implicitly abusive language detection with concept activation vectors, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 5517–5529, https://aclanthology.org/2022.acl-long.378 .

38.

Oikarinen

Das

Nguyen

L.M.

Weng

T.-W.

, Label-free concept bottleneck models, in: The Eleventh International Conference on Learning Representations, 2023, https://openreview.net/forum?id=FlCg47MNvBA .

39.

Oikarinen

Weng

T.-W.

, CLIP-dissect: Automatic description of neuron representations in deep vision networks, in: The Eleventh International Conference on Learning Representations, 2023, https://openreview.net/forum?id=iPWiwWHc1V .

40.

OpenAI, 2023, GPT-4 Technical Report, arXiv preprint arXiv:2303.08774. http://arxiv.org/abs/2303.08774.

41.

Page

, Connectionist modelling in psychology: A localist manifesto, Behavioral and Brain Sciences 23(4) (2000), 443–467, https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/abs/connectionist-modelling-in-psychology-a-localist-manifesto/65F9E3CEC90E0C80A46B25E0028BCFE3 . doi:10.1017/S0140525X00003356.

42.

Radford

Kim

J.W.

Hallacy

Ramesh

Goh

Agarwal

Sastry

Askell

Mishkin

Clark

Krueger

Sutskever

, Learning transferable visual models from natural language supervision, in: Proceedings of the 38th International Conference on Machine Learning, PMLR, 2021, pp. 8748–8763, ISSN 2640-3498, https://proceedings.mlr.press/v139/radford21a.html.

43.

Radford

Child

Luan

Amodei

Sutskever

, Language Models Are Unsupervised Multitask Learners, 2019, https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.

44.

Ramesh

Pavlov

Goh

Gray

Voss

Radford

Chen

Sutskever

, Zero-Shot Text-to-Image Generation, 2021, http://arxiv.org/abs/2102.12092 arXiv:2102.12092 [cs].

45.

Ras

Xie

van Gerven

Doran

, Explainable deep learning: A field guide for the uninitiated, Journal of Artificial Intelligence Research 73 (2022), 329–396, https://www.jair.org/index.php/jair/article/view/13200 . doi:10.1613/jair.1.13200.

46.

Ribeiro

M.d.S.

Leite

, Aligning artificial neural networks and ontologies towards explainable AI, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, 2021, pp. 4932–4940, https://ojs.aaai.org/index.php/AAAI/article/view/16626.

47.

Sado

Loo

C.K.

Liew

W.S.

Kerzel

Wermter

, Explainable goal-driven agents and robots – a comprehensive review, ACM Computing Surveys 55(10) (2023), 211:1–211:41. doi:10.1145/3564240.

48.

Sajjad

Durrani

Dalvi

, Neuron-level interpretation of deep NLP models: A survey, Transactions of the Association for Computational Linguistics 10 (2022), 1285–1303. doi:10.1162/tacl_a_00519.

49.

Samek

Montavon

Lapuschkin

Anders

C.J.

Müller

K.-R.

, Explaining deep neural networks and beyond: A review of methods and applications, in: Proceedings of the IEEE, Vol. 109, 2021, pp. 247–278.

50.

Schockaert

Gutiérrez-Basulto

, Modelling symbolic knowledge using neural representations, in: Reasoning Web. Declarative Artificial Intelligence, 17th International Summer School 2021, Leuven, Belgium, September 8–15, 2021, Tutorial Lectures, Šimkus

Varzinczak

, eds, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2022, pp. 59–75. ISBN 978-3-030-95481-9. doi:10.1007/978-3-030-95481-9_3.

51.

Schwalbe

, 2022, Concept Embedding Analysis: A Review, arXiv preprint arXiv:2203.13909. http://arxiv.org/abs/2203.13909.

52.

Sundararajan

Taly

Yan

, Axiomatic attribution for deep networks, in: Proceedings of the 34th International Conference on Machine Learning, PMLR, 2017, pp. 3319–3328, ISSN 2640-3498, https://proceedings.mlr.press/v70/sundararajan17a.html.

53.

Vig

Gehrmann

Belinkov

Qian

Nevo

Singer

Shieber

, Investigating gender bias in language models using causal mediation analysis, in: Advances in Neural Information Processing Systems, Vol. 33, Curran Associates, Inc., 2020, pp. 12388–12401, https://proceedings.neurips.cc/paper/2020/hash/92650b2e92217715fe312e6fa7b90d82-Abstract.html .

54.

Wermter

, Knowledge extraction from transducer neural networks, Applied Intelligence 12(1) (2000), 27–42. doi:10.1023/A:1008320219610.

55.

Wermter

Lehnert

W.G.

, A hybrid symbolic/connectionist model for noun phrase understanding, Connection Science 1(3) (1989), 255–272. doi:10.1080/09540098908915641.

56.

Wermter

Panchev

Arevian

, Hybrid neural plausibility networks for news agents, in: Proceedings of the National Conference on Artificial Intelligence AAAI, 1999, pp. 93–98, https://cdn.aaai.org/AAAI/1999/AAAI99-014.pdf .

57.

Yang

Panagopoulou

Zhou

Jin

Callison-Burch

Yatskar

, Language in a bottle: Language model guided concept bottlenecks for interpretable image classification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19187–19197, https://openaccess.thecvf.com/content/CVPR2023/html/Yang_Language_in_a_Bottle_Language_Model_Guided_Concept_Bottlenecks_for_CVPR_2023_paper.html.

58.

Yeh

C.-K.

Kim

Arik

C.-L.

Pfister

Ravikumar

, On completeness-aware concept-based explanations in deep neural networks, in: Advances in Neural Information Processing Systems, Vol. 33, Curran Associates, Inc., 2020, pp. 20554–20565, https://proceedings.neurips.cc/paper/2020/hash/ecb287ff763c169694f682af52c1f309-Abstract.html .

59.

Yuksekgonul

Wang

Zou

, Post-hoc concept bottleneck models, in: The Eleventh International Conference on Learning Representations, 2023, https://openreview.net/forum?id=nA5AZ8CEyow .

60.

Zhou

Sun

Bau

Torralba

, Interpretable basis decomposition for visual explanation, in: Computer Vision – ECCV 2018, Ferrari

Hebert

Sminchisescu

Weiss

, eds, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2018, pp. 122–138. ISBN 978-3-030-01237-3. doi:10.1007/978-3-030-01237-3_8.