In this paper, we review recent approaches for explaining concepts in neural networks. Concepts can act as a natural link between learning and reasoning: once the concepts are identified that a neural learning system uses, one can integrate those concepts with a reasoning system for inference or use a reasoning system to act upon them to improve or enhance the learning system. On the other hand, knowledge can not only be extracted from neural networks but concept knowledge can also be inserted into neural network architectures. Since integrating learning and reasoning is at the core of neuro-symbolic AI, the insights gained from this survey can serve as an important step towards realizing neuro-symbolic AI based on explainable concepts.
In recent years, neural networks have been successful in tasks that were regarded to require human-level intelligence, such as understanding and generating images and texts, performing dialogues, and controlling robots to follow instructions [25,40,44]. However, their decision-making is often not explainable, which undermines user trust and negatively impacts their usage in sensitive or critical domains, such as automation, law, and medicine. One way to overcome this limitation is by making neural networks explainable, e.g., by designing them to generate explanations or by using a post-hoc explanation method that analyzes the behavior of a neural network after it has been trained.
This paper reviews explainable artificial intelligence (XAI) methods with a focus on explaining how neural networks learn concepts, as concepts can act as primitives for building complex rules, presenting themselves as a natural link between learning and reasoning [50], which is at the core of neuro-symbolic AI [23,30,33,55,56]. On the one hand, identifying the concepts that a neural network uses for a given input can inform the user about what information the network is using to generate its output [5,22,27,28,35,54]. Combined with an approach to extract all relevant concepts and their (causal) relationships, one could generate explanations in logical or natural language that faithfully reflects the decision procedure of the network. On the other hand, the identified concepts can help a symbolic reasoner intervene in the neural network such that debugging the network becomes possible by modification of the concepts [1,6,28,34].
Some XAI surveys have been published in recent years [15,24,31,32,45,47,49]. However, almost all of them are mainly concerned with the use of saliency maps to highlight important input features. Only a few surveys include concept explanation as a way to explain neural networks. A recent survey in this vein is by Casper et al. [8], which discusses a broad range of approaches to explaining the internals of neural networks. However, due to its broader scope, the survey does not provide detailed descriptions of methods for explaining concepts and misses recent advances in the field. The surveys by Schwalbe [51] and Sajjad et al. [48], on the other hand, are dedicated to specific kinds of concept explanation methods with a focus on either vision [51] or natural language processing [48] and are, therefore, limited in scope, failing to analyze the two areas together.
We categorize concept explanation approaches and structure this survey based on whether they explain concepts at the level of individual neurons (Section 2) or at the level of layers (Section 3). The last section summarizes this survey with open questions.
Neuron-level explanations
The smallest entity in a neural network that can represent a concept is a neuron [48], which could be – in a broader sense – also a unit or a filter in a convolutional neural network [5]. In this section, we survey approaches that explain, in a post-hoc manner, concepts that a neuron of a pre-trained neural network represents, either by comparing the similarity between a concept and the activation of the neuron (see Section 2.1) or by detecting the causal relationship between a concept and the activation of the neuron (see Section 2.2).
Using similarities between concepts and activations
In this category, the concept a neuron is representing is explained by comparing the concept with the activations of the neuron when the concept is passed as an input to the model. The network dissection approach by Bau et al. [5] is arguably the most prominent approach in this category, which is mainly applied to computer vision models. In this approach, a set of concepts are prepared as well as a set of images for each concept . Then the activations of a convolutional filter are measured for each input . Afterward, the activation map is thresholded to generate a binary activation mask and scaled up to be compared with the original concept (e.g., concept head) in the binary segmentation mask of the input X (e.g., the head segment of an image with a bird). See Fig. 1 for an illustration. Then, to measure to which degree concept C is represented by the convolutional filter, the dataset-wide intersection over union metric (IoU) is computed, which is defined as . If the IoU value is above a given threshold, then the convolutional filter represents the concept C. Several extensions of this approach have been introduced.
Fong et al. [19] question whether a concept has to be represented by a single convolutional filter alone or whether it can be represented by a linear combination of filters. They show that the latter leads to a better representation of the concept and also suggest to use binary classification for measuring how well filters represent a concept. Complementary to that extension, Mu et al. [35] investigate how to approximate better what a single filter represents. To this end, they assume that a filter can represent a Boolean combination of concepts (e.g., (water OR river) AND NOT blue) and show that this compositional explanation of concepts leads to higher IoU. An intuitive extension of using compositional explanations is using natural language explanations. The approach called MILAN by Hernandez et al. [22] finds such natural language explanations as a sequence d of words that maximizes the pointwise mutual information between d and a set of image regions E that maximally activates the filter, (i.e., ). In the approach, the two probabilities and are approximated by an image captioning model and a language model respectively, which are trained on a dataset that the authors curated.
One strong assumption made by the network dissection approach is the availability of a comprehensive set of concepts and corresponding labeled images to provide accurate explanations of neurons. This is, however, difficult to obtain in general. Oikarinen et al. [39] tackle this problem with their CLIP-Dissect method, which is based on the CLIP [42] vision-language model. (CLIP embeds images and texts in the same vector space, allowing for measuring the similarity between texts and images.) To explain the concept a convolutional filter k is representing, they choose a set of the most highly activating images for filter k, then use CLIP to measure the similarity between and each concept (here, the concept set consists of 20K most common English words), and finally find the best matching concept C.
The dissection approach can also be used in generative vision models. Bau et al. [6] identify that units of generative adversarial networks [21] learn concepts similar to network dissection and that one can intervene on the units and remove specific concepts to change the output image (e.g., removing units representing the concept tree leads to output images with fewer trees in their scenes).
Neuron-level explanation using similarities between concepts and activations. Depicted is the network dissection approach, which compares the segmented concept in the input with the activation mask of a neuron [5].
Using causal relationships between concepts and activations
In this category, the concepts that a neuron is representing are explained by analyzing the causal relationship either (i) between the input concept and the neuron by intervening on the input and measuring the neural activation or (ii) between the neuron and the output concept by intervening on the neural activation and measuring the probability in predicting the concept. This approach is often used for explaining neurons of NLP models [48], where the types of concepts can be broader (e.g., subject-verb behavior, causal relationship, semantic tags).
The first line of work investigates the influence of a concept in the input on the activation of a neuron by intervening in the input. Kádár et al. [26] find the n-grams (i.e., a sequence of n words) that have the largest influence on the activation of a neuron by measuring the change in its activations when a word is removed from the n-grams. Na et al. [36] first identify k sentences that most highly activate a filter of a CNN-based NLP model. From these k sentences, they extract concepts by breaking down each sentence into a set of consecutive word sequences that form a meaningful chunk. Then they measure the contribution of each concept to the filter’s activations by first repeating the concept to create a synthetic sentence of a fixed length (to normalize the input’s contribution to the unit across different concepts) and then measuring the mean value of the filter’s activations.
The second line of work investigates the role of a neuron in generating a concept by intervening in the activation of the neuron. Dai et al. [11] investigate the factual linguistic knowledge of the BERT model [13], a widely used pre-trained model for text classification, which is pre-trained among other tasks by predicting masked words in a sentence. In this approach, given relational facts with a mask word (e.g. “Rome is the capital of [MASK]”), each neuron’s contribution to predicting the mask is measured using the integrated gradients method [52]. To verify the causal role of the neuron that is supposed to represent a concept, the authors also intervene in the neuron’s activation (by suppressing or doubling) and measure the change in accuracy in predicting the concept. Finlayson et al. [18] analyze whether a neuron of a transformer-based language model (e.g., GPT-2 [43]) has acquired the concept of conjugation. The authors determine which neuron contributes most to the conjugation of a verb by using the causal mediation analysis [53]. To this end, they first modify the activation of a neuron to the one that the neuron would have output if there was an intervention on the input (e.g., the subject in the input sentence was changed from singular to plural) and then measure the amount of change between the predictions of the correct conjugation of a verb with and without the intervention (see Fig. 2). Meng et al. [34] also apply causal mediation analysis to GPT-2 to understand which neurons memorize factual knowledge and modify specific facts (e.g., “The Eiffel Tower is in Paris” is modified to “The Eiffel Tower is in Rome”). The data they use consists of triples of the form (subject, relation, object) and the model has to predict the object given subject and relation. They discover that the neurons in the middle layer feed-forward modules in GPT-2 are the most relevant for encoding factual information and implementing a weight modifier to change the value of weights and alter the factual knowledge.
Neuron-level explanation using causal relationships between concepts and activations. In causal mediation analysis, the activation of a neuron is modified to the one that the neuron would have output if there was an intervention on the input (the subject in the input sentence was changed from singular to plural). Afterward, the amount of change between the predictions of the correct conjugation of a verb with and without the intervention is measured [18].
Layer-level explanations
Concepts can also be represented by a whole layer as opposed to a neuron or a convolutional filter, as mentioned in the paragraph about the work by Fong et al. [19] in Section 2.1. This can be achieved in a post-hoc manner for a pre-trained model by passing examples of a concept dataset to the model and extracting the activations of a specific layer to train a concept classifier. Two approaches are prominent in layer-level explanations: the first is explaining with concept activation vectors (CAVs) (see Section 3.1) and the second is probing (see Section 3.2). The main difference between the two approaches is that in the case of CAV a linear binary classifier is trained for each concept , and in probing a multiclass classifier is trained with classification labels that are often related to certain linguistic features (e.g, sentiments, part-of-speech tags). On the other hand, concepts can be baked in a layer, where each concept represents a neuron as was done with localist representations in the early days of neural network research (see Section 3.3).
Using vectors to explain concepts: Concept activation vectors
A concept activation vector (CAV) introduced by Kim et al. [27] is a continuous vector that corresponds to a concept represented by a layer of a neural network f (see Fig. 3). Let , where is the bottom part of the network whose final convolutional layer ℓ is of interest. To identify the existence of a concept C (e.g., the concept stripes) in layer ℓ, network is first fed with positive examples that contain concept C and negative examples that do not contain the concept, and then their corresponding activations and are collected. Next, a linear classifier is learned that distinguishes activations from activations . The vector normal to the decision boundary of the classifier is then a CAV of concept C. One useful feature of a CAV is that it allows for testing how much an input image x is correlated with a concept C (e.g., an image of a zebra and concept stripes), which is called testing with CAVs (TCAV) in [27]. This is accomplished, roughly speaking, by measuring the probability of a concept C having a positive influence on predicting a class label on a dataset , i.e., how much moving the latent vector along the direction of , i.e., , changes the log-probability of label k when it is fed to for all images with class label k.
CAVs can be used in many different ways. Nejagholi et al. [37] use CAVs to identify sensitivity of abusive language classifiers with respect to implicit types (as opposed to explicit types) of abusive language. Different from the original approach [27] which obtains CAVs by taking the vector normal to the decision boundary, they obtain CAVs by just averaging over the activations for all positive samples to mitigate the impact of the choice of random negative samples on determining the decision boundary. Zhou et al. [60] decompose the row vector of the last linear layer for predicting a class label k and represent it as a linear combination of a basis that consists of CAVs using only positive weights. Each positive weight then indicates how much of the corresponding concept is involved in predicting class label k. Similarly, Abid et al. [1] propose an approach that learns a set of CAVs, but for debugging purposes. Given an input image misclassified by a model, a weighted sum of the set of CAVs is computed that leads to correct classification when added to the activations before the last linear layer of the model. In addition to explaining bugs on a conceptual level, this approach allows for identifying spurious correlations in the data.
Layer-level explanation using vectors to explain concepts. For each concept C positive examples and negative examples are fed to a pre-trained model to learn the so-called concept activation vector (CAV) from the corresponding activations of the target layer [27].
An issue with the original approach for learning CAVs is that one needs to prepare a set of concept labels and images to learn the CAVs. Ghorbani et al. [20] partially tackle this issue by preparing images of the same class and then segmenting them with multiple resolutions. The clusters of resulting segments then form concepts and can be used for TCAV. As corresponding concept labels are missing, the concepts need to be manually inspected. Yeh et al. [58] circumvent the problem of preparing a concept dataset by training CAVs together with a model on the original image classification dataset. To this end, they compute a vector-valued score, where each value corresponds to a learnable concept and indicates to which degree the concept is present in the receptive field of the convolutional layer (computed by building a scalar product). The score is then passed to a multilayer perceptron (MLP) to perform classification.
Using classifiers to explain concepts: Probing
Similar to the CAV-based approaches in Section 3.1, probing uses a classifier to explain concepts. However, instead of training a binary linear classifier for each concept to measure the existence of the concept in the activation of a layer, probing uses a classifier for multiclass classifications with labels that often represent linguistic features in NLP (e.g., sentiments, part-of-speech tags). For example, given sentences as inputs to a pre-trained NLP model (e.g., BERT [13]), probing allows for evaluating how well the sentence embeddings of the model capture certain syntactic and semantic information, such as the length or the tense of the sentence [2,10,16] (see Fig. 4).
Layer-level explanation using a classifier to explain concepts. In this example, a pre-trained model takes as its input a sentence and a probing classifier is applied to the activation of the highlighted layer to check whether the activation encodes the concept of sentence length [2].
Probing, which is designed as a layer-level explanation method, can also be combined with a neuron-level explanation method (see Section 2) by applying the probing classifier only to neurons that are relevant for the classification [14]. Finding such neurons can be accomplished by applying the elastic-net regularization to the classifier, which constrains both the L1- and the L2-norm of the classifier weights.
The concepts learned by such probing classifiers can be combined with a knowledge base to provide richer explanations. Ribiero and Leite [46] use identified concepts as evidence to draw conclusions from a set of axioms in a knowledge base (e.g., given an axiom in the knowledge base, identifying both antecedent concepts and in the activations explains the presence of the consequence in the input). However, one cannot always assume the presence of a knowledge base for a given task. Ferreira et al. [17] weaken this assumption by learning the underlying theory from the identified concept using an induction framework.
Since the probing classifier is trained independently from the pre-trained model, it is pointed out that the pre-trained model does not necessarily leverage the same features that the classifier uses for predicting a given concept, i.e., what the probing classifier detects can be merely a correlation between the activation and the concept [3,7].
Layer-level explanation using the concept bottleneck model approach [28]. Each neuron in the concept bottleneck corresponds to a unique concept (e.g., wing color).
Using localist representations: Concept bottleneck models
Different from the neuron-based approach in Section 2, where concepts are learned in a post-hoc manner, in a concept bottleneck model (CBM) [28], each concept is represented by a unique neuron in the bottleneck layer of a model f (see Fig. 5), which is a reminiscence of localist representations [41]. This layer provides information about the existence or the strengths of each concept in the input. The output of the bottleneck layer is then used by a classifier or regressor for the prediction, which allows for explaining what concept led to the given prediction. Often, the bottom part of a pre-trained model is used for initializing the layers before the concept bottleneck and is a linear layer that maps the features from to concepts. Therefore, a CBM is , where ∘ stands for composition. To train the concept bottleneck , the training data has to include concept labels in addition to task labels.
One of the main limitations of CBMs is the need for the aforementioned concept labels, which might not be available for specific tasks. Several recent approaches overcome this limitation [38,57,59]. The main idea behind these approaches is using an external resource to obtain a set of concepts relevant to the task. This external resource could be a knowledge base such as ConceptNet [59], or the 20K common English words [38], or a language model like GPT-3 [57]. After obtaining concept set , each concept word is embedded as a vector by means of the CLIP vision-language model (cf. Section 2.1) such that vector can be used for computing the strength of concept C for a given input , e.g., by measuring the cosine similarity between and the embedding . Finally, the presence of concepts in the concept bottleneck layers allows for inducing logical explanations, e.g., Ciravegna et al. [9] induce explanations in disjunctive normal form (DNF) from concept activations and predicted labels, which is similar to logic-based explanation approaches [17,46] in Section 3.2.
Conclusion
In this survey, we have reviewed recent methods for explaining concepts in neural networks. We have covered different approaches that range from analyzing individual neurons to learning classifiers for a whole layer. As witnessed by the increasing number of recent papers, this is an active research area and a lot is still to be discovered, for example, empirically comparing or integrating different approaches.1 With the progress of concept extraction from neural networks, integrating the learned neural concepts with symbolic representations – also known as neuro-symbolic integration – is receiving (again) increasing attention [4,9,12,17,29,46]. In the near future, we expect tighter integration between neural models and symbolic rules via concept representations to make the models more transparent and easier to control. In conclusion, this line of research is still very active and in development, providing ample opportunities for new forms of integration in neuro-symbolic AI.
Footnotes
Acknowledgement
The authors gratefully acknowledge support from the DFG (CML, MoReSpace, LeCAREbot), BMWK (SIDIMO, VERIKAS), and the European Commission (TRAIL, TERAIS). We would like to thank Cornelius Weber for valuable comments on this paper.
Notes
References
1.
AbidA.YuksekgonulM.ZouJ., Meaningfully debugging model mistakes using conceptual counterfactual explanations, in: Proceedings of the 39th International Conference on Machine Learning, PMLR, 2022, pp. 66–88, ISSN 2640-3498, https://proceedings.mlr.press/v162/abid22a.html.
2.
AdiY.KermanyE.BelinkovY.LaviO.GoldbergY., Fine-grained analysis of sentence embeddings using auxiliary prediction tasks, in: International Conference on Learning Representations, 2016, https://openreview.net/forum?id=BJh6Ztuxl.
BarbieroP.CiravegnaG.GianniniF.ZarlengaM.E.MagisterL.C.TondaA.LioP.PreciosoF.JamnikM.MarraG., Interpretable neural-symbolic concept reasoning, in: Proceedings of the 40th International Conference on Machine Learning, PMLR, 2023, pp. 1801–1825, https://proceedings.mlr.press/v202/barbiero23a.html.
5.
BauD.ZhouB.KhoslaA.OlivaA.TorralbaA., Network dissection: Quantifying interpretability of deep visual representations, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6541–6549, https://openaccess.thecvf.com/content_cvpr_2017/html/Bau_Network_Dissection_Quantifying_CVPR_2017_paper.html.
6.
BauD.ZhuJ.-Y.StrobeltH.ZhouB.TenenbaumJ.B.FreemanW.T.TorralbaA.DissectionG.A.N., Visualizing and understanding generative adversarial networks, in: International Conference on Learning Representations, 2018, https://openreview.net/forum?id=Hyg_X2C5FX.
CasperS.RaukerT.HoA.Hadfield-MenellD., Toward transparent AI: A survey on interpreting the inner structures of deep neural networks, in: First IEEE Conference on Secure and Trustworthy Machine Learning, 2023, https://openreview.net/forum?id=8C5zt-0Utdn.
ConneauA.KruszewskiG.LampleG.BarraultL.BaroniM., What you can cram into a single backslash$&!#* vector: Probing sentence embeddings for linguistic properties, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 2126–2136, http://aclweb.org/anthology/P18-1198.
11.
DaiD.DongL.HaoY.SuiZ.ChangB.WeiF., Knowledge neurons in pretrained transformers, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 8493–8502, https://aclanthology.org/2022.acl-long.581.
12.
DalalA.SarkerM.K.BaruaA.VassermanE.HitzlerP., Understanding CNN Hidden Neuron Activations Using Structured Background Knowledge and Deductive Reasoning, 2023, arXiv preprint arXiv:2308.03999. http://arxiv.org/abs/2308.03999.
13.
DevlinJ.ChangM.-W.LeeK.ToutanovaK., BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186, https://aclanthology.org/N19-1423.
14.
DurraniN.SajjadH.DalviF.BelinkovY., Analyzing individual neurons in pre-trained language models, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 4865–4880, https://aclanthology.org/2020.emnlp-main.395.
15.
DwivediR.DaveD.NaikH.SinghalS.OmerR.PatelP.QianB.WenZ.ShahT.MorganG.RanjanR., Explainable AI (XAI): Core ideas, techniques, and solutions, ACM Computing Surveys55(9) (2023), 194:1–194:33. doi:10.1145/3561048.
16.
EttingerA.ElgoharyA.ResnikP., Probing for semantic evidence of composition by means of simple classification tasks, in: Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, Association for Computational Linguistics, Berlin, Germany, 2016, pp. 134–139, https://aclanthology.org/W16-2524. doi:10.18653/v1/W16-2524.
17.
FerreiraJ.RibeiroM.d.S.GonçalvesR.LeiteJ., Looking inside the black-box: Logic-based explanations for neural networks, in: Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning, Vol. 19, 2022, pp. 432–442, https://proceedings.kr.org/2022/45/.
18.
FinlaysonM.MuellerA.GehrmannS.ShieberS.LinzenT.BelinkovY., Causal analysis of syntactic agreement mechanisms in neural language models, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online, 2021, pp. 1828–1843, https://aclanthology.org/2021.acl-long.144.
19.
FongR.VedaldiA., Net2Vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8730–8738, https://openaccess.thecvf.com/content_cvpr_2018/html/Fong_Net2Vec_Quantifying_and_CVPR_2018_paper.html.
GoodfellowI.Pouget-AbadieJ.MirzaM.XuB.Warde-FarleyD.OzairS.CourvilleA.BengioY., Generative adversarial nets, in: Advances in Neural Information Processing Systems 27, GhahramaniZ.WellingM.CortesC.LawrenceN.D.WeinbergerK.Q., eds, Curran Associates, Inc., 2014, pp. 2672–2680, http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf.
22.
HernandezE.SchwettmannS.BauD.BagashviliT.TorralbaA.AndreasJ., Natural Language Descriptions of Deep Visual Features, in: International Conference on Learning Representations, 2022, https://openreview.net/forum?id=NudBMY-tzDr.
23.
HitzlerP.SarkerM.K.EberhartA. (eds), Compendium of Neurosymbolic Artificial Intelligence, Frontiers in Artificial Intelligence and Applications / Faia, Vol. 369, IOS Press, Washington, 2023. ISBN 978-1-64368-406-2.
24.
IbrahimR.ShafiqM.O., Explainable convolutional neural networks: A taxonomy, review, and future directions, ACM Computing Surveys55(10) (2023), 206:1–206:37, https://dl.acm.org/doi/10.1145/3563691. doi:10.1145/3563691.
25.
JiangY.GuptaA.ZhangZ.WangG.DouY.ChenY.Fei-FeiL.AnandkumarA.ZhuY.FanL., VIMA: Robot manipulation with multimodal prompts, in: Proceedings of the 40th International Conference on Machine Learning, PMLR, 2023, pp. 14975–15022, ISSN 2640-3498, https://proceedings.mlr.press/v202/jiang23b.html.
26.
KádárÁ.ChrupałaG.AlishahiA., Representation of linguistic form and function in recurrent neural networks, Computational Linguistics43(4) (2017), 761–780. doi:10.1162/COLI_a_00300.
27.
KimB.WattenbergM.GilmerJ.CaiC.WexlerJ.ViegasF.SayresR., Interpretability beyond feature attribution: Quantitative Testing with Concept Activation Vectors (TCAV), in: Proceedings of the 35th International Conference on Machine Learning, PMLR, 2018, pp. 2668–2677, ISSN 2640-3498, https://proceedings.mlr.press/v80/kim18d.html.
28.
KohP.W.NguyenT.TangY.S.MussmannS.PiersonE.KimB.LiangP., Concept bottleneck models, in: Proceedings of the 37th International Conference on Machine Learning, PMLR, 2020, pp. 5338–5348, ISSN 2640-3498, https://proceedings.mlr.press/v119/koh20a.html.
LeeJ.H.SioutisM.AhrensK.AlirezaieM.KerzelM.WermterS., Chapter 19. Neuro-symbolic spatio-temporal reasoning, in: Compendium of Neurosymbolic Artificial Intelligence, IOS Press, 2023, pp. 410–429, https://ebooks.iospress.nl/doi/10.3233/FAIA230151.
31.
LiX.XiongH.LiX.WuX.ZhangX.LiuJ.BianJ.DouD., Interpretable Deep Learning: Interpretation, Interpretability, Trustworthiness, and Beyond, Knowledge and Information Systems64(12) (2022), 3197–3234. doi:10.1007/s10115-022-01756-8.
32.
MadsenA.ReddyS.ChandarS., Post-hoc interpretability for neural NLP: A survey, ACM Computing Surveys55(8) (2022), 155:1–155:42, https://dl.acm.org/doi/10.1145/3546577.
33.
McGarryK.J.TaitJ.WermterS.MacIntyreJ., Rule-extraction from radial basis function networks, in: International Conference on Artificial Neural Networks, Vol. 2, 1999, pp. 613–618, ISSN 0537-9989.
NaS.ChoeY.J.LeeD.-H.KimG., Discovery of natural language concepts in individual units of CNNs, in: International Conference on Learning Representations, 2018, https://openreview.net/forum?id=S1EERs09YQ.
37.
NejadgholiI.FraserK.KiritchenkoS., Improving generalizability in implicitly abusive language detection with concept activation vectors, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 5517–5529, https://aclanthology.org/2022.acl-long.378.
38.
OikarinenT.DasS.NguyenL.M.WengT.-W., Label-free concept bottleneck models, in: The Eleventh International Conference on Learning Representations, 2023, https://openreview.net/forum?id=FlCg47MNvBA.
39.
OikarinenT.WengT.-W., CLIP-dissect: Automatic description of neuron representations in deep vision networks, in: The Eleventh International Conference on Learning Representations, 2023, https://openreview.net/forum?id=iPWiwWHc1V.
PageM., Connectionist modelling in psychology: A localist manifesto, Behavioral and Brain Sciences23(4) (2000), 443–467, https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/abs/connectionist-modelling-in-psychology-a-localist-manifesto/65F9E3CEC90E0C80A46B25E0028BCFE3. doi:10.1017/S0140525X00003356.
42.
RadfordA.KimJ.W.HallacyC.RameshA.GohG.AgarwalS.SastryG.AskellA.MishkinP.ClarkJ.KruegerG.SutskeverI., Learning transferable visual models from natural language supervision, in: Proceedings of the 38th International Conference on Machine Learning, PMLR, 2021, pp. 8748–8763, ISSN 2640-3498,https://proceedings.mlr.press/v139/radford21a.html.
43.
RadfordA.WuJ.ChildR.LuanD.AmodeiD.SutskeverI., Language Models Are Unsupervised Multitask Learners, 2019, https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
RasG.XieN.van GervenM.DoranD., Explainable deep learning: A field guide for the uninitiated, Journal of Artificial Intelligence Research73 (2022), 329–396, https://www.jair.org/index.php/jair/article/view/13200. doi:10.1613/jair.1.13200.
46.
RibeiroM.d.S.LeiteJ., Aligning artificial neural networks and ontologies towards explainable AI, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, 2021, pp. 4932–4940, https://ojs.aaai.org/index.php/AAAI/article/view/16626.
47.
SadoF.LooC.K.LiewW.S.KerzelM.WermterS., Explainable goal-driven agents and robots – a comprehensive review, ACM Computing Surveys55(10) (2023), 211:1–211:41. doi:10.1145/3564240.
48.
SajjadH.DurraniN.DalviF., Neuron-level interpretation of deep NLP models: A survey, Transactions of the Association for Computational Linguistics10 (2022), 1285–1303. doi:10.1162/tacl_a_00519.
49.
SamekW.MontavonG.LapuschkinS.AndersC.J.MüllerK.-R., Explaining deep neural networks and beyond: A review of methods and applications, in: Proceedings of the IEEE, Vol. 109, 2021, pp. 247–278.
50.
SchockaertS.Gutiérrez-BasultoV., Modelling symbolic knowledge using neural representations, in: Reasoning Web. Declarative Artificial Intelligence, 17th International Summer School 2021, Leuven, Belgium, September 8–15, 2021, Tutorial Lectures, ŠimkusM.VarzinczakI., eds, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2022, pp. 59–75. ISBN 978-3-030-95481-9. doi:10.1007/978-3-030-95481-9_3.
SundararajanM.TalyA.YanQ., Axiomatic attribution for deep networks, in: Proceedings of the 34th International Conference on Machine Learning, PMLR, 2017, pp. 3319–3328, ISSN 2640-3498,https://proceedings.mlr.press/v70/sundararajan17a.html.
53.
VigJ.GehrmannS.BelinkovY.QianS.NevoD.SingerY.ShieberS., Investigating gender bias in language models using causal mediation analysis, in: Advances in Neural Information Processing Systems, Vol. 33, Curran Associates, Inc., 2020, pp. 12388–12401, https://proceedings.neurips.cc/paper/2020/hash/92650b2e92217715fe312e6fa7b90d82-Abstract.html.
WermterS.LehnertW.G., A hybrid symbolic/connectionist model for noun phrase understanding, Connection Science1(3) (1989), 255–272. doi:10.1080/09540098908915641.
56.
WermterS.PanchevC.ArevianG., Hybrid neural plausibility networks for news agents, in: Proceedings of the National Conference on Artificial Intelligence AAAI, 1999, pp. 93–98, https://cdn.aaai.org/AAAI/1999/AAAI99-014.pdf.
57.
YangY.PanagopoulouA.ZhouS.JinD.Callison-BurchC.YatskarM., Language in a bottle: Language model guided concept bottlenecks for interpretable image classification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19187–19197, https://openaccess.thecvf.com/content/CVPR2023/html/Yang_Language_in_a_Bottle_Language_Model_Guided_Concept_Bottlenecks_for_CVPR_2023_paper.html.
58.
YehC.-K.KimB.ArikS.LiC.-L.PfisterT.RavikumarP., On completeness-aware concept-based explanations in deep neural networks, in: Advances in Neural Information Processing Systems, Vol. 33, Curran Associates, Inc., 2020, pp. 20554–20565, https://proceedings.neurips.cc/paper/2020/hash/ecb287ff763c169694f682af52c1f309-Abstract.html.
59.
YuksekgonulM.WangM.ZouJ., Post-hoc concept bottleneck models, in: The Eleventh International Conference on Learning Representations, 2023, https://openreview.net/forum?id=nA5AZ8CEyow.
60.
ZhouB.SunY.BauD.TorralbaA., Interpretable basis decomposition for visual explanation, in: Computer Vision – ECCV 2018, FerrariV.HebertM.SminchisescuC.WeissY., eds, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2018, pp. 122–138. ISBN 978-3-030-01237-3. doi:10.1007/978-3-030-01237-3_8.