Sage Journals: Discover world-class research

Abstract

The use of artificial intelligence in Critical Infrastructure Systems has increased substantially, having evolved to become both technically possible and financially beneficial. Yet there is an emerging consensus that the consideration and management of artificial intelligence-related risks in Critical Infrastructure Systems have not been commensurate with its rapid growth. Our surveys have identified that generalised artificial intelligence principles such as those promoted by the Organisation for Economic Co-operation and Development are alone not fit for purpose in guiding use of artificial intelligence in Critical Infrastructure Systems. Evaluation is an important aspect of that, and we argue for the development of a foundational approach suited to evaluation of artificial intelligence-enhanced Critical Infrastructure Systems as a base to further research and improve practice. This study develops a novel conceptual framework for evaluation of artificial intelligence-enhanced Critical Infrastructure Systems, based on theory adaptation of Value-Focused Thinking. The framework offers simplicity and additional functionality over the default principles-based framework.

Keywords

artificial intelligence critical infrastructure multi-criteria decision analysis value-focused thinking

Introduction

Critical infrastructure is essential to the functioning of society. Without infrastructure such as power, water, sanitation, communications and transport infrastructure, our societies would quickly descend into chaos. Critical infrastructure has been clearly identified under the European Programme for Critical Infrastructure Protection and the U.S. National Infrastructure Protection Plan. Alcaraz and Zeadally (2015) have succinctly summarised the critical infrastructure inclusions for both jurisdictions. This infrastructure is the focus of our study.

The use of artificial intelligence (AI) in Critical Infrastructure Systems (CIS) such as energy (Wang et al., 2019), water (Fu et al., 2022), wastewater (Zhao et al., 2020), food production (Kakani et al., 2020), transport (Nguyen et al., 2018) and service centres (McLeay et al., 2021) has become widespread. McMillan and Varga (2022) provide evidence of an exponential growth in use of AI within transport, energy, telecommunications, water and wastewater infrastructure systems. Their review reveals this rapid and recent increase in AI use within these infrastructure systems is directed towards critical aspects of service provision such as service demand, supply forecasting, forecasting system safety, routing in telecommunications and transport systems, security of infrastructure networks, monitoring infrastructure condition and predicting the need for preventive maintenance. Yet evaluation theory and practice for these AI interventions has not been sufficiently developed, resulting in the possibility of significant risk to society.

Concerningly, the scholarship has focused on the utilisation of AI in the audit or evaluation process rather than evaluation of the impacts of AI utilised in the system (Fedyk et al., 2022; Nielsen et al., 2025; Wassie and Lakatos, 2024) with few exceptions (Leeuw, 2025; Montrosse-Moorhead, 2023; Rinaldi and Nielsen, 2025). AI interventions to critical services offer immense benefits such as efficiency and reliability. However, deficiencies in AIeCIS have the potential for many harmful impacts such as injury, deaths, financial costs, failure to meet social objectives and decay of trust. Consideration of such risks and their management has not been commensurate with the rapid growth in the use of AI (Doorn, 2021). We therefore support the argument, first raised in the evaluation community by Montrosse-Moorhead (2023), that the evaluation literature must similarly move to establish support for the efficient and effective evaluation of AI-enhanced systems and note Rinaldi and Nielsen’s (2025) call for work in this space to continue. Importantly, because of the inherent high risk to human life, well-being and economic prosperity, immediate attention should be given to the evaluation of AIeCIS.

An array of principles to be applied to the design and implementation of AI interventions is emerging (Felzmann et al., 2020; Gabriel, 2020; Hassija et al., 2024; Jobin et al., 2019; Li et al., 2023; Taeihagh, 2021) and there are natural differences in opinion about their usefulness. We have a body of 23 evaluation approaches systematically described by Stufflebeam and Coryn (2014) expanded to 42 approaches distributed across design, use, theory, hybrid and emerging categories (Lemire, 2024), to evaluate various interventions. Notably, principles-focused evaluation (Patton, 2017) has emerged; an approach which aligns very conveniently with the proliferation of principles for design and implementation of AI interventions previously mentioned. In addition, many authors have called loudly for literature to move from merely advocating the principles of AI to practising those principles (Jobin et al., 2019; Kaur et al., 2022; Laine et al., 2024; Palladino, 2023; Thiebes et al., 2021). We also note that traditional systems engineering methods of evaluation during the design phase are considered deficient when applied to AI-enhanced systems (Vanderlinde et al., 2022) because of the probabilistic nature of AI (Hassija et al., 2024; Holzinger, 2021), its susceptibility to learning data quality (Holzinger, 2021), its complexity (Hassija et al., 2024; Holzinger, 2021) and in some instances compromised algorithm transparency (Hassija et al., 2024). These characteristics often result in issues with the explainability and trustworthiness of AI.

CIS are complex human-asset systems essential to the functioning and safety of society and are bestowed trust to perform this important role. Each system is a set of elements standing in interrelation (Bertalanffy, 1969, p 38) that act together to achieve an objective. Accordingly, each of the many elements of these complex systems presents evaluators and investigators with the possibility of deficiencies, particularly for those elements where changes have been applied. A tragic example is changes to the Boeing 737 MAX Maneuvering Characteristics Augmentation System (MCAS), the probable cause of the loss of numerous lives in two separate accidents (Komite Nasional Keselamatan Transportasi Republic of Indonesia, 2018; The Federal Democratic Republic of Ethiopia Ministry of Transport and Logistics, 2022), the basis for the worldwide grounding of Boeing 737 MAX aircraft for an extended period, and the main reason for loss of confidence in Boeing signified by the subsequent fall in Boeing share price.

Although events like this are rare, serious CIS failure modes are abundant. For example, the opportunities for collisions between the trains of a typical large city railway system are copious. Similarly, there are many possible ways for water systems in an urban setting to become dangerously contaminated and there are multiple means of system failure through which city wastewater systems could seriously impact the environment, public health or both. It is the functional integrity, orchestrated system resilience and culture of accountability surrounding the use of these systems that prevent countless serious failures. However, the impending AI enhancements to CIS significantly changes this landscape by introducing new forms of significant risk. Guidance for the evaluation of these risks and their control measures is scarce. Our article is dedicated to addressing this deficit with an evaluation tool backed by theory and example.

These horrific outcomes drive our objective, to propose and validate a conceptual theoretical framework for evaluation of AIeCIS, comprised of factors that matter to stakeholders at the time of evaluation. What matters to stakeholders of critical infrastructure, things like safety, service availability and cost efficiencies, have been explained by a significant body of scholarship (Koppenjan et al., 2008; Van Gestel et al., 2008; Vuorinen and Martinsuo, 2019) as values. A model that systematically and proactively assists decision-makers to achieve a strong alignment between values pertinent to critical infrastructure and the fundamental objectives of that infrastructure, the Value Focused Thinking (VFT) model, was formed by Keeney (1994, 1996a, 1996b) and developed in critical infrastructure case study research (Keeney and McDaniels, 1999; Pudney, 2010; Simon et al., 2014). The VFT model offers convenient insights into our objective of establishing a tool to assist theorists and practitioners evaluating AIeCIS. To meet that objective, our research is guided by the following question:

In which ways do Value Focused Thinking (VFT), widely acknowledged governance principles for AI, and studies in critical infrastructure decision-making inform the development of novel conceptual theoretical frameworks for evaluating AIeCIS?

The remainder of this article is structured as follows. In Section ‘Background of the evaluation of AI-enhanced systems’, we set out the extant knowledge on evaluation of critical infrastructure which utilises AI. In Section ‘Methodology’, we explain and justify our research methodology. In Section ‘Findings and discussion’, we propose a framework to guide evaluations of AIeCIS and then we discuss the usefulness of the proposed framework to theorists and practitioners, with an example from Deutsche Bahn. Finally, in Section ‘Conclusions and future research’, we conclude by reflecting upon the extent to which our research has met the needs of evaluation theorists and practitioners, and we propose directions for further research.

Background of the evaluation of AI-enhanced systems

The widespread application of developed forms of AI to operational aspects of CIS has been a relatively recent phenomenon. This section provides a background on evaluation of AI-enhanced systems, currently thought by scholars to be relatively underdeveloped. We do this via a ‘selective review’ (Yin, 2016 [2015]: 72–73) examining reviews by others into the evaluation of AI-enhanced systems.

Arnold and Scheutz (2018) exposed the need for evaluation of modern AI systems and proposed a method for evaluation of AI in parallel with development but stopped short of proposing an evaluation framework. Our background investigations therefore start from 2018.

The evaluation community has proactively raised the need for the evaluation of AI used for policy and programme development. We refer to the workshop ‘Digitalization of Evaluations and Evaluation of Digitalization’ in 2023, and the set of publications dedicated to AI in evaluations (Nielsen et al., 2025).

Montrosse-Moorhead (2023) investigated evaluation criteria for AI use in evaluation. The article has addressed a considerable oversight. There were previously no proposed criteria for evaluating AI use in evaluation yet AI use in evaluation continues to rise. Montrosse-Moorhead applied Teasdale’s Criteria Domains Framework to a group of evaluation studies to arrive at eight criteria domains. We note the author’s comment on the importance of context in deriving appropriate evaluation criteria. Our investigations indicate criteria applicable to the critical infrastructure context would include privacy and security aspects of AI algorithms because of the circumstances where private citizens and other legal entities access critical infrastructure services. These criteria are not considered by Montrosse-Moorhead (2023) and rightly so in that application. In addition, the AI tools being considered for and applied to critical infrastructure have components that require sophisticated AI analysis and coding expertise to assess their trustworthiness. Evaluators could therefore be reliant on a technical expert for critical aspects of an evaluation. We therefore contend that the criteria specified by Montrosse-Moorhead (2023) are unsuitable for evaluation of AIeCIS due to the fact that some important criteria are missing.

Use of AI in evaluations and in evaluating critical infrastructure is interesting for the evaluation community for several reasons. First, low knowledge of AI makes it something evaluators either love or are afraid of. Leeuw (2025) applies the realist evaluation approach to AI from the evaluators’ perspective to shed light on the ‘black box’ in using AI and decrease the unknown. He sees a low understanding of AI among evaluators as one of the main causes of reluctance to use AI in evaluations. Our argument is similar. We believe a sound knowledge of AI is necessary for competent evaluation of AIeCIS and one of the ways to achieve that is to use AI in evaluations. Nevertheless, evaluating AI in critical infrastructure could be a double ‘black box’ for evaluators of AIeCIS. Evaluators need to understand risks associated with critical infrastructure interventions. Second, in sympathy with Leeuw (2025), evaluators should investigate and familiarise themselves with AI used in the CIS they evaluate.

Nielsen (2025) sought to answer two questions: ‘Will emerging technologies (ET) impact the evaluation industry?’ and ‘How will ET impact evaluation practice?’ In answering the first question, the author draws on Raftree’s insights that three waves of diffusion of ET can be distinguished in the evaluation industry. The first wave was characterised by the emergence of new data collection tools, such as mobile phones or geodata, in evaluation activities. The second wave included two main vectors: advanced data collection (e.g. drones, satellites) and data analysis methods (e.g. machine learning). The third wave summarised the previous developments and introduced new disciplines to the evaluation field, such as software development and data science. However, Nielsen (2025) emphasises that evaluation service commissioners play the main role in the development of ET in the evaluation industry. They formulate the demand for specific tools, instruments and technologies in the procurement specifications. If such demand is clearly expressed, ET becomes more widespread throughout the industry.

We examine Nielsen’s response to the first question from the perspective of the evaluand, our main focus. Indeed, we expect the diffusion of ET will occur in a similar way but the procurement of engineering into critical infrastructure, including software and hardware, will determine the rate and extent to which AI becomes embedded into CIS. As such, an evaluation framework directed to the evaluand would need cognizance of approaching waves of ET.

In response to the second question, Nielsen (2025) identifies five main factors on which the development of ET in the field of evaluation depends: (1) Competitive strategies of evaluation service providers, (2) Size and duration of evaluation contracts, (3) Nature of evaluation services, (4) Broadness and depth of capabilities of evaluation service providers and (5) Suitability of technologies for a specific evaluation task.

In addition, price competition and the desire to ensure the highest quality of evaluation also encourage providers to use efficient data collection tools. The application of ET may lead to shorter contract implementation terms, and it may be increasingly difficult for smaller service providers to compete with large companies that have access to advanced technologies and automation solutions. Nielsen (2025) predicts that monitoring systems will likely be automated and replace some of the tasks performed by humans. This is particularly relevant, as evaluators have often complained in the past that data quality in monitoring systems was dependent on the human factor (Dvorak, 2010). In the future, the skills of evaluation teams will be increasingly important, in particular data science.

Critical infrastructure already is experiencing such impacts. Critical infrastructure has embedded systems that continuously collect data, monitor and analyse that data sometimes using sophisticated forms of AI. The outcomes are reported to dashboards with predetermined alarm levels. Hence, the evaluation of AIeCIS requires highly specialised personnel.

Leeuw (2025), has proposed a realist evaluation approach and makes arguments about its relevance to evaluating AI-enhanced interventions in real time. We agree with Leeuw that realist evaluation of AI interventions has the advantage of learning about limitations (and benefits) on the job and that ‘human in the loop’ strategies can substantially mitigate risk. However, when significant risks are presented, as is often the case with critical infrastructure, these human-based strategies by themselves are often insufficient. The opinions about AI in evaluations are divided. Some emphasised the risks and critically argue that AI can deepen social inequality – racial, gender and cultural (Reid, 2023) – pose a threat to marginalised minorities, and erase alternative perceptions of the world (Head et al., 2023). Other evaluators, on the contrary, were more optimistic about the prospects for the application of AI—they emphasised that AI could enhance humanity and encourage trust in it (Sabarre et al., 2023). Meanwhile, Tilton et al. (2023) argued that it is necessary to include AI literacy training in evaluator training so that future and current evaluators can critically evaluate the application of AI models and reflect on the values hidden in AI algorithms.

The application of a set of standardised evaluation criteria in evaluations like those of the OECD’s Development Assistance Committee (DAC) (OECD, 2021) causes some evaluations to be problematic (Qian-Khoo et al., 2022) because top-down implemented evaluation criteria might fail in capturing the specifics of the programmes. Moreover, such frameworks without a flexible application also cause a challenge in that their value content is often only assumed to be understood in the evaluation process (Teasdale, 2021), and the actual understanding can differ.

Evaluators apply different evaluation criteria even if they are from the same field. A recent study by Mavrot et al. (2025) shows that the use of evaluation criteria differs depending on which evaluation teams carry out the evaluation, and for which purpose. It underlines the need for flexibility in how the criteria are applied, which our approach enables. Patton (2020) points out the possible inaccuracy in use of evaluation criteria if the precision of application of evaluation criteria gave the evaluators the illusion of meaningfulness and utility. On the other hand, we see this issue as an advantage enabling the stakeholders to define the importance of each criterion according to their understanding and needs.

Nolt and Leviton (2022) recommend flexibility in defining evaluation criteria and accept that there is a variation in acceptance in their definitions. These should be done before implementation, which should lead to more organic, authentic and transparent practice in applying evaluation criteria. Based on this experience, we define an evaluation framework, but we recommend their flexible application to enable stakeholders to propose their views and needs. It especially concerns complex systems where a large variety of criteria may apply. Numerous AI evaluation frameworks have been proposed in the computing research community since 2018. Nauta et al. (2023) proposed 12 conceptual properties of explainability arising from a review of 312 AI explainability studies, each of which depicts the evaluation of an explainability method. The categories proposed by Nauta et al. are useful for those seeking a deep understanding of the categories of explainability and how each type of explainability may be evaluated but offers no contextualisation of its centrepiece framework to the fundamental objective of explainability, that is, to maintain stakeholder trust in AI. However, from this study, we can observe that the link between each of the 12 forms of explainability and trust in aspects of the services to which the AI is applied, such as safety, reliability, privacy of personal data, autonomy of service users, oversight of automation, trust in the continuing financial effectiveness and viability of the service, remains unaddressed.

Hedström et al. (2023) clearly articulate the fundamental purpose of explainable AI (XAI) and provide descriptions of six evaluative dimensions: Faithfulness, Robustness, Localisation, Complexity, Axiomatic, Randomisation in relation to XAI. Our main concern is that, because of the limited focus to neural network AI, application to broader contexts, such as AIeCIS, may not be valid. Hence, we believe further work is required to align fundamental service delivery objectives with each type of AI explainability to establish their relevance under different contexts.

A review of Trustworthy AI (TAI) literature reveals a relative scarcity of research into evaluation criteria for TAI (McCormack and Bendechache, 2024), especially the development of metrics for evaluating the impacts of AI on society and the environment but also in areas such as evaluating accountability, human agency and oversight. Evaluation of TAI has structural issues in the form of conceptual clarity (Kowald et al., 2024) and there is need to develop methods to evaluate explainability of AI across multiple disciplines to build its trustworthiness in broader domains (Kaur et al., 2022; Kowald et al., 2024).

Ethical principles for the use of AI (EAI), reviewed by Jobin et al. (2019), revealed a significant increase in the number of ethical recommendations for AI proliferating in the public and private sectors over a short time frame. However, solutions to overcome the ethical challenges of AI differ according to context.

Our review of AI evaluation literature has found a dominance of principles-based guidelines promoted by the public sector, intergovernmental organisations and non-governmental organisations. These have become default evaluation frameworks for AI. Given the variety of philosophical standpoints found in XAI, TAI and EAI, and the multiple disciplines involved, it is not surprising that there is a divergence of opinion on the application of those principles and their meaning in different circumstances (Hagendorff, 2020; Jobin et al., 2019; Mittelstadt, 2019). We conclude that there may be a strong need for a more disciplined approach to evaluating AI.

Methodology

Essentially, the methodology of our study is that of conceptual research enabling us to build theory, forming an essential foundation for subsequent empirical research (Jaakkola, 2020). One part of conceptual research, known as theory adaptation, is to introduce an alternative frame of reference to an existing concept (Jaakkola, 2020).

Investigating AI principles in practice

We began by investigating whether the Principles for Trustworthy AI espoused by the OECD are mutually exclusive when applied to real cases. We collated, examined and discussed examples from Deutsche Bahn’s (DB’s) AI-enhanced traffic management system for this purpose.

The methodology continued with capturing and analysing representative data about the guiding principles of AI and associated objectives initiated from those principles. We applied Yin’s (2016 [2015]) selective review approach, essentially a review of highly cited reviews, to curate examples of AI principles-inspired real-world objectives covering each of the prevalent schools of thought on the application of guiding principles (XAI, TAI and EAI) to gain an understanding of the present application of the principles. This was not a systematic review, nor was that level of rigour thought to be necessary given this is not the main output of the study. The search terms ‘AI principl*’ and ‘review’ were applied to the Scopus and Web of Science databases and sorted by citation count. The search results were restricted to those items published since 2019 with citations greater than 100 according to Google Scholar. The only inclusion criterion was that the review had to include reference to examples of the real-world objectives manifested under the guiding principles of AI. Citation searching was then undertaken to locate additional literature, to ensure the main schools of thought were represented. To this, we added the very recent review by McCormack and Bendechache (2024). The list of included studies numbered 15.

Building a VFT based framework

Our next step was to compile an alternative classification framework of fundamental objectives for critical services delivery derived with Keeney’s VFT. To do this, we searched for and reviewed studies where VFT was applied to decision-making in relation to critical infrastructure service delivery. The literature search approach is defined in Table 1. We extracted, from the included studies, the sets of hierarchically structured fundamental objectives, a partial example of which is illustrated in Figure 1. We utilised these to inform the development of an alternative framework comprising seven purposefully designed mutually exclusive classifications and 29 sub-classifications of fundamental objectives and analysed that outcome. We validated these fundamental objectives by checking their alignment with the principles of AI.

Table 1.

Literature search for VFT case studies.

Database/method	Scopus	Web of science
Search terms	‘value focused thinking’	‘value focused thinking’
Fields	Title; Abstract; Keywords	Title; Abstract; Indexing
Subject areas	Engineering; Decision Sciences; Business Management and Accounting; Energy	Business Economics; Engineering; Operations Research Management Science; Public Administration; Energy Fuels; Water Resources; Transportation; Telecommunications
Document types	Articles; Conference papers; Book chapters; Books	Article; Dissertation Thesis
Documents found	223	190
Total documents	413
Duplicates deleted	113
Unique documents	300
Criterion 1	Study must include one or more cases on critical infrastructure services.
Excluded	257 (43 remaining)
Criterion 2	Study must include a set of hierarchically structured, measurable fundamental objectives.
Excluded	19 (24 remaining)
Criterion 3	Study must have at least 1 citation per year.
Excluded	10
Included	14
Citation searching	2
Total studies	16

Figure 1.

Example of hierarchically structured measurable fundamental objectives.

Within this literature, we searched for the objective or purpose behind applying each of the principles of AI, documenting each objective against the principle to which it is related. These objectives were then assessed as to whether they are means or fundamental objectives according to Keeney’s VFT approach. Fundamental objectives being those that represent core values and means objectives being ways to achieve fundamental objectives.

The final step was to demonstrate the framework with an example. The recently developed Deutsche Bahn (DB) AI-enhanced traffic management system was chosen for this purpose.

Strengths and limitations of this methodology

The methodology consists of a series of data collection and analysis steps progressing from what is currently known about application of principles of AI towards an answer to the research question. The strength of the initial investigation into the application of AI principles at DB is that it is a real and recent case at the cutting edge of AI application. The main limitation is that it is a singular case and hence it is not appropriate to extend the findings to other cases.

The next part of the methodology, to curate examples of real-world objectives inspired by the AI principles, has the strength that the inspiring principles are not just the AI trustworthiness principles promoted by the OECD but also include AI principles emanating from AI explainability cohorts and the ethical AI schools of thought. The main limitation is that we have applied a selective review.

Our search for hierarchically structured, measurable fundamental objectives relating to critical infrastructure was of high quality. However, the data yielded has placed limitations on analysis. The main issue is that the number of conforming studies is so low that some types of critical infrastructure, for example wastewater and telecommunications, are poorly represented. Another limitation is that the included studies are mainly from a time when the use of AI was in its infancy and may not have been considered in the formulation of fundamental objectives and core values.

Findings and discussion

The results of our study are presented and discussed in this section interspersed with findings and discussion about a case study on Deutsche Bahn’s (DB’s) AI-enhanced traffic management system described in Annex 1.

Application of AI principles to Deutsche Bahn

We begin by discussing one of the challenges of using AI principles as the sole evaluative framework for the evaluation of AIeCIS. We introduce a Venn diagram (Figure 2) of DB’s traffic management AI (tmAI). The box represents the universe of actions towards compliance with all AI principles. The black circles represent actions towards compliance with principles applicable to DB’s tmAI, while a faint circle represents actions towards principles that don’t apply to DB’s tmAI such as privacy. The set of actions towards compliance with the principles of robustness of AI and transparency of AI are highly applicable to DB’s tmAI, but they are not mutually exclusive sets. For example, DB requires the tmAI to be robust and tests the tmAI under many varied scenarios to confirm its robustness (Sturm et al., 2024). Confirmation of robustness is achieved at DB through transparency to relevant stakeholders, such as signal engineers and AI designers, involving those experts in selection of test data, design of the tests, carrying out the tests and analysing the results (Schneider et al., 2024). In this case, members of DB’s tmAI development team and their collaborators wrote two separate articles describing DB’s tmAI, subsequently published (Schneider et al., 2024; Sturm et al., 2024).

Figure 2.

Venn diagram of AI ethical principles applicable to Deutsche Bahn traffic management AI.

Hence, we argue that actions towards transparency of robustness are simultaneously in the ‘robustness set’ and ‘transparency set’. Consequently, an evaluation of DB’s tmAI against the OECD Principles could potentially count the same data twice because of this overlap.

Actions towards transparency in relation to AI overlaps with other principles too. For example, DB is transparent in its 2024 interim report to shareholders about AI diagnostics of images to ensure system safety and security of freight car brakes and roof tarpaulins. The fact that DB’s intentions regarding tmAI and AI diagnostics were reported to shareholders, some of which are done with timeframes, also demonstrates accountability to their stakeholders. So, in relation to DB’s tmAI, our finding is that it is possible for actions in support of the principles of AI to be in more than one set of AI principles.

Investigating the application of AI principles

As a follow up, and in consideration of the fact that the DB case is but a single example, we sought additional data about the application of each of the widely accepted principles of AI. Table 2 was compiled for this purpose. The relevant AI principles, taken from XAI, TAI and EAI literature, are articulated in the first column, followed in the second and third columns by the objectives associated with each principle and their appearance in the literature.

Table 2.

Principles of AI and corresponding objectives according to literature (items marked with # are means objectives).

Principle	Objective	References
Accountability	To act with integrity	Jan et al. (2023) and Jobin et al. (2019)
	To clarify the attribution of responsibility^#	Berente et al. (2021), Felzmann et al. (2020), Hagendorff (2020), Gabriel (2020), Jan et al. (2023), Jobin et al. (2019) and Mittelstadt (2019)
	To be accountable for decisions and actions	McCormack and Bendechache (2024)
	To provide high level management of system risk	McCormack and Bendechache (2024)
	To be responsible for whistleblowing in case of potential harm	Jobin et al. (2019)
	To be accountable for personalisation algorithms that deliver tailored content and advertisements^#	McCormack and Bendechache (2024)
	To identify and rectify processes that could potentially cause harm	Felzmann et al. (2020), Jobin et al. (2019), Li et al. (2023), McCormack and Bendechache (2024), Taeihagh (2021)
	To introduce ethics into technological processes^#	Jobin et al. (2019)
	To attribute legal liability^#	Berente et al. (2021), Jobin et al. (2019), Gabriel (2020) and Taeihagh (2021)
	To promote diversity^#	Jobin et al. (2019) and Thiebes et al. (2021)
	For systems and processes to be auditable/monitorable/traceable^#	Li et al. (2023), McCormack and Bendechache (2024) and Winfield et al. (2021)
Beneficence	To promote human well-being, peace and happiness	Jan et al. (2023), Jobin et al. (2019) and Thiebes et al. (2021)
	To create socioeconomic opportunities	Jobin et al. (2019)
	To provide positive impact to social and economic life	Thiebes et al. (2021)
	To provide benefit to the environment	Thiebes et al. (2021)
	To create economic prosperity	Jobin et al. (2019)
	To promote benefit to all sentient beings	Thiebes et al. (2021)
	To promote harmony^#	Thiebes et al. (2021)
Fairness/Justice Dignity	To respect, preserve and promote human dignity	Jobin et al. (2019)
	To address unwanted bias and discrimination	Berente et al. (2021), Gabriel (2020), Hagendorff (2020), Hassija et al. (2024), Jan et al. (2023), Jobin et al. (2019), Kaur et al. (2022), Li et al. (2023), McCormack and Bendechache (2024), Taeihagh (2021) and Thiebes et al. (2021)
	To encourage respect for diversity, inclusion and equality^#	Jan et al. (2023), Jobin et al. (2019) and McCormack and Bendechache (2024)
	To ensure stakeholder accessibility^#	McCormack and Bendechache (2024)
	To provide an ability to appeal or challenge decisions	Jobin et al. (2019)
	To provide for fair access to AI, benefits of AI and data	Felzmann et al. (2020), Gabriel (2020), Jan et al. (2023), Jobin et al. (2019), Kaur et al. (2022), Li et al. (2023), McCormack and Bendechache (2024) and Thiebes et al. (2021)
	To address democratic or societal fairness	Gabriel (2020), Jobin et al. (2019) and McCormack and Bendechache (2024)
	To ensure availability of future work/maintain workers’ rights	Hagendorff (2020)
Freedom/Autonomy	To allow freedom of expression	Jobin et al. (2019)
	To enable self-determination/respect autonomy	Gabriel (2020), Hagendorff (2020), Jobin et al. (2019), McCormack and Bendechache (2024) and Thiebes et al. (2021)
	To enable meta-autonomy (the right to decide when to decide)	Thiebes et al. (2021)
	To provide privacy protecting user controls	Jobin et al. (2019)
	To ensure user control of AI systems/human agency and oversight	Hagendorff (2020) and Kaur et al. (2022)
	To ensure freedom from experimentation, manipulation and surveillance	Jobin et al. (2019)
Legality/Compliance	To ensure compliance with legal and regulatory requirements	Hagendorff (2020), Kaur et al. (2022) and Mittelstadt (2019)
Non-maleficence	To prevent or avoid foreseeable harm	Jan et al. (2023), Jobin et al. (2019), Kaur et al. (2022) and Thiebes et al. (2021)
Non-maleficence	To protect people’s privacy	Jobin et al. (2019) and Thiebes et al. (2021)
Privacy	To protect data privacy	Felzmann et al. (2020), Jan et al. (2023), Jobin et al. (2019), Kaur et al. (2022), Li et al. (2023), McCormack and Bendechache (2024), Mittelstadt (2019), Taeihagh (2021) and Thiebes et al. (2021)
	To uphold the right to individual privacy	Felzmann et al. (2020), Jobin et al. (2019) and McCormack and Bendechache (2024)
	To provide transparency of data privacy^#	McCormack and Bendechache (2024) and Winfield et al. (2021)
Reliability/Robustness	To deal with execution errors, erroneous inputs or unseen data^#	Li et al. (2023) and McCormack and Bendechache (2024)
	To ensure the reliability and representation of datasets essential for accurate model training and deployment^#	McCormack and Bendechache (2024)
	To ensure the system performs as intended, be unlikely to fail and recover from failure. To ensure system dependability	Hassija et al. (2024), Kaur et al. (2022), Li et al. (2023) and McCormack and Bendechache (2024)
	To prevent unintended harmful behaviour by the system	Li et al. (2023)
	To detect and defend against adversarial attacks	Hassija et al. (2024), Li et al. (2023) and McCormack and Bendechache (2024)
	To ensure reliability and traceability of decisions^#	Felzmann et al. (2020) and McCormack and Bendechache (2024)
Safety/Security	Preventing harm to individuals or society	Hagendorff (2020), McCormack and Bendechache (2024) and Thiebes et al. (2021)
	Managing risks	Felzmann et al. (2020), Jan et al. (2023) and McCormack and Bendechache (2024)
	Safeguarding sensitive data and preventing breaches	Hassija et al. (2024), Kaur et al. (2022) and McCormack and Bendechache (2024)
	Enhancing system reliability and robustness. Resilience to attack	Li et al. (2023), McCormack and Bendechache (2024) and Thiebes et al. (2021)
	Preventing maleficent misuse	Berente et al. (2021) and Li et al. (2023)
Solidarity	Safeguarding social cohesion^#	Hagendorff (2020) and Jobin et al. (2019)
	Respecting vulnerable individuals and groups^#	Jobin et al. (2019)
	Promoting clients’ best interests	Mittelstadt (2019)
Sustainability	To protect the environment and promote environmental well-being	Jobin et al. (2019), McCormack and Bendechache (2024) and Thiebes et al. (2021)
	To improve ecosystems and biodiversity	Jobin et al. (2019)
	To contribute to a fairer and more equal society	Jobin et al. (2019)
	To anticipate needs and mitigate risks associated with societal well-being	McCormack and Bendechache (2024)
	To support economic sustainability	McCormack and Bendechache (2024)
	To promote peace	Jobin et al. (2019)
	To create sustainable systems that process data sustainably^#	Jobin et al. (2019)
	To support sustainable development/align with the sustainable development goals^#	Jan et al. (2023), McCormack and Bendechache (2024) and Taeihagh (2021)
Transparency/Explainability	To increase understandability/interpretability/explainability/traceability^#	Berente et al. (2021), Felzmann et al. (2020), Hassija et al. (2024), Jobin et al. (2019), Kaur et al. (2022), Li et al. (2023), McCormack and Bendechache (2024), Mittelstadt (2019), Taeihagh (2021), Thiebes et al. (2021) and Winfield et al. (2021)
	To demonstrably minimise user harm by improving AI or justify decision-making	Felzmann et al. (2020), Hassija et al. (2024), Jobin et al. (2019), Kaur et al. (2022), Li et al. (2023), McCormack and Bendechache (2024) and Winfield et al. (2021)
	To facilitate model performance measurement/improvement^#	Felzmann et al. (2020), Hassija et al. (2024), Kaur et al. (2022), Li et al. (2023), McCormack and Bendechache (2024), Mohseni et al. (2021) and Thiebes et al. (2021
	For legal benefit (precondition for administrative legality or compliance)	Felzmann et al. (2020), Hassija et al. (2024), Jobin et al. (2019) and Winfield et al. (2021)
	To foster trust or improve trust between parties	Felzmann et al. (2020), Hassija et al. (2024), Jobin et al. (2019), Kaur et al. (2022), Li et al. (2023), Mohseni et al. (2021), Thiebes et al. (2021) and Winfield et al. (2021)
	To improve dialogue, participation, inclusivity^#	Berente et al. (2021), Felzmann et al. (2020), Jobin et al. (2019), Kaur et al. (2022), Mittelstadt (2019), Taeihagh (2021) and Winfield et al. (2021)
	For bias mitigation^#	Hassija et al. (2024) and Mohseni et al. (2021)
	To demonstrate system sustainability	Winfield et al. (2021)
	To enable data transparency for user self-assessment of privacy	McCormack and Bendechache (2024) and Mohseni et al. (2021)
	For democratic reasons^#	Jobin et al. (2019)
	As a precondition for optimal markets^#	Felzmann et al. (2020)
	To facilitate improved AI governance^#	Hassija et al. (2024) and Kaur et al. (2022)
	To facilitate outcome transparency^#	McCormack and Bendechache (2024)
Trustworthiness	To gain stakeholder trust in the recommendations, judgements and uses of AI	Jobin et al. (2019)
	To build or sustain trust by education	Jobin et al. (2019)
	To build or sustain trust through model reliability and technical robustness	Jobin et al. (2019), Mohseni et al. (2021) and Thiebes et al. (2021)
	To build or sustain trust through accountability	Jobin et al. (2019), McCormack and Bendechache (2024) and Thiebes et al. (2021)
	To build or sustain trust through monitoring and evaluation processes	Berente et al. (2021) and Jobin et al. (2019)
	To build or sustain trust through compliance and governance processes	Jobin et al. (2019), McCormack and Bendechache (2024) and Mittelstadt (2019)
	To build or sustain trust through transparency	Bandi et al. (2023), Felzmann et al. (2020), Jobin et al. (2019), Kaur et al. (2022), McCormack and Bendechache (2024), Thiebes et al. (2021) and Winfield et al. (2021)
	To build or sustain trust through understandability	Berente et al. (2021), Felzmann et al. (2020), Jobin et al. (2019), Mohseni et al. (2021) and Thiebes et al. (2021)
	To build or sustain trust through explainability/interpretability/explicability	Bandi et al. (2023), Berente et al. (2021), Jobin et al. (2019), Kaur et al. (2022), McCormack and Bendechache (2024), Mittelstadt (2019), Mohseni et al. (2021) and Thiebes et al. (2021)
	To build or sustain trust through traceability	McCormack and Bendechache (2024) and Winfield et al. (2021)
	To build or sustain trust through fulfilment of expectations	Felzmann et al. (2020) and Jobin et al. (2019)
	To build or sustain trust through fairness and non-discrimination/non-bias	Jobin et al. (2019), Kaur et al. (2022), McCormack and Bendechache (2024), Mittelstadt (2019), Mohseni et al. (2021) and Thiebes et al. (2021)
	To build or sustain trust through stakeholder dialogue or communication	Jobin et al. (2019) and Winfield et al. (2021)
	To build or sustain trust by safety and avoiding harm	Jobin et al. (2019), Kaur et al. (2022), McCormack and Bendechache (2024), Mittelstadt (2019), Mohseni et al. (2021) and Thiebes et al. (2021)
	To build or sustain trust through human oversight/autonomy	Kaur et al. (2022), McCormack and Bendechache (2024), Mittelstadt (2019) and Thiebes et al. (2021)
	To build or sustain trust through environmental sustainability	McCormack and Bendechache (2024) and Thiebes et al. (2021)
	To build or sustain trust through providing benefit to society and maintaining societal well-being	McCormack and Bendechache (2024) and Thiebes et al. (2021)
	To build or sustain trust through protection of privacy and data governance	McCormack and Bendechache (2024) and Thiebes et al. (2021)
	To build or sustain trust through diversity	McCormack and Bendechache (2024)
	To build or sustain trust through review of decision-making processes	McCormack and Bendechache (2024)
	To build or sustain trust through ability	Felzmann et al. (2020)
	To build or sustain trust through benevolence/beneficence/acting in users’ best interests.	Felzmann et al. (2020) and Thiebes et al. (2021)
	To build or sustain trust through integrity	Felzmann et al. (2020)
	To build or sustain trust through harmony	Thiebes et al. (2021)

Note that means objectives indicated with the # symbol in table 2 are defined as ways to achieve fundamental objectives. Fundamental objectives are those objectives that represent core values.

Table 2 provides an accurate representation of the many objectives found in practice that are inspired by the principles of AI. Our first observation is the number of individual principles (20 grouped to 13) that apply to the development and use of AI. These principles are not just convenient categories developed on the run. They originate from governments, agencies of governments and NGOs.

Each objective recorded in Table 2 is inspired by or aligned with a principle and this link originates from the document in which the objective was articulated. Of the 98 individual objectives identified in Table 2, we assess 23 to be means objectives and the remainder as fundamental objectives. These findings paint a picture of multiple principles that overlap in meaning and are interchanged in use. The findings are also indicative of a group of objectives that are not all ideally formulated. From an evaluation perspective, fundamental objectives are preferred because they provide a structure for explaining values expressed by stakeholders and can be used for evaluation against stakeholder values (Keeney and McDaniels, 1992). While research institutions and public sector organisations have generally favoured the use of principles to guide AI development, scholars and theorists warn that this practice on its own has significant shortcomings. Agreement on the precise principles to apply is not straightforward given the diversity of people’s moral beliefs (Gabriel, 2020). Nor can we assume a common understanding of the principles (Jobin et al., 2019). In effect, by adopting AI principles, such as those promoted by the OECD, as the only design guidance, designers are asking stakeholders in AI-enhanced systems to trust their interpretation (Hagendorff, 2020). Under circumstances where AIeCIS has received guidance only in the form of principles, it follows that evaluation is also problematic. Our thoughts on this matter, supported by the findings of this study, go much further. We believe the principles being used to guide development of AI provide an inadequate and confused classification system for evaluations research and practice because they have an overlap in scope exacerbated by the lack of commonly accepted definitions for each of the principles. For example, in Table 2, objectives linked to the principle of Accountability (to identify and rectify processes that could potentially cause harm; to promote diversity) and Beneficence (to promote human well-being; to provide benefit to the environment) are similar to those associated with the principles of Safety (preventing harm to individuals or society), Fairness and Justice (to encourage respect for diversity, inclusion and equality) and Sustainability (to protect the environment and promote environmental well-being). The lack of agreed definition and uncertainty in scope leads to an inability to precisely outline metrics for determining successful application of these principles which in turn impacts the ability to operationalise evaluations. We therefore propose an alternative framework to support a structured evaluation approach. The alternative framework is developed in Section ‘Investigating cases of VFT in critical infrastructure to build an evaluation framework’.

Investigating cases of VFT in critical infrastructure to build an evaluation framework

This section utilises studies that rigorously apply VFT to critical infrastructure intervention decisions to build a novel evaluation framework via a review and analysis of the identified studies. Our review of fundamental objectives for critical infrastructure services is informed by 16 separate studies in decision making in critical infrastructure. The studies included 8 in energy, 2 of which were specifically hydroelectric energy, 2 in transport, 2 in water, 2 in healthcare, 1 in gas infrastructure and 1 in wastewater. A total of 147 objectives were analysed. Our analysis yields 7 classifications and 29 sub-classifications of fundamental objectives, as shown in Table 3 left hand columns. The number of occurrences of each fundamental objective in the 16 items of literature is shown in brackets alongside. Only 9 of the 147 objectives did not align with any of the 29 sub-classifications.

Table 3.

Alignment between fundamental objectives of CIS and principles of AI.

These fundamental objectives are compared with the 20 principles of AI compiled from our initial survey (Table 2) to analyse and assess the alignment between fundamental objectives of critical infrastructure services objectives and principles for the implementation of AI. It is a rudimentary form of validation. We note that as a group, the fundamental objectives cover each of the 20 AI principles. An important point is that while the sub-classifications are typical for critical infrastructure services, we do not imply that they are fixed. They are an example only. The objectives are and should always remain unique to the services provided in response to stakeholder values/requirements and will depend on contextual factors applicable to the situation.

We assert that there is a strong similarity between the fundamental objectives of different types of CIS and our findings support this. The VFT inspired process of eliciting fundamental objectives requires us to engage with stakeholders about their values. It stands to reason that these values remain largely constant regardless of the critical service being accessed. Safety is always a high priority, stakeholders care about sustaining the environment, stakeholders expect the CIS to deliver effective services to a standard that reflects good value and be available when needed.

Of 147 fundamental objectives, only 8 could not be classified into our framework of 7 categories. We list these below:

Promoting attraction of international investment and renewable technology;

Encouraging the equipment industries with a focus on technology and sustainability;

Reducing the institutional and market barriers for new technologies;

Minimise external dependence;

Ensure market transformation;

Maximising control over technology choice;

Improve staff qualifications;

Promote responsible decision making by others.

The first three come from the same study by Bortoluzzi et al. (2021) based on a portfolio of energy generation projects in Brazil. They are part of a drive to introduce distributed energy resources consisting of small-scale solar, wind turbines and energy storage systems, all of which would be located close to the consumer. In effect, these objectives are a means to increasing the availability and continuity of energy to distributed communities of Mato Grosso do Sul, located in Brazil. There are contextual factors relating to the declining capacity of existing hydroelectric sources driving this need (Bortoluzzi et al., 2021). The other five are similarly driven by unique contextual factors. Thus, a significant cause of misalignment of fundamental decisions for the set of CIS studies we examined appears to be the context in relation to those studies. Table 3 demonstrates that each of the fundamental objectives depicted in the proposed framework of fundamental objectives aligns with one or more principles of AI. Moreover, the proposed VFT approach delivers measurable clarity to those principles and provides fundamental objectives aligned with stakeholder values to guide their implementation and evaluation. Consequently, we assert that evaluation of complex AIeCIS is made simpler and more effective by initially identifying fundamental objectives with stakeholder involvement. We offer the framework of fundamental objectives, against which AIeCIS may be evaluated, and the adapted VFT method as valuable outputs of our research.

An evaluation against this framework and in accordance with established and tested VFT practice would normally entail eight components: (1) Review the intervention context, (2) Engage with stakeholders, (3) Establish the intervention objectives, (4) Develop alternative solutions, (5) Develop the utility model, (6) Make the decision, (7) Monitor the intervention and (8) Manage the feedback. This process aligns most closely with the utilisation-focused evaluation model proposed by Patton (2011) in which the evaluator acts as a mentor or trainer of key users. The evaluator identifies and engages influential users to build and use evaluation processes and provides support to those individuals as and when needed. It also aligns with Keeney’s (1994, 1996a, 1996b) VFT method.

The evaluation framework presented in Table 3 combined with the proposed eight-component evaluation process has several valuable features. It is a practical tool for evaluators of AI interventions to CIS locked into compliance with a principles-based edict, such as the IEEE standard for AI transparency. It enables the service provider to set up an evaluation use team and to define measurable objectives for the AI intervention informed by the proposed framework elements. In addition, Table 4 shows references to commonly used metrics for each of the fundamental objectives developed, tested and validated by scholars with appropriate expertise.

Table 4.

Examples of metrics for fundamental objectives.

Level 1 fundamental objectives	Examples of metrics from literature
1.1 Maximise continuity of service	Number of customer service outages per annum, duration of service outages (Keeney and McDaniels, 1992; Neves et al., 2009); Service outages in person-years (Pudney, 2010)
1.2 Maximise service effectiveness	Elapsed time until new power supply service is installed (Keeney and McDaniels, 1992); Increase or decrease in service price for customers (Neves et al., 2009; Pudney, 2010)
2.1 Minimise human harm	Person-years of life lost, person-years of lost work time (Keeney et al., 1996; Keeney and McDaniels, 1992; Neves et al., 2009; Pudney, 2010)
3.1 Maximise economic benefit	Amount of profit, sales revenue (Neves et al., 2009) MWh of cumulative energy savings (Haydt et al., 2013); Funds transferred to government (Keeney and McDaniels, 1992; Neves et al., 2009; Pudney, 2010).
3.2 Minimise cost	Present value cost (Haydt et al., 2013; Keeney et al., 1996; Neves et al., 2009)
3.3 Minimise economic risks	Payback period in years (Haydt et al., 2013)
4.1 Minimise climate impact	Increase or decrease in tonnes of CO₂ equivalent (Haydt et al., 2013; Neves et al., 2009; Pudney, 2010)
4.2 Minimise environmental pollutants	Megawatts of generation capacity that results in ‘fossil fuel’ pollution (Keeney and McDaniels, 1992); Tonnes of Nitrous Oxides, Sulphurous Oxides and particulates emitted (Höfer and Madlener, 2020; Neves et al., 2009)
4.3 Minimise species and habitat loss	Hectares of mature forest, wildlife habitat, or pristine ecosystems lost (Keeney et al., 1996; Keeney and McDaniels, 1992; Neves et al., 2009; Pudney, 2010)
4.6 Minimise environmental risks	Change in hectares of land or waterway infested by weeds or pests (Pudney, 2010)
5.1 Maximise equitable access to services by all	Measurement of equitable pricing to different customers using a constructed scale; Number of persons reporting inequitable treatment (Keeney et al., 1996; Keeney and McDaniels, 1992)
5.2 Minimise socio-economic harm to all	Person-years exposure to noise, dust, odour, disruption to routine or worry (Keeney et al., 1996; Pudney, 2010); Hectares of recreational area lost (Keeney et al., 1996; Keeney and McDaniels, 1992); Loss of real estate value (Neves et al., 2009)
5.3 Maximise socio-economic benefit to all	Increase in full-time equivalent employment (Neves et al., 2009; Pudney, 2010); Gross domestic product (Neves et al., 2009)
5.5 Minimise socio-economic risks	MWh of imported energy exposure reduction (Haydt et al., 2013; Neves et al., 2009)
6.1 Maximise stakeholder privacy	Privacy paradox metric – percentage of stakeholders behaving paradoxically (Gimpel et al., 2018)
6.3 Maximise system security	Vulnerability, attack, defence and situational metrics detailed by Pendleton et al. (2016); Access control, confidentiality security, vulnerability, attacker observation, defender, authentication and authorisation attempts, log to system interaction ratio, log file monitoring ratio, availability, source code, security source code, version control, defect, defensibility, security requirement metrics detailed by Morrison et al. (2018)
7.1 Maximise trust in initiatives	Qualitative trust score (Neves et al., 2009); Advice accuracy, advice reliability, pragmatic utility, advice value, decision benefit and potential robustness (Campagner et al., 2023)
7.2 Maximise strategic alignment	Qualitative strategic alignment score (Neves et al., 2009); Resource-product-market model (Reid, 2023).
7.3 Ensure evaluation capability	Qualitative evaluation capability score (Neves et al., 2009)
7.4 Ensure fulfilment of societal and statutory obligations	Qualitative societal obligation fulfilment score (Neves et al., 2009)

In conclusion, the proposed evaluative framework demonstrably accommodates the fundamental objectives of 16 critical infrastructure interventions informed by Keeney’s VFT. We posit that the resultant framework adapted from Keeney’s VFT offers structured evaluation guidance in the form of common fundamental objectives aligned with typical stakeholder values and informed by evidence derived from 16 separate critical infrastructure intervention studies. We advance the case that that our fundamental objectives framework aligns with principles of AI derived from literature. Finally, we postulate our framework satisfies objective measurability via references to tested and validated metrics.

Evaluation of Deutsche Bahn AI-enhanced traffic management system

The remaining task of our study is to provide an example of our values inspired framework in action and highlight its advantages for evaluators. We have chosen an AI intervention into CIS for which information is available. We first describe the AIeCIS development and implementation process and then indicate important aspects of evaluating the intervention. Please refer to Table 5 which provides an eight-stage evaluation process beginning with an intervention context review.

Table 5.

Evaluative oversight of AI interventions for critical infrastructure.

Step	AI intervention steps	Evaluation type	Key evaluation question
1	Review the intervention context	Context analysis.Stakeholder salience.Stakeholder selection.	Is there a coherent and evidence-based analysis of the context relevant to the proposed intervention?Has a stakeholder salience analysis been undertaken?Does the stakeholder selection support the context?
2	Engage with intervention stakeholders	Stakeholder values.Stakeholder strategic alignment.	Are stakeholder values elicited, articulated, clarified and subsequently documented?Is strategic alignment between stakeholders explored, developed and agreed?
3	Establish the intervention objectives	Intervention fundamental objectives.	Have the fundamental objectives of the intervention and associated metrics been developed and agreed by stakeholders?
4	Develop the alternative solutions	Intervention alternatives.	Has an exhaustive and innovative set of alternative ways to address the required objectives been articulated and tested?
5	Develop the utility model and make the decision	Utility model development.Model developer competence.	Has the utility model been developed with appropriate expertise, peer reviewed, tested and validated?Has the decision been made with evidential support and stakeholder agreement?
6	Implement the Intervention	Implementation plan.Implementation performance.	Has the implementation plan with project controls been developed and agreed in consultation with stakeholders?Has the implementation been achieved within cost, quality and time expectations?
7	Monitor the intervention	Monitoring plan.Intervention monitoring.	Has the monitoring plan with feedback controls been developed and agreed with stakeholders?Is the intervention being monitored according to plan and operating with expected results?
8	Manage the feedback	Change control.	If the intervention is not operating with expected results, has feedback data been captured and analysed?Has a change plan been developed and agreed with stakeholders?

Step 1: We start with understanding the context

Deutsche Bahn (DB, 2024) is Germany’s national railway operating 33,464 km of track, serving 1837 million rail passengers and carrying 197.6 million tonnes of freight in 2023. Over the years, DB’s networks and operations have become increasingly complex with freight traffic in tonne-kilometres predicted to continue to increase 35 per cent between 2019 and 2040 (Kulikowska – Wielgus, 2024) and an average annual passenger growth of 5.1 per cent per annum from 2024 to 2029 (IBISWorld, 2024). As a result, the manual planning and dispatching of trains will reach its limit soon (Digitale Schiene, 2022).

Ensuring its safety and efficiency through optimised traffic management is crucial. DB’s proposed traffic management AI (tmAI) is an AI-based solution for optimising railway traffic management. It is an AI intervention to the manual traffic management system planned for introduction into DB’s Stuttgart Digital Node in 2029 (Digitale Schiene, 2025). At its core, the system employs a Multi-Agent Reinforcement Learning (MARL) approach, where individual trains act as intelligent agents learning to navigate a shared network (Schneider et al., 2024). However, to trust such a complex AI-based system with safety-critical operations, its reliability must be rigorously evaluated. Explainable AI (XAI) methods suitable for Deep Reinforcement Learning (DRL) (Hickling et al., 2023) and MARL (Boggess et al., 2023) provide a pathway to understand, verify and ultimately trust the tmAI.

The German government, as part of its commitment to climate change targets, has committed to using more trains and less road and air traffic. The initiative is known at DB as Strong Rail and the tmAI is an enabler technology to that initiative.

Comments about the evaluation of Step 1

An evaluation of this step must confirm the context is well researched and understood by the implementation team. A key aspect from an evaluative perspective is to ensure the stakeholders are clearly identified and that strategic perspectives of those stakeholders are understood well enough to determine their salience. Therefore, an evaluation would confirm the adherence to an exhaustive process of identifying and understanding stakeholders. Stakeholders are not of equal salience. To this extent, an evaluation ought to confirm that a salience assessment of each stakeholder is completed by examining the dimensions of power, legitimacy and urgency as proposed by Mitchell et al. (1997). This salience assessment would determine which stakeholders must be consulted in the subsequent steps. Our opinion is that the number consulted in the next stages should be in the order of five at most because the level of complexity increases significantly with the involvement of each additional stakeholder.

In the case of DB, the most important stakeholder is the management team. They have the highest level of power due to their position in DB. They have legitimacy due to their responsibility for the viability of freight and passenger operations. They have urgency because the manual traffic management system is fast approaching its scalability limit. Another important stakeholder group, the passengers, have power in numbers and a legitimate need to reach their destinations on time. They also have urgency because the punctuality of passenger services has been declining in recent times. Their interests might best be represented by train traffic controllers. A third stakeholder group, the German government, has a stated interest in achieving greenhouse gas (GHG) reduction targets for 2030 and beyond and have announced their intention to do that through encouraging increased use of rail passenger and freight services. This represents their urgent mission. Their power resides in their being the elected representatives of the people of Germany and their legitimacy comes from their reliance on railways as being the most effective mode of transport from a GHG emission perspective and the only way of reducing GHG emissions at scale in the short term.

The evaluation should reveal sound process, not only in the identification of the three key stakeholders via stakeholder saliency rating, but also confirmation of the corporate strategy supporting why those stakeholders are the most appropriate focus. For example, our case study shows that DB’s corporate strategy supports the choice of the German government as a key stakeholder. In this case, DB’s Strong Rail strategy (DB, n.d.) is intimately tied to the German government’s GHG targets and advocates a growing use of rail transport in support of the achievement of GHG targets.

Consultation with stakeholders to develop a decision model – Steps 2, 3, 4 and 5

The objective of these steps is to arrive at the best intervention solution in consultation with the key stakeholders and in acknowledgement of their values. The process we recommend implementers of the DB tmAI to adopt is to pose a series of questions that reveal the fundamental values important to stakeholders. It is done as a group so that they learn about each other’s values. The values should be articulated, clarified and documented in a shared environment with other stakeholders. This will assist stakeholders to develop, and realise, shared strategic opportunities, which should also be explored and agreed.

The stakeholder participants should then be asked to articulate their fundamental objectives in front of each other and explain how those objectives relate to the values developed earlier. The objectives, developed from critical infrastructure case studies illustrated in Table 3, may be a useful prompt for this process. Keeney (1996a) describes how some objectives are a means to an end and others are an end in themselves (fundamental) and describes the process of stakeholders can be encouraged to identify fundamental objectives. Again, the objectives should be documented noting that there may be different objectives for different stakeholders. Next, the stakeholders should discuss and explore the way their fundamental objectives are or could be measured. The resources identified by Table 4 may assist this process. Often, useful data is already available and it’s a matter of sharing that so that the entire group can understand what is available and discuss metrics that may align more closely with the intent of the fundamental objectives.

An example relating to DB’s tmAI, the fundamental objectives and their metrics taken directly from the Table 4 VFT Framework is given below:

1.0 Service Fulfilment

1.2 Maximise service effectiveness – Measured by percent of services on time (attribute se)

2.0 Health and Safety

2.1 Minimise human harm – Measured by person years of life lost/disability (attribute hh)

3.0 Financial Effectiveness

3.1 Maximise financial benefit – Measured in Euros (attribute fb)

3.2 Minimise financial cost – Measured in Euros (attribute fc)

4.0 Environmental Sustainability

4.1 Minimise climate impact – Measured in tonnes of CO₂ equivalent (attribute ci)

7.0 Governance and Compliance

7.1 Maximise trust in systems – Measured with a composite trust metric (attribute ts)

The next part of this phase is to develop a utility model which is a function of individual attribute metrics that represent stakeholder benefit and disbenefit traded off. In this pre-implementation stage of the intervention, there must be intense consultation with key stakeholders with sufficient technical expertise on AI. The stakeholder representatives and implementation team must assign the weights assigned to each attribute.

It is then necessary to determine the most applicable type of utility model (additive, multiplicative or combinatorial). This is done by establishing the nature of independence between the individual model attributes (service fulfilment, health and safety, financial effectiveness and environmental sustainability). Additive independence between attributes implies that stakeholder preferences for the consequences of each attribute depend only on the level of each individual attribute and not on their manner of combination. Note these attribute metrics are probabilistic. They have an expected distribution rather than a discrete value. The correlation between metric distributions also must be determined. The final part of this modelling is to run a Monte Carlo simulation for the utility function of each potential solution and use the result to determine the most beneficial solution. This is a highly technical step, and it is possible that the implementation team does not recognise its own lack of technical expertise. It is critical for evaluators to have confidence in the technical competency of the implementation team in this respect.

The implementation team working with the stakeholders must develop some possible solutions, decide on one of them using the utility model and implement it. In this case, solutions may include variations in the type of AI to use, it may also include a comparative non-AI solution.

Comments about the evaluation of Steps 2, 3, 4 and 5

Evaluation of the design of the AI intervention raises some interesting issues. The evaluation during design must include an assessment of the intervention implementation team’s skills in determining fundamental objectives in consultation with stakeholders, developing a utility model using their input. The components of this evaluation are articulated in lines 2 to 5 in Table 5. One of the largest risks during evaluation of the design stage is whether appropriate expertise is accessed by those implementing the intervention.

Evaluators are advised to ensure a modelling process that is fit for purpose is diligently undertaken and that the process is characterised by genuine collaboration between stakeholders and expert implementers. Evaluators should also check the competence of the people involved, especially where modelling is concerned and that the design stages are peer reviewed by competent reviewers. The models must be correctly designed, executed and analysed. The output of the utility models is stochastic rather than deterministic and must take that into account when being utilised and when the model outputs are being analysed.

DB is taking these steps in accordance with a detailed plan. The modelling includes simulations using real data inputs and will eventually lead to a real trial in Stuttgart Digital Node in 2029 before being implemented more widely across Germany.

Another key aspect is the identification and testing of alternative solutions. Keeney (1996b) indicates that the development and testing of many viable alternatives creates better solutions. Evaluators can check this process is being undertaken by requesting evidence of simulation and testing of alternatives.

In effect, the decision will become obvious through the process of identifying alternatives, modelling those and documenting the results. Stakeholders, some of whom would possess model performance assessment skills, would ideally participate in the decision-making process.

Steps 6, 7 and 8

Technical expertise is also needed in implementing and planning the intervention implementation. There must be an implementation plan developed in consultation with stakeholders with project controls. During implementation, cost, quality and time parameters defined in the plan are checked.

The final steps come after the intervention has been implemented, beginning immediately after the solution is in place. These steps are about monitoring the solution against utility criteria on a regular basis, performing an analysis and making improvements to the AI enhancement if needed.

This step simply involves monitoring the measures that were defined in the beginning, analysing the model outputs and continuous improvement of the AI-enhanced intervention. Again, we recommend a collaborative approach and the involvement of appropriate expertise for modelling and analysis. Any proposed improvements must go through a change management process to control the risks.

Comments about the evaluation of Steps 6, 7 and 8

Evaluators must check that there is a team with sufficient expertise responsible for monitoring the performance of the AI intervention. In the case where modifications to the intervention are to be undertaken, the evaluation aspects covered in Steps 2, 3, 4 and 5 are also applicable here.

Our final finding is that AI trust is a sleeping giant issue (Habbal et al., 2024). There is an enormous quantum of trust the travelling public presently has in DB’s traffic management. This trust is valuable but could be lost with a single accident caused by poor design of AI, poorly trained AI or inappropriate management of AI. The potential downside in relation to trust is so enormous as to potentially cause the DB business to suffer business crippling losses.

Closing comments on the novel values-based evaluation framework

Numerous researchers have applied VFT as a decision-making framework for critical infrastructure cases (Bortoluzzi et al., 2021; Coelho, 2013; de Souza, 2012; Eskafi et al., 2020; Haydt et al., 2013; Höfer and Madlener, 2020; Keeney et al., 1996; Keeney and McDaniels, 1992, 1999; McDaniels, 1994; Morais et al., 2013; Neves et al., 2009; Oliveira et al., 2023; Pudney, 2010; Smith and Dhillon, 2018; Smith et al., 2018). Their work has informed the development of the Table 3 fundamental objectives. Of these, none have adapted VFT as an evaluation framework. We posit that these examples support the adaptation of VFT, a stakeholder values-based multi-criteria decision analysis method, to evaluation of AI interventions in critical infrastructure systems. We summarise the guidance for evaluative oversight derived from this theory adaptation in Table 5.

While Keeney’s VFT inspired approach provides an encouraging framework of fundamental objectives for reference by evaluators, our research is somewhat limited by the scarcity of evaluative studies of the application of AI in critical services and the lack of data in relation to that. Consequently, a repeat of this study when more data is available would be advisable. This is, after all, the early stages of implementing AI technology into society’s critical services.

Conclusions and future research

Our conceptual evaluation framework and eight-step approach is the first to align evaluation of an intervention to stakeholder values using Keeney’s value-focused thinking. We demonstrate the potential impact of this novel framework through a specific focus on evaluation of AI-enhanced Critical Infrastructure Systems interventions, which we know to be on an exponential growth trajectory.

We established through a review of literature that a substantially common group of AI guiding principles are being promoted through multiple public sector organisations, non-governmental organisations and intergovernmental organisations. These guidelines are being used extensively as default implementation and evaluation frameworks, though scholars warn of problems with doing so. Our Venn diagram analysis of an example from the railway industry revealed that AI guiding principles provide a confusing framework for evaluators because the principles overlap and compound with each other.

By identifying, summarising and analysing the objectives of AI principles extracted from recent literature, we established that, for AIeCIS, the governing principles of AI could be represented under an alternative framework based on Keeney’s VFT. We demonstrated that this alternative framework delivers measurable clarity to those principles, provides fundamental objectives to guide their implementation and evaluation and aligns those objectives with relevant guiding principles of AI. We also demonstrated the use of the framework on a case study from the railway industry. The demonstration illustrated how our values-based framework and eight-step evaluation process can be used to create a self-evaluation tool. It also provided some guidance for evaluator mentors that mentor others using the framework. The alternative framework was therefore found to be simpler and more effective in comparison to the widely available AI guiding principles.

Future research directions

Our novel framework provides multiple avenues for future research into the evaluation of AI trust. The framework has a category within Governance and Compliance, and this could be separated into sub-categories such as trust for (a) service fulfilment, (b) health and safety, (c) economic effectiveness, (d) environmental sustainability, (e) economic well-being, (f) socio-economic well-being, (g) system privacy and security and (h) human autonomy and oversight. Ways to measure each of these forms of trust are yet to be fully developed and it may vary somewhat, according to the type of critical infrastructure, the stakeholder extending trust or lack thereof, and other context information. Further to this, there is a trade-off between trust and other criteria such as how much it costs to develop trust. How do we construct the utility function involving trust and what does it trade off against? Given the uncertainty of risk in relation to trust, and its potential for negative impact, we assert the need for urgent research into measuring trust issues using the proposed conceptual framework.

Another potential research area is related to complexity. We touched on this in the body of our work. Critical infrastructure systems are complex, the ways in which we model simple systems are different from the ways we model complex systems. It is possible, even quite likely, that the framework may need modification to accommodate complexity. One approach would be to consider the effects of feedback and another would be to consider the impact of neighbouring systems.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship and/or publication of this article: This work was supported by the European Cooperation in Science and Technology [grant number CA20112].

ORCID iDs

Steven Pudney

David Mills

Ali Reza Alaei

Sam Sellers

Jaroslav Dvorak

Oto Potluka

Steven Pudney is a researcher in critical infrastructure asset management with extensive experience in engineering project management and operations management in mining and minerals processing industries. He researches the effectiveness education programmes in an industry setting. He specialises in the development and delivery of post graduate engineering management programmes, micro-credential programmes and corporate education programmes for critical infrastructure organisations.

David Mills has more than 20 years’ experience in policy, governance and delivery of government commercial services including rail, public works, trustee services and audit. David’s research follows the thread of stewardship in local and state public services delivery. Current research focuses on how rural local governments cooperate in a range of innovative forms of governance, namely hybrid, privatised, and not-for-profit, that both suit the delivery of services and advance local and regional social and economic development.

Ali Reza Alaei is currently a Senior Lecturer in Information Technology and Computer Science at the Faculty of Science and Engineering, Southern Cross University. His research expertise lies in artificial intelligence (AI), with a particular focus on Big Data analysis, machine learning, document image analysis and recognition, human visual system modelling and biometrics. He has led AI based projects across multiple interdisciplinary domains, including cognitive science, environmental science, and tourism and hospitality, exemplifying the intersection of technology, research, and real-world applications, driving meaningful advancements across sectors.

Sam Sellers is an Engineering and Design Manager with over 15 years of international experience in leadership, project management and operations. Holding dual Master’s degrees in Engineering Management and Project Management and recognised as a Chartered Engineer (EngExec), Sam has a proven track record in managing and delivering complex engineering projects across diverse industries. Sam’s research interests include sustainable decommissioning and the application of AI in project decision-making, process improvement, developing and implementing innovative engineering design delivery models for critical infrastructure.

Jaroslav Dvorak is a Head of the Department of Public Administration and Political Sciences at Klaipeda University, Lithuania and has longstanding research experience in public service delivery and performance evaluation of public organisations. Jaroslav Dvorak is involved in the editorial board of international scientific journals.

Oto Potluka is a Senior Researcher at the Center for Philanthropy Studies (CEPS) at the University of Basel. He works on evaluations of local and regional development programmes, including the role of civil society in development policy. He is a member of several international associations in the field of evaluations working on the topic of participation and partnership in public policy making, especially with regard to the involvement of civil society.

References

Alcaraz

Zeadally

(2015) Critical infrastructure protection: Requirements and challenges for the 21st century. International Journal of Critical Infrastructure Protection 8: 53–66.

Arnold

Scheutz

(2018) The ‘big red button’ is too late: An alternative model for the ethical evaluation of AI systems. Ethics and Information Technology 20: 59–69.

Berente

Recker

, et al. (2021) Managing artificial intelligence. MIS Quarterly 45(3): 1433–50.

Bertalanffy

(1969) General System Theory: Foundations, Development, Applications. New York: George Braziller.

Boggess

Kraus

Feng

(2023) Explainable multi-agent reinforcement learning for temporal queries. Proceedings of the Thirty-second International Joint Conference on Artificial Intelligence 7: 55–63.

Bortoluzzi

Furlan

Colombo

, et al. (2021) Combining value-focused thinking and PROMETHEE techniques for selecting a portfolio of distributed energy generation projects in the Brazilian electricity sector. Sustainability 13(19): 11091.

Campagner

Angius

Cabitza

(2023) A question of trust: Old and new metrics for the reliable assessment of trustworthy AI. HEALTHINF. Available at: https://www.scitepress.org/Papers/2023/116796/116796.pdf

Coelho DHDC (2013) Apoio à Decisão em Planeamento Energético Urbano Integrado. Doctoral Dissertation, 00500, Universidade de Coimbra, Coimbra.

de Souza

GBH

(2012) A multi-objective decision support methodology for developing national energy efficiency plans. Doctoral Dissertation, Universidade do Porto, Porto.

10.

Deutsche Bahn (DB) (2024) Deutsche Bahn facts & figures 2023. Available at: https://www.deutschebahn.com/en/group/ataglance/ (accessed 27 October 2024).

11.

Deutsche Bahn (DB) (n.d.) Implementing strong rail. Available at: https://ir.deutschebahn.com/en/db-group/strategy/the-building-blocks-of-the-strong-rail-strategy/

12.

Digitale Schiene (2022) Digitale Schiene Deutschland develops an AI-based capacity and traffic management system using the ‘Deep Reinforcement Learning’ Method. Digitale Schiene, 20 December. Available at: https://digitale-schiene-deutschland.de/en/news/2022/capacity-and-traffic-management-system-Deep-Reinforcement-Learning

13.

Digitale Schiene (2025) Artificial intelligence in the capacity and traffic management system. Available at: https://digitale-schiene-deutschland.de/en/capacity-and-traffic-management-system#:~:text=Commencing%20in%202029%2C%20the%20CTMS%20will%20initially%20control,interlocking%20and%20to%20ETCS%20Level%202%20without%20signals

14.

Doorn

(2021) Artificial intelligence in the water domain: Opportunities for responsible use. Science of the Total Environment 755: 142561.

15.

Dvorak

(2010) Evaluation of the European Union structural funds’ support in Poland: Scope and significance. Baltic Journal of Law & Politics 3(1): 53–75.

16.

Eskafi

Fazeli

Dastgheib

, et al. (2020) A value-based definition of success in adaptive port planning: A case study of the port of Isafjordur in Iceland. Maritime Economics & Logistics 22: 403–31.

17.

Fedyk

Hodson

Khimich

, et al. (2022) Is artificial intelligence improving the audit process? Review of Accounting Studies 27(3): 938–85.

18.

Felzmann

Fosch-Villaronga

Lutz

, et al. (2020) Towards transparency by design for artificial intelligence. Science and Engineering Ethics 26(6): 3333–61.

19.

Jin

Sun

, et al. (2022) The role of deep learning in urban water management: A critical review. Water Research 223: 118973.

20.

Gabriel

(2020) Artificial intelligence, values, and alignment. Minds and Machines 30(3): 411–37.

21.

Gimpel

Kleindienst

Waldmann

(2018) The disclosure of private data: Measuring the privacy paradox in digital services. Electronic Markets 28(4): 475–90.

22.

Habbal

Ali

Abuzaraida

(2024) Artificial Intelligence Trust, risk and security management (AI TRiSM): Frameworks, applications, challenges and future research directions. Expert Systems with Applications 240: 122442.

23.

Hagendorff

(2020) The ethics of AI ethics: An evaluation of guidelines. Minds and Machines 30(1): 99–120.

24.

Hassija

Chamola

Mahapatra

, et al. (2024) Interpreting black-box models: A review on explainable artificial intelligence. Cognitive Computation 16(1): 45–74.

25.

Haydt

Leal

Dias

(2013) Uncovering the multiple objectives behind national energy efficiency planning. Energy Policy 54: 230–9.

26.

Head

Jasper

McConnachie

, et al. (2023) Large language model applications for evaluation: Opportunities and ethical implications. New Directions for Evaluation 2023(178–179): 33–46.

27.

Hedström

Weber

Krakowczyk

, et al. (2023) Quantus: An explainable ai toolkit for responsible evaluation of neural network explanations and beyond. Journal of Machine Learning Research 24(34):1-11.

28.

Hickling

Zenati

Aouf

, et al. (2023) Explainability in deep reinforcement learning: A review into current methods and applications. ACM Computing Surveys 56(5): 1–35.

29.

Höfer

Madlener

(2020) A participatory stakeholder process for evaluating sustainable energy transition scenarios. Energy Policy 139: 111277.

30.

Holzinger

(2021) The next frontier: AI we can really trust. In: Kamp

(ed.) Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Cham: Springer, 427–40.

31.

IBISWorld (2024) Intercity passenger rail transport in Germany – Market research report. IBISWorld, December. Available at: https://www.ibisworld.com/germany/industry/intercity-passenger-rail-transport/894/

32.

Jaakkola

(2020) Designing conceptual articles: Four approaches. AMS Review 10(1–2): 18–26.

33.

Jan

Ahamed

Mayer

, et al. (2023) Artificial intelligence for industry 4.0: Systematic review of applications, challenges, and opportunities. Expert Systems with Applications 216: 119456.

34.

Jobin

Ienca

Vayena

(2019) The global landscape of AI ethics guidelines. Nature Machine Intelligence 1(9): 389–99.

35.

Kakani

Nguyen

Kumar

, et al. (2020) A critical review on computer vision and artificial intelligence in food industry. Journal of Agriculture and Food Research 2: 100033.

36.

Kaur

Uslu

Rittichier

, et al. (2022) Trustworthy artificial intelligence: A review. ACM Computing Surveys 55(2): 1–38.

37.

Keeney

(1994) Creativity in decision making with value-focused thinking. Sloan Management Review 35: 33.

38.

Keeney

(1996a) Value-Focused Thinking: A Path to Creative Decisionmaking. Cambridge, MA: Harvard University Press.

39.

Keeney

(1996b) Value-focused thinking: Identifying decision opportunities and creating alternatives. European Journal of Operational Research 92(3): 537–49.

40.

Keeney

McDaniels

(1992) Value-focused thinking about strategic decisions at BC Hydro. Interfaces 22(6): 94–109.

41.

Keeney

McDaniels

(1999) Identifying and structuring values to guide integrated resource planning at BC Gas. Operations Research 47(5): 651–62.

42.

Keeney

McDaniels

Ridge-Cooney

(1996) Using values in planning wastewater facilities for metropolitan Seattle 1. JAWRA Journal of the American Water Resources Association 32(2): 293–303.

43.

Komite Nasional Keselamatan Transportasi Republic of Indonesia (2018) Preliminary KNKT. 18.10. 35.04 Aircraft accident investigation report, PT. Lion Mentari Airlines, Boeing 737-8. Available at: https://www.aaiu.ie/sites/default/files/FRA/2018%20-%20035%20-%20PK-LQP%20Final%20Report.pdf

44.

Koppenjan

Charles

Ryan

(2008) Editorial: Managing competing public values in public infrastructure projects. Public Money & Management 28: 131–4.

45.

Kowald

Scher

Pammer-Schindler

, et al. (2024) Establishing and evaluating trustworthy AI: Overview and research challenges. Frontiers in Big Data 7: 1467222.

46.

Kulikowska – Wielgus

(2024) German traffic forecast 2040: Significant freight growth expected. Trans.info, 13 November. Available at: https://trans.info/en/german-traffic-forecast-2040-significant-freight-growth-expected-400711

47.

Laine

Minkkinen

Mäntymäki

(2024) Ethics-based AI auditing: A systematic literature review on conceptualizations of ethical principles and knowledge contributions to stakeholders. Information & Management 61: 103969.

48.

Leeuw

(2025) The algorithmization of policy and society. In: Nielsen

Rinaldi

Petersson

(eds) Artificial Intelligence and Evaluation. New York: Routledge, 242–65.

49.

Lemire

(2024) The evaluation metro map. Journal of Multi-Disciplinary Evaluation 20(48): 35–40.

50.

Liu

, et al. (2023) Trustworthy AI: From principles to practices. ACM Computing Surveys 55(9): 1–46.

51.

Mavrot

Potluka

Balzer

, et al. (2025) What evaluation criteria are used in policy evaluation research: A cross-field literature review. Evaluation and Program Planning 108: 102512.

52.

McCormack

Bendechache

(2024) A comprehensive survey and classification of evaluation criteria for trustworthy artificial intelligence. AI and Ethics 5: 1973–94.

53.

McDaniels

(1994) Sustainability, value trade offs, and electric utility planning A Canadian example. Energy Policy 22(12): 1045–54.

54.

McLeay

Osburg

Yoganathan

, et al. (2021) Replaced by a robot: Service implications in the age of the machine. Journal of Service Research 24(1): 104–21.

55.

McMillan

Varga

(2022) A review of the use of artificial intelligence methods in infrastructure systems. Engineering Applications of Artificial Intelligence 116: 105472.

56.

Mitchell

Agle

Wood

(1997) Toward a theory of stakeholder identification and salience: Defining the principle of who and what really counts. Academy of Management Review 22(4): 853–86.

57.

Mittelstadt

(2019) Principles alone cannot guarantee ethical AI. Nature Machine Intelligence 1(11): 501–7.

58.

Mohseni

Zarei

Ragan

(2021) A multidisciplinary survey and framework for design and evaluation of explainable AI systems. ACM Transactions on Interactive Intelligent Systems 11(3–4): 1–45.

59.

Montrosse-Moorhead

(2023) Evaluation criteria for artificial intelligence. New Directions for Evaluation 2023(178–179): 123–34.

60.

Morais

Alencar

Costa

, et al. (2013) Using value-focused thinking in Brazil. Pesquisa Operacional 33: 73–88.

61.

Morrison

Moye

Pandita

, et al. (2018) Mapping the field of software life cycle security metrics. Information and Software Technology 102: 146-159.

62.

Nauta

Trienes

Pathak

, et al. (2023) From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai. ACM Computing Surveys 55(Suppl. 13): 1–42.

63.

Neves

Dias

Antunes

, et al. (2009) Structuring an MCDA model using SSM: A case study in energy efficiency. European Journal of Operational Research 199(3): 834–45.

64.

Nguyen

Kieu

Wen

, et al. (2018) Deep learning methods in transportation domain: A review. IET Intelligent Transport Systems 12(9): 998–1004.

65.

Nielsen

(2025) The evaluation industry and emerging technologies. In: Nielsen

Rinaldi

Petersson

(eds) Artificial Intelligence and Evaluation. New York: Routledge, 266–86.

66.

Nielsen

Rinaldi

Petersson

(eds) (2025) Artificial Intelligence and Evaluation: Emerging Technologies and Their Implications for Evaluation. New York: Routledge.

67.

Nolt

Leviton

(2022) Fidelity and adaptation of programs: Does adaptation thwart effectiveness? American Journal of Evaluation 44(3): 322–34.

68.

OECD (2021) Applying evaluation criteria thoughtfully. OECD. Available at: https://www.oecd.org/en/publications/applying-evaluation-criteria-thoughtfully_543e84ed-en.html (accessed 29 June 2025).

69.

Oliveira

RSMD

Morais

Siebert

(2023) Developing and evaluating new alternatives for urban mobility: A case study of a Brazilian city. Pesquisa Operacional 43(Suppl. 1): e263440.

70.

Palladino

(2023) A ‘biased’ emerging governance regime for artificial intelligence? How AI ethics get skewed moving from principles to practices. Telecommunications Policy 47(5): 102479.

71.

Patton

(2011) Essentials of Utilization-Focused Evaluation. Thousand Oaks, CA: Sage.

72.

Patton

(2017) Principles-Focused Evaluation: The Guide. New York: Guilford Publications.

73.

Patton

(2020) Evaluation use theory, practice, and future research: Reflections on the Alkin and King AJE series. American Journal of Evaluation 41(4): 581–602.

74.

Pendleton

Garcia-Lebron

Cho

, et al. (2016) A survey on systems security metrics. ACM Computing Surveys 49(4): 1–35.

75.

Pudney

(2010) Asset Renewal Decision Modelling with Application to the Water Utility Industry. Doctoral Dissertation, Queensland University of Technology, Brisbane, QLD, Australia.

76.

Qian-Khoo

Hiruy

Hutton

RW-A

, et al. (2022) A systematic review of meta-evaluations: Lessons for evaluation and impact analysis. American Journal of Evaluation 43(3): 394–411.

77.

Reid

(2023) Vision for an equitable AI world: The role of evaluation and evaluators to incite change. New Directions for Evaluation 2023(178–179): 111–21.

78.

Rinaldi

Nielsen

(2025) Artificial intelligence: Challenges for evaluators. In: Nielsen

Rinaldi

Petersson

(eds) Artificial Intelligence and Evaluation. New York: Routledge, 287–306.

79.

Sabarre

Beckmann

Bhaskara

, et al. (2023) Using AI to disrupt business as usual in small evaluation firms. New Directions for Evaluation 2023(178–179): 59–71.

80.

Schneider

Roets

Stirbu

, et al. (2024) Intelligent railway capacity and traffic management using multi-agent deep reinforcement learning. In: IEEE international conference on intelligent transportation systems (ITSC), Edmonton, AB, Canada, 24–27 September, 1743–8. New York: IEEE.

81.

Simon

Regnier

Whitney

(2014) A value-focused approach to energy transformation in the United States Department of Defense. Decision Analysis 11(2): 117–32.

82.

Smith

Dhillon

(2018) Eliciting societal values for health information policy decisions. Available at: https://aisel.aisnet.org/cgi/viewcontent.cgi?article=1267&context=icis2018

83.

Smith

Dhillon

Hedström

(2018) Reconciling value-based objectives for security and identity management. Information & Computer Security 26(2): 194–212.

84.

Stufflebeam

Coryn

(2014) Evaluation Theory, Models, and Applications. Hoboken, NJ: John Wiley & Sons.

85.

Sturm

Vienken

Söhlke

, et al. (2024) Artificial intelligence for the traffic management system of the future. SIGNAL+ WIRE 1(2): 24–31.

86.

Taeihagh

(2021) Governance of artificial intelligence. Policy and Society 40(2): 137–57.

87.

Teasdale

(2021) Evaluative criteria: An integrated model of domains and sources. American Journal of Evaluation 42(3): 354–76.

88.

The Federal Democratic Republic of Ethiopia Ministry of Transport and Logistics (2022) Investigation report on accident to the B737-MAX8 Reg. ET-AVJ Operated by Ethiopian Airlines. Available at: https://bea.aero/fileadmin/user_upload/ET_302__B737-8MAX_ACCIDENT_FINAL_REPORT.pdf

89.

Thiebes

Lins

Sunyaev

(2021) Trustworthy artificial intelligence. Electronic Markets 31: 447–64.

90.

Tilton

LaVelle

Ford

, et al. (2023) Artificial intelligence and the future of evaluation education: Possibilities and prototypes. New Directions for Evaluation 2023(178–179): 97–109.

91.

Vanderlinde

Robinson

Mashford

(2022) The challenges for artificial intelligence and systems engineering. Australian Journal of Multi-disciplinary Engineering 18(1): 47–53.

92.

Van Gestel

Koppenjan

Schrijver

, et al. (2008) Managing public values in public-private networks: A comparative study of innovative public infrastructure projects. Public Money and Management 28(3): 139–45.

93.

Vuorinen

Martinsuo

(2019) Value-oriented stakeholder influence on infrastructure projects. International Journal of Project Management 37(5): 750–66.

94.

Wang

Lei

Zhang

, et al. (2019) A review of deep learning for renewable energy forecasting. Energy Conversion and Management 198: 111799.

95.

Wassie

Lakatos

(2024) Artificial intelligence and the future of the internal audit function. Humanities and Social Sciences Communications 11(1): 1–13.

96.

Winfield

Booth

Dennis

, et al. (2021) IEEE P7001: A proposed standard on transparency. Frontiers in Robotics and AI 8: 665729.

97.

Yin

(2016 [2015]) Qualitative Research from Start to Finish. New York: Guilford Publications.

98.

Zhao

Dai

Qiao

, et al. (2020) Application of artificial intelligence to wastewater treatment: A bibliometric analysis and systematic review of technology, economy, management, and wastewater reuse. Process Safety and Environmental Protection 133: 169–82.

Evaluation of artificial intelligence-enhanced critical infrastructure systems: A conceptual framework

Abstract

Keywords

Introduction

Background of the evaluation of AI-enhanced systems

Methodology

Investigating AI principles in practice

Building a VFT based framework

Strengths and limitations of this methodology

Findings and discussion

Application of AI principles to Deutsche Bahn

Investigating the application of AI principles

Investigating cases of VFT in critical infrastructure to build an evaluation framework

Evaluation of Deutsche Bahn AI-enhanced traffic management system

Step 1: We start with understanding the context

Comments about the evaluation of Step 1

Consultation with stakeholders to develop a decision model – Steps 2, 3, 4 and 5

Comments about the evaluation of Steps 2, 3, 4 and 5

Steps 6, 7 and 8

Comments about the evaluation of Steps 6, 7 and 8

Closing comments on the novel values-based evaluation framework

Conclusions and future research

Future research directions

Footnotes

Declaration of conflicting interests

Funding

ORCID iDs

References