Abstract
Safety critical functions of the engineered railway need to perform at levels of integrity that are so high that an acceptable failure rate cannot be demonstrated through testing alone. Where such functions need to be implemented in complex programmable electronic systems certain design, build and test requirements are defined in technical standards and these are deemed to ensure that the correct level of systematic integrity is achieved. These approaches are based on assumptions around how system requirements are managed and delivered which are increasingly challenging to meet in practice. In particular the V&V lifecycle used in functional safety standards and emerging cyber security design standards is idealised. It assumes a top-down cascade of requirements for each delivery project. The approaches have become the de-facto standard internationally and are now mandated to an extent in European railway safety regulations. This paper proposes a different approach: a new lifecycle model that aligns better with the reality of the modern global supply chain and the order in which asset design and project delivery activities are actually undertaken to improve the ability to proactively manage safety.This leads to a fundamental change in the assurance philosophy to bring a simpler and more understandable approach. A framework for applying this approach is set out along with further research objectives to deliver the solution in practice.
Keywords
Introduction
The railway was traditionally built from electro-mechanical systems whose function was relatively simple. 1 Members of railway staff were also time served, with a general degree of understanding of all aspects of railway function. 2 ‘Because of this there was a good local understanding of the railway’s function, both in normal operation and under failure conditions. On the modern railway, software systems are now being designed in localised pockets of expertise based in key locations around the world.3,4 For local railway staff systems arrive as ‘black box’–commercial off-the-shelf (COTS) systems 5 and therefore the same degree of understanding does not exist locally. This loss of knowledge and lack of transparency makes it increasingly difficult for those who own and operate the system to build and maintain a high degree of assurance of its safety.
Major accidents can occur on the railway. In order to ensure that railway assets are designed, built and operated safely there are stringent regulations and standards in place. In Europe requirements are set in two high level directives6,7 which are implemented in the national legislation of each member state. In support of this a number of lower level requirements also exist. One of these is the regulation for the common safety method for risk evaluation and assessment.8,9 It requires that the responsible party determines whether the introduction of new technology into the railway is a ‘significant’ change. If a change is deemed ‘significant’ then a structured risk management process needs to be applied, evidenced and assessed by an independent body. The legislation recognises that various actors have a part to play in bringing complex railway technical systems into safe operation on the railway network. Asset manufacturers typically act as the responsible party for ‘placing in service’ i.e. for ensuring that the equipment is good as a product and fit to be sold for its intended application. The ultimate user of the equipment must put the system ‘in use’ and ensure that all necessary safety requirements for its operation and maintenance are met in situ. Effective transfer of risk information, and transparency between the actors is critical to the achievement of a safe outcome. The detailed approach to meet these regulatory requirements is set out in a number of specific safety engineering and functional safety standards. The risk management standard for the railway is EN50126 which is in two parts. 10
The regulatory process includes particular requirements for ‘Technical Systems’. 11 The ‘technical system’ means a product or an assembly of products including the design, implementation and support documentation: typically new signalling systems, or units of rolling stock for example. The development of a technical system starts with its definition and requirements specification and ends with its acceptance; although the design of relevant interfaces with human behaviour is considered, human operators and their actions are not included in the technical system.
The regulation itself is silent on how to meet the requirements associated with the safety functions of the ‘technical system.’ The most widely accepted technical standard that does so is the railway functional safety standard 12 which is linked to the wider risk management process set out in EN50126. EN50128 is the railway version of the widely adopted process functional safety standard. 13
The safety lifecycle
The safety engineering approach described in EN50126 and embedded in EN50128 is based upon the application of a ‘waterfall’ approach to verification and validation. The representation of the cascading process takes on the shape of the letter V (see Figure 1)
14
describes the approach as it relates to software thus: Key steps in the verification and validation development lifecycle (after EN50128).
“Verification: the process of determining whether or not the products of a given phase of the software development cycle fulfil the requirements established during the previous phase. Validation: the process of evaluating software at the end of the software development process to ensure compliance with software requirements.”
More informally Boehm describes the terms via two questions. For verification the question is: “Am I building the product right?” For validation the question is instead “Am I building the right product?”
Descending down the left hand side of the ‘V’ the process describes how the system designer decomposes its requirements to lower and lower levels of abstraction, verifying at each stage that the decomposition is correctly done. Then ascending upwards on the right hand side of the V, each sub-system and lower level design realisation is validated against the appropriately decomposed specification that was previously produced. In this way the presence of design errors that would lead to systematic faults is continually checked for, and their existence minimised. The process is conceptually clear and is based on a number of assumptions that are increasingly under challenge, namely: - That a design is undertaken under the strong control and authority of a single central design authority. - That activities happen in a fixed, logical and sequential order. - And that the competence is in place to fully understand and interpret requirements and their validation evidence, across multiple separate teams and organisations.
Safety architecture of high integrity rail systems
The overall risk management framework defined in the CSM RA encompasses system definition, hazard identification and risk assessment and the definition, implementation and testing of safety requirements. The evidence base that this activity has been done is typically referred to as a ‘safety case’, although the regulation does not use this term, the particular requirements relating to the safety of ‘technical systems’ are a subset of these requirements and there are specific approaches to develop and address them.
A revision to the CSM regulation (9) and its associated guidance set out a number of core safety critical functions of the railway. These are listed in Annex 1 of (11) and include for example: 1. Total or partial loss of braking effort. 2. Correct movement authority not enforced by the train. 3. One door being unlocked (with train crew not correctly informed of this door status). 4. One door released and opened in inappropriate areas (e.g. wrong side of train) or situations (e.g. train running).
Each of these functions is set a different severity class [i.e. (a) or (b) in point 2.5.5. in the Annex of the regulation. One and 2 above are examples of Category (a) failures, defined as: “a failure that has a credible potential to lead directly to an accident typically affecting a large number of people and resulting in multiple fatalities, the associated risk does not have to be reduced further if the frequency of the failure of the function has been demonstrated to be less than or equal to 10−9 per operating hour.” Three and four above are examples of Category (b) failures and are defined as “where a failure has a credible potential to lead directly to an accident typically affecting a very small number of people and resulting in at least one fatality, the associated risk does not have to be reduced further if the frequency of the failure of the function has been demonstrated to be less than or equal to 10−7 per operating hour.”
Both random and systematic failures need to be considered. A random failure is a failure whose occurrence is unpredictable in the absolute sense, but is predictable in a probabilistic or statistical sense. This is the domain of traditional reliability engineering. A systematic failure is a failure that is not determined by chance but is introduced by an inaccuracy or design flaw inherent in the system. Such failures occur repeatedly in the same set of circumstances. Software failures are always systematic as they are collections of instructions to a machine. Because there is a large state space of data input and outputs, such errors cannot be exhaustively tested for and may remain undiscovered in a system until a particular set of system inputs arises.
SIL levels - (Table from IEC61508 part 1, page 34).
For systematic software failures, SILs simply indicate which particular software design measures and approaches and roles are deemed necessary to attain the required level. Any practical link between the application of the standard and the failure rate actually achieved is not clearly proven.
17
One critical aspect of compliance to the standards is the design of an appropriate system architecture. Partitioning and duplication of system functions is required in some circumstances to deliver high integrity. A given function is implemented multiple times in different ways. Residual software failures can then be detected and masked by comparing the outputs of these multiple systems to discard outputs that are inconsistent. Different approaches to ‘voting’ can be used depending on the application requirements. For example, for SIL 4 system functions a ‘two out of three’ (2oo3) voting system might be required (see Figure 2). Three diverse channels are created to deliver the same specified output, but each is realised independently through separate technology and/or technical expertise. Two out of 3 voting architecture (Diagram from IEC61508 part 6).
Such approaches are generally highly recommended for safety critical software and in many cases an essential feature of the system architecture.
Emerging weaknesses of the current approaches
The evidence for mitigating the risk from systematic failures is fundamentally the evidence of robust implementation of a clearly defined and formal waterfall development process for verification and validation. Compliance with this approach is coming ever more critical as digitalisation creates more potential for systematic failures. However rapid technological evolution is undermining a related set of assumptions that underpin the model: • The model assumes that there is an overarching entity in control of the design. In reality the core platform is usually developed by integrating a range of different sub-systems into the railway, under control of a centralised computer system. The sub-systems are often developed through sub-supplier companies following their own verification and validation approaches independently of the project. The sub-system design is one step further removed than the asset platform design from an understanding of the operational safety requirements. This creates the possibility for miscommunication, misunderstanding, or loss of documented assurance of safety requirements. • The V & V lifecycle assumes a fixed sequence of activities throughout the design, implementation and test of the system in its entirety. This way of working, the ‘waterfall’ method, is no longer the default approach in software development which creates a mismatch of method. As already mentioned different parts of the development are undertaken at different times. Also, agile approaches to software development are based on a less structured approach with iterative sprints to build a functional and user centred system.
18
• The approach of certifying to a SIL level at the sub-system level is sub-optimal. The SIL concept is intended to be applied to functions not systems; the integrity of the function should be assured with respect to a functioning train, in which the sub-system has been integrated and configured for its particular use. • As regards architectural design of the system, duplication of system hardware requires significant additional work and cost and requires rare, highly skilled resource and expertise. Even if it is possible to have multiple teams of the right level of skill and experience it is difficult to ensure that their design solutions and implementations are truly diverse. Common specifications and design assumptions might be cascaded to these teams and common supply chain elements used will undermine the ability to build a high integrity solution. • The platform will form the core basis of a wide range of different applications each with its own operational use case. The delivery project requires local adaptations to national standards and local operating rules and constraints. Ultimately safety and security requirements can only be truly and fully understood when a system is considered in its actual operating environment.
Together, these issues create a greater opportunity for systematic failures to exist and remain undetected, and for the effectiveness of assurance to be undermined. It is an accepted principle that engineered systems must be safe and secure by design, 19–22. However safety and security requirements analysis work often only begins in earnest to meet final authorisation deadlines, rather than proactively, to improve the inherent safety of the product. This approach leads to project delays and increased costs. It also creates the potential for unnecessary residual risk caused by sub-optimal design decisions made under delivery pressure and against a back drop of sunk costs.
Many of the difficulties highlighted above have been raised in other sectors.23–25 They were tragically evident in the causation of the crashes of the Boeing 737 Max aeroplane in Indonesia and Ethiopia in 2018 and 2019 in which 346 people died. The immediate cause of those accidents was determined to relate to its Manoeuvring Characteristics Augmentation System (MCAS) which was designed to adjust the horizontal stabilizer trim to push the plane nose down so that the pilot would not inadvertently pull the airplane up too steeply, potentially causing a stall. In both crashes it was determined that the MCAS was activated by erroneous indications from its sensors, which were not duplicated in the design to enhance functional integrity. The investigation 26 found that “the MCAS was not evaluated as a complete and integrated function in the certification documents that were submitted to the FAA,”
It also found that: “The lack of a unified top-down development and evaluation of the system function and its safety analyses, combined with the extensive and fragmented documentation, made it difficult to assess whether compliance was fully demonstrated.”
Emerging challenges: Cyber security and safety expectations
In addition to unintentional safety flaws digitalization brings a whole new threat: malignant intrusion of networked systems. The emergence of cyber security vulnerabilities must also be managed in the design, build, operation and maintenance of complex railway technology. Standards and legislation to manage the risks of cyber security have developed with a degree of independence and separation from the systems and approaches to manage safety risk. Security and threat risk management standards have arisen27–29 which broadly follow a ‘plan, do, check, act’ management framework and V & V lifecycle of the same type as that specified in the framework described in EN50126/8, and therefore many of the challenges set out here are relevant to cyber assurance as well. More specifically, in the UK, the Department for Transport stresses that all risks must be managed according to the usual legislative safety management and risk acceptance principles: the subset of security issues with safety implications must therefore be considered within existing, mandatory safety assurance activity. This implies a degree of integration in how safety and security requirements are developed and met. However: • the approaches to architectural design are different: security levels require a zoning approach28,29 that is different to the concepts of redundancy associated with SIL assurance. • There are practical and cultural conflicts; good safety culture requires the open sharing of safety information to support learning.30–32 However there is typically much more secrecy around security information. • Cyber security risks are characterised by rapid evolution. This manifests in systems design as continual update of software. This rapid update must be reconciled with the need for robust and stable safety systems to minimise the chances of introducing systematic safety failures. • As risks are being deliberately created by ‘threat actors,’ traditional safety engineering and reliability methods, based on randomness, may no longer be valid, and the legislative assumption that the person who creates the risk must manage it, flounders.
Some of these challenges are explained in detail in a code of practice produced by the Institute of Engineering and Technology. 32 It should also be noted that railway safety performance has increased significantly over recent years. 33 In this environment there is now comparatively little practical experience of the occurrence of major accidents than in previous decades. Based on the significant work on ‘societal concern’ (Hoyland, 2018, Bearfield, 2014) it is known that the travelling public has a very low tolerance for rail accidents (Van Gulijk, 2018). The sector needs to ensure that new systems are at least as safe as the more simple and well understood technologies they are replacing, and that the new emergent risks are mitigated as effectively as the old.
Improved model: The safety/security STAIRCASE model
The emerging, technological challenges set out above pose a fundamental challenge to the applicability and assurance of the use of the classical V & V lifecycle model for safety and cyber security engineering. A new model is needed which creates the environment to have meaningful and productive engagement on the emerging risks and design challenges set out here. This paper proposes a revised assurance lifecycle model, the safety ‘STAIRCASE’ (see Figure 3). The safety STAIRCASE lifecycle model.
The left-hand side boxes show the different generic organisations responsible for determining the system and its requirements. Each has a different role to play sequentially, in ensuring that robust safety and security requirements are identified and implemented. The blue boxes indicate the type of safety case produced at key project lifecycle phases (the phases are annotated in bold italics). The bold downward lines indicate the source of fixed safety requirements for each safety case. The upwards arrows indicate the source of downstream requirements that need to be checked against the prevailing fixed requirements.
There are some similarities between the concepts set out here and the hierarchical concept of a Generic Product Safety Case; a Generic Application Safety Case and a Specific Application Safety Case as outlined in. 10 Both recognise the fact that V & V activities have layers, different owners, and a natural temporal place. However the STAIRCASE Model is based on the idea that each responsible party must consider all requirements to the level that they are able to, at the point in delivery where they are the lead organisation.
Outline safety case
In the spirit of safety by design, the proposed framework recognises the critical importance of effectively identifying key safety requirements as early as possible, in order to de-risk project delivery and ultimately achieve the best outcome. In particular, the pre-contract safety case creates a commercial incentive to enhance safety and security by design and address the emerging design assurance issues described in this paper and creating additional pressure for these architectures to evolve to meet the rapidly evolving digital assurance risks.
The first significant evaluation of safety should be a part of the tender process, and a basis on which the contract is selected. The safety case would in effect be a first iteration through the risk management process already defined in the CSM RA regulation or its equivalent, focussing on the requirements within the design control of the manufacturer. This should not actually create significant additional work as the ‘first of type’ platform analysis should provide the bulk of the ‘Reference System’ evidence that is legally required for subsequent safety demonstrations. Perhaps the most significant change to address is that teams evaluating bids would need the technical competence available to evaluate such safety information at that early stage. Input from experienced operators in the local domain of application is highly valuable here too, as it would be an opportunity to determine whether there were any local application changes needed, prior to the design being frozen. Creating some formal stage-gate here would help to get the right level of engagement early on and create the incentives to make this happen.
Preliminary safety case
With the outline safety case and core argument understood, early project work can focus on identifying any location specific changes or adaptations that might have been missed. This requires early engagement with the future user/operator on operational risks and controls. Clear safety requirements can then be cascaded into the tier 1 and tier 2 supply chain, enhancing compliance, project delivery and assurance.
Validated safety case
The validated safety case should be a relatively defined and simple process associated with the key regulatory stage gate. It should be about gathering the necessary information to evidence the safety argument and provide assurance that all is already in place. The approach makes this a more mechanical process, bringing greater assurance, ensuring that there is a clear audit trail for the safety argument and a solid basis for risk transfer into the operation and maintenance phases.
Fundamentally the approach is based on strengthening the ownership of the whole project at the concept stage, and with the ultimate ‘owner’ taking overall accountability for the whole assurance process. This should have many benefits as regards getting things right first time, and importantly it should strengthen the overall approach to systems integration, as there is a controlling mind for the process and its application.
Case study: The ETCS cambrian line failure
In 2017, a train driver travelling on the Cambrian Coast line in North Wales, UK reported a fault with the information provided on his in-cab display. Temporary speed restrictions were not being transmitted to several trains under their control. The temporary speed restrictions were required on the approach to seven level crossings to provide level crossing users with sufficient warning of approaching trains so that they could cross safely. The line was equipped with a pilot installation of the European Rail Traffic Management System (ERTMS), a form of railway signalling which transmits signalling and control data directly to the train. Investigation, by the local maintenance staff, found that the signalling system stopped transmitting temporary speed restriction data after it had experienced a shutdown the previous evening. The signallers had no indication of an abnormal condition and the display at the signalling control centre (on the ‘poste de GEstion des Signalisations Temporaires’ or ‘GEST’ system) wrongly showed these restrictions as being applied correctly. The UK
37
undertook an investigation. It found that: • An automated software reset occurred when the equipment requested part of a movement authority that it had previously released for use by another train. • Temporary speed restriction data was not uploaded to the signalling system after the software reset, because the external database of signaller information had entered a fault condition. • The system was not designed to provide any indication to signallers that the system had failed. • The memory used for storing temporary speed restrictions in the Radio Block Centre (RBC) was volatile, allowing temporary speed restriction data to be lost during a rollover. • The required level of safety integrity for validation of temporary speed restriction data uploaded to the RBC following a rollover was not achieved by the design.
Review of the benefits of the STAIRCASE model against selected recommendations from the RAIB report into the failure of the ETCS system on the Cambrian line in 2017.
In summary, the failure mode would have been much more likely to be prevented by robust application of the STAIRCASE methodology using competent people. More generally, the STAIRCASE methodology would have created earlier and more rigorous focus on the core safety argument and methodology, bringing a range of wider benefits.
Research and further work
Further work is needed to refine the methodology and to test the approach on a real-world project. This would involve: - Aligning contracting, governance and assurance to implement the model set out (this could be done voluntarily on a contractual basis, rather than requiring any legislative change). Having said that, should the approach be successfully implemented contractually, it may make sense to review the prevailing legal frameworks to embed it. As the method is based on fundamental principles of good safety engineering this should not, in theory, present a significant challenge. - Consideration of the optimal safety and security architecture for different rail assets, to support productive discussions on these topics. - Clarification of the revised, ideal competence requirements needed to support the effective application of the revised model.
Conclusion
Rail Technology is becoming more digitally complex and this is challenging the existing approaches to achieving safety assurance of software driven functions. Meanwhile the travelling public have rising expectations for safety. The processes for building safety integrity need to be effective and transparent and need to drive the right design and assurance behaviours in the real world. The approach presented here provides an avenue of research for addressing these challenges through development of a refined safety and security lifecycle model, that is attuned to real world behaviours and the need for proactive safety analysis and assurance.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
