Abstract
The Train Control and Monitoring System (TCMS) is the communication command center of the train, and the Central Control Unit (CCU) is its core component, which should achieve SIL4 safety integrity level defined in standard EN50129. Comparing the current CCU safety mode of dual-machine hot standby to the double 2 out of 2 (2oo2) structure, this paper find that the latter performances better in security. So a new CCU architecture with enhanced safety is proposed, the double 2oo2 structure is applied to the CCU design, and the Markov failure probability model is established to analyze its safety quantitatively. The CCU based on double 2oo2 redundant structure could meet the failure safety principle of composite fail-safety and reactive fail-safety stipulated in the SIL4 safety integrity level, and its Tolerable Functional Failure Rate (TFFR) could reach the requirements of the SIL4. As another important part of RAMS (Reliability, Availability, Maintainability, and Safety), the reliability of this CCU architecture is evaluated both qualitatively and quantitively. According to the results, the double 2oo2 structure could greatly improve the safety of the TCMS and can be used in railway signal system with SIL4 safety integrity requirements.
Keywords
Introduction
Train Control and Monitoring System (TCMS) is an on-board computer system designed for train controlling and communication. Its main function is to monitor and display the operating status of important facilities, and perform real-time fault diagnosis. As the commanding and control center of rail vehicles, TCMS need to transmit various operating instructions and display data in real time. And at the same time it should perform calculation processing and fault diagnosis on the system data, and issue control instructions and feedback information on the human-computer interaction interface.
The Central Control Unit (CCU) is the core component of TCMS, which is mainly responsible for train reconnection controlling, vehicle bus management, train-level controlling, and train fault diagnosis. After logical processing on the driver instructions, train protection system instructions and system information, CCU issues operating commands to various control units, and feeds back the status information and diagnosis information to the driver and maintenance personnel. As the key equipment of TCMS, CCU usually adopts redundant structure to get high safety and reliability. 1
Dual-machine hot standby is a commonly used safety redundancy structure, with two subsystems running at the same time and backing up each other to jointly ensure the execution of important services. When a subsystem fails, the second automatically takes over the task, ensuring that the whole system can continue performing without manual intervention. 2 Due to its advantages in safety and reliability, dual-machine hot standby has been widely used in agriculture, transportation, industry, and other fields.3–9 Currently the CCU of rail vehicles in China all adopts dual-machine hot standby redundant structure,10,11 and for many years the effect has been remarkable. However, with the upgrading of high-speed trains, dual-machine hot standby redundancy is difficult to meet higher safety requirements. On the one hand, the TCMS of the new generation of high-speed train contains more subsystems, and the network bandwidth required for intelligent information services is greater. 12 On the other hand, International and European standards for safety of railway signal systems have been applied around the world,13,14 and that the active security strategy based on safety integrity theory is more reliable has been generally accepted by the whole railway industry.15,16
Over the years, researchers have taken various measures to reduce dangerous failure rate of CCU system. For example, the urban rail transit network CCU based on dual buses designed by Zhang et al., 17 using ECN and MVB buses to achieve redundant controlling. Gao and others 18 re-planned the vehicle network topology of the train on Chongqing Metro Line 10, and introduced a maintenance network to strengthen system maintenance. Although such schemes can improve the reliability and stability of TCMS to a certain extent, its security cannot be significantly improved, since the CCU still adopts the dual-machine hot standby structure. Large numbers of studies have shown that the security of double 2oo2 redundancy architecture is far superior than that of dual-machine hot standby architecture.19–22 Relying on its safety advantages, the double 2oo2 architecture has been applied to a variety of railway signal systems, especially in ones with strict safety requirements (SIL4), and the performance is significant. Chen et al. 2 used double 2oo2 structure for the CTC autonomous machine system, and compared its safety and reliability with the dual-machine hot standby structure. Chen and others analyzed the station electronic computer interlocking system with double 2oo2 architecture and described its safety mechanism. Sun and others 24 designed a train distributed security computer combined with FlexRay bus, easy for isolating and switching. Feng 25 designed a computer interlocking system based on double 2oo2 structure for rail vehicles, and adopted security communication protocol, intelligent maintenance, and multilayer check to improve safety performance. Wang and others 26 designed the station computer interlocking system based on dynamic fault tree analysis, and demonstrated the reliability performance of the double 2oo2 system in full fault mode. There are also other applications such as automatic train protection systems, 27 automatic optimization of locomotive operation device, 28 urban rail transit ground control system, 29 railway signal network system,30,31 etc.
Safety problem of CCU need to be fundamentally solved to adapt to the development of high-speed trains, and the double 2oo2 redundancy is a feasible approach. This paper proposes a security enhancement design scheme for CCU based on double 2oo2 redundant architecture and demonstrates its performance. The main contributions of this article are addressed as follows. Firstly, this paper designs a suitable hardware architecture according to the CCU function requirements, and the working mechanism of the redundant platform, the security policy, and the synchronous communication strategy of the redundant CCU are proposed. Secondly, a markov analysis model is established to analyze the safety and reliability of CCU with different redundant structure, and the contribution of the double 2oo2 redundancy to CCU security is proved theoretically. Then, compared with the same type of Markov model, this paper considers more random cases and the analysis results are closer to the real situation. Finally, the active safety strategy stipulated in EN50129 is introduced into train network. The safety mechanism design is no longer independent of the functional system, but integrated in the whole life cycle of the product, which will provide a reference for the railway signal system with SIL4 safety integrity requirements.
Research on CCU function safety requirement and basic structure
The train control and monitoring system is responsible for the communication and dispatching of the entire train, and some system functions with high safety demand put forward new requirements for the central control unit. For example, in European standard EN50129 the braking system has different SIL levels requirement for various functions, including Brake System Management, Service Brake, Emergency Brake, Parking Brake, Automatic Brake Test, Low Adhesion Management, etc. TCMS has important impact on some sub-functions of Service Brake and Brake System Management. And for the communication solution based on the Ethernet protocol, the standard requests the safety level of SIL3 to SIL4. In summary, the functional safety of TCMS based on Ethernet must be guaranteed, and as the core control equipment, CCU is considered to reach the SIL4 safety integrity level.
The CCU is mainly composed of gateway, vehicle control unit, I/O ports management unit, and communication module. As the safety control platform, CCU is redundant in pairs and located in the control room at both end of the train. Figure 1 is the schematic diagram of CCU of the high-speed train, with four cars forming a vehicle marshaling, and the on-board systems in the marshaling network system communicate with each other under the control of two redundant CCU. CCU communicates with the microcomputer control unit of each system through the TCMS to realize process controlling, communication management, information displaying, and fault diagnosis.

Schematic diagram of CCU of high-speed train.
At present, most CCU still use the traditional technology of single processing core,17,18,32 such designs cannot theoretically meet the safety integrity requirements of SIL4. Furthermore, most of the dual-machine hot standby systems do not include inter-system synchronization, or only achieve simple status detection, like heartbeat packets based on CAN bus, MVB bus, or hard wire. Such mechanism is difficult to achieve seamless switching, which is a hidden danger that should not be ignored. Finally, without synchronization, the difference of the calculation results between the redundant processors will accumulate over time, which may cause different diagnosis results.
The dual-machine hot standby strategy has good reliability and can guarantee the stable and reliable operation of the network. However, for the new generation train network system based on industrial Ethernet which has more complex structure, it needs a control platform with higher safety. The double 2oo2 structure has high safety and is widely used because that it supports technological upgrades in operation and conforms to the characteristics of high traffic density and frequent system updates in China railway system. The comparison of safety mode of the dual-machine hot standby and the double 2oo2 architectures is shown in Figure 2. In addition to the fault diagnosis of the logic unit itself, the processing results of the units are compared by the comparison units in double 2oo2 system. If the comparison results are inconsistent, it is considered that a fault not detected by the diagnostic unit has occurred. The double 2oo2 structure has an external detection means, and the ability to detect failure before danger will be improved by orders of magnitude.

Security mode comparison between dual-machine hot standby and double 2oo2.
CCU security enhancement design based on double 2oo2 architecture
The design scheme of the CCU based on double 2oo2 redundant architecture is shown in Figure 3, the system has two same processing subsystems (subsystem A and subsystem B). The hardware of the two subsystems is completely independent and can undertake processing tasks of CCU alone. CCU exchanges information with the on-board systems through the vehicle network to control and monitor the whole train. The two CCU subsystems rely on inter-system communication to process synchronously. When the major system occurs a failure, the standby system takes over the control power and performs task.

CCU based on double 2oo2 architecture.
To avoid common-cause faults caused by system failures, the two subsystems are physically completely isolated and connected only through inter-system synchronization. For the system with safety integrity requirement, its interface with other facilities need to meet certain security requirements. The synchronous communication of CCU system adopts security protocol communication, and the residual rate of security channel is evaluated according to the standard. The fault diagnosis unit is logically isolated from application functions and can diagnose process faults in real time. To achieve the required safety integrity level, diagnosis coverage need to be industry-high, which will be discussed in detail in the safety analysis section. The two CCU subsystems are mounted on the same train network, using different communication addresses, measures should be taken to avoid control competition. The two subsystems transfer the control power through inter-system synchronization communication, meanwhile the software of CCU subsystems could be designed to monitor each other through application network.
The double 2oo2 redundant architecture is the most common used redundancy platform in the domestic rail transit field and has been applied in many signal systems. However there is no CCU safety product of TCMS based on this architecture in China. Combining the structural characteristics of the double 2oo2 redundancy with functional requirement of CCU, system architecture in this paper not only meets the security requirements, but also leaves enough change space in hardware interface and software design to get convenience to adapt to different vehicle.
Transfer of CCU working status
The CCU system should have a definite working state transition mechanism to ensure that the system can execute a predetermined safety plan under any conditions. In the design process, it is necessary to consider all possible conditions of the redundant system and define their safety functions. The system status transition mechanism is shown in Figure 4, the letters in figure represent the transition conditions.

Mechanism of system status transfer.
Initialization: after power on, the CCU perform software and hardware initialization, data initialization, task list initialization, and self-check of each arithmetic unit. In this status, both subsystems cannot produce effective output.
Waiting state: The CCU communicates with every device mounted on the TCMS, waiting for the device status to switch to online, and then completes the initialization of the local area network communication. Enter safe status: The system detects dangerous fault and enters safe status. A as major system: A works as the major system of CCU, producing effective output, and at the same time acts as the initiator of inter-system synchronization. B as major system: B works as the major system of CCU, producing effective output, and at the same time acts as the initiator of inter-system synchronization. a: A potentially dangerous fault is detected; initialization is unsuccessful. b: The initialization is successful and no faults are detected in both systems. c: The system self-repair is completed and attempts to re-enter the working status; the fault is manually removed and the system enters the working status by manual operation. d: System A detects a fault and system B is working normally; system B does not receive the synchronization signal from system A within the specified time, and the CCU switches according to the plan. e: System A and B have both have detected faults. f: System A resumes working status, CCU is manually switched to system A; system A resumes working status, system B fails; system A resumes working status, and the synchronization signal from B system is not received within the specified time, and the CCU switches according to the plan. g: System A has not resumed working status and then system B fails. h: CCU system reset or restart.
Hardware architecture
In order to ensure that the system remains safe in the event of any type of single random hardware failure, EN50129 proposes three implementation methods according to the fail-safe principle: composite fail-safety, reactive fail-safety, and inherent fail-safety. Systems with SIL3/SIL4 safety integrity level requirements should use at least one of the above technologies. Because part of the failure mode of CCU is harmful, and some of the harmful faults cannot be eliminated by rapid detection, the hardware architecture of 2oo2 system is mainly based on composite fail-safe, supplemented by reactive fail-safety, as shown in the Figure 5.

Hardware architecture of 2oo2 subsystem.
In response to the multiple faults and in order to achieve the specified quantitative safety goals, EN50129 requires the system to regularly detect integrated circuit faults online according to the SIL3/SIL4 composite fail-safe function of the multi-electronic architecture. Possible fault facility includes: CPU registers, internal RAM, instruction decoding and execution, program counter, stack pointer, clock, reset; memory; power supply; etc. The security domain in the 2oo2 subsystem uses dual-core real-time processing unit (RPU) with lockstep technology to detect this type of failure, which can achieve task synchronization at the time granularity of clock cycle level.
The 2oo2 subsystem uses multiple application processing cores (APU) as the main computing units, which is responsible for application-level task processing and coordinating the work of various processing modules of CCU. APU and RPU realize safe communication through dynamic memory. APU is equipped with three Gigabit Ethernet controllers (GEM) for application communication, debugging and inter-system synchronization respectively.
Inner-system synchronization
Inter-system synchronization is the basis for the CCU system to achieve seamless switching. Fast-Megabit Ethernet is used to transmit task data between the two 2oo2 subsystems to achieve task-level synchronization. The standby system is always in the working state of following the major system, when the switching mechanism is activated, it can take over to perform scheduling tasks without disturbance. Synchronization communication only involves data exchange between the two 2oo2 subsystems, so there will be no transfer conflict using the LWIP Ethernet protocol and the transmission medium of a full-duplex cable connecting to a standard RJ45 network port.
CCU has hundreds of task threads, and their communication period with the train network are very different. The synchronization period between the two subsystems is determined by the thread with the highest accessing frequency to the network. The major system is designed to communicate with the standby system every 10 ms, and the standby system adjusts its operating status to be consistent with the main system according to the synchronization data received. The timing logic of inter-system synchronization communication is shown in Figure 6, where T represents the maximum time to complete a synchronization data transmission. At the beginning of each synchronization cycle, the major system sends start information to the standby system, and then the standby system prepares to receive synchronization data, which includes the input, output, and status information of the major system. If the standby system does not receive the synchronization data within the specified time, it will troubleshoot the fault according to the plan, then restore communication or take over the control power of the whole system.

Synchronization communication logic sequence in time.
Establishment of Markov model of CCU redundancy system
European safety standards define faults as random hardware failures and system failures. System failure refers to that caused by human error in the process of design, manufacturing, installation, verification, operation, maintenance, etc. System failure is completely avoidable. Hardware failure is the most important part that affects the safety and reliability of system, for which the random probability model is used for quantitative analysis. The Markov model is the most commonly used random failure analysis model in the field of redundant computers, and its effectiveness has been proved in years of application. Combining the working characteristics of CCU, this chapter elaborates on the modeling process of dual-machine hot standby and double 2oo2 redundant architectures.
Fundamental assumption of modeling
The risk of the railway system has great ambiguity and randomness, and the failure analyzing of the signal system is a dynamic and random process. Markov model is a commonly used random analysis method, which has high accuracy in analyzing the safety and reliability of signal system. In this section the Markov model of CCU based on double 2oo2 architecture is constructed. In order to build proper system transition mechanism and solve random failure probability function, the following assumptions are proposed:
Only the failure of logic unit and comparison unit is considered;
Only a single fault occurs at a certain moment;
All modules cannot be repaired after failure;
The failure rates of modules that may fail all obey the exponential distribution;
The failure of the comparison unit is undetectable.
As shown in Figure 7, here gives the main module and failure parameters of CCU with dual-machine hot standby and double 2oo2 architecture. The fault parameter of logic unit is λ, diagnostic coverage rate is c, and the failure parameter of the comparison unit is μ.

Main modules of failure analysis.
For the module whose life follows an exponential distribution and fault rate is
Modeling for dual-machine hot standby redundant CCU
According to the redundant structure and working principle in Figure 7, Table 1 lists the working status of the dual-machine hot standby system.
Definition of working status of dual-machine hot standby CCU system.
The Markov status transition model of the dual-machine hot standby system is shown in Figure 8, and the status transition is described as follows:
Status 0–status 1: CCU system in normal status, system B has a detectable failure and is isolated, system A operates alone;
Status 0–status 2: CCU system in normal status, system A has a detectable failure and is isolated, system B operates alone;
Status 0–status 3: CCU system in normal status, system B has an undetectable failure and is not isolated;
Status 0–status F: CCU system in normal status, system A has an undetectable failure and drives the whole system in danger;
Status 1–status S: system A has a detectable failure and CCU enters the safe status;
Status 1–status F: system A has an undetectable failure and drives the CCU system in danger;
Status 2–status S: system B has a detectable failure and CCU enters the safe status;
Status 2–status F: system B has an undetectable failure and drives the CCU system in danger;
Status 3–status F: system has a failure and drives the CCU system in danger (In status 3, system B has an undetectable failure and is not isolated, whether system A has a detectable or an undetectable failure, CCU system will be driven in danger).

Markov analysis model of dual-machine hot standby system.
The system is in a definite status at time t, and consider the possibility of every status which the system may transfer to after
Combining the initial condition of time 0:
Modeling for double 2oo2 redundant CCU
According to the redundant structure and working principle in Figure 7, Table 2 lists the working status of double 2oo2 system.
Definition of working status of double 2oo2 CCU system.
According to the status transition of double 2oo2 system, the Markov analysis model is constructed as shown in Figure 9.

Markov analysis model of double 2oo2 system.
List the status transition differential equations, combined with the initial conditions described in equation (8), the probability function of failure-safe status and dangerous status (
Analysis results of CCU safety and reliability
According to the calculation formulas obtained by the Markov analysis model, the random failure probability of the specific CCU is evaluated. In order to compare the safety and reliability of CCU with dual-machine hot standby and double 2oo2 redundant architecture, specific values are assigned to those parameters. According to the random failure parameters calibrated in the official reliability report document of the processor, assign the failure rate of logic unit and comparison unit:
The fault diagnosis coverage rate (c) depends on the software and hardware design of system, and it marks the fault diagnosis coverage level of diagnosis unit. The standard EN50129 gives the fault categories of basic components, and the percentage of faults that can be detected is the fault diagnosis coverage rate. For CCU system, the fault diagnosis of the processing core is the most important factor that affects the safety, followed by the dual-core lock-step mechanism, external memory, redundant power supply, ethernet card, MVB manager, etc. According to the CCU system design example of China Academy of Railway Sciences, the fault coverage rate of diagnostic unit can generally reach 95%. And in the main research on safety analysis of the double 2oo2 redundant architecture, Chen et al. 2 took the diagnosis coverage rate as 90%, Huang and Lei 21 took the diagnosis coverage rate as 95%, and Zhang et al. 29 took the diagnostic coverage rate as 99%. Based on the principle of safety and conservation, this paper takes the fault diagnosis coverage rate c as 90%. Substituting the parameter values in simulation, the safety and reliability comparison of CCU with dual-machine hot standby and double 2oo2 redundant architecture is shown in Figures 10 and 11.

Safety curve.

Reliability curve.
The result shows that the reliability of double 2oo2 structure is lower than dual-machine hot standby structure, but the safety is higher than latter. This result can be explained by the structure. Double 2oo2 structure uses more redundant hardware to identify abnormal status that may occur during operation, which can enhance the detection capability and increase the probability leading to safe status. But at the same time, its hardware is twice than that of the dual-machine hot standby system, which increases the probability of single point failure, making the system more likely to trigger the fail-safe mechanism.
Select several nodes in the safety curves of the two systems, the safety data is shown in Table 3. It can be seen, as the simulation time goes, the double 2oo2 architecture has great advantage in safety compared to dual-machine hot standby system. For CCU that need to run continuously, double 2oo2 redundant architecture can greatly reduce the danger risk. The model analysis in this paper does not consider factors such as fault repair and regular maintenance, and the safety analysis results are theoretically conservative. For CCU with SIL4 safety requirements, the hardware architecture proposed in this article can be a considerable solution.
Safety comparison.
This paper analyzes the safety and reliability based on Markov probability model, obtains the differential equation of the system through state transition analysis, and then gets the probability function of the system state. There is no approximate solution process, so the calculation results are accurate and reliable under the condition that the analysis model fits the system working mechanism. The fault parameters refer to the official instructions of chip manufacturers and the experimental data of China Academy of Railway Sciences, and in this paper conservative values are used, so the actual performance of safety system is supposed to be better than the calculated results.
Conclusion
This paper researches the safety integrity requirements and safety enhancement strategies of TCMS, and proposes a security redundant hardware architecture based on the double 2oo2 platform for CCU. According to the analysis results of Markov model, the new redundant architecture could greatly improve the safety of CCU, and the reliability would decrease slightly due to the more complicated hardware, but it does not affect the normal operation. Over the years, researchers have made many improvements to the train network, such as adding diagnostic functions and designing reactive fail-safe policies, all that can improve the reliability and stability of the network system to a certain extent, but can not make the CCU system reach SIL4 level. In this paper, the double 2oo2 redundant architecture is used for CCU design for the first time, and its safety improvement is proved theoretically. Compared with the passive safety policies, the active safety policy can better control the accident risk and better meet the safety requirements in EN50129.
Footnotes
Handling Editor: Chenhui Liang
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Open Project Fund of State Key Laboratory for Traction and Control System of EMU and Locomotive(2019YJ201); Shanghai Maglev and Orbit Communication Collaborative Innovation Center.
