Safety enhancement design method and control strategy for CCU of high-speed train

Abstract

The Train Control and Monitoring System (TCMS) is the communication command center of the train, and the Central Control Unit (CCU) is its core component, which should achieve SIL4 safety integrity level defined in standard EN50129. Comparing the current CCU safety mode of dual-machine hot standby to the double 2 out of 2 (2oo2) structure, this paper find that the latter performances better in security. So a new CCU architecture with enhanced safety is proposed, the double 2oo2 structure is applied to the CCU design, and the Markov failure probability model is established to analyze its safety quantitatively. The CCU based on double 2oo2 redundant structure could meet the failure safety principle of composite fail-safety and reactive fail-safety stipulated in the SIL4 safety integrity level, and its Tolerable Functional Failure Rate (TFFR) could reach the requirements of the SIL4. As another important part of RAMS (Reliability, Availability, Maintainability, and Safety), the reliability of this CCU architecture is evaluated both qualitatively and quantitively. According to the results, the double 2oo2 structure could greatly improve the safety of the TCMS and can be used in railway signal system with SIL4 safety integrity requirements.

Keywords

Traffic engineering security double 2 out of 2 train control and monitoring system central control unit

Introduction

Train Control and Monitoring System (TCMS) is an on-board computer system designed for train controlling and communication. Its main function is to monitor and display the operating status of important facilities, and perform real-time fault diagnosis. As the commanding and control center of rail vehicles, TCMS need to transmit various operating instructions and display data in real time. And at the same time it should perform calculation processing and fault diagnosis on the system data, and issue control instructions and feedback information on the human-computer interaction interface.

The Central Control Unit (CCU) is the core component of TCMS, which is mainly responsible for train reconnection controlling, vehicle bus management, train-level controlling, and train fault diagnosis. After logical processing on the driver instructions, train protection system instructions and system information, CCU issues operating commands to various control units, and feeds back the status information and diagnosis information to the driver and maintenance personnel. As the key equipment of TCMS, CCU usually adopts redundant structure to get high safety and reliability.¹

Dual-machine hot standby is a commonly used safety redundancy structure, with two subsystems running at the same time and backing up each other to jointly ensure the execution of important services. When a subsystem fails, the second automatically takes over the task, ensuring that the whole system can continue performing without manual intervention.² Due to its advantages in safety and reliability, dual-machine hot standby has been widely used in agriculture, transportation, industry, and other fields.^3–9 Currently the CCU of rail vehicles in China all adopts dual-machine hot standby redundant structure,^10,11 and for many years the effect has been remarkable. However, with the upgrading of high-speed trains, dual-machine hot standby redundancy is difficult to meet higher safety requirements. On the one hand, the TCMS of the new generation of high-speed train contains more subsystems, and the network bandwidth required for intelligent information services is greater.¹² On the other hand, International and European standards for safety of railway signal systems have been applied around the world,^13,14 and that the active security strategy based on safety integrity theory is more reliable has been generally accepted by the whole railway industry.^15,16

Over the years, researchers have taken various measures to reduce dangerous failure rate of CCU system. For example, the urban rail transit network CCU based on dual buses designed by Zhang et al.,¹⁷ using ECN and MVB buses to achieve redundant controlling. Gao and others¹⁸ re-planned the vehicle network topology of the train on Chongqing Metro Line 10, and introduced a maintenance network to strengthen system maintenance. Although such schemes can improve the reliability and stability of TCMS to a certain extent, its security cannot be significantly improved, since the CCU still adopts the dual-machine hot standby structure. Large numbers of studies have shown that the security of double 2oo2 redundancy architecture is far superior than that of dual-machine hot standby architecture.^19–22 Relying on its safety advantages, the double 2oo2 architecture has been applied to a variety of railway signal systems, especially in ones with strict safety requirements (SIL4), and the performance is significant. Chen et al.² used double 2oo2 structure for the CTC autonomous machine system, and compared its safety and reliability with the dual-machine hot standby structure. Chen and others analyzed the station electronic computer interlocking system with double 2oo2 architecture and described its safety mechanism. Sun and others²⁴ designed a train distributed security computer combined with FlexRay bus, easy for isolating and switching. Feng²⁵ designed a computer interlocking system based on double 2oo2 structure for rail vehicles, and adopted security communication protocol, intelligent maintenance, and multilayer check to improve safety performance. Wang and others²⁶ designed the station computer interlocking system based on dynamic fault tree analysis, and demonstrated the reliability performance of the double 2oo2 system in full fault mode. There are also other applications such as automatic train protection systems,²⁷ automatic optimization of locomotive operation device,²⁸ urban rail transit ground control system,²⁹ railway signal network system,^30,31 etc.

Safety problem of CCU need to be fundamentally solved to adapt to the development of high-speed trains, and the double 2oo2 redundancy is a feasible approach. This paper proposes a security enhancement design scheme for CCU based on double 2oo2 redundant architecture and demonstrates its performance. The main contributions of this article are addressed as follows. Firstly, this paper designs a suitable hardware architecture according to the CCU function requirements, and the working mechanism of the redundant platform, the security policy, and the synchronous communication strategy of the redundant CCU are proposed. Secondly, a markov analysis model is established to analyze the safety and reliability of CCU with different redundant structure, and the contribution of the double 2oo2 redundancy to CCU security is proved theoretically. Then, compared with the same type of Markov model, this paper considers more random cases and the analysis results are closer to the real situation. Finally, the active safety strategy stipulated in EN50129 is introduced into train network. The safety mechanism design is no longer independent of the functional system, but integrated in the whole life cycle of the product, which will provide a reference for the railway signal system with SIL4 safety integrity requirements.

Research on CCU function safety requirement and basic structure

The train control and monitoring system is responsible for the communication and dispatching of the entire train, and some system functions with high safety demand put forward new requirements for the central control unit. For example, in European standard EN50129 the braking system has different SIL levels requirement for various functions, including Brake System Management, Service Brake, Emergency Brake, Parking Brake, Automatic Brake Test, Low Adhesion Management, etc. TCMS has important impact on some sub-functions of Service Brake and Brake System Management. And for the communication solution based on the Ethernet protocol, the standard requests the safety level of SIL3 to SIL4. In summary, the functional safety of TCMS based on Ethernet must be guaranteed, and as the core control equipment, CCU is considered to reach the SIL4 safety integrity level.

The CCU is mainly composed of gateway, vehicle control unit, I/O ports management unit, and communication module. As the safety control platform, CCU is redundant in pairs and located in the control room at both end of the train. Figure 1 is the schematic diagram of CCU of the high-speed train, with four cars forming a vehicle marshaling, and the on-board systems in the marshaling network system communicate with each other under the control of two redundant CCU. CCU communicates with the microcomputer control unit of each system through the TCMS to realize process controlling, communication management, information displaying, and fault diagnosis.

Figure 1.

Schematic diagram of CCU of high-speed train.

At present, most CCU still use the traditional technology of single processing core,^17,18,32 such designs cannot theoretically meet the safety integrity requirements of SIL4. Furthermore, most of the dual-machine hot standby systems do not include inter-system synchronization, or only achieve simple status detection, like heartbeat packets based on CAN bus, MVB bus, or hard wire. Such mechanism is difficult to achieve seamless switching, which is a hidden danger that should not be ignored. Finally, without synchronization, the difference of the calculation results between the redundant processors will accumulate over time, which may cause different diagnosis results.

The dual-machine hot standby strategy has good reliability and can guarantee the stable and reliable operation of the network. However, for the new generation train network system based on industrial Ethernet which has more complex structure, it needs a control platform with higher safety. The double 2oo2 structure has high safety and is widely used because that it supports technological upgrades in operation and conforms to the characteristics of high traffic density and frequent system updates in China railway system. The comparison of safety mode of the dual-machine hot standby and the double 2oo2 architectures is shown in Figure 2. In addition to the fault diagnosis of the logic unit itself, the processing results of the units are compared by the comparison units in double 2oo2 system. If the comparison results are inconsistent, it is considered that a fault not detected by the diagnostic unit has occurred. The double 2oo2 structure has an external detection means, and the ability to detect failure before danger will be improved by orders of magnitude.

Figure 2.

Security mode comparison between dual-machine hot standby and double 2oo2.

CCU security enhancement design based on double 2oo2 architecture

The design scheme of the CCU based on double 2oo2 redundant architecture is shown in Figure 3, the system has two same processing subsystems (subsystem A and subsystem B). The hardware of the two subsystems is completely independent and can undertake processing tasks of CCU alone. CCU exchanges information with the on-board systems through the vehicle network to control and monitor the whole train. The two CCU subsystems rely on inter-system communication to process synchronously. When the major system occurs a failure, the standby system takes over the control power and performs task.

Figure 3.

CCU based on double 2oo2 architecture.

To avoid common-cause faults caused by system failures, the two subsystems are physically completely isolated and connected only through inter-system synchronization. For the system with safety integrity requirement, its interface with other facilities need to meet certain security requirements. The synchronous communication of CCU system adopts security protocol communication, and the residual rate of security channel is evaluated according to the standard. The fault diagnosis unit is logically isolated from application functions and can diagnose process faults in real time. To achieve the required safety integrity level, diagnosis coverage need to be industry-high, which will be discussed in detail in the safety analysis section. The two CCU subsystems are mounted on the same train network, using different communication addresses, measures should be taken to avoid control competition. The two subsystems transfer the control power through inter-system synchronization communication, meanwhile the software of CCU subsystems could be designed to monitor each other through application network.

The double 2oo2 redundant architecture is the most common used redundancy platform in the domestic rail transit field and has been applied in many signal systems. However there is no CCU safety product of TCMS based on this architecture in China. Combining the structural characteristics of the double 2oo2 redundancy with functional requirement of CCU, system architecture in this paper not only meets the security requirements, but also leaves enough change space in hardware interface and software design to get convenience to adapt to different vehicle.

Transfer of CCU working status

The CCU system should have a definite working state transition mechanism to ensure that the system can execute a predetermined safety plan under any conditions. In the design process, it is necessary to consider all possible conditions of the redundant system and define their safety functions. The system status transition mechanism is shown in Figure 4, the letters in figure represent the transition conditions.

Figure 4.

Mechanism of system status transfer.

Initialization: after power on, the CCU perform software and hardware initialization, data initialization, task list initialization, and self-check of each arithmetic unit. In this status, both subsystems cannot produce effective output.

Waiting state: The CCU communicates with every device mounted on the TCMS, waiting for the device status to switch to online, and then completes the initialization of the local area network communication.

Enter safe status: The system detects dangerous fault and enters safe status.

A as major system: A works as the major system of CCU, producing effective output, and at the same time acts as the initiator of inter-system synchronization.

B as major system: B works as the major system of CCU, producing effective output, and at the same time acts as the initiator of inter-system synchronization.

a: A potentially dangerous fault is detected; initialization is unsuccessful.

b: The initialization is successful and no faults are detected in both systems.

c: The system self-repair is completed and attempts to re-enter the working status; the fault is manually removed and the system enters the working status by manual operation.

d: System A detects a fault and system B is working normally; system B does not receive the synchronization signal from system A within the specified time, and the CCU switches according to the plan.

e: System A and B have both have detected faults.

f: System A resumes working status, CCU is manually switched to system A; system A resumes working status, system B fails; system A resumes working status, and the synchronization signal from B system is not received within the specified time, and the CCU switches according to the plan.

g: System A has not resumed working status and then system B fails.

h: CCU system reset or restart.

Hardware architecture

In order to ensure that the system remains safe in the event of any type of single random hardware failure, EN50129 proposes three implementation methods according to the fail-safe principle: composite fail-safety, reactive fail-safety, and inherent fail-safety. Systems with SIL3/SIL4 safety integrity level requirements should use at least one of the above technologies. Because part of the failure mode of CCU is harmful, and some of the harmful faults cannot be eliminated by rapid detection, the hardware architecture of 2oo2 system is mainly based on composite fail-safe, supplemented by reactive fail-safety, as shown in the Figure 5.

Figure 5.

Hardware architecture of 2oo2 subsystem.

In response to the multiple faults and in order to achieve the specified quantitative safety goals, EN50129 requires the system to regularly detect integrated circuit faults online according to the SIL3/SIL4 composite fail-safe function of the multi-electronic architecture. Possible fault facility includes: CPU registers, internal RAM, instruction decoding and execution, program counter, stack pointer, clock, reset; memory; power supply; etc. The security domain in the 2oo2 subsystem uses dual-core real-time processing unit (RPU) with lockstep technology to detect this type of failure, which can achieve task synchronization at the time granularity of clock cycle level.

The 2oo2 subsystem uses multiple application processing cores (APU) as the main computing units, which is responsible for application-level task processing and coordinating the work of various processing modules of CCU. APU and RPU realize safe communication through dynamic memory. APU is equipped with three Gigabit Ethernet controllers (GEM) for application communication, debugging and inter-system synchronization respectively.

Inner-system synchronization

Inter-system synchronization is the basis for the CCU system to achieve seamless switching. Fast-Megabit Ethernet is used to transmit task data between the two 2oo2 subsystems to achieve task-level synchronization. The standby system is always in the working state of following the major system, when the switching mechanism is activated, it can take over to perform scheduling tasks without disturbance. Synchronization communication only involves data exchange between the two 2oo2 subsystems, so there will be no transfer conflict using the LWIP Ethernet protocol and the transmission medium of a full-duplex cable connecting to a standard RJ45 network port.

CCU has hundreds of task threads, and their communication period with the train network are very different. The synchronization period between the two subsystems is determined by the thread with the highest accessing frequency to the network. The major system is designed to communicate with the standby system every 10 ms, and the standby system adjusts its operating status to be consistent with the main system according to the synchronization data received. The timing logic of inter-system synchronization communication is shown in Figure 6, where T represents the maximum time to complete a synchronization data transmission. At the beginning of each synchronization cycle, the major system sends start information to the standby system, and then the standby system prepares to receive synchronization data, which includes the input, output, and status information of the major system. If the standby system does not receive the synchronization data within the specified time, it will troubleshoot the fault according to the plan, then restore communication or take over the control power of the whole system.

Figure 6.

Synchronization communication logic sequence in time.

Establishment of Markov model of CCU redundancy system

European safety standards define faults as random hardware failures and system failures. System failure refers to that caused by human error in the process of design, manufacturing, installation, verification, operation, maintenance, etc. System failure is completely avoidable. Hardware failure is the most important part that affects the safety and reliability of system, for which the random probability model is used for quantitative analysis. The Markov model is the most commonly used random failure analysis model in the field of redundant computers, and its effectiveness has been proved in years of application. Combining the working characteristics of CCU, this chapter elaborates on the modeling process of dual-machine hot standby and double 2oo2 redundant architectures.

Fundamental assumption of modeling

The risk of the railway system has great ambiguity and randomness, and the failure analyzing of the signal system is a dynamic and random process. Markov model is a commonly used random analysis method, which has high accuracy in analyzing the safety and reliability of signal system. In this section the Markov model of CCU based on double 2oo2 architecture is constructed. In order to build proper system transition mechanism and solve random failure probability function, the following assumptions are proposed:

Only the failure of logic unit and comparison unit is considered;

Only a single fault occurs at a certain moment;

All modules cannot be repaired after failure;

The failure rates of modules that may fail all obey the exponential distribution;

The failure of the comparison unit is undetectable.

As shown in Figure 7, here gives the main module and failure parameters of CCU with dual-machine hot standby and double 2oo2 architecture. The fault parameter of logic unit is λ, diagnostic coverage rate is c, and the failure parameter of the comparison unit is μ.

Figure 7.

Main modules of failure analysis.

For the module whose life follows an exponential distribution and fault rate is $x$ , suppose at this time it is in normal status, then the probability of failure after time $Δ t$ is $(1 - e^{- x Δ t})$ . If $Δ t$ is small enough, then the failure probability can be simplified to $x Δ t$ . So the failure probability of logic unit and comparison unit are respectively $λ Δ t$ and $μ Δ t$ , and probabilities of detectable and undetectable failure rate of logic units are $λ c Δ t$ and $λ (1 - c) Δ t$ .

Modeling for dual-machine hot standby redundant CCU

According to the redundant structure and working principle in Figure 7, Table 1 lists the working status of the dual-machine hot standby system.

Table 1.

Definition of working status of dual-machine hot standby CCU system.

Status	Explanation
0	Logic units in subsystems A and B are all normal, A is the major system
1	Logic unit in A normal, logic unit in B has a detectable failure, A is the major system
2	logic unit in A has a detectable failure, logic unit in B normal, B is the major system
3	Logic unit in A normal, logic unit in B has an undetectable failure, A is the major system
S	System enters the safe status
F	System in dangerous status

The Markov status transition model of the dual-machine hot standby system is shown in Figure 8, and the status transition is described as follows:

Status 0–status 1: CCU system in normal status, system B has a detectable failure and is isolated, system A operates alone;

Status 0–status 2: CCU system in normal status, system A has a detectable failure and is isolated, system B operates alone;

Status 0–status 3: CCU system in normal status, system B has an undetectable failure and is not isolated;

Status 0–status F: CCU system in normal status, system A has an undetectable failure and drives the whole system in danger;

Status 1–status S: system A has a detectable failure and CCU enters the safe status;

Status 1–status F: system A has an undetectable failure and drives the CCU system in danger;

Status 2–status S: system B has a detectable failure and CCU enters the safe status;

Status 2–status F: system B has an undetectable failure and drives the CCU system in danger;

Status 3–status F: system has a failure and drives the CCU system in danger (In status 3, system B has an undetectable failure and is not isolated, whether system A has a detectable or an undetectable failure, CCU system will be driven in danger).

Figure 8.

Markov analysis model of dual-machine hot standby system.

The system is in a definite status at time t, and consider the possibility of every status which the system may transfer to after $Δ t$ , a piece of short time. For status 0 the probability calculation method is shown as equation (1), then take the derivative of both sides with respect to $Δ t$ , so we get equation (2). And the same goes for all the other status, the differential equations of system transition list in formula (3).

P_{0} (t + Δ t) = P_{0} (t) * (1 - 2 λ Δ t)

(1)

P_{0}^{'} (t) = - 2 λ P_{0} (t)

(2)

{\begin{matrix} P_{0}^{'} (t) = - 2 λ P_{0} (t) \\ P_{1}^{'} (t) = λ c P_{0} (t) - λ P_{1} (t) \\ P_{2}^{'} (t) = λ c P_{0} (t) - λ P_{2} (t) \\ P_{3}^{'} (t) = (1 - c) λ P_{0} (t) - P_{3} (t) \\ P_{S}^{'} (t) = c λ P_{1} (t) + c λ P_{2} (t) \\ P_{F}^{'} (t) = (1 - c) λ P_{1} (t) + (1 - c) λ P_{0} (t) \\ + (1 - c) λ P_{2} (t) + λ P_{3} (t) \end{matrix}

(3)

Combining the initial condition of time 0: $P_{0} (0) = 1, P_{1} (0) = P_{2} (0) = P_{3} (0) = P_{S} (0) = P_{F} (0) = 0$ , solve the differential equation (3), we get the probability function of $P_{S} (t)$ and $P_{F} (t)$ shown as equations (4) and (5). In status S, the CCU system stops working but no dangerous accident occurs, and in status F, a dangerous accident has happened. Therefore, when the system is in a status other than F, it is considered to be safe. When in status F or S, the system no longer works and is unreliable. The safety and reliability probability functions of the dual-machine hot standby system are as equations (6) and (7).

P_{S} (t) = (2 c + 1) (c - 1) e^{- λ t} - c (c - 1) e^{- 2 λ t} - c^{2} + 1

(4)

P_{F} (t) = c^{2} - 2 c^{2} e^{- λ t} + c^{2} e^{- 2 λ t}

(5)

S_{D} (t) = 1 - P_{F} (t)

(6)

R_{D} (t) = 1 - P_{F} (t) - P_{S} (t)

(7)

Modeling for double 2oo2 redundant CCU

According to the redundant structure and working principle in Figure 7, Table 2 lists the working status of double 2oo2 system.

Table 2.

Definition of working status of double 2oo2 CCU system.

Status	Explanation
0	All units work normally
1	The major system is normal; one of logic units in standby system has a detectable failure
2	The major system is normal; one of logic units in standby system has an undetectable failure
3	The major system is normal; one of logic units in standby system has a detectable failure, another has a detectable or an undetectable failure
4	The major system is normal; one of logic units in standby system has an undetectable failure, another has a detectable or an undetectable failure
5	The major system is normal; one of logic units in standby system has an undetectable failure, comparison unit in standby system fails
6	The major system is normal; one of logic units in standby system has a detectable failure, another has a detectable or an undetectable failure, comparison unit in standby system fails
7	Comparison unit in major system fails; one of logic units in standby system has a detectable failure, comparison unit in standby system fails
8	Comparison unit in major system fails; one of logic units in standby system has a detectable failure, another has a detectable or an undetectable failure, comparison unit in standby system fails
9	Comparison unit in major system fails; one of logic units in standby system has a detectable failure
10	Comparison unit in major system fails; one of logic units in standby system has a detectable failure, another has a detectable or an undetectable failure
11	Comparison unit in major system fails; one of logic units in standby system has an undetectable failure
12	Comparison unit in major system fails; one of logic units in standby system has an undetectable failure, another has a detectable or an undetectable failure
13	Comparison unit in major system fails; standby system is normal
14	Comparison unit in major system fails; one of logic units in standby system has an undetectable failure, comparison unit in standby system fails
15	Comparison unit in major system fails; one of logic units in standby system has an undetectable failure, another has a detectable or an undetectable failure, comparison unit in standby system fails
16	Major system is normal; one of logic units in standby system has an undetectable failure, comparison unit in standby system fails
17	Major system is normal; one of logic units in standby system has an undetectable failure, another has a detectable or an undetectable failure, comparison unit in standby system fails
18	Major system is normal; comparison unit in standby system fails
19	Comparison unit of major system fails; comparison unit in standby system fails
S	System enters the safe state
F	System in dangerous state

According to the status transition of double 2oo2 system, the Markov analysis model is constructed as shown in Figure 9.

Figure 9.

Markov analysis model of double 2oo2 system.

List the status transition differential equations, combined with the initial conditions described in equation (8), the probability function of failure-safe status and dangerous status ( $Q_{S} (t)$ and $Q_{F} (t)$ ) are solved. Then we can get the safety and reliability probability functions of double 2oo2 system, shown as equations (9) and (10).

{\begin{matrix} Q_{0} (0) = 1 \\ Q_{i} (0) = 0, i = 1, 2, . . ., 19, S, F \end{matrix}

(8)

S_{D 2 oo 2} (t) = 1 - Q_{F} (t)

(9)

R_{D 2 oo 2} (t) = 1 - Q_{F} (t) - Q_{S} (t)

(10)

Analysis results of CCU safety and reliability

According to the calculation formulas obtained by the Markov analysis model, the random failure probability of the specific CCU is evaluated. In order to compare the safety and reliability of CCU with dual-machine hot standby and double 2oo2 redundant architecture, specific values are assigned to those parameters. According to the random failure parameters calibrated in the official reliability report document of the processor, assign the failure rate of logic unit and comparison unit: $μ = 1 \times 10^{- 6} / h$ , $μ = 1 \times 10^{- 6} / h$ .

The fault diagnosis coverage rate (c) depends on the software and hardware design of system, and it marks the fault diagnosis coverage level of diagnosis unit. The standard EN50129 gives the fault categories of basic components, and the percentage of faults that can be detected is the fault diagnosis coverage rate. For CCU system, the fault diagnosis of the processing core is the most important factor that affects the safety, followed by the dual-core lock-step mechanism, external memory, redundant power supply, ethernet card, MVB manager, etc. According to the CCU system design example of China Academy of Railway Sciences, the fault coverage rate of diagnostic unit can generally reach 95%. And in the main research on safety analysis of the double 2oo2 redundant architecture, Chen et al.² took the diagnosis coverage rate as 90%, Huang and Lei²¹ took the diagnosis coverage rate as 95%, and Zhang et al.²⁹ took the diagnostic coverage rate as 99%. Based on the principle of safety and conservation, this paper takes the fault diagnosis coverage rate c as 90%. Substituting the parameter values in simulation, the safety and reliability comparison of CCU with dual-machine hot standby and double 2oo2 redundant architecture is shown in Figures 10 and 11.

Figure 10.

Safety curve.

Figure 11.

Reliability curve.

The result shows that the reliability of double 2oo2 structure is lower than dual-machine hot standby structure, but the safety is higher than latter. This result can be explained by the structure. Double 2oo2 structure uses more redundant hardware to identify abnormal status that may occur during operation, which can enhance the detection capability and increase the probability leading to safe status. But at the same time, its hardware is twice than that of the dual-machine hot standby system, which increases the probability of single point failure, making the system more likely to trigger the fail-safe mechanism.

Select several nodes in the safety curves of the two systems, the safety data is shown in Table 3. It can be seen, as the simulation time goes, the double 2oo2 architecture has great advantage in safety compared to dual-machine hot standby system. For CCU that need to run continuously, double 2oo2 redundant architecture can greatly reduce the danger risk. The model analysis in this paper does not consider factors such as fault repair and regular maintenance, and the safety analysis results are theoretically conservative. For CCU with SIL4 safety requirements, the hardware architecture proposed in this article can be a considerable solution.

Table 3.

Safety comparison.

Simulation time (h)	$5 \times 10^{2}$	$5 \times 10^{3}$	$1 \times 10^{4}$	$1 \times 10^{5}$	$1 \times 10^{6}$	$2 \times 10^{6}$
$S_{D 2 oo 2} (t)$	1.00000000	0.99999798	0.99999101	0.99905848	0.96339883	0.94715549
$S_{D} (t)$	0.99999980	0.99997985	0.99991981	0.99266471	0.67634312	0.39440749

This paper analyzes the safety and reliability based on Markov probability model, obtains the differential equation of the system through state transition analysis, and then gets the probability function of the system state. There is no approximate solution process, so the calculation results are accurate and reliable under the condition that the analysis model fits the system working mechanism. The fault parameters refer to the official instructions of chip manufacturers and the experimental data of China Academy of Railway Sciences, and in this paper conservative values are used, so the actual performance of safety system is supposed to be better than the calculated results.

Conclusion

This paper researches the safety integrity requirements and safety enhancement strategies of TCMS, and proposes a security redundant hardware architecture based on the double 2oo2 platform for CCU. According to the analysis results of Markov model, the new redundant architecture could greatly improve the safety of CCU, and the reliability would decrease slightly due to the more complicated hardware, but it does not affect the normal operation. Over the years, researchers have made many improvements to the train network, such as adding diagnostic functions and designing reactive fail-safe policies, all that can improve the reliability and stability of the network system to a certain extent, but can not make the CCU system reach SIL4 level. In this paper, the double 2oo2 redundant architecture is used for CCU design for the first time, and its safety improvement is proved theoretically. Compared with the passive safety policies, the active safety policy can better control the accident risk and better meet the safety requirements in EN50129.

Footnotes

Handling Editor: Chenhui Liang

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Open Project Fund of State Key Laboratory for Traction and Control System of EMU and Locomotive(2019YJ201); Shanghai Maglev and Orbit Communication Collaborative Innovation Center.

ORCID iDs

Haiquan Liang

Kai Zhang

References

C-X

Zou

J-Y

Zhao

M-H

, et al. The central control unit based on MVB communication. J China Railw Soc 2010; 32: 125–130.

Chen

Yuan

Z-M

Yan

, et al. Reliability and safety evaluation of autonomous computer system of intelligent CTC in high speed railway. Zidonghua Xuebao 2020; 46: 463–470.

Samet

Recovery device for real-time dual-redundant computer systems. IEEE Trans Dependable Secure Comput 2011; 8: 391–403.

Tan

Xie

, et al. Improving the performance of deduplication-based storage cache via content-driven cache management methods. IEEE Trans Parallel Distrib Syst 2021; 32: 214–228.

Yang

, et al. Reliable data storage in heterogeneous wireless sensor networks by jointly optimizing routing and storage node deployment. Tsinghua Sci Technol 2021; 26: 230–238.

Sun

G-L

Zhang

L-S

Xue

Y-B.

Straw resource mass storage system’s design and implementation. J Comput Res Dev 2011; 48: 78–83.

Park

Kim

Availability analysis and improvement of active/standby cluster systems using software rejuvenation. J Syst Softw 2002; 61: 121–128.

Mukherjee

Dhar

AS.

Real-time fault-tolerance with hot-standby topology for conditional sum adder. Microelectron Reliab 2015; 55: 704–712.

Levitin

Xing

Dai

. Cold vs. hot standby mission operation cost minimization for 1-out-of-N systems. Eur J Oper Res 2014; 234: 155–162.

10.

Hong

Design and implementation of electronic voting system based on dual-link real-time hot standby. J Comput Appl 2018; 38: 257–259,265.

11.

Zhang

J-B

Cai

J-Y

Meng

Y, F

, Design of the fault self-repair circuit system based on evolvable hardware and dual hot-backup technique. Microelectron Comput 2016; 33: 124–126,132.

12.

Xin

CJ.

Research on hot standby switching mechanism of double three-vote-two safety computers. Chengdu: Southwest Jiaotong University, 2019.

13.

Zhang

Guo

Shan

Comprehensive evaluation of risk severity level of railway signal system. J Southwest Jiaotong Univ 2010; 45: 758–762.

14.

Yan

Tang

Yan

Research on concept and allocation principle of safety integrity level. J Beijing Jiaotong Univ 2017; 41: 79–84.

15.

Reliability and safety of double 2-vote-2 redundancy system. Hefei: Hefei University of Technology, 2013.

16.

Tan

Lin

, et al. Design and reliability, availability, maintainability, and safety analysis of a high availability quadruple vital computer system. J Zhejiang Univ Sci A (Appl Phys Eng) 2011; 12: 926–935.

17.

Zhang

Song

Redundancy control design of main control equipment of urban rail transit train network based on dual bus. Urban Mass Transit 2021; 24: 79–83.

18.

Gao

Design and application of new rail vehicle network control system. Railw Locomot Car 2019; 39: 118–122.

19.

Zhang

Wei

, et al. Comparison study of D2V2R and C-DDMR structure. J Electron Meas Instrum 2009; 23: 15–22.

20.

Zhang

Research on safety and performance analysis of computer based interlocking system based on dynamic fault tree analysis. J Railw Sci Eng 2019; 16: 1543–1552.

21.

Huang

Lei

Reliability and security analysis of double 2oo2-fetch computer interlocking system based on Markov process. Railw Signaling Commun Eng 2017; 14: 1–4,17.

22.

Cheng

Research on reliable redundant structure of computer interlocking system. Ind Control Comput 2016; 29: 78–79,81.

23.

Chen

Fan

Wei

, et al. All electronic computer interlocking system based on double 2-vote-2. China Railw Sci 2010; 31: 138–144.

24.

Sun

Luo

Design of distributed vehicle security computer System based on FlexRay bus. Railw Comput Appl 2021; 30: 54–58.

25.

Feng

Design and research of full electrical computer interlocking system for urban rail transit. J Railw Sci Eng 2021; 18: 2145–2155.

26.

Wang

Gao

Study on reliability of redundant structure of station computer interlocking system based on dynamic fault tree. Autom Instrum 2021; 4: 31–34.

27.

Cai

Wang

Synchronization mechanism for double 2oo2 safety computer platform of on-board automatic train protection. Comput Eng 2015; 41: 301–305.

28.

Zheng

Analysis of locomotive manipulation automatic optimization device based on double 2-vote-2 computer system. J Railw Sci Eng 2013; 10: 112–115.

29.

Zhang

Zou

, et al. New degradation strategy of double 2-out-of-2 system. J East China Jiaotong Univ 2017; 34: 99–105.

30.

Lopez

Aguado

Ugarte

, et al. Exploiting redundancy and path diversity for railway signalling resiliency. In: 2016 IEEE international conference on intelligent rail transportation (ICIRT), Birmingham, 2016, pp.432–439.

31.

Kim

Jeon

H-J

Lee

, et al. The design and evaluation of all voting triple modular redundancy system. In: Annual reliability and maintainability symposium. 2002 proceedings (Cat. No.02CH37318), Seattle, WA, USA, 2002, pp.439–444.

32.

Liu

Development and design of metro network control software. Chengdu: Southwest Jiaotong University, 2019.