Sage Journals: Discover world-class research

Abstract

Manufacturing industries involve both business processes and complex manufacturing processes. Predictive process monitoring techniques are effective for managing process executions by making multi-perspective real-time predictions, preventing issues such as delivery delays. Conventional predictive process monitoring for business processes focuses on predicting the next activity, next event time, and remaining time using single-task learning, which is costly and complex. For complex manufacturing processes, predictive process monitoring primarily aims to predict the remaining time, that is, product cycle time. However, single-task learning methods fail to capture all the variations within the historical process executions. To address them, we propose the multi-gate mixture of transformer-based experts framework, which leverages a transformer network within the multi-gate mixture-of-experts multi-task learning architecture to extract sequential features and employs gated expert networks to model task commonalities and differences. Empirical results demonstrate that multi-gate mixture of transformer-based experts outperforms three alternative architectures across five real-life event logs, highlighting its generalization, effectiveness, and efficiency in predictive process monitoring.

Keywords

Multi-task learning mixture of experts transformer predictive process monitoring manufacturing

Introduction

Under the wave of Industry 4.0, the manufacturing industry is undergoing profound changes, with smart manufacturing becoming an inevitable trend. To achieve this transformation, manufacturing industries must not only introduce advanced production equipment and technologies but also comprehensively optimize and modernize their production processes.¹ The processes in manufacturing are diverse and complex, typically categorized into conventional business processes within enterprise information systems and intricate manufacturing processes within production workshops. To realize smart manufacturing, it is essential to implement comprehensive monitoring and management across all processes, which not only enhances production efficiency but also helps in timely identification and resolution of potential issues. This significantly boosts overall operational efficiency and competitiveness of the enterprise.²

Predictive process monitoring (PPM) is an advanced management approach initially developed for business processes. By analyzing event logs recorded in process-aware information systems (PAISs), PPM predicts various performance aspects of ongoing process instances.³ These predictions typically include the next activity prediction, the next event time prediction, the remaining time prediction, and other key performance indicators.^4–7 These predictive tasks are interrelated, sharing the same input data source (historical event logs) and influencing each other’s outcomes. For example, the result of the next activity prediction directly impacts the prediction of the next event time, which in turn affects the remaining time prediction. However, most current research methods are based on single-task learning (STL), where separate models are trained for each task, or they only combine next activity and next event time predictions, rarely addressing more than two tasks simultaneously. In contrast, our previous work⁸ was the first to explore multiple prediction tasks in PPM using a basic multi-task learning (MTL) framework. While this was a step forward, there were still limitations in terms of performance, complexity, and efficiency, which motivated the improvements presented in this article.

As smart manufacturing evolves, the application of PPM has expanded to manufacturing processes.^9–12 Although manufacturing processes also consist of a series of activities, each with a duration and a set of resources, they exhibit unique complexities due to numerous parameters and variables such as product type.¹³ Unlike business processes, the predictive monitoring for manufacturing processes primarily focus on the remaining time prediction, that is, the product cycle time, as timely product delivery is critical for manufacturing enterprises.¹⁴ However, most studies in this area also focus solely on this single task without considering the potential performance improvements from related tasks.

To address these challenges, we introduce a novel MTL framework designed to achieve effective predictive monitoring across all processes in manufacturing. This framework aims to enhance the accuracy and generalization of PPM technologies, providing robust support for comprehensive automation in manufacturing. Currently, MTL has emerged as a crucial machine learning paradigm that significantly improves model generalization by sharing knowledge across multiple related tasks. In particular, the multi-gate mixture of experts (MMoE) method has gained traction in industry,¹⁵ originating from the mixture-of-expert (MoE) architecture.¹⁶ It employs multiple expert networks and gating mechanisms to enable effective information sharing and adaptively expert weighting tailored to the needs of different tasks. This architecture enhances the model’s generalization and prediction accuracy while maintaining task independence. Inspired by the auxiliary learning within MTL,^17,18 we further explore the application of the MTL framework MMoE in predictive monitoring of manufacturing processes. Specifically, we employ the MTL to predict the remaining time by establishing auxiliary tasks that support this primary task. In our implementation, the next activity prediction and the next event time prediction serve as auxiliary tasks, while the remaining time prediction is treated as the primary task.

Additionally, most techniques used for the individual prediction tasks in PPM can generally be divided into two categories: conventional machine learning methods and deep learning-based approaches.^7,19 In recent years, motivated by the superior capabilities of Transformer networks in natural language processing (NLP), PPM techniques, originally based on the machine learning, have evolved from the original RNN, LSTM,²⁰ attention-based model²¹ to the current transformer-based models.^22,23 The reason is that in manufacturing, the input to PPM is an event log of process executions recorded in PAIS or manufacturing systems (MSs). This log consists of multiple sequences of events with complex temporal ordering and dependencies, where each sequence represents a process execution. This allows each sequence of events to be treated like a sentence in natural language, enabling efficient feature learning using the transformer network.

Therefore, building on our previous research,⁸ we propose the MMoTE (multi-gate mixture of transformer-based experts) approach, which integrates Transformer networks with the MTL framework MMoE to enable predictive monitoring in both business and manufacturing processes. Unlike our previous research,⁸ the proposed MMoTE approach in this article addresses both the commonalities and distinctions between different tasks, reducing interference among weakly related tasks while enhancing feature sharing among strongly related tasks. Specifically, this approach leverages transformer networks to effectively extract both local and global features from process sequential data characterized by complex ordering and dependencies. Additionally, it employs the MMoE framework for achieving parallel learning of task-specific and shared feature representations. The approach helps to optimize the feature learning further and improve the model’s ability to recognize complex patterns, thereby improving performance and reducing computational resources in PPM of manufacturing.

In conclusion, this article presents the following contributions:

We introduce a technique, MMoTE, designed to be applicable across various types of processes, including business processes and manufacturing processes, for predictive monitoring. Extensive experiments on event logs from these process types validate its effectiveness.

We present a scalable MTL framework that enhances performance by sharing information across multiple tasks in PPM, improving predictions for business processes, and supporting auxiliary learning to optimize PPM for manufacturing processes.

We propose MMoTE, which integrates transformer networks with the MMoE framework, enabling efficient learning from diverse perspectives of event sequences and adapting to multiple tasks through gated units that manage expert networks.

The organization of this article is as follows. First, we discuss the relevant research work and offer a brief overview. Then, the “Preliminaries and problem statement” section introduces some fundamental concepts and the problem addressed in this article. Afterward, the “Approach” section provides an in-depth exposition of the proposed MMoTE approach. Extensive comparative experiments are conducted in the “Experimental evaluation” section to evaluate the performance of MMoTE. Finally, the “Conclusion and future work” section summarizes the article and suggests future work directions.

Related work

Predictive process monitoring technology

PPM belongs to the category of business process execution and monitoring, whose specific tasks can be divided according to different prediction goals: predicting the next event occurrence,⁴ predicting the next event execution time, predicting the remaining execution time,⁵ and predicting the specific outcome of a process execution (such as successful completion, failure termination and other specific states),^6,7 and so on. Depending on the techniques used, they can be classified into conventional machine learning-based and deep learning-based approaches. Conventional business process prediction methods mainly include transition system-based process remaining time prediction,^24,25 probabilistic finite automata,²⁶ Markov chain,²⁷ decision tree²⁸ to predict the next event, and random forest to forecast the outcome.²⁹ In contrast, Appice et al. predicted the next event and its duration time, along with the remaining time, respectively, through shallow machine learning methods.⁵ Nevertheless, a limitation of such methods is its significant dependence on manual feature engineering, especially when low-level feature representations are required.^7,30

With the evolution of deep learning technology, numerous approaches based on neural networks have emerged in recent years. These approaches streamline the process of manually selection and extraction of features, leading to significant enhancements in predicting business process execution based on extensive process execution logs. Rama-Maneiro et al.³¹ systematically summarized these approaches and conducted exhaustive experimental evaluations. Depending on the type of neural network used, they can be categorized as recurrent neural network (RNN)-based,^32,33 long short-term memory (LSTM)-based,^20,34–36 attention mechanism-based,^22,21 convolutional neural network (CNN)-based^37,38 and generative adversarial networks (GANs)-based,⁴ graph neural network (GNN)-based,^39,40 and custom networks.^41–44

For example, Evermann et al.³² first used a model composed of two hidden layers of RNNs to forecast the subsequent event. Cao et al.³³ initially constructed the Petri net and its related reachability graph according to the event log and then used the RNN with gate units to forecast the remaining duration time to increase the explainability. Tax et al.²⁰ presented an approach to predict the upcoming event and its duration time based on LSTM. Moreover, Camargo et al.³⁴ developed a model to forecast the event sequence to be executed in the future, their execution time, and the associated resource pool based on LSTM. Similarly, Navarin et al.³⁵ predicted the remaining time of PPM using LSTM networks. In addition to the existing process prediction methods based on RNNs, several studies integrated the attention mechanism as an optimization way to enhance the accuracy and efficiency of the prediction model. For instance, Bukhsh et al.²² provided a ProcessTransformer model by using the self-attention mechanism, which adapts and optimizes the transformer network structure according to specific forecasting tasks and achieves the expected prediction effect. Likewise, Wickramanayake et al.²¹ developed different attention mechanisms and combined them with LSTM to forecast the next event.

In addition, several methods have been presented based on other types of neural networks. For instance, Taymouri et al.⁴ put forward an innovative adversarial training framework based on GANs that is designed to make predictions about the next activity and its timestamp. Di Mauro et al.³⁷ and Pasquadibisceglie et al.³⁸ investigated how to employ CNN for predictive monitoring of processes. Harl et al.³⁹ were the first to use gated GNN to make decisions more explainable in process outcome prediction. Weinzierl⁴⁰ also explored how gated GNN could be used to forecast the next event.

Several other approaches based on customized networks have been presented. For instance, Khan et al.⁴¹ proposed a memory augmented neural network (MANN) and then employed it to make recommendations of the upcoming events sequence for an on-going case. Theis and Darabi⁴² augmented a mined Petri net process model by incorporating a time decay function, using the execution time of activities within a process as the primary variable to construct successive samples of process states. Then, a prediction model is trained on this basis to predict future execution activities using deep learning techniques. Guo et al.⁴³ proposed a feature selection approach alongside a feature-informed cascade model to make predictions.

In addition to the above methods, other studies have explored new ways by combining prior knowledge, process structure and other domain knowledge, and integrate them into neural network models for process prediction. Di Francescomarino et al.³⁶ utilized the structure of the process execution trace and a priori knowledge to predict the sequence of forthcoming activities using LSTM. Rama-Maneiro et al.⁴⁴ proposed a method combining RNN with GCN to simultaneously learn spatio-temporal information from both process models and historical process logs.

The methods discussed above focus primarily on exploring how different neural networks or machine learning techniques can be used for effective feature learning to enhance prediction performance. However, these approaches are all based on the STL, that is, training the model on one prediction task at a time. While some methods aim to predict multiple aspects of an activity, such as the next activity, its associated time (i.e. next event prediction), and resources, none of these approaches are based on the MTL. Instead, they merely concatenate predictions across different tasks, without addressing them as distinct tasks through a MTL framework. In contrast, a few studies have explored concepts related to MTL in PPM, though they differ from the focus of this work, which addresses the simultaneous prediction of multiple tasks in PPM. For instance, Chen et al.²³ proposed a pre-training model that can be applied to different tasks individually but does not support the simultaneous prediction of multiple tasks, meaning it does not employ a MTL approach. Similarly, Cheng et al.⁴⁵ only focused on outcome prediction in call center scenarios, applying MTL by incorporating data from other modalities. However, their work did not address multiple prediction tasks within PPM itself.

Multi-task learning

Recently, MTL has attracted significant interest as a way to enhance machine learning model performance by training models on multiple tasks simultaneously.⁴⁶ By sharing feature representations, MTL captures commonalities among related tasks, offering a more human-like learning process than STL, though it also introduces new challenges. MTL has been applied across fields such as NLP, recommendation systems, speech recognition, and computer vision.

STL versus MTL

STL, which focuses on learning one task at a time, is the most common approach in machine learning. In STL (as shown in Figure 1(a)), each task is learned independently with no shared knowledge or transfer between tasks, requiring each task to be learned from scratch. Moreover, STL constructs independent models for each task, which may lead to high complexity and easy overfitting. In contrast, MTL enables multiple tasks to share the learned knowledge by sharing the underlying network and parameters. This sharing mechanism helps accelerate the learning process and improves the generalization ability of the model. Moreover, MTL introduces additional constraints and regularization terms to the model by simultaneously learning multiple tasks. This regularization effect helps reduce the risk of overfitting the model on a single task and improves performance. Currently, most of the PPM techniques are based on the STL, where the prediction tasks involving the next activity, the next event time, and the remaining time are learned separately to obtain each model (as illustrated in Figure 1(a)).

Figure 1.

The difference among traditional single-task learning and two different multi-task learning frameworks in PPM: (a) task-specific models for single-task learning, (b) traditional shared-backbone model for multi-task learning, and (c) MoE for multi-task learning. PPM: predictive process monitoring; MoE: mixture-of-experts.

Hard parameter sharing of MTL versus Soft parameter sharing of MTL

Existing MTL methods are typically segmented into two groups, including hard parameter sharing and soft parameter sharing. Specially, the former involves sharing weights across multiple tasks, allowing simultaneous training to minimize multiple loss functions. Traditional MTL models use this approach, where multiple tasks share the same bottom network (as shown in Figure 1(b)).⁸ While this reduces the risk of overfitting, it can increase model complexity and limit flexibility since all tasks use the same parameter set. The scalability of this framework can improve as the number of tasks increases. Soft parameter sharing involves using separate models for each task, but incorporating parametric relationships or differences into a joint objective function. Mechanisms like regularization or constraints encourage similarity or distance between task models, which aids knowledge transfer and efficient parameter use. This approach allows the multi-task model to leverage both commonalities and differences among tasks, improving performance and model quality for each task. MoE represents a significant advancement in flexible parameter sharing. Initially proposed by Jacobs et al., it divides a system into independent networks, each handling a portion of the data.¹⁶ Shazeer et al. enhanced this concept with the Sparsely-Gated MoE layer, which integrates multiple experts and a trainable gating network.⁴⁷ This approach uses a divide-and-conquer strategy to address complex problems, improving efficiency and model generalization. Furthermore, Shazeer et al. applied MoE to natural language modeling and machine translation, while Riquelme et al. introduced the visual mixed expert (V-MoE) model for image classification.⁴⁸

The MoE model performs well in single-task scenarios but faces challenges in MTL due to complex inter-task relationships such as correlation and conflicts.⁴⁹ In MoE’s MTL framework (as shown in Figure 1(c)), multiple tasks share a common set of experts and a single gating network, which may lead to conflicts and inefficiencies. To address these issues, MMoE¹⁵ was introduced, utilizing multiple gating networks to enable task-specific expert selection and better capture task relationships.⁵⁰ Unlike MoE, which relies on one gating network for all tasks, MMoE allows for different expert selections for distinct tasks.⁵¹ As indicated by Wang et al.,⁵² MMoE enhances MTL by allowing task-specific adjustments to expert networks and improving the modeling of task relationships, thereby boosting overall performance. As a form of soft parameter sharing, MMoE uses soft gating networks to aggregate experts learned from different tasks, addressing negative migration problems effectively. It outperforms other methods, such as cross-stitch networks,⁵³ particularly in content recommendation. Various novel approaches based on the MMoE have emerged. For instance, Qin et al.⁵⁴ introduced the mixture of sequential experts (MoSE), which utilizes LSTM within an advanced MMoE framework to capture sequential user behavior. Zhang et al.⁵⁵ developed a dual-task model combining MMoE with bi-directional gated recurrent units (BiGRUs) for health status assessment and remaining useful life prediction, enhancing model versatility and dynamic task differentiation.

In the domain of MTL, despite variations in task definitions and sample characteristics, inherent commonalities are often observed among tasks. This is particularly evident in PPM, where tasks such as forecasting the execution time of future activities and the remaining time of an ongoing case frequently co-occur with the execution of subsequent activities. So, there is a correlation and commonality in the information characterization between the time-related prediction task and the next upcoming event prediction task. When the model commonality representation is strong, it tends to weaken the feature representation, while a strong feature representation is usually detrimental to the commonality representation. In order to share the same feature representations flexibly, the MMoE framework is chosen in this article to implement the PPM in manufacturing.

Preliminaries and problem statement

This section aims to provide clarity by introducing fundamental concepts and formal definitions pertinent in our study. Additionally, we outline the problem statement addressed in this article to facilitate comprehension.

Preliminaries

An increasing number of information systems in manufacturing, such as PAIS and MS, automatically record a large amount of historical process execution data. This data can be analyzed using process mining techniques to gain valuable insights and enhance process performance. In manufacturing systems, an event log details various aspects of business or manufacturing processes, including operational events and related data. Each execution of such a process denotes a distinct process instance or case. Building on this foundation, we now outline the relevant definitions of event logs.

Definition 1 (Event)

In a process of manufacturing industry, an event signifies the completion or initiation of a particular production-related activity. An event is the basic element within a event log and can be formally defined as $e = (a, c a s e I D, e v e n t I D, t_{s t a r t}, t_{e n d}, r, a t t_{1}, \dots, a t t_{q})$ , where $a$ denotes the activity name associated with the event $e$ within a process, $c a s e I D$ represents the unique identifier of the process instance in which the event $e$ occurred, $e v e n t I D$ signifies the unique identifier assigned to this event within a process instance, $t_{s t a r t}$ and $t_{e n d}$ denote the start and completion times of the event $e$ respectively, $r$ denotes the resource (role) responsible for executing the event $e$ , and $a t t_{1}, \dots, a t t_{q}$ represent the additional attributes associated with the event $e$ . The values of these attributes vary across different events.

Definition 2 (Process Trace)

A process trace is characterized by a finite and non-empty sequence of ordered events, defined as $σ =< e_{1}, e_{2}, \dots, e_{| σ |} >$ , in which $e_{i} \neq e_{j}$ , $e_{i} . c a s e I D = e_{j} . c a s e I D$ , and $e_{i} . t_{s t a r t} < e_{j} . t_{s t a r t}$ ( $\forall i, j \in [1, | σ |]$ , $i < j$ , and $| σ |$ denotes the number of events within this case).

Definition 3 (Process Instance)

Usually, a process instance (case) corresponds to a completed process trace, it can be denoted as a process trace with the case identifier $c a s e I D$ . A process instance that is being executed is notated as an on-going case, which is the object to be predicted in PPM, that is, the process has not yet completed execution.

Definition 4 (Event Log)

An event log represents a collection of process traces from the accomplished process instances, documenting the historical executions of a process. It can be defined as $L = {σ_{1}, σ_{2}, \dots, σ_{n}}$ , in which $n$ denotes the total number of historical process cases executed within the process.

For example, consider a fragment of an event log from a simple automobile manufacturing process, which closely resembles a typical business process, as shown in Table 1. In this table, each row represents a distinct event, where the $c a s e I D$ column denotes the identifier of the process case, and the $e v e n t I D$ column indicates the identifier of the event in this case. The start timestamp column and the end timestamp column denote the beginning and end timestamps, respectively, of the execution of the event. The activity column denotes the specific activity corresponds to the event, the resource column denotes the role that executes this event, and the cost column denotes the cost required for the execution of the event. Here, we obtain a trace $σ =< P r o c u r e m e n t o f P a r t s, Q u a l i t y I n s p e c t i o n o f P a r t s, A s s e m b l y o f P a r t s, E n g i n e$ $a n d T r a n s m i s s i o n I n s t a l l a t i o n, C a r B o d y P a i n t i n g, I n t e r i o r I n s t a l l a t i o n, Q u a l i t y I n s p e c t i o n$ $o f V e h i c l e, V e h i c l e D e l i v e r y >$ for the case with $c a s e I D$ of 1.

Table 1.
Event log of a simple automobile manufacturing process.

CaseID EventID Start timestamp End timestamp Activity Resource Cost

1 1 2012/03/19
20:43:00 2012/03/20
02:26:00 Procurement of Parts Procurement Department $5000

1 2 2012/03/20
02:27:00 2012/03/20
06:41:00 Quality Inspection of Parts Quality Control Department $1000

1 3 2012/03/20
07:00:00 2012/03/20
11:07:00 Assembly of Parts Production Line Workers $8000

1 4 2012/03/20
09:54:00 2012/03/20
11:54:00 Engine and Transmission
Installation Skilled Workers $6000

1 5 2012/03/21
06:55:00 2012/03/21
07:41:00 Car Body Painting Painting Workers $2500

1 6 2012/03/21
11:55:00 2012/03/21
16:41:00 Interior Installation Assembly Workers $3000

1 7 2012/03/22
07:35:00 2012/03/22
12:16:00 Quality Inspection of Vehicle Quality Control & Technical
Departments $2500

1 8 2012/03/23
06:30:00 2012/03/23
10:24:00 Vehicle Delivery Logistics Department $1000

$\dots \dots$ $\dots \dots$ $\dots \dots$ $\dots \dots$ $\dots \dots$ $\dots \dots$ $\dots \dots$

CaseID	EventID	Start timestamp	End timestamp	Activity	Resource	Cost
1	1	2012/03/19 20:43:00	2012/03/20 02:26:00	Procurement of Parts	Procurement Department	$5000
1	2	2012/03/20 02:27:00	2012/03/20 06:41:00	Quality Inspection of Parts	Quality Control Department	$1000
1	3	2012/03/20 07:00:00	2012/03/20 11:07:00	Assembly of Parts	Production Line Workers	$8000
1	4	2012/03/20 09:54:00	2012/03/20 11:54:00	Engine and Transmission Installation	Skilled Workers	$6000
1	5	2012/03/21 06:55:00	2012/03/21 07:41:00	Car Body Painting	Painting Workers	$2500
1	6	2012/03/21 11:55:00	2012/03/21 16:41:00	Interior Installation	Assembly Workers	$3000
1	7	2012/03/22 07:35:00	2012/03/22 12:16:00	Quality Inspection of Vehicle	Quality Control & Technical Departments	$2500
1	8	2012/03/23 06:30:00	2012/03/23 10:24:00	Vehicle Delivery	Logistics Department	$1000
$\dots \dots$	$\dots \dots$	$\dots \dots$	$\dots \dots$	$\dots \dots$	$\dots \dots$	$\dots \dots$

Since the aim of our study is to make predictions for the on-going case, it is necessary to establish the notion of a prefix trace to extract relevant features from them.

Definition 5 (Prefix Trace)

A prefix trace refers to an ordered sequence of the initial $k$ events within a process trace $σ$ , formally described as $p r e f i x (σ, k) =< e_{1}, e_{2}, \dots, e_{k} >$ , $(k \in [1, | σ |])$ , where $k$ denotes the length of the prefix trace, that is, the extent of the sequence extracted from the process trace $σ$ . Meanwhile, the timestamps of all events within prefix trace $p r e f i x (σ, k)$ increases sequentially. As for an ongoing case, it can be denoted as a prefix trace with a specific length, with the length corresponds to the number of events that have been completed at the moment.

Using an on-going case as an illustration, the definitions of three related prediction tasks investigated in this study are simply described as:

Definition 6 (Next Activity Prediction)

Regarding the prefix trace $p r e f i x (σ, k)$ , the Next Activity Prediction task can be formally described as function $f_{n a} (p r e f i x (σ, k)) = e_{k + 1} . a$ ( $k \in [1, | σ | - 1]$ ). This function outputs the name of the subsequent activity (denoted as $n e x t_e v e n t$ ) expected at the current process execution stage, given that the first $k$ events have occurred. Since the predicted next event pertains to the activity set of the entire process, this task is treated as a multi-classification prediction task.

Definition 7 (Next Event Time Prediction)

Regarding the prefix trace $p r e f i x (σ, k)$ , the Next Event Time Prediction task can be formally described as function $f_{n t} (p r e f i x (σ, k)) = (e_{k + 1} . t_{e n d} - e_{k} . t_{e n d}) / 24$ ( $k \in [1, | σ | - 1]$ ). The function outputs the time gap in days between the end timestamp of the subsequent event $e_{k + 1}$ and that of the current event $e_{k}$ . Given that the predicted time-related value is a continuous variable, this task is designated as a regression prediction task.

Definition 8 (Remaining Time Prediction)

Regarding the prefix trace $p r e f i x (σ, k)$ , the Remaining Time Prediction task can be formally described as function $f_{r t} (p r e f i x (σ, k)) = (e_{| σ |} . t_{e n d} - e_{k} . t_{e n d}) / 24$ , ( $k \in [1, | σ |]$ ). The function outputs the time gap in days (recorded as $r e m a i n_t i m e$ ) between the end timestamp of the last event $e_{| σ |}$ and that of the current event $e_{k}$ . Given that the predicted time-related value is also a continuous variable, this task is designated as a regression prediction task.

Problem statement

The problem addressed in this article involves predicting the next activity, the next event time, and the remaining time of an ongoing case at run-time, which is achieved by training a prediction model from the historical event log recorded in PAIS or MS. Based on the trained model, we can predict the next activity, event time, and remaining time for any ongoing process instance. The detailed utilization of MMoTE in real-world manufacturing scenarios is illustrated in Figure 2. From Figure 2, historical cases $σ_{1}, σ_{2}, \dots, σ_{n}$ are obtained from the input of an event log and prefix traces are extracted from them firstly. These prefix traces are labeled with the next activity, next event time, and remaining time values, which are then used to train the MMoTE neural network. Once the model is trained, it can predict these values for a given ongoing case $σ^{^{'}}$ based on its executed events (i.e. a specific prefix trace).

Figure 2.

Illustration of the multi-gate mixture of transformer-based experts (MMoTE) framework utilization in manufacturing.

Approach

Modeling preliminary

Considering an event log $L =< σ_{1}, σ_{2}, \dots, σ_{n} >$ , where $n$ denotes the number of cases in this log and $σ$ denotes one of the traces in this log. We can describe the event log formally as a $(n, m a x_l e n, d)$ tensor, where $m a x_l e n$ denotes the max length of the cases (i.e. the length of the longest trace in event log $L$ ), and $d$ denotes the dimension of the encoding vector for each event. Please note that if the length of a case in the event log is less than $m a x_l e n$ , it should be filled with zero-padding to ensure consistency. As for each case (trace) $σ =< e_{1}, e_{2}, \dots, e_{m a x_l e n} >$ , the encoded vector $x = [\begin{matrix} x_{p}, x_{t} \end{matrix}]$ can be obtained where $x_{p} = [x_{(1)}^{p}, x_{(2)}^{p}, \dots, x_{(m a x_l e n)}^{p}]$ is a $(d - 3)$ -dimensional vector (a real-valued variable $x_{(i)}^{p}$ indicates the activity-specific attribute values of the event $e_{i}$ ( $i \in [1, m a x_l e n]$ )) and $x_{t} = [x_{(1)}^{t}, x_{(2)}^{t}, \dots, x_{(m a x_l e n)}^{t}]$ is a three-dimensional vector (a real-valued variable $x_{(i)}^{p}$ indicates the time-specific attribute values of the event $e_{i}$ ( $i = [1, m a x_l e n]$ )).

Multi-gate mixture of transformer-based experts

This section delineates the MMoTE built upon MMoE and the transformer network for PPM in manufacturing. Compared to other frameworks of MTL, MMoE offers distinct advantages primarily in terms of its flexibility and efficiency. Specifically, through the combination of multiple expert networks, MMoE can capture richer feature representations, thus better adapting to complex process prediction tasks. Second, the design of the gating network enables the model to dynamically select the appropriate combination of experts for each task, which is particularly important when dealing with tasks with high variability. Finally, the modular design of MMoE makes the model easier to extend and optimize, which is crucial for evolving PPM scenarios. For the three prediction tasks mentioned earlier, including next activity prediction, next event time prediction, and remaining time prediction, the MMoTE proposed in this article can be applied to build a multitask fusion prediction model. During the prediction phase, this model facilitates multi-task parallel prediction on-line. The model structure of MMoTE comprises the following components, as shown in Figure 3.

Figure 3.

Illustration for the multi-gate mixture of transformer-based experts (MMoTE) model structure.

Transformer Shared Bottom Module

A shared-bottom transformer module is employed to process sequential input from process traces, facilitating effective representation learning from the input. In this module, the sequential trace input $t r a c e_i n p u t s$ needs to be operated further by Trace embedding and Transformer-based feature extraction. The final output of the Transformer Shared Bottom Module is the concatenation of the feature captured from $t r a c e_i n p u t s$ and the original $t i m e_i n p u t s$ .

Considering trace $σ =< e_{1}, e_{2}, \dots, e_{m} >$ ( $m$ denotes the number of events in this trace, but in our implementation, $m$ is determined as a fixed value, that is, the max length of the cases in log $L$ ), preprocessing is conducted initially to derive an encoded feature vector encompassing its attributes (excluding temporal-based attributes) $x_{p}$ and attributes $x_{t}$ . Subsequently, these vectors are concatenated to form vector $x = [x_{p}, x_{t}]$ , serving as the input for the neural network. It’s noteworthy that $x_{p}$ and $x_{t}$ correspond to the $t r a c e_i n p u t s$ and $t i m e_i n p u t s$ of events $e_{1}, e_{2}, \dots, e_{m}$ in trace $σ$ illustrated in Figure 3, respectively.

Trace embedding comprises two components, which are feature embedding and position embedding. As we know, one-hot encoding uses binary values (0 and 1) to represent different category states. For each category, only one dimension is set to 1 (indicating that the current category state is active), and the rest are set to 0. In a classification task, the dimension of the one-hot encoding vector increases dramatically as the number of categories increases, which leads to a high-dimensional sparsity problem. To solve this problem, one approach is to map $t r a c e_i n p u t s$ to a higher dimension vector space through a linear transformation, thus obtaining the feature embedding vector $x_{a t t r} = [x_{a t t r, 1}, x_{a t t r, 2}, \dots, x_{a t t r, m}]$ . This operation effectively mitigates the high-dimensional sparsity issue caused by one-hot encoding while also maintains the similarity of neighboring categories in the vector space, that is, the similar categories are mapped to the neighboring positions. Since the transformer network used in MMoTE does not retain order information similar to RNNs when processing sequential data, inferring the relative positions of elements directly from the data is challenging. To tackle this problem, position embedding is required to capture the positional information of each event within a trace. Similar to feature embedding, position embedding maps the positional details of each event to a high-dimensional vector space, generating a distinctive positional embedding vector $x_{p o s} = [x_{p o s, 1}, x_{p o s, 2}, \dots, x_{p o s, m}]$ . Subsequently, the output $x_{t o k e n} = [x_{a t t r}, x_{p o s}] = [x_{t o k e n, 1}, x_{t o k e n, 2}, \dots, x_{t o k e n, m}]$ is derived after Trace embedding by concatenating the vectors of feature embedding and position embedding. Transformer-based feature extraction comprises the encoder module of the transformer network along with a layer of global (average) pooling (i.e. $G l o b a l P o o l i n g$ in Figure 3). This configuration enables the model to capture intricate dependencies across long distances within the data. Specifically, the vector $x_{t o k e n}$ from Trace embedding is passed through an attention mechanism with multiple heads and processed to yield the final output. Simultaneously, to mitigate the risk of over-fitting, we process the output from the attention mechanism with multi-head undergoes Dropout regularization, followed by feature summation using residual connections for layer normalization (i.e., Add & Norm in Figure 3). Subsequently, the resulting intermediate vectors are passes through a feed forward network incorporating dropout layers to selectively discard some features. Afterward, residual summation and layer normalization operations are performed to derive the output of the encoder module of the transformer network. Finally, processed by the global average pooling layer (i.e. Global Pooling in Figure 3), we derive the output vector $o u t p u t_{f e} (x_{p})$ from the transformer-based feature extraction.

Especially, the transformer model employs a multi-headed self-attention mechanism in which each head’s parameters $W$ are trained independently. The attention scores computed independently are aggregated based on the number of attention heads to yield a final multi-headed attention score. Here, the multiple independent attention is spread the attention over multiple heads. Thus, the transformer-based feature extraction has the ability to attend to various segments of the input trace, enabling it to capture different degrees of dependencies, including short-distance and long-distance. Moreover, the different attention modes it possesses allow the MMoTE to handle complex semantic relationships.

Considering the attention mechanism with $l$ -heads, the total multi-head attention score $a t t M u l t i (Q, K, V)$ along with its associated weight parameters $W_{i}^{Q}$ , $W_{i}^{K}$ , and $W_{i}^{V} (i = 1, 2, \dots, l)$ , can be calculated by the following equation:

\begin{matrix} a t t M u l t i (Q, K, V) = c o n c a t (h e a d_{1}, h e a d_{2}, \dots, h e a d_{l}) W^{M} \\ w h e r e h e a d_{i} = s o f t m a x (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{x_{t o k e n}}}}) V_{i}, i = 1, 2, \dots, l \end{matrix}

(1)

where

c o n c a t ()

denotes the vector concatenated function,

W^{M}

represents the weight parameter for multi-head attention, whose dimension is equal to the dimension of the input vector multiplied by the number of heads.

s o f t m a x ()

refers to the normalization function,

d_{x_{t o k e n}}

signifies the dimension of

x_{t o k e n}

, and

Q_{i} = Q W_{i}^{Q}

K_{i} = K W_{i}^{K}

, and

V_{i} = V W_{i}^{V}

denote the

Q u e r y

K e y

, and

V a l u e

vectors associated with head

i

, respectively. We obtain the final attention-based output by

a t t M u l t i (x_{t o k e n}, x_{t o k e n}, x_{t o k e n})

Muti-gate Mixture-of-Experts Module

This module consists of a mixture of multi-layer perception (MLP) experts with gating networks. Each expert of the MoE models different aspects for each task. Gating networks dynamically adjust the weights of the experts based on the input features to achieve task-specific feature combinations and expert selection.

Based on the last Transformer Shared Bottom Module, the input to the current module is $x^{'} = [o u t p u t_{f e} (x_{p}), x_{t}] ([\dots]$ is the concatenation operation). Given the gating network with three gates, $g a t e^{n a}$ , $g a t e^{n t}$ , and $g a t e^{r t}$ (i.e. Gate 1, Gate 2, and Gate 3 in Figure 3), for the above defined three tasks (i.e. Task 1, Task 2, and Task 3 in Figure 3), the MoE with $K$ experts (i.e. Expert 1, Expert 2,…, Expert $k$ in Figure 3, and each expert network can be denoted as function $e x p e r t_{k} (x)$ ( $k = 1, 2, \dots, K$ )), the output of the MMoE module for each task can be calculated by the following equations:

f^{n a} (x^{'}) = \sum_{k = 1}^{K} g a t e^{n a} (x^{'})_{k} e x p e r t_{k} (x^{'})

(2)

f^{n t} (x^{'}) = \sum_{k = 1}^{K} g a t e^{n t} (x^{'})_{k} e x p e r t_{k} (x^{'})

(3)

f^{r t} (x^{'}) = \sum_{k = 1}^{K} g a t e^{r t} (x^{'})_{k} e x p e r t_{k} (x^{'})

(4)

where

f^{n a} (x^{'})

f^{n t} (x^{'})

, and

f^{r t} (x^{'})

signify the conclusive result produced by the MMoE module for individual task, and

g a t e^{n a} (x^{'})

g a t e^{n t} (x^{'})

, and

g a t e^{r t} (x^{'})

indicate the gate functions corresponds to expert

k

for each task. As for each task, the gate network comprises a linear transformation of the input followed by a softmax layer:

\begin{aligned} g a t e^{n a} (x^{'}) = [g a t e^{n a} (x^{'})_{1}, g a t e^{n a} (x^{'})_{2}, \\ \dots, g a t e^{n a} (x^{'})_{K}] = s o f t m a x (W_{g, n a} x^{'}) \end{aligned}

(5)

g a t e^{n t} (x^{'}) = [g a t e^{n t} (x^{'})_{1}, g a t e^{n t} (x^{'})_{2}, \dots, g a t e^{n t} (x^{'})_{K}] = s o f t m a x (W_{g, n t} x^{'})

(6)

g a t e^{r t} (x^{'}) = [g a t e^{r t} (x^{'})_{1}, g a t e^{r t} (x^{'})_{2}, \dots, g a t e^{r t} (x^{'})_{K}] = s o f t m a x (W_{g, r t} x^{'})

(7)

where

W_{g, n a} ϵ R^{K \times d}, W_{g, n t} ϵ R^{K \times d}, a n d W_{g, r t} ϵ R^{K \times d} W_{g, n a} ϵ R^{K \times d}, W_{g, n t} ϵ R^{K \times d}, a n d W_{g, r t} ϵ R^{K \times d}

are trainable weight parameter matrices,

[\dots]

is the concatenation operation, and

d

is the feature dimension of

x^{'}

Multi-tower Module

In this module, each task corresponds to an individual tower, facilitating the independent optimization for multiple tasks. Such a configuration is frequently employed in MTL research due to its efficacy in accommodating tasks with diverse scales and data characteristics. After the MMoE module, for each task, we need to generate distinct outputs to derive three independent prediction outcomes. The process is described as follows:

{\hat{y}}_{n a}, {\hat{y}}_{n t}, {\hat{y}}_{r t} = h^{n a} (f^{n a} (x^{'})), h^{n t} (f^{n t} (x^{'})), h^{r t} (f^{r t} (x^{'}))

(8)

where

{\hat{y}}_{n a} = [{\hat{y}}_{n a (1)}, {\hat{y}}_{n a (2)}, \dots, {\hat{y}}_{n a (m)}]

{\hat{y}}_{n t} = [{\hat{y}}_{n t (1)}, {\hat{y}}_{n t (2)}, \dots, {\hat{y}}_{n t (m)}]

, and

{\hat{y}}_{r t} = [{\hat{y}}_{r t (1)}, {\hat{y}}_{r t (2)}, \dots, {\hat{y}}_{r t (m)}]

denote the predicted outputs for each task regarding as the input trace

σ

. Especially,

{\hat{y}}_{n a (i)}

{\hat{y}}_{n t (i)}

, and

{\hat{y}}_{r t (i)}

(

i ϵ [1, m]

) indicate the actual outputs from the MMoTE model for

i

-th event

e_{i}

in trace

σ

in terms of these three tasks.

h^{n a}

h^{n t}

, and

h^{r t}

are the activation functions of the tower network for each task, and they are the fully connected (FC) layers.

The concatenated true output vector $Y = [y^{n a}, y^{n t}, y^{r t}]$ can be derived from the input trace $σ$ in terms of the multiple tasks (the above defined three prediction tasks).

Here, $y^{n a} = [y_{n a (1)}, y_{n a (2)}, \dots, y_{n a (m)}]$ , $y^{n t} = [y_{n t (1)}, y_{n t (2)}, \dots, y_{n t (m)}]$ , and $y^{r t} = [y_{r t (1)}, y_{r t (2)}, \dots, y_{r t (m)}]$ denote the true outputs for the above defined three tasks in PPM, respectively. The loss value for each task in the overall multi-task model is computed separately by comparing the actual outputs from the MMoTE model with the true outputs. This iterative process ensures continuous optimization of the model using training dataset to attain a set of fixed parameter values that minimize the loss function, thereby establishing the final well-trained prediction model.

Given that the MMoTE prediction model addresses multiple tasks, we need to compute the loss values for each task separately and subsequently aggregate them to construct the multi-task loss function. Primarily, the first defined task tackled in our study involves multi-classification prediction, necessitating the application of a multi-classification cross-entropy loss function to evaluate the disparity between the actual and true outputs for every input trace within an event log. Conversely, the remaining two tasks primarily involve regression predictions, warranting the use of a LogCosh loss function to quantify the difference. The loss function regarding the first task (i.e. the next activity prediction task) is described as follows:

l o s s (σ) = - \frac{1}{m} \sum_{i = 1}^{m} {\hat{y}}_{n a (i)} l o g (y_{n a (i)})

(9)

where

l o s s (σ)

represent the loss value of trace

σ =< e_{1}, e_{2}, \dots, e_{m} >

. Here,

{\hat{y}}_{n a (i)}

signifies the actual output of the

i

-th event in this trace, while

y_{n a (i)}

signifies its true value. Consequently, event log

L =< σ_{1}, σ_{2}, \dots, σ_{n} >

is input into the MMoTE model, the loss value can be derived as follows:

L o s s_{n a} = \sum_{j = 1}^{n} l o s s (σ_{j})

(10)

where

L o s s_{n a}

signifies the aggregate of the loss values across all cases within event log

L

for

T a s k

1. Regarding the remaining two prediction tasks, we employ the LogCosh loss function to measure their difference as follows:

l o g c o s h (σ) = \frac{1}{m} \sum_{i = 1}^{m} l o g \frac{e^{({\hat{y}}_{n t (i)} - y_{n t (i)})} + e^{(y_{n t (i)} - {\hat{y}}_{n t (i)})}}{2}

(11)

L o s s_{n t} = \sum_{j = 1}^{n} l o g c o s h (σ_{j})

(12)

where

{\hat{y}}_{n t (i)}

denotes the actual output of the

i

-th event in trace

σ

while

y_{n t (i)}

represents the true output. Furthermore,

L o s s_{n t}

signifies the total loss values across all process traces in event log

L

. Likewise,

L o s s_{r t}

can also be calculated in this way. With these components in consideration, the loss function of the MMoTE model can be expressed as follows:

L o s s = (ω_{n a} * L o s s_{n a} + ω_{n t} * L o s s_{n t} + ω_{r t} * L o s s_{r t})

(13)

where

ω_{n a}

ω_{n t}

, and

ω_{r t}

represent the weights assigned to the respective loss function for each task.

Once the prediction model is well-trained, the prediction model can be applied to the event sequence of an ongoing case, yielding multi-task prediction values for an ongoing process.

Experimental evaluation

Extensive experiments were conducted to validate the effectiveness and applicability of the MMoTE in manufacturing by utilizing five real-life event logs sourced from different processes (the business process vs. the manufacturing process). In particular, we developed the MMoE approach according to Ma et al.¹⁵ and used the MTLFormer⁸ as well as the ProcessTransformer approach (called STLFormer) proposed by Bukhsh et al.²² to conduct the ablation experiments. For STLFormer, a distinct backbone network was created for each task based on the transformer network. For MTLFormer, a shared-backbone model was created for multiple tasks based on the transformer network. For MMoE, we constructed a network capable of addressing multiple tasks by facilitating automatic parameter adjustment through a gating mechanism positioned between shared and task-specific models.

Experimental setup

Datasets

In our experiment, we used five datasets with three different processes from the literature for evaluation. These datasets were sourced from the public 4TU research repository. Since there are few publicly available manufacturing process event logs, we used only the Production⁵⁶ event log in this article to prove that the proposed approach is also applicable to manufacturing process prediction. Additionally, the widely used Helpdesk⁵⁷ and BPIC2012⁵⁸ event logs were chosen to compare the evaluation metrics with previously proposed approaches, highlighting the performance advantages of our proposal. These datasets used in our experiment are described as follows and the detailed comparison is shown in Table 2.

Helpdesk: The real-life log comes from the ticket management process of the Helpdesk of an Italian software company. It consists of 4580 cases, 21,348 events, and 14 activities. The main attributes involve the Case ID, Activity, Resource, Complete Timestamp, and so on.

BPIC2012: The real-life log comes from a Dutch Financial Institute, involving a loan application process for a personal loan or overdraft within a global financing organization. The log contains 13,087 cases, 262,200 events, and 23 activities. Because the event log is a merger of three intertwined sub-processes (the Application, the Offer, and the Workflow) where the first letter of each task name identifies from which sub process it originated from, three individual subsets of BPIC2012_A, BPIC2012_O, and BPIC2012_W based on BPIC2012 can be extracted to utilize in our experiment.

Production: The real-life log originates from a manufacturing data of some products in a production workshop from January to March 2012. The log contains 225 cases, 4543 events, and 55 activities. The main attributes involve the Case ID, Activity, Start Timestamp, Complete Timestamp, Span, Work Order Qty, Part Desc., as shown in Table 3. Among them, the special Work Order Qty and the Part Desc. for the manufacturing process represent the quantity and type of products to be produced by this process.

Table 2.

Comparison of different datasets.

Event log	Number of cases	Number of activities	Number of events	Number of variants	Avg. case length	Min. case length	Max. case length
Helpdesk	4580	14	21,348	226	4.66	2	15
BPIC2012_A	13,087	10	60,849	17	4.65	3	10
BPIC2012_O	5015	7	31,244	168	6.23	4	39
BPIC2012_W	9658	6	72,431	2263	7.50	1	153
Production	225	55	4543	221	20.19	1	175

Table 3.

Fragment of production event log.

Case ID	Start timestamp	Complete timestamp	Activity	Span	Work order qty	Part desc.
Case 106	2012/3/15 10:14:00	2012/3/15 15:44:00	Turning & Milling – Machine 8	5:30:00	39	Ballnut
Case 106	2012/3/15 16:55:00	2012/3/15 20:38:00	Turning & Milling – Machine 8	3:43:00	39	Ballnut
Case 106	2012/3/19 7:00:00	2012/3/19 7:28:00	Turning & Milling Q.C.	0:28:00	39	Ballnut
Case 106	2012/3/19 9:38:00	2012/3/19 9:38:00	Laser Marking – Machine 7	1:13:00	39	Ballnut
Case 106	2012/3/19 14:42:00	2012/3/19 15:20:00	Flat Grinding – Machine 11	0:38:00	39	Ballnut
Case 106	2012/3/21 15:27:00	2012/3/21 17:57:00	Final Inspection Q.C.	2:30:00	39	Ballnut
Case 106	2012/3/22 0:00:00	2012/3/22 1:00:00	Packing	1:00:00	39	Ballnut
Case 106	2012/3/22 1:00:00	2012/3/22 13:05:00	Final Inspection Q.C.	1:15:00	39	Ballnut

Overall, we train the prediction model for each technique using the five real-world datasets mentioned above. First, every dataset is preprocessed, and then it is used to train the model. We split each dataset in our experiment so that the first 80% of events are used for training according to the chronological order of the occurrences, and the remaining 20% are used as the test set so that we can evaluate how well each approach works. All approaches used in this study were implemented based on Python 3.8 and Tensorflow 2.5.0. The experimental environment is configured with Windows 10 operating system with two 12-core Intel Xeon 5118 CPUs (2.30 GHz) with 256GB of RAM and accelerated with three NVIDIA Tesla V100 GPUs.

Evaluation Metric

Since the next activity prediction task (i.e. Task 1) in PPM is a standard multi-classification problem, we evaluated the above techniques using four essential metrics: accuracy, precision, recall, and F-score. Additionally, because the next event time prediction task (i.e. Task 2) and the remaining time prediction task (i.e. Task 3) are regression prediction problems, we employed the mean absolute error (MAE) as an evaluation metric to assess both of them. Higher accuracy, precision, recall, and F-score values in classification problems typically indicate better performance of the classification approach. Conversely, in regression prediction tasks, a lower MAE number denotes a better prediction technique performance.

It is important to note that in manufacturing processes, which involve batch production, many activities are repeated within process instances. For such processes, the primary focus is on the remaining production time, which differs from the perspective of business processes. In this experiment, to evaluate the performance of the MMoTE in business process predictive monitoring, we measure performance metrics across three tasks for MMoTE and several other methods. However, for the manufacturing process, we consider only the performance metric of the remaining time prediction (Task 3). In our study, the first two tasks of the next activity prediction (Task 1) and the next event time prediction (Task 2) serve as auxiliary tasks, and the remaining time prediction (Task 3) is the target task, allowing us to assess the application of multi-task learning in predictive monitoring of manufacturing processes. Therefore, the experimental results for business processes and the manufacturing process will be presented separately below.

Experimental results on public business process logs

Our proposed MMoTE is trained using a $A d a m$ optimizer with a learning rate specified at 0.002 for 100 epochs in this experiment. The transformer shared-bottom module in MMoTE is implemented through a distinct attention mechanism with four-head, succeed by feed-forward layers incorporating residual connections, dropout, and a normalization. Additionally, for the three tasks mentioned above, the MMoE module of MMoTE is built with three gates and 16 experts. Two continuous dense layers with activation functions of ReLU and LeakyReLU, respectively, make up the multi-tower module of MMoTE. The parameters utilized in our MMoE implementation, such as the number of epochs, optimizer, learning rate, gating network, and expert configurations, are the same as those used in MMoTE. Furthermore, we maintain consistency by choosing the same number of heads for the multi-head attention mechanism in the Transformer network and adopting the same weights for the loss function across multiple tasks as in the MTLFormer approach within the MMoTE implementation, that is, ${ω_{n a} : 0.6, ω_{n t} : 2, ω_{r t} : 0.3}$ , in order to provide a more equitable comparison with the prior approaches.

Comparison of approach effectiveness

A comprehensive comparison of MMoTE, MMoE, MTLFormer, and STLFormer demonstrates the effectiveness of MMoTE as proposed in this study. Specifically, the performance achieved by the MMoTE approach is obtained without hyperparameter optimization, which is sufficient to show that the performance of the approach in this article still has great potential. Table 4 describes the performance comparison of these four approaches on five datasets.

Table 4.

Comparison of different approaches on four datasets from business processes.

Approach	Dataset	Next activity				Next event time	Remaining time
Approach	Dataset	Accuracy (%)	Precision (%)	Recall (%)	F-score (%)	MAE (days)	MAE (days)
MMoTE	Helpdesk	91	88	91	88	2.90	3.74
	BPIC2012_A	81	78	79	76	1.04	4.78
	BPIC2012_O	86	85	85	84	1.08	5.19
	BPIC2012_W	91	92	91	90	0.39	4.38
	Mean	87	86	86	85	1.35	4.52
MMoE¹⁵	Helpdesk	88	83	87	84	4.9	4.15
	BPIC2012_A	78	75	76	73	1.24	4.84
	BPIC2012_O	79	77	78	76	1.26	6.28
	BPIC2012_W	90	91	90	89	0.41	5.58
	Mean	84	81	83	81	1.95	5.21
MTLFormer⁸	Helpdesk	90	86	89	87	2.84	3.70
	BPIC2012_A	84	75	78	76	1.01	3.12
	BPIC2012_O	83	89	85	87	1.17	3.69
	BPIC2012_W	88	85	90	89	0.46	1.86
	Mean	86	84	86	85	1.37	3.09
STLFormer²²	Helpdesk	86	82	85	81	3.14	4.04
	BPIC2012_A	77	68	64	65	1.12	4.72
	BPIC2012_O	87	85	87	84	1.13	5.28
	BPIC2012_W	90	89	85	90	0.40	5.03
	Mean	85	81	80	80	1.45	4.77

MMoTE: multi-gate mixture of transformer-based experts; MMoE: multi-gate mixture-of-experts; MTL: multi-task learning; STL: single-task learning; MAE: mean absolute error.

Overall performance comparison. By calculating the average performance metrics of the diverse approaches across all datasets, it was found that the MMoTE approach exhibits superior performance on two out of three tasks, particularly on two of the four datasets, and consistently across five out of six metrics. Next is the MTLFormer approach, which performs best on one out of three tasks and three out of six metrics on four datasets. The remaining MMoE and STLFormer approaches exhibit similar performance, suggesting no substantial advantage in employing either the exclusive MMoE framework or the transformer network for process prediction. The average metrics of MMoTE show an overall improvement compared to MTLFormer and MMoE, which suggests that MTL outperforms STL (i.e. task-specific models), with particular performance observed in the MTL model built upon the MoE framework. Secondly, the performance of the MMoE approach is far from that of MTLFormer, which indicates that the transformer network is more effective in handling sequential input data like process instances. However, the MMoTE approach is better than the one that uses the transformer network for learning separately, that is, STLFormer, indicating that the MTL framework MMoE is most effective only in combination with the transformer network. To summarize, the MMoTE presented in this study, which combines the transformer network with the MMoE MTL framework, is compelling.

Comparison from the perspective of each dataset. MMoTE approach demonstrates superior performance on the helpdesk and BPIC2012_W datasets, as indicated in Table 4. Specifically, it outperforms all other approaches on four out of six metrics for the helpdesk dataset and attains the optimal results across five out of six metrics for the BPIC2012_W dataset. The MTLFormer approach demonstrates superior performance on the BPIC2012_A and BPIC2012_O datasets, respectively, achieving the best results in four out of six metrics for BPIC2012_A and three out of six metrics for BPIC2012_O. Alternatively, the STLFormer approach demonstrates superior performance on the datasets of BPIC2012_O and BPIC2012_W, with two metrics achieving the best results for BPIC2012_O and one metric achieving the best result for BPIC2012_W. In contrast, the MMoE approach failed to achieve optimal performance on any of the datasets’metrics.

Comparison from the perspective of each individual task. First, we compare the performance of the methods from the perspective of individual tasks. Based on the frequency of achieving optimal performance across different tasks shown in Table 4, the MMoTE approach performs best on Task 1, followed by Task 2, and then Task 3. Conversely, MTLFormer performs best on Task 3, followed by Task 2, and then Task 1. Additionally, it is worth noting that STLFormer exhibits superior performance only on Task 1. The disparity between MMoTE and MTLFormer highlights a distinct divergence in their respective focuses.

Secondly, we compare the performance of MMoTE, MMoE, MTLFormer, and other task-specific methods on these three individual tasks to evaluate them against the latest techniques in PPM. These task-specific methods are based on STL, where each approach is applied to a specific task, with a separate model trained for each task.

Task 1: The next activity prediction. Considering the widespread use of the metric in most of the referenced literature, we have selected the accuracy metric for comparison here. From Table 5, it is evident that the MMoTE approach consistently exhibits a higher average accuracy across all four datasets, closely followed by MTLFormer, in the next activity prediction task (Task 1). When considering the optimal results for each dataset, it is observed that the MMoTE approach achieves superior performance on the Helpdesk dataset, while MTLFormer performs best on BPIC2012_A dataset. On the BPIC2012_O, STLFormer outperforms other approaches, and the method by Chen et al.²³ achieves the highest accuracy on the BPIC2012_W dataset. These observations suggest that employing a Transformer network provides a clear advantage in this task, as all four approaches mentioned above are built upon it.

Table 5.

Comparison of different approaches in the next activity prediction task (higher is better).

Approach	Accuracy (%)				Mean
Approach	Helpdesk	BPIC2012_A	BPIC2012_O	BPIC2012_W	Mean
MMoTE	91	81	86	91	87.25
MMoE¹⁵	88	78	79	90	83.75
MTLFormer⁸	90	84	83	88	86.25
STLFormer²²	86	77	87	90	85
LSTM (Tax et al.²⁰)	71	77	81	75	76
RNN (Evermann et al.³²)	70	74	79	75	74.5
LSTM (Camargo et al.³⁴)	76	79	85	83	80.75
GRU (Hinkka et al.⁵⁹)	78	79	86	84	79
Shared-attention-based model²¹	/	75	82	84	80
Specialized-attention-based model²¹	/	75	82	84	80
RegPFA (Breuker et al.²⁶)	81	/	/	71	76
Inception CNN³⁷	75	78	82	85	80
CNN (Pasquadibisceglie et al.³⁸)	66	71	78	82	74.25
MANN (Khan et al.⁴¹)	69	76	84	87	79
Method by Theis and Darabi (w/o attributes)⁴²	68	66	82	86	75.5
Method by Theis and Darabi (w/ attributes)⁴²	66	65	74	76	70.25
BERT+transfer learning²³	76	/	/	94	85

MMoTE: multi-gate mixture of transformer-based experts; MMoE: multi-gate mixture-of-experts; MTL: multi-task learning; STL: single-task learning; RNN: recurrent neural network; LSTM: long short-term memory; CNN: convolutional neural network; MANN: memory augmented neural network; BERT: bidirectional encoder representations from transformer.

Task 2: The next event time prediction. From Table 6, it is evident that MMoTE has the lowest mean MAE across the four datasets in the next event time prediction task (Task 2), indicating that it performs the best among the listed methods. Following MMoTE, the MTLFormer approach ranks second, followed by STLFormer and MMoE. This suggests that both transformer-based approaches and MTL offer distinct advantages. Furthermore, MMoTE achieves optimal results on the BPIC2012_O and BPIC2012_W datasets. In contrast, MTLFormer performs best on the Helpdesk dataset.

Table 6.

Comparison of different approaches in the next event time prediction task (lower is better).

Approach	MAE (days)				Mean
Approach	Helpdesk	BPIC2012_A	BPIC2012_O	BPIC2012_W	Mean
MMoTE	2.90	1.04	1.08	0.39	1.35
MMoE¹⁵	4.90	1.24	1.26	0.41	1.95
MTLFormer⁸	2.84	1.01	1.17	0.46	1.37
STLFormer²²	3.14	1.12	1.13	0.40	1.45
LSTM (Tax et al.²⁰)	5.78	0.84	1.61	0.50	2.18
MANN (Khan et al.⁴¹)	6.36	0.75	1.45	0.50	2.27

MMoTE: multi-gate mixture of transformer-based experts; MMoE: multi-gate mixture-of-experts; MTL: multi-task learning; STL: single-task learning; LSTM: long short-term memory; MANN: memory augmented neural network.

Task 3: The remaining time prediction. From Table 7, it is evident that MTLFormer has the lowest MAE across the four datasets in the remaining time prediction task (Task 3), indicating that it performs the best among the methods listed. Following MTLFormer, the MMoTE approach ranks second, followed by STLFormer and MMoE approaches. Additionally, when examining each dataset individually, MTLFormer consistently attains superior results across all four datasets. Although the MMoTE approach ranks second to MTLFormer, the MAE values for MMoTE are notably different from those of MTLFormer on datasets other than Helpdesk, where their performances are nearly identical. This discrepancy may be attributed to two primary factors: (1) in this experiment, the MMoTE approach does not perform hyperparameter optimization to ensure a fair comparison with MTLFormer. It uses the same set of loss function weight parameters mentioned above (in the parameter settings), which are taken from the original MTLFormer study.⁸ These parameters were obtained through hyperparameter optimization for the MTLFormer method across the four datasets. (2) The MMoE framework may have some singularity and may need further adaptation for Task 3. Combining the MMoE framework with a shared-bottom framework using gating units could enable alternative MTL tailored to task-specific requirements. Analyzing the experimental findings reveals a certain complementarity between the two MTL methods, MMoTE and MTLFormer, in PPM.

Table 7.

Comparison of different approaches in the remaining time prediction task (lower is better).

Approach	MAE (days)				Mean
Approach	Helpdesk	BPIC2012_A	BPIC2012_O	BPIC2012_W	Mean
MMoTE	3.74	4.78	5.19	4.38	4.52
MMoE¹⁵	4.15	4.84	6.28	5.58	5.21
MTLFormer⁸	3.70	3.12	3.69	1.86	3.09
STLFormer²²	4.04	4.72	5.28	5.03	4.77
LSTM_argmax (Camargo et al.³⁴)	11.15	11.90	19.78	32.03	18.72
LSTM_random (Camargo et al.³⁴)	10.53	11.90	19.79	32.13	18.59
Data-Aware LSTM (Navarin et al.³⁵)	10.38	6.05	6.73	6.63	7.45

MMoTE: multi-gate mixture of transformer-based experts; MMoE: multi-gate mixture-of-experts; MTL: multi-task learning; STL: single-task learning; LSTM: long short-term memory.

Comparison across different prefix trace lengths

To further evaluate the performance of the four approaches, we analyze the predicted samples by their lengths, studying the performance of these approaches when predicting samples with different prefix trace lengths. Figures 4 to 7 present a comparative analysis and evolution of performance metrics for MMoTE, MMoE, MTLFormer, and STLFormer across all datasets as length increases in three tasks. Subfigures (a) to (d) illustrate the trends in accuracy, precision, recall, and F-score for Task 1 with increasing prefix trace lengths. Additionally, subfigure (e) demonstrates the changes in MAE (i.e. MAE-nt) for Task 2, while subfigure (f) shows the evolution of MAE (i.e. MAE-rt) for Task 3 as the prefix trace length increases. We analyze the performance from three perspectives: overall performance, the trend of change, and the magnitude of change (stability of prediction) on different datasets, as shown in Figures 4 to 7.

Figure 4.

Comparison of prediction performance across different prefix trace lengths on the Helpdesk dataset.

Figure 5.

Comparison of prediction performance across different prefix trace lengths on the BPIC2012_A dataset.

Figure 6.

Comparison of prediction performance across different prefix trace lengths on the BPIC2012_O dataset.

Figure 7.

Comparison of prediction performance across different prefix trace lengths on the BPIC2012_W dataset.

From Figure 4, the performance advantage of the MMoTE approach is significant on the Helpdesk dataset. As shown in Figure 4(a) to (d), MMoTE predicts higher values of accuracy, precision, recall, and F-score for Task 1 with different prefix trace lengths compared to the other methods. Conversely, the two MAE metrics associated with Tasks 2 and 3 (i.e. MAE-nt in Figure 4(e) and MAE-rt in Figure 4(f)) are significantly lower than those of other approaches. Regarding the change trends, MMoTE is more consistent with the MMoE approach, while MTLFormer aligns more closely with STLFormer. Regarding the magnitude of change, the MMoTE is more stable than the other prediction approaches. The values of all four metrics (Figure 4(a) to (d)) show a gradual increase with longer prefix trace lengths, particularly beyond a length of 8. However, for the MAE-nt and MAE-rt metrics (Figure 4(e) and (f)), the performance advantage of MMoTE becomes more pronounced as the prefix trace length exceeds 8, especially compared to MTLFormer. This may be due to the presence or absence of a crucial event in the prefix trace at the time of prediction.

From Figure 5, the performance advantage of the MMoTE approach is more prominent and better than the other three approaches when the prefix trace length is 7–9. Overall, the performance trends of MMoTE and MMoE exhibit remarkable similarity, while the variations in the performance between MTLFormer and STLFormer differ significantly for Task 1, as depicted in Figure 5(a) to (d). However, the trends for Tasks 2 and 3 show greater consistency, as depicted in Figure 5(e) and (f). Regarding the magnitude of change, the MTLFormer approach demonstrates the best stability across the six metrics for the three tasks.

From Figure 6, it is evident that the MMoTE and MMoE approaches exhibit similar trends, while MTLFormer and STLFormer also share similar patterns. The MMoTE approach shows more drastic change in Tasks 2 and 3 (Figure 6(a) to (e)), but demonstrates more stable performance in Task 3 (Figure 6(f)). In terms of the magnitude of change, the MTLFormer approach maintains more stable prediction performance in Tasks 1 and 2 (Figure 6(a) to (e)).

From Figure 7, the MTLFormer approach exhibits the significant performance advantages in Tasks 1 and 2 (Figure 7(a) to (e)). However, the MTLFormer approach demonstrates an even more pronounced performance advantage compared to the other three approaches in Task 3 (Figure 7(f)). Considering the overall change trend, the performance of both MTLFormer and STLFormer exhibits greater consistency in Tasks 1 and 2 (Figure 7(a) to (e)). Meanwhile, the difference between the two is more significant in Task 3 (Figure 7(f)). In contrast, the overall performance of MMoTE and MMoE on the three tasks is more consistent, except for the significant difference in prediction performance at the maximum length in Task 1 (Figure 7(a) to (d)).

Comparison of approach complexity and efficiency

To comprehensively evaluate the model complexity of the aforementioned four approaches, we compared their number of trainable parameters (i.e. model size) and the training time required across different datasets, as shown in Table 8. The number of trainable parameters (#params) refers to all the weights and biases that needed to be trained during model training. Note that the #params for the STLFormer approach are calculated as the sum of parameters in each model dedicated to three tasks.

Table 8.

Comparison of model size (1K = 1000) and training time across different approaches.

Approach	Helpdesk		BPIC2012_A		BPIC2012_O		BPIC2012_W
Approach	#params	training_time (s)	#params	training_time (s)	#params	training_time (s)	#params	training_time (s)
MMoTE	81K	166	80K	155	82K	2042	90K	57
MMoE¹⁵	63K	71	63K	81	66K	273	78K	43
MTLFormer⁸	114K	160	114K	168	116K	2424	124K	61
STLFormer²²	102K	163	102K	162	104K	2174	116K	58

MMoTE: multi-gate mixture of transformer-based experts; MMoE: multi-gate mixture-of-experts; MTL: multi-task learning; STL: single-task learning.

Firstly, it is observed that the #params for MMoTE and MMoE consistently remain below 100K (1K = 1000) across all datasets, while the model sizes of MTLFormer and STLFormer are larger than 100K on all datasets. Furthermore, the MMoE approach exhibits the smallest model size on all datasets, followed by MMoTE, STLFormer, and MTLFormer. This indicates that the MMoE framework possesses a distinct advantage in MTL. Despite utilizing the Transformer network, which typically results in an increased number of model parameters, the model size of MMoTE remains smaller compared to STLFormer and MTLFormer, both of which also employ the transformer network. Secondly, it is found that the training time of MMoE is the lowest across all datasets, while MMoTE, MTLFormer, and STLFormer exhibit similar training times. This may be attributed to the fact that the latter three methods are all based on the transformer network. Considering both metrics, the #params values for MMoTE and MMoE are comparable, indicating similar model complexity. However, there is a significant difference in training time, which may be due to the transformer network requiring more time to learn the features in trace from these datasets.

Parameter analysis of MMoTE

For MMoTE, three key parameters significantly impact process predictions: the number of experts, the number of units in each expert network, and the number of heads within the multi-head attention mechanism. To investigate the effects of these parameters on model performance, we conducted experiments using datasets from the above business processes. The number of experts denotes the capability of capturing the difference among different tasks. We evaluated the impact of this parameter on MMoTE performance by changing the number of experts to 2, 4, 8, 16, and 24. The number of units denotes the number of hidden layer units in the expert network of MMoTE. We evaluated the impact of the hidden feature length on MMoTE performance by changing the number of units to 4, 8, 16, 32, and 64. Lastly, the number of heads within multi-head attention mechanism indicates the number of different information sources that MMoTE can focus on simultaneously during parallel input sequence processing. We evaluated the effect of the multi-head attention mechanism on MMoTE performance by varying the number of heads to 2, 4, 8, 16, and 32.

The parameter analysis on the Helpdesk dataset

From Figure 8(a), it is evident that as the number of experts increases, the performance of MMoTE in Tasks 2 and 3 shows a consistent trend, indicating that these tasks may share many features and have a strong correlation under this parameter. Within the selected range of this parameter, Task 1 achieves optimal performance with four and 16 experts, while Tasks 2 and 3 both reach their best performance with 16 experts. In Figure 8(b), the performance of Task 1 peaks with 8 units and then gradually declines as the number of units increases. This suggests that Task 1 achieves optimal results with 8 units, while Task 2 achieves near-optimal results with either 8 or 32 units, and Task 3 performs best with 32 units. From Figure 8(c), Task 1 reaches optimal performance with 4 heads, at which point Task 3 also achieves optimal results, while Task 2 attains near-optimal performance. In summary, the optimal performance points for Tasks 1, 2, and 3 vary, indicating different sensitivities to these parameters.

Figure 8.

Performance analysis with different parameters of multi-gate mixture of transformer-based experts (MMoTE) on the Helpdesk dataset.

The parameter analysis on the BPIC2012_A dataset

As shown in Figure 9, the performance of MMoTE in Task 2 remains relatively stable with changes in the three parameters, indicating that this task is less sensitive to these parameters. In contrast, the performance of MMoTE in the other two tasks shows more noticeable variations, especially in Task 1. Specifically, from Figure 9(a), although the four metrics for Task 1 show inconsistent trends, an optimal parameter can still be identified when the number of experts is 16. For Tasks 2 and 3, the optimal results are achieved when the number of experts is 24. In Figure 9(b), both Tasks 1 and 2 achieve optimal results when the hidden feature length is 8. However, Tasks 2 and 3 attain their best performance with a hidden feature length of 32, while Task 1 performs the worst at this setting. Figure 9(c) shows that Tasks 1 and 2 both achieve optimal results with four heads, whereas Task 3 requires only two heads for optimal performance.

Figure 9.

Performance analysis with different parameters of multi-gate mixture of transformer-based experts (MMoTE) on the BPIC2012_A dataset.

The parameter analysis on the BPIC2012_O dataset

As depicted in Figure 10, the performance trends of MMoTE across three tasks generally align with the increase in the values of the three parameters, indicating a strong correlation among these tasks. This suggests that they may share some vital features, and as the parameter values vary, these shared features may influence the performance of MMoTE in a similar manner. Furthermore, it reflects that the sensitivity of MMoTE to these three parameters is comparable across the three tasks. Within the given range of parameter variations, the optimal parameter values for the three tasks can be easily identified, specifically when the number of experts is 16, the number of units is 16, and the number of heads is 4.

Figure 10.

Performance analysis with different parameters of multi-gate mixture of transformer-based experts (MMoTE) on the BPIC2012_O dataset.

The parameter analysis on the BPIC2012_W dataset

As shown in Figure 11, the performance of MMoTE in Task 2 remains relatively stable with changes in the three parameters, indicating that this task is less sensitive to these parameters. In contrast, the performance of MMoTE in the other two tasks shows more noticeable variations, especially in Task 3. Specifically, as shown in Figure 11(a), Task 1 achieves the optimal performance when the number of experts is 4. For Tasks 2 and 3, the best results are obtained when the number of experts is 16. In Figure 11(b), Tasks 2 and 3 achieve the optimal performance when the number of units is 8, while Task 1 achieves its second-best performance at this value, with its optimal performance at 16 units. Figure 11(c) shows that Tasks 1 and 3 achieve optimal performance when the number of heads is 2. At this value, Task 2 achieves its second-best performance, which is nearly identical to its optimal performance when the number of heads is 16.

Figure 11.

Performance analysis with different parameters of multi-gate mixture of transformer-based experts (MMoTE) on the BPIC2012_W dataset.

In summary, the optimal parameter settings for different tasks in MTL can vary and sometimes conflict. Therefore, it is crucial to consider the similarities and differences between tasks and their sensitivity to parameters in our proposed MMoTE approach. By finding a balance, we can enable multiple tasks to achieve or approach optimal performance simultaneously.

Experimental results on the public manufacturing process log

We conducted this experiment by using the production dataset from a real workshop. Due to the limited availability of publicly accessible comparison methods, we did not perform a fair comparison with other methods. Instead, we randomly optimized hyperparameters for the four approaches: MMoTE, MMoE, MTLFormer, and STLFormer, and selected the best-performing hyperparameter combinations for each method. Additionally, we implemented the remaining time prediction (single-task) methods based on LSTM and GRU. Since the focus in predictive monitoring of manufacturing processes is primarily on the remaining time prediction, this experiment exclusively compares the MAE (i.e. MAE-rt) for Task 3.

Comparison of approach effectiveness, complexity, and efficiency

Although this experiment focuses solely on the performance of remaining time prediction (Task 3), Tasks 1 and 2 are still necessary for the MMoTE approach as auxiliary tasks. The MMoTE approach employs MTL to predict the remaining time of the manufacturing process, similar to the MMoE and MTLFormer methods. Table 9 presents the prediction performance, model complexity, and training_time across different approaches on the Production dataset in Task 3. As shown in Table 9, the MMoTE approach achieves the best performance, followed by MTLFormer, STLFormer, MMoE, GRU, and LSTM. Regarding model complexity (#params), the MMoE approach maintains the lowest complexity, while MMoTE, STLFormer, LSTM, and GRU are comparable in this aspect. In contrast, the model complexity of MTLFormer significantly increases. In terms of training_time, the MMoE approach requires the least time, followed by STLFormer, LSTM, GRU, MMoTE, and finally, MTLFormer. A comprehensive analysis of both #params and training_time reveals that model training time does not always increase proportionally with model complexity. This discrepancy may be due to various factors such as computational resources and dataset characteristics affecting the training time.

Table 9.

Comparison of model size (1K = 1000), training time, and remaining time prediction across different approaches on the production dataset.

Approach	Production
Approach	#params	training_time (s)	remaining time (MAE (days))
MMoTE	165K	96	5.97
MMoE¹⁵	36K	16	7.09
MTLFormer⁸	802K	102	6.56
STLFormer²²	142K	22	6.96
LSTM	124K	39	9.89
GRU	122K	45	8.64

Comparison across different prefix trace lengths

To further evaluate the aforementioned approaches, we analyze their performance in remaining time prediction (Task 3) at different prefix trace lengths. Figure 12 illustrates the variation in MAE performance (i.e. MAE-rt) for the six different approaches at different stages of process execution (i.e. different lengths of prefix traces). Some approaches, such as STLFormer and LSTM, exhibit a rapid decrease in MAE-rt initially with increasing prefix length, followed by stabilization. In contrast, approaches like MMoTE and GRU show relatively stable MAE changes across the entire length range, indicating higher prediction stability. From different stages, the MMoTE approach demonstrates superior performance over larger continuous length intervals (i.e. [11, 29] and [64, 78]) compared to other methods. The MTLFormer approach performs best in intervals like [0, 10] and [54, 63], while the MMoE approach excels in the [30, 53] range. This phenomenon may be attributed to the varying ability of different approaches to learn and adapt to data features as specific events occur during process execution. The MMoTE approach likely captures and utilizes critical information in the data more effectively across most stages, maintaining a performance advantage over larger length intervals. In contrast, MTLFormer and MMoE may perform better at specific stages or under certain data characteristics, resulting in relatively better performance within certain length intervals.

Figure 12.

Comparison of remaining time prediction performance across different prefix trace lengths on the production dataset.

Parameter analysis of MMoTE

Similarly, we conduct experiments on the production dataset to investigate the impacts of the above-mentioned key parameters, the number of experts, the number of units in expert network, and the number of heads within the multi-head attention mechanism, as shown in Figure 13. From Figure 13(a), as the number of experts increases from 2 to 16, the MAE-rt initially rises and then falls, indicating that a greater number of expert networks does not always lead to improved performance. Performance peaks with 16 expert networks, where the MAE-rt is at its lowest. However, when the number of experts increases further to 24, the MAE-rt begins to rise again, suggesting that too many expert networks can make the model overly complex and reduce predictive performance. This indicates that while increasing the number of expert networks enhances the model’s capacity to learn both shared and task-specific information, excessive expert networks may lead to overfitting or instability during training. From Figure 13(b), as the number of units increases from 4 to 8, performance improves. The task reaches its peak performance with 8 units, where the MAE-rt is at its lowest. However, as the number of units increases further to 64, overall performance begins to decline. This suggests that an appropriate number of units allows the model to capture the complexity of the task effectively while avoiding overfitting. From Figure 13(c), regarding the multi-head attention mechanism, two heads deliver the best performance and then the MAE-rt gradually rises as the number of heads increase, demonstrating that adding more heads does not improve performance for this task. This suggests that a simple attention mechanism is sufficiently effective, and too many heads can make the model overly complex and difficult to train.

Figure 13.

Performance analysis with different parameters of multi-gate mixture of transformer-based experts (MMoTE) on the production dataset.

Sensitivity analysis of model performance to process complexity

To provide a more comprehensive comparison of our proposed MMoTE across different types of processes in manufacturing, we conducted a sensitivity analysis of the model’s performance in relation to process complexity. Specifically, we began by examining the event log characteristics that represent process complexity and then focused on analyzing the overall performance of MMoTE across event logs with different process complexity. As we known, event logs can reflect the complexity of the actual process execution. As indicated in Table 2, the characteristics that best represent process complexity are the number of activities (i.e. number of activities), the number of case variants (i.e. number of variants), the maximum case length (i.e. Max. case length), and the average case length (i.e. Avg. case length). From this table, we observe that the most complex log is the Production log from a manufacturing process. This further proves that manufacturing processes are more complex in manufacturing, followed by business processes. Next is BPIC2012_W, followed by Helpdesk, and finally the similar levels of complexity between BPIC2012_A and BPIC2012_O. As shown in Tables 4 and 9, for the most complex process, that is, the production log, MMoTE outperforms all other approaches. Similarly, among the event logs of business processes, Table 4 indicates that MMoTE achieves the best performance on BPIC2012_W, followed by Helpdesk, then BPIC2012_A, and finally, BPIC2012_O. This performance ranking aligns with the complexity of these processes. Therefore, we can conclude that the performance of MMoTE is highly sensitive to process complexity, with its performance advantage becoming more pronounced as the complexity of the process increases.

Conclusion and future work

This study introduces an approach MMoTE, a MMOTE, to address the predictive process monitoring in manufacturing. Manufacturing involves not only various business processes but also more complex manufacturing processes. For business processes, PPM requires multiple prediction tasks, such as the next activity, the next event time, and the remaining time. In contrast, for manufacturing processes, PPM focuses more on the remaining time prediction (i.e. the cycle time). Given the diversity and complexity of manufacturing processes, single-task prediction methods may not sufficiently capture all variations within the historical process executions. Thus, we developed MMoTE to create a more comprehensive predictive process monitoring method suitable for different types of processes in manufacturing industry. MMoTE leverages the feature extraction capabilities of the transformer network and the dynamic, flexible parallel learning capabilities of the MMoE framework as well as a sequence of expert networks with a gating mechanism. By effectively handling the complex and variable data characteristics in various processes of manufacturing, MMoTE provides accurate and reliable predictions. This enables manufacturing enterprises to better monitor and control operations, promptly identify potential delays or issues, and make necessary adjustments to enhance production efficiency. Evaluations on five datasets from different processes show the effectiveness, generalization, and efficiency of MMoTE in PPM of manufacturing.

In our current study, MMoTE incorporates modules with general scalability, notably the expert network, which can be enhanced by replacing the existing simple MLP with more advanced sequence modeling networks, such as those based on the transformer or LSTM, or hybrid models combining both. Additionally, both the shared bottom and tower networks offer opportunities for further optimization to improve scalability. Future research will explore different sequences and combinations of these modules to enhance model performance. We also plan to investigate performance improvements by incorporating heterogeneous tasks and examining the effectiveness of multi-modal information fusion between diverse tasks in PPM, inspired by Cheng et al.⁴⁵

Footnotes

Authors’ contribution

Jiaojiao Wang: conceptualized and drafted the work; Yao Sui conducted the experiments; Chang Liu analyzed the data; Xuewen Shen prepared the outline, and Zhongjin Li and Dingguo Yu supervised the manuscript. All authors were involved in the writing, reviewing, and editing of the manuscript.

Declaration of conflicting interests

The author(s) declared that they have no potential conflicts of interest with respect to the research work reported in this paper.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The work is supported by the National Natural Science Foundation, China (nos. 62002316, 62206241, and 61802095), the Key Science and Technology Project of Zhejiang Province (no. 2021C03138), the National Natural Science Foundation of Zhejiang Province, China (no. LY22F020021), and the Medium and Long-term Science and Technology Plan for Radio, Television, and Online Audiovisuals, China (no. 2022AD0400).

ORCID iD

Jiaojiao Wang

References

Pongboonchai-Empl

Antony

Garza-Reyes

, et al. Integration of industry 4.0 technologies into lean six sigma DMAIC: a systematic review. Prod Plann Control 2023; 1–26. DOI: https://doi.org/10.1080/09537287.2023.2188496. Ahead-of-print.

Nadim

Ragab

Ouali

. Data-driven dynamic causality analysis of industrial systems using interpretable machine learning and process mining. J Intell Manuf 2023; 34: 57–83.

Maggi

Di Francescomarino

Dumas

, et al. Predictive monitoring of business processes. In: CAiSE 2014: 26th international conference on advanced information systems engineering, 16–20 June 2014, Springer, pp.457–472. DOI: 10.1007/978-3-319-07881-6_31.

Taymouri

Rosa

Erfani

, et al. Predictive business process monitoring via generative adversarial nets: the case of next event prediction. In: BPM 2020: 18th international conference on business process managementp, 13–18 September 2020, Springer, pp.237–256. DOI: 10.1007/978-3-030-58666-9_14.

Appice

Di Mauro

Malerba

. Leveraging shallow machine learning to predict business process behavior. In: 2019 IEEE international conference on services computing (SCC), 08–13 July 2019, IEEE, pp.184–188. DOI: 10.1109/SCC.2019.00039.

Teinemaa

Dumas

Rosa

, et al. Outcome-oriented predictive process monitoring: review and benchmark. ACM Trans Knowl Discovery Data (TKDD) 2019; 13: 1–57.

Kratsch

Manderscheid

Röglinger

, et al. Machine learning in business process monitoring: a comparison of deep learning and classical approaches used for outcome prediction. Bus Inf Syst Eng 2021; 63: 261–276.

Wang

Huang

, et al. MTLformer: multi-task learning guided transformer network for business process prediction. IEEE Access 2023; 11: 76722–76738.

Wang

Guo

Huang

, et al. A spatial-temporal feature fusion network for order remaining completion time prediction in discrete manufacturing workshop. Int J Prod Res 2024; 62: 3638–3653.

10.

Zhu

Wang

Liu

, et al. An MBD-driven order remaining completion time prediction method based on SSA-BiLSTM in the IoT-enabled manufacturing workshop. Int J Prod Res 2024; 62: 3559–3584.

11.

Kusiak

. Predictive models in digital manufacturing: research, applications, and future outlook. Int J Prod Res 2023; 61: 6052–6062.

12.

Friederich

Lindeløv

Lazarova-Molnar

. Predictive process monitoring for prediction of remaining cycle time in automated manufacturing: a case study. In: 2023 IEEE 28th international conference on emerging technologies and factory automation (ETFA), 12–15 September 2023, IEEE, pp.1–8. DOI: 10.1109/ETFA54631.2023.10275361.

13.

Ruschel

Loures EdF

Santos

EAP

. Performance analysis and time prediction in manufacturing systems. Comput Ind Eng 2021; 151: 106972.

14.

Friederich

Lugaresi

Lazarova-Molnar

, et al. Process mining for dynamic modeling of smart manufacturing systems: data requirements. Procedia CIRP 2022; 107: 546–551.

15.

Zhao

, et al. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In: KDD’18: proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, July 2018, pp.1930–1939. DOI: 10.1145/3219819.3220007.

16.

Jacobs

Jordan

Nowlan

, et al. Adaptive mixtures of local experts. Neural Comput 1991; 3: 79–87.

17.

Wang

Zhao

Ding

, et al. Multi-output RNN-T joint networks for multi-task learning of ASR and auxiliary tasks. In: 2023 IEEE International conference on acoustics, speech and signal processing (ICASSP), 4–10 June 2023, IEEE, pp.1–5. DOI: 10.1109/ICASSP49357.2023.10096273.

18.

Verboven

Chaudhary

Berrevoets

, et al. Hydalearn: highly dynamic task weighting for multitask learning with auxiliary tasks. Appl Intell 2023; 53: 5808–5822.

19.

Pasquadibisceglie

Appice

Castellano

, et al. A multi-view deep learning approach for predictive business process monitoring. IEEE Trans Serv Comput 2021; 15: 2382–2395.

20.

Tax

Verenich

La Rosa

, et al. Predictive business process monitoring with LSTM neural networks. In: CAiSE 2017: 29th international conference on advanced information systems engineering, 12–16 June 2017, Springer, pp.477–492. DOI: 10.1007/978-3-319-59536-8_30.

21.

Wickramanayake

Ouyang

, et al. Building interpretable models for business process prediction using shared and specialised attention mechanisms. Knowledge-Based Syst 2022; 248: 108773.

22.

Bukhsh

Saeed

Dijkman

. Processtransformer: predictive business process monitoring with transformer network. arXiv preprint arXiv:210400721, 2021. DOI: 10.48550/arXiv.2104.00721.

23.

Chen

Fang

. Multi-task prediction method of business process based on BERT and transfer learning. Knowledge-Based Syst 2022; 254: 109603.

24.

Van der Aalst

Schonenberg

Song

. Time prediction based on process mining. Inf Syst 2011; 36: 450–475.

25.

Polato

Sperduti

Burattin

, et al. Time and activity sequence prediction of business process instances. Computing 2018; 100: 1005–1031.

26.

Breuker

Matzner

Delfmann

, et al. Comprehensible predictive models for business processes. Mis Q 2016; 40: 1009–1034.

27.

Lakshmanan

Shamsi

Doganata

, et al. A Markov prediction model for data-driven semi-structured business processes. Knowl Inf Syst 2015; 42: 97–126.

28.

Conforti

De Leoni

La Rosa

, et al. Supporting risk-informed decisions during business process execution. In: CAiSE 2013: the 25th international conference on advanced information systems engineering, 17–21 June 2013, Springer, pp.116–132. DOI: 10.1007/978-3-642-38709-8_8.

29.

Leontjeva

Conforti

Di Francescomarino

, et al. Complex symbolic sequence encodings for predictive monitoring of business processes. In: BPM 2015: the 13th international conference on business process management, 31 August–3 September 2015, Springer, pp.297–313. DOI: 10.1007/978-3-319-23063-4_21.

30.

Goodfellow

Bengio

Courville

. Deep learning. USA: MIT press, 2016. https://dl.acm.org/doi/abs/10.5555/3086952.

31.

Rama-Maneiro

Vidal

Lama

. Deep learning for predictive business process monitoring: review and benchmark. IEEE Trans Serv Comput 2021; 16: 739–756.

32.

Evermann

Rehse

Fettke

. A deep learning approach for predicting process behaviour at runtime. In: BPM 2016 international workshops, 19 September 2016, Springer, pp.327–338. DOI: 10.1007/978-3-319-58457-7_24.

33.

Cao

Zeng

, et al. Business process remaining time prediction using explainable reachability graph from gated RNNs. Appl Intell 2023; 53: 13178–13191.

34.

Camargo

Dumas

González-Rojas

. Learning accurate LSTM models of business processes. In: BPM 2019: the 17th international conference business process management, 1–6 September 2019, Springer, pp.286–302. DOI: 10.1007/978-3-030-26619-6_19.

35.

Navarin

Vincenzi

Polato

, et al. Lstm networks for data-aware remaining time prediction of business process instances. In: 2017 IEEE symposium series on computational intelligence (SSCI), 27 November–1 December 2017, IEEE, pp.1–7. DOI: 10.1109/SSCI.2017.8285184.

36.

Di Francescomarino

Ghidini

Maggi

, et al. An eye into the future: leveraging a-priori knowledge in predictive business process monitoring. In: BPM 2017: the 15th international conference on business process management, 10–15 September 2017, Springer, pp.252–268. DOI: 10.1007/978-3-319-65000-5_15.

37.

Di Mauro

Appice

Basile

. Activity prediction of business process instances with inception CNN models. In: AI*IA 2019–advances in artificial intelligence: the 18th international conference of the Italian association for artificial intelligence, 19–22 November 2019, Springer, pp.348–361. DOI: 10.1007/978-3-030-35166-3_25.

38.

Pasquadibisceglie

Appice

Castellano

, et al. Using convolutional neural networks for predictive process analytics. In: 2019 international conference on process mining (ICPM), 24–26 June 2019, IEEE, pp.129–136. DOI: 10.1109/ICPM.2019.00028.

39.

Harl

Weinzierl

Stierle

, et al. Explainable predictive business process monitoring using gated graph neural networks. J Decis Syst 2020; 29: 312–327.

40.

Weinzierl

. Exploring gated graph sequence neural networks for predicting next process activities. In: BPM 2021 international workshops, 6–10 September 2021, Springer, pp.30–42. DOI: 10.1007/978-3-030-94343-1_3.

41.

Khan

, et al. DeepProcess: supporting business process execution using a MANN-based recommender system. In: ICSOC 2021: the 19th international conference on service-oriented computing, 22–25 November 2021, Springer, pp.19–33. DOI: 10.1007/978-3-030-91431-8_2.

42.

Theis

Darabi

. Decay replay mining to predict next process events. IEEE Access 2019; 7: 119787.

43.

Guo

Liu

, et al. Explainable and effective process remaining time prediction using feature-informed cascade prediction model. IEEE Trans Serv Comput 2024; 17: 949–962.

44.

Rama-Maneiro

Vidal

Lama

. Embedding graph convolutional networks in recurrent neural networks for predictive monitoring. IEEE Trans Knowl Data Eng 2023; 36: 137–151.

45.

Cheng

Liu

, et al. Multi-modal fusion for business process prediction in call center scenarios. Inf Fusion 2024; 108: 102362.

46.

Zhang

Yang

. A survey on multi-task learning. IEEE Trans Knowl Data Eng 2021; 34: 5586–5609.

47.

Shazeer

Mirhoseini

Maziarz

, et al. Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. In: ICLR 2017: international conference on learning representations, 22 July 2022, https://openreview.net/pdf?id=B1ckMDqlg.

48.

Riquelme

Puigcerver

Mustafa

, et al. Scaling vision with sparse mixture of experts. In: NIPS’21: proceedings of the 35th international conference on neural information processing systems, 6–14 December 2021, volume 657, pp.8583–8595. https://dl.acm.org/doi/abs/10.5555/3540261.3540918.

49.

Fan

Sarkar

Jiang

, et al. M

^{3}

vit: mixture-of-experts vision transformer for efficient multi-task learning with model-accelerator co-design. In: NIPS’22: proceedings of the 36th international conference on neural information processing systems, November 2022, volume 35, pp.28441–28457. https://dl.acm.org/doi/abs/10.5555/3600270.3602332.

50.

Xin

Jiao

Long

, et al. Prototype feature extraction for multi-task learning. In: Proceedings of the ACM Web conference 2022, April 2022, pp.2472–2481. DOI: 10.1145/3485447.3512119.

51.

Liu

Song

Sun

, et al. MMoE-GAT: a multi-gate mixture-of-experts boosted graph attention network for aircraft engine remaining useful life prediction. In: AsiaSim 2023: the 22nd Asia simulation conference, 25–26 October 2023, Springer, pp.451–465. DOI: 10.1007/978-981-99-7240-1_36.

52.

Wang

, et al. Multi-task learning with calibrated mixture of insightful experts. In: 2022 IEEE 38th international conference on data engineering (ICDE), 9–12 May 2022, IEEE, pp.3307–3319. DOI: 10.1109/ICDE53745.2022.00312.

53.

Misra

Shrivastava

Gupta

, et al. Cross-stitch networks for multi-task learning. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), 27–30 June 2016, pp.3994–4003. DOI: 10.1109/CVPR.2016.433.

54.

Qin

Cheng

Zhao

, et al. Multitask mixture of sequential experts for user activity streams. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, August 2020, pp.3083–3091. DOI: 10.1145/3394486.3403359.

55.

Zhang

Xin

Liu

, et al. Health status assessment and remaining useful life prediction of aero-engine based on BiGRU and MMoE. Reliab Eng Syst Saf 2022; 220: 108263.

56.

Levy

. Production analysis with process mining technology. DOI: 10.4121/uuid:68726926-5ac5-4fab-b873-ee76ea412399.

57.

Polato

. Dataset belonging to the help desk log of an Italian company. DOI: 10.4121/uuid:0c60edf1-6f83-4e75-9367-4c63b3e9d5bb.

58.

van Dongen

. BPI challenge, 2012. DOI: 10.4121/uuid:3926db30-f712-4394-aebc-75976070e91f.

59.

Hinkka

Lehto

Heljanko

. Exploiting event log event attributes in RNN based prediction. In: International symposium on data-driven process discovery and analysis, 8 September 2019, Springer, pp.67–85. DOI: 10.1007/978-3-030-46633-6.

Multi-task learning with multi-gate mixture of transformer-based experts for predictive process monitoring in manufacturing

Abstract

Keywords

Introduction

Related work

Predictive process monitoring technology

Multi-task learning

STL versus MTL

Hard parameter sharing of MTL versus Soft parameter sharing of MTL

Preliminaries and problem statement

Preliminaries

Definition 2 (Process Trace)

Definition 3 (Process Instance)

Definition 4 (Event Log)

Definition 6 (Next Activity Prediction)

Definition 7 (Next Event Time Prediction)

Definition 8 (Remaining Time Prediction)

Problem statement

Approach

Modeling preliminary

Multi-gate mixture of transformer-based experts

Transformer Shared Bottom Module

Muti-gate Mixture-of-Experts Module

Multi-tower Module

Experimental evaluation

Experimental setup

Datasets

Evaluation Metric

Experimental results on public business process logs

Comparison of approach effectiveness

Comparison across different prefix trace lengths

Comparison of approach complexity and efficiency

Parameter analysis of MMoTE

The parameter analysis on the Helpdesk dataset

The parameter analysis on the BPIC2012_A dataset

The parameter analysis on the BPIC2012_O dataset

The parameter analysis on the BPIC2012_W dataset

Experimental results on the public manufacturing process log

Comparison of approach effectiveness, complexity, and efficiency

Comparison across different prefix trace lengths

Parameter analysis of MMoTE

Sensitivity analysis of model performance to process complexity

Conclusion and future work

Footnotes

Authors’ contribution

Declaration of conflicting interests

Funding

ORCID iD

References