Multi-view deviation detection under privacy for medical business processes: Enhancing patient-centered care via digital transformation

Abstract

Deviation detection has emerged as a critical research focus for business processes, enabling enterprises to prevent fraud, monitor anomalies, and safeguard the security of processes and data, particularly in the medical field. Despite its importance, existing methods face significant limitations. Some approaches focus solely on control flow deviations while neglecting data-induced deviations, whereas others rely on specific data, risking the exposure of personal privacy information. Consequently, a major challenge lies in balancing data availability for deviation detection with the imperative of preserving data privacy and security. To address this challenge, this paper proposes a multi-view deviation detection method based on privacy protection. First, data attributes critical to business processes are extracted using a random field model. Next, an identity and purpose-based data matching algorithm ensures the security of user identities and validates the intended use of data for privacy protection. Furthermore, the business process activity view regulates legally permissible data operations, while decision logic analysis links processes and data through decision tables to detect deviations. Beyond detecting deviations within each perspective, this method uncovers hidden deviations arising from the interplay of business process, data flow, and privacy perspectives. The evaluation using real-world medical event data demonstrates the method's effectiveness. Notably, it outperforms existing approaches by accurately identifying deviations that other methods fail to detect.

Keywords

Data security deviation detection privacy protection decision logic active view

Introduction

Process mining technology can extract valuable information from the event logs commonly produced by modern information systems, which provides a new means for process discovery, monitoring and improvement in various application fields. However, due to the imperfection of information system design or system upgrade, inconsistencies¹ of control flow and data flow in the system may cause security risks. We call such inconsistencies deviations² and detecting such deviations allows us to validate and extend the business process models and improve the business processes accordingly. Also, deviation detection is a popular topic in the enterprise because its application areas are multifaceted such as fraud detection,³ intrusion detection and e-commerce anomaly transaction detection.⁴ The literature⁵ provides a formal definition of anomalous activities in business processes based on Petri nets. An efficient method to detect deviations between process models and event logs is proposed to detect missing, attached and misplaced activities in business processes. The literature⁶ presents an efficient method for detecting deviations between a process model with a cyclic structure and an event log. In response to the above approach to detect deviations from a pure control flow perspective, the literature⁷ points out that traditional business process modelling approaches focus mainly on the sequence of activities of the process, although the occurrence of these activities may be required to satisfy data constraints,⁸ and therefore cannot accurately identify and explain deviations from other perspectives of the business process.

Existing anomaly detection methods primarily focus on control flow and point anomalies, and endeavor to minimize false positives in the event of unexpected occurrences. Over the past few years, there has been a growing trend in process design and mining, shifting from a purely control-flow-based perspective to more integrated models that explicitly take data⁹ and decisions into account. Indeed, without understanding the context in which data is accessed, it is challenging to differentiate between legitimate and illegitimate actions. The literature¹⁰ points out that existing process coding techniques are focused on the control flow perspective, encoding only the sequence of activities in the logs, ignoring the multiple features of the process that can be analysed by hiding valuable process behaviours in the data.¹¹ A dependency pattern is proposed to encode the logs as a whole into a suitable format capturing the perspective lost in the one-dimensional analysis. Coloring Petri nets (CPNs) can be simulated by CPN Tools, which makes data flow and control flow fusion possible on prior art reliability. The literature¹² proposes a method to search for transactional system deviations by checking the consistency between coloring Petri nets and event logs in addition to detecting control deviations also object priority deviations and resource deviations. The existing literature¹³ integrates time-related factors and resource information to propose an anomaly detection method capable of addressing unexpected events during process model execution. It is suggested that not all deviant events are promptly identified as anomalies; instead, detection is based on a specific probability of occurrence, thereby reducing the number of false positives. A neural network architecture BINet¹⁴ for multi-view anomaly detection in business process event logs proposes a set of heuristic algorithms¹⁵ to automatically set the threshold of the anomaly detection algorithm and demonstrates that BINet can be used not only at the case level but also at the event attribute level to detect anomalies in event logs.

In the context of the digital economy, data openness has emerged as an inevitable trend. Over recent years, a substantial number of methods have been devised to analyze event logs within the framework of process mining. However, the impact of privacy rules¹⁶ on the design and organizational application of process mining techniques has been largely ignored leading to irresponsible use of personal data, such as the use of employee data in process mining to predict employee performance. The literature¹⁷ proposes an application of differential privacy for event data protection privacy and analyzes potential privacy leakage and prevention methods. Healthcare information systems contain highly sensitive information and healthcare regulations often require the protection of data privacy. The need to comply with strict privacy requirements may lead to a decrease in the utility of the data used for analysis. The literature¹⁸ analyzed the data privacy and practical needs of healthcare process data and evaluated the applicability of privacy-preserving data transformation methods to anonymous healthcare data. The accuracy of bias detection can be improved by adding data information to the business process bias detection process. However, this data information often contains sensitive information about individuals. Data security and privacy protection¹⁹ issues are inevitable in the data openness process, and exploring how to achieve a balance between maximum data utilization and privacy protection in the deviation detection process is of great importance but currently less studied. The problem is that organizations often lack appropriate mechanisms to monitor the use of data. Existing methods either use data views²⁰ to compare data access to security policies or compare activity deviations to the activities that are needed to run business processes. Analyzing user behavior from these perspectives alone cannot only lead to some deviations not being detected or having false positives, but can also reveal the data provider's secrecy. The literature²¹ posited that the deviation-detection method ought to concurrently take into account the control flow, data flow, and privacy perspective of the business process to identify a broad spectrum of deviations, especially the deviation related to the intended use of the data and the data use environment. Current access control mechanisms are inadequate for data protection. They merely serve as preventive measures and do not ensure that the data is employed for its intended purpose. The problem of how to detect deviations in business processes while protecting user data security is a challenge for process mining. In this paper, we detect deviations between logs and business processes from multiple perspectives by considering all control flows, data flows, and data decisions of business processes based on the premise of privacy protection. This paper proposes multi-perspective deviation detection of business processes based on privacy protection from the following three aspects. See Figure 1 for the framework diagram of the paper. The main contributions of this paper are as follows:

The conditional random field model is applied to extract important data attributes from complex and diverse unstructured data information, which is of great significance to business process research.

In order to prevent the fraud risk caused by privacy disclosure, an identity-based purpose matching algorithm is proposed to match the intended use of data with the access purpose of data users to ensure the safe use of data.

Under the fusion of control flow data flow, the deviation detection is carried out by combining the business process and the activity view to detect activities, resources, data operations, privacy and other perspectives, so as to improve the accuracy of the deviation detection

Figure 1.

Multi-view deviation detection for business processes under privacy protection framework diagram.

The other parts of this article are arranged as follows, Section 2 shows the motivating case, Section 3 shows the basics knowledge, Section 4 shows the extraction of important data features using the random field model, Section 4 shows the identity and purpose-based privacy access control model, Section 5 shows the deviation detection based on the fusion of control flow and data flow, Section 6 shows the experimental evaluation, and Section 7 concludes.

Motivations

Organizations frequently employ process models and security policies to delineate the regulated behavior of business systems and the legitimate utilization of data. In practical scenarios, organizations might permit users to deviate from the prescribed behavior in order to effectively manage unforeseen situations. Nevertheless, this functionality is susceptible to abuse, thereby elevating the risk of detrimental data breaches. Moreover, insiders may exploit their privileges to access sensitive data for personal or financial gain. There is ample evidence in the literature²² that real process behaviour often deviates from the expected process, which often opens the way for fraudulent behaviour. In recent years, different methods for finding business process deviations have been proposed. However, deviation detection from a data or process perspective alone may not explain why the deviation occurred, making it difficult for security analysts to take the necessary steps to fix security violations. Let's consider a trace that is executed, underscores indicate deviations.

L=<(re{Name:Lily,PatientID:Li453726,Nurse:Alice}),(tr{Nurse:Ella,Triage:newly diagnosed},(sc{ Specialist: Frank, Medical history: diabetes}),(at{pharmacist: Jame, check item: EcgCT}),(ct{lab expert: Marry, Medical history: diabetes })(di{ Specialist: Frank, Medical history: NO Disease grading :II}), (tu{ Attending physician :John, Critically ill prove: False}) > .

Figure 2 shows the healthcare business process and contains data attributes, resources (roles) and constraints. It can be observed from Figure 3 that if the deviations are detected only from the control flow perspective, {re, tr, sc, at, (ct), bt, di, tu} only the log can be detected skipping the active basic lab test (bt) and inserting the active clinical trial ct (clinical trial). It does not detect the following behaviour and data biases.

Figure 2.

Medical business process.

Figure 3.

Comparison of log and model.

Deviation caused by lack of data

Deviation 1: According to constraint C2, a new patient must be consulted by a general practitioner (gc), whereas in practice it is consulted by a specialist (sc). This deviation may be due to the deviation caused by the medical staff's failure to update the patient records, leading to the inappropriate treatment of the patient.

Deviation 2: di (Diagnosis) did not update the history, and disease Class II (hs) should have been hospitalized, but an active transfer to the ICU (tu) was actually made. In this case, there was no violation from a control flow perspective, and from a data perspective, the patient's history, degree of illness were not updated in a timely manner, and the data manipulation led to the wrong business process path selection.

Deviation caused by lack of privacy protection

Deviation 3: Clinical experiment ct {lab expert: Marry, Medical history: diabetes} is inserted in the business process and Marry, a healthcare professional, performs a clinical experiment without patient consent to obtain information about the patient's medical history. Curious actors may use this privilege to access private patient information for personal or financial gain.

Basic Knowledge

The process design, engineering, and mining move from a purely control-flow perspective to a more integrated model where data and decisions are explicitly considered. The perspective of controlling flow in a business process is important because it describes the mainstream behaviour of the business process. However, other perspectives on data flow must also be taken into account. Next, this section provides the basics of control flow and data flow.

Control Flow Perspective

Definition 1

(Process Model²³) A process model is a five-tuple, satisfying.

T is a finite set of non-empty activities.

P is a finite set of places.

$F \subseteq P \times T \cup T \times P$ is a flow relation.

$τ_{i}$ is the initial identity and $τ_{f}$ is the termination identity.

Definition 2

(Log move Model move²⁴) Let $σ_{L}$ have a set of event logs $A_{L}$ , $A_{M}$ are all process run sets of the process model from initial identification to final identification, where $A_{L}^{≪} = A_{L} \cup {≪}$ , $A_{M}^{≪} = A_{M} \cup {≪}$ .Active pair $(x, y) \in A_{L}^{≪} \times A_{M}^{≪}$ is the log and model alignment matching pair. There are three types of alignment between logs and models.

If $x = ≪$ , $y = A_{M}$ , then $(x, y)$ is the Model move.

If $x = A_{L}$ , $y = ≪$ , then $(x, y)$ is the Log move.

If $x = A_{L}$ , $y = A_{M}$ , then $(x, y)$ is the Synchronous move.

Figure 4 shows the Petri net of the medical business process. Figure 5(a) displays a process running of the medical business process model. This running instance is also called the partially-ordered running of Petri nets. Figure 5(b) is an event log. Figure 5(c) presents the alignment between process runs and logs in the control flow perspective, and it is evident that there are two deviation model movie (bt), and log move (ct).

Figure 4.

Healthcare business process petri ne.

Figure 5.

Process run and process trace.

Concentrating solely on the control-flow perspective only allows us to detection of the insertion or omission of specific activities. Nevertheless, this approach is insufficient to identify threats such as data leaks resulting from data operations (e.g. doctors can access sensitive medical information of patients and conduct clinical trials). These insiders have a profound understanding of information systems and security controls and may exploit their privileges and this knowledge for malicious ends. In the following sections, we present some data information derived from business processes and employ the fusion of control flow and data flow to detect deviations from multiple perspectives.

Data perspective

Traditional business process modelling just focuses only on the control flow perspective and constrains the activities in the process by the behaviour profile relationships in the business process (e.g. sequential, concurrent, mutually exclusive, etc.) However, activities are executed by different resources, manipulate data objects, and are constrained by the state of these objects. This makes it necessary to extend the control-flow model to incorporate data. Particularly in the medical field, doctors need to make a comprehensive judgment based on important data such as patient history and auxiliary test results so as to make an accurate diagnosis of the disease. The following describes the definitions related to the introduction of data in business processes.

Definition 3

(Data-awareness reasonableness²⁵) Petri networks with initial identifiers $M_{I}$ , termination identifier $M_{F}$ , and reachable states $R e a c h_{N}$ , ${(M, α) (M_{I}, α_{I}) (M, α)}$ are said to be data-aware if the following conditions are satisfied.

$\forall (M, α) \in R e a c h_{N}$ , $\exists α^{'} (M, α) \overset{*}{\to} (M_{F}, α^{'})$ .

$\forall (M, α) \in R e a c h_{N}$ , $M \geq M_{F} \Rightarrow (M = M_{F})$ .

$\forall t \in T, \exists M_{1}, M_{2}, α_{1}, α_{2}, β$ . $(M, α) \in R e a c h_{N}$ ,

$(M_{1}, α_{1} \in R e a c h_{N}),$ and $(M_{1}, α_{1}) \overset{t, β}{\to} (M_{2}, α_{2})$ .

Definition 4

(Conditional random field CRF²⁶) Identifies CRF models that can effectively solve the sequence labeling and text slicing problems, and can use words, phrases, sentences, etc. as contextual features, assuming that X denotes a sequence of a medical history text, Y is an entity class, Where $x = {x_{1}, x_{2}, x_{3} \dots x_{n}}$ is the observation sequence and $y = {y_{1}, y_{2}, y_{3} \dots y_{n}}$ is the state sequence and $P (y / x)$ denotes the conditional probability distribution of the output y given x. The CRF can be represented using a first-order linear chain random field model CRF, as shown in Figure 6.

p (y / x) = \frac{1}{Z (x)} \exp (_{i, k} λ_{k} t_{k} (y_{i - 1}, y_{i}, x, i) +_{i, l} u_{l} s_{l} (y_{i}, x, i))

(1)

Z (x) =_{y} \exp (_{x, y} λ_{k} t_{k} (y_{i - 1}, y_{i}, x, i) +_{i, l} u_{l} s_{l} (y_{i}, x, i))

(2)

Figure 6.

CFR linear chain structure.

In equations (1) and (2), $t_{k}$ , $s_{l}$ represents the characteristic function and $λ_{k}$ , $u_{i}$ represents the corresponding weight. $Z (x)$ represents the normalization factor, which is used to constrain the conditional probability. The features and their associated weights determine the final linear chain random field. CRF model for named entity recognition can be viewed as a sequence labeling problem. Each sentence to be identified is treated as an observation sequence, each word in the sentence as a symbol, and each symbol is assigned a category label. The simplest model of CFR is a chain structure, as shown in Figure 7.

Figure 7.

Entity recognition of important medical attributes.

Algorithm 3-1: Conditional Random Field Entity Recognition Algorithm Rithm

Input: Sequence of medical text tokens: $X = {x_{1}, x_{2}, x_{3}, \dots x_{n}}$

Feature Functions: Set of feature functions $f_{k}$ , weights $λ_{k}$

Entity Classes: Set of entity labels $Y$

Output: Feature Functions Predicted sequence of entity labels: $Y_{-} p r e d = {y_{1}, y_{2}, y_{3}, \dots y_{n}}$

1 Tokenize input medical text X into word sequence

2 for each token $x_{i} \in X$ do

3 Extract features including

4 Current token ( $x i$ )

5 Part-of-speech tag

6 Contextual features $(p r e v i o u s / n e x t t o k e n s)$

7 Domain-specific features (medical abbreviations, keywords)

8 end

9 Define conditional probability using first-order linear chain CRF

$p (y / x) = \frac{1}{Z (x)} \exp (i, k λ_{k} t_{k} (y_{i - 1}, y_{i}, x, i) +_{i, l} u_{l} s_{l} (y_{i}, x, i))$

10 $Z (x) =_{y} \exp (x, y λ_{k} t_{k} (y_{i - 1}, y_{i}, x, i) +_{i, l} u_{l} s_{l} (y_{i}, x, i)) / / Z (x)$ normalization factor

11 Find the most probable label sequence using dynamic programming:

12 $Y_{-} p r e d = a r g m a x_{y} P (y | x)$

13 Compute optimal path through the label state space

14 Handle transition probabilities between consecutive labels

15 Postprocessing: Merge consecutive tokens with the same entity label into spans

16 Validate entity boundaries based on medical domain constraints

17 Resolve overlapping or conflicting entity annotations

18 Apply domain-specific rules for entity normalization

return Return predicted label sequence Y₋pred

Algorithm 3-1 The description of the conditional random field entity recognition algorithm is as follows. Lines 1-7 describe the preprocessing and feature extraction phase. The input medical text sequence $X = {x_{1}, x_{2}, \dots, x_{n}}$ is tokenized into individual words, and for each token $x_{i}$ , relevant features are extracted, including the current token, its part-of-speech tag, contextual features, and domain-specific medical features. Lines 9–10 define the CRF model by specifying the conditional probability $P (y / x)$ using a first-order linear chain CRF. The probability is computed via feature functions $f_{k}$ and their corresponding weights $λ_{k}$ , with $Z (x)$ serving as the normalization factor to ensure a valid probability distribution.Lines 11–12 perform decoding via dynamic programming (Viterbi algorithm) to find the most probable label sequence $Y_{-}$ pred by maximizing $P (y / x)$ . Lines 14–18 Transition probabilities between consecutive labels are managed, consecutive tokens with identical entity labels are merged into spans, entity boundaries are validated based on medical domain constraints, and domain-specific rules are applied for entity normalization to ensure accurate and consistent outputs. Finally, the algorithm returns the predicted entity label sequence Y. This structured approach ensures robust medical entity recognition by combining contextual features, CRF modeling, and domain-specific refinements.

The combination of CRF and rules (Table 1) was used to identify the entities with important medical attributes. Figure 7 illustrates the identified important attributes of medical domain entity Patient Information. These medical attribute features are extremely valuable for patients to cure their diseases. Yet, the leakage of this medical information can also pose a threat to patients, Consequently, we propose below an identity and purpose-based privacy access control model.

Table 1.

Composite entity construction rules.

Rule expressions	Examples
Negative + clinical manifestations	No edema
Anatomy + negation + clinical presentation	The conjunctiva no pallo
Negative + anatomy + clinical presentation	No enlarged tonsils
Modifications + Diseases	Severe abdominal pain
Anatomy + Retouching	Swollen throat
Body + data + quantifiers	Blood pressure 110/80mml
Physical examination + data	T 36.8
Data + quantifiers	100 times/minute

Identity and purpose based privacy access control model

The advent of the big data era has brought about a great deal of convenience to people's daily lives, exemplified by the sharing of medical data, which has enhanced the level of medical development and treatment efficiency. The literature²⁷ underscores the significance of establishing a patient centric privacy preserving access control model. This model ensures that patients can devise access control policies in accordance with their privacy preferences, without obstructing data visitors, such as healthcare professionals, from accessing patients’ medical information. In this paper, we propose the purpose-based access control (PBAC) model, the core of which is to match the intended purpose (IP) of the privacy data provided by the data owner with the access purpose (AP) proposed by the data user according to his willingness to use the data before the data user is entitled to access. A conditional purpose-based access control model conditional purpose-based access control (CPBAC) is proposed to simultaneously ensure high quality and privacy of data. An algorithm is proposed to achieve the compliance calculation between the purpose of access and the intended purpose.

Definition 5.

(Ancestors and descendants¹⁹) Let $P T$ be a purpose tree and P be the set of purposes in $P T$ . Let $p \subseteq P$ be a set of purpose sets in the process tree $P T$ .

$P^{↑}$ denotes the ancestor set, $P^{↓}$ denotes the descendant set.

$P^{↕} = P^{↑} \cup P^{↓}$ denotes the set of all nodes that are ancestors or descendants.

An intended purpose $I P$ is a tuple $⟨ A I P, P I P ⟩$ , where $A I P$ is called the set of allowed intended purposes and $P I P$ is called the set of forbidden intended purposes.

$I P * = A I P^{↓} - P I P^{↕}$ denotes the set of actual accessible data purposes.

$I P^{\times} = P I P^{↕}$ denotes the set of inaccessible data purposes.

Conditional Purpose Sets $I P^{+} = P - I P * - I P^{\times}$ Data visitors to have conditional access to the data.

Figure 8 establishes a set of intended purposes for data usage that are permitted for the medical information provided by patients.

Figure 8.

Entity recognition of important medical attributes.

In the current context where the digital business environment is becoming increasingly complex and business data contains a large amount of sensitive information, this paper focuses on studying business processes using a multi-perspective detection strategy within the framework of privacy protection to enhance the accuracy of deviation detection. Business processes serve as the core framework for business operations and the mainstay of detection, while business data acts as a crucial information carrier reflecting the status of processes. Only through the deep integration of the two can more precise deviation detection be achieved. To this end, this paper first employs a random field model to deeply mine business data, accurately extracting data attributes of significant value for deviation detection, covering key characteristics of business activities, data associations, and so on. After introducing these attributes, the accuracy of deviation detection is significantly improved. However, the addition of data attributes increases the risk of data leakage. To address this challenge, this paper innovatively proposes an identity- and purpose-based data matching algorithm. It ensures that only authorized legitimate users can access data through strict identity authentication and guarantees compliance by rigorously reviewing and monitoring the purpose of data use, thereby effectively protecting the privacy of business data. Finally, by using decision tables, this paper closely associates business processes with the processed data and conducts a comprehensive decision logic analysis across multiple dimensions such as business activities, data, and resources. Taking various factors and conditions into comprehensive consideration, it employs logical reasoning and data analysis methods to accurately detect deviations in activities, data, and resources, providing a solid decision-making basis for enterprises to optimize business processes, enhance operational efficiency, and reduce risks.

Establishment of the data patterns

Depending on the visitor's access purpose in different sets of purposes $(I P *, I P^{\times}, I P^{+})$ , data visitors can access different data patterns, which are used to handle and balance data availability and privacy protection, improving data security in the era of Big Data. Figure 9 shows the different data patterns that the data visitor can access when the data visitor accesses purposes belonging to different sets of purposes.

Figure 9.

Data patterns of medical information obtained for different access purposes. (a) Pattern 1: Access to complete information $I P *$ _. (b) Pattern 2: Conditional access information $I P^{+}$ . (c) Pattern 3: Access to overgeneralized information $I P^{\times}$ .

Constructing the purpose tree table

According to Figure 8, the medical purpose tree is traversed in breadth-first order, the coding of intended purpose code and prohibited intended purpose code can be calculated to construct the medical purpose table in Table 2.The collected medical information of the patient is shown in Figure 9. Figure 9(a) Pattern 1 is the complete medical information of the patient, Figure 9(b) Pattern 2 is partially generalized medical information and Figure 9(c) Pattern 3 is the medical information of the patient after the generalization process.

Table 2.

Table of medical purposes.

id	Name	parent	code	aip-code	pip-code
1	General Purpose	0	0 × 200	0 × 483	0 × 483
2	Medical Attendance	1	0 × 100	0 × 133	0 × 333
3	Self-Access	1	0 × 080	0 × 080	0 × 280
4	Scientific Research	1	0 × 040	0 × 052	0 × 252
5	Clinical Attendance	2	0 × 010	0 × 013	0 × 313
6	Adjunctive therapy	2	0 × 020	0 × 020	0 × 320
7	Patient Access	3	0 × 006	0 × 006	0 × 286
8	Relatives Access	3	0 × 012	0 × 012	0 × 292
9	Health Screening	4	0 × 008	0 × 008	0 × 248
10	Clinical Trials	4	0 × 004	0 × 004	0 × 244
11	Internal Medicine	5	0 × 002	0 × 002	0 × 310
12	Surgical Treatment	5	0 × 001	0 × 001	0 × 311

The above analysis leads to 3 public keys. The first byte of the public key is the patient 7-bit PID, two data patterns identification (CondBit) where 00 indicates data pattern 1; 01 indicates data pattern 2; 11 indicates data pattern 3 and equal length of $P I P_{-} c o d e, A I P_{-} c o d e$ . Similarly, the data visitor applies for the private key to the system, and the process is similar to the destination matching stage, which needs to determine the access destination AP according to the visitor role RIP, and determines the return private key according to the AP and IP matching result, where (CondBit = 00,01,11).

As one of the most pivotal resources in the contemporary world, data holds inestimable value for the advancement of modern society. Data security and privacy protection issues are inevitable during the process of data openness, and it is of great significance to deliberate on how to strike a balance between data openness and privacy protection. This section proposes a role and access purpose matching algorithm that can both securely access data and effectively prevent the occurrence of malicious access data leakage. The role ID is selected to verify the identity of visitors to determine whether they have access rights, and then the data is encrypted as the identity key according to the conditions of the access purpose and the predetermined purpose, which can effectively ensure that the system users can access the patient information only after the authentication and the access purpose matches the predetermined purpose. For the specific algorithm content, see Algorithm 4-1.

Access purpose matching algorithm

In the purpose matching algorithm, lines 1–2 set the expected access purpose data set and prohibited access purpose set according to the data provider's wishes. Lines 3–5 provide the data visitor with the ID number and access role information RIP, and if the RIP meets the role attributes allowed by the data provider, then access purpose compliance verification is performed, otherwise the user does not have access to the data. Lines 6–10 calculate the binary and hexadecimal codes of the allowed, denied, and conditional destination sets based on the purpose tree table. Purpose sets have no intersection and are full coverage of the purpose tree. Lines 11–14 determine that the access destination matches the allowed access destination and output the key and full data information data pattern 1. Lines 15–18 determine whether the access purpose matches the set of conditional purposes, and if it does, return the corresponding key and data pattern 2, otherwise proceed to lines 19–21 to return the corresponding key and data pattern 3.

This section proposes that data patterns of different granularity can be obtained according to different access purposes of different roles, which protects the privacy and security of patients to a certain extent and maximizes the utilization value of data. However, what data visitors can do with data when they get it, and how the input of data affects business process decisions. Next, we introduce business process deviation detection after data flow control flow fusion.

Deviation detection under data flow control flow fusion

In the realm of Business Process Management (BPM), the fusion of business processes and related data represents a pivotal concern, given that the execution of business processes is frequently subject to data constraints.²⁸ Particularly in data decision-intensive scenarios, it is of great utility to establish connections between activities and data operations, thereby enabling the detection of anomalies in business processes from a variety of perspectives.

In this section, we expand the concept of activity view and decision table (DMN) to narrow the gap between process and data conceptual design, thereby linking process activities and related data. When business rules or constraints are required to be incorporated into the process for decision-making purposes, we introduce a decision table. Subsequently, we feed relevant data into the decision table, apply decision logic, and then output decision results back to the decision-making activities within the business process. By doing so, the business process is elevated to this richer data-aware level.

Control flow data flow fusion

Definition 7.

(Decision table²⁹) Decision table (DMN) is a tuple $⟨ N a m e, I, O, X,$ $I n f a c e t, O R a n g e, O D e f ⟩$ where

Name is the name of the decision table;

I and O are disjoint finite sets of input and output attributes respectively;

X is an external parameter set;

$I n f a c e t$ is an input range function, which associates each input attribute $a \in I$ with the external parameter x to the S-FEEL condition, specifying the allowed input value of a;

$O R a n g e$ is an output range function that associates each output attribute $b \in O$ with a tuple of a possible output value;

$O D e f$ is a default assignment (partial) function that maps some output properties to the corresponding default values.

Algorithm 4-1: Purpose Matching Algorithm

Input: Intended Purpose $I P = ⟨ A I P, P I P ⟩$ , Access Purpose ap, Identity number PID, RIP

Output: Result, Pub, Pri, Data

1 $A I P = {a i p_{1}, a i p_{2}, a i p_{3} \dots a i p_{m}} N = (P, T, F, τ_{i}, τ_{f}) P I P = {p i p_{1}, p i p_{2}, p i p_{3} \dots p i p_{n}}$

2 for $P_{i} \in V i s i t o n$ do

3 $I D N n u m b e r (P_{i})$ , $R I P (P_{i})$ //The visitor's ID number and access role

4 If $R I P (P_{i}) \in A t t r i b u t e (a_{1}, a_{2}, \dots a_{n})$ then

5 $A I P_c o d e = a i p_c o d e_{1} + a i p_c o d e_{2} + \dots a i p_c o d e_{m}$

6 $P I P_c o d e = p i p_c o d e_{1} + p i p_c o d e_{2} + \dots p i p_c o d e_{n}$

7 $I P_{-} * c o de = (A I P^{↓} - P I P^{↕})_c o d e$ // The set of actual accessible data purposes when permitted purposes conflict with prohibited purposes

8 $I P_{-}^{x} c o de = (P I P^{↕})_c o d e$ //The collection of data destinations cannot be accessed

9 $I P_{-}^{+} c o de = Σ (P - I P * - I P^{x})_c o d e$ //The data visitor conditionally accessed the data

10 If $a p_c o d e & I P *_c o d e \neq 0$ then

11 $R e s u l t = P e r m i t,$

12 $P u b = P I D + C o n d b i t (00) + A I P_c o d e + P I P_c o d e$

$P r i = R I D + C o n d b i t (00) + A I P_c o d e + P I P_c o d e$

13 $D a t a = d a t a p a t t e r 1$

14 else if $a p_c o d e & I P_{-}^{+} c o d e \neq 0$ then

15 $R e s u l t = C o n d P e r m i t,$

16 $P u b = P I D + C o n d b i t (01) + A I P_c o d e + P I P_c o d e$

$P r i = R I D + C o n d b i t (00) + A I P_c o d e + P I P_c o d e$

17 $D a t a = d a t a p a t t e r 2$

18 else if $a p_c o d e & I P_{-}^{x} c o d e \neq 0$ then $R e s u l t = D e n y$

19 $P u b = P I D + C o n d b i t (11) + A I P_c o d e + P I P_c o d e$

$P r i = R I D + C o n d b i t (00) + A I P_c o d e + P I P_c o d e$

20 $D a t a = d a t a p a t t e r 3$

21 end

22 end

23 else if $R I D (P_{i}) \notin A t t r i b u t e (a_{1}, a_{2}, \dots a_{n})$ then the visitor $P_{i}$ has

no access permission;

end

24 end

25 return $R e s u l t, P u b, P r i, D a t a$

Data is essential for the execution of business activities. In particular, the execution of activities may require certain operations to be performed on data objects. Figure 10 presents a visual depiction of activities and data within a healthcare business process through an activity view. The dashed arrow establishes a connection between the activity and the relevant portion of the data schema delineated in the activity view. The connecting arrows are annotated with information regarding the accessed resource, the access type, and the quantity of objects involved in the operation. The active view links the processing logic to the data layer by the actions performed on a given data object and specifies the specifications for the actions that data visitors can perform.

Figure 10.

Flow chart of purpose matching.

The interplay between process and decision models assumes a pivotal role in BPM, given that decisions are often grounded in ongoing processes and can significantly influence process outcomes. In this section, we integrate the data layer, decision layer, and process layer to facilitate sound decision-making for business processes and steer business processes towards achieving optimal process management.

Figure 11 exhibits the business flow diagram for medical data decision-making subsequent to the amalgamation of the data flow and control flow. The data layer acquires the legitimate data schema according to the visitor's identity. The decision layer comprises a decision table, and the data layer inputs relevant data into the blue-colored region of the decision table. Subsequently, data rule constraint operations are carried out to yield the corresponding decision. The orange hue signifies that the decision output is transmitted to the process layer to direct the selection of business process execution paths.

Figure 11.

Data decision business process diagram.

Decision Table (1): The body mass index (BMI) decision table utilizes the patient's fundamental information, namely height and weight, as input data. It then computes the body fat percentage and calculates the BMI value, which, in this case, is 32.91, classifying the patient as obese. Given that obesity is a risk factor, the patient is at a higher likelihood of developing cardiovascular diseases, type II diabetes, and other obesity-related conditions. When considering a patient who also has diabetes and hypertension, the derived results are transmitted to the business process's decision activity triage (tr). This activity can recommend that the patient opt for a specialist consultation, thereby enhancing the efficiency of the treatment process.

Decision Table (2): This decision table is responsible for determining the disease grade. It takes the data from the data layer, specifically treatment-related information such as retinopathy grade 3 and renal function classified as Chronic Kidney Disease (CKD) stage 3, as input. Based on the constraint rule of disease grading (grade II), it facilitates inpatient treatment (hs). Moreover, the outcomes of this decision table serve as a guide for the selection activities that follow the decision activity (di).

Based on the aforementioned analysis, data, decisions, and business processes are intimately connected. Organizations are required to derive valuable information from the data gathered by business processes and data sources, which can then be utilized for decision-making and process enhancement. Business process models typically encode decision logic of varying complexity through conditional expressions attached to outgoing flows or conditional events of decision activities. By separating this decision logic from the control flow logic and capturing it at a higher level of abstraction, it becomes more straightforward for a business process to select a scientific path.

Multi-view deviation detection

In this section, we put forward a multi-view bias detection method. The primary objective of this method is to identify the concealed bias among the three dimensions of behavior, data, and resources following the logical decision-making process. The activity view of the log in figure 12 reflects the actual process track and the real data operation of the performer. Notably, there are some deviations between the activity view of log execution in figure 12 and the standard medical business activity view in figure 10. To precisely detect these discrepancies and offer contextual information for diagnosis, we introduce the concept of composite movement. This is achieved by integrating system traces, event logs, data attributes, and activity views as dictated by the decision logic. In doing so, we connect the process and data views to detect deviations from multiple perspectives.

Figure 12.

Log activity view.

Definition 8

(Combined movement²³) $A_{φ}$ is the set of running events in process trace $φ$ , $E_{σ}$ is the trace event recorded in event log trace $σ$ , and $s_{β}$ is the given system trace $β$ system event combination. AV specifies resource execution $R_{i}$ and data execution s in the business process activity view, av is resource execution $r_{i}$ and data execution q recorded in the event log. $γ = (M_{γ}, <_{γ})$ is the alignment of control flow defined on $E_{σ}$ and $A_{φ}$ . The combined move is a tuple $((R_{i}, r_{i}) (s, q), (e, a)) \in (S_{β}^{≫} \times a υ^{≫}) \times$ $M_{γ}^{(≫, ≫)}$ .

If $(e, a) = (≫, ≫)$ , then $(s, q) = (x, ≫)$ , where $x \in S_{β}$ .

If $q \neq≫$ is either $a \neq≫$ , $π_{a c t} (q) = π_{a c t} (a)$ ,or $e \neq≫$ , $π_{a c t} (q) = π_{a c t} (e)$ .

Given a composite move $((r_{i}, R_{i}) (s, q), (e, a))$ , the system operations events and activities view $a υ$ data operations specify that legitimate resources can perform reasonable data operations on a particular activity. Moreover, the purpose of these data operations can be used to evaluate whether the data execution complies with the business process data constraint specifications. The deviation detection between resource $(r_{i}, R_{i})$ and data $(s, q)$ offers ample data context information for decision-making. This, in turn, directs the business process to select the appropriate execution path. Based on the aforementioned definition of composite movement, it can be categorized into ten categories.

Figure 13 presents a graphical representation of the composite movement classification. Considering the degree of influence of combination movement on deviation, we classify it into the following six categories.

Synchronized movement the roles, data, and processes in the business process operations and event logs are fully compliant with the prescribed behavioral specifications and data operations. Denoted as $((r_{i}, R_{i}) (s_{i}, q_{i}), (e_{i}, a_{i}))$ , the image is represented in Figure 13(1).

Synchronous movement of the wrong role when the expected activities and data operations are executed by an unauthorized role. Denoted as $((W r o n g (r_{i}), R_{i}) (s_{i}, q_{i}), (e_{i}, a_{i}))$ , the image representation is shown in Figure 13(2). Since this paper adopts a particular data-acquisition mode centered on identity and access purpose, even when the executor makes an error, the system can still carry out the reasonable data operations and activity executions as required by the system. Furthermore, this situation exerts no adverse impact on the business process.

Partial synchronization move for correct role In cases where a data operation, which ought to be carried out within the system, is absent from the data log, and the anticipated activity is executed by a duly-authorized role. Designated as $((r_{i}, R_{i}) (≪, q_{i}), (e_{i}, a_{i}))$ . Refer to Figure 13(3) for its graphical depiction. The missing operations, which could involve data updates or security checks, imply that the data may not be reliable.

Partial synchronization move for wrong role when a data operation that should be performed in the system is missing from the data log, but the expected activity is performed by a role that is not allowed. It is noted as $((W r o n g (r_{i}), R_{i}) (≪, q_{i}), (e_{i}, a_{i}))$ .The image representation is shown in Figure 13(4).

Model move, the system specifies a system event to perform human and data operations, and the log skips the activity without performing it. As shown in Figure 13(7), this pure model movement is represented as $((≪, R_{i}) (≪, q_{i}), (≪, a_{i}))$ .

Illegal operation movement, a data operation is considered illegal either when it is executed on an activity prohibited by the active view, or when the data or activity is processed in an invalid context. In the case where the role performing the event is valid, but a data operation not permitted by the system is recorded in the data log, it is labeled as $((r_{i}, R_{i}) (s_{i}, ≪), (e_{i}, a_{i}))$ . Refer to Figure 13(5) for its graphical illustration. Figure 13(6) $((W r o n g (r_{i}), R_{i}) (s_{i}, ≪), (e_{i}, a_{i}))$ , where the wrong role enforces system-prohibited data operations for unintended purposes is more threatening to the system, for example, a nurse who does not have access to a patient's medical history leaks patient information to an insurance claims company. Figure 13(10). The appropriate role conducts irrational data operations on an incorrect activity. For instance, a treating physician, who has access to all patient information, utilizes such information for clinical medical research.

Figure 13.

Composite move classification chart.

In addition, Figure 13(8) $(r_{i}, ≪) (s_{i}, ≪), (e_{i}, ≪)$ is a pure log movement, where the role forces the execution of events not allowed by the system and performs data operations not specified by the activity view. In the same way Figure 13(9) $(r_{i}, ≪) (e_{i}, ≪), (≪, ≪)$ are illegal operations, which can be used in practical application scenarios to capture unauthorized data access, unauthorized data modification and secondary use of data, and deviations from these different views can pose different levels of threats to the business.

Interlayer alignment

The composite method combines process and data perspective views, thereby facilitating the identification of deviations that remain undetectable under other single-view approaches and enabling an accurate diagnosis of such deviations. To accurately assess the extent of deviation from both data and process perspectives, this section presents a set of cost functions (Figure 14) to evaluate the legitimacy and cost associated with deviations in composite moves.

Figure 14.

Composite mobile cost distribution diagram.

Figure 15 illustrates the two different interlayer alignments of $ψ_{1}$ and $ψ_{2}$ , These alignments furnish execution activities, roles, and data operations for every system event within the business process, thereby facilitating more precise deviation detection. In multi-perspective interlayer alignment $ψ_{1}$ as show in Figure 13. The event appointment examination (ai) was supposed to be read by the laboratory specialist Mary (r) to check the items (P) while the actual performer was the nurse (Ella) corresponding to Figure 13 deviation type $((W r o n g (r_{i}), R_{i}) (s_{i}, q_{i}), (e_{i}, a_{i}))$ . This compound movement cost was 2. David's failure to read the data attribute (P) or update the data (rU) without updating the deviation may lead to the wrong decision corresponding to the deviation cost of 1. The base test (bt) is a model move, the log skips this activity and the corresponding data operation corresponds to a deviation type of $((≪, R_{i}) (≪, q_{i}), (≪, a_{i}))$ and a deviation cost of 3. General practitioner Tom checked patient information (I) and diagnostic records (C) for the purpose of clinical trial (ct) examination, which violated the business system specifications.

Figure 15.

Multi-view interlayer alignment comparison diagram.

Medical staff exploited their access rights to engage in system-unauthorized activities and execute system-prohibited data operations, thereby violating patient privacy. This type of deviation poses a more significant threat to the business system. For deviation type $(r_{i}, ≪) (s_{i}, ≪), (e_{i}, ≪)$ , the cost is 4. Additionally, the attending physician failed to review the patient's symptoms (S) upon the patient's hospitalization (hs).The deviation type was $((r_{i}, R_{i}) (≪, q_{i}), (e_{i}, a_{i}))$ and the deviation cost was 1. Accountant Jack read the patient's medical certificate (P) and examination items (C) when the patient paid for leaving the hospital (bi). The accountant checked that the total amount of the examination items was reasonable, but the data view prohibited him from checking the attribute diagnosis certificate (C). This type of deviation is $((r_{i}, R_{i}) (s_{i}, ≪), (e_{i}, a_{i}))$ , and the deviation cost is 2.

In summary, the total cost of interlayer alignment $ψ_{1}$ is 13. For the same reason, interlayer alignment $ψ_{2}$ costs 14.The most substantial disparity between the interlayer alignments of $ψ_{1}$ and $ψ_{2}$ stems from the distinct business process execution paths that originate from the section with an orange-hued background in the decision table. The disease grade is determined based on the results of data attribute inspections and relevant conditions, and the decision-making result is derived from medical domain knowledge to guide the business process execution path. The data attribute, decision table, and control flow interact synergistically and are mutually dependent. This implies that the appropriate business process path can be selected provided that specific data constraints are satisfied, and deviations in the business process can be identified from multiple perspectives with a high degree of granularity.

Multi-view optimal interlayer alignment deviation detection algorithm

Current research primarily centers on extracting models from process control flow, whereas the relationship between process-related data and decision logic has yet to be investigated. Consequently, deviations in the business process cannot be identified from multiple viewpoints. In this paper, we incorporate data information into business processes via activity views and employ decision tables to ascertain whether the data complies with business process rule constraints, thereby offering a scientific foundation for business process logic decision-making. The multi-view deviation detection algorithm is presented below.

To detect business process deviations with greater precision, given the significant impact of resource errors on anomaly detection, this section divides the multi-perspective anomaly detection algorithm into Algorithm 5-1 and Algorithm 5-2, which conduct deviation detection from dimensions such as activities, resources, and data respectively. Among them, Algorithm 5-1 focuses on the deviation situations of activities and data under the condition of correct resources, while Algorithm 5-2 emphasizes the elaboration of deviation costs of business process activities and data under the circumstances of resource errors.

In the multi-view deviation detection algorithm 5-1, lines 1–4 provide the algorithm input to the business process activity view $A V_{M}$ , the types of data operations that can be performed by the activity in the model $A T P$ . The logs record the activity view $A V_{L}$ and the data operation types atp. The decision table that satisfies the business process condition constraints $D M N$ . Lines 6–17 is to determine the deviation between the actual data operation si recorded in the log and the model specification data operation qi when the roles and activities are performed correctly.

Where lines 14–16 rows are combination move $C o m p o s i v e m o v e = (r_{i}, R_{i})$ $(s_{i}, ≪), (e_{i}, a_{i})$ , $a t y (s_{i}) \in A T Y (q_{i})$ indicating that the data operations recorded in the log are legitimate, otherwise $a t y (s_{i}) \notin A T Y (q_{i})$ data operations in the log do not meet the data operations specified in the business process at this time deviation cost is higher. Lines 18–31 log and model deviation costs under correct behavior roles, different combination movements and data manipulation conditions.

Algorithm 5-1: Multi-view Deviation Detection Algorithm (1)

Input: Data= {data patter}, $A V_{M}$ , $A V_{L}$

Output: Composite Move, Cost

1 $A V_{M} = {R_{s e t i} = R_{i}, C_{s e t i} = q_{i}, Π_{M} (t_{i}) = a_{i}, A T P = (c, r, u, d)}$

2 $A V_{L} = {r_{s e t i} = r_{i}, c_{s e t i} = s_{i}, Π_{L} (t_{i}) = e_{i}, a t p = (c, r, u, d)}$

3 $D M N = {I, O, R e g = (r e g_{1}, r e g_{2}, \dots, r e g_{n})$

4 $c o s t = 0, C o m p o s i t c M o v e = \emptyset$

5 for $t_{i} \in σ_{i}$ do

6 if $r_{i} = R_{i}$ then

7 if $a_{i} = e_{i}$ then

8 if $s_{i} = q_{i} \land A T Y (q_{i}) = a t y (s_{i})$ then

9 $C o m p o s i t e M o v e = ((r_{i}, R_{i}) (s_{i}, q_{i}), (e_{i}, a_{i})), \cos t = c o s t$ //The activities, resources and data in the log fully comply with the business process norms

10 else $s_{i} = ≪, q_{i} = q$ then

11 $C o m p o s i t e M o v e = ((r_{i}, R_{i}) (≪, q_{i}), (e_{i}, a_{i}))$ //The active resources in the log meet the process specifications, while the data operations specified by the process are missing in the log

12 $a t y (s_{i}) = \emptyset, c o s t = c o s t + 1$ //Calculate the Composite mobile deviation cost

13 else

14 $s_{i} = s, q_{i} = ≪$

$C o m p o s i t e M o v e = ((r_{i}, R_{i}) (s_{i}, ≪), (e_{i}, a_{i}))$

15 $a t y (s_{i}) \in A T Y (q_{i}), \cos t = \cos t + 1$

16 $a t y (s_{i}) \notin A T Y (q_{i}), \cos t = \cos t + 2$ //If the data inserted in the log does not meet the process specifications, the deviation cost is calculated

17 end

18 else if $a_{i} = ≪, e_{i} = e$ then

19 $C o m p o s i t e M o v e = ((r_{i}, R_{i}) (s_{i}, ≪), (e_{i}, ≪))$

20 $s_{i} = s, a t y (s_{i}) \in A T Y (q_{i}), \cos t = \cos t + 2$

21 $s_{i} = s, a t y (s_{i}) \notin A T Y (q_{i}), \cos t = \cos t + 3$

22 else if $a_{i} = a, e_{i} = ≪$ then

23 $C o m p o s i t e M o v e = ((r_{i}, R_{i}) (s_{i}, q_{i}), (≪, a_{i}))$

$s_{i} = s, q_{i} = q, a t y (s_{i}) \in A T Y (q_{i}), \cos t = \cos t + 2$

24 else if $s_{i} = s, q_{i} = ≪$ then

25 $C o m p o s i t e M o v e = ((r_{i}, R_{i}) (s_{i}, ≪), (≪, a_{i}))$

26 $a t y (s_{i}) \in A T Y (q_{i}), \cos t = \cos t + 2$

27 $a t y (s_{i}) \notin A T Y (q_{i}), \cos t = \cos t + 3$ //Calculate deviation cost

28 else 29 $s_{i} = ≪, q_{i} = q$

30 $C o m p o s i t e M o v e = ((r_{i}, R_{i}) (≪, q_{i}), (≪, a_{i})), \cos t = \cos t + 2$

31 end

32 end

33 end

34 return $C o m p o s i t c M o v e, C o s t = C o s t 1$

In the multi-view deviation detection algorithm 5-2. Lines 1–17 shows the deviation cost analysis of the logs and models under the condition of execution role error $R_{i} = W r o n g (r_{i})$ . The error of execution role in traditional anomaly detection methods may be a threat to business process caused by illegal personnel intrusion. Lines 18–28 are log and model deviation detection due to missing roles in the log. Lines 30–31 construct the multi-view deviation alignment diagram for each log, calculate the total deviation cost $C o s t$ , record the data attribute in each log and use it to call algorithm 5-3 for data decision logic.

Algorithm 5-2: Multi-view Deviation Detection Algorithm (2)

Input: Data= {data patter}, $A V_{M}$ , $A V_{L}$

Output: Composite Move, Cost

1 $A V_{M} = {R_{s e t i} = R_{i}, C_{s e t i} = q_{i}, Π_{M} (t_{i}) = a_{i}, A T P = (c, r, u, d)}$

2 $A V_{L} = {r_{s e t i} = r_{i}, c_{s e t i} = s_{i}, Π_{L} (t_{i}) = e_{i}, a t p = (c, r, u, d)}$

3 $D M N = {I, O, R e g = (r e g_{1}, r e g_{2}, \dots, r e g_{n})$

4 $C o s t = C o s t 1$

5 for $t_{i} \in σ_{i}$ do

6 if $R_{i} = W r o n g (r_{i}), a_{i} = e_{i}$ //Error in resource execution in the log, deviation type during activity alignment 7then

8 $s_{i} = q_{i}$ , $\cos t = \cos t + 2$ //The deviation cost of $C o m p o s i t e M o v e = ((W r o n g (r_{i}), R_{i}) (s_{i}, q_{i}), (e_{i}, a_{i}))$

9 $s_{i} = ≪, q_{i} = q, \cos t = \cos t +$ 4

10 $s_{i} = s, q_{i} = ≪, a t y (s_{i}) \in A T Y (q_{i}), \cos t = \cos t + 4$

11 $s_{i} = s, q_{i} = ≪, a t y (s_{i}) \notin A T Y (q_{i}), \cos t = \cos t + 5$ //Data operations that do not conform to the business process were inserted into the log

12 end

13 else if $R_{i} = W r o n g (r_{i})$ , $(a_{i} = ≪, e_{i} = e) \cup ({a_{i} = a, e_{i} = ≪)}$ then

14 $s_{i} = q_{i}, \cos t = \cos t +$ 4

15 $s_{i} = ≪, q_{i} = q, \cos t = \cos t + 4$ // Under conditions of resource errors, the deviation cost of data operations that are missing from the log.

16 $s_{i} = s, q_{i} = ≪, a t y (s_{i}) \in A T Y (q_{i}), \cos t = \cos t + 5$

17 $s_{i} = s, q_{i} = ≪, a t y (s_{i}) \notin A T Y (q_{i}), \cos t = \cos t + 6$ // Under resource error conditions, the deviation cost of data operations that the log accesses are not allowed by the process.

18 else if $R_{i} = R, r_{i} = ≪, a_{i} = e_{i}$ then

19 $s_{i} = q_{i}, \cos t = \cos t +$ 1

20 $s_{i} = ≪, q_{i} = q, \cos t = \cos t + 2$

21 $s_{i} = s, q_{i} = ≪, a t y (s_{i}) \in A T Y (q_{i}), \cos t = \cos t + 3$

22 $s_{i} = s, q_{i} = ≪, a t y (s_{i}) \notin A T Y (q_{i}), \cos t = \cos t + 4$

23 else

24 $(R_{i} = R, r_{i} = ≪, a_{i} = a, e_{i} = ≪) \cup (R_{i} = R, r_{i} = ≪, a_{i} = ≪, e_{i} = e)$

25 $s_{i} = q_{i}, \cos t = \cos t + 3$

26 $s_{i} = ≪, q_{i} = q, \cos t = \cos t + 3$

27 $s_{i} = s, q_{i} = ≪, a t y (s_{i}) \in A T Y (q_{i}), \cos t = \cos t + 4$

28 $s_{i} = s, q_{i} = ≪, a t y (s_{i}) \notin A T Y (q_{i}), \cos t = \cos t + 5$

29 end

30 Cost $= Σ cost t, D a t a A t t r i b u t e = \sum_{i = 1}^{n} d_{i} = (s_{i}, q_{i})$

31 $D M N = {I, O, R e g = (r e g_{1}, r e g_{2}, \dots, r e g_{n})}$ //Algorithm Data Decision

32 end

33 return $C o m p o s i t c M o v e, C o s t$

Lines 2–6 of Algorithm 5-3 determine whether the data operations performed in the log meet the requirements of the business process $o p e r a (s_{i}) = i l l e g a l$ . For unreasonable data operations in the business process $D a t a A t t r i b u t e = D a t a A t t r i b u t e ∖ (s_{i}, q_{i})$ , delete them from the data attribute as input I to the decision table $D M N$ , and the decision logic function $D e c i s i o n l o g i c = {d e (s_{i}) f_{R e g} (s_{i})}$ is used to determine whether the data meets the business process rules and condition constraints to guide the business process operation. Line 8 reconstructs the DMN model, clarifying the structure of input, output, and rule sets. Line 9 updates the data attribute set, integrating valid data as input for the decision table. Line 10 employs the decision logic function to verify whether the data complies with business rules and constraints. Finally, Line 11 generates the final decision output based on the verification results, directly guiding the execution of business processes and ensuring that only compliant data influences operational pathways.

Algorithm 5-3: Data Decision Algorithm

Input: $D M N = {I, O, R e g = (r e g_{1}, r e g_{2}, \dots, r e g_{n})}$ , $D a t a A t t r i b u t e = \sum_{i = 1}^{n} d_{i} = (s_{i}, q_{i})$

Output: Decision logic(O)

1 for $t_{i} \in σ_{i}, A V (t_{i}) = a_{i}, a v (t_{i}) = e_{i}$ do

2 if $o p e r a (s_{i}) \subseteq o p e r a (q_{i}) = (C c, R r, U u, D d)$ then

3 $o p e r a (s_{i}) = l e g a l$ // Determine whether the data operations in the log are legal

4 else

5 $o p e r a (s_{i}) = i l l e g a l$

6 $D a t a A t t r i b u t e = D a t a A t t r i b u t e ∖ (s_{i}, q_{i})$ // For unreasonable data operations, delete them from the data attribute set.

7 end

8 $D M N = {I, O, R e g = (r e g_{1}, r e g_{2}, \dots, r e g_{n})}$

9 $I = D a t a A t t r i b u t e = \sum_{i = 1}^{n} d_{i} = (s_{i}, q_{i})$ // The updated set of data attributes is used as the input for the decision table DMN.

10 $D e c i s i o n \log i c = {d e (s_{i}) | {f_{R e g} (s_{i})}$ // Determine whether data meets the constraints rules of business processes.

11 $O = D e c i s i o n \log i c (d e (s_{i}))$

12 end

13 return $D e c i s i o n \log i c (O)$

Experimental evaluation

In this section, we explore the accuracy and performance of our approach. We conducted experiments using real medical event logs and interpreted the results. The experiments were conducted on a machine equipped with a 3.4 GHz Intel Core i7 processor and 16 GB of RAM. In Section 6.1, we introduce the accuracy of CFR + rules and pure CFR to identify medical entities. Section 6.2 proposes privacy-preserving soundness checks for purpose-based access control mechanisms. section 6.3 presents the control flow data flow model fusion deviation detection model visualization. section 6.4 presents multi-view deviation detection effectiveness verification.

Verification of medical entity identification accuracy

In this paper, 4200 medical records were obtained from a tertiary hospital in Anhui Province. A corpus of 653897 characters was constructed to identify entities including basic patient information, symptoms, medical history, examination items, treatment methods, and medical records into 6 groups, with test data volumes of 200, 400, 600, 800, 1000, and 1200 respectively. These six groups of corpus were evaluated by applying CFR and CFR + rule methods respectively. The accuracy of medical entity identification was determined by calculating precision rate $P = \frac{N C I}{N E I} \times 100 %$ , recall rate $R = \frac{N C I}{N C E} \times 100 %$ and F-score $F = \frac{2 \times P \times R}{P + R} \times 100 %$ . NEI represents the number of recognized entities, regardless of whether the identification result is correct. $N C I$ The total number of correctly identified entities, $N C E$ refers to the actual number of entities contained in the corpus.

Figure 16 shows the comparison of CFR and CFR + rule entity recognition, whereas Figure 16(a) shows the effect of the test set ADT and the total number of corpus entities $N C E$ on entity recognition. As the amount of test data and the total number of corpus entities increased, the precision, recall, and F-score of CRF + rule method entity recognition increased smoothly precision reached 86.03%, recall reached 90.53%, F-score reached 87.45%. The main reason for the precision not reaching 90% is the lack of identification of complex drug names and misclassification of disease names as symptoms. Figure 16 (b) shows that the maximum number of entities identified by CFR method is 18,737, among which 8621 entities are correctly identified. The maximum number of entities identified by CFR + rule is 15047, and the number of entities correctly identified is 11270. However, the precision, recall rate and F-score of pure CRF method declined sharply at first and then rose slowly. As the number of entities identified by the pure CRF method was large, but the number of entities correctly identified was small, so the precision and recall rate decreased. When the identified entity reached the maximum, the precision recall rate and F-score of the identified entity increased slowly. The precision reached 63.46%, the recall rate reached 73.75%, and the F-score reached 62.83%.

Figure 16.

Comparison diagram of CFR and CFR + regular entity recognition. (a)The impact of ACE and NCE on entity identification. (b)The impact of NEI and NCI on entity identification.

Reasonableness check of purpose-based access control mechanism

Unreasonable access to medical data may lead to the leakage of patients’ sensitive information. This paper proposes an identity-based and purpose-based access control machine model. According to the different data access permissions, the model also combines the attribute encryption technology to build the key according to the expected purpose of the data, conduct authentication and conduct purpose matching to access different data mode information. And extend the traditional purpose tree to achieve full coverage. Figure 17 Comparison of the running time of the purpose matching algorithm considering the purpose tree hierarchy after considering tuple-based and element-based annotation of the electronic medical records. As shown in the figure, when the purpose tree has 5 nodes (PT size = 5), the height of the purpose tree is 2, and it is necessary to use 5-bit encoding to construct all possible expected purposes. When the destination tree has 14 nodes (PT size = 14), the height of the destination tree is 5, and 14 bits are needed to encode all possible intended purposes. Figure 17 shows that the size of the destination tree does not make any substantial difference in either the tuple-based or element-based identification methods. The reason for this difference is that both bitwise and operational algorithms work well, no matter how long the encoding is.

Figure 17.

Purpose tree size and performance.

In Figure 18, as the number of accessed attributes increases, the corresponding time consumption increases. This is understandable because fine-grained labeling schemes require more purpose matching checks. The number of attributes of the query-accessed data schema is also a significant factor in the time spent for the element-based labeling schema. Because the element-based markup approach requires a conformance check for each element accessed by the query, the conformance check time increases with the number of attributes accessed and the complexity of attribute extraction rules. The method in this paper is based on tuple for electronic medical record annotation, aggregating the same attribute elements into a tuple, and then performing the purpose matching conformity check, with significantly higher efficiency. Figure 18 also shows that using both AIP and PIP does not result in a significant increase in time compared to using AIP alone. This is because the purpose of access is divided into AIP and PIP patients set their own data usage scope and protect their private information as expected, while for the computer, there is only one more bit and operation compliance check.

Figure 18.

Labeling scheme and performance.

Control flow data flow fusion deviation detection

In this section, we embark on a crucial evaluation of the effectiveness and feasibility of our proposed method, utilizing the real event logs sourced from a third-class A hospital in Anhui. The significance of these logs lies in their direct reflection of real-world medical scenarios, making them an ideal basis for our study. Prior to the evaluation, we carried out a meticulous attribute extraction process on these logs, ensuring that only the most relevant and informative data is retained.

The experimental log presents a wealth of data that is both extensive and varied. It contains 10,000 traces, each representing a unique sequence of medical events. These traces collectively encompass 167,868 events, offering a comprehensive view of the medical workflow. The presence of 21 distinct types of activities underscores the complexity and diversity of the medical tasks involved. In addition, the log contains 51 medically important attribute elements. To facilitate a more structured and manageable analysis, these attribute elements have been aggregated into 8 attribute classes. This categorization allows us to better understand the relationships and patterns among the attributes and their impact on the deviation detection process. By providing such detailed information about the experimental data, we aim to offer a transparent and rigorous foundation for our evaluation.

Business process deviation detection aims to identify discrepancies between the actual execution of business processes and the expected models, which is of paramount importance for ensuring the standardization, efficiency, and quality of business processes. Currently, a variety of methods have been proposed in this field. These methods analyze business processes from different perspectives to uncover potential deviations. Among them, DFM (direct follow deviation detection), WFT (workflow table), and A* with ILP (the integration of integral linear A*) are three representative approaches, each with distinct characteristics in terms of control-flow analysis and the combination of data and control-flow analysis. DFM is a relatively fundamental deviation detection method. It identifies deviations by analyzing the direct follow relationships between activities within a business process. WFT takes into account the impact of data on the control flow. It employs workflow tables to represent the relationships between activities and data elements, enabling the detection of deviations caused by data errors. A* with ILP combines the A* search algorithm with integer linear programming (ILP) to find the optimal alignment between the observed process execution and the reference model. This method is effective in detecting deviations in complex business processes.

We evaluated the effectiveness of four methods to detect bias for medical event logs with different levels of noise (0%-10%) in the logs and an attribute complexity of 3%. Figure 19(a) shows the effect of noise on recall for the four methods. Noise contains all types of deviations, i.e. missing, additional and misplaced noise of activities on the control flow and 3% data noise contains 3% increase in the type of resources and 3% increase in incorrect matching of activities to data execution operations. With the increase in noise content, figure 19(b)(c) shows the Precision, F-score after deviation detection by the four methods is a decreasing trend. The Precision rate of the multi-view deviation detection method proposed in this paper is better than that of direct follow deviation detection (DFM), workflow table (WFT)³⁰ and integral linear A*(A* with ILP).³¹ When the noise content is 10%, the Precision rate remains at 0.9328, which has the best noise resistance compared with the other three methods. Because the proposed method combines the control flow with the data flow by using the activity view to standardize the activity executor and the legitimate data operation, the deviation of the activity execution can be accurately identified. A* with ILP and DFM focuses on the control flow perspective deviation detection and consider fewer data attributes. with the increase of noise levels, the recall rate decreases significantly, among which, the recall rate and Precision of DFM decreases to 0.6132 and that of A* with ILP to 0.5481. WFT can detect deviations due to data errors but is less effective for data anomalies caused by concurrent structures. In the presence of concurrency in a process, the timing of system events may fall between the start time and completion time of multiple process moves. Without data attributes, it is easy to create incorrect links between system events and process moves. This leads to a significant drop in the recall rate of concurrent structures reaching a minimum of 0.7385 when too much noise is included.

Figure 19.

Influence of noise on deviation detection method at 3% of attribute complexity.

Figure 20 demonstrates that with the increase of attribute complexity, DFM and A* with ILP only pay attention to the deviation of the activity occurrence sequence of the business process, and the running time is basically unchanged. WFT and the method in this paper consider the influence of data constraints on the deviation detection. However, since we transform the activity view of the business process into a CURD²³ matrix that specifies the activity execution resources and the execution operations on the data, it is easy for the computer to read efficiently so the running time is kept within 2 ms. WFT also looks at the data operations of business processes, but it takes longer because each data operation has to go through all activities to find a match.

Figure 20.

Influence of attribute complexity on deviation detection methods.

To sum up, DFM and A* with ILP cannot detect data bias. WTF takes into account the impact of data on control flow that must be introduced into deviation detection, and introduces data into business process through workflow table. However, it does not consider the guiding role of decision logic in the business process, and it takes a long time to match data with activities. This paper proposes multi-view deviation detection combining activity view, data decision and control flow to detect deviations that are difficult to detect by other methods maintaining high precision and relatively short running time.

Conclusion

The main contribution of this paper is to maximize the use of data under the condition of protecting the privacy of data providers, introduce data attributes into business processes, and solve the problem that the existing process model cannot represent some decision requirements related to data, failing to detect deviations from other perspectives in the process. Using activity view and decision logic to combine control flow with data flow, applying composite moving classification diagram and deviation cost function according to the degree of impact of deviation type on business process can detect deviations in addition to privacy, data and control flow, and also find hidden deviations in combination of these three perspectives. To broaden the application of our algorithm, future research will concentrate on two pivotal areas. First, we aim to enable active data updates for loop structures in business process models, which is vital for managing complex and dynamic workflows. Second, we intend to employ Bayesian networks and extended likelihood graphs, designed to incorporate a wider range of data attributes. This will facilitate a more thorough understanding of the intricate data-activity interactions across various business processes.

Our privacy preserving deviation detection method, already proven effective in the medical field, holds great potential for wider adoption across multiple sectors. In finance, it can serve as a reliable tool for detecting fraudulent transactions and compliance violations while protecting sensitive customer data. In supply chain management, it enables real-time logistics workflow monitoring, identifying anomalies such as unauthorized diversions and data tampering while preserving partner data privacy. In manufacturing, it can identify production process deviations affecting product quality or safety, while securing proprietary operational data. In cybersquatting, it enhances intrusion detection by analyzing process deviations that may indicate security breaches, all while prioritizing data privacy. The method's versatility across these critical domains arises from its core capability to effectively detect deviations while strictly maintaining data privacy and security, making it a promising solution for diverse business process deviation detection scenarios. Although the method proposed in this paper demonstrates notable advantages in multiple aspects, it still has certain limitations. On the one hand, currently, the handling of data updates for loop structures in business processes lacks sufficient flexibility. When confronted with complex and dynamically changing workflows, it may fail to promptly and accurately reflect the impact of data variations on the processes. On the other hand, in terms of the utilization of data attributes, despite the introduction of some key attributes, the mining of certain latent and complex data characteristics is not thorough enough. This may restrict the accuracy and comprehensiveness of the method when dealing with highly complex.

Footnotes

Acknowledgments

I'm brimming with gratitude as I write this acknowledgment for my thesis.First, I'm deeply thankful to the review experts and editors. Their constructive feedback and insightful comments refined the research content, strengthened arguments, and elevated the paper's quality. Their pursuit of academic excellence motivates me to aim higher in future research.I also sincerely thank George K. Agordzo. His meticulous English grammar check was a great help. With his keen eye for detail and language proficiency, he polished the manuscript to meet international standards, ensuring my findings could reach a global audience.Lastly, I extend heartfelt appreciation to the funding agencies. Their generous support was the bedrock of this project, allowing me to access resources, conduct experiments, and attend conferences. Without them, this research would have been an uphill battle.I'm truly grateful to all who've contributed to this work.

ORCID iD

Juan Li

Author contributions

Author contributions Juan Li wrote the main manuscript style, Xianwen Fang optimized it, Yan Wang collected part of the data. All the authors read the manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Anhui Provincial Natural Science Foundation, Natural Science Foundation of Anhui Provincial Education Department, the National Natural Science Foundation of, China, Anhui Province Academic and Technical Leader Foundation, Key Research and Development Program of Anhui Province, Scientific Research Project for Graduate Students of Anhui Province, The Project was Supported by Open Research Fund of Anhui Province Engineering Laboratory for Big Data Analysis and Early Warning Technology of Coal Mine Safety, the Leading Backbone Talent Project in Anhui Province, China, Open Research Fund of Anhui Province Engineering Laboratory for Big Data Analysis and Early Warning Technology of Coal Mine Safety, Postdoctoral Research Project of Anhui Province, (grant number (Water Science Joint Fund, 2308085US11), (2024AH040091), (No.61572035, 61402011), (No. 2022D327), (2022a05020005), (2021CX1011), (NO. CSBD2024-ZD03 ), (2020-1-12), （NO. CSBD2024-ZD03), (NO. BSHXM202401)).

Declaration of conflicting interests

The datasets analyzed in this study contain sensitive medical information and are not publicly available to protect patient privacy. Data may be requested from the first author Dr Juan Li . The authors declare that they have no competing financial or non-financial interests relevant to this work.

References

Yang

, et al. An approach to automatic process deviation detection in a time-critical clinical process. J Biomed Inform Sep. 2018; 85: 155–167.

Ben Chaabene

NEH

Bouzeghoub

Guetari

, et al. Deep learning methods for anomalies detection in social networks using multidimensional networks and multimodal data: a survey’. Multimed. Syst. Dec. 2022; 28: 2133–2143.

Omair

Alturki

. Multi-dimensional fraud detection metrics in business processes and their application.’. Int. J. Adv. Comput. Sci. Appl. 2020; 11: 570.

Wang

Teng

, et al. A detection method for abnormal transactions in E-commerce based on extended data flow conformance checking’. Wirel Commun Mob Comput Jan. 2022; 2022: e4434714.

Wang

. Efficient deviation detection between a process model and event logs. IEEECAA J. Autom. Sin. Nov. 2019; 6: 1352–1364.

Wang

, et al. Petri net-based deviation detection between a processmodel with loop semantics and event logs’. Concurr. Pract. Exp. 2018; 30: e4419.1–e4419.18.

Estañol

Munoz-Gama

Carmona

, et al. Conformance checking in UML artifact-centric business process models’. Softw. Syst. Model. Aug. 2019; 18: 2531–2555.

Fani Sani

van Zelst

van der Aalst

WMP

. Repairing outlier behaviour in event logs. In: Abramowicz

Paschke

(eds) Business information systems. Lecture Notes in Business Information Processing. Cham: Springer International Publishing, 2018, pp.115–131. doi: 10.1007/978-3-319-93931-5_9.

Chadli

Kabbaj

Bakkoury

. An enhanced adhoc approach based on active help to detect data flow anomalies in a loop of a business modeling. In: Elhoseny

Hassanien

(eds) Emerging technologies for connected internet of vehicles and intelligent transportation system networks. Studies in Systems, Decision and Control, vol. 242. Cham: Springer International Publishing, 2020, pp.127–147. doi: 10.1007/978-3-030-22773-9_9.

10.

Guzzo

Joaristi

Rullo

, et al. A multi-perspective approach for the analysis of complex business processes behavior’. Expert Syst Appl 2021; 177: 114934.

11.

Herouala

Ziani

Kerrache

, et al. Cadaca: a new caching strategy in NDN using data categorization’. Multimed. Syst. Oct. 2023; 29: 2935–2950.

12.

Carrasquel

Lomazova

. Searching for Deviations in Trading Systems: Combining Control-Flow and Data Perspectives. In: International Conference on Software Testing, Machine Learning and Complex Process Analysis, Cham: Springer Nature Switzerland. vol. 1559, no.16, pp.94-116. doi: 10.48550/arXiv.2210.16800.

13.

Böhmer

Rinderle-Ma

. Multi-perspective anomaly detection in business process execution events. In: Debruyne

Panetto

Meersman

Dillon

Kühn

O’Sullivan

Ardagna

(eds) On the move to meaningful internet systems: OTM 2016 conferences. Lecture Notes in Computer Science. Cham: Springer International Publishing, vol. 10033, October 2016, pp.80–98. doi: 10.1007/978-3-319-48472-3_5.

14.

Nolle

Luettgen

Seeliger

, et al. BINet: multi-perspective business process anomaly classification’. Inf Syst Jan. 2022; 103: 101458.

15.

Lahann

Pfeiffer

Fettke

. LSTM-Based Anomaly detection of process instances: benchmark and tweaks. In: Montali

Senderovich

Weidlich

(eds) Process mining workshops. 468. Cham: Springer Nature Switzerland, March, 2023, pp.229–241. doi: 10.1007/978-3-031-27815-0_17.

16.

Michael

Koschmider

Mannhardt

, et al. User-Centered and privacy-driven process mining system design for IoT. In: Cappiello

Ruiz

(eds) Information systems engineering in responsible information systems. Lecture Notes in Business Information Processing, vol. 350. Cham: Springer International Publishing, 2019, pp.194–206. doi: 10.1007/978-3-030-21297-1_17.

17.

Pika

Wynn

Budiono

, et al. Towards privacy-preserving process mining in healthcare. In: Di Francescomarino

Dijkman

Zdun

(eds) Business process management workshops. Lecture Notes in Business Information Processing, vol. 362. Cham: Springer International Publishing, 2019, pp.483–495. doi: 10.1007/978-3-030-37453-2_39.

18.

Pika

Wynn

Budiono

, et al. Privacy-preserving process mining in healthcare’. Int. J. Environ. Res. Public. Health Jan. 2020; 17: 1612.

19.

Elkoumy

, et al. Privacy and confidentiality in process mining: threats and research challenges. ACM Trans. Manag. Inf. Syst. 2021; 13: 11:1–11:17.

20.

Beugelsdijk

van Witteloostuijn

Meyer

. A new approach to data access and research transparency (DART). J Int Bus Stud Aug. 2020; 51: 887–905.

21.

Mozafari Mehr

de Carvalho

van Dongen

. Detecting privacy, data and control-flow deviations in business processes. In: Nurcan

Korthaus

(eds) Intelligent information systems. Lecture Notes in Business Information Processing. Cham: Springer International Publishing, 2021, pp.82–91. doi: 10.1007/978-3-030-79108-7_10.

22.

Zhang

, et al. Fuzzy multi-perspective conformance checking for business processes’. Appl Soft Comput Nov. 2022; 130: 109710.

23.

Alizadeh

Fahland

, et al. Linking data and process perspectives for conformance analysis’. Comput Secur Mar. 2018; 73: 172–193.

24.

Tsoury

Soffer

Reinhartz-Berger

. How well did it recover? Impact-aware conformance checking. Computing Jan. 2021; 103: 3–27.

25.

Felli

de Leoni

Montali

. Soundness verification of data-aware process models with Variable-to-Variable conditions’. Fundam. Informaticae Jan. 2021; 182: 1–29.

26.

Lin

Xie

. Research on named entity recognition of traditional Chinese medicine electronic medical records. In: Huang

Siuly

Wang

Zhou

Zhang

(eds) Health information science. in Lecture Notes in Computer Science. Cham: Springer International Publishing, 2020, pp.61–67. doi: 10.1007/978-3-030-61951-0_6.

27.

Tan

Qin

Tang

, et al. Privacy protection for medical images based on DenseNet and coverless steganography’. Comput. Mater. Contin. 2020; 64: 1797–1817.

28.

Böhmer

Rinderle-Ma

. Mining association rules for anomaly detection in dynamic process runtime behavior and explaining the root cause to users’. Inf Syst May 2020; 90: 101438.

29.

Bazhenova

Zerbato

Oliboni

, et al. From BPMN process models to DMN decision models’. Inf Syst Jul. 2019; 83: 69–88.

30.

Tao

Liu

Yang

, et al. Workflow nets with tables and their soundness. IEEE Trans. Ind. Inform. Mar. 2020; 16: 1503–1515.

31.

García-Bañuelos

van Beest

NRTP

Dumas

, et al. Complete and interpretable conformance checking of business processes. IEEE Trans Softw Eng Mar. 2018; 44: 262–290.