Sage Journals: Discover world-class research

Abstract

Activity recognition technologies only present a good performance in controlled conditions, where a limited number of actions are allowed. On the contrary, industrial applications are scenarios with real and uncontrolled conditions where thousands of different activities (such as transporting or manufacturing craft products), with an incredible variability, may be developed. In this context, new and enhanced human activity recognition technologies are needed. Therefore, in this paper, a new activity recognition technology, focused on Industry 4.0 scenarios, is proposed. The proposed mechanism consists of different steps, including a first analysis phase where physical signals are processed using moving averages, filters and signal processing techniques, and an atomic recognition step where Dynamic Time Warping technologies and k-nearest neighbors solutions are integrated; a second phase where activities are modeled using generalized Markov models and context labels are recognized using a multi-layer perceptron; and a third step where activities are recognized using the previously created Markov models and context information, formatted as labels. The proposed solution achieves the best recognition rate of 87% which demonstrates the efficacy of the described method. Compared to the state-of-the-art solutions, an improvement up to 10% is reported.

Keywords

Activity recognition context-aware systems Industry 4.0 pervasive sensing Markov model time series analysis

ï»¿

1. Introduction

Industry 4.0 [1] refers a new age in industry, where pervasive sensing and ubiquitous computing platforms are employed to support highly efficient processes. Industry 4.0 is also characterized by the integration of Cyber-Physical Systems (CPS) [2], the implementation of the Anything as a Service paradigm [4] and the use of totally automatized and intelligent production systems [6]. Among all industrial intelligent solutions, human monitoring mechanisms are the most important component to be adapted to Industry 4.0. Actually, to integrate people into Industry 4.0, it is essential these technologies are able to capture and understand information about people and the tasks they perform [12].

Currently, human activity recognition is largely reliant on computer vision, with very good results, through 2D and 3D camera sensors [51]. In these use cases [8], wide open spaces are available and activities to be recognized (steel bending, walking, transporting, etc.), depends only on the general body position, the movement and the elements workers manipulate [49]. On the contrary, the craft industry and hand-made products include activities where the specific position and movement of fingers and feet, the interaction with other workers or the pressure a worker is applying are relevant [10]. For example, in the handmade pottery industry, tasks such as molding and casting a clay sculpture are distinguished by the position of fingers. In order to recognize human activities in these scenarios using computer vision, cameras with a very high resolution would be required, or several cameras focusing on different areas of the scenario [48]. However, in the craft industry, spaces tend to be smaller and chaotic, and human activity recognition techniques based on computer vision have shown some limitation on those scenarios [7].

Thus, for these craft industrial scenarios, heterogenous pervasive sensing platforms are investigated as a possible valid alternative [9, 11]. In these platforms, although cameras may be included, we typically find low-cost sensors such as accelerometers and RFID tags and readers integrated into wearables, Bluetooth beacon devices for indoor positioning, or passive infrared sensors to control the workers movement [12, 13]. In those scenarios, besides, the number of sensing nodes is huge [14]. Moreover, activities of craftsmen tend to be non-controllable, with an incredible variability [15]. Thus, existing activity recognition technologies usually present a poor performance in real industrial scenarios.

Therefore, the objective of this paper is to define and evaluate a new hybrid activity recognition technology, focused on (craft) Industry 4.0 scenarios. The proposed mechanism consists of various steps. Those steps are designed to make independent the pervasive hardware platform and the software algorithms without needing any additional controller. Thus, complex human activities are recognized through a sequence of ensembled technologies. The referred steps include a first analysis phase where physical signals are processed with DTW technologies; a second phase where activities are modeled using Markov chains; and a third step where activities are recognized using the previously created Markov models.

The rest of the paper is organized as follows. Section 2 presents the state of the art on activity recognition technologies. Section 3 presents the proposed solution, including all the considered steps. Section 4 describes the experimental evaluation; and Section 5 concludes the paper.

2. State of the art

In general, human activity recognition systems can be classified in two different categories, according to the type of device employed to capture information: video-based and sensor-based.

•
Video-based solutions use cameras to capture images about the scenario, which are later processed. In the most traditional approach, images are captured by multiple cameras in a predefined environment where optical markers are placed [47]. However, these markers are restrictive to workers and it is a pending challenge how to implement these mechanisms in industrial scenarios [46]. Markerless techniques have been also reported and have been successfully applied to Industry 4.0 scenarios [49]. The main advantage of these markerless approaches is the unobtrusive and precise monitoring. However, several objects and workers in the images reduces the precision of these methods [7]; and focusing on small areas may be difficult because of the cameras’ resolution and the environment. Furthermore, in general, video-based systems are very sensitive to extreme temperature variations, lighting, noise of vibrations, that are common in industrial applications [45].
•
Sensor-based solutions (or non-optical systems) might be supported by three basic sensor technologies: environmental sensors, wearable sensors, or smart phones [45]. This approach is common in industrial applications, as devices are low-cost and pervasive platforms, with a huge number of devices, may be deployed [43]. The main advantage is the information granularity and redundancy [44]. However, environmental sensors are sensitive to the industrial environmental conditions and smartphones and wearables may affect the workers performance [44]. In this paper we address this pending challenge by employing a hybrid approach where we balance the advantages of environmental sensors (unobtrusive monitoring) and wearables (precision).

Table 1
State of the art in activity recognition techniques for industrial scenarios

Reference Detection mode Model Context Conditions Sensor type

[49] Real-time Neural network Industry Non-ideal Camera

[48] Real-time EP Industry Ideal Camera

[36] Real-time Gaussian Laboratory Ideal Hybrid

[22] Real-time EP Laboratory Ideal Wearable

[32] Offline HMM Laboratory Ideal Phone

[43] Offline Other AI Laboratory Ideal Wearable

[44] Offline HMM and other AI Laboratory Ideal Hybrid

[47] Real-time EP Street Ideal Camera

From the mathematical point of view, recognition mechanisms for industrial scenarios may be classified into five basic categories: (i) Bayesian classifiers; (ii) Hidden Markov Models; (iii) the Conditional Random Field; (iv) the Skip Chain Conditional Random Field; (v) Emerging Patterns and (vi) other artificial intelligence models.

•
Bayesian classifiers is the most basic and elemental technology. Because of this simplicity, in scenarios of craft industry, with non-ideal conditions (or even in living labs and other real-like applications), where actions are highly variable, the performance of Bayesian classifiers is lower than other solutions [17], so its application in real scenarios is still an open challenge.
•
Hidden Markov Model (HMM) [21] is the most commonly employed mechanism to model human activities [22]. These models can be combined with cameras or sensor, although sensor-based systems are much common [31]. Besides, HMM have been successfully employed in domestic environments [29]. However, as main disadvantage, these models are not useful to model concurrent activities [24] which are very common in Industry 4.0 applications.
•
In Conditional Random Fields (CRF) any probability distribution is allowed (although actions composing activities are still connected as chains). As main advantage, CRF have been successfully employed in controlled scenarios such as living labs [28], as well as in in-home solutions [30]. Moreover, these models can be integrated with both camera-based [34] and sensor-based solutions [32].
•
Skip Chain Conditional Random Fields (SCCRF) is a pattern recognition technique that enables modeling activities that are not sequence of actions in nature. This technique has been employed in scenarios such as complex biomedical applications [35] or surgery activities recognition [38]. This approach is the most adequate for craft Industry 4.0 scenarios [39].
•
Emerging patterns (EP). For most authors, EP is a technique describing activities as vectors of parameters and their corresponding values (location, object, etc.) [41]. Its main advantage is the efficiency in computational terms (so real-time operation is enabled), but standalone implementations have showed a reduced precision compared to other classifier and hybrid approaches [40].
•
Finally, other artificial intelligence models have been developed, especially for camera-based systems and computer vision. Gaussian models [36], semantic technologies [18], intelligent encoders [5], optimization functions [19] or estimation techniques [20] have been reported very recently. All these approaches have the advantage of showing a very good performance and precision, but they are not flexible

Table 1 presents and analyzes works on these scenarios.

In this paper we aim to balance and combine the flexibility of Markov CRF models, and the precision of intelligent application-specific classifiers. To do that, a hybrid approach is proposed, where different phases or steps are considered.
3. Analysis-modeling-recognition algorithm

Reference	Detection mode	Model	Context	Conditions	Sensor type
[49]	Real-time	Neural network	Industry	Non-ideal	Camera
[48]	Real-time	EP	Industry	Ideal	Camera
[36]	Real-time	Gaussian	Laboratory	Ideal	Hybrid
[22]	Real-time	EP	Laboratory	Ideal	Wearable
[32]	Offline	HMM	Laboratory	Ideal	Phone
[43]	Offline	Other AI	Laboratory	Ideal	Wearable
[44]	Offline	HMM and other AI	Laboratory	Ideal	Hybrid
[47]	Real-time	EP	Street	Ideal	Camera

In this Section, the proposed activity recognition mechanism for Industry 4.0 is described. Figure 1 shows the block and flow diagram of the proposed solution.

Figure 1.

Proposed activity recognition methodology.

In this paper we propose a hybrid approach in three steps. The first step (analysis phase) analyzes heterogenous signals from different sensor types, and recognizes atomic actions limited in time and space through the location of emerging patterns. The second step considers the recognized atomic actions to model industrial activities using general CRF (GCRF). This model, however, is focused on actions performed by one person (user activities). In the third step, in order to recognize business activities (performed by several people, for example), all user activities are introduced in a high-precision classifier (random forest), where information (labels) about the physical context (extracted from sensor signals) is also employed.

Before any further explanations, some formal definitions must be considered:

•

Atomic action: Elemental movement, including some instruments or not, with an objective in the context of the production process (e.g., press a button).

•

User action: Independent activity performed by only one worker, which meets a production objective (e.g., controlling a machine).

•

Context label: Any representation of the environmental situation in an industrial scenario (e.g., temperature, noise level, etc.).

•

Business action: Production activity, which meets an objective at business level (e.g., manufacturing a product).

The proposed solution is supported by pervasive sensing and computing platforms, composed of heterogenous devices with very different behavior and characteristics. Signals are, then acquired through a set of different technologies such as WiFi or publication/subscription brokers [baseline=(char.base)] Â Â Â Â Â Â Â Â Â Â Â [shape=circle,draw,inner sep=0.002pt] (char) 1;. These signals are unsynchronized and multimodal, so an analysis phase is carried out. The analysis phase starts with a noise reduction filter and a digitalization step [baseline=(char.base)] Â Â Â Â Â Â Â Â Â Â Â [shape=circle,draw,inner sep=0.002pt] (char) 2;, based on an exponential mobile average (EMA) and the $\Sigma-{\Delta}$ encoder. To remove format divergences among signals, they orthonormalized and all redundances are also removed [baseline=(char.base)] Â Â Â Â Â Â Â Â Â Â Â [shape=circle,draw,inner sep=0.002pt] (char) 3; considering the restriction of the scenario.

These digital signals are then grouped considering spatial restrictions and signal segments are created [baseline=(char.base)] Â Â Â Â Â Â Â Â Â Â Â [shape=circle,draw,inner sep=0.002pt] (char) 4;. If the segment matches the format of any of the patterns in the atomic task repository, the activity recognition process starts. On the contrary, segments are considered context information and sent to the next phase. The atomic recognition process starts with a temporal segmentation [baseline=(char.base)] Â Â Â Â Â Â Â Â Â Â Â [shape=circle,draw,inner sep=0.002pt] (char) 5;, considering the typical duration of activities in the pattern repository. Temporal segments are dynamically calculated through sliding windows. For each possible temporal segment, a Dynamic Time Warping (DTW) algorithm is employed to measure the distance between the segment and patterns in the repository [baseline=(char.base)] Â Â Â Â Â Â Â Â Â Â Â [shape=circle,draw,inner sep=0.002pt] (char) 6;. If that distance is lower than a minimum, the segment is close enough to run the recognition algorithm. On the contrary, the temporal segmentation process in updated and a new DTW distance is calculated. If the binary packet (spatial segment) finishes and the recognition algorithm could not be triggered, the segment is considered context information. Finally, atomic actions are recognized [baseline=(char.base)] Â Â Â Â Â Â Â Â Â Â Â [shape=circle,draw,inner sep=0.002pt] (char) 7;. This algorithm is based on the k-nearest neighbors (K-NN) solution but adapted to future Industry 4.0 scenarios. Recognized atomic actions are represented as labels (integer numbers) with some additional metadata such as the timestamp.

In a craft pottery industry, atomic actions could be, for example, press the pedal of the potter’s wheel or turning it on (if electrical).

In the modeling phase, two modules are working in parallel: the user activity recognition and the context recognition modules. On the one hand, atomic actions are first classified according to the user performing those actions [baseline=(char.base)] Â Â Â Â Â Â Â Â Â Â Â [shape=circle,draw,inner sep=0.002pt] (char) 8; (using the metadata). Actually, in this modeling phase, we are focusing on activities performed by only one user. Then, each sequence of atomic actions performed by each user is processed using a sliding window [baseline=(char.base)] Â Â Â Â Â Â Â Â Â Â Â [shape=circle,draw,inner sep=0.002pt] (char) 9; to deal with unsynchronicities among users. Different start points are then considered, and all the potential sequences of atomic actions are introduced in a GCRF model. In this module [baseline=(char.base)] Â Â Â Â Â Â Â Â Â Â Â [shape=circle,draw,inner sep=0.002pt] (char) 10;, finally, probabilistic numerical models are employed to determine if some known user actions are recognized. User actions are represented as ASCII labels, typically composed of four or five printable characters (for example, WLK for “walking”). In the same craft pottery industry, an example of user action could be modeling one piece in the potter’s wheel (which includes atomic actions such as press the pedal periodically, move the hands, etc.).

On the other hand, context signals are also processed using a temporal segmentation algorithm [baseline=(char.base)] Â Â Â Â Â Â Â Â Â Â Â [shape=circle,draw,inner sep=0.002pt] (char) 11;, but in this case it is based on a fixed square window. From each segment, then, a set of statistical features [baseline=(char.base)] Â Â Â Â Â Â Â Â Â Â Â [shape=circle,draw,inner sep=0.002pt] (char) 12; (mean, deviation, etc.) are extracted in the next step. These features (as a numerical vector of double precision variables) are then introduced in a multilayer perceptron [baseline=(char.base)] Â Â Â Â Â Â Â Â Â Â Â [shape=circle,draw,inner sep=0.002pt] (char) 13;. Specifically, this context recognition module is based on a supervised learning algorithm, built as a neural network. This perceptron generates a tensor (matrix with double values) which feed a set of classifiers [baseline=(char.base)] Â Â Â Â Â Â Â Â Â Â Â [shape=circle,draw,inner sep=0.002pt] (char) 14;. Context labels are attributes (key-value pairs) indicating the temperature, geographical position, etc.

Finally, all user actions and context labels are finally combined to recognize the high-level business actions. A random forest approach is employed. First [baseline=(char.base)] Â Â Â Â Â Â Â Â Â Â Â [shape=circle,draw,inner sep=0.002pt] (char) 15; both inputs are combined to create vectors containing a list of user actions and the corresponding context label as well. With this vector a new set of classifiers, in this case, decision trees, are fed so each tree evaluates [baseline=(char.base)] Â Â Â Â Â Â Â Â Â Â Â [shape=circle,draw,inner sep=0.002pt] (char) 16; which business action is being performed independently. A counter selects [baseline=(char.base)] Â Â Â Â Â Â Â Â Â Â Â [shape=circle,draw,inner sep=0.002pt] (char) 17; the most recognized activity by the decision trees as the final recognized business action. In general, business actions may be represented employing any data format, required by the visualization dashboard or management platform. For example, YAWL or simple ASCII labels. Many different examples of business actions could be imagined. For example, product production in a craft pottery industry (that includes, design, modeling, decorating, etc.)

In general, as previously proposed in other very precise hybrid approaches [40], recognition and analysis technologies in the lower levels (DTW, KNN) are noise-tolerant although less precise than other alternatives. And, in the higher layers, very precise solutions are proposed taking profit of the noise removing and data curation in the lower layers.

Next subsections are describing all details about each one of these three phases.

3.1 Analysis phase

In an Industry 4.0 scenario we are considering a pervasive hardware platform composed of $P$ information sources (sensors, computing elements, etc.). These information sources provide information about production processes carried out by all people and activities in the environment. We are considering $N$ different users developing $M$ independent business actions.

Thus, after acquiring and aggregating all information sources, we obtain a time-variant vector $\overrightarrow{X}(t)$ of $P$ components Eq. (1).

$\displaystyle\overrightarrow{X}(t)=\{x_{1}(t),\dots,x_{i}(t),\dots,x_{P}(t)\}$ (1)

First, in the general case, $\overrightarrow{X}(t)$ is an analog (or analog-like) vector. Thus, signals must be digitalized through a sampling and retention scheme. This scheme, besides, must transform all information signals into integer time series represented all of them using the same number of bits, $B$ . This is essential to compensate differences in the hardware devices characteristics (precision, resolution, etc.), and avoid numerical problems when operating with data Eq. (3.1).

$\displaystyle\overrightarrow{X}[n]=\{x_{1}[n],\dots,x_{i}[n],\dots,x_{P}[n]\}$ $\displaystyle x_{i}[n]=x(nT_{s})\quad\textit{being}\ n\in\mathbb{N}$ (2) $\displaystyle x_{i}[n]\in[0,2^{B}-1]$

The sampling period $T_{s}$ must be selected according to human behavior characteristics. Thus, and according to the Nyquist theorem, any sampling frequency above 20 Hz is adequate Eq. (3).

$\displaystyle f_{s}=\frac{1}{T_{s}}\geqslant f_{s-\min}=20\ Hz$ (3)

This frequency may be modified according to the scenario and the considered hardware devices (for example, if cameras are also employed), but is adequate for environmental sensors, wearables and smartphones.

Considering this very reduced bandwidth, and the extremely high resolution required to maximize the precision of the following recognition algorithms, the sampling and retention scheme we are employing is a sigma-delta encoder. Figure 2 describes the block diagram of a standard $\Sigma-\Delta$ encoder.

Figure 2.

Basic block diagram for a $\Sigma-\Delta$ encoder.

Now, this initial digital vector, in general, is affected by physical random phenomena such as electronic noise, interferences, etc. These high frequency components may affect the following steps, so they must be removed. To do that, an exponential smoothing filter or exponential moving average (EMA) has been proved to be the most effective technique with time series. However, people tend to evolve with the workday, what creates trends in signals which may be removed by EMA. Besides, these trends are seasonal, as they are repeated every day. Because of these characteristics, we are not employing a simple EMA but a triple exponential smoothing (or Holt-Winters method), consisting of three EMA applied in a recursive manner. The first EMA applies an overall smoothing Eq. (3.1). The second EMA preserves the trends in signals Eq. (3.1). And the third and final EMA must preserve the seasonal information Eq. (6). The smoothed time series $\overrightarrow{Y}[n]$ (output) are obtained considering three real parameters $\alpha,\beta,\gamma\in[0,1]$ . Moreover, $L$ is the discrete period of the seasonal components (in Industry 4.0 scenarios, twenty-four hours).

$\displaystyle\overrightarrow{Y}[n]=\alpha\cdot\frac{\overrightarrow{X}[n]}{% \overrightarrow{I}[n-L]}+(1-\alpha)$ (4) $\displaystyle\quad{}\cdot(\overrightarrow{Y}[n-1]+\overrightarrow{T}[n-1])$ $\displaystyle\overrightarrow{T}[n]=\gamma\cdot(\overrightarrow{Y}[n]-% \overrightarrow{Y}[n-1])+(1-\gamma)$ (5) $\displaystyle\quad{}\cdot\overrightarrow{T}[n-1]$ $\displaystyle\overrightarrow{I}[n]=\beta\cdot\frac{\overrightarrow{X}[n]}{% \overrightarrow{Y}[n]}+(1-\beta)\cdot\overrightarrow{I}[n-L]$ (6)

Now, in the smoothed vector of time series $\overrightarrow{Y}[n]$ , not all components will be independent. In fact, as information sources belong to a pervasive platform, they are (in general) linked by $C$ constraints. These constraints may belong to three different types. Namely:

•

Physical constraints: They are due to physical laws. For example, two close ambient sensors should generate the same output.

•

Design constraints: They are due to the selected technological architecture. For example, two digital sensors are programmed to generate reversed bits.

•

Business constraints: These constraints are caused by mandatory business workflows and routines in the industrial scenario.

These constraints, in industrial scenarios, are typically scleronomic (i.e. they are independent from time); and besides they are holonomic (i.e. they are independent from differential operations on the coordinates). In those conditions, constraints may be expressed as simple functions Eq. (7). These functions may be employed to remove redundant components in the smoothed vector $\overrightarrow{Y}[n]$ of time series, obtaining a new generalized vector $\overrightarrow{Q}[n]$ where all components (time series) are totally independent Eq. (3.1).

$\displaystyle f_{j}(\overrightarrow{Y}[n])=0\quad j=1,\dots,C$ (7) $\displaystyle\overrightarrow{Q}[n]=\{q_{1}[n],\dots,q_{i}[n],\dots,q_{P-C}[n]\}$ (8) $\displaystyle q_{i}[n]=q_{i}(\overrightarrow{Y}[n],n)\forall i\in[1,P-C]$

In general, this new vector will have $P-C$ components. The benefit of this approach is that, now, every component may be analyzed independently from the others. We can imagine the obtained vector in a $P-C$ dimensional space, where notions such as the Euclidian distances are applicable. Besides, vector $\overrightarrow{Q}[n]$ has a generic format where no particularities from physical sensors are affecting the signals. An additional normalization process can be carried out if required.

3.2 Atomic action recognition

In this context, it is possible to evaluate the similarity of two generalized vectors (or patterns) using simple distance functions, what enables doing a large number of comparisons in a short time (Euclidian distances are extremely computationally low-cost). In that way, the distance between two patterns or generalized vectors $\overrightarrow{Q_{a}}[n]$ and $\overrightarrow{Q_{b}}[n]$ at each time instant may be expressed through simple mathematical operations Eq. (9). Where each vector potentially contains information about atomic actions executed by workers.

$\displaystyle d(\overrightarrow{Q_{a}}[n],\overrightarrow{Q_{b}}[n];n_{0})=% \sqrt{\sum^{P-C}_{i=1}{{(q^{a}_{i}[n_{0}]-q^{b}_{i}[n_{0}])}^{2}}}$ (9)

However, in Industry 4.0 scenarios, atomic actions take a time period to be executed, $T_{\textit{action}}$ , so the proposed Euclidian distance would evolve with time. Besides, a standard quadratic subtraction as mechanism to measure the distance in each one of the $P-C$ dimensions (components) is only valid if all actions are always executed at the same speed (what is not true in human performed actions). Therefore, we are evaluating the distance in each dimension using dynamic time warping technologies -DTW- Eq. (10), being function $\textit{dtw}(\cdot,\cdot)$ the standard DTW algorithm [15]. Using DTW technologies, variations in the execution speed do not affect the final result, and a global estimation about the difference between two time series is directly obtained (the obtained result is independent from time). On the other hand, DTW algorithm is only valid for signals with a similar structure. This similarity level in the signals’ format is reached in the analysis phase, where signals are orthonormalized and segmented (aligned and synchronized).

$\displaystyle d_{\textit{DTW}}(\overrightarrow{Q_{a}}[n],\overrightarrow{Q_{b}% }[n])=\sqrt{\sum^{P-C}_{i=1}{\textit{dtw}^{2}(q^{a}_{i},q^{b}_{i})}}$ (10)

Theoretically, DTW distance could be directly applied to patterns $\overrightarrow{Q_{a}}[n]$ and $\overrightarrow{Q_{b}}[n]$ (as, for example, in speech recognition systems), but this approach assumes all components inside each vector evolve at the same speech (although it tolerates that this speech is different in different vectors). Nevertheless, this assumption is not true in general in Industry 4.0 (where components represent sensors that evolve independently), so in our technology (as shown above) DTW mechanism must be applied to every information source independently and, later, aggregate all the obtained costs in a global distance.

As said, in this initial analysis phase, atomic actions are recognized. To do that, the $P-C$ time series making up the generalized vector $\overrightarrow{Q}[n]$ are obtained as data streams. All the recognition process is performed at real-time (as required by Industry 4.0 applications) but, for clarity, we are describing the referred recognition process considering the whole $\overrightarrow{Q}[n]$ vector has been already received. This approach does not affect the described mathematical operations and algorithms.

This atomic action recognition process, basically, calculates the distance between elements $s_{i}$ in a repository of patterns $\mathcal{S}=\{{\overrightarrow{s}}_{i}\ i=0,\dots,Z\}$ and the current values of the generalized vector $\overrightarrow{Q}[n]$ .

The pattern repository $\mathcal{S}$ contains $Z$ different action patterns which are experimentally determined. Different users and experts (industry workers) are requested to perform those actions to capture the patterns and feed the repository. In this approach, the required time to completely analyze an industry scenario grows exponentially with the number of users and processes to be considered. Therefore, in very complex scenarios, the deploying cost of this solution might be high. On the contrary, the control and monitoring capacity also grows in precision and efficiency.

The pattern ${\overrightarrow{s}}^{*}_{i}$ that is the closest (later we are describing this point with all details) to the generalized vector is selected as the atomic action being performed. However, atomic actions are characterized by being limited in time and space. Thus, several atomic actions could be performed at the same time and in the same global space. The generalized vector $\overrightarrow{Q}[n]$ will contain information about all of them, and (in this situation) DTW distance cannot be directly calculated. Then, a problem to be solved is to segment the generalized vector into sets of samples $\overrightarrow{g_{i}}[n]$ containing only one atomic action Eq. (11), being $n_{\textit{init}}-n_{\textit{fin}}$ the execution period of the atomic actions.

$\displaystyle\overrightarrow{g_{i}}[n]=\{q_{j}[n]j\in[1,P-C]n\in[n_{\textit{% init}},n_{\textit{fin}}]\}$ (11)

To separate atomic actions, we are grouping components $q_{i}[n]$ in the generalized vector $\overrightarrow{Q}[n]$ that comes from devices that are together at a certain moment. To recover this geographical information at this point, it could be stored as metadata in the acquisition process. Besides, information about the user performing the action could be acquired. In this paper, this information is presented as semantic annotations (metadata) [42]. As different people may perform actions in a different manner, two areas are defined (see Fig. 3):

•

$A_{1}$ area includes all components that are close enough to be considered they are (for sure) composing a unique atomic action.

•

$A_{2}$ area includes components that may be part of the atomic action described by devices in the $A_{1}$ area or not.

Limits for these areas ( $r_{1}$ and $r_{2}$ ) are fixed according to the scenario under study. For example, in an Industry 4.0 scenario of hand-made basketry, they would take values of some centimeters.

Figure 3.

Spatial distribution of information sources.

All possible sets $\overrightarrow{g_{i}}[n]$ generated by grouping elements in $A_{2}$ in all exiting manners (regardless the order and without repeating elements) potentially represent the atomic action being performed. However, although the number of possible actions grows up exponentially with the number of components in $A_{2}$ area Eq. (12), it is not required a long time to solve this calculation: only combinations (atomic actions) ${\overrightarrow{s}}_{i}$ which are also present in the pattern repository $\mathcal{S}$ must be evaluated. Hereinafter, $\textit{card}\{A_{2}\}$ represents the number of elements in set $A_{2}$ . If different combinations are describing atomic actions stored in the repository, all of them will be considered and evaluated using the DTW distance. Components $q_{i}[n]$ which are not finally attached to any atomic action, they are considered context signals.

$\displaystyle\sum^{\textit{card}\{A_{2}\}}_{j=1}\left(\begin{array}[]{c}j\\ \textit{card}\{A_{2}\}\end{array}\right)=\sum^{\textit{card}\{A_{2}\}}_{j=1}{% \frac{j!}{\textit{card}\{A_{2}\}!\cdot(\textit{card}\{A_{2}\}-j)!}}$ (12)

Segmenting time series $q_{i}[n]$ into time intervals $(n_{\textit{init}}-n_{\textit{fin}})$ describing only one atomic action is a more complex problem.

To perform this action, we are proposing a sliding window scheme. This window $s[n]$ will have a square envelope and a duration of $D$ samples. Besides, we are defining a core $co[n]$ in center of this window with a duration of $D_{c}$ samples. The sliding window moves with an overlap of $D_{o}$ samples, which must include at least one sample in the window core (see Fig. 4 – dashed window is represented just for clarity as it represents the window in the previous time instant –). Parameter $D_{c}$ is selected to adjust to the fastest atomic action. $D$ is selected to adjust to the slowest atomic action, including (probably) a certain error margin. Finally, $D_{o}$ is selected to adapt to the average transition period between atomic actions performed by workers. The objective of this window structure is to locate all samples belonging to the same atomic action.

Figure 4.

Sliding window mechanism in the analysis phase.

The proposed solution operates in the following manner (see Algorithm 1). The windowed time series (segment) is compared to the patterns (using the proposed DTW technique) considering as initial sample every sample from the initial one to the $\frac{(D-D_{C})}{2}$ -th sample. At the same time, the final sample is selected in the range $[\frac{(D+D_{C})}{2},D]$ . It is defined a maximum admissible distance $d_{\max}$ . If no path has a cost below $d_{\max}$ , the sliding window advances $D-D_{o}$ samples. All these samples are considered empty noise between actions. If some paths have a cost below $d_{\max}$ , then the longest path (with more steps) is selected as the segment describing the atomic action, and the sliding window advances $D-D_{o}$ samples.

Algorithm 1: Time segmentation of time series
Input Pattern $q_{a}[n]$ and time series $q_{b}[n]$
Output Detection of $q_{a}[n]$ pattern in $q_{b}[n]$ or not
Integer $i=0$
while $q_{b}[n]$ is generating more data do
Calculate $q_{c}=s[n]\cdot q_{b}[n+i\cdot(D-D_{o})]$
Create the final distance between pattern $d_{f}$
Create the warping path $w_{f}$ with zero length
for each value of $j\in[0,\frac{(D-D_{C})}{2}]$ do
for each value of $k\in[\frac{(D+D_{C})}{2},D]$ do
Calculate a square window $s_{2}[n]$ with non-zero
values in samples between $j$ and $k$
Calculate $d=\textit{dtw}(q_{a},s_{2}[n]\cdot q_{c})$ and optimum path
$w^{*}$
if ${d}\leqslant{{d}}_{{\max}}$ and $w^{*}$ is larger than $w_{f}$ then
$w_{f}$ is equal to $w^{*}$
$d_{f}$ is equal to $d$
end if
end for
end for
if $w_{f}$ has a non-zero length then
return event pattern $q_{a}$ has been detected with distance
$d_{f}$
end if
Increment $i$ in one unit
end while

This sliding window and segmentation process is meant to, mainly, reduce the false negative elements, increasing the system recall. In noiseless scenarios, DTW technologies are tolerant to add or remove several samples from the signals. However, in noisy scenarios as distance thresholds must be more restrictive to avoid false positive elements, the segmentation process is essential to ensure all samples and contributions are considered.

In the most basic approach, the detected atomic actions ${\overrightarrow{s}}^{*}_{i}$ are those which are closest to each segment $\overrightarrow{g_{i}}[n]$ . However, this approach is very weak, and we are here employing a k-nearest neighbors (K-NN) algorithm [33] but modified to adapt to Industry 4.0 scenarios. First, if no pattern is closest than a certain threshold distance $d_{th}$ , no action is recognized. Second, as some atomic actions may be more common in the pattern repository $\mathcal{S}$ than others, not all neighbors can be considered in the same manner. In that way, contributions to the estimation functions must be weighted Eq. (3.2) using a real parameter $\alpha_{j}$ . Basically, patterns that are closer than a distance of $d_{\textit{break}}$ units are considered “close actions” and weighted in a similar way. Patterns further than $d_{\textit{break}}$ units are considered different actions and are penalized.

$\displaystyle\textit{action}{\overrightarrow{s}}^{*}_{i}\leftarrow\textit{% argmax}_{l\in\textit{REPO}_{\textit{LAB}}}$ $\displaystyle\left(\sum^{K}_{j=1}\alpha_{j}\cdot\delta[l,\textit{label}({% \overrightarrow{s}}^{g,j}_{\textit{K-close}})]\right.$ (13) $\displaystyle\left.\phantom{(\sum^{K}_{j=1}}\!\!\!\!\!\!\!\!\!\!\!\cdot\delta[% \textit{true},d_{\textit{DTW}}({\overrightarrow{s}}^{g,j}_{\textit{K-close}},% \overrightarrow{g_{i}})\leqslant d_{th}]\right)$

This weighting parameter $\alpha_{j}$ is calculated through a function that may take different expressions. For this work, we are considering a linear piecewise function, depending on the distance between patterns Eq. (3.2). Besides, parameters $\beta_{1}$ and $\beta_{2}$ must fulfill a relation in order to define a valid weighting function Eq. (15).

$\displaystyle\alpha_{j}=\alpha(d)=\alpha(d_{\textit{DTW}}({\overrightarrow{s}}% ^{g,j}_{\textit{K-close}},\overrightarrow{g_{i}}))$ (14) $\displaystyle\quad=\left\{\begin{array}[]{ll}-\beta_{1}\cdot d+1&\text{if }d% \leqslant d_{\textit{break}}\\ -\beta_{2}\cdot(d-d_{th})&\text{if }d>d_{\textit{break}}\\ \end{array}\right.$ $\displaystyle\beta_{2}=\frac{\beta_{1}\cdot d_{\textit{break}}-1}{d_{\textit{% break}}-d_{th}}$ (15)

After recognizing the atomic actions being performed, all components $q_{i}$ describing any of these actions are not considered anymore. Components $q_{i}$ which have not been identified to be part of any atomic action are injected into the following steps as context information.

3.3 Modeling phase: User activities recognition

At this point, we have obtained two data structures. On the one hand, a set $\overrightarrow{A}$ of recognized atomic actions, labeled with a discrete temporal stamp $T$ and a piece of information $\mathfrak{u}$ , indicating the user that performed the action. Although any user recognition solution could be employed to collect this piece of information $\mathfrak{u}$ , in order to reduce the acquisition cost we are employing a deterministic scheme. Wearable sensors and smartphones are directly associated to specific users (so each sensor monitors a worker), while environmental sensors may monitor several different users. Regarding the timestamp, in this model we are assuming that computational and acquisition delays are constant and independent from the specific action being recognized or sensor being employed. So, actions are recognized in the same order they are performed, and actions can be aggregated easily following a strict temporal order.

In the general case, we are considering $V$ atomic actions are recognized in $T_{0}$ time units Eq. (3.3).

$\displaystyle\overrightarrow{A}=\{a^{T,\mathfrak{u}}_{i}\ i=1,\dots,V;T=1,% \dots,T_{0};$ $\displaystyle\quad\mathfrak{u}=1,\dots,N\}$ (16) $\displaystyle\textit{being }a^{T,\mathfrak{u}}_{i}=\textit{label}({% \overrightarrow{s}}^{*}_{i})\ \textit{for}\ n=T$

On the other hand, a set of context signals $\overrightarrow{C}$ (those time series that do not contain data describing any atomic action and are not empty noise, as said in the previous section) is also obtained Eq. (3.3).

$\displaystyle\overrightarrow{C}=\{c_{m}[n]=q_{j}[n]\ \textit{being}$ (17) $\displaystyle q_{j}[n]\notin\overrightarrow{g_{i}}[n]\forall j,i,n\}$

In the modeling phase, these two data structures are employed to evaluate user action and context models. Each one with a different approach (Conditional Random Fields and Neural Networks). As a result, user actions and context labels are recognized.

First, we are discussing how user actions are recognized.

A repository $\mathcal{U}$ of user actions, where actions ${\overrightarrow{U}}_{i}$ are described as sequences of atomic actions $u^{j}_{i}$ is considered Eq. (3.3). This repository is easily built by monitoring the industrial scenario and recognizing atomic actions in a supervised manner.

$\displaystyle\mathcal{U}=\{{\overrightarrow{U}}_{i}\ i=1,\dots,K_{u}\}$ (18) $\displaystyle{\overrightarrow{U}}_{i}=\{u^{j}_{i}\ j=1,\dots,P_{u}\}$

As referred for the atomic action repository, in this case the cost of supervising users and creating the repository of user actions is not negligible. Specifically, this cost grows up exponentially with the number of workers and activities under consideration.

Now we are evaluating the conditional probability of a user $\mathfrak{u}$ to be executing a certain user action ${\overrightarrow{U}}_{i}$ considered the observed and recognized atomic actions $\overrightarrow{A}$ Eq. (19). The user action ${\overrightarrow{U}}_{i}$ maximizing this conditional probability is the recognized user action $\overrightarrow{U^{*}}$ Eq. (20).

$\displaystyle P({\overrightarrow{U}}_{i}|\overrightarrow{A})$ (19) $\displaystyle\overrightarrow{U^{*}}\leftarrow\textit{argmax}_{{\overrightarrow% {U}}_{i}\in\mathcal{U}}(P({\overrightarrow{U}}_{i}|\overrightarrow{A}))$ (20)

However, set $\overrightarrow{A}$ contains atomic actions performed by different users and, besides, actions belonging to different user actions. Then, we must split set $\overrightarrow{A}$ in different subsets before applying the discriminative model Eq. (20). Figure 5 presents the proposed splitting mechanism, which may operate, even, at real-time.

Figure 5.

Aggregation process of atomic actions in the modeling phase.

Then, $N$ different subsets $\overrightarrow{A_{\mathfrak{u}}}$ are obtained Eq. (21). One for each user in the scenario. This process may be easily performed using the piece of information $\mathfrak{u}$ .

$\displaystyle\overrightarrow{A_{\mathfrak{u}}}=\{a^{T,j}_{i}\ j=\mathfrak{u}% \forall i,n\}$ (21)

Now, atomic actions in each subset $\overrightarrow{A_{\mathfrak{u}}}$ must be separated into new subsets describing only one user action. Thus, they can be compared to patterns stored in repository $\mathcal{U}$ . This process is based on the timestamp of atomic actions and a sliding window (see Fig. 6 – dashed window is represented just for clarity as it represents the window in the previous time instant –).

Figure 6.

Sliding window mechanism in the modeling phase.

Recognized atomic actions are ordered as a time series, according to their timestamp. Then, a window $e[n]$ with a square envelope is employed to aggregate $E$ atomic actions in a subset $\overrightarrow{A_{\mathfrak{u},[n_{\textit{init}},n_{\textit{fin}}]}}$ Eq. (22).

$\displaystyle\overrightarrow{A_{\mathfrak{u},[n_{\textit{init}},n_{\textit{fin% }}]}}=\{\overrightarrow{A_{\mathfrak{u}}}[n]n\in[n_{\textit{init}},n_{\textit{% fin}}]\}$ (22)

Besides, we are defining a core $eo[n]$ in center of this window with a width of $E_{c}$ actions. The sliding window moves with no overlap. A window starts exactly where it finishes the last one. Contrary to the analysis phase, where empty noise may appear in time series, in this scenario every atomic action must be part of a user action. Thus, the sliding window has not a central structure (see Fig. 4), but a left aligned one (see Fig. 6); and no overlap is considered. Any case, the purpose of this structure is the same: to locate all atomic actions belonging to a user action, that are spread along a segment whose duration varies between the minimum ( $E_{c}$ ) and the maximum ( $E$ ).

The windowed (aggregated) atomic actions $\overrightarrow{A_{\mathfrak{u},[n_{\textit{init}},n_{\textit{fin}}]}}$ are compared to patterns (using the conditional probability) in the user activity repository $\mathcal{U}$ . Different possible combinations will be considered, by selecting the final atomic action in the range $[E_{c},E]$ . All possible combinations are evaluated. The one with the highest probability is selected as the performed and recognized user action $\overrightarrow{U^{*}}$ . All the included atomic actions in that user action are removed, and the sliding window moves to start exactly where the previous user action finished. Algorithm 2 describes the proposed mechanism.

Algorithm 2: User action recognition process
Input Recognized atomic actions $\overrightarrow{A}$
Output Recognized user actions $\overrightarrow{U^{*}_{i}}$
Create variable $\textit{prob}_{\max}\leftarrow 0$ and $k_{\max}\leftarrow E_{C}$
while $\overrightarrow{A}$ contains atomic actions do
Read label $\mathfrak{u}$ of atomic action $a^{T,\mathfrak{u}}_{i}$
Store $a^{T,\mathfrak{u}}_{i}$ in $\overrightarrow{A_{\mathfrak{u}}}$
for each value of $j\in[1,N]$ do
if $\overrightarrow{A_{j}}$ contain at least $E$ atomic actions then
for each value of $k\in[E_{c},E]$ do
Select the first k atomic actions in $\overrightarrow{A_{j}}$
and store them in a set $\overrightarrow{A_{\mathfrak{u},k}}$
for each user action ${\overrightarrow{U}}_{i}$ in $\mathcal{U}$ do
Estimate the probability $P({\overrightarrow{U}}_{i}\|\overrightarrow{A_{\mathfrak{u},k}})$
if $P({\overrightarrow{U}}_{i}\|\overrightarrow{A_{\mathfrak{u},k}})>\textit{prob}_% {\max}$ then
$\overrightarrow{U^{*}_{i}}={\overrightarrow{U}}_{i}$
$\textit{prob}_{\max}=P({\overrightarrow{U}}_{i}\|\overrightarrow{A_{\mathfrak{u% },k}})$
$k_{\textit{max}}=k$
end if
end for
end if
Remove all atomic actions in $\overrightarrow{A_{\mathfrak{u},k_{\max}}}$ from $\overrightarrow{A}$
Return $\overrightarrow{U^{*}_{i}}$
end for
end while

At this point, we only must discuss how the conditional probability $P({\overrightarrow{U}}_{i}|\overrightarrow{A_{\mathfrak{u},k}})$ may be numerically evaluated from the observed atomic actions $\overrightarrow{A}$ and the patterns in the user action repository $\mathcal{U}$ .

Each one of the random variables considered in the conditional probability ( ${\overrightarrow{U}}_{i}$ and $\overrightarrow{A_{\mathfrak{u},k}}$ ) are, in fact, a vector of $P_{u}$ and $k$ (respectively) atomic actions (or elemental random variables) Eq. (3.3). As seen before, these vectors represent the sequence of atomic actions $a^{T,\mathfrak{u}}_{i}$ performed by a certain user $\mathfrak{u}$ during a certain time period $[n_{\textit{init}},n_{\textit{fin}}]$ .

$\displaystyle P({\overrightarrow{U}}_{i}|\overrightarrow{A_{\mathfrak{u},k}})=% P(\{u^{j}_{i}\ j=1,\dots,P_{u}|$ (23) $\displaystyle\{a^{T_{i},\mathfrak{u}}_{i}\ i=1,\dots,k\})$

In real Industry 4.0 scenarios, despite the variable and flexible character of activities, actions tend to be performed following a minimum common structure (according to the production process, for example). Thus, once a certain atomic action $a^{T_{i},\mathfrak{u}}_{i}$ is observed, the probability distribution of the next atomic action $a^{T_{i+1},\mathfrak{u}}_{i+1}$ varies (for example, atomic actions belonging to the same production process that the first action $a^{T_{i},\mathfrak{u}}_{i}$ will be more probable). In other words, observed atomic actions are depending on each other. However, in this work, we are considering all atomic actions are independent. This assumption is introducing a certain error, but it will be more stable and more easily evaluable (aggregated into the model’s precision).

In those conditions, it is possible to develop the conditional probability as a product or partial probabilities (one for each atomic action $u^{j}_{i}$ in ${\overrightarrow{U}}_{i}$ ). Besides, in order to allow this mechanism to be implemented using software tools, it must be casual, i.e. it must only depend on the atomic actions $a^{T_{i},\mathfrak{u}}_{i}$ currently observed or in the past, but not in the future, as $\overrightarrow{A_{\mathfrak{u},k}}$ contains all of them Eq. (24). Being $Z$ a parameter to maintain the global value of the product in the range $[0,1]$ , according to the Kolmogorov’s definition of probability.

$\displaystyle P({\overrightarrow{U}}_{i}|\overrightarrow{A_{\mathfrak{u},k}})=% \frac{1}{Z}\prod^{P_{u}}_{j=1}{P(u^{j}_{i}|\{a^{T_{l},\mathfrak{u}}_{l}\ l=1,% \dots,j\})}$ (24)

At this point, in order to improve the precision of the model, we can add (artificially) information about previously recognized user actions, contained in the set ${\mathcal{U}}^{*}$ . This new information takes the form of new conditions in the conditional probability Eq. (25).

$\displaystyle P({\overrightarrow{U}}_{i}|\overrightarrow{A_{\mathfrak{u},k}})=% \frac{1}{Z}\prod^{P_{u}}_{j=1}{P(u^{j}_{i}|\{a^{T_{l},\mathfrak{u}}_{l}\ l=1,% \dots,j\},{\mathcal{U}}^{*})}$ (25)

Each elemental probability $P(u^{j}_{i}|\{a^{T_{l},\mathfrak{u}}_{l}\ l=1,\dots,\linebreak j\},{\mathcal{U% }}^{*})$ must now be evaluated numerically, so it can be understood as an unknown function $f_{j}$ depending on atomic actions $u^{j}_{i}$ and $a^{T_{l},\mathfrak{u}}_{l}$ and the set ${\mathcal{U}}^{*}$ . Then, these expressions Eq. (25) may be rewritten in a more compact manner Eq. (26).

$\displaystyle P({\overrightarrow{U}}_{i}|\overrightarrow{A_{\mathfrak{u},k}})=% \frac{1}{Z}\prod^{P_{u}}_{j=1}{f_{j}(u^{j}_{i},\{a^{T_{l},\mathfrak{u}}_{l}\ l% =1,\dots,j\},{\mathcal{U}}^{*})}$ (26)

Now, as humans might freely perform any action at any time, function $f_{j}(u^{j}_{i},\{a^{T_{l},\mathfrak{u}}_{l}\ l=1,\dots,j\},{\mathcal{U}}^{*})$ is never taking the zero value (no action is an impossible event). Then, function $f_{j}(u^{j}_{i},\{a^{T_{l},\mathfrak{u}}_{l}\ l=1,\dots,j\},{\mathcal{U}}^{*})$ may be understood as a Gibbs random field (GRF), whose probability distribution is based on exponential functions Eq. (3.3). Being $H_{j}(u^{j}_{i},\{a^{T_{l},\mathfrak{u}}_{l}\ l=1,\linebreak\dots,j\},{% \mathcal{U}}^{*})$ a new function called the energy function of the GRF. Besides, with this new view, $Z$ parameter may be calculated as the partition function of the GRF Eq. (3.3)

$\displaystyle f_{j}(u^{j}_{i},\{a^{T_{l},\mathfrak{u}}_{l}\ l=1,\dots,j\},{% \mathcal{U}}^{*})$ (27) $\displaystyle\quad{}=e^{-H_{j}(u^{j}_{i},\{a^{T_{l},\mathfrak{u}}_{l}\ l=1,% \dots,j\},{\mathcal{U}}^{*})}$ $\displaystyle Z=\sum_{\forall{\overrightarrow{U}}_{i}\in\mathcal{U}}\prod^{P_{% u}}_{j=1}f_{j}$ $\displaystyle\quad{}(u^{j}_{i},\{a^{T_{l},\mathfrak{u}}_{l}\ l=1,\dots,j\},{% \mathcal{U}}^{*})$ (28) $\displaystyle\quad{}=\sum_{\forall{\overrightarrow{U}}_{i}\in\mathcal{U}}\prod% ^{P_{u}}_{j=1}{e^{-H_{j}(u^{j}_{i},\{a^{T_{l},\mathfrak{u}}_{l}\ l=1,\dots,j\}% ,{\mathcal{U}}^{*})}}$

On the other hand, as function $f_{j}(u^{j}_{i},\{a^{T_{l},\mathfrak{u}}_{l}\ l=1,\dots,i\},{\mathcal{U}}^{*})$ is a GRF, we can consider the first lemma of Hammersley-Clifford theorem. Then, function $f_{j}$ may be factorized into two terms Eq. (29), $f^{a}_{j}$ and $f^{it}_{j}$ , separating the influence of the observed atomic actions $a^{T_{l},\mathfrak{u}}_{l}$ and the previously recognized user actions.

$\displaystyle f_{j}(u^{j}_{i},\{a^{T_{l},\mathfrak{u}}_{l}\ l=1,\dots,j\},{% \mathcal{U}}^{*})=f^{a}_{j}(u^{j}_{i},\{a^{T_{l},\mathfrak{u}}_{l}\ l=1,\dots,% j\})\cdot f^{it}_{j}(u^{j}_{i},{\mathcal{U}}^{*})$ (29)

$f^{a}_{j}$ is called the “action function” and represents the influence of observed atomic actions. $f^{it}_{j}$ is called the “interaction function” and represents the influence of previously recognized user actions.

Each, factor, as said before Eq. (3.3) may be expressed as an exponential function considering an energy function. Thus, it is induced a new factorization Eq. (30).

$\displaystyle f_{j}(u^{j}_{i},\{a^{T_{l},\mathfrak{u}}_{l}\ l=1,\dots,i\},{% \mathcal{U}}^{*})=e^{-H^{a}_{j}(u^{j}_{i},\{a^{T_{l},\mathfrak{u}}_{l}\ l=1,% \dots,j\})}\cdot e^{-H^{it}_{j}(u^{j}_{i}{\mathcal{U}}^{*})}=\exp(-H^{a}_{j}(u% ^{j}_{i},\{a^{T_{l},\mathfrak{u}}_{l}\ l=1,\dots,j\})-H^{it}_{j}(u^{j}_{i}{% \mathcal{U}}^{*}))$ (30)

Now, we can rewrite the expression for the conditional probability considering the GFR Eqs (3.3) and (3.3).

$\displaystyle P({\overrightarrow{U}}_{i}|\overrightarrow{A_{\mathfrak{u},k}})$ $\displaystyle\quad{}=\frac{1}{Z}\sum^{P_{u}}_{j=1}{e^{-H_{j}(u^{j}_{i},\{a^{T_% {l},\mathfrak{u}}_{l}\ l=1,\dots,i\},{\mathcal{U}}^{*})}}$ $\displaystyle\quad{}=\frac{1}{Z}\exp\left(\sum^{P_{u}}_{j=1}-H^{a}_{j}(u^{j}_{% i},\right.$ (31) $\displaystyle\quad\left.\phantom{\sum^{P_{u}}_{j=1}}\!\!\!\!\!\!\!\!\{a^{T_{l}% ,\mathfrak{u}}_{l}\ l=1,\dots,j\})-H^{it}_{j}(u^{j}_{i}{\mathcal{U}}^{*})\right)$ $\displaystyle Z=\sum_{\forall{\overrightarrow{U}}_{i}\in\mathcal{U}}\exp\left(% \sum^{P_{u}}_{j=1}-H^{a}_{j}(u^{j}_{i},\right.$ $\displaystyle\quad\left.\phantom{\sum^{P_{u}}_{j=1}}\!\!\!\!\!\!\!\!\{a^{T_{l}% ,\mathfrak{u}}_{l}\ l=1,\dots,j\})-H^{it}_{j}(u^{j}_{i}{\mathcal{U}}^{*})\right)$

Functions $H^{a}_{j}$ and $H^{it}_{j}$ must be selected to represent the restrictions of industrial processes, and the human behavior. In order to do that, we are needing a learning process to capture that information in a systematic manner. However, to allow the utilization of existing learning mechanisms we must rewrite functions $H^{a}_{j}$ and $H^{it}_{j}$ in another manner Eqs (32) and (33).

$\displaystyle H^{a}_{j}=-\sum^{j}_{l=1}{\theta^{a}_{l,j}\cdot g^{a}_{l,j}(u^{j% }_{i},a^{T_{l},\mathfrak{u}}_{l})}$ (32) $\displaystyle H^{it}_{j}=-\sum_{\forall\overrightarrow{U^{*}}\in{\mathcal{U}}^% {*}}{\theta^{it}_{\overrightarrow{U^{*}},j}\cdot g^{it}_{\overrightarrow{U^{*}% },j}(u^{j}_{i},\overrightarrow{U^{*}})}$ (33)

Functions $g^{a}_{l,j}(u^{j}_{i},a^{T_{l},\mathfrak{u}}_{l})$ and $g^{it}_{\overrightarrow{U^{*}},j}(u^{j}_{i},\overrightarrow{U^{*}})$ are unitary functions. Typically, they can be expressed as combinations of Kronecker’s delta functions. In this most simple formulation, they will be only one delta function Eqs (3.3) and (3.3).

$\displaystyle g^{a}_{l,j}(u^{j}_{i},a^{T_{l},\mathfrak{u}}_{l})=\delta[u^{j}_{% i},a^{T_{l},\mathfrak{u}}_{l}]$ (34) $\displaystyle\quad{}=\left\{\begin{array}[]{cl}1&\text{if }u^{j}_{i}=a^{T_{l},% \mathfrak{u}}_{l}\\ 0&\text{othewise}\\ \end{array}\right.$ $\displaystyle g^{it}_{\overrightarrow{U^{*}},j}(u^{j}_{i},\overrightarrow{U^{*% }})=\delta[u^{j}_{i},\overrightarrow{U^{*}}]$ (35) $\displaystyle\quad{}=\left\{\begin{array}[]{cl}1&\text{if }u^{j}_{i}\in% \overrightarrow{U^{*}}\\ 0&\text{othewise}\\ \end{array}\right.$

Parameters $\theta^{a}_{l}$ and $\theta^{it}_{\overrightarrow{U^{*}}}$ are real values which weight the contribution of each function and must be learnt automatically. Different strategies could be employed to learn those parameters, but in order to select the optimal weighting scheme we are representing them as a vector $\Theta^{{*}}$ Eq. (3.3). With this new formulation, the obtained model is formally identical to a General Conditional Random Field (GCRF); although the deduction process and meaning of each element is different. However, mathematically, the same numerical methods employed to train GCRF may be employed in our case. Particularly we are employing an optimization algorithm of the maximum verisimilitude logarithm.

$\displaystyle\Theta^{{*}}=\{\theta^{a}_{l,j},\theta^{it}_{\overrightarrow{U^{*% }}};l=1,\dots,j;$ (36) $\displaystyle\quad j=1,\dots,P_{u};\forall\overrightarrow{U^{*}}\in{\mathcal{U% }}*\}$

In our approach, we are not using a generic model, but a model that is adapted to the industrial scenarios since the beginning and the initial mathematical definition. That is a novelty compared to existing solutions, which causes a relevant increase in the system precision and justifies the higher processing delay.

3.4 Modeling phase: Context label recognition

Now, we are paying attention to the context signals Eq. (3.3), which must be transformed into high-level context labels in this phase (to enable the business action recognition process in the final phase).

In this case, context series do not represent a behavior as complex as human behavior (like atomic actions), but the evolution of the environment (which is, in general, much slower, and predictable). Thus, a more standard approach may be employed to create context labels from these context signals.

Figure 7.

Implementation of the context recognition module.

First, we are defining a set $\Lambda$ of context labels $\lambda_{c}$ including $\mathfrak{L}$ different labels Eq. (37). The objective of this new module (context recognition, se Fig. 1) is to deduct which context labels are applicable at each time instant, according to the information contained in the context signals. This labeling problem adapts perfectly to the functionality of neural networks; so, we are employing a recognition module based on this technology (see Fig. 7).

$\displaystyle\Lambda=\{\lambda^{{i}}_{c}\ i=1,\dots,\mathfrak{L}\}$ (37)

Table 2

Features extracted from context signal segments

Feature	Mathematical expression
Maximum value	$\max\{b^{m}_{r}[n]\}$
Minimum value	$\min\{b^{m}_{r}[n]\}$
First maximum	$n\|b^{m}_{r}[n]=\max\{b^{m}_{r}[n]\}$
First minimum	$n\|b^{m}_{r}[n]=\min\{b^{m}_{r}[n]\}$
$p$ -th raw moment	$\frac{1}{W_{c}}\sum^{W_{c}-1}_{n=0}{{(b^{m}_{r}[n])}^{p}}$
$p$ -th central moment	$\frac{1}{W_{c}}\sum^{W_{c}-1}_{n=0}{{(b^{m}_{r}[n]-E[b^{m}_{r}])}^{p}}$
$p$ -th standardized moment	$\frac{1}{{(E[{(b^{m}_{r}-E[b^{m}_{r}])}^{2}])}^{p/2}}\frac{1}{W_{c}}\sum^{W_{c% }-1}_{n=0}{{(b^{m}_{r}[n]-E[b^{m}_{r}])}^{p}}$
Median	$b^{m}_{r}[\lfloor\frac{W_{c}+1}{2}\rfloor]$
$p$ -th quartile	$b^{m}_{r}[\lfloor\frac{p(W_{c}+1)}{4}\rfloor]$
Entropy	$-\sum^{W_{c}-1}_{n=0}{P((b^{m}_{r}[n]))}{{{\log}}_{2}P((b^{m}_{r}[n]))}$
Mean of gradient signal	$\frac{1}{W_{c}}\sum^{W_{c}-1}_{n=0}{\frac{\|\|b^{m}_{r}[n]-b^{m}_{r}[n-1]\|\|}{% \max\{b^{m}_{r}[n]\}}}$
Mean of Laplacian signal	$\frac{1}{W_{c}}\sum^{W_{c}-1}_{n=0}{\frac{\|\|b^{m}_{r}[n+1]-2b^{m}_{r}[n]+b^{m}% _{r}[n-1]\|\|}{\max\{b^{m}_{r}[n]\}}}$

In the proposed context recognition module, we are first segmenting context signals $c_{m}[n]$ using a sliding rectangular window $sw[n]$ . This window has a width of $W_{c}$ samples Eq. (38). In this case, this window is moving with no overlap. Later, from each segment Eq. (39) $b^{m}_{r}[n]$ is extracted a vector of features $V^{m}_{\mathrm{r}}$ Eq. (40), including statistical and waveform characteristics. Extracted features (see Table 2) are selected to be referred as good quality features for context recognition [25].

$\displaystyle sw[n]=\left\{\begin{array}[]{cl}1&\text{if }0\leqslant n% \leqslant W_{c}\\ 0&\text{otherwise}\\ \end{array}\right.$ (38) $\displaystyle b^{m}_{r}[n]=c_{m}[n]{\cdot}sw[n+r\cdot W_{c}]$ (39) $\displaystyle\overrightarrow{V^{m}_{{r}}}{=}\{v^{i}_{r,m}i=1,\dots,F\}$ (40)

Then, for each position of the sliding windows we are obtaining a large vector $\overrightarrow{{{V}}_{{r}}}$ including all features extracted from all context signal segments $b^{m}_{r}[n]$ Eq. (41).

$\displaystyle\overrightarrow{{{V}}_{{r}}}=\{\overrightarrow{V^{m}_{{r}}}% \forall m\,r\in\mathbb{N}\}$ (41)

Figure 8.

Proposed architecture for the multilayer perceptron.

This vector becomes the input of a neural network. For this neural network, we propose a multilayer network, specifically a Multilayer Perceptron (MLP) formed by a stack of five hidden layers (see Fig. 8). There are two dense layers followed by Dropout layers with a rate of $\frac{1}{2}$ which randomly switch off 50% of the MLP’s neurons in each epoch. This allows the neurons to independently develop meaningful features and avoids overfitting when processing the input feature vectors. Six hundred and twenty-five different trainable parameters are then defined in this network. The output of this network (a tensor) is a vector encoding the class probability for the input vector. The tensor resulting from this multilayer network (where the input to one layer is the output of the previous one) is then fed into a classifier bank with $L$ different one-unit classifiers (one per each context label to be recognized), where a 2-class classification (context label recognized or not) is performed. As activation functions, we used ReLU (Rectified Linear Unit) non-linearity for Fully-Connected layer and sigmoid for the last Dense layer (encoding the probability of a class or the other).

For the training process, we use a stochastic gradient descent algorithm and Adam optimization (considered to be the fastest to converge) with a small learning rate of $1\times e^{-3}$ to optimize the binary crossentropy function (which measures the similarity between the prediction and the ground truth when working with a network ending in a sigmoid function). The network was trained for one hundred (100) epochs. After each epoch, using never seen data, the classification error is measured, evaluated, and distributed across the entire network using backpropagation. For the testing phase (see Section 4) neither the training data nor the evaluation data are employed, so performance metrics are obtained using never seen data, which increases the experiment reliability.

We chose this architecture for its simplicity, computational efficiency and flexibility, which allow us to reach the real-time requirements of Industry 4.0 scenarios. As in the user activity recognition module, this neural network must be trained to capture information about the application scenario. To perform this process, we are also employing standard instruments.

Finally, after the classification process, we obtain for each time instant (position of the sliding window) a set $\{\Lambda_{r}r\in\mathbb{N}\}$ containing the recognized context labels.

3.5 Recognition phase

For this final phase, we need a new classification technique being able to chop input information and analyze the parts independently, although partial results must be later composed to obtain a global result. This description perfectly fits with a classifier based on Random Forests [37].

Random Forest technique consists of a set of $D$ decision trees (see Fig. 9), which are fed with different subsets $I_{d}$ of the input information (in our case we can include discontinuous information in each subset) Eq. (3.5). We are only ensuring that each subset contains both, information about user activities and information about context.

$\displaystyle I_{d}=\{\Lambda_{r1},\dots,\Lambda_{rx},\overrightarrow{U^{*}_{s% 1}},\dots,\overrightarrow{U^{*}_{sx}}r1,\dots,$

(42) $\displaystyle\quad rx,s1,\dots,sx\in\mathbb{N}\}$ $\displaystyle\mathcal{I}=\{I_{d}\ d=1,\dots,D\}$

Each tree, then, evaluates the input subset (or sub-vector) and decides about which business action $E_{i}$ from the business action repository $\mathcal{E}$ Eq. (43) is being executed.

$\displaystyle\mathcal{E}=\{E_{i}\ i=1,\dots,E\}$ (43)

Figure 9.
Proposed architecture for the random forest classifier.

This repository, as the other ones described in this paper, is created by supervising users and workers in the scenario under study for a period. All repositories may be created at the same time through a unique configuration phase. Atomic, user and business actions and activities should be defined by managers or industry experts according to the production processes, manufactured products, and business objectives. Any modeling language could be employed for this purpose.

Besides, we are modifying classic Random Forest; and all business actions $E^{*}_{i}$ which are recognized by more than $D_{th}$ decision trees are globally recognized as current actions $E^{**}_{i}$ . Despite this modification in the final step, standard frameworks may be employed to create the proposed classifier, considering as input information sets ${\mathcal{U}}^{*}$ , $\{\Lambda_{r}r\in\mathbb{N}\}$ , $\mathcal{E}$ and $\mathcal{I}$ . All decision trees are built during the training phase, which also may follow a standard procedure.
4. Evaluation and discussion

In order to evaluate the performance of the proposed solution, in this section we carry out a set of relevant experiments and provide and analyze the obtained results.

4.1 Experimental validation: Materials and methods

To evaluate the performance of the proposed technology, two experiments are planned and performed. The first experiment was focused on evaluating the quality of the described technique through a set of standard indicators in the field of activity recognition solutions. The second experiment was planned to evaluate the performance of the new technology in terms of execution time and scalability. To perform these studies, the new activity recognition mechanism was implemented and executed using MATLAB 2017 software suite. In order to guarantee the obtained results for the new technology during the planned experiments are comparable to results previously reported in the state of the art, we are basing both experiments in standard datasets commonly employed to evaluate activity recognition technologies. Specifically, we have selected two different datasets: ExtraSensory [23] dataset and UJAmI dataset [16].

ExtraSensory dataset contains, mainly, information provided by personal mobile sensors integrated into mobile phones. Sensors such as accelerometers, gyroscopes and magnetometers are included in this dataset. On the other hand, UJAmI dataset contains information from a pervasive hardware platform including sensors such as NFC tags or temperature and CO2 sensors. Besides, to guarantee the statistical significance and validity of the results, the performance of the new activity recognition technology is analyzed through a $k$ -fold cross-validation scheme. In that methodology, the working dataset is divided into $k$ different sub-sets, equal and interchangeable These sub-sets are employed for training and testing the proposed technology and, then, interchanged. The process is repeated $k$ times, one for each sub-set. Considering the number of records in the datasets, we are using a scheme with five iterations. Final results, presented in this paper, are the statistical mean values extracted from all these previous partial results.

Datasets employed in these experiments were selected according to different criteria:

•
Datasets must contain information about unconstrained activities. Contrary to other applications in real scenarios where people act freely.
•
Different users must be present in the dataset. To represent a real Industry 4.0 scenario, more than one worker must be performing activities. This condition also allows us to guarantee the proposed solution generalizes all human-dependent factors in business activities.
•
Samples in datasets must be collected according to communication and sampling schemes described in the proposed architecture.
•
Information about activities being executed in parallel, with interruptions, and about activities executed by different users in a collaborative manner must be also present in the selected datasets.
•
More than one sensor must be present. Preferably, an heterogenous set of information sources must be represented in the dataset.
•
Time, context and geographical information must be present in the dataset, to be adequate for the proposed new technology.

With these criteria, two datasets were selected. ExtraSensory dataset describes sixty (60) users performing up to one hundred and sixteen (116) different activities in a multi-tasking scheme. More than 300.000 minutes of monitoring are present in the dataset. Personal mobile sensors are employed. Raw signals are available. On the other hand, UJAmI dataset represents workers performing activities in a pervasive sensing scenario, as it is envisioned to happen in Industry 4.0 applications. Only twenty-four different actions are monitored. Ten days of monitoring are available. UJAmI dataset was initially published in 1996. However, different actualizations and versions have been released and, for this work, we are considering the last 2018 version, so we can guarantee the dataset reflects current Industry 4.0 scenarios.

Table 3
Configuration parameters for the experimental phase

Parameter Value Comments

Analysis phase

$\overrightarrow{X}[n]$ Six hundred and sixty data sources in the ExtraSEnsory dataset (one mobile device per user and eleven sensors per device)

Thirty-nine consolidated data sources in the UJAmI dataset

B 12 Standard number of bits for current analog-to-digital converters

$f_{s}$ 80 Hz Maximum frequency in data signals is 40 Hz

$\alpha$ 0.441 Traditional values for a standard smoothing effect

$\beta$ 0.030

$\gamma$ 0.002

$L$ 6912000 Season is considered as a workday (24 h)

$C$ Redundant sensors are considered equal

Modeling phase

$N_{\textit{train}}$ 80% of available instances in each experiment (depends on the experiment, but around forty thousands)

$\mathcal{U}$ 164 ExtraSensory dataset provides 116 different activities and UJAmI dataset provides 48 different activities

Number of users 60 As indicated in the considered datasets

Recognition phase

$\mathfrak{L}$ 75 ExtraSensory dataset provides 51 different activities and UJAmI dataset provides 24 different activities

$W_{c}$ 48000 The maximum variation period for context signals is fixed to ten minutes

$D$ 100 Default value for a good quality classifier, as reported in the literature

In order to guarantee that obtained results are not user-conditioned, when creating the five folds in the validation process independent groups of users were considered. Specifically, 80% of users considered in each experiment were employed to train the model and additional 20% of participants were employed to test the performance. Although this approach may cause overfitting under certain circumstances, in our experiment we saw a very high and constant performance in every k-fold. No subset where this performance is significantly lower has been detected. As a result, we can conclude the generalization capacity of our model is very high.

As said, two different experiments were conducted. For both experiments, the proposed new activity recognition technique was configured with a particular set of parameters, which are shown in Table 3.

Most of these parameters must be selected according to the activities to be recognized, although parameters related to the smoothing effect have optimum values that have been analyzed and reported in the state of the art [26]. In order to tune activity-dependent parameters, a “silence” detection analysis based on elemental signal processing may be done, so the length and duration of the different activities may be easily identified and calculated. The spectrogram tool is employed for this purpose in this work.

The first experiment was focused on analyzing the recognition and classification capabilities of the proposed technology. To do that, a standard collection of relevant performance indicators was considered (see Table 4). The entire datasets were employed to train and evaluate the proposed technique during this experiment. Three different situations were defined. In the first one, we are only using the ExtraSensory dataset. In the second one, we are only using the UJAmI dataset. And in the third one we are creating a new dataset, obtained by merging ExtraSensory and UJAmI datasets.

Table 4
Performance indicators considered in the first experiment

Indicator Expression

Precision $\frac{tp}{tp+fp}$

Recall $\frac{tp}{tp+fn}$

F1-Score $2\cdot\frac{tp}{2\cdot tp+fn+fp}$

Specificity $\frac{tn}{tn+tp}$

Balance accuracy $\frac{1}{2}(\frac{tp}{tp+fn}+\frac{tn}{tn+tp})$

Root mean square error $\sqrt{\frac{1}{N_{T}}\sum^{N_{T}}_{i=1}{\delta[y_{i},x_{i}]}}$

Kappa $\frac{po+pc}{1-pc}$

In Table 4, $t p$ indicates the number of activities that are correctly recognized; $t n$ indicates the number of activities that are correctly non recognized; $f p$ indicates the number of activities that are falsely recognized; and $f n$ indicates the number of activities that are falsely non-recognized. Besides, $y_{i}$ represents the recognized label for an activity, and $x_{i}$ indicates the real label for that activity. $N_{T}$ denotes the total amount of samples in the dataset. Finally, $p o$ refers to the observed probability in the entire dataset, and $p c$ represents the probability of chance.

In order to highlight the novelty of the proposed solution, and provide a relevant data comparison, the obtained results are statistically compared to the state-of-the-art hybrid mechanisms [40]. For this purpose, we have selected as reference the hybrid technology showing the highest accuracy [40] among all reported solutions in the last five years. Other more recent proposals [22, 50] could be found, but they show a worse performance. Although different tests could be employed, in this experiment we are using the Mann-Whitney U test, as it has been proved to be effective to compare activity recognition solutions. The $p$ value indicates the significance level of the Mann-Whitney U test. Different tests for various significance levels (alpha parameter) were conducted. Significance levels have been selected to be the most usual and standard in the state of the art. The error associated to this test may be considered negligible given the size of the datasets we are employing [3].

On the other hand, the second experiment was focused on the performance and scalability analysis of the proposed technology. Considering the dataset generated by merging ExtraSensory and UJAmI datasets, the required time for the training process and the recognition delay are measured. From this dataset, different folds were extracted containing different numbers of users. For each fold, the training and recognition delay was measured. From this experiment, the required processing time, the scheme scalability and the algorithm temporal order was calculated and discussed.
4.2 Results

Parameter	Value	Comments
Analysis phase
$\overrightarrow{X}[n]$	Six hundred and sixty data sources in the ExtraSEnsory dataset (one mobile device per user and eleven sensors per device)
	Thirty-nine consolidated data sources in the UJAmI dataset
B	12	Standard number of bits for current analog-to-digital converters
$f_{s}$	80 Hz	Maximum frequency in data signals is 40 Hz
$\alpha$	0.441	Traditional values for a standard smoothing effect
$\beta$	0.030
$\gamma$	0.002
$L$	6912000	Season is considered as a workday (24 h)
$C$	Redundant sensors are considered equal
Modeling phase
$N_{\textit{train}}$	80% of available instances in each experiment (depends on the experiment, but around forty thousands)
$\mathcal{U}$	164	ExtraSensory dataset provides 116 different activities and UJAmI dataset provides 48 different activities
Number of users	60	As indicated in the considered datasets
Recognition phase
$\mathfrak{L}$	75	ExtraSensory dataset provides 51 different activities and UJAmI dataset provides 24 different activities
$W_{c}$	48000	The maximum variation period for context signals is fixed to ten minutes
$D$	100	Default value for a good quality classifier, as reported in the literature

Indicator	Expression
Precision	$\frac{tp}{tp+fp}$
Recall	$\frac{tp}{tp+fn}$
F1-Score	$2\cdot\frac{tp}{2\cdot tp+fn+fp}$
Specificity	$\frac{tn}{tn+tp}$
Balance accuracy	$\frac{1}{2}(\frac{tp}{tp+fn}+\frac{tn}{tn+tp})$
Root mean square error	$\sqrt{\frac{1}{N_{T}}\sum^{N_{T}}_{i=1}{\delta[y_{i},x_{i}]}}$
Kappa	$\frac{po+pc}{1-pc}$

Table 5 provides the obtained results for the first experiment. Globally, these results are coherent both, internally (among the different indicators) and externally [37]. No dissonant value or result may be seen, so they may be considered valid and statistically representative of the technology’s behavior. This conclusion is also supported by the high values in the Cohen’s kappa score. From Table 5 it can be deducted the proposed mechanism present a very good behavior as activity recognition technique in Industry 4.0 scenarios: F1-Score is near 0.9 for all experiments (even significantly above this value for the UJAmI dataset).

Table 5
Results from first experiment

Indicator	ExtraSensory	UJAmI	ExtraSensory $+$ UJAmI
Precision	0.869	0.957	0.859
Recall	0.875	0.960	0.870
F1-Score	0.872	0.959	0.864
Specificity	0.879	0.965	0.880
Balance accuracy	0.877	0.963	0.875
Root mean square error	0.105	0.089	0.109
Kappa	0.904	0.923	0.903

In crowdsensing scenarios (represented by ExtraSensory dataset), precision is almost 87%. This value considers all business activities represented in the ExtraSensory dataset together (such as driving, cooking, or working in the lab). Activities are heterogenous enough to represent a large catalogue of potential Industry 4.0 scenarios. The same catalogue of activities has been previously recognized using other approaches, some of them even similar to the proposed solution, and obtained results with our proposal (globally) improve up to 10% the performance of these state-of-the-art techniques [37] applied to the same dataset. In general, activities that are performed in a continuous and homogeneous manner (such driving or walking) are recognized with a better precision than activities that are non-continuous (such as cooking or bathing). The difference in precision between both kinds of activities is around 2.5%.

The best results are obtained for UJAmI dataset, which represents environments based on pervasive sensing platforms, and shows a F1-Score around 10% higher than experiments with other datasets. On the other hand, the proposed scheme in this work improves the precision around 8% compared to the state-of-the-art proposals where the entire catalogue of activities in the UJAmI datasets are considered [27].

More complex Industry 4.0 scenario will include both, personal sensors and pervasive sensing platforms. These scenarios are represented by the merged ExtraSensory $+$ UJAmI dataset. In this case, the precision, as well as the F1-Score, is a little bit lower than the value for the ExtraSensory or UJAmI datasets independently (a reduction about 2% and 10% respectively). However, results are still improving, although in a more moderate manner (around 7%–8%, depending on the indicator), the performance of techniques in the state of the art.

Although some discussions have been provided, comparing the results with state-of-the-art mechanisms, Table 6 shows a formal statistical comparison with existing hybrid approaches [40] using the Mann-Whitney U test. As it can be seen, in general for all metrics the proposed solution is significantly better than the state-of-the-art hybrid mechanisms [40] applied to the same datasets.

Table 6

Comparison of different indicators with the state of the art

Indicator	ExtraSensory	UJAmI	ExtraSensory $+$ UJAmI
Precision	**	*	*
Recall	**	**	**
F1-Score	*	**	NS
Specificity	**	**	**
Balance accuracy	**	**	**
Root mean square error	***	**	**
Kappa	*	*	*

NS not significant; ${}^{*}$ significant at $p<$ 0.05; ${}^{**}$ significant at $p<$ 0.005; ${}^{***}$ significant at $p<$ 0.001.

First, in general, all metrics improve with a significance level of 0.005. However, in our approach, F1-score shows a more similar behavior to previous proposals than other metrics, and the significance level reduced in one magnitude order. Even, for the experiment considering the ExtraSensory and UJAmI datasets together no difference is detected. Any case, globally, we can conclude the proposed scheme improves the performance of state-of-the-art mechanisms, as Kappa parameter shows a relevant improvement with a significance level of $p=$ 0.05.

In order to add more information to the discussion, we are analyzing some relevant activity types. Namely, the continuous (C) and non-continuous (NC) activities, and the activities performed by one (I) or by several (G) workers together. These disaggregated results are represented in Fig. 10. Besides, Table 7 present the confusion matrix for these four groups and all the considered datasets.

Figure 10.

Precision, recall and F1-Score for some relevant activity types.

Continuous activities are those that last for a long period generating a homogeneous and almost permanent sensor outputs (such as driving or sitting). On the other hand, non-continuous activities are those that last a short time or have a variable behavior (for example, cooking).

As can be seen there is no big difference between performance for individual and group activities. Indicators are slightly higher for individual activities, but differences are below 1%. Only errors originated in this last phase affect the differences between individual and group activities (contrary to other approaches based on monolithic solutions).

Table 7

Confusion matrix for some relevant activity types

Dataset			Recognized
			NC	C	I	G
ExtraSensory	Real activity	NC	0.883	0.161
		C	0.124	0.888
		I			0.897	0.155
		G			0.158	0.865
UJAmI		NC	0.906	0.103
		C	0.071	0.917
		I			0.965	0.103
		G			0.094	0.909
ExtraSensory $+$ UJAmI		NC	0.896	0.148
		C	0.114	0.877
		I			0.865	0.140
		G			0.141	0.849

Table 8

Comparison of different indicators with the state of the art

Dataset			Indicator
			Precision	Recall	F1-score
ExtraSensory	Real activity	NC	*	**	NS
		C	**	**	*
		I	**	**	*
		G	*	**	NS
UJAmI		NC	*	**	**
		C	*	*	**
		I	*	**	**
		G	NS	*	*
ExtraSensory $+$ UJAmI		NC	*	*	NS
		C	**	**	*
		I	**	**	*
		G	*	*	NS

NS not significant; ${}^{*}$ significant at $p<$ 0.05; ${}^{**}$ significant at $p<$ 0.005; ${}^{***}$ significant at $p<$ 0.001.

However, there is a significant difference (in the environment of 5%) between the performance for continuous and non-continuous activities. In this case, discontinuities affect both, the modeling, and the recognition phases. As a general idea, complex business activities (with discontinuities and several users collaborating together) are recognized with a lower precision (e.g., working in the laboratory) than activities with a simpler structure such as driving or lying.

In order to analyze with more details which kinds of activities are recognized with the best precision, Table 8 shows a statistical comparison of the obtained results with the state of the art, using the Mann-Whitney U test. Besides, in order to enable a heuristic comparison, Table 9 shows the values for the main indicators (precision, recall, specificity and F1-score).

Table 9

Main indicators for the main types of activities

	Indicators
	Precision	Recall	Specificity	F1-score
NC	0.896	0.858	0.884	0.878
C	0.877	0.884	0.858	0.880
I	0.865	0.860	0.858	0.862
G	0.849	0.858	0.860	0.853

First, in general, all kind of activities shows a significant improvement in all metrics compared to the state of the art. In general, precision and F1-score improvement have a significance level of $p=$ 0.05; while recall improvement shows a significance of $p=$ 0.005. However, as it can be seen, differences are more significant for continuous and individual activities (such as driving). Besides, sensor information from ExtraSensory datasets (smartphone, mainly) also allows a more significant improvement than information from pervasive platforms. That may be caused by the precise user identification enabled by phones’ sensors.

Regarding the different activity types, continuous and individual activities (as they have a simpler structure) are recognized with a better precision, recall and F1-Score. This includes activities such as lying, sitting, running or driving. In this case, the significance level of the improvement is close to $p=$ 0.005. On the contrary, non-continuous and group activities are more complex, and the improvement is less significant. The increase in precision and recall, in this case, has a significance level one magnitude order lower: $p=$ 0.05; while F1-Score does not show any significant difference. Activities such working in the laboratory or cooking belong this second group.

Figure 11 shows the results of the second experiment. As can be seen (Fig. 11b), only around 200 milliseconds are required to recognize a business activity using the proposed framework. Almost-random variations may be observed in the figure, of 3% between the maximum and minimum values, but they can be easily explained by exogenous processes affecting the experiment, such as delays caused by the operating system and other applications that are sharing the resources. Any case, seen the obtained graphic, the temporal order of the proposed solution during the operation phase is almost linear with respect to the number of users. This is the most desired behavior for industrial solutions.

Figure 11.

(a) Evolution of the time required in the training process for different numbers of users in the dataset. (b) Operation delay for different numbers of users.

On the other hand, in Fig. 11a the required time for the training process is shown. Values range between only twenty minutes (approximately) for scenarios where only one user is employed to train the algorithm; to around four hours, required in scenarios where sixty workers are involved in the training process. These results are coherent with the idea that training processes in our solution are performed using standard mechanisms, which have showed similar behaviors in other previously reported works [52].

In this case, and using a fitting mechanism, we have found that the temporal order of the proposed solution during the training process is $n\cdot\log(n)$ with respect to the number of users involved in the training.

5. Conclusions

In this paper, it is proposed a new activity recognition technology, focused on Industry 4.0 scenarios. The proposed mechanism consists of different steps, including a first analysis phase where physical signals are processed with DTW technologies; a second phase where activities are modeled using CRF, and neural networks are employed to analyze context information; and a third step where activities are recognized using previously recognized user actions and context information, formatted as labels.

The proposed solution achieves the best recognition rate of 87% which demonstrates the efficacy of the described method. Results show that the proposed mechanism improves up to 10% the precision of previously reported technologies which a relevant significance level, when applied to Industry 4.0 (craft industry) scenarios. On the other hand, the weight of craft industry within the global Industry 4.0 sector may be small (depending on the region, country, etc.), so other less precise mechanisms could be considered in practice by companies, if they are low-cost because of the exponential economy. Solutions such as artificial vision, which is exhaustively employed in other sectors like the automotive sector, but currently have limited applicability in craft industries, could be then deployed in this scenario because of its affordable cost.

Future works will consider the validation of the proposed solution in different Industry 4.0 scenarios. Besides, other classifiers during the recognition may be employed, in order to adapt the proposed mechanism to certain critical scenarios where, for example, only video signals are available (for example, in energy companies). Future works will also analyze how the proposed solution may be applied to other large-scale industries such as the automotive sector.

Footnotes

Acknowledgments

The research leading to these results has received funding from the Spanish Ministry of Science, Innovation and Universities through the COGNOS project (PID2019-105484RB-I00).

References

. Industry 4.0: A survey on technologies, applications and open research issues. Journal of Industrial Information Integration, 2017; 6: 1-10.

Bordel

Alcarria

Robles

Martín

. Cyber-physical systems: Extending pervasive sensing from control theory to the Internet of Things. Pervasive and Mobile Computing, 2017; 40: 156-184.

Perme,

Manevski

. Confidence intervals for the Mann-Whitney test. Statistical Methods in Medical Research, 2019; 28(12): 3755-3768.

Bordel

Alcarria

de Rivera

Robles

. Process execution in Cyber-Physical Systems using cloud and Cyber-Physical Internet services. The Journal of Supercomputing, 2018; 74(8): 4127-4169.

Noering

FKD

Schroeder

Jonas

Klawonn

. Pattern discovery in time series using autoencoder in comparison to nonlearning approaches. Integrated Computer-Aided Engineering, 2021; 28(3): 235-254.

Roda-Sanchez

Olivares

Garrido-Hidalgo

de la Vara

Fernández-Caballero

. Human-robot interaction in industry 4.0 based on internet of thing real-time gesture control system. Integrated Computer-Aided Engineering, 2021; 28(2): 159-175.

Beddiar

Nini

Sabokrou

Hadid

. Vision-based human activity recognition: A survey. Multimedia Tools and Applications, 2020; 79(41): 30509-30555.

Zhang

Wei

Nie

Huang

Wang

. A review on human activity recognition using vision-based method. Journal of Healthcare Engineering, 2017; 2017: 3090343.

Bordel

Alcarria

Martín

Robles

de Rivera

. Self-configuration in humanized cyber-physical systems. Journal of Ambient Intelligence and Humanized Computing, 2017; 8(4): 485-496.

10.

Bordel

Alcarria

Jara

. Process execution in humanized Cyber-physical systems: Soft processes. In 2017 12th Iberian Conference on Information Systems and Technologies (CISTI), IEEE. June 2017, pp. 1-7.

11.

Sánchez

Alcarria

de Rivera

Sánchez-Picot

. Enhancing process control in industry 4.0 scenarios using cyber-physical systems. JoWUA, 2016; 7(4): 41-64.

12.

Bordel Sánchez

Alcarria

Martín

Robles

. TF4SM: A framework for developing traceability solutions in small manufacturing companies. Sensors, 2015; 15(11): 29478-29510.

13.

Martín

Bordel

Alcarria

Sánchez-Picot

de Rivera

Robles

. Improving learning tasks for mentally handicapped people using AmI environments based on cyber-physical systems. In International Conference on Ubiquitous Computing and Ambient Intelligence Springer, Cham, Nov. 2016, pp. 166-177.

14.

Bordel

Alcarria

. Assessment of human motivation through analysis of physiological and emotional signals in Industry 4.0 scenarios. Journal of Ambient Intelligence and Humanized Computing, 2017; 1-21.

15.

Bordel

Alcarria

Sánchez-de-Rivera

. A Two-Phase Algorithm for Recognizing Human Activities in the Context of Industry 4.0 and Human-Driven Processes. In World Conference on Information Systems and Technologies, Springer, Cham, April 2019, pp. 175-185.

16.

Espinilla

Martínez

Medina

Nugent

. The experience of developing the UJAmI Smart lab. IEEE Access, 2018; 6: 34631-34642.

17.

Benavoli

Corani

Demšar

Zaffalon

. Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis. The Journal of Machine Learning Research, 2017; 18(1): 2653-2688.

18.

Martín Rico

Gomez-Donoso

Escalona

García Rodríguez

Cazorla

. Semantic visual recognition in a cognitive architecture for social robots. Integr. Comput. Aided Eng, 2020; 27(3): 301-316.

19.

Cai

Xue

. Self-adapted optimization-based video magnification for revealing subtle changes. Integr. Comput. Aided Eng, 2020; 27(2): 173-193.

20.

Zhang

Neri

Zhu

Jiang

Kuhnert

. A multi-aperture optical flow estimation method for an artificial compound eye. Integrated Computer-Aided Engineering, 2019; 26(2): 139-157.

21.

Elliott

Aggoun

Moore

. Hidden Markov models: estimation and control. Springer Science & Business Media, 2008; 29.

22.

Ullah

Muhammad

Ding

Palade

Haq

Baik

. Efficient activity recognition using lightweight CNN and DS-GRU network for surveillance applications. Applied Soft Computing, 2021; 103: 107102.

23.

Kohavi

. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In Kdd, August 1996; 96: 202-207.

24.

Jacob

. Sequential Bayesian inference for implicit hidden Markov models and current limitations. ESAIM: Proceedings and Surveys, 2015; 51: 24-48.

25.

Ehatisham-ul-Haq

Azam

. Opportunistic sensing for inferring in-the-wild human contexts based on activity pattern recognition using smart computing. Future Generation Computer Systems, 2020.

26.

Nakano

Takahashi

. Generalized exponential moving average (EMA) model with particle filtering and anomaly detection. Expert Systems with Applications, 2017; 73: 187-200.

27.

Salomón

Tîrnăucă

. Human activity recognition through weighted finite automata. In Multidisciplinary Digital Publishing Institute Proceedings, 2018; 2(19): 1263.

28.

Debes

Merentitis

Sukhanov

Niessen

Frangiadakis

Bauer

. Monitoring activities of daily living in smart homes: Understanding human behavior. IEEE Signal Processing Magazine, 2016; 33(2): 81-94.

29.

Kabir

Hoque

Thapa

Yang

. Two-layer hidden Markov model for human activity recognition in home environments. International Journal of Distributed Sensor Networks, 2016; 12(1): 4560365.

30.

Bakar

UABUA

Ghayvat

Hasanm

Mukhopadhyay

. Activity and anomaly detection in smart home: A survey. In Next Generation Sensors and Systems, Springer, Cham, 2016, pp. 191-220.

31.

Lee

Cho

. Activity recognition using hierarchical hidden markov models on a smartphone with 3D accelerometer. In International Conference on Hybrid Artificial Intelligence Systems, Springer, Berlin, Heidelberg, May 2011, pp. 460-467.

32.

Ronao

Cho

. Recognizing human activities from smartphone sensors using hierarchical continuous hidden Markov models. International Journal of Distributed Sensor Networks, 2017; 13(1): 1550147716683687.

33.

Pandey

Jain

. Comparative analysis of KNN algorithm using various normalization techniques. International Journal of Computer Network and Information Security, 2017; 11(11): 36.

34.

Liu

Nie

Hao

Yang

. Coupled hidden conditional random fields for RGB-D human action recognition. Signal Processing, 2015; 112: 74-82.

35.

Liu

Huang

Zhu

. Recognizing biomedical named entities using skip-chain conditional random fields. In Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, Association for Computational Linguistics, July 2010, pp. 10-18.

36.

Knoch

Ponpathirkoottam

Fettke

Loos

. Technology-enhanced process elicitation of worker activities in manufacturing. In: Business Process Management Workshops, Springer International Publishing, 2018, pp. 273-284.

37.

Vaizman

Ellis

Lanckriet

. Recognizing detailed human context in the wild from smartphones and smartwatches. IEEE Pervasive Computing, 2017; 16(4): 62-74.

38.

DiPietro

Lea

Malpani

Ahmidi

Vedula

Lee

Hager

. Recognizing surgical activities with recurrent neural networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, Cham, October 2016, pp. 551-558.

39.

Yang

. CIGAR: Concurrent and interleaving goal and activity recognition. In AAAI, July 2008; 8: 1363-1368.

40.

Malazi

Davari

. Combining emerging patterns with random forest for complex activity recognition in smart homes. Applied Intelligence, 2018; 48(2): 315-330.

41.

García-Borroto

Martínez-Trinidad

Carrasco-Ochoa

. A survey of emerging patterns for supervised classification. Artificial Intelligence Review, 2014; 42(4): 705-721.

42.

Bordel

Alcarria

Sanchez de Rivera

Martín

Robles

. Fast self-configuration in service-oriented Smart Environments for real-time applications. Journal of Ambient Intelligence and Smart Environments, 2018; 10(2): 143-167.

43.

Hassan

Huda

Uddin

Almogren

Alrubaian

. Human activity recognition from body sensor data using deep learning. Journal of Medical Systems, 2018; 42(6): 1-8.

44.

Liu

Cao

Yang

Jiang

. CPS-based smart warehouse for industry 4.0: A survey of the underlying technologies. Computers, 2018; 7(1): 13.

45.

Antar

Ahmed

Ahad

MAR

. Challenges in sensor-based human activity recognition and a comparative analysis of benchmark datasets: a review. In 2019 Joint 8th International Conference on Informatics, Electronics & Vision (ICIEV) and 2019 3rd International Conference on Imaging, Vision & Pattern Recognition (icIVPR), IEEE, 2019, pp. 134-139.

46.

Reining

Niemann

Moya Rueda

Fink

ten Hompel

. Human activity recognition for production and logistics – a systematic literature review. Information, 2019; 10(8): 245.

47.

Jalal

Mahmood

Hasan

. Multi-features descriptors for human activity tracking and recognition in Indoor-outdoor environments. In 2019 16th International Bhurban Conference on Applied Sciences and Technology (IBCAST), IEEE, 2019, pp. 371-376.

48.

Penumuru

Muthuswamy

Karumbu

. Identification and classification of materials using machine vision and machine learning in the context of industry 4.0. Journal of Intelligent Manufacturing, 2019; 31: 1-13.

49.

Luo

Xiong

Fang

Love

Zhang

Ouyang

. Convolutional neural networks: Computer vision-based workforce activity assessment in construction. Automation in Construction, 2018; 94: 282-289.

50.

Fioranelli

Jing

. Semisupervised Human Activity Recognition With Radar Micro-Doppler Signatures. IEEE Transactions on Geoscience and Remote Sensing, 2021; 1-12.

51.

Ibrahim

Mori

. Hierarchical relational networks for group activity recognition and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 721-736.

52.

Guillén

Llanes

Imbernón

Martínez-España

Bueno-Crespo

Cano

Cecilia

. Performance evaluation of edge-computing platforms for the prediction of low temperatures in agriculture using deep learning. The Journal of Supercomputing, 2021; 77: 818-840.

Recognizing human activities in Industry 4.0 scenarios through an analysis-modeling- recognition algorithm and context labels

Abstract

Keywords

1. Introduction

2. State of the art

4.1 Experimental validation: Materials and methods

Table 5 Results from first experiment

Footnotes

Acknowledgments

References

Table 5
Results from first experiment