Sage Journals: Discover world-class research

Abstract

How to deal with multi-modality data from different types of devices is a challenging issue for accurate recognition of human activities in a smart environment. In this paper, we propose a multimodal fusion enabled ensemble approach. Firstly, useful features collected from Bluetooth beacons, binary sensors, and smart floor are extracted and presented by fuzzy logic based-method with variable-size temporal windows. Secondly, a group of support vector machine classifiers are used to perform the classification task. Finally, a weighted ensemble method is used to obtain the final prediction. Especially, by applying the geometric framework, we are able to obtain the optimal weights for the ensemble. The proposed approach is evaluated on the UJAmI dataset. The experimental results demonstrate the efficacy and robustness of the proposed method.

Keywords

Ensemble learning Feature-level fusion Geometric framework Human activity recognition Multimodal fusion

Introduction

Human Activity Recognition (HAR) systems have grown increasingly important as a part of the daily life in a smart home/environment. In many countries, there is increasing amount of elderly people. To monitor human Activities of Daily Living (ADL), especially regarding cognitive or other impairments experienced by disabled individuals and the elderly living alone, is very helpful for providing timely health-related services to them.^1,2 It is even more demanding when the COVID-19 pandemic is spreading around the world, many people have to live in isolation. Home monitoring can be used as a possible solution to provide information on behavioral deviations and cognitive assistance.^3,4

Up to date, three types of sensors have been used to monitor ADL: vision-based,^5,6 non-invasive sensor-based,^7–9 and hybrid methods.¹⁰ Vision-based approaches depend on video cameras or depth cameras for accurate activity recognition.^11,12 Non-invasive sensor-based approaches utilize monitoring data from non-vision-based devices such as binary sensors¹³ or IMU sensors.^14,15 Hybrid approaches combine vision-based and non-invasive sensors for recognizing human ADL and behavior.¹⁰ In many situations, non-invasive sensors are preferred due to their advantages such as easy installation and maintenance, privacy protection, low storage requirements, and fast data transmission and processing.¹⁶ Therefore, Human Activity Recognition (HAR) systems that recognize ADL from non-invasive sensor-based monitoring data are now an active topic of research.

This piece of work focuses on non-invasive sensor-based monitoring to predict human activities in a smart environment. The UJAMI dataset is used^17,18 and it was collected in the UJAmI Smart Lab at the University of Jaén (Spain). Multi-modal data were collected from three types of sources including binary sensors, Bluetooth Low Energy (BLE) beacons, and smart floors.

It is a challenging task to deal with the UJAMI dataset for better human activity recognition. First, the number of classes in the dataset is relatively large. Twenty-four quotidian activities were performed by the inhabitant over a 10-day period. Second, some low-quality and noisy data were collected from those sensors. For example, there were missing values from binary sensors that were expected to activate during certain activities. Third, many existing methods of feature representation focus on binary sensors, whereas very few methods are able to deal with data from Bluetooth beacons and smart floors.

To deal with these challenges, we take quite a few measures at different levels. For the extracted data, fuzzy logic-based feature representation is defined on variable-size temporal windows to help represent temporal connections between activities. In the data set, there are quite a few such activities: preparing breakfast and breakfast, preparing lunch and lunch, preparing dinner and dinner, breakfast, lunch, and dinner, go to bed and wake up, and so on.

The second measure is to generate a group of strong and diversified base classifiers. To make base classifiers stronger, we use feature-level fusion. Using all available features, rather than a subset of them, is more likely to obtain stronger classifiers. Support vector machine is chosen for the realization of all base classifiers because its superior performance. Sampling with replacement is used to make the generated base classifiers with higher diversity.

Finally, based on those base classifiers, the third strategy is to use an ensemble to combine their predictions. Under a geometric framework, optimal weights are calculated for the classifier ensemble to achieve the best possible results. Our experiment with the UJAMI dataset shows that the above strategies combined are effective for improving classification performance.

In short, the major contributions of our research are as follows:

(1) A multimodal fusion enabled ensemble is proposed. At feature level, a few fuzzy logic based-methods with variable-size temporal windows are used to extract various features from multi-modular data. At base-classifier level, a group of support vector machines are trained with bagging for more diversity. At combination level, a weighted ensemble method is proposed with the optimal weights.

(2) The optimality of the weighted solution is discussed.

(3) The proposed method works well on the UJAMI dataset. It demonstrates that the method has good potential for HAR tasks with multiple modular data.

The remainder of this paper is organized as follows. Related Work section discusses related work. The proposed approach is presented in Methodology section. Experimental Setup section presents the settings of the experiment for performance evaluation on the UJAmI dataset. The results are discussed in Results and Discussion section. Conclusion section concludes the paper.

Related work

Many studies on human activity recognition have centered on one type of sensors such as binary sensors, BLE beacons, and so on. Temporal window definition¹⁹ and feature representation are the main topics in this process. Banos et al. found that using windows with fixed size is a good temporal division for evaluating daily activities based on binary-sensor data. Also using temporal windows with fixed size, Ordóñez et al. applies Artificial Neural Network and Support Vector Machines within a framework of Hidden Markov Model to perform activity recognition.²⁰ Hoffmann et al. calculated human body positions by using fixed-size temporal windows.^21,22 Mean movement distance distribution was used as features to recognize persons with a high versus low risk of falling in [21]; trajectory localization updates were used as features to recognize the movement patterns of humans and cats in [22]. A few studies^23,24 used multiple and/or variable-size windows to represent long-term and short-term dependence between activities. Espinillal et al. proposed three temporal sub-windows in the sensor data stream, where temporal sub-windows keep a partial order from the end time of the activity in the short term, medium term, and long term.²³ Their results suggest that multiple temporal windows outperform a single temporal window in terms of accuracy. Medina et al. proposed using multiple windows to express long- and short-term binary-senor activation information.^24,25 Fuzzy features were used to represent binary sensors in [24], and aggregated spatial and temporal features were used in [25].

Various machine learning methods have been used to recognize human activities. Neural networks, support vector machines, and decision trees are commonly used as base classifiers.²⁶ Bagging, boosting, and stacking are commonly used ensemble learning methods. It is observable that in almost all the cases an ensemble is more effective than an individual classifier-based approach.²⁷ In the following we review several ensemble approaches in the domain of HAR. Peng et al.²⁸ proposed a classifier-level fusion method to recognize complex activities by using acceleration, vital sign, and location data. In their study, base classifiers were built on different-context data separately. Muzammala et al. proposed a data fusion approach for medical data obtained from wireless body sensor networks in a fog computing environment.²⁹ In their solution, the daily activity data was first fused together, and then a novel kernel random forest ensemble classifier was developed for heart disease prediction. Garcia-Ceja et al. proposed a multi-view ensemble method to fuse the data collected from sound and accelerometer sensors.³⁰ Their base models for each of the sensor views were combined via a stacking framework. Irvine et al.³¹ proposed a homogeneous ensemble with four multi-layer perceptions to investigate the solution of predictive conflicts between base models.

In this study, we propose a multimodal fusion enabled ensemble approach. A combination of methods including variable-sized fuzzy windows, fusing data from all three types of devices, weighted ensemble of a group of SVM classifiers, were used in this study. Experiments demonstrated that the proposed approach has very good performance compared to other methods.

Methodology

This section describes the detailed procedures of our approach. The whole framework, as shown in Figure 1, is mainly composed of two basic modules, namely, the feature extraction and fusion module and the SVM-based ensemble classification module. Three different modal data were collected from a middleware platform, which was deployed for monitoring and collecting the information generated by three types of devices in the UJAmI Smart Lab. The original data description is detailed in [18]. In the first module, features are extracted and presented by the fuzzy logic-based methods and then fused together. In the second module, multiple SVM classifiers were built on the data output from the first module. Based on the geometric framework, optimal weights are calculated and applied for the ensemble.

Figure 1.

Framework of the proposed method for activity recognition.

Feature representation

To provide proper feature representation, the timeline is segmented into equal intervals. Each instance $I_{i} = (a t t r_{i}, l a b e l_{i}) (1 \leq i \leq N)$ is defined on a given interval $Δ t_{i}$ , where $a t t r_{i}$ is the feature vector and $l a b e l_{i}$ is the activity label of the $i - t h$ instance. Let $Δ t_{i} = {t_{i}}^{*} - t_{i}, {t_{i}}^{*} > t_{i} (1 \leq i \leq N)$ , where ${t_{i}}^{*}$ represents the ending time and $t_{i}$ the starting time of the temporal segment $Δ t_{i}$ . In order to predict activity performed within an interval $Δ t_{i}$ more accurately, it is better to consider not just the information at the current time interval, but also the information from some previous time intervals. In this way, it enables proper representation of the dependency among activities.

1. Feature representation for binary sensors

The set of binary sensors is represented by $= {E_{1}, E_{2}, . . ., E_{| E |}}$ , where |E| is the number of binary sensors. Any binary sensor $E_{i} (1 \leq i \leq | E |)$ is described by a set of binary activations

E_{i} = {E_{i_{0}}, E_{i_{1}}, . . ., E_{i_{| E_{i} |}}} \begin{array}{c} E_{i_{j}} \end{array} = {\begin{array}{c} {{E_{i_{j}}}^{-}, {E_{i_{j}}}^{+}} \begin{array}{c} E_{i} \in E_{A} \end{array} \\ {E_{i_{j}}}^{0} \begin{array}{c} E_{i} \in E_{B} \end{array} \end{array}

(1)

where |E_i| is the total number of activations for a given binary sensor E_i and

E_{i_{j}} (1 \leq j \leq | E_{i} |)

is the j-th activation. There are two types of binary sensors: magnetic sensors and motion & pressure sensors. They are denoted by E_A and E_B respectively. If E_i is of type E_A, then the binary activation

E_{i_{j}} (1 \leq j \leq | E_{i} |)

is described within a time range between

{E_{i_{j}}}^{-}

and

{E_{i_{j}}}^{+}

{E_{i_{j}}}^{-}

and

{E_{i_{j}}}^{+}

are the starting and ending time of

E_{i_{j}}

or the “open” and “close” states being activated, respectively for magnetic sensors. If E_i is of type E_B, then the binary activation

E_{i_{j}} (1 \leq j \leq | E_{i} |)

is described as a time point

{E_{i_{j}}}^{0}

at which the sensor is activated, or the time point when pressure or movement is detected for pressure and movement sensors.

To represent the activity performed in the interval $Δ t$ , the features extracted from binary sensors are represented by using fuzzy temporal windows (FTWs). A given FTW $T_{k}$ is defined by the values $[L_{k - 3}, L_{k - 2}, L_{k - 1}, L_{k}]$ related to a trapezoidal membership function $F_{k} (x)$

F_{k} (x) [L_{k}, L_{k - 1}, L_{k - 2}, L_{k - 3}] = {\begin{array}{c} 0 & x \leq L_{k - 3} \\ (x - L_{k - 3}) / (L_{k - 2} - L_{k - 3}) & L_{k - 3} < x < L_{k - 2} \\ 1 & L_{k - 2} \leq x \leq L_{k - 1} \\ (L_{k} - x) / (L_{k} - L_{k - 1}) & L_{k - 1} < x < L_{k} \\ 0 & L_{k} \leq x \end{array}}

(2)

where

F_{k} (x)

is the fuzzy function corresponding to FTW

T_{k}

;

L_{k - 3}

and

L_{k}

are the starting and ending time of the given temporal window, respectively. Within FTW

T_{k}

, the activation degree

T_{k} (E_{i_{j}}, t_{i})

of a binary activation

E_{i_{j}}

can be computed as follows

T_{k} (E_{i_{j}}, t_{i}) = {\begin{array}{c} \max (F_{k} (t_{i})) \begin{array}{c} (\forall t_{i} \in E_{i_{j}}) \land (t_{i} \in [L_{k - 3}, L_{k}] \cap [{E_{i_{j}}}^{-}, {E_{i_{j}}}^{+}]) \land (E_{i} \in E_{A}) \end{array} \\ F_{k} (t_{i}) \begin{array}{c} \begin{array}{c} (t_{i} = {E_{i_{j}}}^{0}) \land (t_{i} \in [L_{k - 3}, L_{k}]) \land (E_{i} \in E_{B}) \end{array} \end{array} \\ \begin{array}{c} \begin{array}{c} \begin{array}{c} 0 \end{array} & \begin{array}{c} o t h e r w i s e \end{array} \end{array} \end{array} \end{array}

(3)

The final activation degree of a binary sensor E_i in temporal window $T_{k}$ can be computed via the max operator as follows

T_{k} (E_{i}, t_{i}) = \max (T_{k} (E_{i_{j}}, t_{i})), \forall E_{i_{j}} \in E_{i}

(4)

Figure 2 shows an example of the calculation of the activation degree for binary sensor E_i within the fuzzy temporal window T_k. The top part of the figure shows that the activation degree is 0.7 for the binary activation $E_{i_{j}}$ , which is fired by a given magnetic sensor. The middle part shows that the activation degree is 0.2 for the activation $E_{i_{j}}$ , which is fired by a given motion or pressure sensor. The bottom part shows that the final activation degree is 0.9 for a given sensor E_i.

Figure 2.

Example of the calculation of the activation degree for binary sensor $E_{i}$ within the fuzzy temporal window T_k. The top part of the figure shows that the activation degree is 0.7 for the binary activation $E_{i_{j}}$ . The middle part shows that the activation degree is 0.2. The bottom part shows that the final activation degree is 0.9.

Within a given FTW $T_{k}$ , the feature is formed by the sequence of activation degrees for each sensor. In our work, we defined multiple FTWs with incremental size to collect long- and short-term activation information for each binary sensor. The size of the temporal feature vector is |T| × |S| which is equal to the number of FTWs multiplied by the number of sensors.

2. Feature representation for BLE beacons

The set of BLE beacons is represented by $= {B_{1}, B_{2}, . . ., B_{| B |}}$ , where |B| is the number of beacons. Each BLE beacon is described by a set of signals received from a smart watch

B_{i} = {B_{i_{0}}, B_{i_{1}}, . . ., B_{i_{| B_{i} |}}}, B_{i_{j}} = {R e c T i m e_{i_{j}}, R S S I_{i_{j}}}

(5)

where |B_i| is the total number of signals for a given beacon B_i; signal

B_{i_{j}}

is represented by the Received Signal Strength Indicator (RSSI) value by the smart watch and corresponding timestamp.

In this work, we take the beacon signals as input to extract temporal and spatial features. On the one hand, the timestamp can be used to extract the relationship among activities with multi fuzzy temporal windows. On the other hand, considering the distance between wearable devices and beacons that determines the RSSI value, spatial features can be extracted with the RSSI values to represent the distance between inhabitant and beacons.

The temporal feature of the BLE beacons signals is also extracted by using fuzzy temporal windows. Within the given FTW $T_{k}$ , we define the temporal feature $T_{k} (B_{i_{j}}, t_{i})$ by using the timestamp of the beacon signal $B_{i_{j}}$ as follows

T_{k} (B_{i_{j}}, t_{i}) = {\begin{array}{c} F_{k} \begin{array}{c} (t_{i}) & (t_{i} = R e c T i m e_{i_{j}}) \land (R e c T i m e_{i_{j}} \in [L_{k - 3}, L_{k}]) \end{array} \\ \begin{array}{c} 0 & \begin{array}{c} \begin{array}{c} \begin{array}{c} \begin{array}{c} \begin{array}{c} \begin{array}{c} o t h e r w i s e \end{array} \end{array} \end{array} \end{array} \end{array} \end{array} \end{array} \end{array}

(6)

In equation (6),

F_{k} (t_{i})

is the trapezoidal membership function corresponding to FTW as defined in equation (2). Furthermore, according to the transmission power and the characteristics of the beacons used, some fuzzy sets are proposed to describe the spatial features depending on the detected power of Bluetooth signals.

The spatial features describe the distance degree between the inhabitant and the BLE beacon. In our work, three fuzzy sets $F R^{l} (x) (l \in (1,2,3)$ are characterized by the trapezoidal membership function. We extracted the spatial feature with the RSSI values by using three fuzzy sets (“Near”, “Far”, “Very far”). Figure 3 shows an example of spatial feature calculation process for beacon signal $B_{i_{j}}$ .

Figure 3.

Example of spatial feature calculation process for beacon signal $B_{i_{j}}$ . The RSSI value of signal $B_{i_{j}}$ is set to 0 for “Very far”, 0.7 for “Far”, and 0.35 for “Near” distance.

In Figure 3, the fuzzy set (“Near”) represents the measurements of Bluetooth signal power close to −60 dBm and is associated with distances close to 0 m away from the beacon. The fuzzy set (“Far”) represents the signals less powerful, until around −100 dBm and is associated with closer distances. The fuzzy set (“Very far”) represents the signal powerful less −100 dBm and is associated with long distances. Figure 3 describes the RSSI value of the beacon signal $B_{i_{j}}$ for “Very far”, “Far”, and “Near” distance.

With the spatial and temporal features, the aggregation degree of beacon $B_{i}$ within FTW $T_{k}$ can be computed by using max and min operators (equation (7)). The aggregation of the spatial-temporal features is a triple $< T S_{k}^{1} (B_{i}, t_{i}), T S_{k}^{2} (B_{i}, t_{i}), T S_{k}^{3} (B_{i}, t_{i}) >$ . The element $T S_{k}^{l} (B_{i}, t_{i}) (1 \leq l \leq 3)$ is described as follows.

T S_{k}^{l} (B_{i}, t_{i}) = \max (\min (T_{k} (B_{i_{j}}, t_{i}), F {R_{k}}^{l} (R S S I_{i_{j}}))), (\forall t_{i} = R e c T i m e_{i_{j}}) \land (R e c t i m e_{i_{j}} \in [L_{k - 3}, L_{k}])

(7)

Figure 4 shows an example of the feature aggregation process with the “Far” membership function within the FTW $T_{k}$ ; there are three temporal fuzzy values and three spatial fuzzy values from three beacon signals $B_{i_{1}}$ , $B_{i_{2}}$ and $B_{i_{3}}$ , respectively. Within FTW T_k, the spatial-temporal aggregation feature of “Far” and beacon B_i is 0.4 as per the max and min operators.

Figure 4.

Example of feature aggregation process with “Far” membership function in FTW $T_{k}$ . The spatial-temporal aggregation feature of “Far” and beacon B_i is 0.4.

The spatial-temporal feature is formed by the sequence of aggregated degrees within the FTW T_k for each BLE beacon. We also used multiple FTWs to collect the activity dependency information for each BLE beacon. The size of the spatial-temporal feature vector is 3×|T|×|S|, where |T| is the number of FTWs and |B| the number of beacons.

3. Feature representation for the floor sensor

Here, the inhabitant’s location information can be extracted from the smart floor data to represent the activity with the fixed-window method. The indoor location information is estimated by means of distance measured in a given temporal window $Δ t$ . We take the distances between inhabitant and the binary sensors and BLE Beacons as inputs to obtain the location feature for floor sensor data.

Within a given interval $Δ t$ , multiple location points are measured, and the inhabitant’s location is determined by their center of gravity. Let $Z_{i} = (x_{i}, y_{i}, c_{i}), i \in (1, . . ., n)$ be n positions measured by the smart floor within $Δ t$ ; $c_{i} (1 \leq i \leq n)$ is the measured electronic capacity corresponding to the i-th position and $(x_{i}, y_{i}) (1 \leq i \leq n)$ is the i-th measured position in the smart lab. The inhabitant’s location $z_{Δ t} = (x_{Δ t}, y_{Δ t})$ within the interval $Δ t$ can be calculated by

\begin{array}{c} \begin{array}{c} C = \sum_{i = 1}^{n} c_{i} \end{array} x_{Δ t} = \frac{1}{C} \begin{array}{c} \sum_{i = 1}^{n} c_{i} x_{i} & y_{Δ t} = \frac{1}{C} \sum_{i = 1}^{n} c_{i} y_{i} \end{array} \end{array}

(8)

Where C is total capacities of all measured points,

(x_{Δ t}, y_{Δ t})

is the location point in 2D world coordinates with the origin in the lower left corner of the UJAmI lab. The distance

D i s t_{Δ t}

between the inhabitant location and the coordinate of BLE beacon or binary sensor can be calculated as follows

D i s t_{Δ t} = \sqrt{{(x_{Δ t} - x_{s e n})}^{2} + {(y_{Δ t} - y_{s e n})}^{2}}

(9)

Where

(x_{s e n}, y_{s e n})

is the coordinate of a given BLE beacon or binary sensor. We define the location feature representation here by using the fuzzy set to describe the fuzzy distance between the inhabitant and the BLE beacon or binary sensor. Six fuzzy sets

F D^{l} (x) (l \in (1, . . ., 6)

characterized by the triangle membership function (equation (10)) were defined to represent the fuzzy distance feature.

F D^{l} (x) [D_{i - 1}^{l}, D_{i}^{l}, D_{i + 1}^{l}] = {\begin{array}{c} 0 & x \leq D_{i - 1}^{l} \\ \frac{x - D_{i - 1}^{l}}{D_{i}^{l} - D_{i - 1}^{l}} & D_{i - 1}^{l} < x < D_{i}^{l} \\ \frac{D_{i + 1}^{l} - x}{D_{i + 1}^{l} - D_{i}^{l}} & D_{i}^{l} \leq x < D_{i + 1}^{l} \\ 0 & x \geq D_{i + 1}^{l} \end{array}}

(10)

Figure 5 shows an example of the feature vector (0.4, 0.6, 0, 0, 0, 0) calculated to describe the distance between the inhabitant and the given BLE beacon or binary sensor within the interval $Δ t$ .

Figure 5.

Example of feature representation for location information. The distance between the inhabitant and the given BLE beacon or binary sensor within the interval $Δ t$ is represented by the vector (0.4, 0.6, 0, 0, 0, 0).

The size of the distance feature vector is 6 ×|S|, where |S| equals to the number of binary sensors and BLE beacons.

In our work, all three different types of features are combined at feature-level for classification. Then they are used by all base classifiers and the ensemble.

The geometric-framework-based ensemble

In this paper, a geometric framework describing the weighted ensemble approach is proposed to classify human activities in real time. Wu and Crestani ³² first proposed a geometric framework for data fusion in the context of information retrieval. Bonab and Can ^33,34 extended this framework to online multi-label data stream classification tasks. They used a dynamic weighting ensemble approach to achieve the optimal weights for all base classifiers at the data chunk level or sliding window level. Wu and Ding extend the geometric-framework-based linear ensemble method to the dataset level.³⁵

In this paper, we used a p-dimensional space to operate the optimal weighted ensemble method. For a given instance $I_{i} (1 \leq i \leq N)$ with class label vector $O_{i} = {(o_{i}^{1}, o_{i}^{2}, . . ., o_{i}^{p})}^{T}$ , the base classifier $B S_{j} (1 \leq j \leq M)$ returns predictive score vector $S_{i j} = {(s_{i j}^{1}, s_{i j}^{2}, . . ., s_{i j}^{p})}^{T}$ . $S_{i j} (1 \leq j \leq M)$ and $O_{i}$ can be represented as M points of prediction and the ideal point in the p-dimensional space, respectively.

Definition 1

For a given instance $I_{i} (1 \leq i \leq N)$ , the soft majority voting approach (SMV) calculates the centroid $A_{i} = {(a_{i}^{1}, a_{i}^{2}, . . ., a_{i}^{p})}^{T}$ of M prediction points $S_{i j} = {(s_{i j}^{1}, s_{i j}^{2}, . . ., s_{i j}^{p})}^{T} (1 \leq j \leq M)$ as

{a_{i}}^{k} = \frac{1}{M} \sum_{j = 1}^{M} s_{i j}^{k} \begin{array}{c} (1 \leq k \leq p) \end{array}

(11)

Definition 2

The optimal weighted ensemble (OWE) approach serves to reveal the optimum weights $W = {(w_{1}, w_{2}, . . ., w_{M})}^{T}$ for all base classifiers to minimize the distance between M prediction points $S_{i j}$ and the ideal-point $O_{i}$ over all training instances. The optimization goal is to minimize the following expression

\begin{array}{l} \sum_{i = 1}^{N} \sqrt{{(S_{i} W - O_{i})}^{T} (S_{i} W - O_{i})} \end{array}

(12)

where

S_{i} = [S_{i 1}, S_{i 2}, . . ., S_{i M}] = [\begin{array}{c} s_{i 1}^{1} & s_{i 2}^{1} & . . . & s_{i M}^{1} \\ s_{i 1}^{2} & s_{i 2}^{2} & . . . & s_{i M}^{2} \\ . . . & . . . & . . . & . . . \\ s_{i 1}^{p} & s_{i 2}^{p} & . . . & s_{i M}^{P} \end{array}]

(13)

However, it is not straightforward to work out the solution. Rather than minimizing (12), usually people minimize the squared distance

\begin{array}{l} \sum_{i = 1}^{N} {(S_{i} W - O_{i})}^{T} (S_{i} W - O_{i}) \end{array}

(14)

We define $\hat{A} = \sum_{i = 1}^{N} S_{i}^{T} S_{i}$ , $\hat{b} = \sum_{i = 1}^{N} S_{i}^{T} O_{i}$ and $\hat{c} = \sum_{i = 1}^{N} O_{i}^{T} O_{i}$ , thus we get the following function

\begin{array}{l} f (W) = \sum_{i = 1}^{N} {(S_{i} W - O_{i})}^{T} (S_{i} W - O_{i}) \\ = W^{T} (\sum_{i = 1}^{N} S_{i}^{T} S_{i}) W - 2 W^{T} (\sum_{i = 1}^{N} S_{i}^{T} O_{i}) + \sum_{i = 1}^{N} O_{i}^{T} O_{i} \\ = W^{T} \hat{A} W - 2 W^{T} \hat{b} + \hat{c} \end{array}

(15)

where

\hat{A}

is symmetric, nonnegative definite and

f (W)

is convex. Taking a derivation over

W^{T}

, we have

\frac{d f}{d W^{T}} = 2 (\hat{A} W - \hat{b})

(16)

Theorem 1

If $\hat{A}$ is positive definite (thus invertible), $f (W)$ is strictly convex and we get the unique minimum solution

W^{*} = {\hat{A}}^{- 1} \hat{b}

(17)

\hat{A}

is only nonnegative definite, we can use the More-Penrose pseudoinverse

{\hat{A}}^{†}

and the minimum norm solution

W^{*} = {\hat{A}}^{†} \hat{b}

(18)

Furthermore, to get nonnegative weights, we can use quadratic programming.

In Theorem 1, If $\hat{A}$ is positive definite, then $\hat{A}$ is full rank and W has a unique solution $W^{*} = {\hat{A}}^{- 1} \hat{b}$ . In our work, when the outputs from base classifiers are linearly independent from each other, which is equivalent to a full-rank $\hat{A}$ . With the condition of nonnegative definite, the solution $W^{*} = {\hat{A}}^{- 1} \hat{b}$ to equations (15) must be the global minimum point of f because it is the only one that makes the derivatives of f to be zero.

Experimental setup

The UJAMI dataset is used^17,18 and it was collected in the UJAmI Smart Lab at the University of Jaén (Spain). Multi-modal data were collected from three types of sources including 30 binary sensors, 15 Bluetooth Low Energy (BLE) beacons, and smart floors. One inhabitant stayed in the lab for a period of 10 days. The dataset contains 24 different activities performed by the inhabitant over 10 days. Considering the classes skew in the dataset, we removed three classes with the least frequency of occurrence. The data was segmented in non-overlapping five-second windows. Each such window is regarded as an instance thus we obtain over 8880 instances in total. The task is to recognize the activity of the inhabitant in each time window. The main characteristics of the dataset are listed in Table 1.

Table 1.

Detailed activity information of the UJAmI dataset.

No.	Activity	Frequency	Number of instances
1	Take medication	10	231
2	Prepare breakfast	10	393
3	Prepare lunch	8	864
4	Prepare dinner	10	504
5	Breakfast	10	579
6	Lunch	8	657
7	Dinner	10	744
8	Eat a snack	5	90
9	Watch TV	7	627
10	Enter	17	243
11	Relax	6	840
12	Leave	15	222
13	Put waste	15	423
14	Wash hands	9	138
15	Brush teeth	30	552
16	Use the toilet	12	183
17	Put washing	6	84
18	Work	5	429
19	Dressing	23	519
20	Go to the bed	11	249
21	Wake up	10	309

Accuracy (Acc), Precision (P), Recall (R), and F ( $F_{1}$ ) metrics were used to evaluate the performance of the proposed model and for comparisons with other methods.

Multiple continuous FTWs with different sizes were defined as listed in Table 2 to segment the timeline and generate the temporal features from the signals of binary sensors and BLE beacons.

Table 2.

Multiple continuous FTWs described by trapezoidal membership function were used to calculate the activity features for binary sensors and temporal features for BLE beacon.

FTW	L_k-3, s	L_k-2, s	L_k-1, s	L_k, s
T ₁	30	5	0	0
T ₂	60	30	5	0
T ₃	120	60	30	5
T ₄	180	120	60	30
T ₅	300	180	120	60
T ₆	900	300	180	120
T ₇	1800	900	300	180
T ₈	3600	1800	900	300
T ₉	10,800	3600	1800	900
T ₁₀	25,200	10,800	3600	1800
T ₁₁	36,000	25,200	10,800	3600

Fuzzy sets as shown in Table 3 were used to generate the spatial features from the BLE beacon signals.

Table 3.

Fuzzy sets described by trapezoidal membership function were used to calculate spatial features.

Fuzzy set	L _k-3	L _k-2	L _k-1	L _k
Near	−70dBm	−60dBm	0dBm	0dBm
Far	−105dBm	−95dBm	−70dBm	−60dBm
Veay Far	−200dBm	−105dBm	−95dBm	−70dBm

Fuzzy sets as listed in Table 4 were defined to generate the distance features referencing the location information from binary sensor and BLE beacon signals.

Table 4.

Fuzzy sets characterized by triangle membership function were used to represent the distance feature.

Fuzzy set	D_i-1, m	D_i, m	D_i+1, m
FD ¹	0	0	1
FD ²	0	1	2
FD ³	1	2	3
FD ⁴	2	3	4
FD ⁵	3	4	5
FD ⁶	4	5	6

The feature vector obtained from the binary sensor data is 11×30, where 30 is the number of binary sensors and 11 the number of FTWs. The feature vector extracted from beacon signals is 15×11×3, where 15 is the number of BLE beacons, 11 the number of FTWs, and 3 the number of fuzzy sets. The feature vector extracted from the smart floor is 45×6 in size; there are 45 binary sensors and BLE beacons with six fuzzy sets. The feature sets obtained from different sensors are combined and the final feature vector has 1095 elements.

The models were validated by using a “leave 1 day out” approach, in which we retained one full day of information for testing and used the remaining as training data. The process was repeated for each day and the performance was averaged accordingly. We select the SVM model with a linear kernel to build base classifiers. On the one hand, SVM model is suitable for high-dimensional data to produce accurate classifiers. On the other hand, SVM model is sensitive to the training data and can be combined with sampling technology to produce a group of classifiers with healthy diversity. Previous theoretical and experimental studies have revealed that the performance of a classifier ensemble is determined by both accuracy and diversity among individual classifiers in the ensemble. SVM is a good option for this task. Also, some different types of kernels are available for SVM. Among them, SVM with a linear kernel is more efficient than the others. Therefore, we go with it. 30 base classifiers were generated. Rather than using the whole training data set, each of them took a certain percentage (60%) of the instances with replacement on the training data. By doing this, it enables us to obtain more diverse base classifiers. All the algorithms were implemented using the WEKA machine-learning suite.

The proposed method comprises two parts: support vector machines and weighted ensemble. Assume that l is the number of instances in the training process, D_L the dimension of the input data, N_S the number of support vectors, N_M the number of support vector machines trained, N_C the number of class labels. Then for the training, the upper bound of each support vector machine has the time complexity of O(N_CD_Ll²).³⁶ The time complexity of the weighted ensemble is O(lD_LN_M²+ N_M³).³⁷ Therefore, the total time complexity is O(N_MN_CD_Ll²+ lD_LN_M²+ N_M³). For the testing of a support vector machine, its time complexity is O(N_CD_LN_S), and it is O(N_CN_M) for linear combination. Therefore, the total testing complexity is O(N_CD_LN_S + N_CN_M).

Results and discussion

To evaluate the performance of the proposed approach, we compared it with a few commonly used technologies. See Table 5 for the results. The proposed method performs best, IBK performs second-best, and J48 performs worst on all measures. Compared with second-best performer IBK, the proposed method improves by 4%, 4%, 4%, and 3% on Accuracy, Precision, Recall, and

F_{1}

, respectively.

Table 5.

Performance comparison of the proposed method and a few commonly used technologies.

Classifier	Accuracy	Precision	Recall	F1
IBK	0.810	0.841	0.810	0.812
Random forest	0.698	0.742	0.698	0.696
Bayesian network	0.735	0.811	0.735	0.738
J48	0.516	0.552	0.516	0.497
Proposed	0.839	0.871	0.843	0.840

We also list all previous studies on the UJAmI dataset that can be found, as shown in Table 6. All the methods except [31] and ours were evaluated by using 7 days of data as the training set and the last 3 days as the test set with a 30-s interval. Our study achieved the accuracy of 83.9% by using a “leave 1 day out” approach. Due to the difference in multiple aspects including number of activities recognized, sensors data, methodologies taken by those approaches, the proposed approach is not comparable directly with those methods. It may be a little questionable to claim that the proposed method performs better than the other methods. However, it certainly indicates that the proposed method is very competitive.

Table 6.

Performance comparison of several studies on UJAmI dataset.

Study	Techniques	Training/Testing	No. of activities	Accuracy	Sensors
[31]	Neural network ensemble	They reconstructed data of 10 days for training 85% of the training data were randomly selected for testing	12	0.804	Binary sensors
[38]	AdaBoostM1	7 days for training, 3 days for testing	24	0.601	Binary sensors, BLE Beacons, Intelligent floor and acceleration
[39]	Naive Bayes	7 days for training, 3 days for testing	24	0.680	Binary sensors, BLE Beacons, Intelligent floor and acceleration
[40]	Rule-based approach	7 days for training, 3 days for testing	24	0.813	Binary sensors
[41]	Hybrid model	7 days for training, 3 days for testing	24	0.450	Binary sensor and the proximity
[42]	Filtered classifier	7 days for training, 3 days for testing	24	0.470	Binary sensors, BLE Beacons, Intelligent floor and acceleration
Proposed	SVM ensemble	Leave 1 day out	21	0.839	Binary sensors, BLE Beacons, Intelligent floor and acceleration

We also investigate how those different types of sensors individually or combined contribute to the final activity recognition. Three types of sensors separately, or combinations of all two and three types are tested and their results are shown in Table 7. It shows that the beacon-based sensors are able to achieve the best performance than the other two types of sensors; binary sensors are in the second place but not far away from the beacon sensors. The smart floor is the worst. It is not very useful if used alone. Dual fused sensors also appear to perform better than single sensor types, which suggests that sensing information fusion improves recognition performance. The proposed method has slight improvement on fused binary and beacon sensors versus beacon sensors alone. It is possible that the fuzzy feature represents a similar relationship between activities within multi-temporal windows. Although the recognition rate is low when relying on smart floor information, the performance is significantly improved when other types of sensors provide supplementary information. Our method achieves its optimal performance when the three modalities of information are fused at the feature level.

Table 7.

Performance comparison of different sensor combinations.

Different combination of sensors	Accuracy	Precision	Recall	F1
Binary sensors only	0.702	0.752	0.708	0.691
Beacon only	0.778	0.799	0.784	0.773
Smart floor only	0.270	0.392	0.275	0.269
Binary sensors and Beacon	0.789	0.834	0.794	0.783
Binary sensors and smart floor	0.802	0.842	0.806	0.807
Beacon and smart floor	0.808	0.875	0.812	0.807
Binary sensors, Beacon and smart floor	0.839	0.871	0.843	0.840

In Table 8, activities with fixed temporal relationships such as waking, preparing breakfast, eating breakfast, preparing lunch, eating lunch, preparing dinner, eating dinner, dressing, going to bed and wake up show relatively high recognition rates. This is because we used multi-temporal windows to represent the dependency between a current activity and its previous activity. The recognition rate for work is high as well; we believe that the relationship from morning activities to the current activity is generally consistent and that certain activities are performed in the (fixed) morning. If the temporal relationship of the activity is “confused”, the recognition rate decreases. For example, activities as taking medication, disposing of waste, relax, leave and using the toilet have no fixed time dependence with previous activities. The Smart Lab inhabitant took medicine seven times after dinner in the evening and three times after lunch in the afternoon, which invalidated the multi-window feature representation and produced poor recognition effects. There was a poor recognition rate for brushing teeth and washing hands as well, which we attribute to the fact that these activities are performed in the same location and are marked by the same sensors. However, the low recognition rate of put washing and eat a snack may be attributed to a serious imbalance.

Table 8.

Performance comparison between activities evaluated by the metrics of Precision, Recall and F.

No.	Activity	Precision	Recall	F1
1	Take medication	0.716	0.609	0.626
2	Prepare breakfast	0.955	0.878	0.911
3	Prepare lunch	0.970	0.971	0.970
4	Prepare dinner	0.951	0.912	0.926
5	Breakfast	0.848	0.984	0.909
6	Lunch	0.860	0.965	0.903
7	Dinner	0.858	0.983	0.913
8	Eat a snack	0.900	0.653	0.750
9	Watch TV	0.808	0.751	0.774
10	Enter	0.853	0.851	0.851
11	Relax	0.845	0.714	0.719
12	Leave	0.892	0.649	0.723
13	Put waste	0.780	0.743	0.752
14	Wash hands	0.652	0.765	0.679
15	Brush teeth	0.783	0.749	0.761
16	Use the toilet	0.700	0.313	0.418
17	Put washing	0.739	0.704	0.714
18	Work	0.967	0.941	0.953
19	Dressing	0.884	0.870	0.874
20	Go to the bed	0.849	0.952	0.874
21	Wake up	0.853	0.971	0.906

As shown in Table 9, Days 4, 8, and 10 do not perform as good as other days across the 10-days observation period. Poor performance on Day 4 and Day 10 may be due to their divergent behavior patterns in those afternoons compared to other days. The inhabitant left the UJAmI Smart Lab after dressing and brushing his teeth on Days 4 and 10whereas he performed a series of activities after preparing lunch and eating lunch before leaving on the other days. The activity sequence of Day 8 is also quite different from the activity sequence in most other days; we hypothesize that such a discrepancy may be the main reason of its poor performance.

Table 9.

Daily recognition performance of the proposed method.

Day no.	Accuracy	Precision	Recall	F1
1	0.870	0.932	0.873	0.890
2	0.840	0.866	0.846	0.841
3	0.900	0.913	0.903	0.893
4	0.810	0.815	0.810	0.807
5	0.880	0.895	0.886	0.883
6	0.890	0.900	0.891	0.888
7	0.920	0.937	0.925	0.921
8	0.780	0.865	0.780	0.795
9	0.840	0.802	0.842	0.808
10	0.660	0.783	0.670	0.676

Conclusion

In this paper we have presented a method for fusing three types of sensor data for HAR in smart environments. Fuzzy logic-based methods were used to extract the features for those sensors involved. A geometric framework was used to describe ensemble learning and helped to calculate the optimal weights. Evaluated with the UJAmI data set, the results show that the proposed method is very competitive compared with other methods. We also found that combining multiple data sources enhanced classification performance compared to using only one or two sources of information.

There are problems which emerged during this work that have yet to be resolved. First, it is necessary to determine how to represent activities without fixed temporal dependency (e.g., using the toilet and putting out waste). If these activities can be well represented, the recognition performance of the proposed method would increase. It also may be possible to integrate other modal sensors with the proposed fusion method. Acceleration data was also collected in the UJAmI Smart Lab by the inhabitant’s Android smart watch. As our future work, we would investigate how to combine it with the three types of data used in this study to recognize human activities.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research received the support from the Science and Technology Development Project of Weifang City 2020, China, Grant No. 2020GX006

Informed consent

Consent for publication was obtained for this study from the participant.

Ethics statement

Ethics approval was obtained from the Institutional Research Ethics Board of Jiangsu University. We certify that the study was performed in accordance with the 1964 declaration of HELSINKI and later amendments.

Trial registration number/date

No clinical trials occurred during the research.

Data sharing agreement/Availability of data and materials

The public datasets analyzed during this study are available in the [1st UCAmI Cup] repository, [].

ORCID iDs

Weimin Ding

Shengli Wu

Chris Nugent

References

Hjelm

Hedlund

. Internet-of-Things (IoT) in healthcare and social services - experiences of a sensor system for notifications of deviant behaviours in the home from the users' perspective. Health Informatics J 2022; 28(1).

Razmara

Zaboli

Hassankhani

. Elderly fall risk prediction based on a physiological profile approach using artificial neural networks. Health Informatics J 2018; 24(4): 410–418.

Farrús

Codina-Filbà

Escudero

. Acoustic and prosodic information for home monitoring of bipolar disorder. Health Informatics J 2021; 27(1).

Lunardini

Borghese

Piccini

, et al. Validity and usability of a smart ball-driven serious game to monitor grip strength in independent elderlies. Health Informatics J 2020; 26(3): 1952–1968.

Al-Faris

Chiverton

Ndzi

, et al. A review on computer vision-based methods for human action recognition. J Imaging 2020; 6(6): 46.

Zhang

Zhong

, et al. A comprehensive survey of vision-based human action recognition methods. Sensors 2019; 19(5): 1005.

Fridriksdottir

Bonomi

. Accelerometer-based human activity recognition for patient monitoring using a deep neural network. Sensors 2020; 20(22): 6424.

Zhao

Tsai

Chen

, et al. Resident activity recognition based on binary infrared sensors and soft computing. Int J Mach Learn Cybern 2019; 10(2): 291–299.

Chen

, et al. Bi-view semi-supervised learning based semantic human activity recognition using accelerometers. IEEE Trans Mob Comput 2018;17(9): 1991–2001.

10.

Zhu

Cheng

Sheng

. Human activity recognition via motion and vision data fusion. Circuits, Systems and Computers, 1977. Conference Record. 1977 11th Asilomar Conference on, 2011.

11.

Beddiar

Nini

Sabokrou

, et al. Vision-based human activity recognition: a survey. Multim Tools Appl 2020; 79(41–42): 30509–30555.

12.

Jegham

Ben Khalifa

Alouani

, et al. Vision-based human action recognition: an overview and real world challenges. Digit Investig 2020; 32: 200901.

13.

Chen

Hoey

Nugent

, et al. Sensor-based activity recognition. IEEE Trans Syst Man Cybern C 2012; 42(6): 790–808.

14.

Zhuang

Chen

, et al. Design of human activity recognition algorithms based on a single wearable imu sensor. Int J Sens Networks 2019; 30(3): 193–206.

15.

Ayman

Attalahand

Shaban

. An efficient human activity recognition framework based on wearable IMU wrist sensors. In: 2019 IEEE International Conference on Imaging Systems and Techniques, December 9-10 2019, Abu Dhabi, United Arab Emirates.

16.

Medina-Quero

Zhang

Nugent

, et al. Ensemble classifier of long short-term memory with fuzzy temporal windows on binary sensors for activity recognition. Expert Syst Appl 2018; 114: 441–453.

17.

Espinilla

Martinez

Medina

, et al. The experience of developing the UJAmI smart lab. IEEE Access 2018; 6: 34631–34642.

18.

Espinilla

Medina

Nugent

. Ucami cup. Analyzing the UJA human activity recognition dataset of activities of daily living. In: 12th International Conference on Ubiquitous Computing and Ambient Intelligence, December 4-7 2018, Punta Cana, Dominican Republic. pp. 1267.

19.

Banos

Galvez

Damas

, et al. Window size impact in human activity recognition. Sensors 2014; 14(4): 6474–6499.

20.

Ordóñez

de Toledo

Sanchis

, et al. Activity recognition using hybrid generative/discriminative models on home environments using binary sensors. Sensors 2013; 13(5): 5460–5477.

21.

Hoffmann

Lauterbach

Techmer

, et al. Recognising gait patterns of people in risk of falling with a multi-layer perceptron. In: Information Technologies in Medicine - 5th International Conference, June 20-22 2016, Kamień Śląski, Poland. pp. 87–97.

22.

Hoffmann

Steinhage

Lauterbach

. C5.4 - Increasing the reliability of applications in AAL by distinguishing moving persons from pets by means of a sensor floor. In: AMA Conferences 2015.

23.

Espinilla

Medina

Hallberg

, et al. A new approach based on temporal sub-windows for online sensor-based activity recognition. J Ambient Intell Humaniz Comput 2018.

24.

Medina-Quero

Zhang

Nugent

, et al. Ensemble classifier of long short-term memory with fuzzy temporal windows on binary sensors for activity recognition. Expert Syst Appl 2018; 114: 441–453.

25.

López Medina

MÁ

Espinilla

Paggeti

, et al. Activity recognition for IoT devices using Fuzzy spatio-temporal features as environmental sensor fusion. Sensors 2019; 19(16): 3512.

26.

Ehatisham-Ul-Haq

Javed

Azam

, et al. Robust human activity recognition using multimodal feature-level fusion. IEEE Access 2019; 7: 60736–60751.

27.

Ding

. A cross-entropy based stacking method in ensemble learning. J Intell Fuzzy Syst 2020; 39(3): 4677–4688.

28.

Peng

Chen

, et al. Complex Activity Recognition Using Acceleration, Vital Sign, and Location Data. IEEE Trans Mob Comput 2019; 18(7): 1488–1498.

29.

Muzammal

Talat

Sodhro

, et al. A multi-sensor data fusion enabled ensemble approach for medical data from body sensor networks. Inf Fusion 2020; 53: 155–164.

30.

Garcia-Ceja

Galván-Tejada

Brena

, et al. Multi-view stacking for activity recognition with sound and accelerometer data. Inf Fusion 2018; 40: 45–56.

31.

Irvine

Nugent

Zhang

, et al. Neural network ensembles for sensor-based human activity recognition within smart environments. Sensors 2019; 20(1): 216.

32.

Crestani

. A geometric framework for data fusion in information retrieval. Inf Syst 2015; 50: 20–35.

33.

Bonab

Can

. Goowe: Geometrically optimum and online-weighted ensemble classifier for evolving data streams. ACM Trans Knowl Discov Data 2018; 12(2): 1–33.

34.

Bonab

Can

. Less is more: a comprehensive framework for the number of components of ensemble classifiers. IEEE Trans Neural Netw Learn Syst 2019; 30(9): 2735–2745.

35.

Ding

. A dataset-level geometric framework for ensemble classifiers. CoRR abs/2106.08658 (2021).

36.

Christopher

. Burges: a tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov 1998; 2(2): 121–167.

37.

https://datascience.stackexchange.com/questions/35804/what-is-the-time-complexity-of-linear-regression.

38.

Cerón

López

Eskofier

. Human activity recognition using binary sensors, BLE Beacons, an intelligent floor and acceleration data: a machine learning approach. In: 12th International Conference on Ubiquitous Computing and Ambient Intelligence, December 4-7 2018, Punta Cana, Dominican Republic. pp. 1265.

39.

Jiménez

Seco

. Multi-event naive bayes classifier for activity recognition in the UCAmI Cup. In: 12th International Conference on Ubiquitous Computing and Ambient Intelligence, December 4-7 2018, Punta Cana, Dominican Republic. pp. 1264.

40.

Karvonen

Kleyko

. Domain knowledge-based solution for human activity recognition: The UJA DATASET Analysis. In: 12th International Conference on Ubiquitous Computing and Ambient Intelligence, December 4-7 2018, Punta Cana, Dominican Republic. pp. 1261.

41.

Lago

Inoue

. A hybrid model using hidden markov chain and logic model for daily living activity recognition. In: 12th International Conference on Ubiquitous Computing and Ambient Intelligence, December 4-7 2018, Punta Cana, Dominican Republic. pp. 1266.

42.

Razzaq

Cleland

Nugent

, et al. Multimodal sensor data fusion for activity recognition using filtered classifier. In: 12th International Conference on Ubiquitous Computing and Ambient Intelligence, December 4-7 2018, Punta Cana, Dominican Republic. pp. 1262.

A multimodal fusion enabled ensemble approach for human activity recognition in smart homes

Abstract

Keywords

Introduction

Related work

Methodology

Feature representation

The geometric-framework-based ensemble

Experimental setup

Results and discussion

Conclusion

Footnotes

Declaration of conflicting interests

Funding

Informed consent

Ethics statement

Trial registration number/date

Data sharing agreement/Availability of data and materials

ORCID iDs

References