Irrelevant data elimination based on a k-means clustering algorithm for efficient data aggregation and human activity classification in smart home sensor networks

Abstract

For the successful operation of smart home environments, it is important to know the state or activity of an occupant. A large number of sensors can be deployed and embedded in places or things. All sensor nodes measure the physical world and send data to the base station for processing. However, the processing of all collected data from every sensor node can consume significant energy and time. In order to enhance the sensor network in smart home applications, we propose the irrelevant data elimination based on k-means clustering algorithm to enhance data aggregation. This approach embeds the cluster head–based algorithm into cluster heads to omit irrelevant data from the base station. The pattern of measured data in each room can be clustered as an active pattern when human activity happens in that room and a stable pattern when human activity does not happen in the room. The irrelevant data elimination based on k-means clustering algorithm approach can reduce 55.94% of the original data with similar results in human activity classification. This study proves that the proposed approach can eliminate meaningless data and intelligently aggregate data by delivering only data from rooms in which human activity likely occurs.

Keywords

Clustering human activity k-means irrelevant data elimination

Introduction

The smart home has been of interest to many studies and for many applications, such as automatic controlling appliances and elderly assistance. Smart homes can improve the lives of occupants and the elderly by utilizing ambient sensors to capture the presence and behavior of occupants for active and assisted living (AAL) purposes. A smart home communication system can be divided into an external network, a base station network, and an internal network.¹ This network can be wired or wireless, and the Internet is used to interconnect devices in the home; thus, the smart home is one of the applications of the Internet of things (IoT). Sensors or devices can be placed everywhere, measure the physical world, and then transmit data to the base station by communicating through a network based on the IoT.² Gathering huge data in sensor networks can be found in a wide area, such as a smart hospital and smart building. Also, data in smart home have been getting big as a consequence of the increase in intelligent appliances.

The large number of sensors produces a significant amount of data, which leads to concerns about energy consumption for data processing and the transmission of sensors. In addition, a well-designed sensor network must consider various challenges in a smart home, such as privacy and security,³ data handling and compression, learning, and assessment of an occupant’s behavioral patterns, as well as their home environment.⁴ Therefore, an efficient technique is needed to improve the lives of sensors in smart homes to guarantee the sustainable operation of sensor networks. One way to reduce power consumption in a sensor network is to optimize in-network data aggregation in the internal network, which mainly combines with the sensor node and base station. Multiple sensors produce data fusion, including useful and useless data, and a clustering algorithm can be used for efficient data aggregation processing in the sensor network.

An internal communication network comprised a sensor node and a base station. The base station stores data from all sensor nodes and performs management and configuration in the network. Different methods are used to operate the internal communication network between the sensor nodes and a base station to help aggregate data that can improve the lifetime of the network.

Maraiya et al.⁵ categorized data aggregation into four approaches based on different methodologies. These approaches include the tree-based approach, cluster-based approach, multipath-based approach, and hybrid-based approach. A similar categorized data aggregation approach, Sirsikar and Anavatti⁶ described four strategies consisting of a centralized approach, an in-network approach, a tree-based approach, and a cluster-based approach. The differences of these approaches are as follows: (1) the centralized approach is a simple data aggregation approach, as all sensor nodes send the data to a central node called a header node, and then the header node sends the data packet to a destination node; (2) the in-network aggregation approach gathers data through a multi-hop network and data processing at intermediate nodes in order to reduce power consumption; (3) the tree-based approach considers a sensor network as a tree that consists of sink node as a root and source nodes as leaf nodes, where each node has a parent node to forward its data to the sink node; and (4) the cluster-based approach has several cluster heads that gather data from numerous source nodes under its control and transmit the result to a sink node or the destination.

A cluster-based approach can reduce the bandwidth overhead because of its fewer transmitted data packets. A popular cluster-based approach is the low energy adaptive clustering hierarchy (LEACH).⁷ LEACH has inspired other LEACH-based protocols that have attempted to improve the method of cluster head selection, such as EM-LEACH,⁸ ESO-LEACH,⁹ and LEACH-VA.¹⁰ The fundamental goal of LEACH is to equally distribute energy consumption among the sensors in the network that utilize the randomized rotation of cluster heads. In data transmission, LEACH incorporates data fusion into the routing protocol to reduce the amount of information before transmitting data to the base station. Another cluster-based algorithm is the COUGAR algorithm. COUGAR selects the cluster head based on signal strength and performs in-network aggregation with duplicate sensitive aggregators since it consists of a node synchronization engine, which ensures that data are aggregated correctly.¹¹ More recent cluster-based algorithms used swarm algorithms for developing cluster selection method. PSO-ECHS is an energy efficient cluster head selection algorithm based on particle swarm optimization (PSO). PSO-based algorithm randomly selects a suitable subset of nodes as CH candidates. PSO-ECHS considers parameters such as intra-cluster distance, sink distance, and residual energy of all the CHs in the fitness function.¹² The grey wolf optimizer (GWO) is a swarm algorithm, which is used to select CHs based on the predicted energy consumption and current residual energy of each node. However, this proposed method may not be suitable for application where the first node dies has a significant effect on performance of the network.¹³ A cluster-based approach can be implemented by increasing local data processing to contribute efficient data aggregation to sensor networks.

As the local processor processes data locally, some algorithms can be used to enhance data aggregation before transmitting data to the base station, such as a clustering algorithm. A clustering algorithm can cluster different data patterns from a complete dataset then, unwanted data can be purged to reduce the effect of irrelevant data. Not all measured data can be trusted because data can be false and not all the measured data correlate with the current circumstances due to irrelevant data; eventually, these data can produce false negatives or false positives. In a smart home environment while the occupant is performing an activity in a room, and sensors in other rooms may have active values for many reasons (e.g. the occupant may leave a TV on and go to another room or false data may be transmitted from the sensor).

One of the simplest clustering algorithms is the k-means algorithm, which considers the means of features into k groups based on Euclidean distance. Harb et al.¹⁴ used k-means clustering algorithm to group similar data sets into generated clusters using the KPFF technique. This technique enhanced the prefix frequency filtering (PFF) technique using a k-means-based clustering approach for data aggregation in periodic sensor networks. The same researchers enhanced their previous model based on a one-way analysis of variance (ANOVA) model to identify nodes generated with identical data sets and to aggregate these sets before sending them to the sink,¹⁵ to eliminate redundancy from the sensor node members that generate redundant data sets; Harb and Jaoude¹⁶ presented a similar method of using the k-means algorithm in sensor networks for data compression to handle big data collection.

Idrees et al.¹⁷ proposed a Distributed Data Aggregation–based Modified k-means (DiDAMoK) technique, which uses a modified k-means technique to remove data redundancy in data aggregation and improve the lifetime of a sensor network. Using this technique, a sensor node measures and collects the data. Then, the modified k-means algorithm is employed on the collected data to convert the data into clusters and transmit the collected data for each cluster to the sink.

Rida et al.¹⁸ presented a clustering approach called EK-means for dataset classification in sensor networks with the objective of reducing the amount of transmitted data over a network while preserving their properties. This approach consists of two steps. The first step is to eliminate similar data generated at the sensor level using Euclidean distance based on the data aggregation technique. The second step is to group similar datasets and reduce the amount of data sent to the sink using the EK-means enhanced from the k-means clustering algorithm.

Most researchers used the k-means algorithm to eliminate redundant data in sensor networks because these data provide the ability to group redundant data. However, these approaches cannot remove irrelevant data that has no correlation with the current scenario. In this work, we consider the correlation between the collected data and the current scenario to remove irrelevant data. We aim to enhance efficient data aggregation in a smart home for human activity classification over the lifetime of the network. We focus on data reduction via data fusion among cluster heads in a smart home to reduce the size and members of the data passing through the transmission process in order to apply human activity classification to smart home applications. Based on the challenges above, this article presents a novel approach that uses an algorithm for clustering and decision-making in multi-sensor network system communication.

Here, we propose the irrelevant data elimination based on k-means clustering algorithm (IDEK) for local data processing in cluster heads. These cluster heads are pre-defined and divided based on rooms and could obtain meaningful data from the sensor nodes in the room by utilizing the data and enhanced k-means algorithm to decide whether to transmit the data to a base station as shown in Figure 1(b). The presented approach enables cluster heads to make an intelligent decision about whether to transmit stored data to a base station using an embedded cluster head–based algorithm. The cluster-based sensor network architecture is illustrated in Figure 1(a), where the cluster heads aggregate data from the sensor nodes and transmit those data to a base station by intelligently making a decision based on the patterns of source nodes.

Figure 1.

Illustration of cluster-based sensor network: (a) communication architecture and (b) sensor deployment scheme.

The contributions of this work are as follows: (1) efficient data aggregation to reduce energy consumption; (2) IDEK; and (3) human activity classification via ambient sensors in a smart home.

IDEK

This section exhibits the IDEK, which exploits the correlation between measured data from sensors and human activity to reduce the amount of data and save the energy of data transmission to a base station. IDEK is proposed to eliminate irrelevant data aggregation for the objective of human activity classification in a smart home with a hierarchical network structure comprised three levels, as shown in Figure 2. The architecture is designed to suitably embed a cluster head–based algorithm at the cluster head level. Figure 2 explains the IDEK hierarchy, which works alongside the hierarchical network. A cluster head intelligently transmits the measured data from sensor nodes in certain rooms using the IDEK approach. This approach utilizes an embedded cluster head–based algorithm based on the k-means algorithm to cluster data in each CH to decide whether to transmit data during human activities. The transmitted data will be classified according to human activity at the base station level, and then smart home applications can possibly provide a given service to the occupant or elderly. Figure 3 summarizes the procedure of IDEK approach, and we will explain in the detail as below.

Figure 2.

Irrelevant data elimination based on k-means algorithm (IDEK) hierarchy.

Figure 3.

The procedure of IDEK approach.

Sensor node level

The physical environment and human activities are measured through all embedded sensor nodes in the smart home. The sensor nodes update data at given time intervals as $s_{t, n}^{r}$ , where $n$ denotes the measured data id from 1 to the total number of measured data from the sensor nodes with the cluster head $r$ at timestamp $t$ . The sensor node decides to send the measured data to the cluster head when it obtains a value that is different from previously measured data.

Cluster head level

We assume that a certain spatial monitoring area, such as a room, is covered by a sensor cluster, and each cluster consists of a single cluster head and several members of sensors. Every cluster head plays the role of aggregator and receives measured data from their descendants placed in the same room.

The cluster head election is based on the highest residual energy and the smallest distance from base station. For long lifetime of sensor networks, sensor node which has the highest residual energy among members will be selected as cluster head and in case all members have the same residual energy, cluster head will be selected based on the smallest distance from base station. There are communications between sensor node and its own cluster head and between cluster head and base station in order to be aware in case any cluster heads die during the processing. The sensor nodes will get a response from its own cluster head in the limit of time if the cluster head completely received the data packets, if not new cluster head will be selected and announce to others. Likewise, the case of base station cannot receive data packets from the cluster head.

These data are aggregated as a ${CH}_{t}^{r}$ vector (e.g. $C H_{t}^{b a t h r o o m} = {s_{t, 1}^{b a t h r o o m}, s_{t, 2}^{b a t h r o o m}, \dots, s_{t, n}^{b a t h r o o m}}, C H_{t}^{b e d r o o m} =$ ${s_{t, 1}^{b e d r o o m}, s_{t, 2}^{b e d r o o m}, \dots, s_{t, n}^{b e d r o o m}}$ , $C H_{t}^{r} = {s_{t, 1}^{r}, s_{t, 2}^{r}, \dots, s_{t, n}^{r}}$ ), where $r$ is the name of the room. The ${CH}_{t}^{r}$ vector contains measured data from multiple sensor $n$ nodes at $t$ timestamps. The ${CH}_{t}^{r}$ vector can be considered as either a vector that has a correlation to the occurrences of human activity or a vector that does not contain any relevant data or wanted data. An example of some data in the kitchen cluster head plotted from complete timestamp data ( ${CH}_{t = {0, \dots, T}}^{kitchen}$ ) is shown in Figure 4.

Figure 4.

The example of some measured data in the kitchen cluster head.

In order to enhance the ability of a cluster head, the cluster head needs to undergo offline learning from previous datasets before embedding the cluster head–based algorithm in the cluster heads. The clustering algorithm, called CorrKmeans++algorithm 1, is used to cluster the collected data in each cluster head, so the cluster head can decide to transmit aggregated data to the base station when it possibly contains data on the occurrence of an activity. The proposed clustering algorithm utilizes the k-means++algorithm.¹⁹ This algorithm utilizes the complete data of the cluster head ${CH}_{t = {0, \dots, T}}^{r}$ in a dataset to find the optimal $k$ clusters from the matching patterns of sensor data and certain activities. We show the algorithm for grouping similar patterns of measured data and selecting group of wanted data by considering the correlation between the collected data and the current scenario in Algorithm 1.

Algorithm 1 CorrKmeans++ algorithm
1: Input: a whole dataset $X$ of $T$ elements, match_rate = 0.7, where $X = {{CH}_{0}^{r}, {CH}_{1}^{r}, \dots, {CH}_{T}^{r}}$ and ${CH}_{t = {0, \dots, T}}^{r} = {s_{t, 1}^{r}, s_{t, 2}^{r}, \dots, s_{t, n}^{r}}$ , $n$ = number of measured data in the room $r$ , $A$ (activity label). 2: Output: $C$ (cluster centroids), where $C = {C^{r}, \dots, C^{rooms}}$ 3: k-means++ cluster centroids initialization: 4: Choose first centroid $c_{1}$ uniformly at random from among the data elements in $X$ , where $c_{1}$ as a matrix $1 \times n$ 5: Compute the squared distances between all data elements and $c_{1} : D ({CH}_{t}^{r})^{2} = ‖ {CH}_{t}^{r} - c_{1} ‖^{2}$ and weight from $\sum_{t = 0}^{T} D ({CH}_{t}^{r})^{2} = \sum_{t = 0}^{T} ‖ {CH}_{t}^{r} - c_{1} ‖^{2}$ 6: Choose the second centroid $c_{2}$ from $X$ randomly with probability $\frac{D {({CH}_{t}^{r})}^{2}}{\sum_{{CH}_{t}^{r} \in X} D {({CH}_{t}^{r})}^{2}}$ 7: Recompute the squared distances between all data elements and the nearest centroids that have been chosen as $D_{i} ({CH}_{t}^{r})^{2} = min_{i = 1, \dots, k} (‖ {CH}_{t}^{r} - c_{i} ‖^{2})$ and the cumulative values as $\sum_{t = 0}^{T} D_{i} ({CH}_{t}^{r})^{2} = \sum_{t = 0}^{T} min_{i = 1, \dots, k} (‖ {CH}_{t}^{r} - c_{i} ‖^{2})$ 8: Choose the next centroid $c_{i}$ from $X$ randomly by recomputing probability $\frac{D {({CH}_{t}^{r})}^{2}}{\sum_{{CH}_{t}^{r} \in X} D {({CH}_{t}^{r})}^{2}}$ 9: Repeat steps 4 and 5 until $k$ center centroids have been chosen together: $C = {c_{0}, c_{2}, \dots, c_{k}}$ as matrix $k \times n$ 10: while $match_rate < S_{Amax}$ do 11: $k \leftarrow k + 1$ 12: for $r To rooms$ do 13: repeat: 14: for $t \leftarrow 0$ to $T$ do 15: for $i \leftarrow 0$ to $k - 1$ do 16: ${dist}_{t, i}^{r} = ‖ {CH}_{t}^{r} - c_{i} ‖$ (Calculate distance) 17: end for 18: $g_{t} = \underset{i}{\arg \min} ({dist}_{t, i}^{r})$ (Assign group along data $T$ elements, $g$ as vector $T \times 1$ ) 19: end for 20: for $i \leftarrow 0$ to $k - 1$ do 21: $f \leftarrow 0$ 22: for $t \leftarrow 0$ to $T$ do 23: if $g_{t} = i$ then 24: $m_{i, f} \leftarrow t$ 25: $f \leftarrow f + 1$ 26: end if 27: end for 28: for $j \leftarrow 1$ to $n$ do 29: $c_{i, j} = \frac{1}{f} \sum s_{t, j}^{r}, t \in m_{i}$ (Update centroid $C^{r} = {c_{0, 1}, \dots, c_{i, j}}, c_{i, j} \in c_{i}$ ) 30: end for 31: end for 32: until no change; 33: calculate $S_{A}^{r}$ from equation 1 34: end for 35: $S_{Amax} = \max_{r} (S_{A}^{r})$ 36: end while 37: return $C$

Algorithm 1 CorrKmeans++ algorithm

1: Input: a whole dataset

X

T

elements, match_rate = 0.7, where

X = {{CH}_{0}^{r}, {CH}_{1}^{r}, \dots, {CH}_{T}^{r}}

and

{CH}_{t = {0, \dots, T}}^{r} = {s_{t, 1}^{r}, s_{t, 2}^{r}, \dots, s_{t, n}^{r}}

n

= number of measured data in the room

r

A

(activity label).
2: Output:

C

(cluster centroids), where

C = {C^{r}, \dots, C^{rooms}}

3: k-means++ cluster centroids initialization:
4: Choose first centroid

c_{1}

uniformly at random from among the data elements in

X

, where

c_{1}

as a matrix

1 \times n

5: Compute the squared distances between all data elements and

c_{1} : D ({CH}_{t}^{r})^{2} = ‖ {CH}_{t}^{r} - c_{1} ‖^{2}

and weight from

\sum_{t = 0}^{T} D ({CH}_{t}^{r})^{2} = \sum_{t = 0}^{T} ‖ {CH}_{t}^{r} - c_{1} ‖^{2}

6: Choose the second centroid

c_{2}

from

X

randomly with probability

\frac{D {({CH}_{t}^{r})}^{2}}{\sum_{{CH}_{t}^{r} \in X} D {({CH}_{t}^{r})}^{2}}

7: Recompute the squared distances between all data elements and the nearest centroids that have been chosen as

D_{i} ({CH}_{t}^{r})^{2} = min_{i = 1, \dots, k} (‖ {CH}_{t}^{r} - c_{i} ‖^{2})

and the cumulative values as

\sum_{t = 0}^{T} D_{i} ({CH}_{t}^{r})^{2} = \sum_{t = 0}^{T} min_{i = 1, \dots, k} (‖ {CH}_{t}^{r} - c_{i} ‖^{2})

8: Choose the next centroid

c_{i}

from

X

randomly by recomputing probability

\frac{D {({CH}_{t}^{r})}^{2}}{\sum_{{CH}_{t}^{r} \in X} D {({CH}_{t}^{r})}^{2}}

9: Repeat steps 4 and 5 until

k

center centroids have been chosen together:

C = {c_{0}, c_{2}, \dots, c_{k}}

as matrix

k \times n

10: while

match_rate < S_{Amax}

do
11:

k \leftarrow k + 1

12: for

r To rooms

do
13: repeat:
14: for

t \leftarrow 0

T

do
15: for

i \leftarrow 0

k - 1

do
16:

{dist}_{t, i}^{r} = ‖ {CH}_{t}^{r} - c_{i} ‖

(Calculate distance)
17: end for
18:

g_{t} = \underset{i}{\arg \min} ({dist}_{t, i}^{r})

(Assign group along data $T$ elements, $g$ as vector $T \times 1$ )
19: end for
20: for

i \leftarrow 0

k - 1

do
21:

f \leftarrow 0

22: for

t \leftarrow 0

T

do
23: if

g_{t} = i

then
24:

m_{i, f} \leftarrow t

25:

f \leftarrow f + 1

26: end if
27: end for
28: for

j \leftarrow 1

n

do
29:

c_{i, j} = \frac{1}{f} \sum s_{t, j}^{r}, t \in m_{i}

(Update centroid $C^{r} = {c_{0, 1}, \dots, c_{i, j}}, c_{i, j} \in c_{i}$ )
30: end for
31: end for
32: until no change;
33: calculate

S_{A}^{r}

from equation 1
34: end for
35:

S_{Amax} = \max_{r} (S_{A}^{r})

36: end while
37: return

C

The results of the sensor pattern, which are grouped together and shown as examples in Figure 5, will also be considered for use as a time schedule for data transmission in the cluster head. For long-term usage, the model needs to be retrained when the model detects a change in human behavior, such as activity time duration or activity pattern.

Figure 5.

Examples of sensor data in the cluster head in the (a) bedroom and (b) kitchen are grouped with $k = 3$ .

The problem of clustering the sensor data in the cluster heads is how many $k$ (clusters) should be determined for the k-means algorithm. In the standard k-means algorithm, we need to define the number of $k$ . We optimized k-means++algorithm to find the appropriate number of $k$ in different cluster heads by considering correlation between activity and the data in CHs. After the data in CHs is grouped into various $k$ clusters, one group is selected with the highest $S_{A}$ score based on calculating the similarity using equation (1), where $G_{t}$ is a label for transmitting time at $t$ in a cluster head, $A_{t}$ is an activity label, $w$ is weight, and $S_{A}$ is the similarity between $G_{t}$ and $A_{t}$ throughout $T$ period, with $G_{t}, A_{t} \in {0, 1}$ . The selected group represents the group of data when human activity occurs called active group. The algorithm will stop increasing the number of $k$ after match_rate gets over $S_{A}$ score

S_{A} = \frac{\frac{\sum_{t = 0}^{T} G_{t} * A_{t}}{\sum_{t = 0}^{T} A_{t}} + \frac{\sum_{t = 0}^{T} ~ G_{t} * ~ A_{t}}{\sum_{t = 0}^{T} ~ A_{t}}}{2}

(1)

On account of the offline learning, as explained above, we obtain optimal $k$ and centroids $C_{r} = {c_{0}, c_{2}, \dots, c_{k}}$ from the selected group in every chosen cluster head. Thus, we embed the cluster head–based algorithm into the cluster head following Algorithm 2. This process is done to ensure that the cluster head transmits data while only retaining meaningful data (which is relative to activity) to the base station. Each cluster head calculates the Euclidean distance between its own ${CH}_{t}^{r}$ centroids and locates the nearest centroid among all cluster centroids. If the calculated centroid of ${CH}_{t}^{r}$ is clustered in the active group, the cluster head will transmit the collected data in ${CH}_{t}^{r}$ at $t$ to the base station.

Algorithm 2 Embedded cluster head–based algorithm
1: Input: measured data of the $r$ cluster head at $t$ ( ${CH}_{t}^{r}$ ), cluster centroids $C^{r} = {c_{0}, c_{1}, \dots, c_{k}}$ as matrix $k \times n$ . 2: Output: pass or do not pass to the base station 3: for $i \leftarrow 0$ to $k - 1$ do 4: ${dist}_{t, i}^{r} = ‖ {CH}_{t}^{r} - c_{i} ‖$ (calculate distance) 5: end for 6: $g_{t} = \underset{i}{\arg \min} ({dist}_{t, i}^{r})$ (Find the nearest centroid) 7: if $g_{t} = 2$ then(If we assume $c_{2}$ as the centroid of the activity occurrence group) 8: send measured data ${CH}_{t}^{r}$ to the base station 9: else 10: Omit measured data ${CH}_{t}^{r}$ to not pass to the base station 11: end if

Algorithm 2 Embedded cluster head–based algorithm

1: Input: measured data of the

r

cluster head at

t

(

{CH}_{t}^{r}

), cluster centroids

C^{r} = {c_{0}, c_{1}, \dots, c_{k}}

as matrix

k \times n

.
2: Output: pass or do not pass to the base station
3: for

i \leftarrow 0

k - 1

do
4:

{dist}_{t, i}^{r} = ‖ {CH}_{t}^{r} - c_{i} ‖

(calculate distance)
5: end for
6:

g_{t} = \underset{i}{\arg \min} ({dist}_{t, i}^{r})

(Find the nearest centroid)
7: if

g_{t} = 2

then(If we assume $c_{2}$ as the centroid of the activity occurrence group)
8: send measured data ${CH}_{t}^{r}$ to the base station
9: else
10: Omit measured data ${CH}_{t}^{r}$ to not pass to the base station
11: end if

Base station level

At this level, a base station receives the measured data from every cluster head transmitting data that passes; then, the process of human activity classification is performed on a perceptron classifier using one versus all (OVA) for a multi-class problem via the Scikit-learn framework.²⁰ The measured data, that were eliminated of irrelevant data at cluster head level, are fed as the input of classifier as shown in Figure 6. Each training point belongs to one of k different classes, and a predicted activity is a maximum output from k different output. The classifier was made using the default parameters, except that $alpha = 1 e - 10$ , $penalty =' l 2'$ , $tol (tolerance) = 1 e - 10$ , and class_weight is given by the inverse class frequency to handle imbalanced (sparsely labeled) datasets within the training model. The model uses optimized weights with instance-weighted hinge loss, as described in equation (2)

L (w, (x, y)) = {\begin{matrix} ψ_{k} \max_{k} (1 + s_{k} - s_{y}) & k \neq y \\ ψ_{k} \max_{k} (s_{k} - s_{y}) & otherwise \end{matrix}

(2)

ψ = \log \frac{sample in class k}{number of all sample}

(3)

where $L$ is the multi-class hinge loss of the $k$ class for the case $k \neq y$ , $ψ$ is an instance-weight inverse class frequency classified in equation (3), and $s_{k}$ and $s_{y}$ are the score functions of the prediction and target, respectively.

Figure 6.

One versus all perceptron classifier.

At this level, the algorithm is able to add a behavioral change detection model to detect when the inhabitant has changed behaviors since this situation will affect the CorrKmeans++algorithm. Our network can undergo offline learning to acquire new information and retrain its human activity classification model.

Simulation set-up

The simulation study was performed to validate our proposed method in a rich-sensor smart home scenario dataset. We designed the simulation in three steps. In this work, we implemented the human activity classifier on a Google Colab environment as shown in Table 1 and simulated with Python. The first step in the simulation is to locally divide the sensor following room functioning and define the fixed cluster head of each as shown in Table 2. Then, enhance the ability of a cluster head through CorrKmeans++algorithm 1 in offline learning. The second step, every cluster head is embedded algorithm 2 to determine the measured data while the activity is happening in the room and while nothing is happening in the room. The last step is to compare the performance of IDEK approach and other approaches described in the “Results” section. The occupants activity classification is used to measure the performance metrics, including the size of the data being communicated and the quality of the data aggregated for human activity classification using the metrics described in the “Results” section.

Table 1.

The simulation set-up.

Environment set-up	Property
Processor	$1 \times$ single core hyper threaded (1 core, 2 threads), Xeon Processors @2.3 GHz
L3 cache	45 MB
RAM	12.72 GB

Table 2.

The choice of options.

Cluster head location	Data sources	Data properties
Bathroom	Contact sensor, carbon dioxide meter, humidity sensor, switch, dimmer, brightness sensor, infrared sensor, presence sensor, temperature sensor, water meter	24
Bedroom	Contact sensor, carbon dioxide meter, humidity sensor, switch, dimmer, brightness sensor, hue bulb, infrared sensor, presence sensor, temperature sensor, pressure sensor, noise level sensor	28
Dining room	Contact sensor, brightness sensor, noise level sensor, infrared sensor	3
First floor	Contact sensor	1
Hallway	Presence sensor, temperature sensor, temperature controller	7
Hallway first floor	Contact sensor, dimmer, switch, noise level sensor	6
Hallway second floor	Switch, noise level sensor, dimmer	12
Kitchen	Contact sensor, switch, carbon dioxide meter, humidity sensor, switch, dimmer, brightness sensor, noise level sensor, infrared sensor, temperature sensor, tension sensor, water meter, electric intensity, power sensor	36
Living room	Contact sensor, carbon dioxide meter, hue bulb, dimmer, humidity sensor, switch, dimmer, brightness sensor, noise level sensor, infrared sensor, temperature sensor, presence sensor	39
Study room	Contact sensor, switch, dimmer, brightness sensor, noise level sensor, infrared sensor, temperature sensor, presence sensor, noise	9
Total		165

The simulation was processed on the ContextAct@A4H Real-Life dataset from Amiqual4Home²¹ which focuses on AAL in a smart home. This dataset describes daily living activities during July 2016 (1 week) and November 2016 (3 weeks) in an apartment equipped with various types of sensors and actuators. All sensors are ambient sensors deployed in a bedroom, bathroom, kitchen, study room, and around hallways. We used the November dataset with 165 data properties from different sensors. We placed 10 cluster heads based on the room, namely the bathroom, bedroom, dining room, first floor, hallway, hallway first floor, hallway second floor, kitchen, living room, and study room. Each cluster head gathers the data properties from sensor nodes.

The dataset is a log dataset of 1,108,617 tuples with 444 tuples of activity annotation (start time and stop time). However, the occupant reported missing some activity annotations. Thus, we only utilized the data that have a labeled stop time for the activity and measured properties for which we could identify the placement of the sensor or the relative room from the dataset description file with 452,245 tuples. We modified the loc dataset with the time series dataset using a 1-min interval, since we set all sensor nodes to transmit data to the cluster head every minute. The modified dataset contains data for 27,339 timestamps, with activity label data for 17,230 timestamps based on the stop time of the activity annotation in the loc dataset. Our model assumes that we are able to perform algorithm 2 processing on the cluster heads. In a practical and simplified way, we can simulate this model on a virtual machine that emulates a Raspberry Pi.

Results

We compared the quality of data aggregation and the size of the dataset being transmitted between the full data transmission approach, the EK-means approach,¹⁸ and our cluster-based data transmission approach. The performance results of the data reduction with the existing aggregation approach, called the EK-means aggregation algorithm at the sensor level, show results with different values for the required parameters, which include the period size of the data represented by point $T$ and a similarity threshold $δ$ between the two readings of data. We defined $T = 5$ and 10 min, with a similarity threshold $δ = 0.0005$ . The cluster-based data transmission approach applied the IDEK approach to eliminate irrelevant data, as we explained in section “IDEK.”

In the simulation, we can eliminate the irrelevant data by transmitting the measured data following the schedule as shown in Figure 7. The schedule was made from the matching result of the activity and the chosen cluster head transmission. The highest matching scores ( $S_{A}$ ) are calculated via similarity equation (1). Using the present approach, the smart home sensor network with 10 cluster heads (bathroom, bedroom, dining room, first floor, hallway, hallway first floor, hallway second floor, kitchen, living room, and study room) will decide whether to transmit the collected data to the base station every 1 min. The decisions of each cluster head are shown in Figure 7.

Figure 7.

A summary of the sensor patterns grouped and selected for cluster head transmission compared to relevant activity labels (hallway (k = 8, group = 4)—leaving home; study room (k = 2, group = 1)—working; kitchen (k = 2, group = 1)—cooking; kitchen (k = 9, group = 6)—eating; bathroom (k = 6, group = 0)—using the toilet; bathroom (k = 3, group = 0)—washing dishes; bedroom (k = 3, group = 1)—sleeping; and bathroom (k = 2, group = 1)—taking a shower).

We used 17,230 samples, with activity labels available from the time series dataset, to train and test the activity classifier because the dataset cannot be used as it is missing some activity labels. We evaluated the performance using stratified fivefold cross-validation with 90% of training and 10% of testing. Folds were made by preserving the percentage of samples for each class. The result of activity classification was performed on a real-life sensor-rich environmental dataset, which includes a large variety of sensors to avoid the potential reuse of data. This means that our proposed approach can be used in a wide area, such as a smart hospital or smart building.

In the resulting session, we used performance metrics consisting of balanced accuracy (BA),²² training time (s), and testing time (s) to compare the performances between the collected data from the proposed approach and the EK-means approach. Vaizman et al.²³ used BA as a fair (balanced) version of accuracy for an imbalanced dataset such as human activity dataset. BA incorporates both true positive rate (sensitivity) and true negative rate (specificity): $BA = ((sensitivity + specificity) / 2)$ . For energy consumption, we considered the size of data reduction.

The quality of data aggregation

The dataset labels eight activities with an imbalanced sample, including leaving home (9143), working (139), cooking (677), using the toilet (69), eating (336), washing a dish (197), sleeping (6330), and taking a shower (339). In Tables 3 and 4 and Figure 8, we show the results of comparing the activity classification between utilizing data from fully transmitted data (original), IDEK approach, and EK-means approach with $T = 5 \min$ and the similarity threshold $δ = 0.0005$ .

Table 3.

Balanced accuracy.

Approach	Leaving home	Working	Cooking	Using toilet	Eating	Washing dishes	Sleeping	Taking shower
Original	1.00	1.00	0.97	0.94	0.93	0.88	1.00	1.00
EK-means ( $T = 5 \min$ )	1.00	1.00	0.97	0.89	0.90	0.91	1.00	0.96
IDEK	1.00	1.00	0.95	0.90	0.93	0.84	1.00	1.00

IDEK: irrelevant data elimination based on k-means algorithm.

Table 4.

Training time and testing time.

Approach	Trainingtime (s)	Testing time( $\times 10^{- 7} s$ )
Original	1.028	6.057
EK-means ( $T = 5 \min$ )	$T + 0.924$	$T + 5.779$
IDEK	0.910	5.809

IDEK: irrelevant data elimination based on k-means algorithm.

Figure 8.

Confusion matrix and normalized confusion matrix: (a, b) fully transmitting approach (original); (c, d) EK-means approach; and (e, f) IDEK approach.

The size of data passing transmission

The original loc dataset of 452,245 tuples from different scheduled times was changed into minute-long intervals by assuming that all sensor nodes sample data every minute. The modified dataset used 17,230 timestamped samples. Our model can reduce the data transmitted from the cluster heads to the base stations by 51.84% as shown in Table 5. In addition, our work also reduces the energy in the sensor node level when the sensor node measures the same physical value as the previous value; under this condition, the sensor node will not send the current value to the cluster head, thereby reducing energy consumption.

Table 5.

Size of the data reduction.

Approach	Size of data reduction (%)
Original	0
EK-means ( $T = 5 \min$ )	48.631
EK-means ( $T = 10 \min$ )	62.695
IDEK	51.839

IDEK: irrelevant data elimination based on k-means algorithm.

We used the model discussed in Heinzelman et al.⁷ to analyze the energy consumption needed to transmit data messages to the base station. For simplicity, we assume that all cluster heads are placed far from the base station at the same distance $d$ and consume $E_{Tr - elec} nJ / bit$ energy for transmission and also consume $E_{Tr - awp} pJ / bit / m^{2}$ for the transmission amplifier to achieve an acceptable signal-to-noise ratio and transfer data messages reliably. The energy consumption $E_{Tr}$ for transmitting the aggregated data packets $k$ bits to the base station is given as

E_{Tr} (k, d) = k (E_{Tr - elec} + E_{Tr - awp} * d)

(4)

According to equation (4), if the size of the data passing transmission is decreased, the energy consumption needed to transmit the collected data to the base station will be also reduced. Figure 9 shows that the use of IDEK approach can reduce the size of the data passing transmission and provide almost the same BA when transmitting all measured data. For EK-means algorithm, Table 5 shows that the greater the $T$ , the greater the size of data reduction. However, EK-means approach needs around 5 min more of time computation to reduce size of the data passing transmission in the similar rate with IDEK approach and the activity prediction will be delay around 5 min, because EK-means algorithm needs to define two parameters, which are the period size of the data representing point $T$ and the similarity threshold $δ$ between the data of two readings.

Figure 9.

The comparison of the overall balanced accuracy and size of the data passing transmission.

Semi-supervise activity classification

In semi-supervised learning, we can use unlabeled data to improve the training model because the model can learn more data. We used our model to prediction activity of unknown data (unlabeled data). The results of fully transmitting (original) and IDEK approach are shown in Figure 10(b) and (c) (shorter vertical line) comparing to ground truth in Figure 10(a) (higher vertical line). The results show that our model can predict sleeping activity that the subject did not annotation shown in Table 6.

Figure 10.

The results of missing annotated activity prediction (shorter vertical line): (b) fully transmitting approach (original) and (c) IDEK approach comparing to (a) ground truth (higher vertical line).

Table 6.

The schedule of sleeping activity annotation.

Start sleeping		Stop sleeping
2016-11-14	21:01:06	2016-11-15	04:46:22
2016-11-16	21:24:16	2016-11-17	05:07:57
2016-11-17	20:47:44	2016-11-18	04:46:36
2016-11-18	21:50:32	2016-11-19	07:48:03
NaN		2016-11-20	07:14:22
2016-11-20	21:31:17	2016-11-21	05:34:19
2016-11-21	21:15:04	2016-11-22	04:47:17
2016-11-22	22:05:43	2016-11-23	06:02:36
2016-11-23	21:49:08	2016-11-24	05:00:39
2016-11-24	21:22:01	2016-11-25	05:06:58
NaN		2016-11-26	08:03:52
2016-11-26	23:19:24	2016-11-27	06:58:51
2016-11-27	20:52:06	2016-11-28	05:01:26
2016-11-28	22:09:07	2016-11-29	06:02:30
2016-11-30	00:24:44	2016-11-30	06:17:43
2016-11-30	23:05:19	2016-12-01	06:10:33
2016-12-02	02:32:58	2016-12-02	07:26:02
2016-12-02	22:18:09	2016-12-03	07:02:50
2016-12-03	22:04:46	2016-12-04	08:07:52
NaN		2016-12-05	05:20:30

This model can especially increase performance on a small sample. For example, Figure 11 illustrates a large difference in the performance classifications of eating and dishwashing activities.

Figure 11.

A confusion matrix and normalized confusion matrix: (a and b) IDEK approach in semi-supervised learning.

Discussion and conclusion

The results show that the IDEK approach, which includes embedded cluster-based algorithm in the cluster heads, can enhance data aggregation of a sensor network by eliminating irrelevant data, because cluster heads act as local processors that perform data pre-processing and eliminate irrelative data. In addition, the IDEK approach can reduce the size of the data being communicated, which effects energy consumption. The performance of this approach was compared to the existing approach, which utilizes the EK-means algorithm at the sensor level to reduce redundant data.

The EK-means algorithm needs to define two parameters: the period size of the data, represented by point $T$ ; and the similarity threshold $δ$ between the two readings of data. In fact, the T period also defines the delay in the sensor network, as we need to wait for a T period to gather the input for the EK-means algorithm. The similarity threshold $δ$ needs to be carefully defined, since it affects the error of the data representation. Even though the performance of the activity classifier was slightly poorer than that using the fully transmitted data as the input for the classifier and EK-means approach, the training and testing times of the classifier were faster than those using the EK-means approach, and its energy consumption was less than that using the fully transmitted data (51.84%).

However, our approach seems unsuitable to classify activities that are poorly captured by an ambient sensor (such as eating). These actions also include small sample or short-duration activities, such as using the toilet, and multifunctional appliance-using activities, such as using a sink for handwashing, vegetable preparation, dishwashing, and so on.

We developed this approach to intelligently collect and measure only the data that activate the cluster head if the relevant data are captured to reduce the size of the data transmission, reduce energy consumption, and ensure security/privacy data aggregation, as all collected data are not sent to others. On the contrary, most of the other works on data aggregation based on k-means algorithm enhancement focus on the similarities between the amount of data generated without considering present circumstances.

In the future, if we embed more powerful artificial intelligence (AI) technologies into the cluster heads, the sensors could determine complex situations, such as when a user overuses an appliance. For example, if the user leaves a living room with the television on and cooks in the kitchen, then the sensors are active in two places. The cluster heads should be able to learn from the patterns of each room and send data only from the kitchen room to the base station, where the occupant is located. Furthermore, human behaviors can alter these scenarios and environmental effects. Therefore, we may apply online learning to continuously test and update this model over time to reduce errors when the pertinent elements change.

Footnotes

Handling Editor: Zhong Shen

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is result of studies with the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (NRF-2018R1C1B5045953).

ORCID iD

Siriporn Pattamaset

References

Chen

, et al. Smart home: architecture, technologies and systems. Proc Comput Sci 2018; 131: 393–400.

Alaa

Zaidan

, et al. A review of smart home applications based on internet of things. J Net Comput Appl 2017; 97: 48–65.

Lin

Bergmann

. IoT privacy and security challenges for smart home environments. Informat 2016; 7(3): 44.

Majumder

Aghayi

Noferesti

, et al. Smart homes for elderly healthcare-recent advances and research challenges. Sensors 2017; 17(11): E2496.

Maraiya

Kant

Gupta

. Wireless sensor network: a review on data aggregation. Int J Sci Eng Res 2011; 2(4): 1–6.

Sirsikar

Anavatti

. Issues of data aggregation methods in wireless sensor network: a survey. Proc Comput Sci 2015; 49: 194–201.

Heinzelman

Chandrakasan

Balakrishnan

. Energy-efficient communication protocol for wireless microsensor networks. In: Proceedings of the 33rd annual Hawaii international conference on system sciences, Maui, HI, 7 January 2000, p. 10. New York: IEEE.

Al-Sodairi

Ouni

. Reliable and energy-efficient multi-hop leach-based clustering protocol for wireless sensor networks. Sustain Comput Informat Syst 2018; 20: 1–13.

Nigam

Dabas

. ESO-LEACH: PSO based energy efficient clustering in LEACH. J King Saud Univ Comput Informat Sci. Epub ahead of print 2 August 2018. DOI: 10.1016/j.jksuci.2018.08.002.

10.

Liang

Yang

, et al. Research on routing optimization of WSNs based on improved leach protocol. EURASIP J Wire Commun Network 2019; 2019(1): 194.

11.

Fasolo

Rossi

Widmer

, et al. In-network aggregation techniques for wireless sensor networks: a survey. IEEE Wire Commun 2007; 14(2): 70–87.

12.

Rao

Jana

Banka

. A particle swarm optimization based energy efficient cluster head selection algorithm for wireless sensor networks. Wirel Net 2017; 23(7): 2005–2020.

13.

Daneshvar

SMH

Mohajer

PAA

Mazinani

. Energy-efficient routing in WSN: a centralized cluster-based approach via grey wolf optimizer. IEEE Access 2019; 7: 170019–170031.

14.

Harb

Makhoul

Laiymani

, et al. K-means based clustering approach for data aggregation in periodic sensor networks. In: 2014 Proceedings of IEEE 10th international conference on wireless and mobile computing, networking and communications (WiMob), Larnaca, Cyprus, 8–10 October 2014, pp. 434–441. New York: IEEE.

15.

Harb

Makhoul

Couturier

. An enhanced k-means and ANOVA-based clustering approach for similarity aggregation in underwater wireless sensor networks. IEEE Sens J 2015; 15(10): 5483–5493.

16.

Harb

Jaoude

. Combining compression and clustering techniques to handle big data collected in sensor networks. In: 2018 Proceedings of IEEE middle east and north Africa communications conference (MENACOMM), Jounieh, Lebanon, 18–20 April 2018, pp. 1–6. New York: IEEE.

17.

Idrees

Al-Yaseen

Taam

, et al. Distributed data aggregation based modified k-means technique for energy conservation in periodic wireless sensor networks. In: 2018 Proceedings of IEEE middle east and north Africa communications conference (MENACOMM), Jounieh, Lebanon, 18–20 April 2018, pp. 1–6. New York: IEEE.

18.

Rida

Makhoul

Harb

, et al. EK-means: a new clustering approach for datasets classification in sensor networks. Ad Hoc Net 2019; 84: 158–169.

19.

Arthur

Vassilvitskii

. K-means++: the advantages of careful seeding. Technical report, Stanford University, Stanford, January 2006.

20.

Pedregosa

Varoquaux

Gramfort

, et al. Scikit-learn: machine learning in python. J Mach Learn Res 2011; 12: 2825–2830.

21.

Lago

Lang

Roncancio

, et al. The ContextAct@A4H real-life dataset of daily-living activities. In: Brézillon

Turner

Penco

(eds) International and interdisciplinary conference on modeling and using context. Cham: Springer, 2017, pp. 175–188.

22.

Brodersen

Ong

Stephan

, et al. The balanced accuracy and its posterior distribution. In: 2010 Proceedings of the 20th international conference on pattern recognition, Istanbul, Turkey, 23–26 August 2010, pp. 3121–3124. New York: IEEE.

23.

Vaizman

Weibel

Lanckriet

. Context recognition in-the-wild: unified model for multi-modal sensors and multi-label classification. Proc ACM Inter Mob Wear Ubiquit Technol 2018; 1(4): 1–22.