Abstract
The automatic detection of falls within environments where sensors are deployed has attracted considerable research interest due to the prevalence and impact of falling people, especially the elderly. In this work, we analyze the capabilities of non-invasive thermal vision sensors to detect falls using several architectures of convolutional neural networks. First, we integrate two thermal vision sensors with different capabilities: (1) low resolution with a wide viewing angle and (2) high resolution with a central viewing angle. Second, we include fuzzy representation of thermal information. Third, we enable the generation of a large data set from a set of few images using ad hoc data augmentation, which increases the original data set size, generating new synthetic images. Fourth, we define three types of convolutional neural networks which are adapted for each thermal vision sensor in order to evaluate the impact of the architecture on fall detection performance. The results show encouraging performance in single-occupancy contexts. In multiple occupancy, the low-resolution thermal vision sensor with a wide viewing angle obtains better performance and reduction of learning time, in comparison with the high-resolution thermal vision sensors with a central viewing angle.
Introduction
The most recent studies of the World Bank estimated that the number of elderly people is increasing and expected to double again by 2050 worldwide. As the average age of the population continues to rise, elderly people are continuing to suffer from certain chronic diseases like dementia, hypertension, diabetes, gait issues.1,2
Fall detection is a major challenge in the area of public health care, especially for the elderly, and reliable surveillance is a necessity to mitigate the effects of falls. 3 In this context, an alarming 42% of people aged 70 and above are involved in falls annually, with 37.3 million of those requiring medical attention as a result of their severity.1,4
Accidental falls experienced by elderly people are a prominent cause of hospitalization, death due to the injuries sustained, and reduced independence. Several risk factors exist in relation to falls in older adults, studies mainly identifying physical frailty, poor balance, unsteady pace, poor muscle strength, and cognitive impairment. The prevalence of falls also escalates as age increases, particularly when combined with other risk factors such as chronic disease, poor sleep patterns, and diminished vision. 5
Ambient assisted living (AAL) is becoming an important consideration to provide assistive technologies aimed at sustaining independence, well-being, and quality of life. 6 So, there has been a growing need to promote and support “aging in place” due to demographic issues, increasing health care costs, a shortage of caregivers, and the fundamental fact that a large portion of elderly people prefer to remain independent in their own homes for as long as possible. 7
These issues open up new research avenues for tracking activities related to elderly people’s daily routine, specifically with the aim of guaranteeing their safety. Over the last decade, interest in ubiquitous computing technologies has provided researchers with enough opportunities to design monitoring and intervention systems, which could provide continuous 24/7 real-time monitoring in environments with sensors, with the goal of improving the quality of life of elderly people. 8
In order to evaluate the proposed methodology, a case study is presented to evaluate the methodology for fall detection using data collected by two different thermal vision sensors (TVSs) and multiple convolutional neural networks (CNNs) in two different smart labs: the smart lab of Ulster University 9 and the smart lab of the University of Jaén. 10 Moreover, the data set design for data collection includes single-occupancy as well as multi-occupancy scenes.
The article is structured as follows: in this section, we have provided a review of related research in the fields of TVSs and fall detection. The methodology for the evaluation of CNNs to classify the shapes of falls from heterogeneous TVSs is presented in “Methodology” section. The experimental setup of the case study and a discussion of the results are presented in “Experimental setup” section. Finally, in “Conclusions and ongoing works” section, conclusions and ongoing works are discussed.
Related works
The automatic detection of falls within AAL scenarios has attracted considerable research interest due to the prevalence and impact of falls in the elderly, being a crucial research area. 3 Impact-related accidents in indoor environments such as falls and collisions have been identified and studied in an attempt to avoid falls or reduce aid response time. 11 Fall detection approaches in AAL scenarios are divided into two categories: wearable/ambience sensors and vision sensors.3,12
In approaches based on wearable/ambience sensors, sensors are attached to an inhabitant under observation—namely, wearable sensors or smart phones, or objects that make up the environment where the activity takes place—namely, dense sensing. These approaches work with time series of state changes and/or various parameter values that are usually processed through data fusion, probabilistic, or statistical analysis methods and formal knowledge technologies. 13 The main benefit of the wearable or ambience sensor is its cost efficiency. However, two main disadvantages of this kind of sensor are intrusiveness and fixed relative relations with the object or the inhabitant that can be easily disconnected. Furthermore, installation and setup can be complex. For these reasons, this kind of device is not a very good choice for the elderly. 3
Approaches based on vision sensors exploit computer vision techniques like feature extraction, structural modeling, movement segmentation, action extraction, and movement tracking to analyze visual observations for pattern recognition. 13 In recent years, the number of approaches in this category has increased due to the fact that video cameras are commonly included in the wearable technologies or systems we use daily. 3 Previously, general vision sensors entailed disadvantages concerning privacy and ethics. After the emergence of TVSs, these disadvantages can be mitigated, being an excellent alternative to find solutions for the elderly.
Exploring state-of-the-art fall detection systems, we found recent studies within vision sensor-based approaches. In the proposal presented in Bromiley et al., 14 the image stream from the thermal detector is monitored. To do so, extracted features include horizontal and vertical gradients, aspect ratio, and centroid angle to horizontal axis of the bounding box. Falls were confirmed when the angle reached a value below 45 . A fall detection system was proposed in De Miguel et al. 15 based on a low-cost device comprising an embedded computer and camera, executed in a low-cost device such as Raspberry Pi, obtaining good performance values (i.e. 96% sensitivity), comparable to other systems using more expensive and more powerful hardware. An approach for unobtrusive indoor fall detection by an infrared (IR) thermal array sensor was proposed in Hayashida et al. 16 The main innovation of this proposal was to perform the fall detection within the sensor node by a computationally inexpensive algorithm which notifies the server only when a fall has occurred. A method was proposed in Rougier et al. 17 to detect falls by analyzing human shape deformation during a video sequence. A shape matching technique was used to track the person’s silhouette along the video sequence. The shape deformation is quantified from these silhouettes based on shape analysis methods. In Asbjorn and Jim, 18 data collected from a ceiling-mounted 80 × 60 thermal array were combined with an ultrasonic sensor device. This approach monitored activities, recognizing the location and posture of an individual. In Taramasco et al., 19 a non-invasive monitoring system for fall detection in older people was presented by using very low-resolution thermal sensors for classifying a fall and then alerting the care staff. Furthermore, the authors analyzed the performance of three recurrent neural networks for fall detection: long short-term memory (LSTM), gated recurrent unit, and bi-LSTM. Finally, a methodology based on CNNs to detect falls from non-invasive TVSs was presented in Medina-Quero et al. 20 with data augmentation techniques. The results show encouraging performance in single-occupancy contexts, with up to 92% accuracy, but a 10% reduction in accuracy in multiple-occupancy contexts.
Another work related to our proposal but without the application to fall detection was presented in Bayareh et al., 21 studying the diabetic foot by means of a Raspberry Pi as an embedded system and the Lepton-Flir Development Kit as an IR sensor. The IR sensor was characterized to measure the superficial temperature of the human skin radiometrically.
Most of the proposed vision-based approaches lack flexibility due to the fact that these approaches are often case-specific, depending on different scenarios and TVSs.
In this article, we present a methodology to analyze the capabilities of non-invasive TVSs22 to detect falls by means of several architectures of CNNs in different scenarios. We propose the use of the CNNs because they have provided excellent results in multiple areas such as speech recognition, 23 image classification, 24 or gas classification. 25
The learning process with CNNs requires a large amount of data. 26 Therefore, it is necessary to collect multiple images from different inhabitants, orientations, and cases, which takes a great effort. This process could make customization and configuration in different contexts hugely difficult. This disadvantage can be overcome by data augmentation to enlarge the number of learning cases from a limited set 27 and therefore reduce over-fitting. 28
Similar approaches have been proposed in recent works,29,30 where the selection of images from objects in a small number of human-annotated examples is then projected in the environmental background to provide new synthetic examples, as well as in thermal vision data sets. 20
In our proposal, the two studied TVSs have different capabilities. The first TVS has low resolution with a wide viewing angle and the second one has high resolution and a central viewing angle. Three types of CNN are adapted for each TVS in order to evaluate the impact of the architecture on fall detection performance. Furthermore, a large data set is generated from a set of few images as a data source, by using ad hoc data augmentation, that is, increasing the original data set size by generating new synthetic images.
Finally, we propose to include fuzzy representation of thermal information to compute the fuzzy color of human temperature. 31 The aim of including fuzzy processing of TVS data provides (1) a filter for irrelevant information, (2) reduction of noise from non-feasible values, 32 (3) scaling and focusing the relevant data range for the CNN kernels during the learning process. The use of a fuzzy approach has been demonstrated as a successful tool to reduce uncertainty in multiple applications.33–36
Methodology
In this section, we describe the methodology applied. First, in “TVSs for analyzing fall detection” section, we describe the TVSs evaluated in this work. Second, in “Fuzzy representation of thermal information” section, we define a fuzzy representation of thermal information to improve the performance of the fall detection. Third, in “Data augmentation” section, we detail an ad hoc data augmentation for fall detection in the previous learning stage. Fourth, in “Design of the CNN” section, we describe several configurations of CNNs evaluated for each TVS.
TVSs for analyzing fall detection
In this work, we have integrated two TVSs with different capabilities to evaluate their performance in analyzing fall detection:
Low resolution with a wide viewing angle: in this case, we deployed the TVS Heimann HTPA 32 × 31, 37 which provides thermal vision with a 32 × 31 matrix, where each value defines a heat point of temperature. An effective factory calibration is integrated in the device, with no distortion by the fish-eye lens. 38 The data are collected from the TVS by means of a twisted Ethernet cable which is connected to the local area network. The middleware SensorCentral 39 integrates the TVS as a sensor source, providing the thermal sensor data within a Web Service in JSON format.
High resolution with a central viewing angle: in this case, we deployed the Lepton LWIR module included in FLiR Dev Kit, 40 which provides thermal resolution with an 80 × 60 matrix. In addition, a Raspberry PI 41 was used in order to collect the information from the TVS 42 in real time.
In a formal definition, each TVS provides a matrix
In Figure 1, we provide some figures on the sensors deployed and evaluated in this work.

The thermal vision sensors for analyzing fall detection evaluated in this work: (a) Heimann HTPA and sensor central provides a thermal sensor with low resolution with wide viewing angle, (b) FLiR Dev Kit and Raspberry Pi provides a thermal sensor with high resolution and central viewing angle.
Fuzzy representation of thermal information
The data collected by the TVS represent the heat temperature in a matrix of points. In order to provide a visual representation, a transformation function to gray scale values is required. In this work, we propose to define a fuzzy set to represent a fuzzy color
43
of human temperature by means of a membership function
In order to describe the fuzzy set straightforwardly, the shape of the membership function is given by a trapezoidal function which is defined as a lower limit
The aim of including fuzzy data processing from TVSs provides (1) a filter for non-relevant information, (2) the reduction of noise from non-feasible values, 32 (3) scaling and focusing the relevant data range for the CNN kernels during the learning process. In Figure 2, we show an example of the application of fuzzy representation.

Images 1 and 2 show sample data from a low-resolution TVS. Image 3 shows sample data from a high-resolution TVS. Category B describes the raw TVS image data, and Category A shows the same data with fuzzy representation.
Data augmentation
In this section, we propose the augmentation and enlargement of the image data from the original data set by means of image transformations. Thus, the innovation of our proposal is based on the creation of a new larger set of synthetic images to train the model. In this work, we have included the following image transformations—translation, rotation, and scale—to augment the original image data set:
Translation: the original image is relocated within a maximal window size
Rotation scale: the rotations are provided by two methods. First, the translated image is flipped horizontally and vertically by using a random process, which applies the transformation to a percentage of cases, defined by wH, wR respectively. Second, a rotation and scale transformation is defined by a maximal rotation angle
An example of new synthetic images is shown in detail in Figure 3 in order to extend the data set.

Images 2A and 2B show augmentation from a high-resolution TVS, and images 1A and 1B from a low-resolution TVS.
Design of the CNN
In this section, we describe several CNN architectures to classify the falls sustained by inhabitants. The two TVS devices show wide differences regarding technical characteristics and development purposes. For this reason, they are integrated within systems with different computing performance.
Regarding the low-resolution TVS, in our case a Heimann HTPA, the thermal sensor collects a smaller sized matrix of heat points which can be integrated in low-cost boards with low computing performance. For this purpose, three configurations of CNNs to classify fall detection with this kind of device are evaluated:
These three CNN configurations have been previously identified as suitable structures for fall detection, 20 and their details are shown in Table 1.
Configurations of convolutional neural networks for low-resolution thermal vision sensors. [N, N] × M is the convolution of window dimensions whose size is (N×N) and M is the size of filters.
Regarding the high-resolution TVS, in our case, an FLiR DEv Kit and a Raspberry Pi, the matrix of heat points is wider in size, requiring deeper CNN configurations to classify fall detection. In this work, we propose three CNN configurations:
Inception, which includes multiple-sized kernels operating on the same layer. 45 In this work, we integrate convolutions by 3 × 3 and 1 × 1.
Residual, which integrates residual blocks with the same topology ending with identity-shortcut to connect outputs from lower layers as input in upper layers. 46 The residual blocks include convolutions by 3 × 3 and 1 × 1 for a given input and output size which is defined for each layer: res_block([in, out]).
Configurations of convolutional neural networks for high-resolution TVSs.
The CNN architectures for the high-resolution TVS are shown in Table 2.
Experimental setup
In this section, we detail the experimental setup of the case study carried out to evaluate the fall detection methodology using data collected by two different TVSs and multiple CNNs.
The data collection design to detect falls was divided into single-occupancy and multi-occupancy. In single-occupancy, we included three subcategories: (1) empty room, (2) one person standing/walking, and (3) one fallen person. In multi-occupancy, we added two new subcategories: (4) two to three people standing/walking and (5) one fallen person with another person standing/walking. The image data from three participants were collected with the two thermal sensors. While the data were being collected, each person simulated several natural positions to simulate falls, and also took a walk around the vision area of the TVS to capture walking.
Description of case studies
The first case study was carried out in the Smart Lab of Ulster University
9
(https://www.ulster.ac.uk/research/institutes/computer-science/groups/smart-environments). The experiment was carried out in the hall of the Smart Lab. Three participants (one woman, two men) were involved in collecting data in the hall, using a TVS installed on the ceiling. The participants were 1.72, 1.68, and 1.83 m tall. The vision of the TVS in the hall was determined by a square 3.5 m bounding box (12.25
The second case study was carried out in the UJAmI smart lab of the CEATIC (Center for Advanced Studies in Information Technology and Communication) of the University of Jaen (Spain)
10
(http://ceatic.ujaen.es/ujami/). The experiment was also developed in the hall of the Smart Lab; analogously, three participants (one woman, two men) were involved in collecting data in the hall, using a TVS installed on the ceiling. The participants were 1.88, 1.64, and 1.70 m tall. The vision of the TVS in the hall was determined by a square 2.5 × 2.0 m bounding box (5.0
In order to evaluate the two data sets, they were divided into 10% for testing and 90% for training by using a cross-validation (10-cross validation). Accuracy and time were collected for over 2000 learning steps for each CNN in the case of the low-resolution TVS and 200 learning steps for the high-resolution TVS. For each data set with 10 cross-validations, we computed (1) the average accuracy of the last 20 learning steps and (2) the average time wasted in all steps.
The results presented in this work are available in following URL (http://150.214.174.25:8052/thermal/).
Evaluation of low-resolution TVS with wide viewing angle
In this section, we detail the results achieved with the three types of CNNs and the performance of the fuzzy representation of thermal information to detect falls from thermal vision images. From the original data set, we include the following data augmentation steps:
Translation: the original images have been translated within a maximal window size,
Rotation scale: each image is flipped horizontally and vertically by a random probability
Crop-scale: we compute a final centered image with a window size of 28 pixels,
Evaluation of the best CNN configuration
In this section, we present the results from the low-resolution, wide viewing angle TVS, which was evaluated previously in Medina-Quero et al.
20
to detect the best CNN configuration. In Table 3, we include the data for the single- and multi-occupancy data set.
Table summarizing the results of single- and multi-occupancy data for the low-resolution, wide viewing angle TVS.
CNN: convolutional neural network.
Bold value represents the highest precision values obtained.
Evaluation of fuzzy representation of thermal information
In this section, we evaluate performance when applying fuzzy representation to the raw data of the matrix of heat points. To define the fuzzy set which represents human temperature, we have included the following trapezoidal membership function (TR is described in the “Abbreviations” in the appendix).
where
In order to provide a symmetrical evaluation, both with fuzzy representation and raw data, a new augmented data set has been computed and the performance of the best configuration
Table summarizing the results of single and multi-occupancy data with raw and fuzzy representation for the best configuration

Evolution of accuracy of the best configuration

Confusion matrix for the best models. Fuzzy-based single- and multi-occupancy with
Evaluation of high-resolution TVS with central viewing angle
In this section, we detail the results of the three types of CNNs to detect falls from the high-resolution, central viewing angle TVS. From the original data set, we include the following data augmentation and fuzzy steps applied to previous learning data:
Translation: the original image is translated within a maximal window size
Rotation scale: each image is flipped horizontally and vertically by a random probability
Fuzzy configuration:
In Table 5, we include the data for the single- and multi-occupancy data set for each CNN configuration proposed in this work. In addition, the evolution of accuracy while learning is shown in Figure 6. In Figure 7, we also include a confusion matrix for the best model in multi-occupancy contexts.
Table summarizing the results of the single and multi-occupancy data for the high-resolution TVS with a central viewing angle.
CNN: convolutional neural networks.

Evolution of accuracy of the six types of CNN for the high-resolution TVS with a central viewing angle in single and multi-occupancy contexts.

Confusion matrix for the best models. Single and multi-occupancy with
Discussion
In this work, two TVS devices with (1) low resolution and wide viewing angle and (2) high resolution and central viewing angle, data processing stages and different CNN architectures are proposed to classify human falls in single and multi-occupancy contexts.
First, high performance is obtained in single-occupancy scenarios, achieving over 90% accuracy for both devices. For the low-resolution, wide viewing angle TVS, we evaluate the impact of including fuzzy representation of thermal information with previous results, which has been demonstrated to increase learning speed and accuracy notably, which with
Furthermore, despite the capabilities of CNNs to extract visual features, the initial processing of information, such as fuzzy representation, is key to obtaining encouraging results. In the case of the high-resolution TVS, different CNN architectures have been evaluated, obtaining the best performance with the configuration
Second, notable performance is obtained in multi-occupancy; the results show a variance of 2.3% and 7.7% of accuracy between best model and second one in single and multi-occupancy for the high-resolution TVS, but a wide difference in performance is noted between the wide viewing angle and the central viewing angle TVS. For the low-resolution TVS with a wide viewing angle, the best performance is achieved using
It is noteworthy that one of the key reasons for this low performance derives from differences in the vision area between the two devices (12.25 and 5.0
Conclusions and ongoing works
In this work, we have evaluated two TVSs with different capabilities located in the roof of a smart environment to classify the shapes of falls. Two case studies in the Smart Lab of the University of Ulster (UK) and in the Smart Lab of the University Jaen (Spain) are examined. Several CNN configurations are evaluated for each TVS. A low-resolution TVS with a wide viewing angle using fuzzy representation of thermal information provides outstanding performance in single- and multi-occupancy contexts.
In future works, we will analyze the impact of temporal sequences in dynamic data sets with fall detection in natural conditions using Deep Learning approaches on temporal models, such as LSTMs.
Footnotes
Appendix
Acknowledgements
Dr. Martin Cooney from Halmstad University for providing the thermal vision sensor (TVS) with high-resolution and central viewing angle as well as six participants from the University of Jaén who collaborated in the recording of the data set.
Handling Editor: Joseph Rafferty
Author contributions
All authors contributed equally to this work.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research has received funding under the REMIND project Marie Sklodowska-Curie EU Framework for Research and Innovation Horizon 2020, under Grant Agreement No. 734355. Furthermore, this contribution has been supported by the Program “José Castillejo” Mobility stays abroad for young doctors (CAS17/00292), the Spanish Government by RTI2018-098979-A-I00, and the University of Jaén by EI_TIC1_2019.
