Abstract
This paper presents a vision-based method for fire detection from fixed surveillance smart cameras. The method integrates several well-known techniques properly adapted to cope with the challenges related to the actual deployment of the vision system. Concretely, background subtraction is performed with a context-based learning mechanism so as to attain higher accuracy and robustness. The computational cost of a frequency analysis of potential fire regions is reduced by means of focusing its operation with an attentive mechanism. For fast discrimination between fire regions and fire-coloured moving objects, a new colour-based model of fire's appearance and a new wavelet-based model of fire's frequency signature are proposed. To reduce the false alarm rate due to the presence of fire-coloured moving objects, the category and behaviour of each moving object is taken into account in the decision-making. To estimate the expected object's size in the image plane and to generate geo-referenced alarms, the camera-world mapping is approximated with a GPS-based calibration process. Experimental results demonstrate the ability of the proposed method to detect fires with an average success rate of 93.1% at a processing rate of 10 Hz, which is often sufficient for real-life applications.
1. Introduction
The safety of people and goods is a topic of great concern to society. The use of video surveillance systems is common practice when safety is to be ensured. These systems generate a high volume of video data that needs to be parsed continuously by human operators. To ease such a tedious and error-prone task in the context of fire detection, this paper proposes an automated vision-based method. Vision-based fire detection assists in coping with the limitations of contemporary smoke detectors, whose operation is constrained to indoor environments. Furthermore, in opposition to smoke detectors, vision-based systems are expected to generate sufficiently detailed data for the estimation of the fire's outline, location, and dynamics. Thermal cameras can do this in an extremely robust way. However, their high cost renders them practically non-existent in the vast majority of surveillance applications. Therefore, fire detection from low-cost surveillance cameras operating within the visible spectrum is expected to generate the highest practical impact.
The classical approach to fire detection by surveillance cameras is to classify the image pixels according to an appearance model of the fire, which can be devised to operate on the RGB [1–4], YCbCr [5], CIE L*a*b* [6] or HSI [7] colour spaces. To lower the false alarm rate, potential fire regions can be discarded when they do not comply with an expected deformation model [3, 6, 8–10]. Checking the dynamic characteristics of the potential fire's outline is also good practice for the reduction of false positives [11, 12]. In order to also take into account the typical fire's dynamic texture, spatio-temporal wavelet analysis can also be applied [3, 13]. The idea is to exploit the well-known flickering and textured characteristics of flames [14] for their detection.
Despite all the developments in fire detection, there is a lack of reports on integrated solutions ready for deployment in real-life scenarios. To be properly fielded, vision systems need to: (1) handle exceptions; (2) manage the speed-accuracy trade-off; (3) avoid perceptual aliasing situations; and (4) be embedded with seamless calibration procedures. In the case of fire detection, these challenges are related to: (1) handling sudden background changes; (2) determining when a computationally intensive frequency analysis is worth applying; (3) detecting and tracking potential distractors, such as people with fire-coloured clothing; and (4) automatically learning the camera-world coordinates mapping. All these challenges demand the proper selection, adaptation and integration of key previous work, as constrained by robustness and computational parsimony requirements.
In addition to offering an integrated solution, this paper also proposes the following adaptations to key elements of the fire detection system: (1) an attentive mechanism to focus the application of expensive yet accurate frequency analysis; (2) an object detection and tracking pipeline for the reduction of object-induced false fire alarms; (3) a context-based gating of the background learning processes for reducing the chances of erroneously learning moving objects (which is vital for proper object detection and tracking); (4) a GPS-based learning mechanism to automatically approximate the camera-world transformation and, thus, provide the vision system with scale-awareness and enable geo-referenced alarm reporting; (5) a new colour-based model of fire's appearance for enhanced detection accuracy; and (6) a new wavelet-based model of fire's spatio-temporal frequency signature.
Experimental results on a dataset of videos obtained from the Internet show the ability of the proposed method to detect fires with an average success rate of 93.1% 10Hz. These results show that, with the proposed method, the activity of human operators becomes less error-prone and less tedious.
This paper is organized as follows. First, Section 2 provides a general overview of the proposed method. Afterwards, Section 3 and Section 4 describe the fire detection and confirmation pipelines, respectively. Following this, the experimental results are presented in Section 5. Finally, Section 6 provides a set of conclusions and highlights future work.
2. General Overview
Figure 1 depicts the proposed method's processing pipeline. The method starts by detecting which regions of the input image correspond to objects in motion.

The proposed method's processing pipeline
Many well-known techniques for motion detection can be applied for this purpose [15–17]. Due to its simplicity - which is vital for fast computation - in this work motion detection is done by employing a dynamic threshold to the magnitude of each pixel's intensity variation across three consecutive frames [16]. The result of this process is a binary image
The
(1) segmenting fire regions according to a colour model; (2) determining which of the segmented regions present a dynamic texture; and (3) filtering out the regions with dynamic texture that do not exhibit the spatio-temporal frequency signature of typical fires. Despite the pipeline's robustness, the presence of challenging fire-coloured moving objects may still induce false fire alarms. To reduce the fire false alarm rate in these situations, knowledge about the location and category of the moving objects in the scene is used. This processing is the responsibility of the
Ideally, to foster situational awareness in humans, fire alarms should be geo-referenced. For this purpose, the events natively described in the camera-frame need to be described in the world-frame, that is, they need to be mapped from pixel coordinates to GPS coordinates. Conversely, the inverse mapping allows the object detection and tracking process to reject distractors based on the expected size of the objects' bounding boxes. One possibility to solve the mapping problem would be to know beforehand the GPS position of the camera, to make a few assumptions regarding the planarity of the environment, and to employ a camera calibration procedure. Nonetheless, here, calibration is done by learning from observing a moving person in the environment. This approach avoids the hard planar assumption and makes the calibration procedure intuitive and, thus, easily deployable.
To accumulate learning data, a human equipped with a GPS-enabled PDA moves in the scene while being tracked by the system. During the process, the person's GPS position is stored alongside with the bounding box reported by the tracker. The resulting set of these tuples defines the learning set, which is then processed by a weighted K-means algorithm whenever the world-frame position and the expected object's bounding box must be retrieved, given an image position. In our experiments, k = 3 provided the best results. To match the query and the elements in the learning set, the Euclidean distance is used. Figure 2 depicts a typical calibration. This process provides good enough results given a learning set that covers at least the boundaries of the scene.

Typical calibration obtained in the environment depicted in Figure 9. (a) GPS positions, recorded during the learning phase, overlaid on satellite imagery - the positions are interpolated for improved readability. The red overlay corresponds to the camera's field of view. (b) Expected bounding box's height represented by brightness level, given the training set represented by the set of overlaid bounding boxes. The smear in the image is a result of the inability of the system to generalize beyond the boundaries imposed by the learning set.

Representative fire-containing images (top-row) and corresponding classification with the proposed HY-method for colour-based fire regions' segmentation (bottom-row), i.e., binary images

Representative fire-containing images (top-row) and corresponding ground-truth fire/no-fire labels (bottom-row) for each of the four analysed categories. Fire and non-fire labels represented by white and black pixels, respectively.

Dynamic texture detection. (a) Input image of a video stream containing fire. (b) Temporal filter output,

Spacial-temporal frequency analysis. (a) An image from the input video stream. (b) An image of the cropped video stream around a dynamic fire-coloured region. Classifications of the cropped video stream with 1-D DWT and 2-D DWT in (c) and (d), respectively.

Background subtraction process. (a) Input image. (b) Foreground image mask prior to shadow removal,

Typical occlusion situation between two moving objects. (a) Moments before occlusion. (b) Occlusion occurs. (c) Moments after occlusion. The objects are successfully associated with the same trackers before and after the occlusion.

Typical occlusion situation between a moving object and a static structure in the environment. (a) Moments before occlusion. (b) Occlusion occurs. (c)-(d) Moments after occlusion. The object is successfully associated with the same trackers before and after the occlusion.
3. Fire Detection
3.1 Colour-based Analysis
For the colour-based pixel classification process of the detected moving regions, the performance of three pixel colour classification methods for fire detection [4, 5, 7] are analysed here. The methods described in [7], [5], and [4] rely on the HSI, YCbCr, and RGB colour spaces and are hereafter referred to as the ‘H-method’, ‘Y-method’, and ‘R-method’, respectively. Two novel combinations of these three original methods are also studied here. In the first combination - hereafter the ‘HYR-method’ - pixels are classified as fire iff consensually classified likewise by the three original methods. In the second combination - hereafter the ‘HY-method’ - a consensus between the H-method and the Y-method suffices to classify a given pixel as fire. The goal of all these methods is to produce, for a given frame
To assess each of the classification methods, a dataset of 217 images containing flames arising from everyday situations was used. For an analysis in context, these images were divided into four categories: indoor, night, rural, and urban. Ground-truth data was generated by hand-labelling all the images' pixels as either fire or non-fire (see Figure 4). The classification methods were applied to all the images and their output binary masks compared to the hand-labelled ground-truth data. The resulting pixel-wise true and false positives and negatives were used to build a confusion matrix for each method-category pair. The two-class Matthews correlation coefficient (MCC) was then calculated for each confusion matrix (see Table 1). The MCC metric is well known for its ability to handle unbalanced datasets. The closer that the MCC is to 1, the better the hypothesis matches the ground-truth. The results show that the HY configuration is the most consistent across the dataset and, thus, it is selected for the proposed method. These results highlight the weakness of the RGB colour space and the complementary role of both the HSI and YCbCr colour spaces in fire detection. The results also show that a colour-based recognition process alone is insufficient for the robust segmentation of fire regions. Table 2 summarizes the average processing time of each of the tested classification methods (see Section 5.1 for details on the experimental setup).
Colour-based fire detection comparative results (MCC)
Colour-based fire detection processing times
3.2 Dynamic Textures Detection
Fire regions in video streams exhibit a dynamic texture. To perform the rapid detection of dynamic textures, a motion-history image (inspired by [15]) is computed with a parametric recursive temporal filter applied to the fire/non-fire binary image
where λ1 and λ2 are empirically defined scalars, and
To actually determine the presence of a dynamic texture, a threshold
3.3 Spatio-temporal Frequency Analysis
The elements present in
One of the main characteristics of fire is its flickering rate at a frequency of around 10 Hz, no matter what materials and fuels are involved in the process [18]. This
As in [3], the actual decision as to whether a pixel corresponds to a fire region is reached if its corresponding DWT filter bank's output has a minimum of a few peaks (three in the current implementation) above a reasonably high amplitude (100 in the current implementation). For the entire area under analysis to be labelled as fire, the following two conditions must be met. First, the ratio of the analysed pixels that were labelled as fire must be above a given threshold (0.15 in the current implementation). Second, the accumulated number of zero-crossings in the filters' outputs must be above a another given threshold (three-times the area of the image in the current implementation). Peaks, which are not considered in the original, DWT-based model [3], are analysed in a pixel-wise fashion in order to avoid saturating the metric with spuriously high, peaked locations.
To further reduce the chance of generating false alarms, the textured nature of a fire's flames (i.e., its spatial frequency) is also verified. This is attained by means of applying a 2-D DWT to the first image of the video stream under analysis [3]. Distinct from the original application of this method to fire detection [3] - which used a single-stage bank filter - here, a three-stage filter bank is considered for additional accuracy. Furthermore, rather than applying a threshold to the energy of the pixels belonging to a single frame, the threshold is applied to the average, minimum, and sum values computed across the entire frame set. In the current implementation, these thresholds are set to 1.0, 0.01, and 50.0, respectively (see Figure 6). To avoid polluting the frequency analysis with the oscillation caused by the intermittent visualization of the foreground and background in the flame's boundaries, these are first removed.
4. Fire Confirmation
This section describes the pipeline responsible for detecting, tracking, and recognizing objects in the scene, as well as for determining whether these were confused as fire regions by the fire detection pipeline (see Section 3). If not, then a fire alarm is generated by the system.
4.1 Object Detection
In line with Nummiaro et al. [19], the object detection and recognition process uses an object detection technique to initialize a set of particle filters capable of tracking objects according to a colour-based appearance model.
To determine which regions of the visual field are potentially populated by an object, we provide an adaptation to the well-known background subtraction technique proposed by Kim et al. [20]. Background subtraction is used to detect foreground objects because it is faster than solutions based on optical flow (e.g., [21]) and more robust than simple temporal differencing (e.g., [16]) (note that these considerations operate under the assumption that the camera is static in the environment).
The original background subtraction model upon which the proposed method builds on [20] uses a vector of codebooks to build a model of the scene's background, which is iteratively updated. To avoid mistakenly learning foreground objects, the presence of motion can be used to cancel the update process, which makes the solution not ideal for dynamic environments. To overcome this limitation, a 3 × 3 regular grid superposed on the input image is applied here, with each cell being associated with an independently updated background model. This means that a moving object (e.g., a waving tree) in a given region of the scene no longer cancels the background update in other portions of the scene, which suffices for uncrowded scenes. In addition, the update process in each cell only occurs if there is no object already being tracked therein and no considerable motion is observed over a few seconds. Motion information comes from the method proposed by Collins et al. [16],
To remove spurious noise and holes in
Once objects are detected in
4.2 Object Tracking
Based on the binary mask
where
The importance weight of a given particle
The second factor affecting a particle's importance weight penalizes particles with a small ratio of moving pixels in its associated bounding box, given by ϑ
where
where φ is an empirically defined scalar. To allow for rapid adaptation to the appearance dynamics of the object in motion, φ was set to 0.3 in the current implementation. Finally, the proposal distributions are set based on a stochastic first-order motion model [19].
A key topic in object tracking is handling occlusions. One way to handle occlusions is to use multiple cameras with overlapping fields of view [23–25]. To cope with occlusions when using a single camera, particle filter forecasting can be used [19, 26–28]. In this manner, an object
If the estimated bounding box of a given tracker finds itself in a region without moving or foreground pixels, then the object being tracked is tagged as occluded by a static structure of the environment. In this case, the system waits for the emergence of a moving object with a similar appearance in the vicinity of the occlusion. If such an object emerges, then it is associated with the tracker and the occlusion is considered to be no longer active. If the object does not emerge for a few seconds, then the tracker is killed. The same logic is applied if objects disappear in the borders of the visual field. Figure 9 depicts a typical occlusion situation between a moving object and a static structure in the environment.
4.3 Decision-making
To reduce false alarms induced by fire-coloured moving objects, one of the two following conditions must be met in order to issue the alarm: (1) the bounding boxes of the fire region in question and of any other object being tracked do not overlap; (2) there is overlap in the bounding boxes but the distance between the current position of the overlapped object and its position when it emerged in the scene does not cross a given threshold (typically 50 pixels). The second condition ensures that only stationary objects are considered as putative fire regions.
A quasi-static object exhibiting a fire-like dynamic texture (e.g., a full-bodied person who is shaking but slowly moving and wearing fire-textured clothing) complies with these conditions and, thus, may generate an undesired fire alarm. If the object is of a known non-fire category, the alarm can be discarded immediately. The object's category is estimated with an offline learned classifier based on the histogram of oriented gradient (HOG) descriptor [29]. In the current implementation, only the people category is considered. Thus, with this information, the tracked moving objects are classified as human or else as a generic object.
Although the main purpose in this paper in using an object detection and tracking pipeline is to reduce false fire alarms, it can be used by itself to generate additional useful alarms. One of these alarms regards the presence of moving and static objects. The paths taken by the objects is also reported in order to help the operator to detect suspicious behaviour. To enrich the alarm, the object's category is also reported.
5. Experimental Results
5.1 Experimental Setup
The proposed method was fully implemented in C++ and all tests were run on a Ubuntu Linux machine equipped with an Intel Core 2 Duo 2.53 GHz processor. OpenCV library [30] was used for the implementation of low-level image processing routines. With this setup, the method exhibits a processing rate of 10 Hz.
To validate the fire detection algorithm, a set of 12 videos obtained from the Internet was used. These videos encompass a total of 21992 frames of 300 × 250 resolution. To validate the object detection and tracking pipeline, a set of three videos with a total of 17247 frames of 600 × 480 resolution was used. In both cases, different environments and lighting conditions were covered by the dataset.
5.2 Fire Detection Results
Figure 11 depicts key frames from each video in which the fire region recognition pipeline (see Section 3) was tested. These results show the ability of the pipeline to accurately segment the fire regions from the background in a wide variety of situations. Moreover, the results also show that the presence of fire-coloured moving objects, such as people and cars, does not produce false positives. Table 3 summarizes the quantitative results obtained for the same dataset. Overall, the proposed method is able to attain a detection rate of 93.1%. A detection is considered successful when there are no false positives in the evaluated frame and at least 90% of the fire region is properly labelled by the proposed method.
Fire regions recognition success rate

Object recognition and tracking results in three typical situations (one per row). The most likely bounding box of each box is represented by the shaded rectangles. The human labels represent the fact that the proposed method was able to recognize the objects as humans.

Representative frames of the tested dataset with the proposed fire detection algorithm output overlaid. Results are represented by the yellow contours. Note that the presence of distractors (i.e., fire-coloured moving objects) does not influence the algorithm's accuracy.
5.3 Fire Confirmation Results
Figure 10 illustrates key frames from each video in which the object detection and tracking pipeline was tested. These results show that the proposed method is able to track the objects even when their appearance and that of their surroundings are similar.
Table 4 summarizes the quantitative results obtained for the same dataset. Overall, the proposed method is able to attain 95.9% and 92.8% detection and tracking rates, respectively. The detection/tracking rate refers to the number of frames in which the presence/tracking of the objects present in the scene are reported without false positives. The lag between the detection of the object and the initialization of the corresponding tracker is responsible for the lower value in the tracking rate when compared to the detection rate. This lag is caused by the minimum number of frames in which the object must be consecutively detected before creating a new tracker or associating it with an existing one. The proposed method also exhibited robustness to the presence of shadows. Moreover, the presence of waving trees and grass had little effect on the results. This is largely due to the extensive use of the expected object's size, given by the calibration data. A limitation exhibited by the proposed method regards its inability to robustly detect motion in the far field, which means that object detection is delayed until the object is sufficiently near to the camera.
Object detection and tracking results
A final test was run in order to assess the ability of the system to run as whole, that is, with the fire detection and confirmation pipelines processing simultaneously. To perform this test, a properly scaled, real fire video was overlaid on a video with multiple people entering and leaving a scene (see Figure 12). This experimental setup aims to overcome the logistic difficulty of obtaining videos acquired from static cameras in situations that simultaneously exhibit fire regions and dynamic objects. Nevertheless, for an outdoor environment, this setup shows itself to be capable of producing videos with good enough fidelity for the purposes of the method's validation. In fact, the fire region is promptly detected by the fire detection pipeline. All the moving people are also appropriately detected and tracked by the fire confirmation pipeline. One of the people in the video was wearing fire-coloured clothing and was asked to move vigorously in order to increase the chance that the fire detector would report an alarm. The fire confirmation pipeline always inhibited the sporadic alarms induced by this person's behaviour.

Typical output with the proposed method. (a) Output with fire, human and object alarms. Note that in the imaged frame, the person with reddish clothing has not yet been detected as human - this occurs in a later frame. (b) Paths taken by the several objects that crossed the scene. The circles correspond to the position in which the human category classifier reported a positive. Remember that the classifier is used only until it reports a human. Individuals are represented by different colours.
6. Conclusions
A vision-based method for fire detection was presented. Experimental results showed that the method is able to segment fire regions in 93.1% of the tested dataset. A novelty in the presented method is the use of an object detection and tracking pipeline in order to reduce false fire alarms caused by fire-coloured moving objects. It was shown that the object detection and tracking pipeline by itself is able to produce a success rate of roughly 92.8% in the tested dataset. Overall, and without special code optimizations, the proposed method runs at 10 Hz.
Background subtraction was implemented in a windowed manner so as to increase robustness to the presence of artefacts. To focus the application of a computationally expensive frequency analysis component and - therefore - reduce the computational cost, an attentive mechanism based on a rough temporal analysis was proposed. An object detection and tracking algorithm was proposed and its output was integrated with the fire detection algorithm to further reduce the false alarm rate in fire detection. A new colour-based model of fire's appearance and a new wavelet-based model of fire's spatio-temporal frequency signature were proposed for improved accuracy. Finally, to avoid hard assumptions regarding the environment's configuration when determining the camera-world mapping, a GPS-based learning procedure was proposed.
In future work, we expect to improve the method in order to attain a full frame-rate on low-end computational units equipping affordable smart cameras. We also intend to include the ability to recognize multiple categories in the object detection and tracking pipeline and introduce the ability to detect smoke (which would reduce the response time in an emergency). Finally, we also intend to build a video dataset acquired from fixed surveillance cameras covering situations of co-existing dynamic objects and real fire regions. This will foster the further development of the object-based fire confirmation pipeline.
