MCS-RF: mobile crowdsensing–based air quality estimation with random forest

Abstract

It is a great challenge to offer a fine-grained and accurate PM_2.5 monitoring service in urban areas as required facilities are very expensive and huge. Since PM_2.5 has a significant scattering effect on visible light, large-scale user-contributed image data collected by the mobile crowdsensing bring a new opportunity for understanding the urban PM_2.5. In this article, we propose a fine-grained PM_2.5 estimation method based on random forest with data announced by meteorological departments and collected from smartphone users without any PM_2.5 measurement devices. We design and implement a platform to collect data in the real world including the image provided by users. By combining online learning and offline learning, the method based on random forest performs well in terms of time complexity and accuracy. We compare our method with two kinds of baselines: subsets of the whole data sets and six classical models (such as logistic, naive Bayes). Six kinds of evaluation indexes (precision, recall, true-positive rate, false-positive rate, F-measure, and receiver operating characteristic curve area) are used in the evaluation. The experimental results show that our method achieves high accuracy (precision: 0.875, recall: 0.872) on PM_2.5 estimation, which outperforms the other methods.

Keywords

Air quality estimation mobile crowdsensing semi-supervised random forest online random forest data fusion

Introduction

Since atmospheric pollutants may cause respiratory diseases such as lung cancer and severe environmental problems (NASA climate research, http://climate.nasa.gov/causes/), urban air pollution is a critical issue in both developed and developing countries. It is necessary to give relatively accurate evaluation and analysis of air pollution in urban areas.

To communicate to the public the levels of the air pollution, such as the concentrations of $N O_{2}$ , $P M_{2.5}$ (particulate matter in the atmosphere with a diameter no more than 2.5 µm), and $P M_{10}$ , the government agency has defined the air quality index (AQI). To measure AQI values, the fixed and high-cost air quality monitoring stations are required. Up to 2017, there are only 38 air pollution monitoring stations in Beijing. Although more and more monitoring stations are established in recent years, these monitoring stations are far enough for fine-grained air pollution monitoring.

Previous studies on air pollution monitoring are mainly based on wireless sensor networks (WSNs) and Internet of Things.¹ In static sensor networks (SSNs), sensors are usually placed on walls or street lamps. For example, CitySense² aims to provide an urban-scale wireless networking testbed for air pollution monitoring. In vehicle sensor networks (VSNs), sensor nodes are usually deployed on buses or taxis. For example, Völgyesi et al.³ present the mobile air quality monitoring network system by a large number of car-mounted sensor nodes. As gas sensors are usually of low cost, lightweight, and with a fast response time, the programs above have a good performance on gas pollution monitoring. However, to monitor the $P M_{2.5}$ accurately, professional monitoring facilities are required, which are usually very expensive and huge. For single data sets, the problem of sparse data cannot be avoided. Other studies have tried to provide fine-grained air quality estimation based on a variety of data sources. For example, Zheng et al.⁴ propose U-Air to infer real-time and fine-grained air quality, and Yu et al.⁵ propose RAQ for predicting air quality in urban sensing systems. Although all the data sets used are fixed data, which are passive and uncontrollable, the impact of sparse data still exists.

In recent years, the large-scale user-contributed data provided by participatory sensing^6–8 bring a new opportunity for understanding the air pollution in the city. Photos shared by users are valuable resources. According to the principle of Mie scattering,⁹ $P M_{2.5}$ has a significant scattering effect on visible light, which can be captured by photos. In cooperation with the incentive mechanism,¹⁰ the data collected by this way could be very flexible and efficient. However, it is hard to achieve the data fusion between irregular real-time data and fixed data. The distribution of the irregular real-time data changes frequently and the time series is very important.

To achieve the fusion between irregular real-time data (the photos collected by spontaneous volunteers and motivated participants) and fixed data (the official data published by relevant agencies), two main problems need to be solved: (1) correlation analysis of data and (2) modeling and inference for heterogeneous data.

The correlation analysis of $P M_{2.5}$ and fixed data has been fully discussed by Zheng et al.⁴ On the other hand, the relationships between $P M_{2.5}$ and photos has attracted little attention. To evaluate the relationships between image features and $P M_{2.5}$ , three kinds of image features are extracted, including spatial contrast,¹¹ dark channel,¹² and HSI color differences. A large number of photos labeled with real $P M_{2.5}$ values are used to obtain the final result. The analysis results show that these features are discriminative in $P M_{2.5}$ inference.

To make the best use of the heterogeneous data, the MCS-RF model combined with offline learning and online improvement has been proposed in this article. We apply the MCS-RF model to infer the real-time and fine-grained $P M_{2.5}$ throughout Beijing based on heterogeneous data sets. The data sources collected include meteorological data, traffic data, point of interest (POI) data, and the image data. Data from different data sources are quite different from each other on temporal distribution, spatial distribution, data density, and data expression. Furthermore, the image data collected by the users have strong real-time performance. To make the best use of the data, the feature level–based data fusion¹³ method is applied. As most of the data are unlabeled and the image data have irregular real-time characteristics, an algorithm based on semi-supervised random forest (SRF) and online random forest (ORF) is proposed to estimate the urban area air quality $(P M_{2.5})$ . Compared with other baselines, our method has obvious advantages. Finally, the precision of all data sets can be 87.5% and the recall is 87.2%.

Related work

There is a lot of research on air quality monitoring based on WSN in the past few years. At present, there are three ways to achieve the air quality monitoring: SSN, VSN, and community sensor network (CSN). In SSN, the sensor nodes are typically mounted on the streetlight or traffic light poles or walls.^2,14 It can provide accurate and reliable data and can also guarantee the network connectivity. However, to achieve the fine-grained air quality monitoring, many sensors and a customized wireless network are required. What is more, the SSN may also cause resource waste¹⁵ In VSN, the sensor nodes are typically carried by the public transportations like buses or taxis³ With the development of the vehicular networks,¹⁶ the VSN can provide accurate and reliable data, and the sensor nodes have high mobility. However, to achieve fine-grained air quality monitoring, the cost inefficiency on carriers could be very high. Uncontrolled or semi-controlled mobility cannot be accepted in this way. Also, the spatial-to-temporal resolution trade-off problem cannot be ignored.¹⁷ In CSN, the sensor nodes are carried by the public or professional users.¹⁸ Since the sensors used are usually very cheap and easy to operate with low accuracy, the quality and quantity of the data can be very poor in some situation. For $P M_{2.5}$ , the sensors with acceptable accuracy are usually of very high cost, large size, and heavy weight. It is hard to provide fine-grained $P M_{2.5}$ monitoring service unless the $P M_{2.5}$ sensors have a significant improvement.

There have been many studies on estimating the extinction coefficient and the transmittance of the scene based on the photos. He et al.¹² use the dark channel to remove the haze by calculating the transmittance with a single photo. Graves and Newsam¹⁹ use the spatial contrast to estimate atmospheric light extinction with cameras. As the wavelength range of visible light in air is between 390 and 780 nm, according to Mie scattering principle,²⁰ the fine particulate matter with the diameter in the range of 0.38–0.78 (the major component of $P M_{2.5}$ ) would obviously affect the extinction coefficient and the transmittance of the scene. However, the correlation between $P M_{2.5}$ and extinction coefficient or transmittance is not intuitively clear.

The mobile crowdsensing and computing (MCSC) and Internet of Things have been developing fast in recent years.²¹ The MCSC applications mainly focus on environment monitoring, transportation, traffic planning, and urban dynamics sensing with smartphones. Maisonneuve et al.²² measure and map noise pollution with mobile smartphones’ microphone. Eisenman et al.²³ deploy BikeNet, an extensible mobile sensing system to measure the air pollution. Cui et al.²⁴ schedule tasks fairly for large-scale mobile device systems with mobile cloudlets. Yang et al.²⁵ propose an incentive mechanism design for mobile phone sensing. F Restuccia and SK Das²⁶ propose a trust-based framework for secure user incentivization in participatory sensing. Xiao et al.²⁷ apply a deep Q-network (DQN) to derive the optimal MCS policy against faked sensing attacks. With the rapid development of video technology, the hybrid unicast/multicast adaptive videos²⁸ may also be used for environment prediction in the future.

Framework of the MCS-RF

MCS-RF is made up of five parts as shown in Figure 1: Data Collection, Data Storage, Model Training, Inference, and User Interaction.

Figure 1.

The framework of the MCS-RF.

Data Collection

Data Collection is responsible for collecting raw data. The data collected are made up of two major parts: the user-contributed image data collected by smartphones and the public data published by relevant functional departments. In recent years, smartphones have been very popularized in cities and photography is one of the basic functions of smartphones; compared with the other methods based on additional equipment, our method can easily attract more participants. The user clients only upload their photos with the time stamp and GPS information, the cloud server receives the photos and collects other data from the public websites, usually published by the government.

Data Storage

Data Storage is made up of three parts: Latest Data Buffer module is used to store the data updated in real time and send the data to the Online Learning module, the Current Data module updates every 24 h which receives the new data from the Latest Data Buffer module and sends its data to the Offline Learning module and the Historical Data module; the Historical Data module only takes responsibility for the data storage.

Model Training

Model Training consists of two parts: offline learning and online learning. Offline learning achieves the initialization of the forest with all data sets every 24 h. An SRF method is applied to offline learning. Online learning achieves the adaptive incremental learning based on the data arrived in real time.

Inference

Inference takes responsibility for the major function of the platform. It stores the inference model based on the models provided by online learning and offline learning. When the user requests arrive, Inference calculates the final evaluation value based on the data collected from the user and returns the results to the User Interaction module.

User Interaction

User Interaction is applied on the users’ smartphones, which makes contributions to the user experience and also uploads the image data collected by the users.

Problem statement

This paragraph presents the problem definition and notations. Given a collection of grids $G \dot{=} G 1 \cup G 2$ , where $g \in G 1$ (which means the $P M_{2.5}$ in grid g), $g . p$ is unknown, $g' . p \in G 2$ , and $g' . p$ (which means the $P M_{2.5}$ in grid $g'$ ) is known, $| G 1 | \geq | G 2 |$ , we collect images $(I)$ , meteorological data $(M)$ , traffic data $(T)$ , POIs $(P)$ , and air quality monitor station reports $(R)$ in G, and aim to infer $g . p \in G 1$ in real time.

Correlation analysis between PM_2.5and image

According to the principle of Mie scattering,⁹ $P M_{2.5}$ has a significant scattering effect on visible light, which can be captured by photos. The scattering effect of visible light has a significant effect on the color of the image, so three kinds of image features are extracted to achieve the analysis. Spatial contrast is calculated by the color difference of all pixels. Dark channel could be regarded as the atomization value of the image, which usually has a positive correlation with PM_2.5 concentration, and differences of HSI color mainly focus on hue, saturation, and intensity of the image, which can also reflect the change of image caused by scattering. Sky and scene in the image are quite different in image analysis, and the separation of the sky from the scenery cannot be ignored.

Image preprocessing

Our image analysis method is based on mobile phone cameras, and most mobile phones contain image processing functions which often apply non-linear transformation that maps the observed image from the irradiance to the brightness. We use the radiometric calibration method to recover the image.²⁹

Mobile phone photos usually contain two parts: the sky and the scene. These two parts are quite different in image features. S Poduri et al.³⁰ use sky luminance to estimate air turbidity with mobile phones. It is very necessary to separate the sky area and the scene area with a low-complexity algorithm. We find that, in most of the images of outdoor scenes, if we divide the image into nine parts (3 × 3), there will always be some parts which have little sky or scenery. We collect photos taken by mobile phones from different locations in Beijing and divide each image into nine parts. By calculating the image features of different parts, we find that, if the images are converted from RGB model to HSI model, there would be a significant difference in the variance of intensity between the sky part and the scene part. According to the above conclusions, when we get a new image to estimate PM_2.5, it would be divided into nine parts. By calculating the variance of intensity of each part, we choose the highest three parts as the SCPs (scenery parts), which will be used in the extraction of dark channel features and the lowest part as the SKP (sky part), which will be used in the extraction of HSI color features.

Spatial contrast (F_ig)

Atmospheric transmission refers to how well light radiating from a scene is preserved when it reaches an observer. The atmospheric transmission model

I (x) = J (x) t (x) + A (1 - t (x))

(1)

has been widely accepted nowadays, where $I (x)$ is the image irradiance, $J (x)$ is the scene radiance, A is the atmospheric light, and $t (x)$ is the atmospheric transmission. Atmospheric light extinction is inversely related to transmission through the following exponential equation¹¹

t (x) = ex p^{- b_{ext} r (x)}

(2)

where $b_{exy}$ is the extinction coefficient and $r (x)$ is the length of the visual pathway. According to Graves and Newsam’s¹⁹ study

| \nabla_{x} I (x) | = t (x) | \nabla_{x} J (x) |

(3)

transmission has the intuitive interpretation as the ratio of the observed contrast to the true contrast. So we define $F_{ig}$ as $F_{ig} = | \nabla_{x} I (x) |$

Dark channel (F_id)

The dark channel feature, recently proposed by He et al.,¹² has been widely used in haze removal. Based on the assumption that there is at least one color channel including pixels with very low or close-to-zero intensity in most of the non-sky blocks of the image. In this situation, only the SCPs will be used in this situation. The dark channel of an image is defined by

J_{dark} (x) = \min_{y \in Ω (x)} {\min_{c \in {r, g, b}} J^{c} (y)}

(4)

where $Ω (x)$ is a small block around the pixel x, J is the scene radiance, and $J^{c}$ is a color channel. From the equation, we can see that the dark channel value of a given pixel is the minimum intensity of the three color channels of the image block around it. A priori knowledge based on a large number of haze-free image shows¹² that the dark channel of a haze-free image should be zero, which has the property of

J_{dark} \to 0

(5)

By applying dark channel prior to equation (1), the estimated transmission $t (x)$ can be obtained

t (x) = 1 - \min_{y \in Ω (x)} {\min_{c} \frac{I^{c} (y)}{A^{c}}}

(6)

where $A^{c}$ is the global atmospheric light. In this situation, $A^{c}$ is picked from the highest intensity of the image, and we estimate $A^{c}$ with the brightest 0.1% of the pixels in the dark channel. According to the theoretical model proposed by H Ozkaynak et al.,³¹ there is an exponential relationship between $t (x)$ and $P M_{2.5}$

b_{ext} \approx γ P M_{2.5}

(7)

where $b_{ext}$ is the extinction coefficient and $γ$ is a constant usually set to 3.75 in the urban using H Ozkaynak et al.’s³¹ model. As shown in equation (2), we denote the estimated $t (x)$ as $F_{id}$ .

HSI color difference (F_ih, F_is, F_ii)

According to Kim and Kim’s³² study, the sky’s color difference in the HSI color space has an exponential relation with the light extinction coefficient $b_{ext}$ , which can be denoted by

b_{ext} = a e^{b Δ D}

(8)

where a and b are the coefficients in this model and $Δ D$ is used to describe the difference in the HSI color space. In this situation, we choose the image’s SKP to complete the feature extraction. Since it is hard to obtain the influence coefficient of the three components (hue, saturation, and intensity) on the extinction coefficient $b_{ext}$ , we use the difference of the three components in the HSI color space, which can be denoted as

F_{ih} = \frac{1}{m * n} \sum_{y = 1}^{n} \sum_{x = 1}^{m} \sqrt{d_{x} {(h)}^{2} + d_{y} {(h)}^{2}}

(9)

d_{x} (h) = I_{h} (x, y) - I_{h} (x + 1, y)

(10)

d_{y} (h) = I_{h} (x, y) - I_{h} (x, y + 1)

(11)

where I is the input image with $m * n$ points and $I_{h} (x, y)$ is the h value of the point $(x, y)$ . In the same way, we can obtain the definitions of $F_{is}$ and $F_{ii}$ as follows

F_{is} = \frac{1}{m * n} \sum_{y = 1}^{n} \sum_{x = 1}^{m} \sqrt{d_{x} {(s)}^{2} + d_{y} {(s)}^{2}}

(12)

F_{ii} = \frac{1}{m * n} \sum_{y = 1}^{n} \sum_{x = 1}^{m} \sqrt{d_{x} {(i)}^{2} + d_{y} {(i)}^{2}}

(13)

Correlation analysis

Figure 2 shows the correlation matrix between the image features $(F_{i})$ and $P M_{2.5}$ , using the image data we collected from May 2014 to March 2015. The x- and y-axis are the features we extracted from the images. We use different colors and shapes to describe the classification results shown in Figure 2. As the x- and y-axis are the same on the diagonal line, we change the y-axis to the value of PM_2.5 to show the correlations between a single feature and PM_2.5. As the photos are taken from $P M_{2.5}$ monitoring sites, all the image data are with true $P M_{2.5}$ labels. Apparently, a high $F_{id}$ usually means a low concentration of $P M_{2.5}$ (the green points shown in Figure 2). A high concentration of $P M_{2.5}$ (the dark points) is always with a low $F_{ii}$ . The green points always appear when the $F_{id}$ and $F_{ig}$ are all high, and the heavy pollution usually appears when $F_{is}$ and $F_{ii}$ are all low. In short, these features are very discriminative in $P M_{2.5}$ inference.

Figure 2.

Correlation matrix between F_i and $P M_{2.5}$ .

MCS-RF: a random forest–based approach

The MCS-RF is divided into two parts: offline learning and online learning. Since there are only a few air quality monitoring stations, most data are unlabeled, and we use the SRF³³ model to perform the offline learning process. As the image data in our system are collected in real time, to improve the estimation accuracy, the ORF³⁴ model is used to achieve the incremental learning. The two models are combined in our system as shown in Algorithm 1.

Algorithm 1. MCS-RF
Require: Sequential sets of features: F
Require: Some labeled grids $G_{l}$
Require: Some unlabeled grids $G_{u}$
Require: The recalculation time of offline learning: T
Ensure: The model produced by offline learning: $M_{off}$
Ensure: The model produced by online learning: $M_{on}$
$i \leftarrow CurrentTime$
while F is not empty do
put F to the offline learning data sets;
if $CurrentTime - i \geq T$ then
$CurrentModel \leftarrow M_{off}$ ;
$i \leftarrow CurrentTime$
else
CurrentModel unchange;
end if
if $F \in G_{l}$ then
$CurrentModel \leftarrow M_{on}$ ;
else
CurrentModel unchange;
end if
end while

Random forest

A random forest (RF) is a set of decision trees. Each tree in the forest is built and tested independently from the other trees.

We denote the $t th$ tree of the ensemble as $f_{t} = f (x, θ_{t}) : X \to Y$ , where $θ_{t}$ is a random vector capturing the various stochastic elements of the tree. The entire forest is denoted as $F = {f_{1}, f_{2}, \dots, f_{T}}$ , where T is the number of trees in the forest. The estimated probability $e p (p | x)$ for predicting class $P M_{2.5}$ can be defined as

ep (p | x) = \frac{1}{T} \sum_{t = 1}^{T} e p_{t} (p | x)

(14)

where $ep (p | x)$ is the estimated density of class labels of the leaf of the $t th$ tree. Then the final decision function of the forest can be defined as

C (x) = \underset{p \in Y}{argmax} (p | x)

(15)

In this article, we define the classification margin of a labeled sample $(x, p)$ as

m_{a} (x, p) = ep (p | x) - \max_{k \in Y & k \neq p} ep (k | x)

(16)

and from equation (16) we can easily obtain the conclusion that, if the classification is correct, $m_{a} (x, p) > 0$ should be satisfied. So the generalization error can be written as

GE = E_{(X, Y)} (m_{a} (x, p) < 0)

(17)

where E is the expectation which can be measured by the the entire distribution of $(x, p)$ . Breiman³⁵ has given the upper bound of this error, which can be written as

GE \leq \bar{ρ} \frac{1 - s^{2}}{s^{2}}

(18)

In this equation, $\bar{ρ}$ can be defined by the mean correlation between pairs of trees in the forest and s is described as the expected value of the margin over the entire distribution.

Offline learning

For many semi-supervised learning algorithms, the supervised loss function of unlabeled instances can be defined in the form of

\sum_{(x, y) \in X_{l}} λ (y, h (x)) + θ \sum_{x \in X_{u}} λ_{u} (h (x))

(19)

where $X_{l}$ is the labeled data, $X_{u}$ is the unlabeled data, $h (\cdot)$ is a binary classifier, and $λ_{u} (\cdot)$ encodes the regularizer based on the unlabeled samples. The regularization paradigm can be further subdivided into two main approaches: manifold assumption^36,37 and large margin approaches.^33,38

Since we target applications with a large amount of data and manifold regularization leads to algorithms that are quadratic, that is, $O (n^{2})$ , in terms of the number of samples, in this work, we chose to use Leistner et al.’s³³ approach. Given a margin-maximizing loss function $λ (g_{p} (x))$ , the local score for a decision node $R_{j}$ is defined as

ℓ (R_{j}) = \sum_{i = 1}^{T} p_{i}^{j} λ (p_{i}^{j} - \frac{1}{T})

(20)

The margin for the unlabeled data is defined as

m_{u} (x_{u}) = \max_{i \in P M_{2.5}} g_{i} (x_{u})

(21)

Based on the definition of the margin for unlabeled samples, the overall loss can be defined as

ℓ (g) = \frac{1}{| X_{l} |} \sum_{(x, p) \in X_{l}} λ (g_{p} (x)) + \frac{α}{| X_{u} |} \sum_{x \in X_{u}} λ (m_{u} (x))

(22)

where $α$ defines the contribution rate of the unlabeled samples. Leistner et al.³³ apply deterministic annealing to iteratively solve equation (22).

In semi-supervised learning, we directly monitor the strength of the ensemble by measuring the out-of-bag estimation (OOBE), which has been proved to be a good estimate of the generalization error.³⁹ The detailed algorithm is shown in Algorithm 2. The time complexity has been proved by Leistner et al.³³ that the SRF in this way is quite similar to the supervised random forest, which usually has a time complexity of $O (mn \log n)$ (n instances with m features).

Algorithm 2. Offline learning
Require: A set of labeled data $X_{l}$ and unlabeled data $X_{u}$
Require: The size of the forest N
Require: The starting heat parameter $K_{0}$ and the coolingfunction $c (K, m)$
Ensure: The model produced by offline learning: $M_{off}$ Train the random forest (RF) based on the labeled data $X_{l}$ : $F \leftarrow train (X_{l})$ ;
Compute the OOBE: $O_{F}^{0} \leftarrow oobe (F, X_{l})$ ;
Set the epoch $m = 0$ ;
while not Stopping Condition do
Get the heat parameter $K_{m + 1} \leftarrow c (K_{m}, m)$ and set $m \leftarrow m + 1$ ;
$\forall x_{u} \in X_{u}, l \in p$
for $n = 1$ to Ndo
$\forall x_{u} \in X_{u}$ : set $X_{n} = X_{l} \cup {(x_{u}, \hat{PM {2.5}_{u}}) \| x_{u} \in X_{u}}$
Train the new tree: $F_{n} \leftarrow train (X_{n})$
end for
Set $O_{F}^{m} \leftarrow oobe (F_{n}, X_{l})$
end while
if $O_{F}^{0} \leq O_{F}^{m}$ then
Reset the random forest $F \leftarrow train (X_{l})$ ;
end if
Output the forest $M_{off}$

Online learning

To perform online bagging, we use N Oza and S Russell’s⁴⁰ method where the sequential arrival of the data is modeled by a Poisson distribution. That means for each tree $T (x)$ it will update k times on each instance, where k is generated by $Poisson (θ)$ and $θ$ is usually defined as a constant. To grow random trees on the fly, Saffari et al.’s³⁴ method can be a good choice, which grows extremely randomized trees by generating the test functions and thresholds randomly. The usual choices for quality measures are the entropy

L (R_{j}) = - \sum_{i = 1}^{K} {po}_{i}^{j} \log ({po}_{i}^{j})

(23)

or the Gini index

L (R_{j}) = \sum_{i = 1}^{K} {po}_{i}^{j} (1 - ({po}_{i}^{j}))

(24)

where ${po}_{i}^{j}$ is the label density of class i in node j, while K is the number of classes. Then the gain with respect to a test t can be measured as

Δ L (R_{j}, t) = L (R_{j}) - \frac{| R_{jlt} |}{| R_{j} |} L (R_{jlt}) - \frac{| R_{jrt} |}{| R_{j} |} L (R_{jrt})

(25)

where $R_{jlt}$ and $R_{jrt}$ are the left and right partitions made by the test t, respectively, and $| R_{j} |$ is the number of instances in $R_{j}$ . As shown in Algorithm 3, when splitting a node, the test with the highest gain can produce the best splits of the data in order to reduce the impurity of a node. Saffari et al.³⁴ perform a non-recursive strategy for online learning, which means that a generated tree starts with only one root node. In our system, the ORF is performed on the decision trees generated by the offline learing process. A node splits when $| R_{j} | > α$ and $Δ L (R_{j}, t) > β$ , where $α$ is the minimum number of samples a node has to collect before splitting and $β$ is the minimum gain.

Algorithm 3. Online learning
Require: A set of sequential training examples $x, p$
Require: The size of the forest N
Require: The minimum number $α$ and the minimum gain $β$
Ensure: The model produced by online learning: $M_{on}$
for each tree do
for $n = 1$ to Ndo
$k \leftarrow Poisson (θ)$
for $p = 1$ to kdo
$j \leftarrow findLeaf (x)$ and update node $(j, 〈 x, p 〉)$
if $\| R_{j} \| > α$ and $Δ L (R_{j}, t) > β$ then
Find the best test
Create the Left Child
Create the Right Child
end if
end for
end for
Compute $OOB E_{n} \leftarrow updataOOBE (〈 x, p 〉)$
end for
Output the forest $M_{on}$

Evaluation

This section consists of three parts: the first part introduces the basic structure of the experiment; the second part offers the basic description of contrast algorithms; and the third part shows the results and the analysis.

Preliminaries

In this subsection, the standard of classification, data origin, evaluation index, and the ground truth will be fully discussed.

Concentration levels of PM_2.5

AQI is a number used by government agencies to communicate to the public how polluted the air is currently. Different countries have different definitions of AQI. In China, the AQI is based on the levels of six atmospheric gasses, namely, $S O_{2}$ , $N O_{2}$ , $P M_{10}$ , $P M_{2.5}$ , CO, and $O_{3}$ . Each gas has its individual air quality index (IAQI). In this article, we use the IAQI of $P M_{2.5}$ published by China’s government to define the concentration levels of $P M_{2.5}$ . As shown in Figure 3, the concentration levels are divided into six categories. We combine the categories of heavy pollution and severe pollution as these two have unanimous impact on human activities. It is recommended no outdoor activities both in heavy and severe pollution. On the other hand, it would be very useful to distinguish into excellent, good, slight, and moderate as the effects of PM_2.5 on people at different ages, sexes, or health conditions are quite different.

Figure 3.

The definition of AQI in China.

Datasets

In the experiments, the data were collected from May 2014 to March 2015, and the following datasets of Beijing are all available to the public except the image data. Most of the data are unlabeled except those collected from the air quality monitoring stations.

Mobile crowdsensing data

Data collected by mobile crowdsensing have the characteristics of flexibility on temporal and spatial distribution. The data collector can obtain data from specific areas easily with the incentive mechanism. However, it is hard to guarantee the quality of the mobile crowdsensing data and the irregular real-time data can also bring challenges to data fusion. In this article, the mobile crowdsensing data consist of image data.

Image data. The image data consist of photos taken by the participants with their smartphones, from May 2014 to March 2015 in Beijing. All the photos uploaded to our participatory sensing platform have accurate GPS information. The photos near the $P M_{2.5}$ stations (within 1 km) will be labeled with the $P M_{2.5}$ reported by the station. The image data features are based on the correlation analysis between PM_2.5 and image. Three kinds of features are extracted by calculating spatial contrast (F_ig), dark channel (F_id), and HSI color difference (F_ih, F_is, F_ii).

Generic data

Generic data in this article consist of meteorological data, traffic data, POI data, and air quality monitoring station data. The data in this dataset have fixed spatial distributions and update regularly. Especially, the data with high accuracy collected from the air quality monitoring stations make up the labeled datasets.

Meteorological data. Accordingly, we collect the meteorological data from the government website (Meteorological data, http://data.cma.cn/) and identify five features: humidity (F_wh), temperature (F_wt), wind speed (F_ws), barometer pressure (F_wb), and weather (F_ww). The weather features are classified into five categories: sunny, cloudy, foggy, rainy, and snowy.

Traffic data. We employ traffic features based on grids from Baidu maps. The pixels of different colors (green, yellow, red) can describe the traffic condition. We collect the fine-grained traffic status (green, yellow, and red) from the website (Baidu map, http://map.baidu.com/) and calculate the traffic features for each grid every hour.

Air quality records. We collect real-valued AQI of six kinds of air pollutants, consisting of $S O_{2}$ , $N O_{2}$ , $CO$ , $O_{3}$ , $P M_{10}$ , and $P M_{2.5}$ , reported by air quality monitoring stations in Beijing every hour. The features of air quality data can be identified as the number of pollutant concentration.

POI. We employ a POI database from Baidu maps to extract $F_{p}$ for each grid. All the POIs are divided into a unique category. For each grid, the features can be defined as the numbers of the POIs in each category. As the POIs do not change frequently, we update our database every day.

Combined Data

Combined data are the combination of the mobile crowdsensing data and the generic data.

Grid

We divide the city into disjointed grids (e.g. 1 km × 1 km) and assume that the air quality in a grid is unified. Each grid g has a geographical position coordinate $g . l$ , a timestamp $g . t$ , and a $P M_{2.5}$ label $g . p$ to be inferred or already associated if having an air quality monitoring station located. We define $g . f = {F_{p}, F_{w}, F_{t}, \dots}$ as the features extracted from the data we observed in the city.

Evaluation index

The evaluation indexes we used in this article are shown in Table 1.

Table 1.

Evaluation index.

Index	Description
Precision (P)	The correct number of examples of thetotal number of cases
Recall (R)	The number of examples that arecorrectly identified
F-measure (F1)	The harmonic mean of precision and recall
TP rate (TP)	The probability of recognizing an instanceof this classification
FP rate (FP)	The probability of identifying an instanceof other classification
ROC area (ROC)	Plot of the TP rate against the FP rateunder various threshold settings

TP: true positive; FP: false positive; ROC: receiver operating characteristic.

Ground truth

We deliberately remove a station from a grid and infer its PM_2.5 from the other stations. The actual $P M_{2.5}$ reported by the station is then used as the ground truth to measure the inference. The cross-validation method is used to obtain the final result.

Baselines

We compare our method with six baselines and the ratio of the datasets for cross-validation is 10%.

Logistic

It is a classification model for building and using a logistic regression model with a ridge estimator.

Naive Bayes

It is a classification model for a naive Bayes classifier using estimator classes.⁴¹

Random tree

It is a classification model for constructing a tree that considers K randomly chosen attributes at each node. We perform no pruning in the experiment.

Back-propagation neural network

Back-propagation neural network (BP ANN) is a classifier that uses back-propagation to classify instances, which is used by U-Air.⁴

Sequential minimal optimization

Support vector machine using sequential minimal optimization (SMO) is a classifier which replaces all missing values and transforms nominal attributes into the binary ones.⁴² This method is widely used in classification problems in recent years, such as the classification of smartphones apps.⁴³

K*

K* is an instance-based classifier, is the class of a test instance based on the class of those training instances similar to it, and uses an entropy-based distance function.⁴⁴

Results

In this subsection, the overall results are discussed first, which have proved that MCS-RF performs best on the overall average index. Second, the results of classifications are fully discussed, which have proved that MCS-RF has obvious advantages on the indiscernible classifications such as slight pollution and moderate pollution. Finally, the results of features have proved that the image features have obvious promotions on most of the classifications, especially for the indiscernible classifications.

Results of features on the whole

In this area, Zheng et al.⁴ have already proved that the meteorological data, traffic data, POI, and the air quality records have a close relationship with the air quality. As shown in Figure 2, the image features have a fuzzy relationship with $P M_{2.5}$ . In this article, the features are divided into two categories: image features, which are defined as the mobile crowdsensing features (MCS-F) in this article and other data features which are defined as generic data features (GD-F) and their combinations (Combined-F). As shown in Table 2, adding the image feature sets into the models brings a significant precision and recall improvement in all models. The other evaluation indexes also have positive improvements except the receiver operating characteristic (ROC) curve area in the random tree model. The image data can make a great contribution to the air quality assessment.

Table 2.

Results of features on the whole for all models.

	Data_Set	P	R	F1	TP	FP	ROC
Logistic	Combined-F	0.77	0.75	0.748	0.75	0.061	0.927
	MCS-F	0.603	0.6	0.599	0.6	0.117	0.828
	GD-F	0.75	0.75	0.744	0.744	0.059	0.914
Naive Bayes	Combined-F	0.804	0.806	0.804	0.806	0.052	0.954
	MCS-F	0.597	0.517	0.516	0.517	0.105	0.808
	GD-F	0.759	0.75	0.751	0.75	0.054	0.935
Random tree	Combined-F	0.818	0.811	0.812	0.811	0.049	0.811
	MCS-F	0.551	0.561	0.544	0.561	0.129	0.716
	GD-F	0.783	0.772	0.769	0.772	0.053	0.87
BP ANN	Combined-F	0.81	0.806	0.804	0.806	0.052	0.955
	MCS-F	0.576	0.572	0.573	0.572	0.126	0.799
	GD-F	0.804	0.806	0.803	0.806	0.053	0.922
SMO	Combined-F	0.753	0.756	0.754	0.756	0.059	0.942
	MCS-F	0.517	0.533	0.518	0.533	0.153	0.818
	GD-F	0.737	0.744	0.739	0.744	0.063	0.934
K*	Combined-F	0.795	0.783	0.784	0.783	0.05	0.947
	MCS-F	0.556	0.567	0.551	0.567	0.136	0.806
	GD-F	0.725	0.717	0.706	0.717	0.084	0.898
MCS-RF	Combined-F	0.875	0.872	0.871	0.872	0.034	0.972
	MCS-F	0.632	0.633	0.632	0.633	0.105	0.867
	GD-F	0.825	0.822	0.822	0.822	0.043	0.954

TP: true positive; FP: false positive; ROC: receiver operating characteristic; BP ANN: back-propagation neural network; SMO: sequential minimal optimization.

On the other hand, the results also demonstrate the advantage of our method over logistic, naive Bayes, SMO, K*, and random tree models based on our data sets in this situation.

Results of the classifications of all models

The evaluation of the overall index is not enough. As shown in Figure 4, the precise classification into good, slight pollution, and moderate pollution can be hard in this situation, while the classification into excellent and heavy & severe pollution can be much easy. In daily life, if the air pollution level has exceeded moderate pollution, outdoor activities are not advised. On the other hand, if the air pollution level is good, even if the the weather looks bad, it makes no sense on outdoor activities. Figure 4 shows the results of precision, and different models have quite different performance on the classifications. Our method has great improvements on the classifications of good, slight pollution, and moderate pollution, while our method has the same performance on the classifications of excellent and heavy & severe pollution. Finally, our method has a significant improvement on average. Figure 5 shows the results of recall, and the conclusion is almost the same except on the classification of heavy & severe pollution; our method is slightly worse than naive Bayes. Figures 6 –9 show the results of other indexes, and our method has obvious advantages in this situation.

Figure 4.

Precision of the classifications.

Figure 5.

Recall of the classifications.

Figure 6.

F-measure of the classifications.

Figure 7.

TP rate of the classifications.

Figure 8.

FP rate of the classifications.

Figure 9.

ROC of the classifications.

Results of the features on classifications of MCS-RF

Figure 10 shows the classification results of precision based on MCS-RF. Although the image data have poor performance on the air quality estimation, when we add the image feature sets into the model, it can always bring a significant precision improvement on all classifications. Especially, the image features have a good performance on the classification of the moderate pollution, and the overall precision of moderate pollution experiences a sharp increase when we add the image features. The combination of multiple data sources could be very necessary when the problem we faced is very complex. The situation could be very similar when it comes to other indexes. As shown in Figure 11, although the classification of good gets slightly worse, the classification of moderate pollution has almost increased by 51%. When it comes to other indexes such as F-measure shown in Figure 12, the true-positive (TP) rate shown in Figure 13, the false-positive (FP) rate shown in Figure 14, and the ROC area shown in Figure 15, the classifications of slight pollution, moderate pollution, and heavy & severe pollution experience an impressive increase by all.

Figure 10.

Precision of the MCS-RF.

Figure 11.

Recall of the MCS-RF.

Figure 12.

F-measure of the MCS-RF.

Figure 13.

TP rate of the MCS-RF.

Figure 14.

FP rate of the MCS-RF.

Figure 15.

ROC of the MCS-RF.

Conclusion and future work

In this article, we find that several image features are very discriminative in $P M_{2.5}$ inference. However, it is difficult to obtain accurate $P M_{2.5}$ concentration only with the images. When we add the image features to the other features (the meteorological features, the traffic features, and the POI features) we collected, we have a high evaluation accuracy on $P M_{2.5}$ estimation (precision: 0.875, recall: 0.872) without any $P M_{2.5}$ measurement devices. Our method outperforms other methods: logistic, naive Bayes, random tree, BP ANN, K*, and SMO.

In the future, to offer a better fine-grained air pollution monitoring service, it is essential to design the incentive mechanism to obtain the image data of some specific areas. Since the image features have strong effects on the evaluation results, the low-quality photos must be removed from our datasets. An algorithm based on image recognition needs to be performed on our cloud servers to achieve the image filtering. The mechanism for user incentive and deception recognition is also needed, and new algorithms can be applied in the future.

Footnotes

Acknowledgements

This paper is an enhanced version of the paper previously published in conference proceedings of 2017 IEEE International Symposium on A World of Wireless, Mobile and Multimedia Networks (WoWMoM’17) entitled “Estimate air quality based on mobile crowdsensing and big data.”

Handling Editor: Daming Zhou

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was partly supported by the National Natural Science Foundation of China (Nos 61370197, 61402045, and 61602051), Research and Technology Verification of Address-Driven Network Architecture (2015 AA015601), and Research on Architecture and Key Technology System of Service-Oriented Software-Defined Network (2015AA016101).

References

Liu

Wang

et al . USA: faster update for SDN-based Internet of Things sensory environments. Comput Commun 2018; 120: 80–92.

Murty

Mainland

Rose

et al . CitySense: an urban-scale wireless sensor network and testbed. In: Proceedings of the IEEE conference on technologies for homeland security, Waltham, MA, 12–13 May 2008, pp.583–588. New York: IEEE.

Völgyesi

Nádas

Koutsoukos

et al . Air quality monitoring with SensorMap. In: Proceedings of the 7th international conference on information processing in sensor networks, St Louis, MO, 22–24 April 2008, pp.529–530. New York: IEEE.

Zheng

Liu

Hsieh

HP.

U-Air: when urban air quality inference meets big data. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, Chicago, IL, 11–14 August 2013, pp.1436–1444. New York: IEEE.

Yang

et al . RAQ—a random forest approach for predicting air quality in urban sensing systems. Sensors 2016; 16(1): 86.

Gao

Liu

Wang

et al . A survey of incentive mechanisms for participatory sensing. IEEE Commun Surv Tut 2015; 17(2): 918–943.

Zhang

Liu

Tang

et al . Learning-based energy-efficient data collection by unmanned vehicles in smart cities. IEEE T Ind Inform 2018; 14(4): 1666–1676.

Tian

Sangaiah

et al . Privacy-preserving scheme in social participatory sensing based on secure multi-party cooperation. Comput Commun 2018; 119: 167–178.

Wiscombe

WJ.

Improved Mie scattering algorithms. Appl Opt 1980; 19(9): 1505.

10.

Wang

Wellman

. Spoofing the limit order book: an agent-based model. In: Proceedings of the conference on autonomous agents and multiagent systems, São Paulo, 8–12 May 2017, pp.651–659. New York: ACM.

11.

Seinfeld

Pandis

Noone

Atmospheric chemistry and physics: from air pollution to climate change. Hoboken, NJ: John Wiley & Sons, 2016.

12.

Sun

Tang

Single image haze removal using dark channel prior. IEEE T Pattern Anal 2011; 33(12): 2341–2353.

13.

Zheng

Methodologies for cross-domain data fusion: an overview. IEEE T Big Data 2015; 1(1): 16–34.

14.

Liu

Chen

Lin

et al . Developed urban air quality monitoring system based on wireless sensor networks. In: Proceedings of the fifth international conference on sensing technology (ICST), Palmerston North, New Zealand, 28 November–1 December 2011, pp.549–554. New York: IEEE.

15.

Dobre

Arnold

Smalley

et al . Flow field measurements in the proximity of an urban intersection in London, UK. Atmos Environ 2005; 39(26): 4647–4657.

16.

Jia

Wang

et al . Performance-aware mobile community-based VoD streaming over vehicular ad hoc networks. IEEE T Veh Technol 2015; 64(3): 1201–1217.

17.

Wong

Chua

Environmental monitoring using wireless vehicular sensor networks. In: Proceedings of the 5th international conference on wireless communications, networking and mobile computing, Beijing, China, 24–26 September 2009, pp.4986–4989. New York: IEEE.

18.

Hasenfratz

Saukh

Sturzenegger

et al . Participatory air pollution monitoring using smartphones. In: Proceedings of the 2nd international workshop on mobile sensing, Beijing, China, 16–20 April 2012, pp.1–5. New York: ACM.

19.

Graves

Newsam

Using visibility cameras to estimate atmospheric light extinction. In: Proceedings of the IEEE workshop on applications of computer vision (WACV), Kona, HI, 5–7 January 2011, pp.577–584. New York: IEEE.

20.

Kerker

. The scattering of light and other electromagnetic radiation: physical chemistry: a series of monographs, vol. 16. Cambridge, MA: Academic Press.

21.

Ganti

Lei

Mobile crowdsensing: current state and future challenges. IEEE Commun Mag 2011; 49(11): 32–39.

22.

Maisonneuve

Stevens

Niessen

et al . NoiseTube: measuring and mapping noise pollution with mobile phones. In: Proceedings of the information technologies in environmental engineering, Thessaloniki, 28–29 May 2009, pp.215–228. Berlin: Springer.

23.

Eisenman

Miluzzo

Lane

et al . BikeNet: a mobile sensing system for cyclist experience mapping. ACM Trans Sens Netw 2009; 6(1): 6.

24.

Cui

Song

Ren

Software defined cooperative offloading for mobile cloudlets. IEEE ACM T Network 2017; 25: 1746–1760.

25.

Yang

Xue

Fang

et al . Crowdsourcing to smartphones: incentive mechanism design for mobile phone sensing. In: Proceedings of the 18th annual international conference on mobile computing and networking (Mobicom ‘12), Istanbul, 22–26 August 2012. New York: ACM.

26.

Restuccia

Das

. FIDES: a trust-based framework for secure user incentivization in participatory sensing. In: Proceedings of IEEE international symposium on a world of wireless, mobile and multimedia networks, Sydney, NSW, Australia, 19 June 2014. New York: IEEE.

27.

Xiao

Han

et al . A secure mobile crowdsensing game with deep reinforcement learning. IEEE T Inf Foren Sec 2017; 13: 35–47.

28.

Guo

Gong

Liang

et al . An optimized hybrid unicast/multicast adaptive video streaming scheme over MBMS-enabled wireless networks. IEEE T Broadcast 2018; PP(99): 1–12.

29.

Liu

Song

Ngai

et al . PM2:5 monitoring using images from smartphones in participatory sensing. In: Proceedings of the IEEE conference on computer communications workshops (INFOCOM WKSHPS), Hong Kong, China, 26 April–1 May 2015, pp.630–635. New York: IEEE.

30.

Poduri

Nimkar

Sukhatme

. Visibility monitoring using mobile phones. Annual report, Center for Embedded Networked Sensing, Los Angeles, CA, 11 January, pp.125–127, 2010.

31.

Ozkaynak

Schatz

Thurston

et al . Relationships between aerosol extinction coefficients derived from airport visual range observations and alternative measures of airborne particle mass. JAPCA J Air Waste Ma 1985; 35(11): 1176–1185.

32.

Kim

YJ.

Perceived visibility measurement using the HSI color difference method. J Korean Phys Soc 2005; 46(5): 1243–1250.

33.

Leistner

Saffari

Santner

et al . Semi-supervised random forests. In: Proceedings of the IEEE 12th international conference on computer vision, Kyoto, Japan, 29 September–2 October 2009, pp.506–513. New York: IEEE.

34.

Saffari

Leistner

Santner

et al . On-line random forests. In: Proceedings of the IEEE 12th international conference on computer vision workshops (ICCV Workshops), Kyoto, Japan 2009, pp.1393–1400. New York: IEEE.

35.

Breiman

Random forests. Mach Learn 2001; 45: 5–32.

36.

Chapelle

Weston

Schölkopf

. Cluster kernels for semi-supervised learning. In: Proceedings of the advances in neural information processing systems, Cambridge, MA, 9–14 December 2002, pp.585–592. New York: ACM.

37.

Belkin

Niyogi

Sindhwani

Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res 2006; 7: 2399–2434.

38.

Joachims

. Transductive inference for text classification using support vector machines. In: Proceedings of the sixteenth international conference on machine learning (ICML ‘99), San Francisco, CA, 27–30 June 1999, vol. 99, pp.200–209. New York: ACM.

39.

Breiman

Out-of-bag estimation, 1996, https://www.stat.berkeley.edu/∼breiman/arcall96.pdf

40.

Oza

Russell

. Experimental comparisons of online and batch versions of bagging and boosting. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, CA, 26–29 August 2001, pp.359–364. New York: ACM.

41.

John

Langley

Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence, Montréal, QC, Canada, 18–20 August 1995, pp.338–345. San Francisco, CA: Morgan Kaufmann Publishers, Inc.

42.

Platt

JC.

Fast training of support vector machines using sequential minimal optimization. In: Schölkopf

Burges

CJC

Smola

(eds) Advances in kernel methods. Cambridge, MA: MIT Press, pp.185–208, 1999.

43.

Rasthofer

Arzt

Bodden

A machine-learning approach for classifying and categorizing Android sources and sinks. In: Proceedings of 2014 network and distributed system security symposium, San Diego, CA, 23–26 February 2014, pp.1–15. Reston, VA: Internet Society.

44.

Cleary

Trigg

LE.

K*: an instance-based learner using an entropic distance measure. In: Proceedings of the 12th international conference on machine learning, 1995, p.108, https://www.cs.waikato.ac.nz/ml/publications/1995/Cleary95-KStar.pdf

MCS-RF: mobile crowdsensing–based air quality estimation with random forest

Abstract

Keywords

Introduction

Related work

Framework of the MCS-RF

Data Collection

Data Storage

Model Training

Inference

User Interaction

Problem statement

Correlation analysis between PM2.5and image

Image preprocessing

Spatial contrast (Fig)

Dark channel (Fid)

HSI color difference (Fih, Fis, Fii)

Correlation analysis

MCS-RF: a random forest–based approach

Random forest

Offline learning

Online learning

Evaluation

Preliminaries

Concentration levels of PM2.5

Datasets

Mobile crowdsensing data

Generic data

Combined Data

Grid

Evaluation index

Ground truth

Baselines

Logistic

Naive Bayes

Random tree

Back-propagation neural network

Sequential minimal optimization

K*

Results

Results of features on the whole

Results of the classifications of all models

Results of the features on classifications of MCS-RF

Conclusion and future work

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

References

Correlation analysis between PM_2.5and image

Spatial contrast (F_ig)

Dark channel (F_id)

HSI color difference (F_ih, F_is, F_ii)

Concentration levels of PM_2.5