Abstract
Low-cost indoor air quality (IAQ) sensors offer new opportunities for real-time monitoring in the built environment by occupants and researchers. However, their performance can vary substantially depending on the environmental conditions. This study presents a comprehensive evaluation of carbon dioxide (CO2) and fine particulate matter (PM2.5) measurements from two consumer-grade low-cost sensors (the Airthings View Plus for CO2 only and Air Gradient Pro for CO2 and PM2.5) through co-location tests with two reference instruments, Graywolf DSII-8 for CO2 and Lighthouse Handheld 3016 for PM2.5. Using time-series analysis, linear regression, Pearson correlation, Root-Mean Squared Error (RMSE), Bland-Altman test, and paired t-tests, we assess the precision and accuracy of these sensors. At a 5-minute sampling interval, the Air Gradient sensor had a higher coefficient of determination (R2), stronger Pearson correlation, and narrower range of limits of agreement (LoAs), but higher bias (i.e. the mean difference) and RMSE, suggesting higher precision but lower accuracy when compared to Airthings. As a result, it can perform well for tracking the relative changes in CO2, though less ideal for absolute concentrations without calibration. For PM2.5, the Air Gradient also had relatively high R2 (0.79), moderately strong Pearson correlation (ρ = 0.69, p < 0.05), and a narrow range of LOAs (30.1 μg/m3) and low RMSE (5.8 μg/m3). Averaging the 5-minute measurements over 30-minute intervals generally improved the accuracy and precision of both sensors. However, statistically significant differences from the reference instruments remained for both sensors. Overall, this study offers a multi-metric assessment of consumer-grade sensors and highlights the need for in-situ calibration prior to long-term deployment.
Introduction
Canadians spend approximately 90% of their time indoors (Government of Canada, 2021), making indoor environment quality critical for human health and well-being. Real-time assessment of indoor environmental conditions and pollutant concentrations is therefore essential for understanding occupants’ exposure and informing strategies for indoor environmental quality improvement.
Low-cost sensors are effective tools to monitor indoor pollutant concentrations, with growing applications in both consumer use as well as experimental research. They are widely accessible to consumers and are more affordable than research-grade instruments, making them suitable for real-time monitoring of indoor air quality (IAQ) or large-scale deployment. In addition, they are portable and often easy to deploy, allowing occupants to keep track of their indoor air quality and choose to take steps to reduce their exposure to indoor pollutants. In this paper, the performance of two low-cost sensors, Airthings View Plus (“Airthings”) and Air Gradient Pro (“Air Gradient”) is assessed through co-location testing with higher-accuracy research-grade instruments.
Background
The US EPA defines a low-cost sensor to cost between $100 and $2500 USD (Williams et al., 2014). However, the higher end of this range is likely inaccessible to many consumers. To address this, previous research has proposed a more practical upper limit of $500 USD (Zhang and Srinivasan, 2020). There is a wealth of literature on the performance of low-cost sensors. Figure 1 summarizes the indoor air and environmental quality parameters commonly assessed, based on a review of 40 studies published between January 2018 to December 2023. These studies were identified through Google Scholar, Scopus, and the Toronto Metropolitan University library database using various combinations of keywords including “IAQ,”“indoor air quality,”“low-cost sensor,”“performance,”“laboratory,”“field,” and “evaluating.” The two most commonly evaluated parameters in the literature were fine particulate matter (PM2.5) and carbon dioxide (CO2) concentrations. Among the 40 studies reviewed:
34 included PM testing (Afroz et al., 2023; Baldelli, 2021; Baptista et al., 2022; Chen et al., 2020; Collingwood et al., 2019; Coulby et al., 2021; Curto et al., 2018; Demanega et al., 2021; Gillooly et al., 2019; Jayaratne et al., 2020; Kaliszewski et al., 2020; Kim et al., 2023; Konstantinou et al., 2022; Li et al., 2018; Liu et al., 2020; Manibusan and Mainelis, 2020; Moreno-Rangel et al., 2018; Palmisani et al., 2021; Pei et al., 2023; Reis et al., 2023; Schalm et al., 2022; Shen et al., 2021; Singer and Delp, 2018; Taştan, 2022; Tiele et al., 2018; Tryner et al., 2021; Wang et al., 2019, 2020; Zamora et al., 2020; Zhang and Srinivasan, 2020; Zheng et al., 2022; Zou et al., 2020) and
16 included CO2 testing (Afroz et al., 2023; Baldelli, 2021; Baptista et al., 2022; Coulby et al., 2021; Demanega et al., 2021; Gillooly et al., 2019; Konstantinou et al., 2022; Marinov et al., 2021; Moreno-Rangel et al., 2018; Palmisani et al., 2021; Reis et al., 2023; Taştan, 2022; Thomas et al., 2019; Tiele et al., 2018; Tryner et al., 2021; Zheng et al., 2022).

Most commonly evaluated parameters among 40 studies published between Jan 2018 and Dec 2023.
Temperature and relative humidity (RH) performance were also commonly assessed (Afroz et al., 2023; Baptista et al., 2022; Coulby et al., 2021; Demanega et al., 2021; Kim et al., 2023; Konstantinou et al., 2022; Moreno-Rangel et al., 2018; Reis et al., 2023; Zheng et al., 2022).
Understandably, accurate measurements of PM2.5 and CO2 are important due to their implications for health and IAQ assessment. Exposure to elevated levels of PM2.5 has been linked to a range of adverse health effects including increased respiratory symptoms, decreased lung function, and premature death in individuals with heart or lung conditions (United States Environmental Protection Agency, 2024). Meanwhile, CO2 serves as a useful proxy for indoor ventilation levels, and exposure to high concentrations may increase the risk of decreased cognitive function and neurophysiological symptoms (Government of Canada, 2021).
While many studies have evaluated the measurement performance of low-cost sensors for CO2 and PM2.5, there remains a lack of standardized methodologies for sensor performance evaluation (Karagulian et al., 2019). In this study, we apply six analysis methods to assess the performance of two low-cost sensors and compare the results across these methods, to support the development of a more comprehensive evaluation procedure. This is accomplished by analyzing the accuracy and precision of sensors using data collected through co-location testing with reference instruments in an occupied house.
Data collection
Description of sensors
Co-location tests were conducted to collect CO2 and PM2.5 measurements using the Airthings and Air Gradient with two reference instruments, the Graywolf DSII-8 (Graywolf) for CO2 and the Lighthouse Handheld 3016 (Lighthouse) for PM2.5. Both instruments were calibrated by their manufacturers and have been used in previous peer-reviewed research as reference instruments (Baptista et al., 2022; Demanega et al., 2021; Moreno-Rangel et al., 2018; Zhang et al., 2022). In addition to their demonstrated accuracy, both reference instruments were selected for their portability and lower cost compared to some higher-grade alternatives.
The Graywolf DirectSense II CO2 sensor (sensor model SEN-SMTX-CO2) is a dual-wavelength non-dispersive infrared (NDIR) sensor with a measurement range of 0–10,000 ppm. According to the manufacturer, it is factory-calibrated over five reference points to optimize performance in the 350–2000 ppm range, achieving an accuracy of ±35 ppm within that range. Between 2000 and 7000 ppm, the reported accuracy is ±3% of reading. While a formal calibration certificate was not available, this sensor was used as received and is commonly employed in indoor air quality and occupational exposure studies. Its accuracy and reliability have also been supported in prior peer-reviewed research (e.g. Baptista et al., 2022; Demanega et al., 2021; Moreno-Rangel et al., 2018).
The Lighthouse instrument reports particle number concentrations across five bin sizes: 0.3–0.5, 0.5–1.0, 1.0–2.5, 2.5–5.0, and 5.0–10.0 µm and collects raw data at one-second intervals (Lighthouse Worldwide Solutions, n.d.). The instrument was calibrated on May 23, 2022, in Medford, Oregon using certified monodisperse particles, in accordance with ISO 21501-4 2018 (Calibration Certificate 44704220544022). The device demonstrated a counting efficiency of 48.0% for 0.303 µm particles and 102.1% for 0.45 µm particles.
The Airthings is a wireless and battery-operated low-cost sensor. It was selected as it is widely used in the North American market (Airthings, 2024). It uses a non-dispersive infrared (NDIR) sensor for CO2 measurements from 400 to 5000 ppm at 5-minute intervals, with an accuracy of ±50 ppm + 3% of reading (Airthings, n.d.). For PM2.5 measurements, it uses a laser scattering optical particle sensor (Cubic PM2105L) and converts the optical signal directly to PM2.5 mass concentrations in μg/m3 using internal algorithms calibrated with a Grimm reference instrument using cigarette smoke (Evotech Air Quality, 2023). The Airthings sensor can measure PM2.5 from 0 to 500 μg/m3 at 10-minute intervals, with an accuracy of ±5 μg/m3 + 15% of reading when PM2.5 is below 150 μg/m3, and ±5 μg/m3 + 20% of reading when PM2.5 exceeds 150 μg/m3. Unfortunately, during our experiment, the Airthings sensor was unable to log PM2.5 data due to a malfunction. As a result, only the CO2 measurement performance of the Airthings sensor was evaluated in this study.
The Air Gradient sensor uses the Plantower PMS5003 sensor for PM2.5 measurements (Air Gradient, n.d.). The sensor measures concentrations from 0 to 500 μg/m3 with an accuracy of ±10 μg/m3 below 100 μg/m3, and ±10% of reading between 100 and 500 μg/m3. For CO2 measurement, the Air Gradient sensor is equipped with a SenseAir S8 NDIR sensor (Air Gradient, n.d.). While the manufacturer’s specifications indicate that the sensor may be either the SenseAir S8 or S88 model, the unit tested in this study was confirmed to use the S8. It measures CO2 concentrations from 400 to 2,000 ppm with an accuracy of ±40 ppm + 3% of reading. The Air Gradient sensor records both PM2.5 and CO2 data every 30 seconds. This sensor was selected primarily for its open-access software and hardware, which provides greater flexibility for research applications.
To compare to the PM2.5 mass concentrations measured by the two low-cost sensors, particle number concentrations measured by the Lighthouse were converted to mass concentrations assuming spherical particles and a constant particle density of 1.7 g/cm3. This density is commonly used by research-grade instruments such as the Grimm MiniWRAS (Durag Group, n.d.) and has been applied in previous studies (Li et al., 2016; Li and Siegel, 2020). The number concentrations in each bin were first converted to volume concentrations based on equations (1.1) and (1.2), then to mass concentrations using equations (1.3) and (1.4). PM2.5 concentrations were then estimated by summing the converted mass concentrations from the first three bins (0.3–0.5, 0.5–1, and 1–2.5 μm). Similarly, PM5 and PM10 mass concentrations were estimated by summing the first four and all five bins, respectively, for further comparisons in this study.
where r is the geometric mean of the particle diameter based on the upper and lower limits of each bin, V is the total particle volume, N is the number concentration in each bin, ρ is the assumed particle density, M is the total mass of particle for each bin, and C is the calculated mass concentration for each bin.
Measurement setup
To evaluate the performance of the Airthings and Air Gradient sensors, co-location tests were conducted with the two reference instruments on a desk in a home office of a residence in downtown, Toronto, Canada. Figure 2 shows the setup of the sensors during the testing periods. The daily activities of the occupants, including the use of the home office, were not disrupted by the deployment of these devices and their locations remained unchanged throughout each test. CO2 data were collected from May 23 to May 25, 2023, for both low-cost sensors and the Graywolf, and PM2.5 data were collected from July 17 to July 19, 2023, with only the Air Gradient and Lighthouse. During the test periods, the only CO2 source was occupants’ exhalation in the house. PM2.5 was primarily generated from typical indoor sources such as resuspension, cleaning, and cooking, with possible contributions from the infiltration of outdoor PM2.5. All measurements recorded by the Air Gradient, Greywolf, and Lighthouse were converted to 5-minute averages so that they could be compared directly with the Airthings. A total of 396 measurements and 516 measurements at 5-minute intervals were collected for CO2 and PM2.5, respectively.

Co-location test set-up showing: (1) the Lighthouse Handheld 3016, (2) the Graywolf DSII-8, (3) the Air Gradient Pro, and (4) the Airthings View Plus sensors.
Analysis methods
A total of six analysis methods were selected primarily based on the findings from the aforementioned literature review. We identified that some of the most commonly used analysis methods included:
time-series visual analysis (Konstantinou et al., 2022; Palmisani et al., 2021; Reis et al., 2023; Schalm et al., 2022; Singer and Delp, 2018; Taştan, 2022; Thomas et al., 2019),
linear regression with the coefficient of determination (R2; Afroz et al., 2023; Baldelli, 2021; Chen et al., 2020; Collingwood et al., 2019; Curto et al., 2018; Demanega et al., 2021; Gillooly et al., 2019; Kim et al., 2023; Konstantinou et al., 2022; Liu et al., 2020; Manibusan and Mainelis, 2020; Marinov et al., 2021; Moreno-Rangel et al., 2018; Shen et al., 2021; Singer and Delp, 2018; Taştan, 2022; Tryner et al., 2021; Wang et al., 2019, 2020; Zamora et al., 2020; Zhang and Srinivasan, 2020; Zheng et al., 2022; Zou et al., 2020), and
Pearson correlation tests (Coulby et al., 2021; Demanega et al., 2021; He et al., 2020; Kaliszewski et al., 2020; Liu et al., 2020; Manibusan and Mainelis, 2020; Marinov et al., 2021; Reis et al., 2023; Singer and Delp, 2018; Taştan, 2022; Tryner et al., 2021).
Root-Mean Squared Error (RMSE) has become more commonly used in recent studies (Baptista et al., 2022; Gillooly et al., 2019; Li et al., 2018; Reis et al., 2023; Tryner et al., 2021; Zamora et al., 2020; Zheng et al., 2022). Although Bland-Altman plots are less commonly used for low-cost sensor assessment (Baldelli, 2021; Coulby et al., 2021; Curto et al., 2018; Moreno-Rangel et al., 2018; Reis et al., 2023), they are an effective way to quantify the agreement between two quantitative measures and were therefore included. Similarly, paired t-tests were included to examine if the mean differences between the low-cost sensors and the reference instruments were significant.
Using the six analysis methods, the performance of the Airthings and Air Gradient sensors was evaluated by comparing their 5-minute concentration measurements with the reference instruments. In addition, the measurements for both CO2 and PM2.5 were averaged over 30-minute intervals to assess the impact of sampling frequency on sensor performance. Details of these analysis methods are provided below.
Linear regression
Linear regression was conducted to identify the relationship between measurements from the low-cost sensors (dependent variable) and the reference instruments (independent variable). The coefficient of determination (R2), calculated using equation (2), quantifies the proportion of the variance in the dependent variable that is predictable from the independent variable.
where
Pearson correlation
Pearson correlation analysis was conducted to assess the strength and direction of the correlation between the measurements from the low-cost sensors and reference instruments. The Pearson correlation coefficient (ρ), calculated using equation (3), quantifies the degree of association between the two datasets.
where
RMSE
RMSE was calculated to quantify the mean difference between the measurements from the low-cost sensors and the reference instrument, using equation (4).
where
Bland-Altman
Bland-Altman analysis was conducted to illustrate the bias between the low-cost and reference instrument measurements. First, the mean of the low-cost and reference measurements, the difference between their measurements, and the standard deviations (
where
Paired t-test
Paired t-test was conducted to determine whether the mean values measured by the low-cost sensors and reference instruments were significantly different. To validate assumptions of the t-test, a Jarque-Bera Normality Test was conducted to determine whether the difference between the low-cost sensors and the reference instruments followed a normal distribution.
Two extreme CO2 outliers measured by the Greywolf were identified through visual inspection of the time-series comparison plots, as they were the only points exceeding 850 ppm, while the remainder of the measurements were lower than 700 ppm. They were likely caused by accidental contact by the occupants with the Graywolf, and were hence excluded from the analysis.
Results
The analysis results for the CO2 and PM2.5 measurements are presented in this section, starting with analyses for CO2 at 5- and 30-minute intervals, followed by the analyses for PM2.5. The results from the time-series analysis, linear regression, and Bland-Altman analysis are presented in figures, and the results from Pearson correlation, RMSE, and paired t-tests are summarized in Tables.
CO2 data analysis
Figures 3(a) and (b) show the time-series comparison of CO2 concentrations measured by the Graywolf, Airthings, and Air Gradient at 5- and 30-minute intervals, respectively. Both figures show that the Airthings closely follow the measurements by the Graywolf with a moderate mean underestimation of 11.6 ppm (std. dev. = 15.5 ppm) from the start of the test before noon on May 24th. With the peak concentration of 603.1 ppm recorded by the Graywolf on the afternoon of May 24th, the Airthings reported larger overestimations with a maximum of 105.4 ppm (mean = 60.5 ± 22.5 ppm) for approximately 80 minutes (n = 16). Interestingly, the Airthings then reported underestimations with greater fluctuations in CO2 concentrations with a mean underestimation of 34.0 ± 31.8 ppm following the peak and continuing until the end of the test. In contrast, the Air Gradient sensor consistently underreported the CO2 concentrations by a mean of 78.3 ppm throughout the test, though with fewer fluctuations (std. dev. = 16.2 ppm). Despite these discrepancies, both sensors captured the same overall trend observed by the Greywolf and were able to detect all elevated CO2 concentrations caused by occupant activities in the vicinity. The aggregated 30-minute concentrations (Figure 3(b)) show smoother trends of CO2 concentrations from both sensors. However, it further highlights the fluctuations in the Airthings measurements toward the end of the testing period.

Time-series comparison of CO2 concentrations at: (a) 5-minute and (b) 30-minute intervals.
Linear regression plots comparing the Airthings and Air Gradient sensor data to the Graywolf measurements at 5- and 30-minute intervals are shown in Figures 4(a) and (b). The regression models reported an R2 value of 0.68 for Airthings at 5-minute intervals, indicating that 68% of the variance in its measurements could be explained by the reference instrument. In comparison, the Air Gradient sensor had a higher R2 value of 0.87, which is also visually evident from the better alignment of its data points with the line of the best fit shown in Figure 4(a). At 30-minute intervals (Figure 4(b)), both sensors had higher R2 values. This is expected as aggregation smooths out short-term fluctuations in measurements and reduces discrepancies that may be caused by sensor response time. However, at both measurement intervals, most data points from both sensors fall below the 1:1 relationship line, indicating a consistent underestimation of CO2 concentrations.

Linear regression for Airthings (blue) and Air Gradient (yellow) CO2 measurements against the Graywolf at: (a) 5-minute intervals and (b) 30-minute intervals.
Figure 5 shows the Bland-Altman analysis comparing the CO2 concentrations measured by Airthings and Air Gradient against the Graywolf at both 5- and 30-minute intervals. Figure 5(a) shows greater disagreement when the mean CO2 concentration is below 550 ppm for the Airthings sensor. As CO2 concentrations increase, the differences between Airthings and Graywolf become smaller, with data points clustering closer to the bias line (−19.4 ppm). However, a notable number of points fall outside the lower LoA (−81.7 ppm) at lower concentrations and outside of the upper LoA (42.8 ppm) at higher concentrations, indicating the Airthings sensor underestimates CO2 at lower concentrations and overestimates it at higher concentrations. Although the bias line shows that the average measurement of the Airthings was relatively accurate, the wide range of LoAs (124.5 ppm) shows it was not precise.

Bland-Altman plots for CO2 concentration by (a) Airthings and (b) Air Gradient compared to the Graywolf at 5-minute intervals and by (c) Airthings and (d) Air Gradient at 30-minute intervals.
In contrast, Figure 5(b) shows the Air Gradient sensor has a larger bias (−78.3 ppm), indicating a greater level of underestimation of CO2 concentrations aligned with the observations from the time-series comparison. However, with a standard deviation of the differences of 16.2 ppm, the range of its LoAs (63.6 ppm) is narrower than that of the Airthings, suggesting the Air Gradient provides more precise measurements. Similar to the Airthings, more points fall outside the upper LoA at higher concentrations (>550 ppm). Interestingly, because the Air Gradient consistently underestimated concentrations, this deviation from the bias line made the measurements less precise, but ultimately more accurate at higher concentrations. At 30-minute intervals (Figure 5(c) and (d)), the standard deviations of difference for both Airthings (from 31.8 to 24.5 ppm) and Air Gradient (from 16.2 to 12.5 ppm) decreased moderately, with fewer data points falling outside the LoAs. This suggests that data aggregation improves the precision of both sensors, though the general trends remain unchanged.
Table 1 summarizes the Pearson correlation coefficient, RMSE, and paired t-test results for the Airthings and Air Gradient sensors at 5- and 30-minute intervals. The Pearson correlation tests show that both low-cost sensors have positive and statistically significant correlations with Greywolf. The Air Gradient has a stronger correlation of 0.93, confirming the findings from the linear regressions. Aggregating the measurements over 30 minutes further improves the strength of the correlation. In contrast, the Airthings has 54% and 58% lower RMSE when compared to the Air Gradient at 5- and 30-minute intervals, respectively, indicating it provided more accurate measurements during the test period. However, the paired t-test results show that the measurements from both sensors, regardless of the sampling intervals, are statistically significantly different from the Graywolf CO2 measurements (p < 0.05).
Summary of Pearson correlation, RMSE, and paired t-test results for both sensors at 5- and 30-minute intervals.
PM2.5 data analysis
Figures 6(a) and (b) present the time-series comparison of PM2.5 concentration between Air Gradient and Lighthouse at 5- and 30-minute intervals. Throughout the sampling period, the Air Gradient sensor consistently overreported the PM2.5 concentration levels (mean difference = 3.0 ± 5.0 μg/m3). Particularly, it captured periods of elevated concentrations that were not detected by the Lighthouse. To further quantify this discrepancy, we defined an elevated concentration period as any period with PM2.5 concentrations remaining above 10 μg/m3 for at least 15 minutes. The Lighthouse only identified 2.9% (75 minutes) of the test periods as elevated, while the Air Gradient identified 18% (465 minutes). Even during elevated periods identified by both sensors, the Air Gradient overreported PM2.5 with a mean overestimation of 11.3 μg/m3 and a maximum overestimation of 48.8 μg/m3. This is unexpected, as the higher-grade reference instruments are generally better at detecting elevated concentrations when compared to low-cost sensors (Singer and Delp, 2018). The 30-minute time-series comparison in Figure 6(b) shows that after data aggregation, the Air Gradient’s response aligns more closely to the Lighthouse. The maximum difference also decreased from 48.8 to 15.6 μg/m3. It shows that the alignment of the Air Gradient with the Lighthouse can be improved at the expense of data resolution. However, this alignment improvement does not necessarily lead to performance improvement of the sensor. This point is further discussed in the linear regression results.

Time-series comparison of PM2.5 concentrations at: (a) 5-minute and (b) 30-minute intervals.
A likely cause for the discrepancy shown in Figure 6 is that the counting algorithms of the two devices respond differently to varying particle sizes. Although they are both optical, the Lighthouse sorts particles into five bins with different particle sizes and reports particle number concentrations in each bin, while the Air Gradient converts the light scattering signals directly to mass concentrations based on internal calibration algorithms.
A closer examination of the Lighthouse concentration measurements shows that on average, particles in the 0.3–0.5, 0.5–1.0, and 1.0–2.5 μm contributed to 28.2%, 24.4%, and 47.4% of the PM2.5 mass concentrations, respectively. During the elevated concentration periods, these contributions shifted to 21.8%, 25.6%, and 52.7% of the mass concentrations, respectively. This suggests that changes in particle distribution may have led to the Air Gradient sensor overestimating the size of particles greater than 0.5 μm based on the light scattering signals, resulting in inflated mass concentration estimates. Furthermore, the Air Gradient might count particles greater than 2.5 μm for PM2.5 mass concentrations. As shown in Figure 7, the Air Gradient measurements were better aligned with the Lighthouse PM5 or even PM10 measurements during some elevated concentration periods.

Time-series comparison of Air Gradient PM2.5 concentrations with Lighthouse PM2.5, PM5, and PM10 concentrations converted based on number concentrations.
The Lighthouse’s response to some emission sources could also be a potential contributing factor. For instance, Singer and Delp (2018) found that two types of research-grade instruments (Thermo Pdr-1500 and MetOne BT-645) had weaker responses to large particle sources such as Arizona Test Dust 2 (0–3 μm) and dust from a workshop dust mop when compared to some low-cost sensors. Similarly, the Lighthouse may have underperformed when sizing some larger particles from activities such as sweeping, mopping, vacuuming, and resuspension of particles caused by occupant movements into the corresponding bins, resulting in an underestimation of PM2.5. The Lighthouse’s relatively low percentage of elevated PM2.5 concentration periods, compared to previous studies on the impact of indoor emissions in residential settings, also suggests that some emission events may not have been captured (Chan et al., 2018; Zhang et al., 2020). However, since we did not record occupants’ activities, this hypothesis could not be verified.
Figure 8(a) and (b) show the linear regression of the Air Gradient PM2.5 measurements against the Lighthouse measurements at 5- and 30-minute intervals, respectively. Most data points are clustered near the origin and above the 1:1 relationship line, further confirming that the Air Gradient overreported PM2.5 concentrations. Interestingly, while the R2 values for both the 5- and 30-minute data were relatively high, aggregation did not improve them. This suggests that while aggregation reduced the maximum difference between the devices and improved alignment as discussed earlier, the variance in the 30-minute Air Gradient data was not better explained by the Lighthouse measurements.

Linear regression for Air Gradient PM2.5 measurements against the Lighthouse at (a) 5-minute intervals and (b) 30-minute intervals. Note Figure 7 b) has a smaller scale because of the aggregated data.
The Bland-Altman comparisons between the Air Gradient and Lighthouse in Figures 9(a) and (b) show that the differences between the two devices were not consistent. Instead, the differences increased with the mean PM2.5 concentrations in a near-linear relationship. Most data points are also located above the bias lines, once again, indicating a consistent overestimation by Air Gradient. Aggregating the 5-minute measurements to 30-minute intervals slightly increased the bias from 3.0 to 4.8 μg/m3, while slightly reducing the range of the lower and upper LoAs from −12.1 – 18.0 to −7.3 – 16.8 μg/m3, respectively.

Bland-Altman plots for PM2.5 concentration by Air Gradient compared to the Lighthouse at: (a) 5-minute intervals and (b) at 30-minute intervals. For better visual clarity, the graphs are cropped at a mean PM2.5 concentration of 40 μg/m3, resulting in 2 data points excluded from (a) and 1 data point excluded from (b).
Lastly, Table 2 summarizes the 5- and 30-minute interval results of the Pearson correlation test, RMSE, and paired t-test results. The Pearson correlation tests show that the Air Gradient has a positive and statistically significant correlation with the Lighthouse, with the strength of the correlation improving through data aggregation. Similarly, the RMSE of the Air Gradient decreased with aggregation. However, the paired t-test indicates that the mean difference between the devices remains statistically significant (p < 0.05) regardless of the aggregation.
Summary of Pearson correlation, RMSE, and paired t-test results for Air Gradient at 5- and 30-minute intervals.
Discussion
In this study, co-location tests were conducted to assess the performance of two low-cost sensors, the Airthings and Air Gradient against the Graywolf for CO2 measurements, and the Air Gradient for PM2.5 measurements against the Lighthouse. By employing multiple analysis methods, we were able to gain a more comprehensive evaluation of the sensor performance in terms of accuracy and precision, which cannot be easily derived from a single evaluation approach. The results of these assessments highlight the strengths and limitations of these sensors and help identify the most suitable scenarios for their deployment.
The CO2 results show that Airthings provided more accurate CO2 measurements, with better alignment in the time-series comparison and data points clustering closer to the 1:1 line in linear regression. At 5-minute intervals, it also had a smaller bias (−19.4 ppm) from the Bland-Altman analysis and a smaller RMSE of 37 ppm, both within the manufacturer’s specified accuracy of 50 ppm + 5% of readings. On the contrary, the Air Gradient was not as accurate, with a bias of −78 ppm and an RMSE of 80 ppm at 5-minute intervals, exceeding the manufacturer’s specification (40 ppm + 3% of readings). However, the Air Gradient was more precise, with a narrower range of LoAs from the Bland-Altman analysis. It also has a greater R2 value of 0.87 compared to the Airthings (R2 = 0.68).
These findings show that while the Airthings could provide a better representation of the actual indoor CO2 concentrations, the Air Gradient is more reliable for tracking the relative changes, instead of the absolute values. This makes the Air Gradient more suitable for applications such as estimating indoor and outdoor air exchange rates based on increases and decreases in CO2 concentrations (Di Gilio et al., 2021; Du et al., 2024), where fluctuations in CO2 levels are more important than absolute values. In addition, since the bias of the Air Gradient remained generally consistent across different CO2 levels, calibration equations can be easily applied to improve its accuracy. Averaging the 5-minute concentrations to 30-minute intervals improved the accuracy and precision of both sensors at the cost of data resolution. However, this approach limits their applicability in transient-state analyses, such as air exchange rate estimation, which requires high data resolution (Du et al., 2024). For long-term CO2 exposure assessment and studies on the risk of airborne infectious disease transmission (e.g. Rudnick and Milton, 2003), however, this trade-off may be acceptable.
Due to the malfunction of the Airthings, PM2.5 performance assessment was only conducted for the Air Gradient sensor. Overall, Air Gradient consistently overestimated the PM2.5 concentrations by an average of 3.0 μg/m3 with an RMSE of 5.8 μg/m3. While both fall within the manufacturer’s specifications (±10 μg/m3 below 100 μg/m3, and ±10% of reading between 100 and 500 μg/m3), the overestimation was more pronounced in the presence of emission sources, with a maximum difference of 48.8 μg/m3, likely due to misidentifying particles larger than 2.5 μm for PM2.5 mass concentrations. This was evident by comparing Air Gradient PM2.5 data with the PM5 and PM10 measurements by the Lighthouse. Despite a relatively high coefficient of determination (R2 = 0.77) and Pearson correlation (ρ = 0.69) compared to previous studies on other types of low-cost sensors (e.g. Coulby et al., 2021; Singer and Delp, 2018), the increasing discrepancy between the Air Gradient and the Lighthouse at higher PM2.5 concentrations indicates that the Air Gradient may not be suitable for short-term tracking of changes in particle concentrations or long-term exposure monitoring without careful calibration.
Overall, the findings from the CO2 and PM2.5 co-location tests show that while low-cost sensors can provide useful information on pollutant concentrations, their accuracy and precision should be carefully evaluated in the specific environments they would be deployed. Notably, all paired t-test results confirmed that both low-cost sensors measured mean concentrations that were statistically significantly different from the reference instruments, regardless of the sampling intervals. For low-cost sensors measuring PM2.5 concentrations, it is especially important to challenge the sensors against the specific type of particles they would be exposed to as their responses vary by particle type. This was also frequently noted in previous studies (Jayaratne et al., 2020; Qin et al., 2024).
There were a few limitations in this study that should be acknowledged. Firstly, the testing durations for both CO2 and PM2.5 measurements were relatively short. A longer measurement duration would enhance the analysis and provide a more comprehensive assessment of both sensors. It would also allow for comparison of the mean values over longer periods (e.g. the daily averages) that are commonly used for occupant exposure assessment. A related limitation is the generally narrow range of the CO2 concentrations (from 450 to 650 ppm) and the generally low PM2.5 concentrations (only 20% of the time exceeded 5 μg/m3 based on measurements by Lighthouse), which limit the testing of these sensors under more extreme conditions. Changing the measurement location to a space with higher occupancy density and more occupant activities (e.g. the living room or kitchen) would likely address both issues. Furthermore, we did not track other factors that might influence the performance of sensors. For instance, environmental parameters such as temperature and RH could impact the accuracy of NDIR sensors for CO2 measurements (Vafaei and Amini, 2021). Similarly, a record of the occupant activities could provide us with more information on the source of particles to determine their size distribution and composition, thus better understanding their influence on PM2.5 measurements. Lastly, the study design can be further improved by deploying multiple units of the same sensors during the tests. This could greatly improve the reliability of our data collection process and mitigate data loss due to malfunctions. It would also enable us to examine the intra-sensor variations and better identify their most suitable deployment scenarios. Based on our experience with the Airthings sensor, we also recommend regular checks during tests to minimize data loss from connectivity issues.
Conclusion
Continuous real-time monitoring of indoor air pollutants, such as CO2 and PM2.5 is crucial for assessing occupants’ exposure and providing information for the HAVC systems, building managers, and occupants on strategies to improve air quality. In this study, we evaluated the performance of Airthings and Air Gradient for CO2 measurements, and Air Gradient for PM2.5 measurement against the reference instruments. The comparison results show that the Airthings provide more accurate CO2 measurements, making it more suitable for exposure assessment or threshold-based control strategies, such as Demand Controlled Ventilation (ASHRAE, 2022). In contrast, the Air Gradient was more precise while less accurate, making it more suitable for tracking relative changes in CO2 concentrations for air exchange rate estimations. For PM2.5 measurements, although the Air Gradient sensor had a small mean difference (i.e. bias), its overestimation increases with PM2.5 concentrations, making it less than ideal for concentration monitoring without careful calibration. The paired t-tests from all analyses confirmed significant differences between the mean measurements by the low-cost sensors and the reference instruments, further highlighting the importance of colocation tests and calibration.
The six analysis methods employed in this study show they could provide a well-rounded assessment of sensor accuracy and precision. Particularly, the Bland-Altman analysis, which was less commonly used in previous studies on low-cost sensors, proved valuable in determining the trend of bias across pollutant concentrations and quantifying precision through the range of LoAs. We recommend that future studies, low-cost sensor evaluation agencies or guidelines (e.g. Health Canada, AQ-SPEC, and US EPA), and potentially low-cost sensor manufacturers adopt this set of testing metrics for sensor evaluation (South Coast AQMD, n.d.; United States Environmental Protection Agency, 2025).
We also recommend future studies to conduct longer co-location tests with a wider range of pollutant concentrations, account for parameters such as temperature, RH, and occupant activities beyond the pollutant concentrations, and deploy multiple units of the same sensor type to better evaluate their performance. In addition, usability metrics such as reliability, cloud connectivity, integration with building systems, and visual display clarity should be evaluated, as they play an important role in the successful adoption of these sensors at a larger scale.
Footnotes
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC Discovery Grant RGPIN 2022-03821).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
