Combining travel survey and mobile network operator data to produce disaggregated trip statistics

Abstract

Travel Survey has traditionally been the source for yearly official trip statistics of the resident population. In addition to population totals, disaggregated trip statistics are of great interest to many users, such as monthly trips or local trips inside different regions. However, due to the limited sample size, direct survey estimates would be too unstable to be acceptable. Meanwhile, trip counts can be produced using Mobile Network Operator (MNO) data that are derived from the contact signals of in-scope devices. Despite the differences of their coverage, concept and measurement, Travel Survey and MNO data can potentially be combined to yield fit-for-purpose disaggregated trip statistics. We present the relevant methods and their application in Sweden. The disaggregated trip statistics are now available publicly at Transport Analysis’ web portal.

Keywords

disaggregated statistics empirical best linear unbiased predictor transfer learning bootstrap mobile network operator data

1. Introduction

Transport Analysis conducts the annual National Travel Survey (Transport Analysis is responsible for Resvaneundersökningen – the Swedish National Travel Survey (RVU Sweden)), to be referred to as the Travel Survey in this paper, which is traditionally the basis for producing official trip statistics of individuals registered in Sweden aged 6–84. Each trip observed in the survey is characterised by time, origin and destination, as well as purpose, mode of transport and distance.

In addition to the population totals, many disaggregated trip statistics are of great interest to the users, such as monthly trips or local trips inside different regions. However, due to the limited sample size, direct survey estimates would be too unstable to meet acceptable quality standards.

Meanwhile, trips counts can be produced using Mobile Network Operator (MNO) data derived from the contact signals of the in-scope devices. As we shall explain, although trip counts from the Travel Survey and MNOs differ unavoidably in terms of coverage, concept and measurement, it is potentially possible to combine aggregated and anonymous data from the two sources to produce fit-for-purpose disaggregated trip statistics.

In this paper, we present the relevant methods, and their application based on Travel Survey estimates and corresponding trip counts from MNO (Data from Telia Crowd Insights). The resulting trip statistics by calendar months or administrative regions are now publicly available at Transport Analysis’ web portal.

1.1. More on trip data

Travel Survey

The Travel Survey covers all the individuals aged 6-84 with registered resident status in Sweden.

The associated daily travels involve two hierarchical concepts: journey and trip. A journey is delimited by home, secondary residence, another overnight location, workplace, or school. A trip is defined by the occurrence of a new activity. An activity can, for example, be grocery shopping or dropping off children at football training. A change of transport mode is not considered an activity; and there is no restriction for the length of an activity, or the distance between activity locations, as long as it is within the same day.

For example, a journey from office to home may consist of three trips, from office to supermarket (grocery shopping), from supermarket to kindergarten (picking up a child), and from kindergarten to home (going home); and it does not matter if the individual (and the child) stopped to get into a car driven by a neighbour on the way from kindergarten to home.

Trips are collected from a random sample of individuals aged 6–84 in the Population Register. Each selected individual is assigned a specific reporting day distributed across a calendar year. The sample size has varied over the years; see Table 1 for the years 2019-2023.

Table 1.
Yearly sample size, no. respondents and response rate.

Survey year Sample size No. respondents Response rate

2019 42 168 12 386 29.4%

2020 39 415 12 068 30.6%

2021 12 597 3 496 27.8%

2022 12 597 3 331 26.4%

2023 12 246 3 251 26.5%

Survey year	Sample size	No. respondents	Response rate
2019	42 168	12 386	29.4%
2020	39 415	12 068	30.6%
2021	12 597	3 496	27.8%
2022	12 597	3 331	26.4%
2023	12 246	3 251	26.5%

In 2019 and 2020, the sample consisted of 300 individuals per county (See Appendix A Table 6 and Figure 6 for the list of Swedish counties), plus approximately $1.6 \times 10^{4}$ individuals allocated proportionally to each county according to its size of population aged 6–84. From 2021 onwards, the equal-sample size was 200 per county, while the proportionally-allocated sample size was about $8 \times 10^{3}$ in 2021–2023. Finally, extra samples were administered in Stockholm and Södermanland in 2019, and in Stockholm in 2020. Sampling is stratified by age, sex, and county, except for the extra samples in Stockholm which are further stratified by municipalities.

It is thus clear that direct estimation of monthly trip totals or within-county yearly trip totals would be too unstable only based on the Travel Survey, since one needs to take into account the fact that each sample respondent reports only for one day in a year, which would have yielded a sampling fraction $\approx$ 1/30 of all the trips in a month or 1/365 of those in a year, even if the Travel Survey had selected the whole population as its sample.

MNO trip counts

Transport Analysis had access to data from Telia, Sweden for the years 2020-2023, which included service providers Telia, Halebop and Fello. The trip counts should cover all the mobile service users who are registered residents in Sweden. It is assumed that all the mobile phone users are above age 6, when expansion weights are constructed from the in-scope devices to the resident population in Sweden, although we do not know the details of this device-to-population weighting.

Contacts between a mobile phone and the network generate signalling data, whether the phone is actively being used or in a stand-by mode. Although a device is often connected to the nearest radio cell available to save energy, this is by no means always the case. In 2021, there were 51241 radio masts and 150736 unique radio cells to which mobile phones could connect in Sweden.

Algorithms are used to delineate trips from signalling data. For the data, an activity is registered at a location if a mobile phone remains static at this location for at least 10 minutes. An activity is classified as “home” if the mobile phone has stayed the longest in one continuous period before 9:00 a.m. It is classified as “work” if the mobile phone has stayed the longest in one continuous period between 9:00 am and 4:00 pm, and if it has been located there for at least one hour and at least 500 meters from home. The third and rest category is “other”.

A trip is then a movement between activities, provided the distance is at least 100 meters. A trip is classified as completed depending on how long the mobile phone remains at a location in relation to the distance travelled. Any stop from 10 to 70 minutes may constitute a completed trip, depending on the distance travelled.

For example, the threshold for a completed trip during a journey from Skåne to Stockholm is a stop of 70 minutes at a location. If the trip goes from Malmö to Stockholm and the traveller stops in Jönköping for 90 minutes, this results in two trips by the algorithm: one from Malmö to Jönköping and one from Jönköping to Stockholm. If the stop in Jönköping is instead 60 minutes, it is counted as a single trip: from Malmö to Stockholm.

At the other end, if a person walks or travels to a grocery store and shops for less than 10 minutes, it is counted as one trip, whereas it becomes two trips if the shopping takes more than 10 minutes. Similarly for other situations, such as dropping off or picking up children at school.

It should be noted that the trips are limited to each calendar day, as the device identifiers are scrambled every 24 hours such that a device cannot be tracked from one day to the next by Telia. A trip that lasts from one calendar day to the next will be recorded as two trips.

The Telia trip data were only accessible to Transport Analysis as anonymised and aggregated counts, either visually on computer screens or as downloadable CSV files. Three geographical breakdowns are possible: county (21 of them), municipality (290), and a level of grouped DeSO areas (1624).

The underlying data are first calculated at the level of grid cells, 22626 of them, which may be finer or coarser than grouped DeSO areas depending on the location. A grid cell is at minimum $500 \times 500$ meters. Figure 1 shows the number of grid cells per grid size, along with a map of them. A grid cell may overlap with a neighbouring country, such that the total area covered by the grid cells is more than twice the land area of Sweden. Two frequencies are available: daily and hourly. This means that Telia trip counts exist for county-to-county, municipality-to-municipality, grouped DeSO-to-grouped DeSO, and grid cell–to–grid cell, per day and per hour.

Figure 1.

Grid cells as a geographical level.

Summary in comparison

The strength of MNO trip counts lies clearly in the volume of data, with a huge number of trips recorded every day for a vast number of origin-destination pairs. This represents an entirely different scale compared to the Travel Survey, with respect to the latter’s sample size, the single-day measurement period and the increasing survey nonresponse rate over the years. However, the MNO data have also obvious shortcomings, as those listed below.

Despite the amount of data, MNO trip counts do not cover the population as long as not everyone aged 6-84 is a mobile service user.

The trip concept can never be fully aligned with the official definition, unless the latter is revised to align with the MNO data-processing algorithm.

Due to the noises of signalling data, location-trip measurement errors are unavoidable whichever the target trip concept.

Black-box data processing at MNO can be challenging for quality assurance, without the transparency provided by standard pipeline-processing.

Despite the challenges summarised above, in terms of coverage, concept and measurement-processing, previous analysis by Transport Analysis using data in the years 2019–2021 suggested that it may be possible to combine the two data sources, in order to produce disaggregated trip statistics either per month or within each county. The present study will focus on the methods that may be able to achieve these targets.

1.2. Related works in the literature

Ahas et al.¹ illustrate early the potentials of using mobile phone data for foreign visitor statistics. Nichols et al.² offer recently a comprehensive survey of the literature aimed at the use of mobile phone location data in official statistics, as well as other social, demographic and health studies. The main topics in official statistics are population estimates, mobility, socio-economic indicators, and epidemic (covid-19) tracing-monitoring.

MNO-MINDS D3.2 (Deliverable 3.2 of ESSnet project MNO-MNIDS, available at https://cros.ec.europa.eu/system/files/2025-09/WP3_D3.2.pdf) provides a repository of methods for combining MNO and non-MNO data to produce official statistics. Different approaches are possible, depending on how the associated uncertainty is conceptualised and measured. For disaggregating total trips based on Travel Survey estimates and MNO proxy counts, we shall treat the corresponding direct survey estimates as realised dependent variables, for which the MNO counts provide relevant known features.

Treating disaggregation as a problem of small area estimation (e.g. Rao³), we can apply the model of Fay-Herriot,⁴ which yields the empirical best linear unbiased predictor (EBLUP) for each domain of interest. However, in practice, there may exist model outliers, or the fixed-effects model predictor using the MNO counts may be misspecified to a certain extent.

We shall apply a transfer learning approach (MNO-MINDS D3.2, Sec. 5.3 and Chapter 10) in addition, which does not rely on an assumed prediction function, and the inference can be fully based on the sampling design of the Travel Survey. See Pan and Yang⁵ for a general review of transfer learning; see also Gu et al.⁶ and Li et al.⁷ for transfer learning to high-dimensional parameter estimation of linear models. The transfer learning approach of MNO-MINDS D3.2 uses a different technique, which provides an alternative to the empirical Bayes estimator of James and Stein⁸ or the design-based composite estimator in small area estimation.

The rest of the paper will be organised as follows. In Section 2, we describe the methods to be implemented. In Section 3 we present the results of our application. Some final remarks are given in Section 4, where we also point out several topics for future research.

2. Methods

The target of estimation related to various trips can be defined generically as follows. Denote by $U = {1, \dots, N}$ the population of individuals. Denote by $t = 1, \dots, T$ the time duration of interest. For instances, $U$ may consist of all the residents of age 6 to 84 in a given country, and each $t$ may refer to the days in a calendar year. Let $y_{k t}$ be the number of in-scope trips by individual $k$ at time $t$ , such that the target total is given as

Y = \sum_{k \in U} \sum_{t = 1}^{T} y_{k t}

In applications, we can have the following examples of

y_{k t}

All trips regardless purpose or mode of transportation.

OD trips from a given county (origin) to another (destination).

Local trips, with origin and destination inside the same county.

POI trips to a given point-of-interest, such as Lofoton in Norway.

The trips

y_{k t}

may refer to a given purpose or mode of transport in addition.

An estimator of $Y$ based on the sample survey can be given as

\hat{Y} = \sum_{k \in U} \sum_{t = 1}^{T} δ_{k t} w_{k t} y_{k t}

where

δ_{k t} = 1

y_{k t}

is observed in the survey or

δ_{k t} = 0

otherwise, and

w_{k t}

is an estimation weight. For instance, the design weight is

w_{k t} = π_{k t}^{- 1} given π_{k t} = Pr (δ_{k t} = 1)

according to the sampling design. In practice, however,

w_{k t}

usually differs to

π_{k t}^{- 1}

due to adjustments for survey nonresponse.

Meanwhile, it is possible to compile trip counts of mobile devices based on their positions inferred from the contact signals. Denote by $x_{d t}$ the count of device $d$ at time $t$ , where $d \in D$ and $D$ consists of all the in-scope devices at the MNO. In principle, the totals

X = \sum_{t = 1}^{T} X_{t} and X_{t} = \sum_{d \in D} x_{d t}

may be available from each of several MNOs. We do not distinguish in notation whether

X_{t}

is obtained from a single MNO or summed over several MNOs, as long as it does not affect the estimation methods to be described.

For disaggregating the total $Y$ , we shall focus on the estimation of subtotal proportions:

p_{i} = Y_{i} / Y where Y = \sum_{i = 1}^{m} Y_{i}

and

m

is the number of subtotals of interest, which may refer to disaggregation temporally (such as monthly trip totals) or spatially (such as local trips within each county). Let the corresponding MNO subtotal proportions be

q_{i} = X_{i} / X where X = \sum_{i = 1}^{m} X_{i}

2.1. Diagnostic test

Patone and Zhang⁹ devise a test for the null hypothesis that the difference $X_{i} - Y_{i}$ is constant over $i$ , given some big-data $X_{i}$ that has negligible variance compared to the unbiased sample survey estimator ${\hat{Y}}_{i}$ . This is an instance of audit sampling inference (Zhang^10,11), which uses sample surveys to make valid inference with respect to the sampling distribution of ${\hat{Y}}_{i}$ . Below we adapt the test for another relevant null hypothesis here, i.e.

H_{0} : X_{i} \propto Y_{i} \Leftrightarrow H_{0} : X_{i} / X = Y_{i} / Y

Under

H_{0}

, we have

E (\hat{Y} X_{i} / X) = Y_{i}

and

E ({\hat{Y}}_{i} / X_{i}) = E ({\hat{Y}}_{j} / X_{j})

over repeated sampling, where

X_{i}

is a constant of sampling. Let

P = I - 1 1^{⊤} / m

, where

I

is the

m \times m

identity matrix and

1

is the unity vector, such that

P P^{⊤} = P P = P

. Denote by

[Z_{i}]

the

m

-vector of

Z_{i} = {\hat{Y}}_{i} / X_{i}

. We have

E (P [Z_{i}]) = 0 and V (P [Z_{i}]) = P Σ P

where

Σ

has diagonal elements

X_{i}^{- 2} V ({\hat{Y}}_{i})

. We can set the off-diagonal elements to 0 in practice, since each

{\hat{Y}}_{i}

is based on a distinct subsample with negligible finite-population sampling fraction.

Now that the sum of the components of $P [Z_{i}]$ is always $0$ , let $[Z_{i}^{'}]$ be the $(m - 1)$ -vector on deleting an arbitrary component of $P [Z_{i}]$ . Let $Q$ be the correspond $(m - 1) \times (m - 1)$ submatrix of $P Σ P$ , such that $[Z_{i}^{'}]$ has the $(m - 1)$ -variate normal distribution

[Z_{i}^{'}] \sim N (0, Q)

Let

L L^{⊤} = Q

be the Cholesky decomposition with lower-triangular

L

. We have

R = L^{- 1} [Z_{i}^{'}] \sim N (0, I_{(m - 1) \times (m - 1)})

such that a test statistic for

H_{0} : X_{i} \propto Y_{i}

follows as

D = R^{⊤} R \sim χ_{m - 1}^{2}

2.2. Linear mixed modelling

The model of Fay and Herriot⁴ is commonly used in small area estimation, which combines random effects and sampling errors. Treating $i = 1, \dots, m$ as the ‘small areas’, one can let

{\hat{Y}}_{i} = Y_{i} + e_{i} = β_{0} + β_{1} X_{i} + v_{i} + e_{i}

where

(β_{0}, β_{1})

are the regression coefficients,

v_{i}

is a mean-zero random effect with model variance

V (v_{i}) = σ_{v}^{2}

, and

e_{i}

the sampling error of

{\hat{Y}}_{i}

with mean zero and sampling variance

V ({\hat{Y}}_{i}) = V (e_{i})

. In particular, it is assumed that

v_{i}

and

v_{j}

are independent if

i \neq j

e_{i}

and

e_{j}

are independent if

i \neq j

, while

v_{i}

and

e_{j}

are independent of each other whether or not

i = j

Provided the variance components $σ_{v}^{2}$ and $V (e_{i})$ , let $({\hat{β}}_{0}, {\hat{β}}_{1})$ be the weighted least squares (WLS) estimator of $(β_{0}, β_{1})$ . The best linear unbiased predictor (BLUP) of $Y_{i}$ is given as

{\tilde{Y}}_{i}^{H} = γ_{i} {\hat{Y}}_{i} + (1 - γ_{i}) ({\hat{β}}_{0} + {\hat{β}}_{1} X_{i}) and γ_{i} = \frac{σ_{v}^{2}}{σ_{v}^{2} + V (e_{i})}

In practice, one would replace

σ_{v}^{2}

and

V (e_{i})

by their estimates to obtain

{\hat{γ}}_{i}

and the corresponding empirical BLUP, and constrain the final estimates of

Y_{i}

to the overall

\hat{Y}

. An induced linear-prediction (LP) estimator can be given as

{\hat{p}}_{i}^{L P} = \frac{{\hat{γ}}_{i} {\hat{Y}}_{i} + (1 - {\hat{γ}}_{i}) ({\hat{β}}_{0} + {\hat{β}}_{1} X_{i})}{\sum_{j = 1}^{m} {\hat{γ}}_{j} {\hat{Y}}_{j} + (1 - {\hat{γ}}_{j}) ({\hat{β}}_{0} + {\hat{β}}_{1} X_{j})}

(1)

2.3. Transfer learning (TL)

Transfer learning can be helpful when direct unbiased estimation is too noisy to be acceptable in applications. Let the survey estimator of $p_{i}$ be

{\hat{p}}_{i} = {\hat{Y}}_{i} / \hat{Y}

For transfer learning given

{q_{i} : i = 1, \dots, m}

, consider minimising

\begin{aligned} Δ (p; γ) & = - \sum_{i = 1}^{m} {\hat{Y}}_{i} \log p_{i} + γ \sum_{i = 1}^{m} X_{i} (\log q_{i} - \log p_{i}) \\ + λ (\sum_{i = 1}^{m} p_{i} - 1) \end{aligned}

(2)

where the penalty with multiplier

γ

is related to the Kullback-Leibler divergence from the target distribution

{p_{i}}

to the source distribution

{q_{i}}

, and the last term with multiplier

λ

is due to the ensemble parameter restriction

\sum_{i = 1}^{m} p_{i} = 1

Clearly, the solution is ${\hat{p}}_{i}$ if $γ = 0$ in (2), whereas it tends to $q_{i}$ as $γ \to \infty$ . Given non-trivial $γ$ , setting the partial derivatives of $Δ$ to 0, we obtain

{\dot{p}}_{i} = \frac{{\hat{Y}}_{i} + γ X_{i}}{\hat{Y} + γ X} = ψ (γ) {\hat{p}}_{i} + {1 - ψ (γ)} q_{i}

where

ψ (γ) = \frac{\hat{Y} / X}{γ + \hat{Y} / X}

and

\sum_{i = 1}^{m} {\dot{p}}_{i} = 1

holds automatically. Notice the resemblance to the empirical Bayes estimator of James-Stein (1961), although the derivation from (2) does not invoke any empirical Bayes argument.

Moreover, to choose the tuning parameter $ψ (γ)$ , or $ψ$ directly, we minimise the total mean squared error (MSE) of ${{\dot{p}}_{i}}$ over repeated sampling, which is

E (\sum_{i = 1}^{m} ({\dot{p}}_{i} - p_{i})^{2}) = ψ^{2} \sum_{i = 1}^{m} V ({\hat{p}}_{i}) + (1 - ψ)^{2} \sum_{i = 1}^{m} u_{i}^{2}

where

u_{i} = q_{i} - p_{i}

for

i = 1, \dots, m

are treated as constants rather than random variables. The total MSE is minimised given

\begin{aligned} ψ & = \frac{τ_{u}}{τ_{u} + τ_{e}} and τ_{u} = \frac{1}{m} \sum_{i = 1}^{m} u_{i}^{2} and \\ τ_{e} & = \frac{1}{m} \sum_{i = 1}^{m} V ({\hat{p}}_{i}) \end{aligned}

A transfer-learning estimator

{\hat{p}}_{i}^{T L}

follows as

{\hat{p}}_{i}^{T L} = \hat{ψ} {\hat{p}}_{i} + (1 - \hat{ψ}) q_{i} given \hat{ψ} = \frac{{\hat{τ}}_{u}}{{\hat{τ}}_{u} + {\hat{τ}}_{e}}

(3)

where

{\hat{τ}}_{u} = \frac{1}{m} \sum_{i = 1}^{m} (({\hat{p}}_{i} - q_{i})^{2} - \hat{V} ({\hat{p}}_{i})) and {\hat{τ}}_{e} = \frac{1}{m} \sum_{i = 1}^{m} \hat{V} ({\hat{p}}_{i})

On the one hand, in the extreme case of

X_{i} / X = Y_{i} / Y

, the probability will be high for the TL estimate

p_{i}^{T L}

to be equal to

X_{i} / X

, i.e. no error at all. Whereas one still needs to estimate

(β_{0}, β_{1}) = (0, Y / X)

for

{\hat{p}}_{i}^{L P}

based on the

m

survey estimates

{\hat{Y}}_{i}

, such that it will be subject to the sampling errors of

{\hat{Y}}_{i}

even in this case. On the other hand, in the extreme case of uncorrelated

(X_{i}, Y_{i})

in the opposite direction, the TL estimator can be reduced to the survey estimator

{\hat{Y}}_{i} / \hat{Y}

, i.e. no gain at all. Whereas by virtue of the convex combination of

{\hat{Y}}_{i}

and

\hat{Y} / m

, the LP-estimator

{\hat{p}}_{i}^{L P}

can have a smaller MSE than

{\hat{p}}_{i}

, although

(β_{0}, β_{1}) = (Y / m, 0)

admits nil effect of

X_{i}

. In short, neither the estimator dominates generally, and TL is certainly worth considering given good proxy MNO counts.

2.4. Bootstrap

Although the linear mixed model involves both the model variance $V (v_{i})$ and the sampling variances $V ({\hat{Y}}_{i})$ , we propose a simple bootstrap to emulate only the sampling error, as a common ground of uncertainty evaluation, which results in a design-based MSE estimator as given below:

b1)
Choose a set of plug-in values ${Y_{i}^{} : i = 1, \dots, m}$ . Let $p_{i}^{} = Y_{i}^{} / \sum_{j = 1}^{m} Y_{j}^{}$ .
b2.1)
Draw ${\hat{Y}}_{i}^{} \sim N (Y_{i}^{}, \hat{V} ({\hat{Y}}_{i}))$ independently for $i = 1, \dots, m$ .
b2.2)
Obtain estimates ${{\hat{p}}_{i}^{} (η) : i = 1, \dots, m}$ by a given method signified by $η$ .
b3)
Repeat b2.1-b2.2 to obtain ${\hat{p}}_{i, b}^{} (η)$ for $b = 1, \dots, B$ , and the MSE estimate
${mse}_{i} (η) = \frac{1}{B} \sum_{b = 1}^{B} ({\hat{p}}_{i, b}^{} (η) - p_{i}^{})^{2}$

Note that the MNO counts ${X_{i} : i = 1, \dots, m}$ are held fixed as $Y_{i}^{}$ and $p_{i}^{}$ are. The sampling variance estimates $\hat{V} ({\hat{Y}}_{i})$ are available from the survey, and the Monte Carlo error of the bootstrap MSE estimator ${mse}_{i} (η)$ would vanish as $B \to \infty$ . As for the plug-in values ${(Y_{i}^{}, p_{i}^{})}$ , we suggest one may choose the best available estimates, and use other plausible estimates for sensitivity analysis.

Using the design-based MSE as the common criterion allows one to discern ${\hat{p}}_{i}^{L P}$ and ${\hat{p}}_{i}^{T L}$ regardless the difference between their underlying assumptions of the source(s) of uncertainty. The additional sensitivity analysis results should hopefully agree with one’s conclusion about the relative merits of ${\hat{p}}_{i}^{L P}$ and ${\hat{p}}_{i}^{T L}$ , although the reported MSE estimates would surely have been different using a different set of plug-in values ${(Y_{i}^{}, p_{i}^{})}$ for the bootstrap.
2.5. Robust estimator

One needs to estimate $σ_{v}^{2} = V (v_{i})$ for the LP estimator (1). Negative estimates arise if

\sum_{i = 1}^{m} \hat{V} ({\hat{Y}}_{i}) > \sum_{i = 1}^{m} ({\hat{Y}}_{i} - {\hat{μ}}_{i})^{2}

given

{\hat{μ}}_{i} = {\hat{β}}_{0} + {\hat{β}}_{1} X_{i}

. The EBLUP is then simply the synthetic estimator

{\hat{μ}}_{i}

. Similarly for

τ_{u} = \sum_{i} u_{i}^{2} / m

in transfer learning, if

\sum_{i = 1}^{m} \hat{V} ({\hat{p}}_{i}) > \sum_{i = 1}^{m} ({\hat{p}}_{i} - q_{i})^{2}

The TL estimator (3) is then simply the MNO proxy

q_{i}

. However, as long as

{\hat{μ}}_{i}

is not always equal to

q_{i}

, it would have been imprudent to proceed as if either of them had completely negligible bias.

We notice that the situation occurs because the survey sampling variance estimates are relatively large compared to the errors of ${\hat{μ}}_{i}$ or $q_{i}$ . In other words, ${\hat{μ}}_{i}$ and $q_{i}$ can still be quite good, because the sampling variances depend on the relevant sample sizes, whereas the errors of $μ_{i}$ and $q_{i}$ depend on how good the model or the MNO proxy is. The problem is that one cannot accept both ${\hat{μ}}_{i}$ and $q_{i}$ if they are not equal to each other.

It seems sensible in this situation to let a robust estimator of $p_{i}$ be a convex combination of ${\hat{p}}_{i}^{L P}$ derived from ${\hat{μ}}_{i}$ and ${\hat{p}}_{i}^{T L} = q_{i}$ . Now that

\begin{aligned} E (({\hat{p}}_{i} - μ_{i} / Y)^{2}) & = E (({\hat{p}}_{i} - p_{i} + p_{i} - μ_{i} / Y)^{2}) \\ = V ({\hat{p}}_{i}) + (μ_{i} / Y - p_{i})^{2} \\ E (({\hat{p}}_{i} - q_{i})^{2}) & = E (({\hat{p}}_{i} - p_{i} + p_{i} - q_{i})^{2}) \\ = V ({\hat{p}}_{i}) + (q_{i} - p_{i})^{2} \end{aligned}

with respect to sampling, a reasonable choice may be given as

\begin{aligned} {\hat{p}}_{i}^{M X} & = w {\hat{μ}}_{i} + (1 - w) q_{i} and \\ w & = \frac{\sum_{i} ({\hat{p}}_{i} - q_{i})^{2}}{\sum_{i} ({\hat{p}}_{i} - μ_{i} / Y)^{2} + \sum_{i} ({\hat{p}}_{i} - q_{i})^{2}} \end{aligned}

(4)

For instance,

w = 0.5

in the hypothetical case of

μ_{i} / Y \equiv q_{i}

, as it should be.

In practice, estimates of $(μ_{i}, Y)$ are needed to calculate $w$ , for which one may as well use the OLS fit of $(β_{0}, β_{1})$ to compute ${\hat{μ}}_{i}$ for extra robustness. One can still use the WLS $({\hat{β}}_{0}, {\hat{β}}_{1})$ for the synthetic estimator ${\hat{μ}}_{i}$ in (4). Finally, one would use $p_{i}^{*} = {\hat{p}}_{i}^{M X}$ as the plug-in values for bootstrap MSE estimation.

3. Application

For the analysis in this paper we can apply data from Telia, Sweden as MNO in the years 2020–2023. Figure 2 shows the yearly total trips by MNO data, compared to the estimated totals from the Travel Survey and its associated 95% confidence intervals.

Figure 2.

Total trips in years 2020-2023. Source: MNO by Telia (long-dashed), RVU by Travel Survey (solid) with associated 95% confidence interval.

Clearly, there are statistically significant differences between the two totals, due to the various reasons that have been discussed in Section 1. Below, for disaggregating trip statistics, we shall focus on the proportions $p_{i} = Y_{i} / Y$ as the target parameters, where $\sum_{i = 1}^{m} p_{i} = 1$ .

Figure 3 shows the total trips in 2020-2023 by month or county, according to either source. These are indeed all the data we use in the analysis below.

Figure 3.

Total trips 2020-2023 by month (top), county (bottom). Source: MNO by Telia (dotted), RVU by Travel Survey (solid) with associated 95% confidence interval (shaded).

3.1. Diagnostic test

In addition to the test for $H_{0}^{A}$ : constant $Y_{i} - X_{i}$ , where $i = 1, \dots, m$ , which was proposed by Patone and Zhang,⁹ Section 2.1 describes another test for $H_{0}^{B}$ : constant $Y_{i} / X_{i}$ .

Table 2 gives the $p$ -values of both the tests, where $i$ refers to the calendar months or the counties. It is clear that the data provide highly significant evidence against both the hypotheses in the case of county trip totals, where the $p$ -values are all 0 to the second decimal. The null hypotheses hold much better in the case of monthly trip totals, where the $p$ -values are only significant at the 1%-level in year 2020 but not in any of the other years.

Table 2.
P-values of null hypothesis $H_{0}^{A}$ : constant $Y_{i} - X_{i}$ , or $H_{0}^{B}$ : constant $Y_{i} / X_{i}$ , where $i = 1, \dots, m$ refers to month or county.

Month County

Year $H_{0}^{A}$ $H_{0}^{B}$ $H_{0}^{A}$ $H_{0}^{B}$

2020 0.00 0.00 0.00 0.00

2021 0.12 0.15 0.00 0.00

2022 0.60 0.63 0.00 0.00

2023 0.60 0.74 0.00 0.00

	Month	County
2020	0.00	0.00
2021	0.12	0.15
2022	0.60	0.63
2023	0.60	0.74

On the one hand, this conforms to the prior knowledge that the MNO trip counts cannot be treated as the target totals directly for official statistics. On the other hand, it suggests that a given estimator may have somewhat different performances for temporal or spatial disaggregation here, since the statistical relationship between $Y_{i}$ and $X_{i}$ seems different in the one or the other case.

3.2. Modelling

The LP estimator (1) is based on the assumption that $Y_{i}$ is linear in $X_{i}$ , in contrast to the TL estimator (3) that does not rely on any particular functional relationship between $Y_{i}$ and $X_{i}$ . Meanwhile, the plot at the bottom of Figure 3 suggests that there may exist county outliers to the linear assumption.

A diagnostic for model outlier can be given as follows. Let ${\hat{β}}_{(k)}$ be the ordinary least squares (OLS) fit of $(β_{0}, β_{1})$ in $E ({\hat{Y}}_{i}) = β_{0} + X_{i} β_{1}$ , which is obtained from $i = 1, \dots, k - 1, k + 1, \dots, m$ after deleting the $k$ th unit. Let $\hat{β}$ be the OLS based on all the units. Let $| {\hat{β}}_{(k)} / \hat{β} - 1 |$ be the absolute relative differences with or without the $k$ th unit. Any unit $k$ may be called critical/influential and is potentially a model outlier if it causes unusually large changes to the regression coefficients. Moreover, a unit may be so by chance, or it may be so persistently over time — it is certainly an outlier to the linear assumption in the latter case.

Tables 3 and 4 give the diagnostic values, respectively, in the context of temporal and spatial disaggregation. Looking across all these numbers, one can see three outlier counties in Table 4: $k =$ 12, 11 and 19, causing unusual changes to $β_{1}$ or $β_{0}$ , with effects that are largely the same over time for any of these three counties. In comparison, there are no similarly outstanding units in Table 3 in the context of temporal disaggregation. Although there exist units causing relatively large changes to the OLS in a given year, no unit seems to have largely the same unusual effect over time.

Table 3.
Absolute relative changes of OLS by $k$ th month.

Year 1 2 3 4 5 6 7 8 9 10 11 12

2020 $β_{0}$ .10 .02 .00 .62 .05 .00 .03 .27 .06 .17 .05 .40

$β_{1}$ .05 .01 .00 .28 .02 .00 .01 .14 .03 .08 .02 .18

2021 $β_{0}$ 2.14 1.85 .30 .77 .14 .01 .00 .06 .48 .28 .40 .15

$β_{1}$ .35 .29 .03 .12 .03 .00 .01 .02 .08 .04 .07 .05

2022 $β_{0}$ 1.79 1.55 1.71 .12 1.69 1.04 .65 .70 .70 3.01 .39 1.38

$β_{1}$ .08 .08 .09 .00 .09 .06 .03 .04 .04 .17 .03 .06

2023 $β_{0}$ .45 .27 .08 .06 .10 .10 .48 .01 .64 .67 .15 .66

$β_{1}$ .17 .10 .04 .02 .03 .05 .19 .02 .28 .27 .06 .24

Year		1	2	3	4	5	6	7	8	9	10	11	12
2020	$β_{0}$	.10	.02	.00	.62	.05	.00	.03	.27	.06	.17	.05	.40
	$β_{1}$	.05	.01	.00	.28	.02	.00	.01	.14	.03	.08	.02	.18
2021	$β_{0}$	2.14	1.85	.30	.77	.14	.01	.00	.06	.48	.28	.40	.15
	$β_{1}$	.35	.29	.03	.12	.03	.00	.01	.02	.08	.04	.07	.05
2022	$β_{0}$	1.79	1.55	1.71	.12	1.69	1.04	.65	.70	.70	3.01	.39	1.38
	$β_{1}$	.08	.08	.09	.00	.09	.06	.03	.04	.04	.17	.03	.06
2023	$β_{0}$	.45	.27	.08	.06	.10	.10	.48	.01	.64	.67	.15	.66
	$β_{1}$	.17	.10	.04	.02	.03	.05	.19	.02	.28	.27	.06	.24

Table 4.

Absolute relative changes of OLS by $k$ th county (list in appendix A).

Year		1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21
2020	$β_{0}$	.00	.00	.05	.00	.00	.08	.01	.03	.00	.05	.00	.94	.03	.02	.01	.08	.05	.02	.07	.04	.04
	$β_{1}$	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00	.04	.19	.00	.00	.00	.00	.00	.00	.02	.00	.00
2021	$β_{0}$	.02	.03	.08	.01	.06	.04	.00	.00	.01	.06	.00	.71	.02	.03	.00	.04	.05	.04	.11	.04	.03
	$β_{1}$	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00	.01	.16	.00	.00	.00	.00	.00	.00	.03	.00	.00
2022	$β_{0}$	.03	.06	.05	.00	.07	.04	.07	.04	.00	.01	.00	.73	.00	.06	.05	.03	.05	.00	.03	.07	.00
	$β_{1}$	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00	.04	.18	.00	.00	.00	.00	.00	.00	.00	.00	.00
2023	$β_{0}$	.01	.03	.00	.00	.06	.04	.03	.00	.05	.04	.01	1.14	.01	.04	.09	.05	.11	.04	.13	.07	.02
	$β_{1}$	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00	.03	.19	.00	.00	.00	.00	.00	.00	.03	.00	.00

In short, these diagnostic results do not reveal any single calendar month that obviously violates the linear model assumption persistently over time, just like the impression one gets by inspecting the corresponding plot in Figure 3. On the contrary, outliers to the linear assumption do seem to exist in three counties: Stockholm ( $k$ =12), Skåne ( $k$ =11), and Västra Götaland ( $k =$ 19), which conforms to the impression from the county plot in Figure 3. The three most populated counties in Sweden, in descending order, are: Stockholm, Västra Götaland, and Skåne.

3.3. Bootstrap MSE evaluation

We now apply the bootstrap described in Section 2.4 to evaluate the MSEs. Let ${mse}_{i}$ be the bootstrap estimate of the MSE of a given estimator of $p_{i}$ for unit $i = 1, \dots, m$ , where $p_{i} = Y_{i} / Y$ is the target proportion for disaggregating a population trip total. The average relative root MSE over all the units, evaluated by bootstrap, is then given as

ARRMSE = \frac{1}{m} \sum_{i = 1}^{m} \sqrt{{mse}_{i}} / p_{i}^{*}

where

p_{i}^{*}

is the proportion in the plug-in population for bootstrap.

Table 5 gives the bootstrap results for temporal and spatial disaggregation, based on $10^{5}$ iterations in each case. We notice the following particularly.

Table 5.
Average relative root mean squared error (ARRMSE) of disaggregated trip totals, per year and source/method, by bootstrap with $10^{5}$ iterations.

Month, $m = 12$

Year Survey MNO LP TL MX

2020 0.052 0.073 0.026 0.022 0.024

2021 0.083 0.026 0.022 0.021 0.022

2022 0.082 0.002 0.0001 0.001 0.001

2023 0.085 0.009 0.0001 0.008 0.003

County, $m = 21$

Year Survey MNO LP TL MX

2020 0.075 0.108 0.016 0.007 0.010

2021 0.107 0.110 0.059 0.034 0.051

2022 0.101 0.131 0.032 0.024 0.026

2023 0.106 0.114 0.044 0.040 0.042

	Month, $m = 12$
2020	0.052	0.073	0.026	0.022	0.024
2021	0.083	0.026	0.022	0.021	0.022
2022	0.082	0.002	0.0001	0.001	0.001
2023	0.085	0.009	0.0001	0.008	0.003
	County, $m = 21$
Year	Survey	MNO	LP	TL	MX
2020	0.075	0.108	0.016	0.007	0.010
2021	0.107	0.110	0.059	0.034	0.051
2022	0.101	0.131	0.032	0.024	0.026
2023	0.106	0.114	0.044	0.040	0.042

Firstly, for monthly trips in years 2022 and 2023, the direct estimate of $σ_{v}^{2}$ or $τ_{u}$ is often negative and truncated to zero based on the bootstrap replicate samples. This is the situation discussed in Section 2.5, where it is preferable to adopt the estimator ${\hat{p}}_{i}^{M X}$ that mixes ${\hat{p}}_{i}^{L P}$ and ${\hat{p}}_{i}^{T L}$ by (4). For comparable interpretation across all the results here, we have therefore always used the MX-estimates as the plug-in bootstrap population proportions, $p_{i}^{*} = {\hat{p}}_{i}^{M X}$ .

Secondly, in the bootstrap evaluation here, the Travel Survey estimator ${\hat{p}}_{i}$ is treated as unbiased, while we hold the MNO proportions $q_{i} = X_{i} / X$ as fixed constants. It follows that, in terms of ARRMSE, we are essentially comparing the standard error (SE) of Travel Survey to the bias of MNO trips. One can see that the survey SE is much larger than the MNO bias for monthly trips in years 2020-2023, which is consistent with the high $p$ -values reported earlier in Table 2. Whereas the survey SE is somewhat smaller or has about the same magnitude as the MNO bias for county trips, which is again consistent with the corresponding test results in Table 2.

A lesson one can take from this is that the MNO trip counts may be better for certain targets than others, because the effects of the underlying errors can vary for different targets. Whereas, under the hypothetical assumption that MNO counts are the unknown true values, they would have been equally good for all purposes.

Thirdly, comparing transfer learning to linear modelling, we can see the TL-estimator has smaller MSEs than the LP-estimator, except for monthly trips in 2022 and 2023. The LP-estimator appears ‘super-efficient’ with extremely low ARRMSEs in these two cases due to the sensitivity of the EBLUP under linear models, where the estimate of $σ_{v}^{2}$ is more often negative that that of $τ_{u}$ . After all, it is intuitively implausible that the true $Y_{i}$ should be exactly linear in the MNO counts $X_{i}$ , despite all the known contingencies of the latter.

Finally, comparing the TL or MX-estimator to the Travel Survey estimator, we see that there is little difference of efficiency from the TL to the MX-estimator, while both are much more efficient than the direct survey estimator. Indeed, the ARRMSE of the disaggregated TL (or MX) estimator here has a comparable magnitude to the coefficient of variation of the survey-based national trip total estimator, in which respect these disaggregated statistics have achieved similar accuracy as the official statistics at the national level.

3.4. Disaggregation results

Based on the analysis above, we find the MX-estimator (4) to be preferable for estimating monthly trips. On the one hand, this yields large gains of efficiency compared to the direct survey estimator; on the other hand, the results would be more robust than selecting one of the LP and TL estimators, when they do not admit any month-specific contributions from the survey estimator (such as in years 2022 and 2023, Table 5).

The monthly disaggregation results are plotted in Figure 4 for the year 2023. The improvements over the Travel Survey results are visible. The results for all the years 2020-2023 are available at Transport Analysis’ web portal.

Figure 4.

Proportions of trips per month in 2023. LP, linear prediction; TL, transfer learning; MX, robust estimator; RVU by Travel Survey; MNO by Telia.

When it comes to county trips, our analysis has evidenced the problems caused by the outlier counties, which suggests that it would be inappropriate to adopt the linear model given the influential effects of these outlier counties. We therefore find the TL-estimator more suitable for county trips.

The county disaggregation results are plotted in Figure 5 for the year 2023. The numerical domination of the three outlier counties over the rest makes it difficult to appreciate the improvements over the Travel Survey results on the scale here. However, the plot here does emphasise the outlier effects, such as the different results for Stockholm ( $k =$ 12) in particular, whereas additional bootstrap MSE evaluation to Table 5 using $p_{i}^{*} = {\hat{p}}_{i}^{T L}$ does confirm the efficiency gains of transfer learning over direct survey estimation. The county trip results for all the years 2020-2023 are available at Transport Analysis’ web portal.

Figure 5.

Proportions of trips per county in 2023. LP, linear prediction; TL, transfer learning; MX, robust estimator; RVU by Travel Survey; MNO by Telia.

4. Final remarks

We have presented some methods relevant for combining Travel Survey and MNO data to produce disaggregated trip statistics, which are applicable to many other topics, such as commuter statistics or domestic tourist statistics. The methods use only anonymous and aggregated MNO data, which is attractive with respect to confidentiality, integrity and commercial concerns. Our application to the data available to Transport Analysis has proved that it is possible to obtain fit-for-purpose disaggregated statistics. However, there certainly exist topics for future research, such as the ones to be mentioned below.

First, we have treated the survey estimator as unbiased in the methods presented. In reality, however, survey nonresponse and reporting errors may cause some bias that are not fully resolved for the survey estimator. How to extend either the transfer learning or modelling approach to accommodate bias in the survey estimator presents an intriguing question.

Next, one would like to enrich the disaggregated statistics, by allowing for additional trip classifications such as mode of transport or purpose. Since the relevant data currently only exist in the survey, one shall need to investigate two possibilities, either using MNO trip counts without such classifications or to introduce similar proxy classifications in the MNO data as well.

Finally, although bootstrap MSE evaluation is simple to implement, the need to choose explicitly the plug-in bootstrap population is an issue that motivates further research of alternative assumption-lean uncertainty measures, such as interval estimation only based on exchangeable distributions.

Footnotes

Acknowledgement

This work was co-funded by the European Commission Project “MNO-MINDS” – 101132744 - 2022-IT-TSS-METH-TOO.

ORCID iD

Li-Chun Zhang

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

A Appendix

Table 6.

List of Swedish counties (alphabetical order).

Number	County)
1	Blekinge County
2	Dalarnas County
3	Gotland County
4	Gävleborgs County
5	Hallands County
6	Jämtlands County
7	Jönköpings County
8	Kalmar County
9	Kronobergs County
10	Norrbottens County
11	Skåne County
12	Stockholms County
13	Södermanlands County
14	Uppsala County
15	Värmlands County
16	Västerbottens County
17	Västernorrlands County
18	Västmanlands County
19	Västra Götalands County
20	Örebro County
21	Östergötlands County

References

Ahas

Aasa

Silm

, et al. Mobile positioning data in tourism studies and monitoring: Case study in tartu, estonia. In: Sigala M, Mich L, Murphy J (eds.) Information and communication technologies in tourism 2007, 2007. Springer, Vienna. DOI: 10.1007/978-3-211-69566-1_12.

Nichols

O’Brien

Feuer

Childs

. Use of mobile phone location data in official statistics, social, demographic and health studies. In: Research and Methodology Directorate, Center for Behavioral Science Methods, 2023, Research Report Series (Survey Methodology 2023-03). U.S. Census Bureau. https://www.census.gov/library/working-papers/2023/adrm/rsm2023-03.html.

Rao

JNK

. Small Area Estimation. New York: Jon Wiley & Sons, Inc., 2003.

Fay

Herriot

. Estimates of income for small places: An application of james-stein procedures to census data. J Am Stat Assoc 1979; 85: 398–409.

Pan

Yang

. A survey on transfer learning. IEEE Trans Knowl Data Eng 2009; 22: 1345–1359.

Han

Duan

. Robust angle-based transfer learning in high dimensions. J R Stat Soc Ser B 2024; 87(3). DOI: 10.1093/jrsssb/qkae111.

Cai

. Transfer learning for high-dimensional linear regression: Prediction, estimation, and minimax optimality. J R Stat Soc Ser B 2020; 84: 149–173.

James

Stein

. Estimation with quadratic loss. Proc Fourth Berkeley Symp Math Statist Prob 1961; 1: 361–379.

Patone

Zhang

L-C

. On two existing approaches to statistical analysis of social media data. Int Stat Rev 2020; 89: 54–71.

10.

Zhang

L-C

. Audit sampling as a quality standard for multisource official statistics. Spanish J Stat 2023; 5: 67–83.

11.

Zhang

L-C

. Proxy expenditure weights for consumer price index: Audit sampling inference for big-data statistics. J R Stat Soc Ser A 2021; 184: 571–588.

	Month		County
Year	$H_{0}^{A}$	$H_{0}^{B}$	$H_{0}^{A}$	$H_{0}^{B}$
2020	0.00	0.00	0.00	0.00
2021	0.12	0.15	0.00	0.00
2022	0.60	0.63	0.00	0.00
2023	0.60	0.74	0.00	0.00

Combining travel survey and mobile network operator data to produce disaggregated trip statistics

Abstract

Keywords

1. Introduction

1.1. More on trip data

Travel Survey

Table 1. Yearly sample size, no. respondents and response rate. Survey year Sample size No. respondents Response rate 2019 42 168 12 386 29.4% 2020 39 415 12 068 30.6% 2021 12 597 3 496 27.8% 2022 12 597 3 331 26.4% 2023 12 246 3 251 26.5%

MNO trip counts

Summary in comparison

2. Methods

2.1. Diagnostic test

2.2. Linear mixed modelling

Table 2. P-values of null hypothesis H 0 A : constant Y i − X i , or H 0 B : constant Y i / X i , where i = 1 , … , m refers to month or county. Month County Year H 0 A H 0 B H 0 A H 0 B 2020 0.00 0.00 0.00 0.00 2021 0.12 0.15 0.00 0.00 2022 0.60 0.63 0.00 0.00 2023 0.60 0.74 0.00 0.00

Footnotes

Acknowledgement

ORCID iD

Funding

Declaration of Conflicting Interests

A Appendix

References

Table 1.
Yearly sample size, no. respondents and response rate.

Survey year Sample size No. respondents Response rate

2019 42 168 12 386 29.4%

2020 39 415 12 068 30.6%

2021 12 597 3 496 27.8%

2022 12 597 3 331 26.4%

2023 12 246 3 251 26.5%