Sage Journals: Discover world-class research

Abstract

Bicycle volume estimation is essential for effective city planning and ensuring the safety of vulnerable road users. Traditional approaches often involve resource-intensive data collection, such as short-term counts expanded using continuous counts (CCs), or the development of direct-demand models from site-specific data such as geometry, socioeconomic, and land use. Although recent efforts have explored crowdsourced data, such sources may be biased and not consistently available across jurisdictions. This study explores the potential of large language models (LLMs), specifically ChatGPT-4o mini, as a low-cost and accessible alternative for estimating bicycle volume levels at intersections.

A data set of CCs over 12 months and site characteristics for 158 signalized intersections in the Region of Waterloo, Ontario, Canada, was compiled. From this, the average annual daily bicycle volume was computed, and intersections were categorized into low, medium, or high volume levels.

Eight LLMs and three benchmark approaches, including a Naïve (random), an ordered logit (OL) model, and human survey responses, were applied to the data set. Accuracy in predicting the correct volume category was the evaluation metric. The best-performing LLM requires only two input features: (1) bike lane length (km); and (2) the presence of schools, and achieves an average accuracy of 60%, just 4% below the OL model, and significantly better than the Naïve model and human survey respondents. Of interest, providing satellite images or additional quantitative input variables decreased the LLM performance. The results suggest LLMs can offer a scalable, data-efficient alternative for jurisdictions lacking extensive bicycle count data or modeling expertise.

Keywords

annual average daily bicycle volume direct-demand model large language models ChatGPT vision artificial intelligence satellite images

Introduction

Volume (also referred to as exposure) is a key input to road safety analysis and for planning transportation infrastructure requirements and improvements. Traditionally, focus has been on vehicular traffic and is frequently expressed as the annual average daily traffic (AADT). However, with the growing focus on improving the safety of vulnerable road users, such as bicyclists and pedestrians, estimating their exposure has become equally important. Analogous exposure measures for pedestrians and bicyclists are annual average daily pedestrian traffic and annual average daily bicycle (AADB) volume, respectively.

Several methods have been used to estimate bicycle volumes. The most common is to use factoring methods to scale up short-term counts (STCs) to estimate a longer-term average volume (e.g., AADB). The scaling factors capture temporal variations (by time of day, day of the week, and month of the year) and are typically computed from a smaller number of locations at which continuous counts (CCs) are available. These factoring methods work well, but STCs must be available for all sites of interest, and scaling factors must be available (or be computed from data from CC sites).

When jurisdictions do not have (sufficient) STCs and/or CCs, then statistical methods, such as direct-demand (DD) models, have been used. The DD models typically associate bicycle volumes (e.g., AADB) with socioeconomic and land use characteristics of the area directly surrounding the site of interest. The main challenge with DD models is that a large data set with known bicycle volumes is required to develop the DD model.

More recently, methods utilizing crowdsourced data, such as Strava ( 1 ) and StreetLight ( 2 ) data, have gained popularity. Crowdsourced data from dedicated platforms such as Strava provides detailed trajectory data from individual trips. However, there are challenges with sampling bias (not all bicyclists use these apps), and the sampling rate is unknown and varies temporally and spatially. Other sources of crowdsourced data face these and other challenges, such as correctly distinguishing between different modes (e.g., pedestrian versus bicyclist). Despite the range of methods that have been and are being used in practice, a need for an accurate, cost-effective, and robust method for estimating bicycle volumes remains.

Machine learning (ML) approaches such as random forest (RF), eXtreme Gradient Boosting (XGBoost), and deep neural network methods have been applied to the problem of estimating pedestrian and bicyclist volumes ( 3 ). These ML techniques have advantages over traditional statistical models, including improved ability to capture nonlinearities, no need to specify the functional form of the relationship, and fewer assumptions about the underlying distributions of errors. However, these methods have limitations, including the need for relatively large training data sets, the possibility of large computational loads for training, and the propensity for overfitting when training data sets are not sufficiently large.

There have been tremendous advances recently in artificial intelligence (AI) and, in particular, large language models (LLMs). The LLMs have been applied to a wide range of transportation problems, including:

Various applications in intelligent transportation ( 4 );

Investigating LLMs’ risk perceptions from forward-facing traffic images ( 5 );

Bridging the gap between machine execution and human intention in human–machine interactions using LLMs as a “co-pilot” ( 6 );

Utilizing prompt engineering and LLMs to interact with traffic data imputation systems, enabling users to request information using simple language without needing knowledge of the detailed information or mathematical modeling behind it ( 7 );

Exploring the combination of LLMs and vision-language models to evaluate image-based questions from driving license tests, such as traffic sign image questions, scenario-based questions with images, and a combination of both ( 8 );

Using natural language processing and advanced language models to classify pedestrians’ maneuvers before a crash based on crash data and police reports ( 9 ).

The LLMs are particularly well-suited to handle unstructured data and address complex and nonlinear relationships. They are widely available at low cost and do not require expertise in statistical model fitting to apply. Furthermore, general LLMs, such as ChatGPT, are not trained on a data set from a specific jurisdiction. This is particularly important because it means that there is no need to have a training data set of known AADBs from the local jurisdiction to apply the model.

This leads to the main research question for this study: Can LLMs be used to estimate bicycle volumes with sufficient accuracy to be of practical value? In addition, this study determines how the LLM estimation accuracy is affected by the prompt and input information. The analysis is carried out using bicycle volume counts collected at a set of 158 signalized intersections in the Region of Waterloo, Ontario, Canada, over 12 months.

The remainder of this study is organized as follows. The next section presents a review of the literature, including benchmark models in bicycle volume estimation, as well as relevant studies investigating volume assessments using LLM models. Then, the data used in this study is described, followed by the Methodology, Results, and Conclusion sections. Finally, limitations of this study are discussed.

Literature Review

A large body of literature exists on methods for estimating bicycle volumes. Existing methods can generally be classified into six categories.

Direct calculation in which long-term volumes (typically AADB) are calculated directly from counts from CC sites ( 10 ). Because these methods can only be applied to CC locations and most jurisdictions have a small fraction of their intersections instrumented with CC technologies, the application domain of these methods is limited.

Expansion methods in which STCs are expanded into long-term volumes (typically AADB) using expansion factors calculated from CC sites ( 11 – 13 ). The expansion factors reflect temporal variations in bicycle activity (across days of the week and seasonality). Therefore, these temporal patterns must be identified, the available CC sites must be grouped into temporal patterns (factor groups), and then expansion factors must be computed for each temporal pattern (factor group). Finally, a temporal pattern (and the associated set of expansion factors) must be associated with each STC site. Much research has been performed to: (1) propose methods for defining temporal patterns; (2) recommend a minimum number of CC sites for each factor group; and (3) quantify the effect that duration (e.g., 1, 2, … 7 days) of the STC has on the AADB estimation accuracy.

DD models are statistical models in which the long-term bicycle volume (dependent variable) is estimated as a function of the set of explanatory variables that reflect land use, demographics, and transportation infrastructure characteristics ( 14 – 20 ). The main challenge with DD models is that a relatively large data set of known volumes is required to develop the model. Studies have also explored the spatial transferability of DD models for pedestrians ( 14 ) and bicyclists ( 15 ). However, applying these models naively has led to significant estimation errors. These errors, quantified as the average absolute error between the estimated and actual counts across all sites in a jurisdiction, divided by the average actual count, are from 0.52 to 50 for pedestrian volume ( 14 ) and 0.41 to 31.9 for bicyclist volume ( 15 ).

Network models include travel demand models (TDMs) and other models that incorporate some form of bike trip origin–destination (O–D) demand matrix (or matrices) and then assign these trips to the network via a route assignment algorithm. The advantage of these models is that they can provide estimates for all locations in the network, not just intersections; however, general TDMs are typically used to assess large-scale land use and transportation network decisions over long time horizons. Aggregate four-step TDMs typically have poor accuracy for estimating bicyclist volumes. These models divide the network into traffic analysis zones that typically do not provide the spatial resolution appropriate for bike trips, which are shorter than motorized vehicle trips ( 21 ). Activity-based TDMs are disaggregated but require greater input data and are less frequently used by jurisdictions. Dedicated bike network models can have higher accuracy but typically require a bike trip O–D matrix as an input and a calibrated bike trip assignment model; information that jurisdictions frequently do not have. For example, Bhowmick et al. ( 22 ) developed a bike volume estimation model using 4 years of an activity survey to derive origins and destinations and used GPS trace data from almost 20,000 cycling trips to develop a route choice model. They reported a mean absolute percent error = 25% for bike volume counts across 48 links in Melbourne, Australia.

Crowdsourcing, in which a sample of bicycle trips is observed. These observations are typically over a much longer period of time than STCs, and often include the entire trajectory of the trip, but represent only a sample of all bicycle trips made during the observation period. Then, techniques are applied to expand the sample to obtain an estimate of the total number of trips ( 2 , 23 , 24 ).

Hybrid methods, which are a combination of Methods 1–5. Typically, these hybrid methods are a combination of crowd-sourced data and expansion methods, or the use of ML and deep learning approaches ( 3 , 16 , 23 , 25 , 26 ).

Each of the previous categories of methods has unique strengths and limitations. The most significant constraint for most jurisdictions is the lack of data (bike trip O–D matrix, sample of bike trip trajectories, CCs, and/or STCs), which precludes the application of the previous methods.

Recent advances in AI models, including LLMs, and the ease of accessibility to these models present them as a potentially effective and efficient method for estimating bicycle volumes at intersections. However, no previous studies could be found that applied LLMs to the problem of estimating bicycle volumes at intersections. However, LLMs have been applied to other similar transportation applications.

Driessen et al. ( 5 ) evaluated ChatGPT-4 Vision’s (GPT-4V) ability to estimate risk levels from forward-facing road traffic images and compared its performance with human responses. Performance was measured by calculating the correlation coefficient between AI-generated risk scores and human assessments across 210 images. The results showed a correlation coefficient of 0.79, demonstrating a high degree of correlation between the AI rankings and human assessments.

Li et al. ( 27 ) applied and evaluated several LLMs, including one that included a spatio-temporal encoder to capture temporal dependencies, to predict bicycle inflows to and outflows from 80 regions (approximately 1 × 1 km) within New York City (NYC). They used NYC-bike data sets from the first 2 weeks of January 2020. For a given region, they provided land use information and the historical inflows and outflows for the 12 30-min periods from noon to 06:00 p.m. and then directed the LLM to predict the inflows and outflows for the next 12 30-min periods (i.e., 06:00 p.m. to midnight). They found that the LLM generally performed well, but the model with the spatio-temporal encoder performed best because this model could capture the temporal dependencies between the 30-min intervals.

Other studies have evaluated the performance of LLMs to: (1) correctly answer driving license test questions ( 8 ); (2) determine pedestrians’ maneuvers before a crash ( 9 ); and (3) predict pedestrian crossing behavior at crosswalks ( 28 ).

In a study focused on estimating pedestrian volumes at intersections, Sobreira and Hellinga ( 29 ) investigated the use of ChatGPT-4V to rank urban intersections based on the level of pedestrian activity when the LLM was only given satellite images of the area directly surrounding the intersections. The results revealed a strong correlation (0.73) between the true rankings and those estimated by ChatGPT-4V. In addition, a novel methodology was proposed and tested, which combined a sample of intersections with known pedestrian volumes and ChatGPT’s rankings to estimate pedestrian volumes at other ranked sites. The study found that ChatGPT could rival and, in some cases, outperform traditional methods such as DD models. This highlights the potential of ChatGPT for estimating bicycle volumes, offering a simpler alternative to traditional statistical and factoring methods.

To the best of the authors’ knowledge, no other studies evaluate the potential usage of LLMs for estimating long-term bicycle volumes at intersections. This study aims to assess the accuracy of LLMs in estimating bicycle volumes (AADB) at intersections using different scenarios as inputs and examines their practicality compared with the existing benchmark models.

Data Description

A total of 256 signalized intersections were selected as target locations in the Region of Waterloo, Ontario, Canada, as a permanently deployed camera-based CC data collection system provided by a single vendor existed at these locations. These counts included minute-by-minute bicycle data for each turning movement and crossing at the intersections. Counts per minute were aggregated to form counts for each calendar day (i.e., 24-h). Count data were acquired from these sites with the objective of obtaining data from a full year, from July 1, 2023, to June 30, 2024. At some of the intersections, the data collection system had not been deployed for the full year period; 170 intersections met the 1-year data availability condition.

The camera-based count system only reports counts when one or more entities (vehicles, pedestrians, or bicyclists) are detected in a 1-min interval. Consequently, the system does not distinguish between periods with no activity (zero counts) and periods when there is an issue with the system (e.g., power loss) and no counts are recorded. Therefore, in this study, a 24-h count is considered valid if there was at least one vehicle, pedestrian, or bicyclist detected in the 24-h period. Sites were included in this study only if the camera-based system was installed for 12 months, and there were no significant data gaps (sites were excluded if no vehicle traffic was detected for more than 4 consecutive days). After applying the above filtering process, 158 sites remained, and these sites formed the data set used in this study.

The true AADB was computed from the observed counts using Equation 1.

AAD B_{s} = \frac{\sum_{t = 1}^{d_{s}} c_{s, t}}{d_{s}}

(1)

where

$c_{s, t}$ = daily bicycle counts for intersection $s$ on day $t$ ,

$d_{s}$ = number of valid days for site $s$ , and

$AAD B_{s}$ = annual average daily bicycle for site $s$ .

Methodology

The goal of this study is to quantify the performance of an LLM using GPT-4o mini, to estimate the AADB category (low, medium, and high as defined in the Data Description section) of a set of sites and to compare this performance with benchmark methods. The methodology of this study is structured into six phases, as shown in Figure 1. Each of these phases is described in the following subsections.

Figure 1.

Methodology framework.

Phase 1: Site Selection

The level of bicycling activity at intersections varied considerably across the different CC sites. Based on the frequency distribution of the AADB, three levels (categories) of cycling activity were defined: low, medium, and high, as shown in Figure 2. The boundaries for these categories were established as follows: low = AADB ≤ 50; medium = 50 < AADB ≤ 150; and high = AADB > 150 bicycle counts per day. Within these categories, 75 sites were identified as low, 57 as medium, and 26 as high. To ensure equal representation across the three AADB categories, 75 sites (25 per category) were selected using stratified random sampling.

Figure 2.

Cumulative relative frequency (Cum. Rel. Frequency) distribution of AADB for 158 study sites.

Phase 2: Site Features

Demographic information and variables were sourced from the 2016 census data. To derive the explanatory variables listed in Table 1, open-source data sets were utilized, including OpenStreetMap ( 30 ) for network and land use variables, along with additional open data from the City of Waterloo ( 31 ). These characteristics were selected based on the shortlisted DD models from a study conducted by Azizi Soldouz and Hellinga ( 15 ). Their study examined the spatial transferability of existing DD models in estimating the AADB for a different jurisdiction. The shortlisted models were chosen based on the retrievability of their variables for other jurisdictions. Similarly, in this study, features from these DD models were selected based on two criteria:

Characteristics that are most commonly used in DD models;

Characteristics that are quantitative and easier for jurisdictions to extract.

Table 1.

Site Features

Feature variable description	Variable type	Buffer radius	Minimum	Maximum	Mean
Presence of bicycle facilities	Binary (yes = 1; no = 0)	na	0	1	0.3
Number of lanes approaching the intersection	Count	na	3	31	11.1
Bike facilities length (BL)	Continuous (length of bicycle facilities [km])	800 m	0	6.8	1.2
Bus stops	Count (number of transit stops)	400 m	0	8	1.9
Commercial land use	Continuous (commercial land area [1,000s of m²])	161 m	0	1.17	0.13
Employment	Continuous (number of employment in thousands)	400 m	0	1.652	0.5
Land use mix	Continuous (mix index of 0–1)	800 m	0	1	0.59
Residential area	Continuous (hectares of low-density residential area)	161 m	0	3.33	0.42
Schools	Binary (yes = 1 and no = 0)	400 m	0	1	0.4
Presence of parking entrance	Binary (yes = 1 and no = 0)	50 m	0	1	0.02
Presence of three approaches	Binary (yes = 1 and no = 0)	na	0	1	0.08
Connected node ratio (CNRH)	Continuous (connected node ratio within 800 m buffer)	800 m	0.66	0.99	0.98
Income	Continuous (average annual income [in 1,000s of dollars (Canadian dollars (CAD))])	50 m	0	110.3	41.3

Note: na = not applicable.

Phase 3: Site Satellite Images

In Phase 3, satellite images of all selected sites were extracted using the Google Cloud Platform ( 32 ). These images were utilized either as a stand-alone scenario or in combination with quantitative features for the category estimation process. The images were square-shaped, centered on the coordinates of each intersection, with an area coverage of approximately 70 × 70 m per image. This zoom level effectively captures the geometry of the intersections, such as the number of lanes and the presence of bike lanes. An example of the satellite image used with OpenAI Application Programming Interface (API) models is shown in Figure 3.

Figure 3.

Example of satellite image ( 33 ) used for LLMs examination.

Phase 4: Benchmark Models

Given the novelty of LLMs, particularly in addressing transportation engineering problems, it is crucial to conduct a comprehensive analysis using benchmark models for comparison. Therefore, two benchmark models were developed.

The first benchmark model (BM1) is a Naïve model, which randomly assigns one of three categories (low, medium, or high) to each of the 75 selected sites without considering any associated site characteristics. The BM1 is applied 30 times to each site to create 30 replications.

The second benchmark model (BM2) is an OL model, which estimates the AADB category for each site from the site characteristics. To develop the best OL model, the underlying latent variable $Y^{*}$ is first determined, as shown in Equation 2. Then, the observed outcome $Y$ is determined by the thresholds $μ_{1}$ and $μ_{2}$ , as shown in Equations 3–5. The probability of being in each category is based on these thresholds.

Y^{*} = β_{0} + β_{1} X_{1} + β_{2} X_{2} + \dots + β_{p} X_{p} + ε

(2)

where

$Y^{*}$ = latent continuous variable (unobserved),

$X_{1}, X_{2}, \dots, X_{p}$ = explanatory variables,

$β_{0}, β_{1}, \dots, β_{p}$ = model coefficients, and

$ε$ = error term assumed to follow logistic distribution.

The site features from Table 1 were considered as explanatory variables in the logit model. Explanatory variables most strongly correlated with the dependent variable were selected, while accounting for multicollinearity among the independent variables using the variance inflation factor (VIF < 5). A stepwise regression method was employed to determine the optimal OL model, maximizing the accuracy of category prediction and minimizing the Akaike Information Criterion (AIC).

The observed categories $Y$ are determined as follows.

Y = 1 (low) if Y^{*} \leq μ_{1},

(3)

Y = 2 (medium) if μ_{1} < Y^{*} \leq μ_{2},

(4)

Y = 3 (high) if Y^{*} > μ_{2}

(5)

The performance of all models, including the benchmark models, was expressed in a confusion matrix (Table 2) and summarized by computing the categorization accuracy using Equation 2.

Table 2.

Confusion Matrix: Accuracy Calculation

		Predicted AADB category
		Low	Medium	High
True AADB category	Low	F_L–L	F_L–M	F_L–H
	Medium	F_M–L	F_M–M	F_M–H
	High	F_H–L	F_H–M	F_H–H

Note: AADB = annual average daily bicycle volume; L-L = Low-Low; L-M = Low-Medium; L-H = Low-High; M-L = Medium-Low; M-M = Medium-Medium; M-H = Medium-High; H-L = High-Low; H-M = High-Medium; H-H = High-High.

Accuracy = \frac{F_{L - L} + F_{M - M} + F_{H - H}}{F_{T}} \times 100 %

(6)

where

F _x–y = number of sites that are in AADB category x and predicted to be in AADB category y, and

F _T = sum of F_x–y for all AADB categories (F_T = 75 in this study).

These benchmarks serve as a baseline for comparing the accuracy of their predictions against those generated by the OpenAI API. The performance of all models was compared using confusion matrices and accuracy in predicting the correct category for the selected 75 sites, averaged over 30 replications for the LLMs and Naïve models.

Phase 5: LLMs

The performance of an LLM is expected to be influenced by the information provided to the LLM as part of the prompt. Consequently, different LLM scenarios were created, each with different prompt information. The performance of the LLM for each scenario was quantified using the same metrics (i.e., confusion matrix and accuracy) as described in Phase 4 and used for the benchmark models.

Scenarios were developed to answer the following five key research questions.

Question 1: Do LLMs perform better than randomly assigning volume categories (Naïve model)? This question was addressed by comparing randomly generated categories across all sites with those produced by LLMs, supported by a statistical test.

Question 2: Is the LLM performance comparable with traditional methods, such as statistical approaches and human survey analysis? This question was answered by comparing the outputs of the developed OL model and the human survey with the LLMs.

Question 3: Does providing satellite images improve the LLM AADB categorization accuracy? This question was answered by comparing LLM performance across a range of scenarios in which prompt information varied, with some scenarios including images only, and others combining images with various quantitative site feature variables.

Question 4: Does providing more site feature variables within the LLM prompt improve the LLM AADB categorization accuracy? A set of scenarios was constructed in which the number of feature variables provided as input to the LLM varied.

Question 5: How stable is the LLM performance across different sites (i.e., are results affected by overfitting to the 75 sites used in this study)? To answer this question, a second data set was compiled (labeled the Test 2 data set). This data set again consisted of 75 sites, but a new set of 25 sites (i.e., sites not included in the original data set) was selected for the low and medium categories. There were not enough sites in the high category to select 25 new sites; therefore, most selected sites in this category were identical to the original set. The LLM was then tested using this new set of sites.

All LLM prompts with variables started with the following phrase:

Using the additional information, analyze this intersection’s infrastructure, land use, and surrounding characteristics to estimate its annual average daily bicycle volume. If your estimated volume based on the provided data is below 50 then categorize it as “Low”, if your estimated volume is between 50 and 150 then categorize it as “Medium”, and if your estimation for the bicycle volume is above 150 then categorize it as “High”. Additional data provided.

All LLM models that only included variables used the same format for the above phrase. The only difference was the specific additional data provided to each model.

For LLM scenarios that included satellite imagery, the prompt used the following phrase.

For the model that used only satellite images, the prompt was.

Analyze the infrastructure and land use characteristics of this intersection’s image to estimate its annual average daily bicycle volume. Please categorize the volume into three categories based on your analysis:

“Low” if the estimated volume is below 50 bicycles per day, “Medium” if the estimated volume is between 50 and 150 bicycles per day, and “High” if the estimated volume is above 150 bicycles per day. Please respond with only “Low”, “Medium”, or “High”.

An example prompt used for LLM Scenario 2 is as follows.

Using the image and additional information, analyze this intersection’s infrastructure, land use, and surrounding characteristics to estimate its annual average daily bicycle volume. If your estimated volume based on the image and provided data is below 50 then categorize it as “Low”, if your estimated volume is between 50 and 150 then categorize it as “Medium”, and if your estimation for the bicycle volume is above 150 then categorize it as “High”. Additional data provided: presence of a bike lane as binary variable is “{bike_lane_presence}”, length of bicycle lane in km in 800 m buffer is {bike_lane_length} km, and the number of bus stops in 400 m buffer of the intersection is {bus_stops}. Please respond with only “Low”, “Medium”, or “High” based on your analysis.

To extract the LLM output using the API, all parameters remained at their default settings except for the maximum number of tokens allowed in the generated response (max tokens), which was set to 500 ( 34 ). Each token is equivalent to four characters, which is approximately 75% of a word in English ( 35 ).

Phase 6: Accuracy of AADB Categorization by Humans

In addition to the two benchmark models described in Phase 5, a survey was conducted asking human participants to categorize each of 75 sites as Low, Medium, and High AADB. The participants were given the following information about each site: whether or not a school was located within a 400 m radius of the intersection and the total length of bicycle facilities (on-street bike lanes and off-street bike or multiuse paths) within an 800 m radius of the intersection. These two site features were selected because they were the features associated with the best-performing LLM (this is described in the Results section of this study). To align with the replication process used for the LLMs, 30 participants completed the survey. The survey was designed using the open-source tool Streamlit ( 36 ). Because the survey did not collect any personally identifying information, no consent form was required from the participants.

Initial pilot testing of the survey revealed that participants were not clear on what “total length of bicycle facilities” meant. The survey design went through several iterations to determine the best way to describe this feature variable. The final survey consisted of a brief introduction explaining that the survey did not collect any personally identifiable information and that participants were asked to categorize the level of bicycle activity for each of the 75 intersections. Participants were told the intersections were in the Region of Waterloo, Ontario, Canada, but were not told the names of the intersections or given a site map. Figure 4 shows the image given to participants to illustrate the total length of bicycle facilities within an 800-m radius from the intersection. Survey participants were only given the value of the two site feature variables for each site and were asked to select one of the three AADB volume categories: (Low ≤ 50, 50 < Medium $\leq 150$ , and High > 150). A sample portion of the survey is shown in Figure 5.

Figure 4.

Image provided to survey participants to illustrate “total length of bike facilities”, feature variable.

Figure 5.

Sample of the survey questionnaire.

Results and Discussion

This section is organized into three subsections. The first subsection presents the performance of the Naïve benchmark model and the development and performance of the OL benchmark model. The second subsection presents the performance results of the LLMs. The third subsection presents the results of the survey, demonstrating the performance of humans as a further benchmark for assessing the performance of the LLMs.

Comparison Models

Naïve Model (BM1)

The Naïve model (BM1) randomly selects an AADB category for each site and is presented as the minimum accuracy threshold that must be surpassed by any proposed model (including LLMs). In this study, the AADB is categorized into three levels, and there are an equal number of sites (25) in each AADB category. The accuracy of each application of the Naïve model varies because of randomness; however, the average accuracy of the Naïve model should be 33%. The application of the Naïve model 30 times resulted in an average accuracy = 35%.

OL Benchmark Model (BM2)

The OL model was calibrated on the study data set. Equations 7–10 represent the OL model to extract the continuous latent variable and the corresponding probability calculation for each category. The fit criteria were to select a model that minimizes the AIC criteria while maximizing the accuracy from the confusion matrix. The correlation matrix for the site features is provided in Table 3. From this table, site features can be identified that are highly correlated with each other (colinear) and those that are strongly correlated with the dependent variable AADB (e.g., bike facilities length [BL], presence of bike lane, number of bus stops, employment [Emp], and schools). The best fit OL model included just two independent variables, BL and Emp and two thresholds (τ_Low/Medium and τ_Medium/High). Model fit and coefficient values are provided in Table 4.

Y^{*} = X_{1} β_{1} + X_{2} β_{2} + \dots + X_{i} β_{i} + ε

(7)

P (Y_{i} > j) = \frac{\exp (\sum_{1}^{n} X_{i} β_{i} - τ_{j})}{1 + \exp (\sum_{1}^{n} X_{i} β_{i} - τ_{j})}; j = 1, 2, 3, \dots, T - 1

(8)

P (Y_{i} = 1) = 1 - \frac{\exp (\sum_{1}^{n} X_{i} β_{i} - τ_{1})}{1 + \exp (\sum_{1}^{n} X_{i} β_{i} - τ_{1})}

(9)

\begin{matrix} P (Y_{i} = j) = \frac{\exp (\sum_{1}^{n} X_{i} β_{i} - τ_{j - 1})}{1 + \exp (\sum_{1}^{n} X_{i} β_{i} - τ_{j - 1})} \\ - \frac{\exp (\sum_{1}^{n} X_{i} β_{i} - τ_{j})}{1 + \exp (\sum_{1}^{n} X_{i} β_{i} - τ_{j})}; j = 2, 3, \dots, T - 1 \end{matrix}

(10)

where

$Y^{*}$ = latent (unobserved) continuous variable,

$X_{i}$ = observed explanatory variables,

$β_{i}$ = log-odds coefficient,

$ε$ = error term,

$n$ = total number of explanatory variables,

$P (Y_{i})$ = estimated probability for a given threshold .,

$τ_{j}$ = estimated coefficient for threshold $j$ , and

$T$ = total number of ordered categories.

Table 3.

Correlation Matrix

	AADB	BL	PBL	NBS	NL	PPE	3L	Inc	Emp	Res
Bike facilities length (BL [km])	0.63	1.00	0.41	0.44	−0.18	0.21	−0.02	−0.20	0.32	−0.34
Presence of bike lane (PBL)	0.39	0.41	1.00	−0.17	−0.15	−0.15	−0.09	−0.24	−0.11	−0.10
Number of bus stops (NBS)	0.31	0.44	−0.17	1.00	−0.05	0.13	−0.05	−0.02	0.39	−0.26
Number of lanes (NL)	−0.02	−0.18	−0.15	−0.05	1.00	−0.19	−0.17	−0.01	−0.24	−0.26
Presence of parking entrance (PPE)	0.06	0.21	−0.15	0.13	−0.19	1.00	−0.05	0.04	0.14	−0.09
Presence of three leg intersection (3L)	−0.17	−0.02	−0.09	−0.05	−0.17	−0.05	1.00	−0.02	0.02	0.12
Average income (in 50 m) (Inc)	−0.17	−0.20	−0.24	−0.02	−0.01	0.04	−0.02	1.00	−0.01	−0.03
Num employment (in 400 m) (Emp)	0.36	0.32	−0.11	0.39	−0.24	0.14	0.02	−0.01	1.00	0.01
Low-density residential area (in 161 m [Res])	−0.26	−0.34	−0.10	−0.26	−0.26	−0.09	0.12	−0.03	0.01	1.00
Presence of school	0.27	0.35	0.06	0.22	−0.13	0.08	0.06	−0.31	0.10	0.04

Table 4.

Ordered Logit Model Calibration Results

Statistic	Value
Dependent variable	AADB	na
Log-likelihood	−61.3	na
AIC	130.61	na
Independent variable	Coefficient	p-value
Bike facilities length (km)	0.781	0.000
Num employment (in 400 m)	0.002	0.013
τ_Low/Medium	1.433	0.012
τ_Medium/High	0.813	0.000

Note: AADB = annual average daily bicycle volume; AIC = Akaike Information Criterion; τ_Low/Medium and τ_Medium/High = thresholds; na = not applicable.

Categorization by Humans (H1)

As mentioned in the Methodology section, a survey was conducted in which respondents were given values for two site features (total length of bike facilities within 800 m of the site and whether or not a school was located within 400 m of the site) and asked to categorize the level of bicycle volume at the site. Respondents were asked to do this categorization for 75. The survey was distributed online, with a brief explanation of its purpose. The survey took approximately 10–15 min to complete. The target sample size was 30, to match the number of LLM replications. No personal information was collected; only estimation results for all 75 sites were gathered, similar to the LLM tests.

The average accuracy from the 30 survey respondents was 52%, which is much better than the Naïve model and ranges from approximately 0.4 to 0.6.

LLM Models

Based on the explanatory variables described in the Site Features section, the correlation analysis in the previous section, and features found to be statistically significant in the logit model, various models were tested using the OpenAI API. Details of the variables selected for each model are provided in Table 5. As explained in the Methodology section, three main hypotheses were tested for the LLM models to examine their potential use in providing the highest average accuracy over 30 replications and their compatibility with traditional benchmark models.

Table 5.

Model Scenario

		Benchmark models		LLM (ChatGPT4) models								Humans
	Scenario code	BM1	BM2	LLM1	LLM2	LLM3	LLM4	LLM5	LLM6	LLM7	LLM8	H1
	Scenario name	Naïve	Logit	Images only	Images + DD	Images + Emp + BL	Emp + BL	All variables	Four variables	Three variables	Two variables	Survey
Site characteristics (features)	Presence of bicycle facilities	na	x	na	x	na	na	x	x	x	na	na
	Number of lanes approaching the intersection	na	x	na	na	na	na	x	na	na	na	na
	Bike facilities length (BL)	na	X	na	x	x	x	x	x	x	x	x
	Bus stops	na	x	na	x	na	na	x	na	na	na	na
	Commercial land use	na	x	na	na	na	na	x	na	na	na	na
	Employment (Emp)	na	X	na	na	x	x	x	x	na	na	na
	Land use mix	na	x	na	na	na	na	x	na	na	na	na
	Residential area	na	x	na	na	na	na	x	na	na	na	na
	Schools	na	x	na	na	na	na	x	x	x	x	x
	Presence of parking entrance	na	x	na	na	na	na	x	na	na	na	na
	Presence of three approaches	na	x	na	na	na	na	x	na	na	na	na
	Connected node ratio (CNRH)	na	x	na	na	na	na	x	na	na	na	na
	Income	na	x	na	na	na	na	x	na	na	na	na
	Intersection satellite image	na	na	x	x	x	na	na	na	na	na	na

Note: x = feature is input to the model; X = statistically significant (at the 95% confidence level) feature in the ordered logit (OL) model; BM1 = Naïve benchmark model; BM2 = OL benchmark model; H1 = human categorization; LLM = large language model; na = not applicable.

The average accuracy, based on confusion matrices, was calculated for all LLM models and the comparison models, as shown in Figure 6. The highest accuracy was achieved by the benchmark (OL) model, with 64%. The second-best performance came from the ChatGPT model (accuracy = 60%), which utilized only two variables (bicycle facility length and presence of schools) and no images (LLM8).

Figure 6.

Accuracy comparison across all models.

Statistical tests were used to determine whether the differences in average accuracy between different pairs of models are statistically significant. The F-test was used to compare the variances. If the variances were statistically different, then the independent two-tailed t-test assuming unequal variances was used, the independent two-tailed t-test assuming equal variances was used.

Table 6 presents the t-value results of independent two-tailed t-tests for each pair of models. For all comparisons, the critical t-value = 2.045. The three cells with bolded text indicate model pairs that are not statistically significant; all other results are statistically significant at the 95% confidence level. The findings indicate that all LLM models without images outperform the model with images, and these differences are statistically significant. With the exception of LLM1 (Images Only), all other models have average accuracies that are statistically better (t-values are positive and greater than 2.045) than the Naïve model (BM1). This is surprising because the Naïve model represents a low bar for performance. More interesting is that the average accuracy of the human participants was (statistically) higher than some of the LLM models (specifically, the LLM1 (Images Only) and LLM2 (Images + DD), both of which include images.

Table 6.

Student t-values from Independent Two-Tailed t-tests for Each Pair of Models

	BM1	BM2	H1	LLM1	LLM2	LLM3	LLM4	LLM5	LLM6	LLM7
BM2 (logit)	21.13	0.00	−12.55	−58.67	−35.73	−24.05	−20.37	−13.41	−25.71	−13.83
H1 (survey)	9.66	−12.55	0.00	−13.4	−10.24	−6.22	0.23	5.05	2.24	5.55
LLM1 (images only)	1.11^a	−58.67	−13.4	0.00	−2.71	−6.69	−19.75	−29.52	−28.17	−31.75
LLM2 (images + DD)	2.53	−35.73	−10.24	−2.71	0.00	3.85	−13.81	−20.92	−18.54	−22.13
LLM3 (Images + Emp + BL)	5.05	−24.05	−6.22	−6.69	3.85	0.00	−8.06	−13.83	−11.2	−14.62
LLM4 (Emp + BL)	11.13	−20.37	0.23	−19.75	−13.81	−8.06	0.00	6.83	−2.98	−7.68
LLM5 (all variables)	15.09	−13.41	5.05	−29.52	−20.92	−13.83	6.83	0.00	5.04	−0.65
LLM6 (four variables)	13.19	−25.71	2.24	−28.17	−18.54	−11.2	−2.98	5.04	0.00	−6.08
LLM7 (three variables)	15.56	−13.83	5.55	−31.75	−22.13	−14.62	−7.68	−0.65	−6.08	0.00
LLM8 (two variables)	17.71	−10.98	8.17	−40.08	−26.95	−18.09	−12.01	−4.8	−11.8	−4.36

Note: LLM = large language model; Emp = employment; BL = bicycle facilities length; DD = direct-demand; BM1 = Naïve benchmark model; BM2 = OL benchmark model; H1 = human categorization.

Bolded values indicate the cases for which the difference in means was not statistically significant. All other cases, the difference in means is statistically significant.

The best-performing LLM relies solely on two inputs: BL and the presence of schools. An examination of the correlation matrix in Table 3 highlights that these two features have a strong correlation with the AADB, and bicycle facility length was a statistically significant variable in the ordered probit model. The presence of schools is frequently a statistically significant variable in DD models for estimating pedestrian and bicycle volumes; therefore, its presence is not surprising. In addition, this study’s jurisdiction contains three post-secondary educational institutions (two universities and one college), which are likely to be generators and attractors of bicycle trips.

Five research questions were posed in the Methodology section. The model application results are used to answer these five questions.

Question 1: Does the LLM Perform Better than Randomly Assigning Volume Categories (Naïve Model)?

The results show that all tested LLMs, except the LLM with only satellite images as input (LLM1), outperform the Naïve model in accuracy, and the difference is statistically significant.

Question 2: Is the LLM Performance Comparable with Traditional Methods, such as Statistical Approaches and Human Survey Analysis?

A comparison analysis between traditional statistical methods, such as developing an OL model (BM2), indicates that LLMs in general can achieve very similar levels of accuracy. For instance, LLM8, using only two variables, achieves an accuracy of 60%, and BM2, using very similar variables, reaches an accuracy of 64%. A comparison between the LLM and human survey analysis (H1) shows that LLMs can outperform humans in accuracy. For example, LLM8 achieved an average classification accuracy of 60%, and the average accuracy from the human survey was only 52%. This is a positive outcome because the survey design closely mirrored the structure of the input prompts provided to the LLMs.

Question 3: Does Providing Satellite Images Improve LLM AADB Categorization Accuracy?

The results shown in Figure 5 and listed in Table 4 indicate that LLMs without images (LLMs 4–8) outperformed those with images (LLMs 1–3). The differences in accuracy for the LLM models without images were statistically significant, and their average accuracy was higher than that of the models with images. Therefore, the results indicate that providing LLMs with satellite images of the sites does not improve the average accuracy of category classifications. This result is somewhat surprising because it might be expected that the spatial distribution of site characteristics would be helpful in estimating the AADB category. It was hypothesized that the LLM could not accurately extract features from the image, and the errors in the feature extraction had a detrimental effect on AADB estimation accuracy. This hypothesis was tested by providing an image (the same image size as used in the rest of this study) to the LLM and prompting the LLM to assess the image and provide two features: (1) the number of bus stops present in the image; and (2) the length of bicycle facilities (e.g., bike lanes and multiuse paths) in the image (in km). This analysis was repeated for five randomly chosen sites in the data set. The results (Table 7) confirmed the hypothesis that the LLM is unable to accurately extract important land use features from the provided satellite images. Bicycle facilities exist at all five tested sites; however, the LLM detected bike facilities at only one of the sites, and even at this site, it incorrectly determined the length of the facility. There were bus stops at all five sites, but the LLM detected bus stops at only three of the five sites, and at all three sites, the extracted number of bus stops was incorrect.

Table 7.

Results of Large Language Model Feature Extraction from Satellite Images

	Bicycle facility length (km)		Number of bus stops
Satellite image	True value	LLM	True value	LLM
1	0.052	0	3	2
2	0.065	0	1	2
3	0.148	0	3	2
4	0.095	0.069	2	0
5	0.097	0	2	0

Question 4: Does Providing More Site Feature Variables Within the LLM Prompt Improve LLM AADB Categorization Accuracy?

The results indicate that LLMs, with a higher number of explanatory variables, were less accurate than those with fewer variables. This is somewhat surprising, and it might be related to how LLMs interpret input data because they seem to rely more on context-based information than on the numerical input values as explanatory variables. In addition, adding more variables might make interpretation more complex for LLMs.

Question 5: How Stable is the LLM Performance Across Different Sites (i.e., Are Results Affected by Overfitting to the Specific 75 Sites Used in this Study)?

This question is difficult to answer definitively without compiling an entirely separate data set, which was unavailable. Instead, overfitting was evaluated using another set of 75 sites from the same jurisdiction (Test 2 data set, as described in the Methodology section). Only a subset of the LLMs (LLM5–8) were applied to the Test 2 data set because these models performed much better than the other LLMs. The findings show that overfitting was not an issue, as very similar results—and in some cases, higher average accuracy—were achieved by the LLM models on the Test 2 data set. An independent two-tailed t-test was conducted to compare the mean accuracy of Test 2 with the original data set. The results indicated that there is no statistically significant difference between the mean accuracy of LLM5 and LLM6. However, the mean accuracies of LLM7 and LLM8 were significantly different, with a higher mean accuracy for the Test 2 data set. This finding may be attributed to the sample size of 30 and because both data sets share the same sites for the High bicycle volume category. A comparison analysis of prediction accuracy between Test 2 and the original data sets is given in Table 8.

Table 8.

Accuracy Comparison Between Original and Test 2 Data Sets

	Classification accuracy
	Test 2 data set	Original data set	Student t-value	p-value
LLM5 (all variables)	56%	57%	−0.089	0.422
LLM6 (four variables)	55%	54%	1.769	0.083
LLM7 (three variables)	61%	58%	3.828	0.000
LLM8 (two variables)	62%	60%	2.89	0.006

Note: LLM = large language model.

Conclusions and Recommendations

This study aimed to answer five specific research questions as discussed in the previous section. The main conclusions from this work are as follows.

ChatGPT exhibited statistically significant higher average bicycle volume classification accuracy than the Naïve model (average accuracy = 35%) for almost all input data combinations examined (the only exception being the LLM using only satellite images as input).

The best-performing LLM (LLM8) required only two site characteristics as inputs (total length of bike facilities within 800 m of the intersection and number of employment within 400 m of the intersection) and provided an average classification accuracy of 60%. The best-performing LLM (LLM8) provided average classification accuracy (60%) that was significantly better than the Naïve model (35%) and human survey respondents (52%). Furthermore, this LLM performed almost as well as the statistical model (OL model) developed on this same data set (accuracy = 64%). The findings of this study show that the LLM model can perform almost as well as traditional DD models. This is promising because, compared with traditional DD models, the LLM model requires less time and cost for development, as well as for data collection and analysis.

An examination of the effect that inputs have on the LLM performance indicated that the use of quantitative inputs, rather than satellite images, improved classification accuracy. This suggests that ChatGPT is unable to accurately extract relevant land use, demographic, or transportation network characteristics directly from satellite images. Furthermore, providing a higher number of quantitative features (variables) as input to the LLM tended to lower the average accuracy, with reductions from 2.5% to 6.3% and an average decrease of approximately 4% compared with the best-performing model, which uses only two variables. All models with more variables were statistically less accurate than the two-variable model. Efforts were made to determine whether results were affected by overfitting. The analysis showed that overfitting is not an issue; however, limitations in the number of sites for which AADB data were available precluded the compilation of a completely different data set on which to test for overfitting.

Overall, LLMs present substantial promise as a tool for estimating bicycle volume levels. The models do not require extensive input data, and they are readily available at low cost.

This study and its findings have several limitations.

The variables used in this study are not guaranteed to provide the highest average accuracy when used in other jurisdictions.

The number of AADB categories and the associated AADB boundary values to define the categories were developed relative to the jurisdiction used for this study. Therefore, they are not necessarily applicable to other jurisdictions with different characteristics.

It is recommended that the use of LLMs for estimating bicycle volume levels be further evaluated as follows.

Applying LLMs to data sets from other jurisdictions to examine the spatial transferability of LLMs and the optimal set of inputs.

Exploring the ability of LLMs to classify bicycle volume levels at locations other than intersections, which is the focus of this study (e.g., mid-block locations or links).

Examining the potential to improve LLM classification accuracy by training a special-purpose LLM.

Evaluating the performance of LLM classification compared with traditional ML approaches, such as XGBoost or RF.

Footnotes

Acknowledgements

The authors gratefully acknowledge the Region of Waterloo, Ontario, Canada, for providing permission to use the bicycle count data and for providing rich open data portals that were essential sources of information for this research and Miovision for providing access to the bicycle data. The work in this study reflects the views of the authors, and there is no explicit or implicit endorsement by any of the aforementioned jurisdictions or companies.

Author Contributions

The authors confirm contribution to the paper as follows: study conception and design: Hellinga, Azizi Soldouz; data collection: Azizi Soldouz; analysis and interpretation of results: Azizi Soldouz, Hellinga; manuscript preparation: Azizi Soldouz, Hellinga. All authors reviewed the results and approved the final version of the manuscript.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors gratefully acknowledge financial support from the Natural Sciences and Engineering Research Council of Canada via the Discovery Grant Program (funding reference number 2022-03275) and Transport Canada’s Enhanced Road Safety Transfer Payment Program. The research was carried out by the authors, and no endorsement of the methods or findings by funding agencies is claimed or implied.

ORCID iDs

Sina Azizi Soldouz

Bruce Hellinga

References

Lee

Sener

I. N.

Strava Metro Data for Bicycle Monitoring: A Literature Review. Transport Reviews, Vol. 41, No. 1, 2021, pp. 27–47. https://doi.org/10.1080/01441647.2020.1798558.

Kothuri

Broach

Mcneil

Hyun

Mattingly

Miah

M. M.

Nordback

Proulx

Exploring Data Fusion Techniques to Estimate Network-Wide Bicycle Volumes. Transportation Research and Education Center (TREC), Portland, OR. https://doi.org/10.15760/trec.273.

Miah

M. M.

Hyun

K. K.

Mattingly

S. P.

Khan

Estimation of Daily Bicycle Traffic Using Machine and Deep Learning Techniques. Transportation, Vol. 50, No. 5, 2023, pp. 1631–1684. https://doi.org/10.1007/s11116-022-10290-z.

Wandelt

Zheng

Wang

Liu

Sun

Large Language Models for Intelligent Transportation: A Review of the State of the Art and Challenges. Applied Sciences, Vol. 14, No. 17, 2024, p. 7455. https://doi.org/10.3390/APP14177455.

Driessen

Dodou

Bazilinskyy

De Winter

Putting ChatGPT Vision (GPT-4V) to the Test: Risk Perception in Traffic Images. Royal Society Open Science, Vol. 11, No. 5, 2024, p. 231676.

Wang

Zhu

Wang

ChatGPT as Your Vehicle Co-Pilot: An Initial Attempt. IEEE Transactions on Intelligent Vehicles, Vol. 8, No. 12, 2023, pp. 4706–4721. https://doi.org/10.1109/TIV.2023.3325300.

Zhang

Zhou

Xie

Semantic Understanding and Prompt Engineering for Large-Scale Traffic Data Imputation. Information Fusion, Vol. 102, 2024, p. 102038. https://doi.org/10.1016/J.INFFUS.2023.102038.

Zhou

Yamauchi

Cai

Tei

Evaluating Vision-Language Models in Visual Comprehension for Autonomous Driving. Proc., 2024 4th IEEE International Conference on Software Engineering and Artificial Intelligence, SEAI 2024, Xiamen, China, IEEE, New York, 2024, pp. 205–209. https://doi.org/10.1109/SEAI62072.2024.10674252.

Das

Oliaee

A. H.

Pratt

M. P.

Classifying Pedestrian Maneuver Types Using the Advanced Language Model. Transportation Research Record: Journal of the Transportation Research Board, 2023. 2677: 599–611.

10.

Traffic Monitoring Guide – Policy. Federal Highway Administration. https://www.fhwa.dot.gov/policyinformation/tmguide/. Accessed May 13, 2025.

11.

Nordback

Marshall

W. E.

Janson

B. N.

Stolz

Estimating Annual Average Daily Bicyclists Error and Accuracy. Transportation Research Record: Journal of the Transportation Research Board, 2013. 2339: 90–97.

12.

Figliozzi

Johnson

Monsere

Nordback

Methodology to Characterize Ideal Short-Term Counting Conditions and Improve AADT Estimation Accuracy Using a Regression-Based Correcting Function. Journal of Transportation Engineering, Vol. 140, No. 5, 2014. https://doi.org/10.1061/(ASCE)TE.1943-5436.0000663.

13.

Nordback

Kothuri

Johnstone

Lindsey

Ryan

Raw

Minimizing Annual Average Daily Nonmotorized Traffic Estimation Errors: How Many Counters Are Needed per Factor Group?

Transportation Research Record: Journal of the Transportation Research Board, 2019. 2673: 295–310.

14.

Sobreira

L. T. P.

Hellinga

Comparing Direct Demand Models for Estimating Pedestrian Volumes at Intersections and Their Spatial Transferability to Other Jurisdictions. Transportation Research Record: Journal of the Transportation Research Board, 2023. 2677: 260–271.

15.

Azizi Soldouz

Hellinga

Examining Spatial Transferability of Direct-Demand Models for Estimating Cyclist Counts at Intersections. Transportation Research Record: Journal of the Transportation Research Board, 2025. 2679: 442–457.

16.

Miah

M. M.

Hyun

K. K.

Mattingly

S. P.

A Review of Bike Volume Prediction Studies. Transportation Letters, Vol. 16, No. 10, 2024, pp. 1406–1433. https://doi.org/10.1080/19427867.2024.2310831.

17.

Strauss

Miranda-Moreno

L. F.

Spatial Modeling of Bicycle Activity at Signalized Intersections. Journal of Transport and Land Use, Vol. 6, No. 2, 2013, pp. 47–58. https://doi.org/10.5198/jtlu.v6i2.296.

18.

Strauss

Miranda-Moreno

L. F.

Morency

Cyclist Activity and Injury Risk Analysis at Signalized Intersections: A Bayesian Modelling Approach. Accident Analysis & Prevention, Vol. 59, 2013, pp. 9–17. https://doi.org/10.1016/J.AAP.2013.04.037.

19.

Griswold

Medury

Schneider

Pilot Models for Estimating Bicycle Intersection Volumes. Transportation Research Record: Journal of the Transportation Research Board, 2011. 2247: 1–7.

20.

Tabeshian

Kattan

Modeling Nonmotorized Travel Demand at Intersections in Calgary, Canada. Transportation Research Record: Journal of the Transportation Research Board, 2014. 2430: 38–46.

21.

Bhowmick

Saberi

Stevenson

Thompson

Winters

Nelson

Leao

S. Z.

Seneviratne

Pettit

H. L.

Nice

Beck

A Systematic Scoping Review of Methods for Estimating Link-Level Bicycling Volumes. Transport Reviews, Vol. 43, No. 4, 2022, pp. 622–651. https://doi.org/10.1080/01441647.2022.2147240.

22.

Bhowmick

Lilasathapornkit

Saberi

Seneviratne

Nelson

Nice

Beck

Modelling Link-Level Bike Riding Volumes in Greater Melbourne. Proc., Australasian Transport Research Forum, Melbourne, Australia, November 27–29, 2024.

23.

Miah

M. M.

Hyun

K. K.

Mattingly

S. P.

Broach

McNeil

Kothuri

Challenges and Opportunities of Emerging Data Sources to Estimate Network-Wide Bike Counts. Journal of Transportation Engineering, Part A: Systems, Vol. 148, No. 3, 2021, p. 04021122. https://doi.org/10.1061/JTEPBS.0000634.

24.

Broach

Kothuri

Miah

Mcneil

Hyun

Mattingly

Nordback

Proulx

Evaluating the Potential of Crowdsourced Data to Estimate Network-Wide Bicycle Volumes. Transportation Research Record: Journal of the Transportation Research Board, 2024. 2678: 573–589.

25.

Dadashova

Griffin

G. P.

Das

Turner

Sherman

Estimation of Average Annual Daily Bicycle Counts Using Crowdsourced Strava Data. Transportation Research Record: Journal of the Transportation Research Board, 2020. 2674: 390–402.

26.

Jean-Louis

Eckhardt

Podschun

Mahnkopf

Venohr

Estimating Daily Bicycle Counts with Strava Data in Rural and Urban Locations. Travel Behaviour and Society, Vol. 34, 2024, p. 100694. https://doi.org/10.1016/J.TBS.2023.100694.

27.

Xia

Tang

Shi

Xia

Yin

Huang

UrbanGPT: Spatio-Temporal Large Language Models. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. 24, 2024, pp. 5351–5362. https://doi.org/10.1145/3637528.3671578/SUPPL_FILE/ADFP0340-VIDEO.MP4.

28.

Liang

Zheng

Human-Intention Prediction with Visual-Language Model. Proc., 2024 International Conference on Automation in Manufacturing, Transportation and Logistics, ICaMaL 2024, Hong Kong, 2024. https://doi.org/10.1109/ICAMAL62577.2024.10919824.

29.

Sobreira

L. T. P.

Hellinga

Estimating Pedestrian Volumes at Intersections using Artificial Intelligence: A ChatGPT Vision Approach. Transportation Research Record: Journal of the Transportation Research Board, 2025. 2679(6): 32–50. https://doi.org/10.1177/03611981241308873.

30.

OpenStreetMap. https://www.openstreetmap.org/#map=3/71.34/-96.82. Accessed May 29, 2024.

31.

Cycling. City of Waterloo Open Data. https://data.waterloo.ca/datasets/RMW::cycling/explore?location=43.437309%2C-80.459000%2C10.59. Accessed May 29, 2024.

32.

Cloud Computing, Hosting Services, and APIs. Google Cloud. https://cloud.google.com. Accessed December 19, 2024.

33.

Google Maps Platform Documentation. Maps Static API. Google for Developers. https://developers.google.com/maps/documentation/maps-static?hl=en. Accessed May 1, 2025.

34.

API Reference. OpenAI API. https://platform.openai.com/docs/api-reference/responses_streaming/response/incomplete#chat_create-max_tokens. Accessed August 6, 2025.

35.

What Are Tokens and How to Count Them? OpenAI Help Center. https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them. Accessed August 6, 2025.

36.

Streamlit • A Faster Way to Build and Share Data Apps. https://streamlit.io/. Accessed May 1, 2025.

Estimating Bicycle Volume Levels at Urban Intersections using Large Language Models

Abstract

Keywords

Introduction

Literature Review

Data Description

Methodology

Phase 1: Site Selection

Phase 2: Site Features

Phase 3: Site Satellite Images

Phase 4: Benchmark Models

Phase 5: LLMs

Phase 6: Accuracy of AADB Categorization by Humans

Results and Discussion

Comparison Models

Naïve Model (BM1)

OL Benchmark Model (BM2)

Categorization by Humans (H1)

LLM Models

Question 1: Does the LLM Perform Better than Randomly Assigning Volume Categories (Naïve Model)?

Question 2: Is the LLM Performance Comparable with Traditional Methods, such as Statistical Approaches and Human Survey Analysis?

Question 3: Does Providing Satellite Images Improve LLM AADB Categorization Accuracy?

Question 4: Does Providing More Site Feature Variables Within the LLM Prompt Improve LLM AADB Categorization Accuracy?

Question 5: How Stable is the LLM Performance Across Different Sites (i.e., Are Results Affected by Overfitting to the Specific 75 Sites Used in this Study)?

Conclusions and Recommendations

Footnotes

Acknowledgements

Author Contributions

Declaration of Conflicting Interests

Funding

ORCID iDs

References