Abstract
In projects centered around rare event case data, the challenge of data comprehension is greatly increased because of insufficient data for deriving insight and analysis. This is particularly the case with traffic crash occurrence, where positive events (crashes) are rare and, in most cases, no data set exists for negative events (non-crashes). One method to increase available data is negative sampling, which is the process of creating a negative event based on the absence of a positive event. In this work, four negative sampling techniques are presented with varying ratios of negative to positive data. These types of techniques are based on spatial data, temporal data, and a mixture of the two, with the data ratios acting as class balancing tools. The best performing model found was with a negative sampling technique that shifted temporal information and had an even 50/50 data split, with an F-1 score, a formulaic combination of precision and recall, of 93.68. These results are promising for Inteligent Transportation Systems (ITS) applications to inform of potential crash locations in an entire area for proactive measures to be put in place.
Get full access to this article
View all access options for this article.
