Abstract
Information retrieval (IR) methods seek to locate meaningful documents in large collections of textual and other data. Few studies apply these techniques to discover descriptions in historical documents for physical geography applications. This absence is noteworthy given the use of qualitative historical descriptions in physical geography and the amount of historical documentation online. This study, therefore, introduces an IR approach for finding meaningful and geographically resolved historical descriptions in large digital collections of historical documents. Presenting a biogeography application, it develops a ‘search engine’ using a boosted regression trees (BRT) model to assist in finding forest compositional descriptions (FCDs) based on textual features in a collection of county histories. The study then investigates whether FCDs corroborate existing estimates of relative abundances and spatial distributions of tree taxa from presettlement land survey records (PLSRs) and existing range maps. The BRT model is trained using portions of text from 458 US county histories. Evaluating the model’s performance upon a spatially independent test dataset, the model helps discover 97.5% of FCDs while reducing the amount of text to search through to 0.3% of total. The prevalence rank of taxa in FCDs (i.e. the number of times a taxon is mentioned at least once in an FCD, divided by the total number of FCDs, then ranked) is strongly related to the abundance rank in PLSRs. Patterns in species mentions from FCDs generally match relative abundance patterns from PLSRs. However, analyses suggest that FCDs contain biases towards large and economically valuable tree taxa and against smaller taxa. In the end, the study demonstrates the potential of IR approaches for developing novel datasets over large geographic areas, corroborating existing historical datasets, and providing spatial coverage of historic phenomena.
Keywords
Get full access to this article
View all access options for this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
