Abstract
In this study, we develop and test a quantitative comparative framework that leverages Multimodal Large Language Models (MLLMs) to examine and partially bridge the gap between Street View Imagery (SVI) and User-Generated Content (UGC), enabling quantitative and spatially explicit urban perception mining in public parks. Understanding how individuals perceive urban spaces is central to inclusive urban planning. Building on Lefebvre’s spatial triad, our approach frames SVI as perceived space and UGC as lived space, integrating geo-computation and large-scale visual–textual analysis to address the modality gaps inherent to SVI. We compare SVI and UGC and assess whether MLLMs, specifically GPT-4o, can bridge the gaps between them. Empirical analysis shows that SVI captures functional, infrastructure-based attributes, while UGC highlights aesthetic and experiential dimensions of urban form and livability. These findings underscore the complementary nature of both data sources and the limitations of relying solely on SVI-based indices. By leveraging MLLMs to interpret visual and textual data within a unified framework, we demonstrate a scalable method for computational modelling of urban perception that integrates both perceived and lived space, offering new insights for data-driven planning and the optimization of urban environments.
Keywords
Get full access to this article
View all access options for this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
