Abstract
Aims and objectives:
This paper attempts to develop a predictive computational model of Cantonese–English code-switching (CS) in Hong Kong, informed by language-internal and “language-external” (e.g., social) factors. I analyze this bilingual practice with respect to these factors and evaluate how accurately a model informed by this analysis can forecast Cantonese–English lexical choice in the context of digital platform WhatsApp.
Approach:
A quantitative “bag-of-words” approach was used to analyze bilingual variability/choice. The paper will focus on analyzing the frequency distribution of English and Cantonese choice at the word level without considering information in peripheral constituents (i.e., part-of-speech of the preceding and succeeding word, collocations).
Data and analysis:
A 329,087-word sociolinguistic corpus of WhatsApp messages from 24 Hong Kong residents was used. The data were analyzed using principal components analysis, sentiment analysis, and Bayesian multivariate regression.
Findings:
Part-of-speech, style, proficiency in English and Cantonese as well as attitudes toward switching to Cantonese interact with matrix language to condition CS. Switches from Cantonese to English signal “interpersonality” whereas the maintenance of English in English-matrix clauses index “informationality.” Individual factors have less of an impact than other factors, suggesting uniformity within the community. Attitudes toward mixing and preference for frequent mixing do not correlate with rates of CS.
Originality:
Unlike prior work, this paper analyzes original, manually collected WhatsApp data, typically underexplored due to access and privacy limitations, leveraging understudied variables such as style, sentiment (affect/emotion), attitudes, and linguistic factors and their interactions under a single model of digital CS. Furthermore, this paper considers the effect of individual/stylistic and dialectal/social factors on CS.
Significance and implications:
This paper advances research on Cantonese-English code-switching in Hong Kong and East Asia, enriching our understanding of bilingualism’s social, linguistic, and affective dimensions while informing multilingual AI models. By prioritizing a simple ‘bag-of-words’ approach to modeling, it also offers a computationally efficient method accessible to researchers with limited resources, broadening the methodological toolkit for sociolinguistic analysis.
Keywords
Get full access to this article
View all access options for this article.
