Predicting language choice in a digital medium: A computational approach to analyzing WhatsApp code-switching in Hong Kong

Abstract

Aims and objectives:

This paper attempts to develop a predictive computational model of Cantonese–English code-switching (CS) in Hong Kong, informed by language-internal and “language-external” (e.g., social) factors. I analyze this bilingual practice with respect to these factors and evaluate how accurately a model informed by this analysis can forecast Cantonese–English lexical choice in the context of digital platform WhatsApp.

Approach:

A quantitative “bag-of-words” approach was used to analyze bilingual variability/choice. The paper will focus on analyzing the frequency distribution of English and Cantonese choice at the word level without considering information in peripheral constituents (i.e., part-of-speech of the preceding and succeeding word, collocations).

Data and analysis:

A 329,087-word sociolinguistic corpus of WhatsApp messages from 24 Hong Kong residents was used. The data were analyzed using principal components analysis, sentiment analysis, and Bayesian multivariate regression.

Findings:

Part-of-speech, style, proficiency in English and Cantonese as well as attitudes toward switching to Cantonese interact with matrix language to condition CS. Switches from Cantonese to English signal “interpersonality” whereas the maintenance of English in English-matrix clauses index “informationality.” Individual factors have less of an impact than other factors, suggesting uniformity within the community. Attitudes toward mixing and preference for frequent mixing do not correlate with rates of CS.

Originality:

Unlike prior work, this paper analyzes original, manually collected WhatsApp data, typically underexplored due to access and privacy limitations, leveraging understudied variables such as style, sentiment (affect/emotion), attitudes, and linguistic factors and their interactions under a single model of digital CS. Furthermore, this paper considers the effect of individual/stylistic and dialectal/social factors on CS.

Significance and implications:

This paper advances research on Cantonese-English code-switching in Hong Kong and East Asia, enriching our understanding of bilingualism’s social, linguistic, and affective dimensions while informing multilingual AI models. By prioritizing a simple ‘bag-of-words’ approach to modeling, it also offers a computationally efficient method accessible to researchers with limited resources, broadening the methodological toolkit for sociolinguistic analysis.

Keywords

Variation and change in the Asia-Pacific bilingualism computer-mediated communication computational sociolinguistics predictive analytics supervised language model sociolinguistics Cantonese–English code-switching

Get full access to this article

View all access options for this article.

References

Alazzawie

(2022). The linguistic and situational features of WhatsApp messages among high school and university Canadian students. SAGE Open, 12(1), 215824402210821. https://doi.org/10.1177/21582440221082124

Auer

Muhamedova

(2005). “Embedded language” and “matrix language” in insertional language mixing: Some problematic cases. Rivista Di Linguistica, 17(5), 35–54.

Baayen

R. H.

(2008). Analyzing linguistic data: A practical introduction to statistics using R. Cambridge University Press.

Bird

Klein

Loper

(2009). Natural language processing with Python. O’Reilly.

Bolton

Bacon-Shone

Lee

(2020). Societal multilingualism in Hong Kong. In Siemund

Leimgruber

J. R. E.

(Eds.), Multilingual global cities: Singapore, Hong Kong, Dubai (1st ed., pp. 160–184). Routledge. https://doi.org/10.4324/9780429463860

Bürkner

P.-C.

(2017). Brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical Software, 80(1). https://doi.org/10.18637/jss.v080.i01

Bustin

Muntendam

Sunderman

(2022). Subject pronouns in Spanish-English code-switching: A test of two models. Linguistic Approaches to Bilingualism. https://doi.org/10.1075/lab.21058.bus

Calude

A. S.

Miller

Pagel

(2020). Modelling loanword success: A sociolinguistic quantitative study of Māori loanwords in New Zealand English. Corpus Linguistics and Linguistic Theory, 16(1), 29–66.

Cao

(2011). Development of a Cantonese-English code-mixing speech recognition system [Doctoral thesis]. Chinese University of Hong Kong.

10.

Chan

B. H.-S.

(1998). How does Cantonese-English code-mixing work? In Language in Hong Kong at century’s end (pp. 191–216). Hong Kong University Press. http://www.jstor.org/stable/j.ctt2jc7vf.11

11.

Chan

B. H.-S.

(2015). A diachronic-functional approach to explaining grammatical patterns in code-switching: Postmodification in Cantonese–English noun phrases. International Journal of Bilingualism, 19(1), 17–39. https://doi.org/10.1177/1367006913477921

12.

Chan

B. H.-S.

(2022). Translanguaging or code-switching? Reassessing mixing of English in Hong Kong Cantonese. Chinese Language and Discourse. An International and Interdisciplinary Journal, 13(2), 167–196. https://doi.org/10.1075/cld.20003.cha

13.

Chan

J. Y. C.

Cao

Ching

P. C.

Lee

(2009). Automatic recognition of Cantonese-English code-mixing speech. Computational Linguistics and Chinese Language Processing, 14(3), 281–304.

14.

Chan

K. L. R.

(2018). Being a “purist” in trilingual Hong Kong: Code-switching among Cantonese, English and Putonghua. Linguistic Research, 35(1), 75–95. https://doi.org/10.17250/KHISLI.35.1.201803.003

15.

Chan

K. L. R.

(2019). Trilingual code-switching in Hong Kong. Applied Linguistics Research Journal. https://doi.org/10.14744/alrj.2019.22932

16.

Chen

H. Y.

(2005). The social distinctiveness of two code-mixing styles in Hong Kong. In ISB4: Proceedings of the 4th International Symposium on Bilingualism (pp. 527–541).

17.

DataReportal. (2022). Digital 2022 Hong Kong. https://datareportal.com/reports/digital-2022-hong-kong?rq=hong%20kong

18.

Deuchar

(2006). Welsh-English code-switching and the Matrix Language Frame model. Lingua, 116(11), 1986–2011. https://doi.org/10.1016/j.lingua.2004.10.001

19.

Dickson

Durantin

(2019). Variation in the reflexive in Australian Kriol. Asia-Pacific Language Variation, 5(2), 171–207.

20.

Eckert

(1989). The whole woman: Sex and gender differences in variation. Language Variation and Change, 1, 245–267.

21.

Eckert

(2008). Variation and the indexical field. Journal of Sociolinguistics, 12, 453–476.

22.

Eisenstein

O’Connor

Smith

Xing

(2010). A latent variable model for geographic lexical variation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (pp. 1277–1287). Association for Computational Linguistics (ACL).

23.

Farida

Pandhiani

S. M.

Buriro

A. A.

(2018). Code-switching and gender identity. The Women Research Journal, 10.

24.

Filippi

Karaminis

Thomas

M. S. C.

(2014). Language switching in bilingual production: Empirical data and computational modelling. Bilingualism: Language and Cognition, 17(2), 294–315. https://doi.org/10.1017/S1366728913000485

25.

Franke

Roettger

T. B.

(2019). Bayesian regression modeling (for factorial designs): A tutorial [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/cdxv3

26.

Goldberg

(2017). Neural network methods for Natural Language Processing ( Hirst

, Ed.). Morgan and Claypool.

27.

Goldrick

Putnam

Schwarz

(2016). Coactivation in bilingual grammars: A computational account of code mixing. Bilingualism: Language and Cognition, 19(5), 857–876. https://doi.org/10.1017/S1366728915000802

28.

Gonzales

W. D. W.

(2016). Trilingual code-switching using quantitative lenses: An exploratory study on Hokaglish. Philippine Journal of Linguistics, 47, 106–128.

29.

Gonzales

W. D. W.

(2018). Philippine Hybrid Hokkien as a postcolonial mixed language: Evidence from nominal derivational affixation mixing [Master’s thesis]. National University of Singapore.

30.

Gonzales

W. D. W.

(2022). “Truly a language of our own ”: A corpus-based, experimental, and variationist account of Lánnang-uè in Manila [Doctoral dissertation]. University of Michigan.

31.

Gonzales

W. D. W.

(2023a). Broadening horizons in the diachronic and sociolinguistic study of Philippine English with the Twitter Corpus of Philippine Englishes (TCOPE). English World-Wide. A Journal of Varieties of English, 44(3), 403–434. https://doi.org/10.1075/eww.22047.gon

32.

Gonzales

W. D. W.

(2023b). From tweets to trends: Analyzing sociolinguistic variation and change using the Twitter Corpus of English in Hong Kong (TCOEHK). Asian Englishes. https://doi.org/10.1080/13488678.2023.2251771

33.

Gonzales

W. D. W.

(2024a). Advancing Sino-Philippine linguistics and sociolinguistics using the Lannang Corpus (LanCorp): A multilingual, POS-tagged, and audio-textual databank. International Journal of Corpus Linguistics, 29(2), 213–257. https://doi.org/10.1075/ijcl.22096.gon

34.

Gonzales

W. D. W.

(2024b). Mixed language in flux? The various impacts of multilingual contact on Lánnang-uè’s wh-question system. International Journal of Bilingualism. https://doi.org/10.1177/13670069231201865

35.

Gonzales

W. D. W.

(2024c). Sociolinguistic analysis with missing metadata? Leveraging linguistic and semiotic resources through deep learning to investigate English variation and change on Twitter. Applied Linguistics, amad086. https://doi.org/10.1093/applin/amad086

36.

Gonzales

W. D. W.

Hiramoto

Leimgruber

J. R. E.

Lim

J. J.

(2022). Is it in colloquial Singapore English: What variation can tell us about its conventions and development? English Today, 1–14. https://doi.org/10.1017/S0266078422000141

37.

Gonzales

W. D. W.

Tsang

Y. M.

(2023). The sociolinguistics of code-switching in Hong Kong’s digital landscape: A mixed-methods exploration of Cantonese-English alternation patterns on WhatsApp. Journal of English and Applied Linguistics, 2(1), 1–21. https://doi.org/10.59588/2961-3094.1041

38.

Goral

Norvik

Jensen

B. U.

(2019). Variation in language mixing in multilingual aphasia. Clinical Linguistics & Phonetics, 33(10–11), 915–929. https://doi.org/10.1080/02699206.2019.1584646

39.

Grafmiller

Szmrecsanyi

Hinrichs

(2018). Restricting the restrictive relativizer. Corpus Linguistics and Linguistic Theory, 14(2), 309–355. https://doi.org/10.1515/cllt-2016-0015

40.

Grieve

Montgomery

Nini

Murakami

Guo

(2019). Mapping lexical dialect variation in British English using Twitter. Frontiers in Artificial Intelligence, 2, 11. https://doi.org/10.3389/frai.2019.00011

41.

Groves

J. M.

(2011). “Linguistic schizophrenia” in Hong Kong: Hong Kong English comes of age. English Today, 27(4), 33–42. https://doi.org/10.1017/S0266078411000514

42.

Hansen Edwards

J. G

. (2018). TH variation in Hong Kong English. English Language and Linguistics, 23(2), 439–468. https://doi.org/10.1017/S1360674318000035

43.

Har

(2021). Language choices between government sector colleagues: A Hong Kong case study of English language adult learners’ plurilingual practices in computer-mediated communication. Linguistics International Journal, 15(1), 1–20.

44.

Haryati

Prayuana

(2020). An analysis of Code-mixing usage in WhatsApp groups conversation among lecturers of Universitas Pamulang. Ethical Lingua: Journal of Language Teaching and Literature, 7(2), 236–250. https://doi.org/10.30605/25409190.180

45.

Hiramoto

Gonzales

W. D. W.

Leimgruber

Lim

J. J.

Choo

J. X. M.

(2022). From Malay to Colloquial Singapore English: A case study of sentence-final particle sia. In Ngefac

Wolf

H.-G.

Hoffman

(Eds.), World Englishes and creole languages today existing paradigms and current trends in action (pp. 117–130). Lincom Europa.

46.

Honnibal

Montani

Van Landeghem

Boyd

(2020). spaCy: Industrial-strength natural language processing in Python. https://doi.org/10.5281/zenodo.1212303

47.

Hui

N.-Y.

Fong

C.-M.

Wang

W. S.

(2022). Bilingual prefabs: No switching cost was found in Cantonese–English Habitual code-switching in Hong Kong. Languages, 7(3), 198. https://doi.org/10.3390/languages7030198

48.

Koban

(2013). Intra-sentential and Inter-sentential Code-switching in Turkish-English Bilinguals in New York City, U.S. Procedia–Social and Behavioral Sciences, 70, 1174–1179. https://doi.org/10.1016/j.sbspro.2013.01.173

49.

Kruschke

J. K.

Liddell

T. M.

(2018). The Bayesian new statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review, 25(1), 178–206. https://doi.org/10.3758/s13423-016-1221-4

50.

Labov

(1972). Sociolinguistic patterns. University of Pennsylvania Press.

51.

Lê

Josse

Husson

(2008). FactoMineR: An R package for multivariate analysis. Journal of Statistical Software, 25(1). https://doi.org/10.18637/jss.v025.i01

52.

Lee

J. L.

Chen

Lam

Lau

C. M.

Tsui

T.-H.

(2022, June). PyCantonese: Cantonese linguistics and NLP in Python. In Proceedings of the 13th Language Resources and Evaluation Conference.

53.

Leimgruber

Lim

J. J.

Gonzales

W. D. W.

Hiramoto

(2021). Ethnic and gender variation in the use of Colloquial Singapore English discourse particles. English Language and Linguistics, 25(3), 601–620. https://doi.org/10.1017/S1360674320000453

54.

D. C. S.

(2000). Cantonese-English code-switching research in Hong Kong: A Y2K review. World Englishes, 19(3), 305–322. https://doi.org/10.1111/1467-971X.00181

55.

Graesser

A. C.

Conley

Cai

Pavlik

P. I.

Pennebaker

J. W.

(2016). A new measure of text formality: An analysis of discourse of Mao Zedong. Discourse Processes, 53(3), 205–232. https://doi.org/10.1080/0163853X.2015.1010191

56.

K. K.

Nguyen

Bryant

Yoo

(2023). Lexical tonal effects in code-switching: A comparative study of Cantonese, Mandarin, and Vietnamese switching with English. International Journal of Bilingualism. https://doi.org/10.1177/13670069231181508

57.

(1999). Codeswitching, speech community membership, and the construction of ethnic identity. Journal of Sociolinguistics, 3(4), 461–479. https://doi.org/10.1111/1467-9481.00091

58.

Lui

Baldwin

(2012). Langid.py: An Off-the-shelf language identification tool. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (pp. 25–30).

59.

Luke

K. K.

(1998). Why two languages might be better than one: Motivations of language mixing in Hong Kong. In Pennington

M. C.

(Ed.), Language in Hong Kong at century’s end (pp. 145–159). Hong Kong University Press.

60.

MacKenzie

(2020). Comparing constraints on contraction using Bayesian regression modeling. Frontiers in Artificial Intelligence, 3, 58. https://doi.org/10.3389/frai.2020.00058

61.

Makowski

Ben-Shachar

M. S.

Chen

S. H. A.

Lüdecke

(2019). Indices of effect existence and significance in the Bayesian framework. Frontiers in Psychology, 10, 2767. https://doi.org/10.3389/fpsyg.2019.02767

62.

Malmgren

(2021). Scrubadub [Python 3.6]. https://scrubadub.readthedocs.io/en/stable/index.html

63.

McElreath

(2020). Statistical rethinking: A Bayesian course with examples in R and Stan (2nd ed.). Taylor and Francis, CRC Press.

64.

Miller

J. E.

Tresoldi

Zariquiey

Beltrán Castañon

C. A.

Morozova

List

J.-M.

(2020). Using lexical language models to detect borrowings in monolingual wordlists. PLOS ONE, 15(2), e0242709.

65.

Myers-Scotton

C. M.

(1993). Duelling languages: Grammatical structure in codeswitching. In Duelling languages: Grammatical structure in codeswitching. Clarendon Press.

66.

Myers-Scotton

C. M.

Jake

J. L.

(2017). Revisiting the 4-M model: Codeswitching and morpheme election at the abstract level. International Journal of Bilingualism, 21(3), 340–366. https://doi.org/10.1177/1367006915626588

67.

C. W.

(2021). Cantonese-English code-switching in Cantopop television drama theme songs. World Englishes, 40(3), 354–370. https://doi.org/10.1111/weng.12466

68.

Nguyen

Bryant

(2020). CanVEC: The Canberra Vietnamese-English code-switching natural speech corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) (pp. 4121–4129).

69.

Noels

K. A.

Kil

Fang

(2014). Ethnolinguistic orientation and language variation: Measuring and archiving ethnolinguistic vitality, attitudes, and identity. Language and Linguistics Compass, 8(11), 618–628. https://doi.org/10.1111/lnc3.12105

70.

Pérez-Sabater

Montero-Fleta

(2015). A first glimpse at mobile instant messaging: Some sociolinguistic determining factors. Poznan Studies in Contemporary Linguistics, 51(3). https://doi.org/10.1515/psicl-2015-0016

71.

R Core Team. (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing. http://www.R-project.org

72.

Rinker

(2022). sentimentR (Version 2.9.0) [R]. https://github.com/trinker/sentimentr

73.

Sharath

(2018). Defining a matrix language in language mixing. http://hdl.handle.net/2152/65027

74.

Sharma

Dodsworth

(2020). Language variation and social networks. Annual Review of Linguistics, 6(1), 341–361. https://doi.org/10.1146/annurev-linguistics-011619-030524

75.

Starr

R. L.

Balasubramaniam

(2019). Variation and change in English /r/ among Tamil Indian Singaporeans. World Englishes, 38(4), 630–643. https://doi.org/10.1111/weng.12357

76.

Sundgren

(2009). The varying influence of social and linguistic factors on language stability and change: The case of Eskilstuna. Language Variation and Change, 21(1), 97–133. https://doi.org/10.1017/S0954394509000040

77.

Szmrecsanyi

Grafmiller

Rosseel

(2019). Variation-based distance and similarity modeling: A case study in world Englishes. Frontiers in Artificial Intelligence, 2, 23. https://doi.org/10.3389/frai.2019.00023

78.

Thomason

(2007). Language contact and deliberate change. Journal of Language Contact, 1(1), 41–62.

79.

Thomason

(2010). Contact explanations in linguistics. In Hickey

(Ed.), The handbook of language contact (pp. 31–47). Wiley-Blackwell.

80.

Tsoukala

Broersma

Van Den Bosch

Frank

S. L.

(2021). Simulating code-switching using a neural network model of bilingual sentence production. Computational Brain & Behavior, 4(1), 87–100. https://doi.org/10.1007/s42113-020-00088-6

81.

Vasishth

Nicenboim

(2016). Statistical methods for linguistic research: Foundational ideas–Part I: Statistical methods for linguistics–Part I. Language and Linguistics Compass, 10(8), 349–369. https://doi.org/10.1111/lnc3.12201

82.

Wang

(2016). On determining matrix language of code-switching between Southern Min and Mandarin. Journal of Chinese Linguistics, 44(2), 357–383.

83.

Wasserscheidt

(2020). Explaining code-switching. Matrix language models vs. bilingual construction Grammar. Književni Jezik, 31, 57–87. https://doi.org/10.33669/KJ2020-31-04

84.

Weston

(2016). “Bits,” “chunks” and “channel-switching”: Perceptions of Cantonese-English code-switching. Journal of Chinese Linguistics, 44(2), 384–414.

85.

Yim

Clément

(2019). “You’re a juksing”: Examining Cantonese–English code-switching as an index of identity. Journal of Language and Social Psychology, 38(4), 479–495. https://doi.org/10.1177/0261927X19865572

86.

Zhang

(2021). A commentary of GPT-3 in MIT Technology Review 2021. Fundamental Research, 1(6), 831–833. https://doi.org/10.1016/j.fmre.2021.11.011

87.

Zhang

Jin

Zhou

Z.-H.

(2010). Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics, 1, 43–52.