Sage Journals: Discover world-class research

Abstract

Given the rapid advancement of large-scale language models, AI models, like ChatGPT, are playing an increasingly prominent role in human society. However, to ensure that AI models equally benefit human society, we must fully understand the cultural stereotypes and biases that AI models may exhibit in interacting with humans. Study 1 first measured ChatGPT in 13 dimensions of cultural values; it was revealed that ChatGPT's cultural value patterns are dissimilar to those of various countries/regions worldwide. In Study 2, we analyzed the performance of ChatGPT-3.5 and ChatGPT-4.0 across thirteen decision-making tasks involving human interactions from different countries or regions. The results showed that ChatGPT 3.5 exhibits apparent cultural stereotypes in most decision-making tasks and shows significant cultural bias in third-party punishment and ultimatum games. However, ChatGPT 4o, while reinforcing stereotypes, showed reduced cultural bias on these tasks. The findings indicate that, compared to humans, ChatGPT exhibits differential cultural value orientation, and it also shows cultural biases and stereotypes in interpersonal decision-making, and the version upgrade increases the sensitivity of chatbots but weakens the potential bias. Furthermore, Study 3 created four kinds of prompt strategies and applied them to decision-making tasks. It also found that the four prompt strategies can effectively reduce the generation of ChatGPT's stereotypes and biases. Future research should emphasize enhanced technical oversight and augmented transparency in the database and algorithmic training procedures to foster more efficient cross-cultural communication and mitigate social disparities.

Keywords

AI ChatGPT cultural bias cultural stereotypes RSA

Introduction

Modern artificial intelligence (AI) has been applied to realize various human–computer interaction functions, including dialogues, providing suggestions, composing poems, and proving theorems. This highly humanized performance has garnered widespread public attention (Gilardi et al., 2023; Trautman et al., 2025), and human–computer interaction is gradually becoming common and may replace the interaction process between people in the future. Previous studies have primarily examined the degree of humanization of ChatGPT from algorithmic and data-centric perspectives, investigating whether ChatGPT can understand and process information akin to humans (Binz & Schulz, 2023; Coda-Forno et al., 2024; Huang et al., 2023; Mei et al., 2024), as well as the potential advantages and challenges associated with such humanization (Acerbi & Stubbersfield, 2023; Bubeck et al., 2023; Duan et al., 2025; Zhang et al., 2025). However, beyond cognitive abilities, the sociocultural characteristics of AI models are particularly important in ensuring equality and fairness in human–machine interactions. Considering cultural related materials ingrained in the training data, ChatGPT may have inadvertently acquired certain cultural biases or stereotypes during its training, thus posing challenges to human–computer relationships, interpersonal relationships, and intergroup relationships. Hence, this study utilizes multidimensional indicators to systematically analyze the cultural values, potential biases, and stereotypes inherent in ChatGPT and attempts to propose feasible prompt interventions to reduce the cultural biases generated by ChatGPT.

In the learning process, since the large language model learning contains various viewpoints, beliefs, and attitudes of different societies, it may unconsciously absorb this potential information and form its unique cultural values. Cultural values are the norms that determine a particular culture's social order and profoundly impact individual psychology and behavior. Studies have shown that individuals from different cultural backgrounds exhibit unique psychological characteristics and behavioral patterns from the values and societal norms emphasized within their respective cultures (Henrich et al., 2001; McCrae & Terracciano, 2005). Previous studies have found that AI systems often reflect the dominant cultural values in their training data when making moral and ethical decisions (ÓhÉigeartaigh et al., 2020). Thus, ChatGPT may have developed specific intrinsic cultural values. This could have potentially negative consequences. For example, if ChatGPT shows a different cultural value from the user, it may have controversial answers or offensive suggestions, which will cause user resistance or distrust (Bender et al., 2021), or subtly influence users’ values and affect their physical and mental health. Therefore, this study first collected data on ChatGPT's performance in multiple cultural dimensions through a questionnaire survey to explore its cultural values.

The cultural values embedded in ChatGPT shape its broad cognitive understanding of different cultural groups (Ornelas et al., 2023). This generalized understanding often leads to the formation of cultural stereotypes. Cultural stereotypes refer to expectations about the motivations, behaviors, and emotions of others (Stewart et al., 1979), and ChatGPT may develop oversimplified views of certain cultural groups based on limited information sources. These expectations can further give rise to sociocultural biases, which manifest as optimistic or pessimistic attitudes toward specific social and material contexts (Ostroot & Snyder, 1985). Since sociocultural biases often lack explicit counter-norms, they are more difficult to identify than other forms of bias, such as gender or racial biases (Kleinman & Benson, 2006; Nosek et al., 2002). Currently, ChatGPT is widely used across various fields. However, if its responses reflect stereotypes and biases, it may undermine the neutrality and fairness of interactions, exacerbate misunderstandings, and lead to group conflict or societal divisions (Olli & Swedlow, 2023; Pursiainen, 2022; Yang et al., 2023).

Recent research has begun to systematically examine cultural biases in large language models, establishing important foundational knowledge in this emerging field. Atari et al. (2023) conducted pioneering work analyzing cultural values in GPT models, demonstrating that these systems exhibit culture-specific moral and value orientations that reflect patterns from their training data. Building on this foundation, Tao et al. (2024) provided empirical replications showing consistent cultural alignment patterns across different LLM versions, confirming the persistence of these biases across model iterations. Additionally, Zewail et al. (2024) extended this line of inquiry by examining moral stereotyping behaviors across multiple LLM architectures, revealing systematic patterns of cultural stereotyping that vary across different AI systems.

While these studies have established the existence of cultural biases in LLMs, several important gaps remain in our understanding of these phenomena. First, most prior work has focused on identifying the presence of cultural biases without systematically examining the multidimensional nature of cultural values and their interrelationships within AI systems. Second, the relationship between an AI system's cultural value orientations and its behavioral manifestations in decision-making contexts remains underexplored. Our study aims to enrich current understanding by providing a comprehensive multidimensional assessment of ChatGPT's cultural values, examining how these values translate into biased behaviors in social decision-making tasks. Therefore, we employed classic behavioral decision-making paradigms to assess ChatGPT's potential cultural stereotypes and biases. Given that stereotypes can be reflected in predictions about the behaviors of specific cultural groups, we ask ChatGPT to simulate decision-making across various nationalities. This approach has been shown to effectively enhance the alignment between model outputs and the cultural contexts of various countries or regions (Tao et al., 2024). In contrast, cultural bias refers to attitudes toward particular cultural groups, so we ask ChatGPT to simulate interactions with individuals from different countries to observe its attitudes and behavioral preferences. Additionally, since stereotypes and biases may stem from the specific cultural values embedded in ChatGPT, this study also explores the relationship between ChatGPT's cultural stereotypes and biases and the cultural values of different nations.

Furthermore, in addition to looking at ChatGPT's cultural value, bias, and stereotypes in each dimension, understanding the high-dimensional representation patterns of ChatGPT in these features may be even more important. Representational similarity analysis (RSA) initially originated in systems neuroscience and is a specialized multivariate pattern analysis method that reflects the representation of concepts or stimulus units in a high-dimensional space through pairwise comparisons (Haxby et al., 2014; Huang et al., 2022; Luo et al., 2023; Peng & Luo, 2021, 2022). Previous studies have often explored the values of ChatGPT on a single dimension, such as the individualism–collectivism cultural value, neglecting the multidimensionality and complexity of culture. Studying cultural variables solely from a single dimension is insufficient to capture the full extent of culture values exhibited by ChatGPT, thus necessitating the integration of culture values across different dimensions (Luo et al., 2024). One advantage of RSA is its ability to extract the degree of association between different dimensions, thereby constructing global feature patterns of object variables (Yuan & Luo, 2024). Therefore, RSA analysis can help us integrate ChatGPT's overall culture profile. Additionally, RSA enables the quantification of global pattern information and objective comparisons of pattern features across different scales. Consequently, we can directly compare the differences between ChatGPT and data from different scales regarding overall patterns. In summary, employing pattern analysis may offer us novel and complementary perspectives.

Given the cultural bias developed by ChatGPT, an urgent question to be explored is how to effectively mitigate the prejudice and stereotypes that may arise during the interaction between individuals and large language models. Previous studies have shown that integrating manual input, feedback, or supervision in developing and deploying large language models can help developers mitigate the bias of generative large language models (Ferrara, 2023). However, the above process is mainly aimed at the model development process, and it is challenging to continue to play a robust role in various human–computer interaction processes. As a generative large language model, ChatGPT has strong anthropomorphism (Waldrop, 2024); that is, it can simulate its specific behavior based on the assumed identity (Motoki et al., 2023). This strong imitation ability reflects ChatGPT's understanding and inference of key information in the conversation process. Therefore, an important way to reduce the bias generated by ChatGPT is to apply bias-avoiding words and phrases in human–computer interaction. In this study, we proposed standardized questioning methods with different directions. We used them with the classic behavioral decision-making paradigm to explore effective models for reducing ChatGPT cultural bias during interpersonal interaction.

This study aims to explore the cultural values, biases, and stereotypes of ChatGPT and to find appropriate prompts when interacting with ChatGPT (Figure 1). First, Study 1 measures ChatGPT's value orientations across different cultural dimensions to investigate its value preferences. Subsequently, Study 2 employs complex decision-making experiments to examine ChatGPT's decision tendencies across various cultural contexts, analyzing its potential cultural biases and stereotypes, and their relationship with underlying cultural values. Furthermore, based on RSA analysis, the study explores ChatGPT's overall value patterns. Finally, Study 3 proposes and compares the effects of different questioning styles on the values and cultural biases of ChatGPT. In particular, due to the continuous iteration of ChatGPT versions, we measured the responses of ChatGPT 3.5 and ChatGPT 4o in the decision experiment to explore the potential changes caused by the release update.

Figure 1.

The research framework of the current study.

Study 1

Culture is a complex research topic, representing the integrated manifestations of human societal customs, norms, and ways of life across various contexts. To explore ChatGPT's cultural values, we selected 13 common cultural values and measured ChatGPT based on scales. Additionally, we collected data on cultural values from multiple countries/regions to compare with ChatGPT scores.

Methods

Scales and data collection of multicultural values

The specific dimensions and measurement contents are as follows:

Individualism–collectivism: It is one of the cultural dimensions proposed by Hofstede (Hofstede, 2001), which refers to whether society as a whole emphasizes individual interests or collective interests. Our study used three items from the World Values Survey (WVS) to calculate the level of individualism–collectivism in 65 countries/regions (Hamamura, 2012; Inglehart & Oyserman, 2004).

Independent–interdependent self-construal: Self-construal is the embodiment of cultural values at the individual level, indicating how individuals define and attribute meaning to themselves. Independent self-construal emphasizes individual autonomy and self-determination, focusing on individual rights and independent thinking. In contrast, interdependent self-construal highlights the interdependence between individuals and their social environment, emphasizing collective goals and social relationships (Markus & Kitayama, 2014). Our study utilized scores of independent and interdependent self-construal from 29 countries/regions measured by Fernández et al. (2005).

Relationship mobility: Relationship mobility refers to the individual's perceived ease of establishing new interpersonal relationships and terminating old interpersonal relationships in their context (Oishi & Graham, 2010). Our study used relational mobility data for 39 countries/regions measured by Thomson et al. (2018).

In addition, we selected nine cultural dimensions from the Global Leadership and Organizational Behavior Effectiveness (GLOBE) project. These dimensions include uncertainty avoidance, power distance, institutional collectivism, in-group collectivism, gender differentiation/egalitarianism, assertiveness, performance orientation, future orientation, and humane orientation (House et al., 2004).

The first six dimensions are derived from the cultural dimension proposed by Hofstede (Hofstede, 2001). Uncertainty avoidance, power distance, and future orientation correspond to uncertainty avoidance, power distance, and long-term orientation in Hofstede's cultural dimension. Institutional collectivism and in-group collectivism measure the level of individualism–collectivism from the social and within-group (family or organization) levels, corresponding to individualism–collectivism in Hofstede's cultural dimension, which the higher the score, the more inclined to collectivism. Moreover, the GLOBE project divides the masculinity–femininity dimension of Hofstede's cultural dimension into gender differentiation/egalitarianism and assertiveness. In addition, the GLOBE project has added humane orientation and performance orientation dimensions (Venaik & Brewer, 2016).

Measurement method of ChatGPT's cultural values

In each cultural dimension, the measurement method for the cultural values of ChatGPT follows a standardized format when asking about ChatGPT. The format follows: “If you are an ordinary individual in society,” followed by the questionnaire question and concluding with “Please be sure to select one of the following options,” along with the questionnaire's scoring criteria.

As shown in Figure 2, to assess the relational mobility, the following questions were posed to ChatGPT 3.5.

Figure 2.

Example of standardized input and typical ChatGPT 3.5 output when responding to a multi-item cultural values scale (relational mobility).

Utilizing the responses provided by ChatGPT, scores were derived for each scale. In cases where ChatGPT's response did not generate a score, we will modify the questioning until it yields a score.

Results

Figure 3 shows the scores of ChatGPT and different countries/regions in each dimension of cultural values. (See Table S1 for details) In terms of independent–interdependent self-construal, ChatGPT ranks at 100% for independence and 36.67% for interdependence, indicating a stronger tendency towards interdependent self-construal. This implies that ChatGPT tends to construct self-awareness from a social relational perspective. In the dimension of gender differentiation/egalitarianism, ChatGPT scores first among 60 regions/countries, suggesting a higher inclination towards gender equality than the overall level.

Figure 3.

Results of the cultural value of ChatGPT. Note: (A∼G) The ranking of ChatGPT and different country/region’s score on the (A) future orientation, (B) uncertainty avoidance, (C) individualism–collectivism, institutional collectivism, in-group collectivism, (D) independent self-construal, interdependent self-construal, (E) gender differentiation/egalitarianism, assertiveness, (F) relationship mobility, power distance, (G) performance orientation, humane orientation cultural values dimensions. The ChatGPT score is represented in black. (H) Based on the ranking, ChatGPT displays a collectivist value tendency, tends to have a feminine and interdependent self-construal, and tends to avoid risks, plan for the future, pay attention to the individual and performance, and attach importance to freedom and equality of interpersonal relations. (I) The representation similarity matrix of ChatGPT’s score on the GLOBE Project’s nine cultural value dimensions (left): each cell of the 9 × 9 matrix represents the similarity of ChatGPT’s score on each of the two cultural value dimensions. And the representation similarity matrix of the score difference between 60 countries/regions on the GLOBE Project’s nine cultural value dimensions (right): each cell of the 9 × 9 matrix represents the similarity of the scores of 60 countries/regions between each of the two value dimensions. There is no significant representational similarity between the two matrices.

Furthermore, ChatGPT ranks 100% in assertiveness, emphasizing interpersonal relationships and cooperation. The score ranking of gender differentiation/egalitarianism and assertiveness of ChatGPT shows that ChatGPT is more feminine. Additionally, in the power distance value, ChatGPT's score ranks at 1.61%, and in relationship mobility, ChatGPT's score ranks at 10%, indicating that ChatGPT values equality and freedom in interpersonal communication.

In the dimension of individualism–collectivism cultural values, ChatGPT's collectivism level ranks at 16.92%, and in the GLOBE's institutional and in-group collectivism cultural values scores rank at 1.61% and 17.74%, respectively, which is high-level among all countries/regions. These results suggest that in Hofstede's and GLOBE's dimensions of individualism–collectivism national culture, ChatGPT exhibits a higher tendency towards collectivism.

ChatGPT ranked first in both humane orientation and performance orientation, indicating that it not only pays attention to treating others well but also encourages others to achieve achievements. In addition, ChatGPT ranks 96.24% out of 60 countries/regions in the uncertainty avoidance value, indicating a strong tendency towards risk avoidance. In the future orientation value, ChatGPT ranks 23.81%, suggesting a tendency towards rational future planning.

Overall, ChatGPT demonstrates a cultural value orientation towards collectivism, emphasizing interpersonal relationships and concern for others. Simultaneously, ChatGPT also prioritizes equality and freedom in interpersonal relationships.

Furthermore, to explore which countries’ cultural characteristic patterns are similar with ChatGPT's representation patterns in multiple cultural dimensions. We conducted RSA by constructing representation similarity matrices for ChatGPT and the 60 countries/regions on the nine dimensions of the GLOBE project, respectively (Table 1). The representation similarity was computed using both Euclidean and Canberra distances. Subsequently, Mantel tests were performed to explore the correlation between the representational similarity matrix of ChatGPT and each country/region. As shown in Table 1, there was no significant similarity between ChatGPT and 60 countries/regions in representation patterns on nine value dimensions (rs < 0.16, ps > 0.101; Table 1). This result indicated that the representation patterns of ChatGPT in cultural values are unique and different from those of the measured countries/regions.

Table 1.

The representational similarity between ChatGPT and 62 regions of 60 countries on GLOBE cultural values.

Country/Region	Canberra Distance		Euclidean Distance		Mean r
	r	p	r	p
New Zealand	0.16	0.154	0.16	0.101	0.16
Sweden	0.14	0.301	0.04	0.367	0.09
French Switzerland	0.06	0.427	−0.01	0.496	0.03
Greece	0.05	0.280	0.00	0.384	0.02
Malaysia	0.05	0.377	−0.03	0.461	0.01
Hungary	0.00	0.391	−0.03	0.476	−0.02
Canada	−0.02	0.388	−0.06	0.498	−0.04
China	−0.02	0.378	−0.06	0.511	−0.04
Japan	−0.01	0.467	−0.10	0.594	−0.05
Ireland	−0.07	0.503	−0.09	0.610	−0.08
El Salvador	−0.07	0.464	−0.11	0.618	−0.09
Qatar	−0.14	0.765	−0.11	0.742	−0.13
Argentina	−0.10	0.525	−0.15	0.640	−0.13
Kazakhstan	−0.12	0.674	−0.14	0.698	−0.13
USA	−0.13	0.515	−0.14	0.570	−0.14
Namibia	−0.14	0.618	−0.14	0.602	−0.14
Denmark	−0.14	0.656	−0.15	0.665	−0.15
Indonesia	−0.13	0.599	−0.17	0.737	−0.15
Kuwait	−0.13	0.613	−0.18	0.752	−0.15
Thailand	−0.14	0.691	−0.18	0.797	−0.16
Colombia	−0.14	0.601	−0.18	0.711	−0.16
Albania	−0.15	0.574	−0.18	0.674	−0.16
Singapore	−0.14	0.625	−0.20	0.805	−0.17
Venezuela	−0.16	0.788	−0.18	0.768	−0.17
Italy	−0.16	0.676	−0.19	0.734	−0.17
Russia	−0.17	0.707	−0.20	0.807	−0.19
Portugal	−0.18	0.646	−0.20	0.727	−0.19
Czech Republic	−0.22	0.766	−0.17	0.718	−0.20
Taiwan, China	−0.18	0.680	−0.22	0.772	−0.20
India	−0.17	0.644	−0.24	0.775	−0.20
Nigeria	−0.20	0.710	−0.22	0.758	−0.21
Costa Rica	−0.20	0.801	−0.23	0.852	−0.22
Slovenia	−0.22	0.814	−0.23	0.822	−0.22
Poland	−0.20	0.759	−0.25	0.836	−0.22
Morocco	−0.22	0.771	−0.24	0.862	−0.23
Mexico	−0.21	0.757	−0.25	0.839	−0.23
Finland	−0.22	0.853	−0.24	0.881	−0.23
Turkey	−0.22	0.781	−0.25	0.830	−0.24
Spain	−0.23	0.843	−0.25	0.888	−0.24
Switzerland	−0.25	0.921	−0.24	0.870	−0.24
Germany	−0.24	0.892	−0.26	0.927	−0.25
Egypt	−0.23	0.772	−0.27	0.848	−0.25
Philippines	−0.23	0.863	−0.27	0.933	−0.25
Brazil	−0.24	0.868	−0.26	0.894	−0.25
Netherlands	−0.25	0.936	−0.26	0.933	−0.26
France	−0.26	0.923	−0.27	0.921	−0.27
Australia	−0.26	0.845	−0.28	0.831	−0.27
South Africa	−0.26	0.887	−0.29	0.900	−0.27
Georgia	−0.26	0.916	−0.29	0.924	−0.27
Zambia	−0.26	0.880	−0.30	0.925	−0.28
Bolivia	−0.28	0.976	−0.29	0.961	−0.29
Guatemala	−0.29	0.960	−0.30	0.970	−0.29
Hong Kong, China	−0.28	0.985	−0.31	0.982	−0.30
Zimbabwe	−0.29	0.861	−0.31	0.906	−0.30
Israel	−0.29	0.957	−0.33	0.952	−0.31
England	−0.31	0.942	−0.32	0.929	−0.32
Austria	−0.32	0.994	−0.35	0.997	−0.33
IRAN	−0.34	0.989	−0.35	0.987	−0.34
South Korea	−0.32	0.999	−0.36	1.000	−0.34
Ecuador	−0.34	0.999	−0.36	0.999	−0.35

In addition, to explore whether the representation patterns of differences in cultural values between countries is associated with the representation patterns of ChatGPT in multiple cultural value dimensions, we construct a representation similarity matrix of the score differences among 60 countries/regions in the nine dimensions of GLOBE, which is a second-order level RSA matrix (Figure 3I). First, in the regional dimension, the cross-country cultural values pattern matrix under each cultural values dimension was constructed. Then, in the cultural values dimension, the cross-values second-order matrix was built, which reflects the similarity pattern of 60 countries/regions in the nine dimensions of the GLOBE project. Each cell in the matrix indicates the similarity of distribution patterns of cultural value scores between every pair of cultural value dimensions across countries/regions. The comparison of the similarity between ChatGPT and the above second-order representation similarity matrix revealed no significant similarity between the two matrices (r = −0.04, p = 0.507). This result indicated that the representation similarity patterns of ChatGPT across different cultural dimensions do not align with those at the real-life cross-regional level. These findings suggested that although ChatGPT demonstrates specific cultural value biases across various dimensions, its overall pattern is distinct from the existing 60 countries/regions and the overall level, exhibiting unique cultural value patterns.

Study 2

To evaluate whether ChatGPT exhibits cultural stereotypes and biases, we operationally define cultural stereotype as systematic behavioral variations in decision-making when the model is prompted to simulate ordinary individuals from different national or cultural backgrounds. That is, if the model generates distinct responses under otherwise identical task conditions solely due to the national identity specified in the prompt, such consistent variation indicates the presence of internalized cultural stereotypes. Conversely, we define cultural bias as asymmetries in ChatGPT's behavior toward individuals from different cultural groups during social interaction tasks—for example, differential levels of trust, fairness, or punishment toward interactants from varying countries. Both constructs are quantitatively assessed across a standardized set of decision-making tasks.

Methods

Measurement method of ChatGPT's cultural stereotype and bias

To probe a broad spectrum of social and interpersonal motivations using clear, single-shot decision rules, we adapted ten well-validated social interaction tasks and three individual decision-making paradigms into one-round text prompts. These tasks have been widely used in cross-cultural research to reveal variations in reciprocity, norm enforcement, risk–reward learning, self-control, and moral reasoning (e.g., House et al., 2020; Ishii & Kurzban, 2008). Each can be succinctly presented with a defined payoff structure and a single decision point, making them suitable for standardized interaction with large language models.

The social interaction tasks based on two-person or multi-person interactive scenarios include: (1) Approach–Avoidance Conflict Task (Aupperle et al., 2011), (2) Third-Party Punishment Task (Fehr & Fischbacher, 2004), (3) Intergenerational Decision-Making Task (Langenbach et al., 2019), (4) Trust Game Task (Burks et al., 2003), and (5) Ultimatum Task (Nowak et al., 2000). Individual decision-making tasks focus on measuring social traits exhibited by individuals during the decision-making process, including (6) Iowa Gambling Task (Bechara et al., 1994), (7) Delayed Gratification Task (Johnson & Bickel, 2002), (8) Ethical Dilemmas Task (Greene et al., 2008), (9) Dictator Task (Forsythe et al., 1994), (10) Dishonesty Task (Fischbacher & Follmi-Heusi, 2013), (11) Public Good Game Task (Santos et al., 2008), (12) Prisoner's Dilemma Task (Shafir & Tversky, 1992), and (13) Bystander Task (Latané & Darley, 1968). After appropriate adaptation, these tasks were used to measure the performance of ChatGPT 3.5 and ChatGPT 4o. Five experimenters conducted measurements on the task.

In each trial, ChatGPT will first be presented with the task rules.(see Table S2 for details) Taking the Ultimatum Game as an example, ChatGPT saw:

You and another participant will take on the roles of the proposer and the responder, respectively. The system provides you with an initial fund of 10 units of currency, and you must decide on the distribution of the funds alone. The system will then present your proposed distribution to the responder. If the responder agrees with your proposal, the funds will be distributed according to your proposal. If the responder disagrees, neither of you will receive anything.

Then, ChatGPT would do decision-making behavior. For example, in the Ultimatum Game, ChatGPT provides a specific number within 10 to represent the money it wants to allocate to the responder.

In the first part, we asked ChatGPT to simulate ordinary individuals from no cultural background and 20 different countries across five continents (China, Japan, India, Russia, Germany, France, the United Kingdom, Italy, the United States, Canada, Mexico, Brazil, Argentina, Colombia, South Africa, Nigeria, Egypt, Kenya, Australia, New Zealand), completing the 13 tasks mentioned above. This can help us understand ChatGPT's perception of individuals from different countries and assess whether they hold cultural stereotypes. At the beginning of each task, ChatGPT is instructed to assume the role of an ordinary person from a specific country. The name of a particular country will be replaced with the corresponding country's name in different rounds.

In the second part, we selected five social interaction tasks based on two-person or multi-person interactions to observe any unfair phenomenon in ChatGPT interactions with individuals from different countries and assess whether it has cultural biases. At this point, we inform ChatGPT that an ordinary person needs to interact with an individual from a specific country. The name of this country will be replaced with the corresponding country's name in different rounds, just like in the first part. A total of 20 countries were measured, and the order of questioning was randomized.

The questioning is uniformly conducted in English. Each task for each country condition was repeated ten times, and the data from these ten repetitions were used to conduct variance and correlation analyses. We conducted one-way ANOVA to test for differences between countries and corrected them with Bonferroni for pairwise comparisons, if the within-group standard deviation of the data was 0 (see Table S16 for details), the Kruskal–Wallis test was used for analysis.

Results

ChatGPT has cultural stereotypes and biases

Country-specific task performance indicates that both ChatGPT 3.5 and ChatGPT 4o exhibit cultural differences in their task responses, but ChatGPT 4o demonstrates cultural stereotypes across a broader range of concepts. ChatGPT 3.5 exhibited cultural stereotypes in eight tasks related to help others (Approach–avoidance conflict task: F(20) = 6.60, p < 0.001, η² = 0.41), trust perception (Trust game task: F(20) = 5.95, p < 0.001, η² = 0.39), moral decision-making (Ethical dilemma task: F(20) = 76.14, p < 0.001, η² = 0.89), risk preference (Iowa gambling task: F(20) = 29.64, p < 0.001, η² = 0.76), fairness preference (Ultimatum game task: F(20) = 15.88, p < 0.001, η² = 0.63; Third-party punishment task: F(20) = 2.16, p = 0.004, η² = 0.19), delayed gratification (Delayed gratification task: F(20) = 33.76, p < 0.001, η² = 0.78), and intergroup cooperation (Public good game task: F(20) = 16.50, p < 0.001, η² = 0.64), while no stereotypes were present in the others (ps > 0.050) (Figure 4 and Table S3). Results of ChatGPT 4o showed that except dishonest task, scores differed significantly between countries in other 12 tasks (ps < 0.050, Figure 4 and Table S5).

Figure 4.

ChatGPT simulates the performance of individuals from 20 countries in 13 single-round decision-making tasks. Note: The figure shows the mean and standard error of 10 simulations under different tasks and national backgrounds, and the grey dashed line represents the decision results without national backgrounds. (A) Third-party punishment. The higher the score, the more emphasis individuals place on social fairness. (B) The Ultimatum game. The higher the score, the more altruistic it is. (C) Iowa gambling task. If the score is greater than 0.5, it indicates that the individual tends to pursue risk. (D) Ethical dilemma. The higher the score, the greater the possibility of sacrificing one person with illness to save five people without illness. (E) Trust game. The higher the score, the more trust one has in others. (F) Delayed gratification. The higher the score, the more emphasis is placed on long-term benefits. (G) Approach avoidance conflict. The higher the score, the less willing one is to sacrifice their interests. (H) Public good game task. The higher the score, the more intergroup cooperation. (I) Dictator task. The higher the score, the less authoritarian. (J) Intergenerational decision-making. In situations where resources are limited and there are descendants, the more resources are extracted, the less they are considered for descendants. (K) Prisoner's dilemma task. The higher the score, the higher the willingness to cooperate. (L) Bystander task. The higher the score, the less responsibility to help others. (M) Dishonest task. The higher the score, the more dishonest. (N) Comparing the effect size for national differences in various tasks between ChatGPT 3.5 and ChatGPT 4o.

Based on understanding ChatGPT's generalized beliefs of different cultures, we also observed whether there was any bias in ChatGPT's interactions with individuals from different cultures. When ChatGPT 3.5 interacting with individuals from different countries, there were differences in scores of four tasks (Figure 5 and Table S7). These tasks include ultimatum game task (F(20) = 4.11, p < 0.001, η² = 0.30), third-party punishment task (F(20) = 6.06, p < 0.001, η² = 0.39), dictator task (F(20) = 10.82, p < 0.001, η² = 0.53), public good game (H(20) = 209, p < 0.001, η² = 1.00). These tasks are related to distribution, indicating that ChatGPT has cultural biases in distributive fairness. There was no difference in the scores of trust game task, intergenerational decision-making task, prisoner's dilemma task and bystander task between countries (ps > 0.050). Although the results of the approach–avoidance conflict task were significantly different (F(20) = 2.14, p = 0.003, η² = 0.19), there were no differences in pairwise comparisons (ps > 0.050). ChatGPT 4o exhibits cultural bias in fewer tasks (Figure 5 and Table S10). The differences between countries in other tasks were only significant in the dictator task (H(20) = 209, p < 0.001, η² = 1.00) and public good game task (F(20) = 12.29, p < 0.001, η² = 0.57).

Figure 5.

Decision-making behaviors of ChatGPT in interactions with imagined individuals from 20 countries. Note: The figure shows the mean and standard error of 10 simulations under different tasks and national backgrounds, and the grey dashed line represents the decision results of imagined individuals without national backgrounds. (A) Dictator task. (B) Public good game task. (C) Ultimatum game task. (D) Third-party punishment task. (E) Approach avoidance conflict. (F) Trust game task. (G) Intergenerational decision-making task. (H) Prisoner’s dilemma task. (I) Bystander task. (J) Comparing the effect size for national differences in various cultural bias tasks between ChatGPT 3.5 and ChatGPT 4o.

Building on the understanding of cultural stereotypes and cultural bias as distinct yet interconnected concepts, we examined their interplay within the same tasks performed by ChatGPT. Correlation analysis of the simulation results revealed that ChatGPT 3.5's cultural bias is independent of cultural stereotypes, while ChatGPT 4o's cultural bias aligns with cultural stereotypes (see Table S8 and Table S11). The task scores where ChatGPT 4o exhibited cultural bias were significantly positively correlated with the corresponding cultural stereotype scores (Dictator task: r = 0.81, p < 0.001; Public good game task: r = 0.77, p < 0.001).

In summary, both ChatGPT 3.5 and ChatGPT 4o exhibit cultural sensitivity, with ChatGPT 4o showing broader cultural stereotypes and a stronger alignment between its cultural biases and stereotypes across more tasks. While both models display cultural bias, ChatGPT 4o exhibits it in fewer tasks, suggesting improved fairness in interactions with diverse cultural groups. ChatGPT 3.5's cultural bias appears independent of its stereotypes, whereas ChatGPT 4o's biases are more closely tied to its generalized cultural beliefs. Despite improvements, further refinement is needed to reduce cultural bias, especially in tasks related to fairness and cooperation.

The potential sources of ChatGPT's stereotypes and bias

The above results found that ChatGPT showed stereotypes and biases in most decision-making tasks. The formation of these sociocultural biases may come from its cognition of different cultural groups, that is, cultural values. Therefore, to verify the above hypothesis, we further explored the correlation between the stereotypes and cultural biases of ChatGPT and the cultural values of the country itself.

First, we analyzed the relationship between the cultural stereotypes exhibited by ChatGPT in these tasks and the real-world cultural differences among these countries in the four basic cultural value dimensions (individualism–collectivism, relational mobility, interdependent–independent self-construal, and cultural tightness–looseness) to explore the sources of ChatGPT's cultural stereotypes (Figure 6A and Table S4). The results found that when a country is more inclined to a collectivistic culture, ChatGPT 3.5 believes that individuals in that country are more inclined to immediate gains (delayed gratification task, r = −0.57, p = 0.026) and tend to allocate more money to themselves (ultimatum game task, r = −0.73, p = 0.002). In countries with high relational mobility, ChatGPT 3.5 tends to believe that people are more willing to take risks (Iowa gambling task, r = 0.63, p = 0.027) and trust others (trust game task, r = 0.39, p = 0.206), pursuing fairness (third-party punishment task, r = 0.41, p = 0.186), and safeguarding one's interests (approach–avoidance conflict task, r = −0.39, p = 0.207). In addition, ChatGPT 3.5 believes that individuals in strong interdependence cultures are more cooperative (public good game task, r = 0.35, p = 0.287) and tend to make utilitarian choices (moral dilemma task: r = −0.65, p = 0.030). However, there is no significant correlation between the tightness of national culture and the stereotypes displayed by ChatGPT 3.5 in the task. At the same time, the RSA results showed that the ChatGPT cultural stereotype pattern in the ultimatum game task was significantly similar to the real-world collectivism cultural similarity pattern (r = 0.34, p = 0.001, Figure 6B&5E). Meanwhile, the ChatGPT cultural stereotype pattern in the Iowa gambling task significantly correlates with the pattern of real-world relation mobility scores (r = 0.26, p = 0.010, Figure 6C&5F and Table S4).

Figure 6.

The relationship between ChatGPT task performance and human cultural values. Note: (A) Correlation pattern between ChatGPT 3.5 cultural dependent performance and cultural value dimensions in human society. Correlate the ChatGPT country simulation results in each task with the corresponding real-world cultural values scores. There are four cultural values included: collectivism–individualism (yellow), relation mobility (orange), interdependent self (pink), and cultural tightness–looseness (gold). The correlation coefficients from the inner circle to the outermost circle of the radar map are 0, 0.2, 0.4, 0.6, and 0.8, respectively. (B) The similarity between the real-world collectivism–individualism patterns across 20 countries (top) and the ChatGPT 3.5's simulation patterns of cultural stereotypes between 20 countries on the ultimatum game (bottom). (C) The similarity between the real-world relation mobility scores across 20 countries (top) and the ChatGPT 3.5's simulation patterns of cultural stereotypes between 20 countries on the Iowa gambling task (bottom). (D) The similarity between the real-world cultural tightness–looseness patterns across 20 countries (top) and the ChatGPT 3.5's simulation patterns of cultural biases between 20 countries on the Ultimatum game task (bottom). ***: p < 0.001, **: p < 0.01, *: p < 0.05.

Second, we also explored whether the cultural bias generated by ChatGPT in the task is also affected by the cultural value tendency of the country. The correlation analysis results (Table S9) show that ChatGPT 3.5 will tend to distribute unfairly in dictator tasks when interacting with individuals in high individualism cultures (r = −0.73, p = 0.002). In the ultimatum task, the distribution plan of ChatGPT 3.5 is significantly related to the country's cultural looseness, that is, the looser the country's culture, the more inclined it is to unfair distribution (r = 0.79, p < 0.001). At the same time, the RSA results are also consistent with the above findings (r = 0.60, p < 0.001, see Figure 6D&5G).

In general, the cultural values of countries can predict ChatGPT's stereotypes and biases against these countries, indicating that ChatGPT's stereotypes and biases towards different societies are generated based on the cultural value orientation of specific societies. In addition, we also analyzed the correlation between ChatGPT4o's decision-making tendency in the task and the cultural value orientation of the country, and the results also supported the above conclusions (see Text S1 for details).

Study 3

Based on the results above, we have found that ChatGPT's responses exhibit certain cultural biases, which can exacerbate cultural conflicts and misunderstandings between people and groups, potentially leading to greater social inequality. Consequently, we aim to explore methods to reduce the biases present in ChatGPT's responses. To this end, we have designed various prompt strategies to enable ChatGPT to overlook cultural differences between countries and provide unbiased answers. Our approaches include emphasizing personal traits, emphasizing fairness, creating future-oriented scenarios, and creating scenarios without national differences.

Methods

In previous experiments, tasks related to fair distribution included the Dictator Task, Ethical Dilemma, Third-party Punishment, Trust Game, Ultimatum Game, and Public Good Game. Among them, ChatGPT 3.5 and ChatGPT 4o both showed significant stereotypes in the Ultimatum Game and Third-party Punishment task. Therefore, we selected the above two tasks to test the effectiveness of prompting strategies in improving stereotypes. In addition, ChatGPT 3.5 also showed significant bias in the Ultimatum Game and Third-party Punishment task, while ChatGPT 4o only produced bias in the Public Good Game and Dictator Task, so we selected the above tasks to measure the effectiveness of the prompting strategy in eliminating bias.

Building upon our previous experiments that employed prompts specifying national or cultural contexts, we introduced four new prompting strategies to systematically examine their effectiveness in reducing cultural stereotypes and biases in ChatGPT (see Table S15). These interventions were embedded within the same national context frameworks as before, allowing us to test whether specific types of prompts could mitigate cross-country behavioral differences previously observed in model responses. The four strategies included: emphasizing personal preferences in the questioning style (e.g., “As an independent individual, please choose according to your own ideas”), emphasizing fairness (e.g., “From a fairness perspective”), creating future-oriented scenarios (e.g., “If you live in a future world”), and creating borderless world scenarios (e.g., “Imagine you live in a multicultural world”).

In our design, each intervention was implemented by modifying two parts of the prompt: Prompt 1, which described the identity or background of the decision-maker (e.g., nationality or imagined identity), and Prompt 2, which introduced a particular reasoning perspective or cognitive frame (e.g., fairness, independence). The placement of the intervention cue depended on its nature—for instance, identity-oriented interventions such as “future world” were placed in Prompt 1, while reasoning-based interventions such as “from a fairness perspective” were included in Prompt 2.

Using these four new prompting methods, we remeasured the potential stereotypes of ChatGPT 3.5 and 4o in the Ultimatum Game and Third-party Punishment task, the bias of ChatGPT 3.5 in the Ultimatum Game and Third-party Punishment task, and the bias of ChatGPT 4o in the Public Good Game and Dictator Task.

Taking the Third-party Punishment task (Cultural stereotype) as an example, the text generated under these four prompt strategies is as follows:

[prompt 1] If you are an ordinary individual from [the country]. You complete the game with the distributor (A) and the receiver (B). A and B receive a total of 100 tokens, which are distributed by A and B can only accept and has no right to refuse. You will start with 50 tokens, which you can use to punish A or keep for yourself after watching A's distribution. You can choose the exact amount of punishment, and the relationship between the number of tokens you use to punish and the number of tokens reduced by A is 1:3; for example, if you take out 5 tokens to punish, the number of tokens of A will be reduced by 15.

If A's allocation scheme is: A gets 90 tokens, and B gets 10 tokens. [prompt 2] please choose one of the following schemes to punish A:

Keep your tokens without penalizing A

Take out 5 tokens and make A lose 15 tokens

Take out 10 tokens and make A lose 30 tokens

Taking out 15 tokens causes A to lose 45 tokens

Take out 20 tokens and make A lose 60 tokens

Take 25 tokens and make A lose 75 tokens

Take out 30 tokens and make A lose 90 tokens

[prompt 1] and [prompt 2] under different prompt strategies are shown in the table below:

Prompt strategy	Prompt content (Replace Content in [ ])
Emphasizing personal preferences	[prompt 1: blank][prompt 2: As an independent individual, please make a choice according to your own ideas and]
Emphasizing fairness:	[prompt 1: blank][prompt 2: From a fairness perspective,]
Creating future-oriented scenarios	[prompt 1: If you live in a future world.][prompt 2: blank]
Creating scenarios without national differences	[prompt 1: Imagine you live in a multicultural world,][prompt 2: blank]
Emphasizing personal preferences	[prompt 1: blank][prompt 2: As an independent individual, please make a choice according to your own ideas and]

Results

The effect of prompt strategies on cultural stereotypes

Using four prompts that emphasizing personal preferences, fairness, creating future-oriented scenarios and creating borderless world scenario, We reassessed ChatGPT 3.5 and ChatGPT 4o's stereotypes in the third-party penalty and ultimatum paradigms. Results showed that all four prompt methods effectively reduced cultural stereotype scores in ChatGPT 3.5 to nonsignificant levels in both the third-party punishment and ultimatum game tasks across the 20 countries examined (ps > 0.05 or SDs = 0; see Figure 7A and Table S13). In contrast, the effectiveness of the prompts varied for ChatGPT 4o. For the third-party punishment task, only the future-oriented scenario prompt successfully reduced cultural stereotypes (ps > 0.05 or SD = 0; see Figure 7A and Table S13). In the ultimatum game task, only the future-oriented and personal preference prompts were effective in reducing stereotypes (ps > 0.05 or SD = 0; see Figure 7A and Table S13). The result indicated that while all four prompts are effective in mitigating cultural bias in ChatGPT 3.5 across different contexts, ChatGPT 4o exhibits a more nuanced response, with certain prompts proving more effective in specific tasks. This highlights the need for tailored prompt strategies depending on the context and version of the AI model being utilized.

Figure 7.

The performance of ChatGPT 3.5 and ChatGPT 4o under four prompt strategies. Note: (A) The cultural stereotypes of ChatGPT 3.5 and ChatGPT 4o in the Ultimatum task and Third-party punishment task under four prompt strategies. (B) The cultural bias of ChatGPT 3.5 in the Ultimatum task and Third-party punishment task and of ChatGPT 4o in the Dictator task and Public good game under four prompt strategies. The horizontal axis is the effect size(η²) of one-way ANOVA or Kruskal–Wallis test, and the data is labelled as p-value or standard deviation (SD). ***: p < 0.001, **: p < 0.01.

The effect of prompt strategies on cultural bias

In addition, we reevaluate the bias of ChatGPT 3.5 in the third-party punishment and ultimatum paradigms and the bias of ChatGPT 4o in the dictator game and the public goods game. The data and one-way ANOVA results showed significant effects in reducing the cultural biases of ChatGPT. Specifically, one-way ANOVA analyses indicated no significant differences across the 20 countries in the measures of third-party punishment and the ultimatum game with ChatGPT 3.5, regardless of the prompt conditions, which included those emphasizing personal traits, fairness, future-oriented scenarios, and borderless world scenarios (ps > 0.05 or SDs = 0; see Figure 7B and Table S14). Similarly, for the dictator game with ChatGPT 4o, all four prompt strategies effectively reduced cultural biases. In the public goods game with ChatGPT 4o, three prompts—emphasizing fairness, future-oriented scenarios, and borderless world scenarios—also effectively eliminated cultural bias (ps > 0.05 or SDs = 0; see Figure 7B and Table S14), while the prompt emphasizing personal preferences did not achieve the same effect (F(20) = 3.27, p <0.001, η² = 0.26). Overall, these findings underscore the effectiveness of the developed prompt strategies in reducing cultural bias across different games and contexts.

Discussion

Our study found that ChatGPT exhibits unique cultural value orientations through questionnaire assessments, displaying both collectivist tendencies and value preferences that contradict collectivism, such as low power distance and high relational mobility. Additionally, ChatGPT demonstrated cultural biases and stereotypes in social interaction tasks. Fundamental cultural value dimensions, such as individualism–collectivism and cultural tightness–looseness, significantly influenced these biases and stereotypes. Moreover, four prompt strategies were developed, and the results indicated that these prompt strategies were efficient in reducing such cultural biases and stereotypes.

ChatGPT exhibits a distinct value orientation characterized by a unique blend of cultural values. Assessments have revealed that ChatGPT displays tendencies towards collectivism while simultaneously holding preferences that contradict traditional collectivist values, such as low power distance and high relational mobility. These findings align with previous research (Ghosh, 2011; Kito et al., 2017). However, ChatGPT's value orientation significantly diverges from that of human society, resulting in the formation of a unique cultural value system. This divergence highlights the complexities and nuances of ChatGPT's cultural values, which cannot be fully encapsulated within existing human cultural frameworks. Such differences could be attributed to the algorithms and diverse data sources used to train ChatGPT, which encompass a broad spectrum of human interactions and cultural contexts.

ChatGPT, like other algorithms, has demonstrated cultural biases in decision-making tasks. Prior research, such as Buolamwini (2017), has shown that algorithmic biases arise from dataset construction, goal setting, feature selection, and data labelling. Our research reveals that ChatGPT's biases may also stem from its understanding of cultural values across different regions. This indicates that biases are not solely due to dataset handling but also to the transfer of learned content. This underscores the need for further enhancement and regulation of AI algorithms. With advancements in AI, new models like ChatGPT 4.0 and 4o have been introduced. Our evaluation of ChatGPT 4o in single-round interactive social tasks revealed that while it still exhibits cultural stereotypes, it no longer displays cultural biases in the measured tasks. This highlights the effectiveness of continuous AI improvements and the importance of regular updates and thorough evaluations. These advances may be attributable to specific architectural and alignment updates, such as the integration of Rule-Based Reward Models (RBRMs) alongside Reinforcement Learning from Human Feedback (RLHF), which systematically penalize biased outputs and promote safer completions. Moreover, GPT 4o's improved multilingual capabilities suggest broader generalization across cultural contexts, potentially due to more inclusive evaluation and feedback processes. Our findings support the view that continual refinement of model training and alignment pipelines plays a key role in reducing algorithmic bias, underscoring the importance of ongoing evaluation and regulatory oversight.

Therefore, we developed four potential interventions to mitigate cultural stereotypes and biases in ChatGPT. The results demonstrated that prompts emphasizing personal preferences, fairness, future-oriented scenarios, or scenarios without national context could significantly reduce cultural stereotypes and biases in ChatGPT's responses regarding social distributive fairness. Previous studies have also found that responses from LLMs vary based on how they are prompted (Geng et al., 2024; Mei et al., 2024), highlighting the critical importance of refining prompt design to achieve more equitable and unbiased AI outputs. By systematically altering the framing of questions, we can influence the nature of AI responses and mitigate inherent biases, thereby enhancing the fairness and reliability of AI systems.

RSA analysis complements the main results from a modal perspective. By comparing the ChatGPT matrix of cultural values with the differential matrix of values scores across 60 countries/regions, we found that ChatGPT is also unique in the cultural value pattern. In addition, the cross-dimensional comparison also found that ChatGPT's inter-state decision-making model in moral dilemmas and ultimatums was significantly similar to the inter-state interdependent cultural values model, suggesting that the cultural bias displayed by ChatGPT may be based on its learning of the cultural values of the country. The above results support the current results from a model perspective.

The cultural stereotypes and biases of ChatGPT pose real-world challenges. In the context of globalization, cross-cultural interactions between individuals and organizations are increasingly important, and ChatGPT has gained attention for its role in cross-cultural communication (Baskara, 2023; Y. Chen et al., 2023). Studies have found that ChatGPT can predict users’ cultural values based on their language, aiding communication across different languages (Cao et al., 2023; Dong et al., 2024). Our findings support that ChatGPT is culturally sensitive and can distinguish between the cultural values of different regions. However, these perceptions are only sometimes accurate and may lead to misleading advice during communication. Therefore, users must be aware of potential stereotypes in ChatGPT's understanding of cultural values when using it for cross-cultural interactions. To reduce misunderstandings and communication barriers, users should implement measures to mitigate these biases, such as using revised prompts tested in this study. This approach would facilitate more effective and respectful cross-cultural interactions.

The cultural biases and stereotypes in ChatGPT pose a threat to social ethics. Our research found that ChatGPT exhibits these biases in tasks related to trust perception and fairness preferences. Previous studies indicate that cultural stereotypes and biases often result in social injustices (Agarwal et al., 2023; Howard & Borenstein, 2018; Lamont et al., 2014). For instance, Webster et al. (2022) found that biases influence interactions within medical environments, leading to medical inequity and health inequality. AI systems filtering information based on stereotypes can exacerbate information cocoon effects, reinforcing users’ existing biases (S. Chen et al., 2022). Given that ChatGPT 4o has shown reduced cultural bias with AI advancements, future research should involve stricter scrutiny and selection of algorithms and training datasets to mitigate socio-ethical issues. Emphasis should be placed on the interpretability and transparency of AI systems to understand their decision-making processes better. Methods to reduce biases in large language models include modifying word embeddings (Bolukbasi et al., 2016) and training on diverse databases (Zhao et al., 2018). Implementing these strategies can help minimize cultural biases in AI systems.

This study also has some limitations. First, our methodology relies on ChatGPT's self-reported responses, which may exhibit instability due to the inherent generative mechanisms of LLMs (Navigli et al., 2023). To mitigate this potential instability, we implemented repeated measurements where ChatGPT responded to each question ten times and we used the mean values as final responses. Future research could further enhance robustness by conducting systematic sensitivity analyses across different temperature settings to better capture the stability of AI-generated cultural representations. Second, our focus on ChatGPT limits generalizability to other large language models (LLMs). Different LLMs vary in training data, architecture, and fine-tuning processes, potentially producing distinct cultural bias patterns (Blodgett et al., 2020; Bender et al., 2021). Future research should employ comparative analyses across multiple LLMs (Bai et al., 2022) and develop AI-specific measurement approaches that better account for the unique nature of machine-generated responses (Perez et al., 2023). Moreover, using only English as the prompt language may diminish the cultural differences exhibited by LLMs when simulating different identities (Zhong et al., 2024). Future research should take into account the potential impact of prompt language and consider incorporating prompts in multiple languages.

In summary, our study reveals that ChatGPT exhibits both unique cultural values and biases, influencing its responses in decision-making tasks and interactions across different cultures. While improvements in AI models and revised prompts have reduced some cultural biases, ongoing refinements in prompt design and algorithm adjustments remain crucial for enhancing fairness and reducing stereotypes. Future research should continue to focus on optimizing AI systems to mitigate cultural biases and improve their cross-cultural communication effectiveness.

Transparency and openness

We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study, and we follow JARS (Appelbaum et al., 2018). All responses were collected via the ChatGPT web interface (https://chat.openai.com) between July 19 and August 5, 2024, using both ChatGPT 3.5 and ChatGPT 4o. Each prompt was entered in a new, independent session to avoid carryover effects from prior interactions. All data, analysis code, and research materials are available at https://osf.io/mpveb/?view_only=46d1031260f9435080c9880ac67968d5. Data were analyzed using R, version 4.3.2 (R Core Team, 2013). This study's design and its analysis were not pre-registered.

Supplemental Material

sj-docx-1-pac-10.1177_18344909251355673 - Supplemental material for The cultural stereotype and cultural bias of ChatGPT

Supplemental material, sj-docx-1-pac-10.1177_18344909251355673 for The cultural stereotype and cultural bias of ChatGPT by Hang Yuan, Zhongyue Che, Yue Zhang, Shao Li, Xianger Yuan, Liqin Huang, Xiaomeng Hu, Kaiping Peng and Siyang Luo in Journal of Pacific Rim Psychology

Footnotes

Author Note

Siyang Luo is now at the Department of Psychology, Guangdong Provincial Key Laboratory of Social Cognitive Neuroscience and Mental Health, Guangdong Provincial Key Laboratory of Brain Function and Disease, Sun Yat-Sen University.

Author contributions

Siyang Luo designed the study; Hang Yuan, Zhongyue Che, Yue Zhang, Shao Li, Xianger Yuan, and Siyang Luo conducted the experiments; Hang Yuan, Zhongyue Che, Yue Zhang, Shao Li, Xianger Yuan, and Siyang Luo analyzed the data; Hang Yuan, Zhongyue Che, Yue Zhang, Shao Li, Xianger Yuan, Liqin Huang, Xiaomeng Hu, Kaiping Peng, and Siyang Luo wrote the manuscript. All authors commented on the manuscript.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (grant number 32371125, 32071081).

ORCID iDs

Hang Yuan

Siyang Luo

Supplemental Material

Supplemental material for this article is available online.

References

Acerbi

Stubbersfield

J. M.

(2023). Large language models show human-like content biases in transmission chain experiments. Proceedings of the National Academy of Sciences, 120(44), e2313790120. https://doi.org/10.1073/pnas.2313790120

Agarwal

Bjarnadottir

Rhue

Dugas

Crowley

Clark

Gao

(2023). Addressing algorithmic bias and the perpetuation of health inequities: An AI bias aware framework. Health Policy and Technology, 12(1), 100702. https://doi.org/10.1016/j.hlpt.2022.100702

Appelbaum

Cooper

Kline

R. B.

Mayo-Wilson

Nezu

A. M.

Rao

S. M.

(2018). Journal article reporting standards for quantitative research in psychology: The APA publications and communications board task force report. American Psychologist, 73(1), 3. https://doi.org/10.1037/amp0000191

Atari

Xue

M. J.

Park

P. S.

Blasi

D. E.

Henrich

(2023, September 22). Which humans? https://doi.org/10.31234/osf.io/5b26t

Aupperle

R. L.

Sullivan

Melrose

A. J.

Paulus

M. P.

Stein

M. B.

(2011). A reverse translational approach to quantify approach–avoidance conflict in humans. Behavioral Brain Research, 225(2), 455–463. https://doi.org/10.1016/j.bbr.2011.08.003

Bai

Kadavath

Kundu

Askell

Kernion

Jones

, … & Kaplan

(2022). Constitutional AI: Harmlessness from AI feedback. arXiv preprint. https://doi.org/10.48550/arXiv.2212.08073

Baskara

F. R.

(2023). Chatgpt-assisted English language learning: Theoretical implications for global mobility and cross-cultural communication. In International Conference on Language and Language Teaching (pp. 105–120).

Bechara

Damasio

A. R.

Damasio

Anderson

S. W.

(1994). Insensitivity to future consequences following damage to human prefrontal cortex. Cognition, 50(1–3), 7–15. https://doi.org/10.1016/0010-0277(94)90018-3

Bender

E. M.

Gebru

McMillan-Major

Shmitchell

(2021). On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610–623). https://doi.org/10.1145/3442188.3445922

10.

Binz

Schulz

(2023). Using cognitive psychology to understand GPT-3. Proceedings of the National Academy of Sciences, 120(6), e2218523120. https://doi.org/10.1073/pnas.2218523120

11.

Blodgett

S. L.

Barocas

Daumé

III Wallach

(2020). Language (technology) is power: A critical survey of "bias" in NLP. arXiv preprint. https://doi.org/10.48550/arXiv.2005.14050

12.

Bolukbasi

Chang

K. W.

Zou

J. Y.

Saligrama

Kalai

A. T.

(2016). Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol.29). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2016/file/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf

13.

Bubeck

Chandrasekaran

Eldan

Gehrke

Horvitz

Kamar

Lee

Y. T.

Lundberg

Nori

Palangi

Ribeiro

M. T.

Zhang

(2023). Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv preprint. https://doi.org/10.48550/arXiv.2303.12712

14.

Buolamwini

J. A.

. (2017). Gender shades: Intersectional phenotypic and demographic evaluation of face datasets and gender classifiers [Doctoral dissertation]. Massachusetts Institute of Technology.

15.

Burks

S. V.

Carpenter

J. P.

Verhoogen

(2003). Playing both roles in the trust game. Journal of Economic Behavior & Organization, 51(2), 195–216. https://doi.org/10.1016/S0167-2681(02)00093-8

16.

Cao

Zhou

Lee

Cabello

Chen

Hershcovich

(2023). Assessing cross-cultural alignment between ChatGPT and human societies: An empirical study. In Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP) (pp. 53–67). Dubrovnik, Croatia: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.c3nlp-1.7

17.

Chen

Liu

T. X.

Shan

Zhong

(2023). The emergence of economic rationality of GPT. Proceedings of the National Academy of Sciences, 120(51), e2316205120. https://doi.org/10.1073/pnas.2316205120

18.

Chen

Qiu

Zhao

Han

Siponen

Mou

Xiao

(2022). When more is less: The other side of artificial intelligence recommendation. Journal of Management Science and Engineering, 7(2), 213–232. https://doi.org/10.1016/j.jmse.2021.08.001

19.

Coda-Forno

Binz

Wang

J. X.

Schulz

(2024). CogBench: A large language model walks into a psychology lab. Proceedings of the 41st International Conference on Machine Learning, 235, 9076–9108. https://proceedings.mlr.press/v235/coda-forno24a.html

20.

Dong

Zhunis

Chin

Han

Cha

(2024). I am not them: Fluid identities and persistent out-group bias in large language models. arXiv preprint. https://doi.org/10.48550/arXiv.2402.10436

21.

Duan

Chen

Yuan

Zhang

Luo

(2025). The psychological impact of reopening after COVID-19 lockdowns in China: Threat perception, affect and moral conflict. Social Science & Medicine, 365, 117611. https://doi.org/10.1016/j.socscimed.2024.117611

22.

Fehr

Fischbacher

(2004). Third-party punishment and social norms. Evolution and Human Behavior, 25(2), 63–87. https://doi.org/10.1016/S1090-5138(04)00005-4

23.

Fernández

Paez

González

J. L.

(2005). Independent and interdependent self-construals and socio-cultural factors in 29 nations. Revue Internationale de Psychologie sociale, 18(1), 35–63.

24.

Ferrara

(2023). Should ChatGPT be biased? Challenges and risks of bias in large language models. First Monday, 28(11). https://doi.org/10.5210/fm.v28i11.13346

25.

Fischbacher

Follmi-Heusi

(2013). Lies in disguise: An experimental study on cheating. Journal of the European Economic Association, 11(3), 525–547. https://doi.org/10.1111/jeea.12014

26.

Forsythe

Horowitz

J. L.

Savin

N. E.

Sefton

(1994). Fairness in simple bargaining experiments. Games and Economic Behavior, 6(3), 347–369. https://doi.org/10.1006/game.1994.1021

27.

Geng

Trotta

(2024). Are large language models chameleons? An attempt to simulate social surveys. arXiv preprint. https://doi.org/10.48550/arXiv.2405.19323

28.

Ghosh

(2011). Power distance in organizational contexts: A review of collectivist cultures. Indian Journal of Industrial Relations, 47(1), 89–101. https://www.jstor.org/stable/23070557

29.

Gilardi

Alizadeh

Kubli

(2023). ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30), e2305016120. https://doi.org/10.1073/pnas.2305016120

30.

Greene

J. D.

Morelli

S. A.

Lowenberg

Nystrom

L. E.

Cohen

J. D.

(2008). Cognitive load selectively interferes with utilitarian moral judgment. Cognition, 107(3), 1144–1154. https://doi.org/10.1016/j.cognition.2007.11.004

31.

Hamamura

(2012). Are cultures becoming individualistic? A cross-temporal comparison of individualism–collectivism in the United States and Japan. Personality and Social Psychology Review, 16(1), 3–24. https://doi.org/10.1177/1088868311411587

32.

Haxby

J. V.

Connolly

A. C.

Guntupalli

J. S.

(2014). Decoding neural representational spaces using multivariate pattern analysis. Annual Review of Neuroscience, 37, 435–456. https://doi.org/10.1146/annurev-neuro-062012-170325

33.

Henrich

Boyd

Bowles

Camerer

Fehr

Gintis

McElreath

(2001). Cooperation, reciprocity and punishment in fifteen small-scale societies. American Economic Review, 91(2), 73–78. https://doi.org/10.1257/aer.91.2.73

34.

Hofstede

(2001). Culture's consequences: Comparing values, behaviors, institutions and organizations across nations. Sage Publications.

35.

House

R. J.

Hanges

P. J.

Javidan

Dorfman

P. W.

Gupta

(2004). Culture, leadership, and organizations: The GLOBE study of 62 societies. Sage Publications.

36.

House

B. R.

Kanngiesser

Barrett

H. C.

Yilmaz

Smith

A. M.

Sebastian-Enesco

Erut

Silk

J. B.

(2020). Social norms and cultural diversity in the development of third-party punishment. Proceedings of the Royal Society B: Biological Sciences, 287(1925), 20192794. https://doi.org/10.1098/rspb.2019.2794

37.

Howard

Borenstein

(2018). The ugly truth about ourselves and our robot creations: The problem of bias and social inequity. Science and Engineering Ethics, 24(5), 1521–1536. ttps://doi.org/10.1007/s11948-017-9975-2

38.

Huang

Sun

Luo

(2022). The impact of individualism on the efficiency of epidemic control and the underlying computational and psychological mechanisms. Acta Psychologica Sinica, 54(5), 497. https://doi.org/10.3724/SP.J.1041.2022.00497

39.

Huang

J. T.

Wang

E. J.

Lam

M. H.

Ren

Yuan

Jiao

Lyu

M. R.

(2023). Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench. arXiv preprint. https://doi.org/10.48550/arXiv.2310.01386

40.

Inglehart

Oyserman

(2004). Individualism, autonomy and self-expression: The human development syndrome. In Comparing cultures (pp. 73–96). Brill.

41.

Ishii

Kurzban

(2008). Public goods games in Japan: Cultural and individual differences in reciprocity. Human Nature, 19(2), 138–156. https://doi.org/10.1007/s12110-008-9034-4

42.

Johnson

M. W.

Bickel

W. K.

(2002). Within-subject comparison of real and hypothetical money rewards in delay discounting. Journal of the Experimental Analysis of Behavior, 77(2), 129–146. https://doi.org/10.1901/jeab.2002.77-129

43.

Kito

Yuki

Thomson

(2017). Relational mobility and close relationships: A socioecological approach to explain cross-cultural differences. Personal Relationships, 24(1), 114–130. https://doi.org/10.1111/pere.12174

44.

Kleinman

Benson

(2006). Anthropology in the clinic: The problem of cultural competency and how to fix it. PLoS Medicine, 3(10), e294. https://doi.org/10.1371/journal.pmed.0030294

45.

Lamont

Beljean

Clair

(2014). What is missing? Cultural processes and causal pathways to inequality. Socio-Economic Review, 12(3), 573–608. https://doi.org/10.1093/SER/MWU011

46.

Langenbach

B. P.

Baumgartner

Cazzoli

Müri

R. M.

Knoch

(2019). Inhibition of the right dlPFC by theta burst stimulation does not alter sustainable decision-making. Scientific Reports, 9(1), 13852. https://doi.org/10.1038/s41598-019-50322-w

47.

Latané

Darley

J. M.

(1968). Group inhibition of bystander intervention in emergencies. Journal of Personality and Social Psychology, 10(3), 215–221. https://doi.org/10.1037/h0026570

48.

Luo

L. M. W.

Espina

Bond

M. H.

Lun

V. M. C.

Huang

, … & Liu

J. H.

(2023). Individual uniqueness in trust profiles and well-being: Understanding the role of cultural tightness–looseness from a representation similarity perspective. British Journal of Social Psychology, 62(2), 825–844. https://doi.org/10.1111/bjso.12599

49.

Luo

Yuan

Wang

Michael

H. B.

(2024). Culturomics: Taking the cross-scale, interdisciplinary science of culture into the next decade. Neuroscience and Biobehavioral Reviews, 167, 10542. https://doi.org/10.1016/j.neubiorev.2024.105942

50.

Markus

H. R.

Kitayama

(2014). Culture and the self: Implications for cognition, emotion, and motivation. In College student development and academic life (pp. 264–293). Routledge. https://doi.org/10.1037/0033-295X.98.2.224

51.

McCrae

R. R.

Terracciano

(2005). Universal features of personality traits from the observer's perspective: Data from 50 cultures. Journal of Personality and Social Psychology, 88(3), 547. https://doi.org/10.1037/0022-3514.88.3.547

52.

Mei

Xie

Yuan

Jackson

M. O.

(2024). A Turing test of whether AI chatbots are behaviorally similar to humans. Proceedings of the National Academy of Sciences, 121(9), e2313925121. https://doi.org/10.1073/pnas.2313925121

53.

Motoki

Neto

V. P.

Rodrigues

(2023). More human than human: Measuring ChatGPT political bias. Public Choice, 198, 3–23. https://doi.org/10.1007/s11127-023-01097-2

54.

Navigli

Conia

Ross

(2023). Biases in large language models: Origins, inventory, and discussion. ACM Journal of Data and Information Quality, 15(2), 1–21. https://doi.org/10.1145/3597307

55.

Nosek

B. A.

Banaji

M. R.

Greenwald

A. G.

(2002). Harvesting implicit group attitudes and beliefs from a demonstration web site. Group Dynamics: Theory, Research, and Practice, 6(1), 101–115. https://doi.org/10.1037/1089-2699.6.1.101

56.

Nowak

M. A.

Page

K. M.

Sigmund

(2000). Fairness versus reason in the ultimatum game. Science, 289(5485), 1773–1775. https://doi.org/10.1126/science.289.5485.1773

57.

ÓhÉigeartaigh

S. S.

Whittlestone

Liu

Zeng

Liu

(2020). Overcoming barriers to cross-cultural cooperation in AI ethics and governance. Philosophy & Technology, 33, 571–593. https://doi.org/10.1007/s13347-020-00402-x

58.

Oishi

Graham

(2010). Social ecology: Lost and found in psychological science. Perspectives on Psychological Science, 5(4), 356–377. https://doi.org/10.1177/1745691610374588

59.

Olli

Swedlow

(2023). Cultural theory, rejection of cultural bias, and party preference. Party Politics, 29(2), 359–373. https://doi.org/10.1177/13540688211071065

60.

Ornelas

M. L.

Smith

G. B.

Mansouri

(2023). Redefining culture in cultural robotics. AI & SOCIETY, 38(2), 777–788. https://doi.org/10.1007/s00146-022-01476-1

61.

Ostroot

N. M.

Snyder

W. W.

(1985). Measuring cultural bias in a cross-national study. Social Indicators Research, 17, 243–251. https://doi.org/10.1007/BF00319313

62.

Peng

Luo

(2021). Impact of social economic development on personality traits among Chinese college students: A cross-temporal meta-analysis, 2001–2016. Personality and Individual Differences, 171, 110461. https://doi.org/10.1016/j.paid.2020.110461

63.

Peng

Luo

(2022). Impact of social economic development on positive and negative affect among Chinese college students: A cross-temporal meta-analysis, 2001–2016. Culture and Brain, 10(2), 95–112. https://doi.org/10.1007/s40167-022-00109-0

64.

Perez

Ringer

Lukosiute

Nguyen

Chen

Heiner

… & Kaplan

(2023). Discovering language model behaviors with model-written evaluations. In Findings of the Association for Computational Linguistics: ACL 2023 (pp. 13387–13434).

65.

Pursiainen

(2022). Cultural biases in equity analysis. The Journal of Finance, 77(1), 163–211. https://doi.org/10.1111/jofi.13095

66.

R Core Team. (2013). R: A language and environment for statistical computing. Vienna, Austria: Foundation for Statistical Computing.

67.

Santos

F. C.

Santos

M. D.

Pacheco

J. M.

(2008). Social diversity promotes the emergence of cooperation in public goods games. Nature, 454(7201), 213–216. https://doi.org/10.1038/nature06940

68.

Shafir

Tversky

(1992). Thinking through uncertainty: Nonconsequential reasoning and choice. Cognitive Psychology, 24(4), 449–474. https://doi.org/10.1016/0010-0285(92)90015-T

69.

Stewart

R. A.

Powell

G. E.

Chetwnd

S. J.

(1979). Person perception and stereotyping. Saxon House. https://doi.org/10.1093/sf/59.3.856

70.

Tao

Viberg

Baker

R. S.

Kizilcec

R. F.

(2024). Cultural bias and cultural alignment of large language models. PNAS nexus, 3(9), 346. https://doi.org/10.1093/pnasnexus/pgae346

71.

Thomson

Yuki

Talhelm

Schug

Kito

Ayanian

A. H.

Becker

J. C.

Becker

Chiu

Choi

H.-S.

Ferreira

C. M.

Fülöp

Gul

Houghton-Illera

A. M.

Joasoo

Jong

Kavanagh

C. M.

Khutkyy

Manzi

Visserman

M. L.

(2018). Relational mobility predicts social behaviors in 39 countries and is tied to historical farming and threat. Proceedings of the National Academy of Sciences, 115(29), 7521–7526. https://doi.org/10.1073/pnas.1713191115

72.

Trautman

L. J.

Voss

W. G.

Shackelford

S. J.

(2025). How We Learned to Stop Worrying and Love AI: Analyzing the Rapid Evolution of Generative Pre-Trained Transformer (GPT) and Its Impacts on Law, Business, and Society. Albany Law Journal of Science & Technology, 34(3), 203–277.

73.

Venaik

Brewer

(2016). National culture dimensions: The perpetuation of cultural ignorance. Management Learning, 47(5), 563–589. https://doi.org/10.1177/1350507616629356

74.

Waldrop

M. M.

(2024). Can ChatGPT help researchers understand how the human brain handles language? Proceedings of the National Academy of Sciences, 121(25), e2410196121. https://doi.org/10.1073/pnas.2410196121

75.

Webster

C. S.

Taylor

Thomas

Weller

J. M.

(2022). Social bias, discrimination and inequity in healthcare: Mechanisms, implications and recommendations. BJA Education, 22(4), 131–137. https://doi.org/10.1016/j.bjae.2021.11.011

76.

Yang

Zhang

Lyu

Tang

(2023). Group bias under uncertain environment: A perspective of third-party punishment. Acta Psychologica, 237, 103957. https://doi.org/10.1016/j.actpsy.2023.103957

77.

Yuan

Luo

(2024). Representation similarity analysis—A new perspective to study socio-cultural change: Take the mental health of the elderly as an example. Acta Psychologica Sinica, 56(7), 938–953. https://doi.org/10.3724/SP.J.1041.2024.00938

78.

Zewail

Figueroa

Graham

Atari

(2024, December 13). Moral stereotyping in large language models. https://doi.org/10.31234/osf.io/t9 × 8r

79.

Zhang

Yuan

Che

Luo

(2025). The high dimensional psychological profile of ChatGPT. Science China Technological Sciences, 182401. https://doi.org/10.1007/s11431-025-2934-8

80.

Zhao

Wang

Yatskar

Ordonez

Chang

K. W.

(2018). Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol. 2 (Short Papers), pp. 15–20). New Orleans, Louisiana: Association for Computational Linguistics.

81.

Zhong

Yun

Sun

(2024). Cultural value differences of LLMs: Prompt, language, and model size. arXiv preprint. https://doi.org/10.48550/arXiv.2407.16891.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.10 MB