Abstract
Personalizing product innovation methodologies to align with user needs is essential in China’s rapidly evolving electronic manufacturing industry. Despite this necessity, there remains limited understanding of how advanced technologies can be effectively leveraged to create consumer personas that inform successful product development. The present study examines the impact of the CoPersona methodology, a novel approach that integrates human expertise with Large Language Models (LLMs), on the innovation process within this competitive sector. Through a detailed case study of a mid-sized manufacturer, B.Co, we investigated the use of CoPersona in the design and marketing of a bedside lamp. By analyzing over 38 million posts from the RedNote social network, we examined users’ daily life behaviors in bedroom contexts, leading to the creation of five distinct personas that informed key aspects of the product’s design. By integrating GPT-4 with expert human review, this methodology enabled B.Co to translate consumer pain points into 13 actionable product design directives, achieving a commercial adoption rate of 69.2%. Our findings suggest that the CoPersona methodology enhances the ability to process large datasets while maintaining the critical understanding needed for effective product design. We provide insights into how this hybrid approach can be utilized to personalize product and marketing strategies, offering valuable recommendations for manufacturers aiming to achieve market success through consumer-driven innovation.
Keywords
Introduction
In the field of human-computer interaction (HCI), personas are widely acknowledged as a powerful technique for fostering user understanding and promoting user-centered design. These fictional yet realistic representations of user segments (An et al., 2018) are invaluable for researchers and practitioners across various industries and application domains. Personas aid in user-centric decision-making processes in sectors such as software development (Watanabe et al., 2021), design (Salminen et al., 2021), e-health (Solem et al., 2020), marketing and advertising (Aimé et al., 2022), cyber-persona identification (Racherache et al., 2023), video games (Vandewalle et al., 2023), online news (Jansen et al., 2021), and recommender systems (Duan et al., Lawrence 2022). In the current era of personified big data (Salminen et al., 2022), personas are essential for segmenting diverse online user populations and for transcending traditional segmentation methods (Jansen et al., 2021) to foster empathetic understanding among stakeholders.
The evolution of the user persona field has shifted from traditional ontology-based methods to deep learning techniques. The concept of user personas was first proposed by Cooper (1999). Gauch et al. (2003) established the earliest ontology-based traditional user persona method, which relies on knowledge graphs and ontologies to construct user personas. Data-driven User Persona Development evolved from initial quantification efforts (2005–2008) through methodological diversification (2009–2014) to the current digitalization phase (2015–present), characterized by social media analytics and advanced computational methods (Salminen et al., 2021). Guntuku et al. (2016) proposed a method for establishing user personas using deep learning models, making a significant leap forward in the field. Deep learning offers stronger user representation capabilities, simplifies modeling processes, improves accuracy, handles multimodal data, and reduces iteration costs. The C-HMCNN (Coherent Hierarchical Multi-Label Classification Networks) framework demonstrated its applicability to structured label prediction, while lookalike technology was used for ad targeting and user attribute inference (Kikuchi & Takahashi, 2021; Ravichandran & Rao, 2022). Active learning combined with Bayesian networks was used for low-cost persona iteration, and large models’ world knowledge was utilized to enhance persona annotation and predictive capabilities, marking significant milestones in the development of the user persona field.
Noteworthy at this stage is the emergence of deep learning technology. Tan et al. (2022) propose a method for creating and validating predictive personas (PPs) to leverage large model technology, combining quantitative and qualitative data to enhance the effectiveness and targeting of marketing strategies. Large models trained on extensive open-domain knowledge can fill gaps in closed systems, enabling more accurate persona annotation and prediction. They provide a high-quality abstraction of the world’s conceptual system, which is well-suited to the construction of personas and tagging systems. Therefore, large-scale modeling technology may enhance the ability to combine analytical methods, thereby improving the accuracy and timeliness of user personas and offering powerful tools for personalized services and decision support.
This study examines the implementation of a novel persona development method through a case study of B.Co, a prominent electrical manufacturer in China’s competitive market. The study proposes a hybrid method that synergizes LLMs (Large Language Models), specifically GPT, with human expertise to generate valuable user personas for business strategy optimization. This method is particularly relevant in the Chinese consumer landscape, where social media-driven purchase behaviors, especially through detailed product reviews (“Ceping” posts), significantly influence electronic word of mouth (eWOM) and consumer decision-making. Through this investigation, this study aims to bridge the gap between advanced computational capabilities and practical business applications in user-centered design.
Based on these, this study proposes two research questions: (1) What methodological approaches can be developed to enhance the precision of social media data collection, focusing on relevance filtering during acquisition rather than post-collection cleaning? (2) To what extent can the derived persona schema be operationalized via similarity-based classification to reliably assign large-scale UGC into persona categories, as validated against expert annotations?
Related Study
Data-Driven Persona Development
Persona in Product Design
Personas are essential to the product design process, as they provide a detailed representation of target users, enabling designers to make informed decisions throughout the product’s life cycle (Howard & Baines, 2023). By leveraging data from sources such as customer feedback, market research, and online interactions, personas accurately capture user requirements and preferences during the requirements analysis phase (Daga et al., 2022). This ensures that the product attributes align with what customers truly want. During the conceptual design stage, personas guide the creation of data-driven models by combining functional requirements with relevant user insights, resulting in solutions that effectively meet user needs. In the detailed design phase, personas facilitate the modeling and verification of product solutions, ensuring that the final product resonates with the intended users. Overall, integrating personas into product design fosters a user-centered approach, enhancing the relevance and usability of the final product (Tan et al., 2022).
Persona for Digital Marketing
In digital marketing, personas are instrumental in tailoring strategies to effectively reach and engage target customers. By leveraging data-driven personas, marketers can gain deep insights into user behaviors, preferences, and needs, enabling more precise targeting and personalization. Personas help define digital marketing use cases, such as channel selection (Cruz & Karatzas, 2020), where understanding customer behavior across channels informs platform selection to engage potential customers (Jansen et al., 2022). They also play a crucial role in creating messages, ensuring that marketing communications resonate with specific user segments. Furthermore, personas assist in budget allocation by identifying the most effective channels and strategies to invest in, thereby optimizing marketing spend. Dynamic pricing strategies and personalized recommendations are enhanced through persona insights, leading to improved conversion rates (Dolbec, 2023). Overall, the use of personas in digital marketing enables a more focused, effective approach, ensuring marketing efforts align with the specific needs and behaviors of the target audience.
Methods of Data-Driven Persona Development
Various methodologies have been applied in data-driven persona development. In this study, we divided the methods by user feature abstraction: qualitative data analysis, discriminant model, and generative models. No matter which methodology they applied, there is always one process for DDPD, data collection, feature abstraction, data analysis, persona construction, and evaluation.
Qualitative Data Analysis
In the context of Data-Driven Persona Development (DDPD), qualitative data analysis plays a crucial role, especially when it involves manually abstracting persona features from a small sample size. Despite the challenges associated with high dependency on structured interview questions and the complexity of dimensionality reduction techniques, qualitative approaches provide rich, contextual insights that quantitative methods often lack (Edberg & Beck, 2020). For instance, Cooper’s widely adopted method for qualitative persona development involves several meticulous stages, such as identifying behavioral variables, mapping interview subjects to these variables, and synthesizing characteristics and goals (Colin, 2020). These stages ensure a systematic approach to transforming qualitative data into actionable personas. Specifically, the process begins by quantifying interview data using visual analog scales (VAS) to map behaviors, followed by identifying significant patterns through segmentation. Korsgaard et al. (2020) manually extracted behavioral variables from interviews, quantified them on scales, and then used subspace clustering algorithms to identify optimal user groupings that form the basis for personas. While qualitative methods provide rich, contextual insights into user behavior (Nascimento et al., 2022), they face significant challenges when dealing with the vast amounts of data generated by social media platforms, making it difficult to effectively analyze and incorporate millions of data points into the persona development process.
Discriminate Models
Discriminating models for persona methods involve applying various algorithms to generate user personas from data, each with unique strengths and limitations. From the reviewed articles, five commonly used algorithms for persona generation were identified (Jansen et al., 2021): Clustering Analysis (CA), Principal Component Analysis (PCA), Latent Semantic Analysis (LSA), Non-Negative Matrix Factorization (NMF), and Latent Dirichlet Allocation (LDA).
Clustering Analysis (CA) has emerged as a fundamental technique in user segmentation. An et al. (2016) proposed a clustering method based on social media data for real-time generation and updating of user personas. Their approach involves feature extraction and vectorization of user behavior data, followed by an improved K-means clustering algorithm. The study demonstrated that weighting features using tf-idf and determining the optimal number of clusters using the elbow method effectively generate user groups with similar behavioral patterns, forming representative personas. Notably, integrating data from multiple social media platforms enriched the resulting personas, thereby enhancing their accuracy and comprehensiveness.
Principal Component Analysis (PCA) has been applied in data-driven web user persona development to address the challenges of high-dimensional data. Mijac et al. (2018) utilized PCA as a dimensionality reduction technique, simplifying complex data while preserving essential features. This approach facilitates the analysis of vast datasets, capturing authentic user behavior patterns to inform business decisions. However, the researchers noted limitations in PCA’s effectiveness when dealing with highly complex or non-linear data structures.
Latent Semantic Analysis (LSA) has shown promise in quantifying qualitative data for persona creation. Korsgaard et al. (2020) proposed a method combining user stereotype modeling with a semi-automatic subspace clustering algorithm. Their approach employs LSA to extract concepts through Singular Value Decomposition (SVD), followed by clustering based on similarity scores. This method demonstrates the potential of LSA in bridging qualitative insights with quantitative analysis in persona development. Non-Negative Matrix Factorization (NMF) has been effectively applied to analyze user behavior patterns. Jung et al. (2020) used NMF to identify behavioral patterns among YouTube content creators, integrating these insights with demographic data to generate user personas. Their research underscored the dynamic nature of user personas, underscoring the need for continuous updating to remain relevant.
Latent Dirichlet Allocation (LDA) has proven valuable in topic modeling for persona development. An et al. (2018) used LDA to extract topics from video titles and linked them to user viewing behavior. This approach enabled the description and differentiation of user personas based on interest patterns, facilitating personalized content recommendations. Similarly, Yan et al. (2018) applied LDA to mine interests from mobile users’ browsing records, treating URLs as words and browsing histories as documents. The resulting topic models were used to generate user feature vectors, which were then clustered using K-means to describe user similarities and individualists.
Despite their utility, these algorithms often require supplementary qualitative data to enrich the personas and ensure they are contextually relevant and actionable. The integration of qualitative insights helps to overcome the limitations of purely quantitative approaches, providing a more comprehensive view of user behaviors and preferences.
Generative Models
Recent research has shown significant advancements in the use of artificial intelligence, particularly Large Language Models (LLMs), for persona development in Human-Computer Interaction (HCI) and User Experience (UX) research. These studies demonstrate the potential of AI to enhance the efficiency, depth, and diversity of persona creation processes. (Bhandari et al., 2025) employed a multi-faceted approach combining k-means clustering and TF-IDF analysis to extract user behavior patterns and pain points from operational logs. They further utilized ChatGPT to generate attributes difficult to extract directly from user logs, such as names and ages, and employed Face Generator to create visual representations of user personas. Salminen et al. (2024) investigated the diversity and potential biases in personas generated by LLMs. They showed that LLM-generated personas were evaluated as informative, credible, and relevant, though some stereotypical representations and a lack of realism were noted. Barambones et al. (2024) explored the feasibility of using ChatGPT to simulate user interviews for persona creation in educational contexts, but further improvements in prompt design and model configuration are still needed to enhance response quality and variety.
LLM in Persona Development
The advent of social media platforms has revolutionized persona development by providing unprecedented access to rich, authentic user data. Research indicates that social media platforms serve as valuable repositories of user-generated content (UGC), offering insights into users’ behaviors, preferences, and demographic characteristics (Jiang & Ferrara, 2023). Using social media data for personas generally follows a process of collecting, processing, and analyzing, with slight variations in specific tasks (Hu et al., 2024). Scholars have demonstrated that machine learning techniques, particularly Large language models (LLMs), can effectively leverage this vast corpus of social media data, including posts, comments, and engagement metrics, to facilitate sophisticated user segmentation and behavioral prediction (Mustak et al., 2021).
The analysis of UGC using LLM methodologies has proven particularly effective for various analytical tasks, including feature extraction (Rocchi & Pescatore, 2022), classification (An et al., 2018), prediction (Majima & Markov, 2022), and persona generation (Tan et al., 2022). Using LLMs in persona development can help businesses better understand their target users, leading to more precise decisions in product design (Tatlisu & Turan, 2022), content creation (Lopes & Casais, 2022), and marketing strategies (Aimé et al., 2022). For instance, combining quantitative and qualitative analysis methods, social media data can be used to generate product personas (Tan et al., 2022), or social media data analysis tools (Jung et al., 2018). Additionally, social media data can be utilized to assist in persona construction by enriching data through topic detection, thereby more accurately defining personas (Spiliotopoulos et al., 2020). Analyzing user interaction data on social media can also identify different user behavior segments and influential user groups (An et al., 2018) or integrate data from various social media platforms to achieve real-time customer reaction (An et al., 2016).
Large language models (LLMs) remove noise, standardize text formatting, and extract key information from social media data (Lu et al., 2023). Such as revealing users’ socio-psychological characteristics, vocabulary use in Facebook messages related to personality, gender, and age (Schwartz et al., 2013), or automatically generating personas including names, photos, and personal attributes for each group (Jung et al., 2018). Notable implementations include the GPT-4-based PersonaGen tool (Zhang et al., 2023), which facilitates automated persona generation through systematic attribute classification and user feedback analysis. However, critical examination of existing research reveals significant limitations in addressing diversity and bias characteristics (Salminen et al., 2024) within LLM-generated personas (Cheng et al., 2023). Studies have highlighted potential ethnic mismatches and challenges in demographic representation (Kocaballi, 2023), while others have emphasized the complexities of ensuring representative user perspectives in generated personas (Hong et al., 2023). Though the “Marked Personas” framework represents a significant advancement in addressing these limitations (Cheng et al., 2023), comprehensive diversity assessment and subject-matter expert validation of LLM-generated personas had still remained a research gap.
Research Aims
Personas, as methodologically constructed representations of target users, serve as instrumental tools in Human-Computer Interaction (HCI) research by providing systematic insights into user needs, behavioral patterns, and preference frameworks (Salminen et al., 2024). In the contemporary digital marketing landscape, organizations increasingly recognize the strategic value of persona development as a methodological approach to inform product design decisions.
This investigation aims to develop and validate an innovative methodological framework for generating demographically representative user personas by systematically integrating Large Language Models (LLMs) with expert validation protocols. This research seeks to establish a robust approach for data-driven persona development that addresses current methodological limitations while providing actionable insights for product innovation.
Methods
This study selects RedNote as the local social media platform. RedNote is one of the most prominent user-generated content (UGC) platforms in China, where users actively share “notes” (posts) that combine textual narratives and images documenting their daily lives. (Gendered Chronotopes on Social Media Through the Lens of Small Stories and Positioning Analysis: The Case of the “Pretty Girl” on Xiaohongshu (RedNote), 2024) Such content provides rich and ecologically valid data for examining users’ behavioral patterns, lifestyle practices, and preference structures.
Data were collected using a systematic keyword-based crawling strategy. Specifically, we constructed 47 × 6 distinct search keyword combinations derived from the persona dimensions defined in this study. All publicly accessible posts retrievable through the platform’s search interface were collected. The resulting dataset spans the period from October 1, 2016, to March 1, 2024, as illustrated in Figure 1, and captures long-term trends and platform evolution. Annual growth of collected RedNote posts from 2016 to 2024. The vertical axis is presented on a log10 scale to better visualize the rapid increase in post volume over time
Overview of the “Copersona” Approach
We propose CoPersona, a human-AI collaborative framework for data-driven persona development from large-scale social media data, with the entire three-phase process shown in Figure 2. The method embeds the theoretical relationship between product attributes, user behaviors, and usage scenarios directly into the data-acquisition stage via keyword design. Specifically, six bedroom-related scenarios were combined with forty-seven bedside lighting attributes to construct 282 search queries, yielding 31,110 relevant posts (D1). After user-level deduplication, 22,039 unique users were identified, from whom up to 20 recent posts per user were collected to form a lifestyle-level corpus of 389,324 posts (D2-1). The three phases of copersona process
To avoid over-representing highly active users, user-normalized weighting was applied so that each user contributed equal total weight regardless of post count. A stratified sample of posts was then analyzed using GPT-4 to generate candidate persona categories and behavioral features, which were iteratively refined through expert review to produce predefined SME persona classifications. Finally, these classifications were applied to the full corpus using NLP-based similarity matching and user-level aggregation, yielding five stable lifestyle-oriented personas.
Phase One: Data Acquisition and Enrichment
Phase One implemented a systematic two-tier data-acquisition strategy to capture both product-specific interactions and broader lifestyle-level behavioral patterns, which are essential for constructing robust, context-aware user personas. The process comprised three stages (Figure 2): Initial Data Collection (D1), User-Level Deduplication (D1-1), and Lifestyle-Level Enrichment (D2-1) (Figure 3). Three stages of D1 and D2-1 data screening and enrichment
282 (6 × 47) Keyword Combinations
Using this keyword set, we retrieved 31,110 posts, forming datasets D1. These posts were subsequently cleaned and duplicate based on user identifiers, yielding 22,039 unique users (D1-1). This user-level refinement ensured that each individual was represented once in the core sample, preventing early-stage over-representation by prolific content creators and establishing a user-centered foundation for persona construction.
To capture lifestyle-level behavioral patterns beyond isolated product mentions, we conducted an Extended Data Collection stage based on D1-1. For each identified user, up to 20 of their most recent posts were collected from their profile pages, along with publicly available engagement metrics and profile metadata. Duplicate posts were removed based on post identifiers and textual similarity, while missing values were handled through removal or imputation. Rule-based authenticity checks were applied to exclude commercial or promotional accounts. This process resulted in a lifestyle-level corpus (D2-1) comprising 389,324 daily life posts. Although 20 posts were targeted per user, a proportion of users had fewer than 20 retrievable public posts due to low activity levels, privacy restrictions, or content removal, resulting in a final corpus of 389,324 posts. By expanding from product-related content to users’ broader everyday expressions, this phase enabled persona development to be grounded in stable lifestyle patterns rather than isolated product interactions.
Phase Two: Human-AI Collaborative Feature Extraction·
Phase Two employed a human-AI collaborative approach to extract interpretable persona categories and characteristic behavioral features from user-generated content.
Expert refinement prompts based on the CO-STAR framework
Secondly, the analytical process followed a structured iteration cycle. (1) In the initial analysis, the processing of sampled posts from D2-1 through GPT-4, while generating preliminary classification schemes, and extraction of characteristic features. (2) features of each classification scheme of persona, reviewed by a human, to make sure these features are relevant to persona, which provides information to understand the classification schemes. (3) human expert refinement prompts, adjustment of classification parameters, and enhancement of feature extraction accuracy, until GPT generated the qualified classification schemes, and extraction of characteristic features.
This collaborative process resulted in four key outputs: (1) a set of predefined persona categories grounded in observed lifestyle patterns, (2) characteristic behavioral keywords defining each persona, (3) criteria for small-sample validation, and (4) optimized prompts capable of producing consistent and reproducible feature abstractions. Importantly, this phase was conducted at the user level rather than the post level, ensuring that persona definitions reflected coherent behavioral profiles instead of isolated textual instances.
Phase Three: Similarity-Based Iterative Posts Classification
In Phase Three, the objective is not to discover new user groups, but to assign large-scale user-generated posts to predefined persona categories derived from Phase Two. Based on the persona schema and characteristic feature sets collaboratively constructed by LLMs and domain experts in Phase Two, this phase implements a similarity-based classification pipeline to operational persona assignment at scale.
Tokenization and Vector Transformation
To prepare the posts (D2-1) for natural language processing, we first applied tokenization using Jieba, a widely recognized Chinese segmentation tool. This step decomposes the posts into a collection of word tokens:
After removing stop words, a filtered set of tokens was generated, as defined in equation (1). Where S denotes the stop words set, and W_filtered represents the cleaned token set.
Subsequently, each word was transformed into a high-dimensional semantic vector using the Tencent AI Lab Embedding Corpus Dataset (Song et al., 2018). This ensures that each token captures rich contextual semantics, as shown in equation (2).
Comment Classification
In this stage, we assign personas using weighted cosine similarity rather than unsupervised clustering. Each post is evaluated against a set of SME validated persona-feature representations, which serve as prototype vectors for each persona category.
First, for each post, the semantic similarity between its token vectors and persona-specific feature vectors is computed using cosine similarity equation (4).
Next, the similarity scores are aggregated using a weighted scoring mechanism (Equation (5)) to determine the overall relevance of the posts to each predefined SME persona classification.
Finally, the post is assigned to the category with the highest score (Equation (6)). If no category meets a predefined similarity threshold, the post re-enters Phase 1 for further refinement.
It is important to note that this phase does not perform unsupervised clustering. Instead, it operationalizes a classification step grounded in persona schemas that were inductively constructed during Phase Two. This separation enables the framework to balance exploratory persona discovery with scalable, interpretable persona assignment, which is critical for industrial design applications.
The Results of the “Copersona” Process
Result of Personas
Data Filtering Process and Dataset Characteristics
The integration of Large Language Models (GPT-4) with expert review demonstrated significant effectiveness in developing nuanced user personas. Our hybrid approach successfully identified five distinct user segments (personas) from the lifestyle data (D2-1), each characterized by unique behavioral patterns and needs in bedroom environments, as shown in Figure 4. There are five personas: Health Aficionados, Night Owls, Interior Decorators, Child-care Workers, and Workaholics. Five personas generated by the copersona
This LLM-enhanced classification process enabled the rapid processing of complex, unstructured social media data while maintaining contextual understanding. The combination of GPT-4’s pattern recognition capabilities and expert validation ensured both efficiency and accuracy in persona development. The quality improvement was particularly evident in three aspects: (1) the elimination of commercial and promotional content through our authenticity verification system, (2) the comprehensive coverage of user behaviors through our dual-layer data collection strategy, and (3) the enhanced contextual understanding enabled by our behavior-product association framework. This approach proved especially effective for household appliance design, where user behaviors and lifestyle patterns significantly influence product interactions. The effectiveness of this methodology was particularly evident in the refinement process, where expert review validated and enhanced the AI-generated categories, ensuring their practical applicability to product design decisions.
Evaluation of CoPersona Classification Performance
To evaluate the performance of the CoPersona assignment mechanism, we conducted an expert-based validation using a persona-stratified sample of 50 posts per persona category. Five domain experts independently annotated all sampled posts according to the persona schemas and characteristic features derived in Phase Two, yielding 1,250 annotations in total (5 experts × 5 personas × 50 posts). These annotations should be understood as interpretive labels grounded in expert consensus, reflecting alignment with the constructed persona schemas rather than objective user identities. We employed a matrix that can evaluate the performance of a classification model. This confusion matrix includes classification accuracy, precision, and recall, shown in Figure 5. Confusion matrix for the CoPersona assignment mechanism. The x-axis shows the true category and the y-axis shows the predicted category. The color scale represents the number of annotations (frequency), with darker shades indicating higher counts
Symmetric Measures for Agreement Between CoPersona and Expert Annotations
Note. aa: No assumption of the null hypothesis; bb: Based on the null hypothesis using asymptotic standard error; cc: Based on normal approximation.
Classification Performance Metrics for Each Persona Category
The empirical evaluation demonstrates that CoPersona, our human-AI collaborative persona development framework, achieves promising performance while highlighting areas for future enhancement. While the framework demonstrates robust classification performance (overall accuracy = 81%), our analysis identifies two directions for optimization. Firstly, feature disambiguation requires refinement, particularly in distinguishing between overlapping behavioral patterns. Low-discriminative features (e.g., “mobile phone usage”) that appear across multiple persona categories contribute to classification ambiguity. Future iterations should focus on identifying and prioritizing more distinctive behavioral markers for each persona category.
Secondly, implementing targeted data collection and augmentation techniques for minority personas, social media posts inherently have sampling biases, such as over-representing aspirational content and premium product usage while under-representing practical majority users (this study uses a human expert to review the process, deliberately compensates for these biases by validating personas against real-world usage patterns).
These findings contribute to the broader discourse on human-AI collaboration in user research methodologies, suggesting that while hybrid approaches offer promising results, careful attention to feature selection and data balance remains crucial for developing comprehensive and accurate user personas.
Discussion
Extrapolate Beyond the User-Generated Data
In the context of specific products, we can identify and analyze relevant keywords such as “sleep” and “late night.” However, a critical challenge arises when users are reluctant to share personal information on social media or other public platforms. This reticence creates a gap in our understanding of their behaviors and preferences (Salminen et al., 2022). Consequently, the data we gather may not be comprehensive or entirely accurate, no matter the volume. To address this issue, we can employ advanced techniques such as generative tasks and transfer learning with large language models (LLMs), such as GPT-4 (Kocaballi, 2023). These models, which are trained on extensive datasets, can be fine-tuned with smaller, annotated datasets to generate text that accurately reflects user personas. This method aligns with the concept of personas as mental models, enabling us to simulate user responses consistently and contextually accurately. By incorporating systems such as Jieba for multidimensional text analysis, we can develop more complete user models. This approach involves not only analyzing the linguistic consistency but also ensuring character consistency and user-focused assessments, such as through conversational interfaces.
The integration of LLMs with expert analysis enables the creation of precise and actionable personas. These personas transcend mere linguistic descriptions, providing concrete insights for product design. For example, by analyzing the pre-sleep behaviors of each user category, we can derive specific design ideas for bedside products. Collaborating with companies like B.Co, this approach of using user interaction data to create Copersonas not only safeguards user privacy but also enables businesses to use data that users willingly share. This method avoids intrusive data collection practices, aligning with the goals of sustainable industry and smart manufacturing. By extracting and analyzing the most relevant data, we can develop personas that directly inform product design. This approach ensures that the products are tailored to meet the specific needs and preferences of different user groups. For instance, insights into pre-sleep routines can inform the design of bedside products for health-conscious individuals, late-night entertainment enthusiasts, and parents with infants.
Extending Personas in Practical Applications and Industry
The Copersona methodology presents several advancements in data-driven persona development. First, our approach introduces an innovative data-acquisition strategy that enhances data relevance at the collection phase, rather than relying solely on post-collection filtering. This pre-emptive relevance optimization represents a departure from approaches that often struggle with noise reduction in social media datasets (Park & Kang, 2022). The methodology’s dual-layer data collection strategy—combining product-specific interactions with broader lifestyle patterns—enables a more nuanced understanding of user behaviors while maintaining data relevance.
Second, the framework’s integration of LLMs with SME introduces a novel approach to feature extraction and persona development. By leveraging GPT-4’s capabilities to analyze user-generated content (Salminen et al., 2024) while incorporating domain expert oversight, the methodology achieves a balance between computational efficiency and human insight. This hybrid approach enhances the accuracy and contextual relevance of generated personas, addressing a limitation in purely automated approaches.
Third, the implementation of an iterative classification system with human supervision at critical junctures also represents a methodological innovation. The system’s ability to recursively process unclassified data while maintaining classification integrity through expert validation (Ohlén & Silvander, 2022) ensures comprehensive coverage while minimizing potential biases. This supervised-learning-like approach demonstrates the potential to create more robust and reliable persona development processes.
Methodological Implications: Meaning-First Persona Construction
Rather than relying on unsupervised clustering to retrospectively infer user meanings (Korsgaard et al., 2020), the CoPersona framework embeds persona hypotheses directly into the data acquisition and feature construction stages. Specifically, the study adopts a user-centered sampling logic that begins with individuals who have demonstrably engaged with bedside lighting products, and subsequently examines their broader lifestyle expressions. This design choice reflects a conceptual shift from asking “Which users can be grouped together?” to “How do users engage with a given product live and behave in daily contexts?” As a result, the personas generated in this study are not marketing abstractions derived solely from product interaction metrics, but lifestyle-oriented representations grounded in everyday practices.
Within this framework, Large Language Models (LLMs) are not positioned as end-to-end decision-makers, but as high-dimensional semantic feature generators. Human experts retain control over interpretive validity by curating, refining, and validating the feature space before any distance-based computation is performed. Consequently, clustering serves as a confirmatory rather than discovery-oriented step, ensuring that computational similarity operates within a semantically meaningful space.
This “meaning-first, distance-second” paradigm addresses a recurring challenge in data-driven persona research, where algorithmic clusters often require extensive post hoc interpretation to establish design relevance. By aligning data collection, feature abstraction, and classification logic with an explicit persona hypothesis, the CoPersona methodology enhances interpretability while preserving computational scalability.
Limitations and Further Study
Several limitations of this study should be acknowledged. First, this study relies on data from a single social media platform, RedNote, which may introduce platform-specific sampling bias. User demographics, content norms, and posting behaviors on RedNote may not fully represent broader social media populations. Future research should therefore explore multi-platform data integration to enhance demographic diversity and reduce platform-dependent effects, enabling more generalizable persona construction across social contexts.
Second, the temporal distribution of posts in dataset D2-1 is skewed toward more recent years. This pattern reflects the natural growth trajectory and increasing user engagement of the RedNote platform rather than a sampling artifact. Importantly, the study does not aim to model long-term behavioral evolution. Instead, it adopts a user-centered sampling strategy that focuses on individuals who have engaged with bedside lighting products and analyzes their recent lifestyle expressions to capture contemporary usage contexts. Given that bedside lighting is a function-oriented household product and that core pre-sleep and nighttime behaviors tend to remain relatively stable over short-to-medium time spans, this temporal skew is unlikely to compromise the validity of persona construction in this study. Nevertheless, future research may extend the CoPersona framework to longitudinal datasets to investigate persona dynamics and behavioral change over time.
Third, the evaluation phase is based on a relatively small, stratified sample due to the high cost of expert annotation and the interpretive nature of persona labeling. Consequently, the reported evaluation metrics should be interpreted as indicators of classification reliability rather than as exhaustive guarantees of performance across the full datasets. While expert agreement statistics demonstrate substantial consistency, future studies could explore scalable validation strategies, such as semi-automated expert-in-the-loop evaluation, cross-study benchmark, or probabilistic uncertainty modeling, to strengthen robustness and general.
Fourth, the current analysis focuses primarily on textual user-generated content, leaving multimodal signals—such as images, interaction patterns, and temporal posting rhythms—largely unexplored. Incorporating multimodal analysis techniques may enable richer, more nuanced persona representations, particularly for lifestyle-oriented products, where visual and contextual cues play a critical role in user expression.
Finally, although expert validation is employed to mitigate potential biases introduced by LLM-assisted feature extraction, this process remains largely manual. While this design choice prioritizes interpretive rigor, it limits scalability. Future research should investigate algorithmic approaches to bias detection, uncertainty estimation, and automated consistency checking, enabling the CoPersona framework to scale without compromising methodological transparency.
Taken together, these limitations highlight important pathways for future research in data-driven persona development. Particularly crucial is the development of more sophisticated bias-mitigation and automated-validation mechanisms that preserve the methodological strengths of human-AI collaboration while enhancing scalability. As LLM capabilities continue to evolve, future work may explore deeper integration of advanced natural language understanding techniques to automate portions of expert validation and feature refinement, while maintaining accuracy and interpretable.
In conclusion, while the CoPersona methodology represents a meaningful advancement in data-driven persona development, addressing these limitations through systematic future research will be essential for establishing more comprehensive, scalable, and generalizable approaches to user understanding in HCI research.
Participants
All participants provided informed consent before data collection.
Conclusion
This research addresses a critical challenge faced by small and medium-sized enterprises in generating actionable user personas under data and resource constraints. We introduce CoPersona, a human-LLM collaborative persona development framework that leverages large-scale user-generated content from RedNote to construct data-driven personas for product design.
CoPersona integrates advanced text processing techniques—including tokenization, semantic embedding, and iterative classification—with structured expert validation. Its primary contribution lies in the “co” paradigm: rather than replacing human judgment, large language models function as high-dimensional semantic feature generators that collaborate with UX researchers in persona construction. Distinct from demographic-based profiling, this approach derives personas from publicly shared lifestyle posts without collecting sensitive or private user data, thereby enabling privacy-conscious user understanding grounded in real-world behavior.
Empirical results demonstrate that the CoPersona framework effectively translates large-scale social media insights into actionable product design decisions. In an industrial case study, the personas generated through CoPersona informed concrete design actions and achieved a 69.2% real-world adoption rate, indicating strong practical value and organizational acceptance. Overall, this study highlights the potential of human-LLM collaboration to bridge the gap between large-scale social data analysis and user-centered product innovation.
Footnotes
Author Contributions
M.Y. designed the experiment and drafted the main manuscript.
H.L. performed data cleaning, organization, and analysis.
B.L. contributed to manuscript writing, prepared figures, and assisted in data collection.
R.C. provided data resources and contributed to data analysis.
All authors reviewed and approved the final manuscript.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the 2025 Annual Regular Project of the Zhejiang Provincial Philosophy and Social Sciences Planning Programme: Research on Entry Models, Mechanisms and Pathways for Digital Intelligence Technology Empowering the Design Industry; 25NDJC102YBM.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The datasets generated and analyzed during the study are available from the corresponding author upon reasonable request, subject to platform data restrictions.
