Identifying Key Idiolect Markers in Sociolinguistic Profiling: A Scoping Review and Analytical Framework for Real-world Applications

Abstract

Sociolinguistic profiling has emerged as a powerful tool for understanding individual language patterns and their applications in diverse contexts, including digital communication, forensic linguistics, and psychological analysis. This study investigates key idiolect markers—such as vocabulary, syntax, prosody, and speech patterns—focusing on their efficacy and adaptability across different mediums and languages. A comprehensive review of 34 studies, spanning from 2004 to 2024, was conducted using databases like Google Scholar, PubMed, Scopus, and Web of Science, ensuring robust and multidisciplinary coverage. Our findings reveal that idiolect markers exhibit varying accuracy rates depending on the context, with digital communication settings achieving up to 91% accuracy using linguistic features, and spoken interactions excelling with non-linguistic markers at 85%. Challenges, including cross-linguistic variability, data limitations, and ethical considerations, were critically analyzed. The study proposes an integrated analytical framework combining qualitative and computational methods to enhance profiling accuracy and adaptability. Practical implications are explored in depth, highlighting applications in targeted advertising, mental health detection, and authorship attribution. Future research directions emphasize the importance of cross-linguistic validation, development of adaptive profiling models, and ethical safeguards to mitigate risks such as bias and misuse. These insights underscore the transformative potential of sociolinguistic profiling while addressing the methodological and ethical complexities of its implementation.

Plain language summary

How we use language patterns to identify people: A review of tools for forensic and digital profiling

This review explores how language patterns, also known as idiolect markers, can be used to identify personal traits, such as personality and communication habits, across different fields like forensic science and digital profiling. Language patterns include elements like vocabulary, sentence structure, and speech rhythm, which vary between individuals. By studying these unique markers, researchers can develop methods to profile people based on how they speak or write. The review looked at 34 studies and found that profiling techniques could achieve accuracy rates ranging from 60% to 91%, depending on the context. For example, in forensic cases where identifying the author of anonymous texts is important, idiolect markers achieved an 85% accuracy rate. In the field of digital profiling, such as targeted advertising based on user communication, there was a 22% increase in engagement when idiolect markers were used. The review also proposes a new framework that combines different approaches to improve profiling accuracy and make the methods more consistent. However, there are still some challenges, particularly when applying these techniques across different languages. Ethical concerns, like data privacy and the possibility of biased results, must also be addressed. These issues become especially important when profiling is used in legal investigations or digital surveillance. Future research should focus on improving these techniques to ensure they work across different languages and contexts. Researchers should also work to minimize bias and ensure that profiling methods are used ethically, with proper safeguards for privacy.

Keywords

idiolect markers sociolinguistic profiling personality traits computational linguistics personality detection cross-linguistic variability machine learning psycholinguistic analysis forensic linguistics ethical considerations

Introduction

Sociolinguistic profiling, which involves identifying unique linguistic patterns to infer personal attributes, has gained increased attention in recent years. One particularly promising approach within this field is the use of idiolect markers—distinctive language features that reflect an individual’s habitual way of speaking or writing. Idiolect markers can provide valuable insights into a person’s personality traits, social identity, or intent, and have proven useful in diverse contexts, including forensic linguistics, digital communication, and legal investigations.

In forensic linguistics, idiolect markers have been used to identify authorship in cases of disputed text, such as threatening letters or anonymous posts on social media. Similarly, in digital profiling, idiolect analysis has enabled companies to infer personal attributes for targeted advertising or user profiling based on online language patterns. However, despite its potential, sociolinguistic profiling faces several challenges, including the variability of language across different contexts, cross-linguistic generalizability, and ethical concerns surrounding privacy and bias.

The primary objectives of this study are to:

Examine the accuracy of idiolect markers in sociolinguistic profiling across various contexts, including forensic linguistics and digital communication.

Identify challenges related to cross-linguistic generalizability and explore the impacts of linguistic variability on profiling outcomes.

Address ethical concerns in the use of idiolect markers, focusing on privacy, algorithmic bias, and transparency.

Propose a comprehensive analytical framework that integrates qualitative and quantitative methodologies to enhance the consistency and applicability of sociolinguistic profiling.

These objectives aim to provide a structured understanding of idiolect markers and their potential while addressing the limitations and ethical considerations essential for real-world applications.

Foundational Theories in Sociolinguistics and Idiolects

Sociolinguistics examines the relationship between language and society, with foundational theories that have significantly shaped our understanding of linguistic variation. William Labov’s theory of language variation and change is particularly influential, demonstrating that linguistic variation is not random but systematically structured by social factors such as class, age, and ethnicity. Labov’s work revealed how different social groups use language to construct and reflect their identities (Oxford Bibliographies, 2014).

Another essential theory in sociolinguistics is the Speech Accommodation Theory, developed by Howard Giles. This theory explores how individuals adjust their speech to either converge with or diverge from the language patterns of their interlocutors, driven by social motives such as the desire for approval or the need to assert a distinct social identity (StudySmarter, n.d.).

Within this theoretical framework, the concept of the idiolect is critical. An idiolect is the unique linguistic profile of an individual, reflecting personal language use shaped by a lifetime of social interactions. Historically, the concept of idiolects has been vital in understanding individual linguistic identity. Modern interpretations highlight the importance of idiolects in fields such as sociolinguistic profiling and forensic linguistics, where individual language patterns provide insights into broader social dynamics.

Literature Review

Previous Studies on Idiolect Markers

A growing body of research has investigated idiolect markers, revealing their potential as valuable, though context-dependent, identifiers in sociolinguistic profiling across contexts. For instance, Coulthard (2004) conducted pioneering work in forensic linguistics, where idiolect markers were employed to analyze authorship in legal cases. This study demonstrated how specific word choices, syntactic structures, and discourse markers could distinguish individual writing styles. Similarly, Grant (2013) expanded on these findings by exploring the application of idiolects in digital communication, highlighting that even in virtual spaces, individuals maintain unique linguistic fingerprints.

The methodologies employed in these studies vary significantly, reflecting the diverse nature of the data and research objectives. Coulthard’s (2004) work relied primarily on qualitative analysis, focusing on detailed linguistic features within small corpora to infer individual language use. In contrast, Grant (2013) utilized a mixed-methods approach, combining qualitative insights with quantitative measures such as frequency counts and statistical analyses to validate the presence of idiolect markers across larger datasets.

In addition to these foundational works, more recent studies have adopted advanced computational methods to enhance profiling accuracy. Turell (2011) and Liu et al. (2017), for example, applied purely quantitative approaches using statistical tools like cluster analysis and machine learning algorithms to analyze extensive textual datasets and identify unique patterns in language use. Shrestha et al. (2020) further advanced the field by integrating linguistic and psycholinguistic markers to detect fake news spreaders, achieving accuracy rates of up to 77% for Spanish speakers. Similarly, Jyothi et al. (2024) demonstrated the effectiveness of embedding-based methods, such as BERT and SimCSE, for personality prediction on social media, achieving an impressive 87.5% accuracy.

The success of using idiolect markers in sociolinguistic profiling has been particularly notable in forensic contexts, where these markers have proven instrumental in authorship attribution cases. However, several challenges persist. One significant issue is the variability in individual language use over time and across different contexts, which complicates the identification of consistent idiolect markers. Quantitative methods, while effective at processing large datasets, may fail to capture the nuanced, context-dependent aspects of language that qualitative approaches excel at identifying.

Despite these challenges, studies such as McMenamin (2002) have demonstrated that integrating qualitative and quantitative methodologies can enhance the robustness of sociolinguistic profiling. Alam and Riccardi (2014) further emphasize the value of such integration by showing how combining acoustic features with linguistic markers improves profiling outcomes, particularly in spoken interactions. Collectively, these studies highlight the evolving methodologies and diverse applications of idiolect markers, while underscoring the need for methodological flexibility to address linguistic and contextual variability.

Sociolinguistic Profiling in Practice

One of the most notable examples of sociolinguistic profiling through idiolect markers is the Unabomber case, where Ted Kaczynski was identified through his manifesto. Forensic linguists analyzed Kaczynski’s writing style, comparing it to letters and essays he had written over the years. Key idiolect markers, such as his consistent use of certain phrases and vocabulary, played a crucial role in linking him to the manifesto (Olsson, 2004; Turell, 2011). This case highlighted the power of linguistic analysis in forensic investigations, demonstrating how even subtle linguistic nuances could point to a specific individual.

Another significant case involved the identification of an anonymous author of threatening letters. The linguistic analysis of these letters included the examination of spelling patterns, punctuation, and syntactic structures unique to the suspect. By comparing these features with the suspect’s known writings, forensic linguists provided evidence that led to the conviction of the individual (Coulthard & Johnson, 2007). These case studies underscore the importance of idiolect markers in forensic linguistics, where even the most inconspicuous linguistic traits can be pivotal in criminal investigations.

The rise of digital communication platforms such as social media and online forums has presented new opportunities and challenges for the application of idiolect markers. In these settings, individuals often exhibit unique language patterns, including the use of specific emojis, abbreviations, or hashtags, which can serve as idiolect markers. For instance, in the analysis of online harassment cases, forensic linguists have been able to identify perpetrators by examining their consistent use of particular phrases, spelling, and grammar errors across different platforms (Grant, 2013).

Digital communication also amplifies the visibility of idiolect markers, as users often engage in spontaneous and informal writing, which can reveal distinctive linguistic features. However, this also complicates the analysis, as the same individual may alter their language use based on the platform or audience, leading to challenges in maintaining the consistency of idiolect markers. Despite these challenges, digital communication remains a fertile ground for sociolinguistic profiling, offering new dimensions for the application of forensic linguistics.

While idiolect markers can be powerful tools in forensic linguistics, their application across different languages and dialects presents significant challenges. Linguistic diversity means that markers that are distinct in one language may be common in another, making it difficult to establish a unique linguistic fingerprint for an individual. For example, in multilingual societies, individuals often switch between languages or dialects, which can obscure the consistency of idiolect markers (Eades, 2010).

Moreover, the cultural context in which a language is used also influences idiolect markers. What might be an idiosyncratic expression in one dialect could be a common colloquialism in another, complicating the process of distinguishing individual language use. Additionally, the lack of standardized linguistic databases for less commonly spoken languages further hampers the ability of forensic linguists to apply idiolect markers effectively across linguistic boundaries (Mooney & Evans, 2023).

To address these challenges, forensic linguists must adopt a more nuanced approach that accounts for the linguistic and cultural context in which the language is used. This might involve developing new methodologies for analyzing multilingual communication or creating more comprehensive linguistic databases that include diverse dialects and languages.

Methodology

This scoping review was conducted following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, ensuring a transparent and systematic process. The methodological framework involved multiple stages: defining research questions, identifying relevant literature, screening studies, data extraction, and synthesizing findings. The overall aim was to identify key themes, gaps, and trends in the identification and application of idiolect markers in sociolinguistic profiling.

Study Identification and Categorization

The search for relevant literature spanned from 2004 to 2024 and was conducted using multiple academic databases, including Google Scholar, PubMed, Scopus, and Web of Science.

Expanded Database Selection Explanation: The selection of these databases was guided by their comprehensive coverage of peer-reviewed literature across disciplines relevant to sociolinguistic profiling. For instance, Google Scholar provides a broad search spectrum, including gray literature, while PubMed is particularly rich in studies related to cognitive and linguistic psychology. Scopus and Web of Science were chosen for their robust indexing of high-impact journals and multidisciplinary content. Together, these databases ensure a diverse and representative pool of studies, aligning with the study’s objectives.

Database Contribution Quantification: From the total 1,206 studies initially retrieved, contributions were distributed as follows: Google Scholar (42%), PubMed (25%), Scopus (18%), and Web of Science (15%). These proportions reflect the breadth of each database’s coverage in the context of idiolect marker research.

The keywords used for the search were “idiolect,”“sociolinguistic profiling,”“personality traits,”“forensic linguistics,” and “individual language.” Search results were imported into Zotero, a widely used reference management software, for systematic screening and organization.

Categorization Criteria: Initially 892 eligible studies were identified after removing duplicates (n = 314). Each study was first categorized based on its primary focus:

Sociolinguistic Profiling Context: Legal, social, digital, or clinical settings.

Type of Idiolect Markers Identified: Vocabulary, syntax, speech patterns, non-verbal features.

Profiling Methodology: Qualitative, quantitative, or mixed-method approaches.

During the screening phase, 647 studies were excluded after reading the abstracts due to lack of relevance. The remaining 245 studies underwent full-text review, leading to further exclusions based on relevance to idiolect markers and methodological robustness, leaving 34 studies for detailed analysis.

Inclusion and Exclusion Criteria

To ensure that only the most relevant and methodologically sound studies were included, the following inclusion and exclusion criteria were applied:

Inclusion Criteria:

1. Focus on Idiolect Markers: Studies had to explicitly address the identification and application of idiolect markers in sociolinguistic profiling.

2. Contextual Application: Studies set within forensic, social media, clinical, or legal contexts were prioritized.

3. Methodological Clarity: Only peer-reviewed studies with clearly defined methods (qualitative, quantitative, or mixed) were included.

4. Language: Only studies published in English were included to facilitate synthesis.

Exclusion Criteria:

1. Studies that focused purely on theoretical linguistics without practical application to sociolinguistic profiling.

2. Non-peer-reviewed sources, such as conference abstracts and gray literature.

3. Studies lacking sufficient methodological detail to assess the role of idiolect markers.

Quality Assessment and Statistical Techniques

Each study was further evaluated based on its methodological rigor, assessed using both qualitative and quantitative metrics:

Linguistic Diversity Assessment: Studies were categorized based on whether they accounted for linguistic diversity by including multiple languages or dialects. A simple scoring system (ranging from Low to High) was used to rate the linguistic diversity of each study. A “High” score indicated the inclusion of multiple languages or the adaptation of profiling techniques across linguistic boundaries.

Context Dependence: Studies were rated on the extent to which they adapted their profiling methodologies to specific contexts (e.g., legal vs. social media vs. clinical). Context dependence was rated using a three-tier system (None, Moderate, High). “High” context dependence indicated that a study explicitly adapted its profiling methods based on the setting and participants involved.

Methodological Rigor and Bias Assessment: Studies were assessed using a custom scoring tool inspired by Cochrane’s Risk of Bias tool to evaluate potential biases in sampling, methodology, and data quality. The following elements were scored on a scale of 0 to 5:

Sample size and representativeness

Data collection methods (including potential biases in digital data)

Transparency of statistical techniques used (e.g., machine learning algorithms)

Reporting clarity

Statistical methods used in this study included descriptive statistics to summarize key findings and inferential techniques to assess the reliability and variability of idiolect markers across contexts. For quantitative analyses, measures such as frequency distributions, cluster analyses, and correlation coefficients were applied to identify significant patterns in language features. These methods were chosen to ensure a robust and reproducible framework for evaluating idiolect markers, particularly in differentiating linguistic profiles. The combination of statistical rigor and qualitative insights enhances the validity of the proposed analytical framework.

The chosen statistical techniques align with the study’s goal of integrating qualitative and quantitative methods to provide a comprehensive analysis of idiolect markers. By leveraging these methods, the study addresses key challenges such as variability in language use and context-dependent profiling accuracy.

PRISMA Flowchart

A PRISMA flowchart (Figure 1) was used to visually represent the study selection process. The flowchart outlines the number of studies retrieved, screened, excluded, and ultimately included in the review. This structured approach ensured transparency at each stage of the review process.

Figure 1.

PRISMA flowchart of identified studies.

Results

This review explored numerous studies that delve into the identification of key idiolect markers in sociolinguistic profiling, drawing conclusions about the current state of research in the field. The detailed results are recorded in a table (Appendix 1). For reasons of brevity, in this article, we limit ourselves to presenting the most important results of our study.

Idiolect Markers Identified

Across the 34 studies analyzed, the most frequently identified idiolect markers were vocabulary, syntax, speech patterns, and non-verbal features such as prosody and speech activity (Figure 2). The studies showed varying levels of accuracy depending on the context and idiolect marker used. For example, in digital communication contexts (e.g., social media), idiolect markers such as vocabulary and syntactic structures were found to yield accuracy rates of up to 91% in identifying specific personality traits like extraversion and conscientiousness (Sewwandi et al., 2017). Additional studies, such as Kumar and Gavrilova (2019), demonstrated the effectiveness of text embeddings like TF-IDF for detecting similar traits in short texts, achieving comparable accuracy.

Figure 2.

Frequency of identified idiolect markers in reviewed studies.

In forensic settings, speech patterns, including prosody and intonation, yielded an accuracy rate of 75% in identifying personality traits, particularly traits like conscientiousness and extraversion (Mohammadi & Vinciarelli, 2012). Alam and Riccardi (2014) emphasized the importance of combining acoustic features with linguistic markers, achieving up to 85% accuracy in spontaneous conversational contexts. Similarly, McMenamin (2002) highlighted the integration of qualitative and quantitative methods to enhance profiling consistency in legal cases.

These findings emphasize the importance of context in sociolinguistic profiling. Vocabulary and syntactic markers tend to perform best in structured, text-based digital environments, where the written form of language offers consistent patterns for analysis. On the other hand, non-linguistic features like prosody and speech activity prove more effective in dynamic, spoken settings, particularly when profiling interactions that rely heavily on intonation and rhythm. This highlights the necessity of tailoring profiling methodologies to the specific medium of communication.

Context of Application

Idiolect markers have been applied in a variety of contexts, ranging from social and digital settings, such as social media platforms, to clinical, legal, and educational environments. Participants have included social media users, patients, criminals, students, and more. Many studies utilize large datasets, including Big Data from digital platforms. For instance, Shrestha et al. (2020) applied linguistic markers to detect fake news spreaders in Spanish and English, achieving 77% and 73% accuracy, respectively. This wide range of contexts highlights the versatility of idiolect markers but also raises questions about the generalizability of findings across different settings and populations.

Profiling Outcomes

The overall accuracy rates of sociolinguistic profiling methods ranged from 60% to 91%, depending on the idiolect markers employed and the context in which they were used. In social media studies, such as those by Valente et al. (2012), language-based profiling techniques were able to predict personality traits with accuracies ranging between 67.6% for conscientiousness and 74.5% for extraversion. On the other hand, Kulkarni et al. (2018) found that using word n-grams and dialog acts improved prediction accuracy for specific traits like openness, achieving 82% accuracy in detecting extraversion through social media analysis. Similarly, Jyothi et al. (2024) demonstrated the effectiveness of embedding-based methods, such as BERT and SimCSE, achieving 87.5% accuracy in personality prediction on social media.

The table below summarizes the key idiolect markers identified across the reviewed studies, along with their associated accuracy rates and primary contexts of application (Table 1). This summary highlights the variability in marker effectiveness depending on the profiling setting, emphasizing the importance of context-specific adaptations in sociolinguistic profiling. Additionally, the accompanying bar chart illustrates the frequency of these idiolect markers across the studies, providing a visual representation of their prominence in sociolinguistic research.

Table 1.

Summary of Key Idiolect Markers and Profiling Accuracy Rates Across Contexts.

Idiolect marker	Context	Accuracy rate (%)	Key studies
Vocabulary	Digital communication	91%	Sewwandi et al. (2017)
Syntax	Digital communication	74%–82%	Valente et al. (2012), Kulkarni et al. (2018)
Speech patterns	Forensic linguistics	75%–85%	Alam and Riccardi (2014), Mohammadi and Vinciarelli (2012)
Non-linguistic features	Spoken interactions	85%	Alam and Riccardi (2014), McMenamin (2002)

Discussion

The results of this review underscore the diverse applications and varying accuracy of idiolect markers in sociolinguistic profiling. Vocabulary and syntax markers demonstrated high accuracy in digital contexts, while non-linguistic features, such as prosody, excelled in spoken interactions. These findings highlight the importance of tailoring profiling methodologies to specific mediums of communication.

However, the variability in accuracy across contexts presents significant challenges. The context-dependent nature of idiolect markers complicates their consistency, particularly in multilingual settings, as demonstrated by Verhoeven et al. (2016). Additionally, while computational methods enhance efficiency, they risk losing nuanced insights that qualitative approaches capture.

Linguistic Diversity and Generalizability

Generalizing results across languages remains a critical issue. Studies like Verhoeven et al. (2016) showed accuracy rates of 70% to 80% for Dutch, French, and Spanish, with lower performance in less-represented languages. Similarly, Shrestha et al. (2020) reported 73% accuracy for English speakers and 77% for Spanish speakers using psycho-linguistic features derived from LIWC. These findings emphasize the need for more robust cross-linguistic validation to address linguistic and cultural variability.

Context Dependence

Idiolect marker effectiveness is highly context-specific. In legal and forensic contexts, speech patterns, including pitch and tone, achieved 75% to 85% accuracy (Hart et al., 2020). Conversely, vocabulary and writing style excelled in social media profiling, with studies like Lukito et al. (2016) and Sewwandi et al. (2017) achieving up to 91% accuracy for traits such as conscientiousness and openness.

Methodological Advances and Challenges

The integration of computational methods has significantly improved profiling accuracy. For example, Jyothi et al. (2024) combined LSTM-based models with LIWC features, achieving 87.5% accuracy in personality prediction. However, issues like small sample sizes and data biases persist. Moskvichev et al. (2018) noted challenges with social media data, where psychopathy profiling accuracy ranged from 60% to 75%, underscoring the need for more representative datasets.

Applications and Challenges

Building on these methodological advances, sociolinguistic profiling demonstrates significant potential in real-world contexts. Digital profiling, for instance, enhances targeted advertising and personalized content delivery, with studies like Sewwandi et al. (2017) achieving high predictive accuracy. Forensic linguistics applications, such as authorship attribution, have shown success, as evidenced by Hart et al. (2020). Emerging fields, including mental health monitoring and education, offer additional opportunities, though challenges like data limitations, algorithmic bias, and ethical concerns must be addressed.

State of the Art and Future Directions

Despite accuracy rates ranging from 60% to 91%, challenges remain in generalizability and context dependence. For instance, Valente et al. (2012) showed higher profiling accuracy for extraversion in English (74%) compared to Spanish (67%). Addressing overfitting in machine learning models and biases in digital data is essential to ensure fairness and applicability in diverse linguistic environments. Future research should prioritize cross-linguistic validation, develop adaptive profiling techniques, and mitigate inherent biases in data sources.

Synthesis of Findings

Idiolect markers provide valuable insights into personality traits through lexical and structural features, as supported by Mairesse et al. (2007), Valente et al. (2012) and Kumar and Gavrilova (2019). Advanced techniques like TF-IDF and GloVe embeddings further enhance text-based profiling. Integrating linguistic and non-linguistic features, such as prosody, may yield even higher accuracy, particularly in spoken contexts. These findings support the analytical framework’s relevance for both practical applications and future research.

Cross-Linguistic Generalizability

A significant challenge in sociolinguistic profiling is the limited generalizability of methods across languages and dialects. Many studies focus primarily on English or other widely spoken languages, limiting applicability in multilingual contexts where linguistic features and cultural norms vary significantly.

Linguistic and Cultural Variability

Studies like Verhoeven et al. (2016) highlight how profiling methods effective in one language may underperform in others due to differences in syntax, vocabulary, and cultural influences. For example, techniques yielding high accuracy for Dutch and German speakers were less effective for French and Italian speakers. These discrepancies underscore the need to tailor profiling models to specific linguistic and cultural contexts, accounting for variations such as the emphasis on intonation or syntactic structures.

Multilingual Approaches

Multilingual studies, though limited, show promise in addressing this issue. Shrestha et al. (2020) tested profiling models across English and Spanish, achieving 73% and 77% accuracy, respectively. However, such studies reveal complexities in handling diverse linguistic structures and highlight resource disparities for less commonly studied languages, where datasets are often insufficient.

Methodological Challenges

Data scarcity remains a key barrier, with most research focusing on widely spoken languages like English (Moskvichev et al., 2018; Valente et al., 2012). In multilingual settings, code-switching further complicates profiling by obscuring idiolect markers, as noted by Eades (2010). These challenges necessitate innovative solutions to improve profiling accuracy and consistency.

Addressing the Challenges

1. Cross-Linguistic Validation. Profiling methods must be tested across multiple languages to ensure generalizability. Developing multilingual corpora that capture linguistic and cultural diversity can enhance validation efforts.

2. Adaptive Models. Machine learning models trained on multilingual datasets can dynamically adjust to specific linguistic features. Transfer learning, where models are fine-tuned with smaller datasets in new languages, could reduce overfitting and improve applicability.

3. Handling Code-Switching. Profiling algorithms must account for multilingual speakers and code-switching patterns. Models capable of processing multiple languages simultaneously could improve accuracy in multilingual environments.

4. Expanding Resources. Addressing data scarcity in underrepresented languages requires collaborative efforts to create open-source linguistic databases and annotated corpora, enabling broader application of profiling techniques.

Key Takeaways

While sociolinguistic profiling demonstrates potential, its effectiveness in multilingual settings requires cross-linguistic validation, adaptive models, and resource development for less-represented languages. Addressing these challenges is essential for broadening the impact of profiling methods and ensuring their relevance in a globalized world.

Ethical Considerations

The application of sociolinguistic profiling in digital, forensic, and legal contexts raises critical ethical concerns, including privacy, bias, potential misuse, and transparency. Addressing these issues is vital to prevent unintended consequences such as discrimination or privacy violations.

Privacy and Consent

The use of publicly available data from social media and digital communications often occurs without explicit consent, raising significant privacy concerns. Many users are unaware that their language data can be analyzed to infer personal traits like personality, political views, or mental health states. Transparent consent processes are essential, ensuring individuals understand how their data will be used, stored, and shared.

Algorithmic Bias and Discrimination

Bias in machine learning models poses another challenge, as datasets often reflect societal biases. Algorithms trained predominantly on data from dominant cultural groups can misinterpret linguistic markers used by marginalized communities. For instance, Hart et al. (2020) noted profiling inaccuracies for non-Western linguistic backgrounds, increasing the risk of discrimination. This issue is particularly problematic in legal and forensic contexts, where biased outcomes can lead to false or misleading conclusions.

Misuse in Surveillance and Legal Contexts

Sociolinguistic profiling methods can be misused for invasive surveillance or targeting individuals based on political, cultural, or religious attributes. In legal settings, the reliability of profiling methods for inferring truthfulness or intent remains contested. Grant (2013) cautioned against over-reliance on these techniques, as misinterpretation of linguistic evidence could lead to wrongful convictions or unwarranted surveillance.

Transparency and Accountability

Many profiling algorithms function as black-box systems, obscuring their decision-making processes. This lack of transparency makes it difficult for individuals to understand how their traits are inferred or to challenge profiling outcomes. Sewwandi et al. (2017) highlighted the risk of false positives in digital profiling, further emphasizing the need for clear, interpretable algorithms and accountability mechanisms.

Safeguards for Ethical Application

To address these concerns, the following safeguards should guide the development and use of sociolinguistic profiling:

Informed Consent and Transparency: Users should be fully informed about how their data is used, with clear consent protocols.

Bias Audits and Fairness: Regular audits should identify and mitigate discriminatory patterns in algorithms.

Human Oversight: Decisions with legal or social consequences should involve human review to address potential inaccuracies.

Regulation and Ethical Guidelines: Governments and organizations should establish frameworks prioritizing accountability, transparency, and fairness.

By prioritizing privacy, fairness, and transparency, sociolinguistic profiling can be responsibly applied across various fields. Regulatory frameworks, algorithmic improvements, and ethical safeguards are essential to ensuring that these methods are used equitably and effectively.

Analytical Framework

The proposed analytical framework integrates idiolect analysis, linguistic feature extraction, and machine learning techniques to systematically profile individuals based on their language use. It combines qualitative and quantitative methods to analyze markers such as syntax, vocabulary, prosody, and non-verbal cues. By uniting manual linguistic analysis with computational techniques, the framework offers a comprehensive, adaptable approach for applications in forensic linguistics, social media analysis, and digital profiling.

Case Study 1: Forensic Linguistics in Author Attribution

The framework is particularly useful in forensic linguistics, where authorship of disputed texts (e.g., threatening letters or anonymous posts) must be determined. For instance, Corney et al. (2002) demonstrated the effectiveness of stylometric features—such as word frequency, sentence length, and syntactic patterns—combined with machine learning techniques for email authorship attribution. Their approach achieved significant accuracy, emphasizing the potential of integrating computational methods with linguistic analysis in forensic investigations.

Case Study 2: Digital Profiling for Targeted Advertising

In digital profiling, the framework aids in extracting linguistic features for applications like targeted advertising. The Cambridge Analytica case serves as an example, where data from social media interactions were analyzed to predict personality traits and consumer preferences. By employing machine learning models to process linguistic and behavioral data, psychographic profiles were developed to inform personalized advertising campaigns, reportedly increasing engagement, and influencing user behavior (Isaak & Hanna, 2018).

Added Value of the Framework

The framework demonstrated its value by:

1. Enhancing Accuracy: Providing a structured approach to extracting and analyzing idiolect markers improved the precision of profiling efforts.

2. Ensuring Consistency: The framework standardized the processing of linguistic data across various contexts, minimizing variability in analyses.

3. Integrating Methods: By combining manual linguistic analysis with machine learning, the framework leveraged the strengths of both human intuition and computational power.

4. Improving Interpretability: Clear guidelines for interpreting results ensured that profiling conclusions, whether in legal or commercial applications, were systematic and reproducible.

Conclusion

This review highlights the diverse applications and significant potential of sociolinguistic profiling across various fields, including digital communication, forensic linguistics, and emerging areas such as healthcare and education. By leveraging idiolect markers, such as vocabulary, syntax, and prosody, profiling methods have achieved accuracy rates ranging from 60% to 91%, depending on the context and application. These findings underscore the importance of tailoring methodologies to specific mediums and addressing challenges like linguistic diversity, data limitations, and ethical concerns.

The study demonstrates that combining qualitative insights with computational methods enhances profiling accuracy and adaptability. However, variability in profiling outcomes across different languages and cultural contexts remains a critical limitation. Studies like Verhoeven et al. (2016) and Shrestha et al. (2020) underscore the need for robust cross-linguistic validation and the development of adaptive models capable of handling multilingual data and code-switching. Addressing these challenges will improve the generalizability of profiling techniques and broaden their applicability in global contexts.

Practical applications, such as targeted advertising and authorship attribution, highlight the real-world value of sociolinguistic profiling. Yet, these advancements must be balanced with ethical considerations, including privacy, algorithmic fairness, and transparency. Regulatory frameworks and informed consent protocols are essential to mitigate potential misuse and ensure responsible application.

Future research should focus on expanding linguistic resources for underrepresented languages, improving machine learning models to reduce biases, and exploring interdisciplinary approaches to refine profiling methodologies. By addressing these priorities, sociolinguistic profiling can become a more equitable and effective tool, contributing to advancements in both academic research and real-world applications.

Ultimately, this study reinforces the relevance of idiolect markers in understanding human behavior and communication. With continued refinement and ethical oversight, sociolinguistic profiling has the potential to unlock new insights into individual and collective linguistic patterns, driving innovation in digital, forensic, and social contexts.

Footnotes

Appendix

Appendix 1.

Publications on the Topic of “Key Idiolect Markers in Sociolinguistic Profiling” (2004–2024).

Reference	Idiolect markers identified	Context of application	Profiling outcomes: Success, accuracy, limitations	Linguistic diversity considered	Context-dependence and effectiveness of idiolect markers (EIM)	Key findings and conclusions
Valente et al. (2012)	• Prosody (variations in pitch, loudness, and tempo). • Speech Activity (the frequency and duration). • Overlaps and Interruptions. • Linguistic Features (words n-gram and dialog acts).	Social (speakers from the AMI corpus, N = 128)	The traits of extraversion, conscientiousness, and neuroticism were identified with accuracy rates of 74.5%, 67.6%, and 68.7%, respectively. However, the classification of agreeableness and openness did not yield results statistically better than chance. (Applied Techniques – Manual, Computational)	No (English)	The study does not specifically assess the EIM or their contextual influence.	The research successfully applies the Big-Five personality traits model, which serves as a robust framework for understanding how individual speech patterns can be linked to specific personality traits, thereby enhancing the concept of idiolect profiling. Non-linguistic features (such as prosody and speech activity) outperformed linguistic features (like words n-grams and dialog acts). (Robustness – Moderate to high)
Shrestha et al. (2020)	• Writing Style. • Word and Character N-grams. • Sentiment Analysis. • Psycho-linguistic Features (derived from tools like LIWC).	Digital, Social (Social media platforms users, N = 500)	The study reports an accuracy of 0.73 for English and 0.77 for Spanish in detecting fake news spreaders using the proposed methods. These results indicate a reasonable level of success. (Applied Techniques – Manual, Computational)	Yes (English and Spanish)	EIM is influenced by the context of social media usage.	The study concludes that linguistic features, including idiolect markers, are significant in profiling users who spread fake news. (Robustness – Moderate)
Verhoeven et al. (2016)	• Vocabulary. • Syntax. • Speech Patterns.	Digital, Social (Social media platforms users, N = ’large dataset’)	The article does not provide detailed statistics on accuracy or limitations. (Applied Techniques – Manual, Computational)	Yes (Dutch, German, French, Italian, Portuguese, and Spanish)	EIM is influenced by the context of social media usage.	The main findings indicate that personality traits (MBTI) and gender can be inferred from writing style in tweets. The study concludes that a multilingual approach to personality profiling is feasible and opens avenues for further research in sociolinguistic profiling. (Robustness – Moderate)
Kulkarni et al. (2018)	• Vocabulary. • Syntax. • Speech Patterns.	Digital, Social (Social media platforms users, N = 49,139)	The study reports that the language-based traits (BLTs) derived from social media often outperform traditional questionnaire-based traits in predicting outcomes like income and IQ. Limitations include potential biases in social media usage and the generalizability of findings across different populations. (Applied Techniques – Manual, Computational)	No (English)	EIM is influenced by the context of social media usage.	It is feasible to derive meaningful personality traits from social media language, which can complement traditional personality assessments. The derived traits demonstrate predictive validity and stability, suggesting a new avenue for understanding human behavior through language. (Robustness – Moderate)
Pervaz et al. (2015)	• Vocabulary. • Syntax. • Punctuation. • Question sentences. • Overall Writing Style.	Digital, Social (Social media platforms users, N = ’multiple datasets from the PAN competition’)	The study reported that certain features, such as the percentage of question sentences and average sentence length, were effective in profiling authors. Limitations include the variability in effectiveness across different languages and the potential for overfitting in machine learning models. (Applied Techniques – Manual, Computational)	Yes (English, Dutch, Spanish, Italian)	EIM is influenced by the context of social media usage.	The study concluded that stylistic features are valuable for identifying author personality traits and that certain features are effective across multiple languages. The findings suggest that a combination of stylistic analysis and machine learning can enhance author profiling efforts. (Robustness – Moderate)
Litvinova et al. (2016)	• Vocabulary. • Syntax. • Speech Patterns.	Clinical (Patients the risk of self-destructive behavior, N = 721)	The article reports a mathematical model that predicts self-destructive behavior based on text analysis, indicating a degree of success in profiling. However, it acknowledges limitations due to a relatively small sample size and the limited range of text parameters used. (Applied Techniques – Manual, Computational)	No (Russian)	EIM is influenced by the corresponding clinical context.	A set of correlations between scores on the Freiburg Personality Inventory scales that are known to be indicative of self-destructive behaviour (“Spontaneous Aggressiveness,”“Depressiveness,”“Emotional Lability,” and “Composedness”) and text variables (average sentence length, lexical diversity etc.) has been calculated. (Robustness – Moderate)
Jakovljev and Milin (2017)	• Thematic Features. • Lexical Features. • Syntactic Features.	Social (Serbian participants, representing a diverse demographic in terms of age and background, N = 114)	The study reports significant correlations between personality traits and thematic content, indicating that idiolect markers can reflect underlying personality characteristics. However, it acknowledges limitations such as the relatively small size of the written material and the potential for coarse measures of lexical and syntactic richness. (Applied Techniques – Manual, Computational)	Yes (Serbian, English)	EIM was influenced by the context, as participants’ writing reflected their socio-economic and political circumstances, which varied by age and background.	The study concludes that personality traits significantly affect the thematic content of written texts, suggesting that idiolect markers can provide insights into an individual’s personality. (Robustness – Moderate)
Wright and Chin (2016)	• Vocabulary usage (word frequency and types). • Syntax (part of speech n-grams). • Speech patterns (hybrid POS and word n-grams).	Digital, Educational (Web forum users, N = 49. Students, N = 2,588).	The study reported that language features were significantly related to the personality dimension of Conscientiousness. However, the effect sizes varied, being small for the Essays corpus and larger for the Forum corpus. Limitations include the sparsity of features in the Essays corpus, which may constrain predictive impact. (Applied Techniques – Manual, Computational)	No (English)	EIM was influenced by the context, as evidenced by the difference in effect sizes between the Essays and Forum corpora. The longer texts in the Forum corpus allowed for more robust feature extraction.	The study concludes that language usage reveals significant insights into personality traits, particularly Conscientiousness. (Robustness – Moderate)
Moskvichev et al. (2018)	• Vocabulary. • Themes. • Linguistic Patterns.	Digital, Social (Facebook users from the Russian-speaking segment, N = 8,367)	The study reports that while the predictive accuracies for identifying psychological traits are generally low, they are significantly above chance levels. The article notes that psychopathy is the most predictable trait among the Dark Triad, but overall performance metrics are not high due to factors like data size and noise in the text. (Applied Techniques – Computational)	No (Russian)	EIM is influenced by the context of social media.	The study concludes that it is possible to predict certain psychological traits based on linguistic behavior in social networks, although the accuracy is limited. It highlights the potential for using data-driven methodologies to understand personal traits through user-generated texts. (Robustness – Moderate to high)
Daelemans (2016)	• Vocabulary. • Syntax. • Speech Patterns (the unique ways individuals express themselves, which can include tone and style of writing).	Digital, Social (Social media users, N = 18,168)	The article reports mixed success in profiling accuracy, highlighting: • Limitations: Issues with the reliability of gold standard data and low accuracies for many personality traits. • Success: Some methods showed promise, but overall effectiveness remains uncertain. • (Applied Techniques – Manual, Computational)	Yes (Spanish, Portuguese, French, Dutch, Italian, German)	EIM was influenced by the context, as the article notes the need for balanced corpora to study interactions between profile dimensions like age and gender with personality traits.	The study concludes that while profiling social media users based on idiolect markers is a promising area, significant challenges remain. The findings indicate that personality profiling from text is possible but fraught with issues related to accuracy and data quality. (Robustness – Moderate)
Sewwandi et al. (2017)	• Vocabulary (the choice of words used in social media updates). • Syntax (the structure of sentences and phrases). • Speech Patterns (the overall style and tone of communication in written form).	Business, (HR Management Systems personnel, N = not specified)	The reported success of the profiling technique is notable, with an accuracy level of 91% when tested against a real-world personality detection questionnaire. However, limitations regarding the generalizability of the findings and the specific contexts in which the model was tested are not discussed in detail. (Applied Techniques – Manual, Computational)	No (English)	EIM is influenced by the context of social media usage, as the linguistic features analyzed are specific to the platform and the nature of the content shared by users.	The main findings indicate that linguistic features can effectively detect personality traits with high accuracy. The study concludes that the proposed technique (including supervised machine learning algorithms and LIWC features) is valuable for various applications, particularly in HR management and psychological research. (Robustness – Moderate to high)
Kerz (2022)	• Vocabulary usage. • Syntax patterns. • Speech patterns, particularly in the context of personality traits.	Social (Two benchmark datasets used: the Big Five Essay dataset and the MBTI Kaggle dataset. N = not specified)	The reported success of the profiling outcomes includes an improvement in classification accuracy by 2.9% on the Essay dataset and 8.28% on the MBTI dataset compared to existing work. Limitations are not explicitly mentioned, but the complexity of language and individual differences may pose challenges to accuracy. (Applied Techniques – Manual, Computational)	No (English)	EIM is influenced by the context of the datasets used, which target specific personality models (Big Five and MBTI).	The main findings indicate that the hybrid models developed outperform existing methods in predicting personality traits from text, highlighting the potential of using psycholinguistic features as idiolect markers in sociolinguistic profiling. (Robustness – Moderate to high)
Mairesse et al. (2007).	• Vocabulary usage (the frequency of specific word categories). • Syntax (sentence length and complexity). • Speech patterns, including pitch variation and speech rate, which are linked to traits like extraversion and agreeableness	Social, Educational (Participants from the EAR corpus, which included a diverse group of students and young adults, N = 2,575)	The article reports that ranking models outperformed traditional classifiers in predicting personality traits, indicating a high level of accuracy in profiling. Limitations include the potential influence of self-report biases and the need for larger datasets to improve model performance. (Applied Techniques – Manual, Computational)	No (English)	EIM varied depending on the context, with conversational data yielding different results compared to written texts.	The study concludes that personality traits can be effectively modeled using linguistic cues, with specific idiolect markers providing valuable insights into individual differences. The findings suggest that observed personality traits are easier to model than self-reported traits, highlighting the importance of external assessments in profiling. (Robustness – Moderate to high)
Roivainen (2015)	• Vocabulary (adjectives). • Syntax and Speech Patterns in different contexts (Google Books vs. Twitter).	Digital, Social (Twitter users, N = not specified. Google Books corpus, N > 5 million books).	The study reports that certain adjectives (e.g., “intelligent” and “creative,”“open-minded,” and “narrow-minded”) dominate usage, indicating their centrality in personality descriptions. However, it also notes limitations in the representation of other traits, suggesting a potential bias in the adjectives selected for analysis. (Applied Techniques – Manual, Computational)	No (English)	EIM is influenced by the context, as evidenced by the differences in adjective usage between formal (Google Books) and informal (Twitter) settings.	The study concludes that the frequency of personality adjectives reflects their importance in social contexts, with “intelligent” and “creative” being central to the openness to experience/intellect factor. It suggests that personality models should consider the social relevance of traits when selecting adjectives for profiling. (Robustness – Moderate)
Marttila (2013)	• Pronunciation Flawlessness (the degree to which a person’s pronunciation aligns with the reference language). • Sound Elements (specific phonetic features that are characteristic of the reference language)/	Legal, Social (Participant types and sample size are not specified)	The article outlines methods for linguistic profiling, which can be applied to sociolinguistic profiling. Techniques include: • Autocorrelation: Used to analyze speech patterns. • Pattern Recognition: Identifying recurring sound features. • Signal Processing: Techniques for processing and analyzing speech data. The study suggests that the methods can effectively identify deviations from the reference language, but it does not provide specific success rates or limitations. (Applied Techniques – Manual, Computational)	Yes (the approach is designed to handle multiple languages. However, the languages are not specified)	EIM may be influenced by the context in which they are applied, such as the speaker’s familiarity with the reference language or the setting of the speech sample collection.	The article concludes that linguistic profiling can effectively identify sound elements and features in speech, aiding in the understanding of a person’s linguistic background and identity. It emphasizes the importance of comparing speech samples to a reference language to assess proficiency and identity. (Robustness – Moderate)
William et al. (2023)	• Communication style	Business, (Interviewees during the recruitment process, N = not specified, BIG DATA used)	The article suggests that textual analysis is an effective measure for predicting personality attributes, indicating a positive outcome. However, it does not provide specific metrics on success, accuracy, or limitations related to the profiling outcomes. (Applied Techniques – Computational)	No (English)	The effectiveness of the personality prediction methods may be influenced by the context of the textual data.	The article emphasizes the use of textual content from interview responses to predict personality traits, which can be seen as a form of sociolinguistic profiling. The main finding is that personality prediction through textual analysis is a significant area of research, with potential applications in psychology and computer science. The article concludes that automated prediction of personality traits is necessary for large user groups. (Robustness – Moderate)
Mohammadi and Vinciarelli (2012)	• Pitch (variability in pitch is linked to personality traits). • Loudness (the volume of the speech). • Speaking Rate (the speed at which someone speaks).	Social (Unacquainted speakers, N = 330. Total 640 speech clips used for assessment)	The reported accuracy of the profiling outcomes ranges from 60% to 75%, depending on the personality trait assessed. Limitations include the challenge of predicting continuous personality scores rather than binary classifications, which could enhance the psychological relevance of the findings. (Applied Techniques – Manual, Computational)	No (French)	EIM is influenced by the context of zero acquaintance, where judges assess speakers, they do not know, highlighting the importance of nonverbal cues in personality perception.	The study concludes that it is feasible to predict personality traits based on prosodic features, with notable success in traits like Extraversion and Conscientiousness. The findings suggest that nonverbal vocal behavior significantly influences personality perception, providing a foundation for future research in sociolinguistic profiling. (Robustness – Moderate to high)
Strashko (2023)	• Vocabulary. • Syntax (the analysis notes violations of literary norms)/ • Speech Patterns (emotional and evaluative interjections).	Social (A representative of the group, affected by the war, N = 1 from multimedia corpus)	The study suggests that the analysis of the respondent`s speech provides insights into her emotional state and social context. However, it does not report specific success rates or accuracy metrics related to profiling outcomes, indicating a limitation in quantifying effectiveness. (Applied Techniques – Manual)	Yes (Ukrainian, Russian)	EIM is influenced by the context of the informant’s life experiences, particularly during the emotional turmoil of war.	The study concludes that the respondent`s linguistic personality is shaped by her life experiences, educational background, and social status. The analysis of her speech reveals significant insights into her emotional and cognitive state during a critical period in her life. (Robustness – Moderate to low)
Blake (2008)	• Vocabulary (specific word choices that reflect personal or regional preferences). • Syntax (unique sentence structures that may indicate individual speaking styles). • Speech Patterns (distinctive rhythms, intonations, or pronunciations that characterize an individual’s speech).	Legal, Social (Criminals, Students. The study included a significant number of participants, though specific numbers are not provided)	The article reports on the effectiveness of profiling outcomes, noting: • Success Rates: High accuracy in identifying individuals based on idiolect markers. • Limitations: Challenges in generalizing findings across different contexts or populations. (Applied Techniques – Manual, Computational)	Yes. The study acknowledges variations across different and not specified languages and dialects, which can influence idiolect markers and their interpretation.	EIM is influenced by context, as: Different settings (e.g., legal vs. social) may yield varying results in profiling accuracy.	The main findings indicate that idiolect markers are effective in sociolinguistic profiling, with significant implications for understanding individual identity and communication patterns. The study concludes that while idiolect markers provide valuable insights, context and linguistic diversity must be considered for accurate profiling. (Robustness – Moderate to high)
Kirkegaard (2018)	• N-grams (patterns derived from the arrangement of letters in names). • Regex patterns (regular expressions used to identify specific linguistic structures in names).	Social, Digital (Users of the Danish names Data Base, N = 1890 names)	The study reports a substantial predictive validity for the models developed: • Overall predictive validity was r = .75 when including origin covariates. • For the Danish subset, the validity was r = .46 using only linguistic features. Limitations include the difficulty in providing a precise numerical estimate of the incremental validity of linguistic features (Applied Techniques – Manual, Computational)	No (Danish)	EIM is influenced by the geographic origin of names, as the study found that social status varied by origin group, indicating context-dependence in the analysis.	The study concludes that it is possible to train accurate social status predictors from subtle linguistic patterns in names, suggesting that humans may use these cues for social perception when data is limited. (Robustness – Moderate to high)
Guidi et al. (2019)	• Vocabulary (specific word choices that reflect individual preferences and backgrounds). • Syntax (unique sentence structures that may indicate personal style). • Speech Patterns (rhythms and intonations that characterize an individual’s speech).	Legal, Social (Students in academic discussions, professionals in business environments, N = 200)	The study reported a high accuracy in identifying certain personality traits based on speech. Limitations. Challenges in generalizing findings across different contexts and populations. (Applied Techniques – Manual, Computational)	Yes (the study examines variations in speech across different dialects and languages, highlighting how these factors influence idiolect markers).	EIM was influenced by context, with variations in accuracy depending on the setting (e.g., legal vs. social).	The main findings indicate that idiolect markers can effectively contribute to sociolinguistic profiling, with specific markers being more reliable in certain contexts. The study concludes that while there are promising outcomes, further research is needed to enhance the accuracy and applicability of these profiling techniques. (Robustness – Moderate)
Olivares et al. (2018)	• Vocabulary (the use of specific adjectives and terms that reflect personality traits). • Syntax (the structure of sentences, including the complexity and types of clauses used). • Speech Patterns (the frequency of certain words and the emotional tone).	Digital, Social (Members of the Yahoo! Answers community, N = 100)	The study reports that different models are effective for identifying distinct personality traits, indicating a moderate to high accuracy in profiling. Limitations include the challenge of accurately assessing certain traits, such as openness, which was noted as particularly difficult. (Applied Techniques – Manual, Computational)	No (English)	EIM was influenced by the context of the digital question-answering community, as the linguistic features were tailored to the nature of the interactions on that platform.	The study concludes that it is feasible to identify personality traits through linguistic analysis in digital contexts, with specific linguistic features correlating to different personality factors. The findings support the idea that idiolect markers can be effectively utilized in sociolinguistic profiling. (Robustness – High)
Utami et al. (2022)	• Vocabulary (specific word choices and phrases used by individuals). • Syntax (sentence structure and grammatical patterns). • Speech Patterns (unique ways of expressing thoughts, including rhythm and intonation).	Digital, Social (Twitter users who post in Bahasa Indonesia, N = 292 and 269,649 tweets)	This study is research for an automatic profiling system which employs a combination of Natural Language Processing and Machine Learning approaches to classify Twitter users’ personality based on the DISC personality traits framework. The study reports a high level of accuracy in profiling personality traits based on language use in Twitter posts. Limitations include potential biases in the data due to the nature of social media, where users may not always express their true selves. (Applied Techniques – Manual, Computational)	No (Bahasa Indonesia)	EIM is influenced by the digital context of Twitter, where brevity and informal language can affect how personality traits are expressed and interpreted.	The main findings indicate that specific idiolect markers can effectively profile personality traits in a digital context. The study concludes that language use on social media platforms like Twitter can reveal significant insights into individual personality characteristics. (Robustness – Moderate)
Saga et al. (2023)	• Vocabulary. • Syntax. • Speech Patterns (characteristics such as speech length and use of function words, which are critical for capturing Formal Thought Disorder (FTD) symptoms)	Clinical (General population. N = 76 of which 28 were candidates for autism spectrum disorder and 15 for Schizotypal Personality Disorder.	The study reported significant correlations between the odd speech subscale and total scores of SPQ and SRS, indicating effective profiling of FTD symptoms. Limitations include the exploratory nature of the research and reliance on self-reported measures, which may affect accuracy. (Applied Techniques – Manual, Computational)	No (Japanese)	EIM was influenced by the context of the tasks used to elicit FTD symptoms, with negative memory tasks proving more effective than positive ones.	The study concluded that longer speech and specific tasks (like negative memory recall) are effective in eliciting FTD symptoms. It highlighted the importance of function words and temporal features in profiling, suggesting differences between SPD-like and ASD-like symptoms. (Robustness – Moderate)
Hart et al. (2020)	• Vocabulary. • Syntax. • Speech Patterns (rhythms and intonations that characterize an individual’s speech).	Legal, Social (Students and criminals. N = 247).	The study reported a high success rate in identifying individuals based on their idiolect markers. Accuracy was noted to be around 85%, although limitations included potential biases in speech samples and the need for more diverse datasets. (Applied Techniques – Manual, Computational)	No (English)	EIM was influenced by the context, with variations noted in formal versus informal settings.	The study concluded that idiolect markers are effective in sociolinguistic profiling, with significant implications for legal and social contexts. The study successfully identifies specific personality-disorder traits that influence how individuals present themselves. This includes traits such as narcissism, borderline personality disorder, and antisocial behavior, which were linked to distinct self-presentation tactics used by individuals in various contexts. (Robustness – Moderate to high)
Alam and Riccardi (2014)	• Acoustic Features (related to the sound of speech, such as pitch and tone). • Linguistic Features (vocabulary choices and syntax). • Psycholinguistic Features (derived from the analysis of word usage and emotional content in speech, using tools like the Linguistic Inquiry Word Count (LIWC))	Broadcast News, Social (The participants included speakers from diverse backgrounds, such as customers and agents in the PerSIA corpus. N = 144 calls)	The study reports improved performance in recognizing personality traits through the combination of different feature sets. However, it notes limitations in the conscientiousness category, indicating that while some traits were effectively predicted, others showed less accuracy (Applied Techniques – Manual, Computational)	Yes (Italian, French)	EIM was influenced by the context, as different corpora yielded varying results in trait recognition, suggesting that context plays a significant role in profiling accuracy.	The study concluded that combining acoustic, linguistic, and psycholinguistic features enhances the recognition of speaker personality traits. It highlighted the importance of feature selection and the potential for further improvement in profiling techniques. (Robustness – Moderate)
Lukito et al. (2016)	• Vocabulary: The choice of words used by individuals in their social media posts. • Syntax: The structure and arrangement of sentences that reflect personal style. • Speech Patterns: The rhythm and flow of language that can indicate personality traits	Digital, Social (Twitter users from Indonesia. N = 200,000 tweets)	The study reports a high accuracy of 80% for Introvert-Extrovert traits and 60% for other traits (Sensing-Intuition, Thinking-Feeling, Judging-Perceiving). Limitations include lower accuracy for Sensing-Intuition traits compared to other studies, indicating potential areas for improvement in profiling accuracy. (Applied Techniques – Manual, Computational)	No (Bahasa Indonesia)	EIM is influenced by the context of social media, where informal language and specific cultural references may affect the accuracy of personality predictions. (Robustness – High)	The study concludes that personality traits can be effectively predicted from social media posts using computational linguistic analysis, with Naive Bayes being the most effective model. It emphasizes the potential for automated personality classification in marketing and other applications. A simple application was developed based on the best statistical model compared before to classify an individual’s personality with their Twitter username and gender as an input and shows the best performance in terms of speed in classifying the users.
Kumar and Gavrilova (2019)	• Vocabulary. • Syntax. • Speech Patterns (the stylistic features)	Social, Digital (Twitter users. N = 8,675).	The study reports significant improvements in personality trait estimation, outperforming state-of-the-art methods. The effectiveness is measured using metrics like mean absolute error (MAE) and root mean square error (RMSE), indicating a high level of accuracy in profiling outcomes. Limitations include challenges in classifying certain traits, such as Neuroticism, which proved to be more difficult. (Applied Techniques – Manual, Computational)	No (English)	EIM is influenced by the context of social media interactions, as the linguistic style can vary significantly based on the platform and the nature of the communication.	The study employs a linguostylistic personality traits assessment (LPTA) system that combines various text representation schemes to classify personality traits. The study concludes that it is possible to accurately predict personality traits from a limited number of tweets, demonstrating the potential of using idiolect markers for sociolinguistic profiling. The findings highlight the importance of linguistic style in understanding personality and behavior in digital contexts. The combination of various NLP techniques and the validation against established datasets enhances the reliability of the findings. (Robustness – High).
Jyothi et al. (2024)	• Writing style (as a means to infer characteristics of authors, which can be related to idiolect markers)	Social, Digital (Twitter users. N = not specified. BIG DATA used)	The study reports an accuracy of 87.53% for personality traits classification using the LSTM model with combined embeddings, indicating a high level of effectiveness. However, it also notes that using BERT alone with LSTM yielded lower accuracy (78.45%), suggesting limitations in the model’s performance when not combined with SimCSE embeddings (Applied Techniques –Computational)	No (English)	The effectiveness of the models may be influenced by the context of the data (i.e., Twitter), as the informal and varied nature of social media language can affect the accuracy of profiling. The study does not provide detailed insights into how context specifically influenced the outcomes.	The study applies pre-trained models like BERT and SimCSE for generating embeddings, which can be seen as computational methods for profiling based on language use. The classification techniques employed include Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN). The main findings indicate that combining BERT and SimCSE embeddings with LSTM classifiers significantly enhances the accuracy of personality traits classification. The study concludes that deep learning models are versatile and effective for author profiling tasks, although it does not specifically address idiolect markers. (Robustness – Moderate to high)
Drouin et al. (2017)	• Vocabulary. • Total Word Usage. • Clout (a measure of social dominance)	Legal (The study involved convicted child sex offenders and undercover agents. N = 590 chat transcripts analyzed)	The study reported that linguistic analyses can provide objective measures that help assess offenders’ predispositions. In particular, offenders used a higher frequency of sexual words compared to undercover agents, indicating a distinct lexical choice. Also, offenders generally used more words overall, suggesting differences in verbosity and exhibited higher clout scores than agents. However, it does not detail specific success rates or limitations, indicating a need for further research to validate these findings. (Applied Techniques – Manual, Computational)	No (English)	EIM appears to be context-dependent, as the language used by offenders was analyzed within the specific setting of online sexual solicitation, which may not generalize to other contexts.	The main findings indicate that offenders exhibit distinct linguistic patterns compared to undercover agents, with higher usage of sexual vocabulary, overall verbosity, and social dominance. These findings suggest that linguistic analysis can be a valuable tool in profiling offenders in legal contexts. (Robustness – Moderate)
Beltrama and Schwarz (2024)	• Speech Patterns (the distinction between a “Nerdy” persona (precise speech) and a “Chill” persona (imprecise speech) is highlighted, indicating how speech style can serve as an idiolect marker)	Social, Digital (Research participants recruited through platform “Prolific.”N = 240)	The study suggests that the interpretation of speech can be influenced by the speaker’s persona, which could be relevant for profiling. The study uses a picture selection task to assess how different personae affect the interpretation of numerical expressions, suggesting a manual method of profiling based on social perception. The article reports that speakers with a Nerdy persona are interpreted more precisely than those with a Chill persona, indicating a successful outcome in demonstrating how persona affects interpretation. Limitations include the lack of detailed participant demographics and the specific contexts in which these interpretations were made. (Applied Techniques – Manual)	No (English)	EIM, as indicated by the study, is influenced by the persona of the speaker, suggesting that context plays a significant role in interpretation.	The main finding is that the social perception of the speaker’s persona significantly influences the interpretation of numerical expressions. This suggests that comprehenders use socio-indexical information to inform their understanding of meaning. In turn, socio-indexical meanings (as idiolect markers), play a crucial role in language processing and understanding, highlighting their relevance to sociolinguistic profiling. (Robustness – Moderate to high)
Park et al. (2015)	• Vocabulary. • Syntax. • Speech Patterns (the overall style and tone of language used in posts)	Digital, Social (Facebook users. N > 70,000).	The study reported high effectiveness in predicting personality traits based on language use, showing: • Accuracy. Language-based assessments aligned well with self-reports and informant reports. • Limitations. While the study demonstrated validity, it did not explore potential biases in language use across different demographics. (Applied Techniques – Manual, Computational)	No (English)	EIM was influenced by the context of social media, where language use can vary significantly based on audience and platform norms.	The study concluded that language-based assessments can provide valid measures of personality, complementing traditional methods. It highlighted the potential of using social media language to create rich profiles of individuals’ mental lives. (Robustness – High)
Cimino et al. (2013)	• General-purpose features that qualify the lexical and grammatical structure of a text, which can be considered as idiolect markers.	Digital, Educational (Training and development sets extracted from the TOEFL11 corpus. N = 9,900 essays)	The article addresses native language identification (NLI) as a form of sociolinguistic profiling. The article reports encouraging results from the NLI task, indicating a level of success and accuracy in the profiling outcomes. However, it does not discuss specific limitations or challenges faced during the study. (Applied Techniques –Computational)	Yes (Arabic, Chinese, English, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish)	The effectiveness of the proposed approach is implied to be context-dependent, as it is designed to be general-purpose and adaptable to various tasks and domains.	The main finding is that the proposed approach to native language identification using general-purpose features yields encouraging results, indicating its potential for broader applications in sociolinguistic profiling. (Robustness – Moderate to high)
Li et al. (2022)	• Vocabulary. • Language Styles (variations in expression that reflect personality traits, such as formality or informality in language use). • Psycholinguistic Features (elements that reveal psychological states and social connections)	Digital, Social (User posts from datasets like Youtube, PAN, and MyPersonality. Big Data used.)	The article introduces a novel hierarchical graph attention network (PerHGAT) for personality prediction. This method aggregates language style information into semantic learning, which can be seen as a computational profiling technique. The focus is on leveraging both semantic and stylistic elements of language for profiling purposes. PerHGAT achieves state-of-the-art performance in predicting personality traits, indicating high effectiveness. However, the article does not discuss specific limitations or accuracy metrics in detail. (Applied Techniques –Computational)	No (English)	EIM in this study is influenced by the context of social media, where language use can vary significantly based on audience and platform. The model’s ability to aggregate language styles suggests that context plays a crucial role in personality prediction.	The main findings indicate that integrating language styles with semantic understanding enhances personality prediction accuracy. The study concludes that personality traits can be effectively predicted through a combination of language styles and semantic information, showcasing the potential of using idiolect markers in profiling. (Robustness – High)

ORCID iD

Vitalii Shymko

Ethical Considerations

This article does not contain any studies involving human participants performed by the author.

Consent for Publication

The author approves the publication of the current work. The work has not been, nor has it been submitted to other journals in consideration for publication.

Author Contributions

The author is the only person that contributed to all parts of this paper.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

Alam

Riccardi

(2014). Fusion of acoustic, linguistic and psycholinguistic features for speaker personality traits recognition [Conference session]. 2014 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2014 (pp. 955–959). Article 6853738. Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/ICASSP.2014.6853738

Beltrama

Schwarz

(2024). (Im)precise personae: The effect of socio-indexical information on semantic interpretation. Language in Society. Advance online publication. https://doi.org/10.1017/S0047404524000320

Blake

B. J.

(2008). Tables. All About Language. Oxford Academic.

Cimino

Dell’Orletta

Venturi

Montemagni

(2013, June). Linguistic profiling based on general-purpose features and native language identification. In Tetreault

Burstein

Leacock

(Eds.), Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 207–215). Association for Computational Linguistics. https://aclanthology.org/W13-1727

Corney

De Vel

Anderson

Mohay

(2002). Gender-preferential text mining of e-mail discourse [Conference session]. 18th Annual Computer Security Applications Conference, Las Vegas, 9–13 December 2002 (pp. 282–289). https://doi.org/10.1109/CSAC.2002.1176299

Coulthard

(2004). Author identification, idiolect, and linguistic uniqueness. Applied Linguistics, 25(4), 431–447. https://doi.org/10.1093/applin/25.4.431

Coulthard

Johnson

(2007). An introduction to forensic linguistics: Language in evidence. Routledge.

Daelemans

(2016). Keynote: Profiling the personality of social media users, ELRA. In Proceedings of the Final Workshop, 7 December 2016. https://doi.org/10.4000/books.aaccademia.1927

Drouin

Boyd

R. L.

Hancock

J. T.

James

(2017). Linguistic analysis of chat transcripts from child predator undercover sex stings. Journal of Forensic Psychiatry & Psychology, 28(4), 437–457. https://doi.org/10.1080/14789949.2017.1291707

10.

Eades

(2010). Sociolinguistics and the legal process. Multilingual Matters.

11.

Grant

(2013). TXT 4N6: Method, consistency, and distinctiveness in the analysis of sms text messages. Journal of Law and Policy, 21(2), 467–494. https://publications.aston.ac.uk/id/eprint/40092/1/2013_Grant_TXT_4N6_Journal_of_Law_and_policy.pdf

12.

Guidi

Gentili

Scilingo

E. P.

Vanello

(2019). Analysis of speech features and personality traits. Biomedical Signal Processing and Control, 51, 1–7. https://doi.org/10.1016/j.bspc.2019.01.027

13.

Hart

Tortoriello

G. K.

Richardson

(2020). Profiling personality-disorder traits on self-presentation tactic use. Personality and Individual Differences, 156, 109793. https://doi.org/10.1016/j.paid.2019.109793

14.

Isaak

Hanna

M. J.

(2018). User data privacy: Facebook, Cambridge analytica, and privacy protection. Computer, 51(8), 56–59. https://doi.org/10.1109/MC.2018.3191268

15.

Jakovljev

Milin

(2017). The relationship between thematic, lexical, and syntactic features of written texts and personality traits. Psihologija, 50(1), 67–84. https://doi.org/10.2298/PSI161012006J

16.

Jyothi

K. D.

Pradeepa

I. L.

Kavuri

(2024). Author profiling approach: Predicting personality traits on Twitter data using combined BERT and SimCSE embeddings. International Journal for Science Technology and Engineering, 12(6), 1140–1147. https://doi.org/10.22214/ijraset.2024.63286

17.

Kerz

(2022). Pushing on personality detection from verbal behavior: A transformer meets text contours of psycholinguistic features. In Proceedings of the 12th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis (pp. 182–194), Dublin, Ireland. Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.wassa-1.17

18.

Kirkegaard

E. O. W.

(2018). Linguistic features in names and social status: An exploratory study of 1,890 Danish first names. Open Differential Psychology. Advance online publication. https://doi.org/10.26775/odp.2018.12.12

19.

Kulkarni

Kern

M. L.

Stillwell

Kosinski

Matz

Ungar

Skiena

Schwartz

H. A.

(2018). Latent human traits in the language of social media: An open-vocabulary approach. Plos One, 13(11), e0201703. https://doi.org/10.1371/journal.pone.0201703

20.

Kumar

Gavrilova

M. L.

(2019, September). Personality traits classification on twitter [Conference session]. 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS) (pp. 1–8). IEEE. http://dx.doi.org/10.1109/AVSS.2019.8909839

21.

Liu

Bai

(2022). Language style matters: Personality prediction from textual styles learning [Conference session]. 2022 IEEE International Conference on Knowledge Graph (ICKG) (pp. 141–148). IEEE. https://doi.org/10.1109/ICKG55886.2022.00025

22.

Litvinova

Seredin

Litvinova

Zagorovskaya

(2016). Profiling a set of personality traits of text author: What our words reveal about us. Research in Language, 14(4), 409–422. https://doi.org/10.1515/rela-2016-0019

23.

Liu

Perez

Nowson

(2017). A language-independent and compositional model for personality trait recognition from short texts [Conference session]. Lapata

Blunsom

Koller

(Eds.), Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers (pp. 754–764). Association for Computational Linguistics. https://aclanthology.org/E17-1071/

24.

Lukito

L. C.

Erwin

Purnama

Danoekoesoemo

(2016). Social media user personality classification using computational linguistics [Conference session]. 2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE) (pp. 1–6). IEEE. https://doi.org/10.1109/ICITEED.2016.7863313

25.

Mairesse

Walker

M. A.

Mehl

M. R.

Moore

R. K.

(2007). Using linguistic cues for the automatic recognition of personality in conversation and text. Artificial Intelligence Research, 30, 457–500. https://doi.org/10.1613/jair.2349

26.

Marttila

(2013). Method of linguistic profiling (U.S. Patent Application No. 13/878,284).

27.

McMenamin

G. R.

(2002). Forensic linguistics: Advances in forensic stylistics. CRC Press.

28.

Mohammadi

Vinciarelli

(2012). Automatic personality perception: Prediction of trait attribution based on prosodic features. IEEE Transactions on Affective Computing, 3(3), 273–284. https://doi.org/10.1109/T-AFFC.2012.5

29.

Mooney

Evans

(Eds.). (2023). Language, society and power: An introduction (6th ed.). Routledge.

30.

Moskvichev

Dubova

Menshov

Filchenkov

(2018). Using linguistic activity in social networks to predict and interpret dark psychological traits. In Filchenkov

Pivovarova

Žižka

(Eds.), Artificial intelligence and natural language. AINL 2017. Communications in computer and information science (Vol. 789, pp. 16–26). Springer.

31.

Olivares

Vivanco

L. M.

Figueroa

(2018). The big five: Discovering linguistic characteristics that typify distinct personality traits across Yahoo! Answers members. Computación y sistemas, 22(3), 795–807. https://doi.org/10.13053/cys-22-3-2752

32.

Olsson

(2004). Forensic linguistics: An introduction to language, crime, and the law. Continuum.

33.

Oxford Bibliographies. (2014, June 30). William Labov - Linguistics. https://www.oxfordbibliographies.com/abstract/document/obo-9780199772810/obo-9780199772810-0195.xml

34.

Park

Schwartz

H. A.

Eichstaedt

J. C.

Kern

M. L.

Kosinski

Stillwell

D. J.

Ungar

L. H.

Seligman

M. E.

(2015). Automatic personality assessment through social media language. Journal of Personality and Social Psychology, 108(6), 934–952. https://doi.org/10.1037/pspp0000020

35.

Pervaz

Ameer

Sittar

Nawab

R. M. A.

(2015). Identification of author personality traits using stylistic features. In CEUR Workshop Proceedings, 1391. https://downloads.webis.de/pan/publications/papers/pervaz_2015.pdf

36.

Roivainen

(2015). Personality adjectives in Twitter tweets and in the Google Books corpus: An analysis of the facet structure of the openness factor of personality. Current Psychology, 34(3), 621–625. https://doi.org/10.1007/s12144-014-9274-x

37.

Saga

Tanaka

Nakamura

(2023). Computational analyses of linguistic features with schizophrenic and autistic traits along with formal thought disorders [Conference session]. Proceedings of the 25th International Conference on Multimodal Interaction (ICMI ’23). Association for Computing Machinery, New York, NY, USA (pp. 119–124). https://doi.org/10.1145/3577190.3614132

38.

Sewwandi

Perera

Sandaruwan

Lakchani

Nugaliyadde

Thelijjagoda

(2017). Linguistic features-based personality recognition using social media data [Conference session]. 2017 6th National Conference on Technology and Management (NCTM) (pp. 63–68). IEEE. https://doi.org/10.1109/NCTM.2017.7872829

39.

Shrestha

Spezzano

Joy

(2020, October). Detecting fake news spreaders in social networks via linguistic and personality features. In Working Notes of CLEF 2020-Conference and Labs of the Evaluation Forum.

40.

Strashko

I. V

. (2023). Phonetic, lexical, grammatical, cognitive, and pragmatic levels of the linguistic personality (based on the interview from the author’s multimedia corpus). Naukovij časopis Nacìonal′nogo pedagogìčnogo unìversitetu ìmenì M.P. Dragomanova, https://doi.org/10.31392/NPU-nc.series9.2023.25.06

41.

StudySmarter. (n.d). Accommodation theory: Definition & examples. https://www.studysmarter.co.uk/explanations/english/language-and-social-groups/accommodation-theory/

42.

Turell

M. T.

(2011). The use of textual, grammatical and sociolinguistic evidence in forensic text comparison. International Journal of Speech Language and the Law, 17(2), 211–250. https://doi.org/10.1558/ijsll.v17i2.211

43.

Utami

Hartanto

A. D.

Adi

Oyong

Raharjo

(2022). Profiling analysis of DISC personality traits based on Twitter posts in Bahasa Indonesia. Journal of King Saud University - Computer and Information Sciences, 34(2), 264–269. https://doi.org/10.1016/j.jksuci.2019.10.008

44.

Valente

Kim

Motlicek

(2012). Annotation and recognition of personality traits in spoken conversations from the AMI meetings corpus. In Proc. Interspeech 2012 (pp. 1183-1186). https://doi.org/10.21437/Interspeech.2012-125

45.

Verhoeven

Daelemans

Plank

(2016). TwiSty: A multilingual twitter stylometry corpus for gender and personality profiling [Conference session]. International Conference on Language Resources and Evaluation. https://aclanthology.org/L16-1258.pdf

46.

William

Yogeesh

Tidake

V. M.

Gondkar

S. S.

Vengatesan

(2023). Framework for implementation of personality inventory model on natural language processing with personality traits analysis [Conference session]. 2023 International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT) (pp. 625–628). IEEE. https://doi.org/10.1109/IDCIoT56793.2023.10053501

47.

Wright

W. R.

Chin

D. N.

(2016). Personality profiling from text: language features tied to personality across corpora. User Modeling, Adaptation, and Personalization (Extended Proceedings). https://ceur-ws.org/Vol-1618/Poster5.pdf