Automatic theme and motif identification in large-scale English literary corpora using deep learning approaches

Abstract

The identification of themes and motifs in literary texts is a fundamental aspect of literary analysis, traditionally performed through manual annotation and expert interpretation. However, the increasing availability of large-scale English literary corpora presents new challenges and opportunities for automated analysis. This paper proposes a deep learning (DL)-based framework for automatically detecting themes and motifs in extensive literary collections. The dataset comprises diverse sources, including classic literature, modern fiction, and poetry, ensuring a broad representation of thematic structures. A rigorous preprocessing pipeline is applied, involving stop word removal and tokenization to refine textual data. For feature extraction, Word2Vec is utilized to capture semantic relationships between words. The core novelty of this research lies in the implementation of a Duelist Algorithm-optimized Bi-directional Long Short-Term Memory (DAO-BiLSTM) model, which enhances the model’s ability to detect and classify recurring thematic elements with high accuracy. The proposed method achieves an accuracy of 96.24%, recall of 97.32%, precision of 95.6%, and an F1-score of 94.7%, demonstrating superior performance over existing methods. The model is implemented in Python 3.9 using TensorFlow in a high-performance computing environment, ensuring efficient processing of large-scale textual data. Experimental results illustrate the effectiveness of the proposed approach in identifying complex motifs and themes across various literary genres. These findings highlight the potential of DL in augmenting literary analysis, enabling large-scale, data-driven thematic exploration that complements traditional human-driven methodologies.

Keywords

theme and motif identification English literary corpora deep learning Duelist Algorithm Optimized Bi-Directional Long Short-Term Memory (DAO-Bi-LSTM)

Get full access to this article

View all access options for this article.

References

Allam

Makubvure

Gyamfi

, et al. Text classification: how machine learning is revolutionizing text categorization. Information 2025; 16(2): 130.

Tang

. Author identification of literary works based on text analysis and deep learning. Heliyon 2024; 10(3): e25464.

Karatzas

Papageorgiou

Lazari

, et al. A text analytic framework for gaining insights on the integration of digital twins and machine learning for optimizing indoor building environmental performance. Developments in the Built Environment 2024; 18: 100386.

Feuerriegel

Maarouf

Bär

, et al. Using natural language processing to analyse text data in behavioural science. Nature Reviews Psychology 2025; 4: 1–6. DOI: 10.1038/s44159-024-00392-z.

Tripathi

Bachmann

Brunner

, et al. Assessing the current landscape of AI and sustainability literature: identifying key trends, addressing gaps and challenges. J Big Data 2024; 11(1): 65.

Onan

Alhumyani

. FuzzyTP-BERT: enhancing extractive text summarization with fuzzy topic modeling and transformer networks. Journal of King Saud University-Computer and Information Sciences 2024; 36(6): 102080.

Sietsma

Ford

Minx

. The next generation of machine learning for tracking adaptation texts. Nat Clim Change 2024; 14(1): 31–39.

Moilanen

Østbye

Simonen

. Machine learning and the identification of smart specialisation thematic networks in Arctic Scandinavia. Reg Stud 2022; 56(9): 1429–1441.

Peng

Ren

, et al. Novel GCN model using dense connection and attention mechanism for text classification. Neural Process Lett 2024; 56(2): 144.

10.

Mets

Karjus

Ibrus

, et al. Automated stance detection in complex topics and small languages: the challenging case of immigration in polarizing news media. PLoS One 2024; 19(4): e0302380.

11.

Bachoumis

Mylonas

Plakas

, et al. Data-driven analytics for reliability in the buildings-to-grid integrated system framework: a systematic text-mining-assisted literature review and trend analysis. IEEE Access 2023; 11: 130763–130787.

12.

Avasthi

Chauhan

. Automatic label curation from large-scale text corpus. Eng Res Express 2024; 6(1): 015202.

13.

Kraidia

Ghenai

Belhaouari

. A multi-faceted approach to trending topic attack detection using semantic similarity and large-scale datasets. IEEE Access 2025; 13: 21005–21028.

14.

Wang

Rudinac

, et al. High-performance computing in healthcare: an automatic literature analysis perspective. J Big Data 2024; 11(1): 61.

15.

Hussain

Asim

, et al. Enhancing e-learning adaptability with automated learning style identification and sentiment analysis: a hybrid deep learning approach for smart education. Information 2024; 15(5): 277.

16.

Abdulsalam

Alhothali

Al-Ghamdi

. Detecting suicidality in Arabic tweets using machine learning and deep learning techniques. Arabian J Sci Eng 2024; 49(9): 12729–12742.

17.

Misini

Canhasi

Kadriu

, et al. Automatic authorship attribution in Albanian texts. PLoS One 2024; 19(10): e0310057.

18.

. Application of an intelligent English text classification model with improved KNN algorithm in the context of big data in libraries. Systems and Soft Computing 2025; 7: 200186.

19.

Hatzel

Stiemer

Biemann

, et al. Machine learning in computational literary studies. IT Inf Technol 2023; 65(4-5): 200–217.

20.

Kuzman

Mozetič

Ljubešić

. Automatic genre identification for robust enrichment of massive text collections: investigation of classification methods in the era of large language models. Mach Learn Knowl Extr (2019) 2023; 5(3): 1149–1175.

21.

Iordan

Giallanza

Ellis

, et al. Context matters: recovering human semantic structure from machine learning analysis of large-scale text corpora. Cogn Sci 2022; 46(2): e13085.

22.

Fedotova

Kurtukova

Romanov

, et al. Semantic clustering and transfer learning in social media texts authorship attribution. IEEE Access 2024; 12: 39783–39803.

23.

Morozov

Glazkova

Iomdin

. Text complexity and linguistic features: their correlation in English and Russian. Russian Journal of Linguistics 2022; 26(2): 426–448.

24.

Almuzaini

Azmi

. An unsupervised annotation of Arabic texts using multi-label topic modeling and genetic algorithm. Expert Syst Appl 2022; 203: 117384.

25.

Elghannam

. Multi-label annotation and classification of Arabic texts based on extracted seed Keyphrases and Bi-Gram alphabet feed forward neural networks model. ACM Trans Asian Low-Resour Lang Inf Process 2022; 22(1): 1–6.

26.

Altameemi

Altamimi

. Thematic analysis: a corpus-based method for understanding themes/topics of a corpus through a classification process using Long Short-Term Memory (LSTM). Appl Sci 2023; 13(5): 3308.

27.

Zhang

. Applications of deep learning in news text classification. Sci Program 2021; 2021(1): 6095354–6095359.