Abstract
In this era of AI technologies and LLMs, libraries are also at the forefront of experimenting with how these technologies can improve and augment library operations, particularly classification and cataloging. This study investigates how well the LLMs, specifically ChatGPT-4o, DeepSeek, and Gemini 2.0, can perform library classification using the DDC scheme. Previous studies have evaluated various AI models for classification and cataloging purposes, mostly using content analysis or similarity measures, providing limited insights into where, how, and by what measure these models make errors. Our study develops a hierarchical evaluation scale that respects the structural characteristics of the DDC system. We tested the selected models on a dataset of 110 book titles, spanning across all main classes of the DDC, with expert-assigned numbers as a benchmark. Models were tested for accuracy, mismatch distribution, direction of misclassification, and cross-model compensation of error, addressing a crucial gap and adding novel findings to the existing body of knowledge. The results indicate that all three models handle broader levels of classification well, particularly up to the second and third digit. DeepSeek performed best overall, with an average match score of 56.43 out of 100, followed by ChatGPT-4o (51.82), while Gemini 2.0 produced the most variable outcomes of the three (45.73). Most errors occur at the section (third digit) and early decimal levels, indicating that such granular distinctions demand contextual understanding beyond the current model capabilities. Misclassifications at the main level were rare (ChatGPT: 9.09%; DeepSeek: 0.91%; Gemini: 8.18%). Interestingly, the cross-model compensation matrix revealed that different models perform differently across the hierarchical bins. DeepSeek was found to be excellent at broader-level classification, while ChatGPT-4o performed better at granular-level classification, indicating future potential for hierarchy-aware model combinations for the given task.
Keywords
Get full access to this article
View all access options for this article.
