Abstract
Background
Lack of information is a critical challenge in occupational health. With over 180 million users, ChatGPT has become a prominent trend, swiftly addressing a wide array of queries, yet it critically needs validation in occupational health.
Objective
This study evaluated GPT-3.5 (free version) and GPT-4 (paid version) on their ability to respond to Occupational Risk Prevention formal multiple-choice questions.
Methods
A total of 303 questions were assessed, categorized across four levels of complexity—task-specific, national, European, and global—within various Spanish regions.
Results
GPT-3.5 achieved an overall accuracy of 56.8%, while GPT-4 reached 73.9% (p < 0.001). GPT-3.5 showed particularly limited performance on domain-specific content. Both models shared similar error patterns, with incorrect response rates ranging from 18–24% across regions.
Conclusion
Despite GPT-4's improved performance, both models display notable limitations in occupational health applications. To enhance reliability, four strategies are proposed: formal validation, continuous training, error analysis, and regional adaptation.
Keywords
Get full access to this article
View all access options for this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
