Abstract
The study investigated the feasibility and reliability of using generative artificial intelligence to conduct heuristic evaluations of workplace instructions. A custom GPT model, fine-tuned with examples and heuristic criteria, was tasked with evaluating aerospace-based work instructions. The AI’s output included identifying weaknesses and improvements, heuristic scoring, and providing rationales for its decisions and analyses. Results showed poor agreement between the AI and human experts, but consistent and reproducible AI scoring. Agreement with human experts was low due to high variability among human evaluators, but qualitative analysis confirmed the AI’s ability to identify common weaknesses and offer relevant feedback, which outperformed individual human evaluators. Finally, although the AI provided adequate explanations, some explanations lacked detail. The research demonstrates the potential of AI-driven heuristic evaluations to streamline assessment processes and augment human analysis in high-risk industries, while acknowledging the need for ongoing model refinement and improved transparency.
Keywords
Get full access to this article
View all access options for this article.
