Abstract
The quality of safety procedures in generative AI-enhanced work instructions was empirically assessed against established work instruction authoring criteria. Three operational tasks from manufacturing, baggage handling, and spaceflight operations were selected. The generative AI was then provided with a set of 75 biomechanical safety requirements common for these settings and prompted to produce work instructions which provided safety requirements and recommendations for each step. Four expert evaluators then rated the improved work instructions against common criteria for their adherence to safety standards, clarity, and practical applicability. Results show that generative AI can significantly improve safety procedures in work instruction by identifying potential hazards and recommending protective equipment or other mitigation approaches. However, the quality of these safety procedures can still be improved significantly.
Introduction
Adequate work instruction (WI) development is important to ensure quality, prevent unnecessary monetary loss, and ensure workplace safety. The integration of large language models (LLMs) into operational environments presents a novel avenue for enhancing both efficiency and safety, including the reduction of injury risk in sectors where manual handling tasks are prevalent. The present project aimed to explore and validate the efficacy of generative AI in improving WI authoring with a focus on biomechanical safety. The primary objective of the project is to investigate the capability of large-language model-based generative AI, such as OpenAI’s ChatGPT and Google’s Gemini, in developing WI that comprehensively integrates biomechanical safety requirements into operational tasks. These tasks often involve substantial manual handling, such as lifting, pushing, or pulling. The project aims to address the gaps identified in current practices which increase the risk of manual handling injuries and compromise operational safety.
Work Instructions
In the context of this paper, “work instruction” includes any available instruction on how to perform job procedures. WI support personnel in completing their job tasks successfully, help mitigate safety risks inherent to the operation, and align tasks with corporate guidelines, established standards, and quality goals (Novatsis & Skilling, 2016). WI can vary significantly in format. Some instructions follow the common step by step list format, while other WI may be featured as a video which staff can watch on demand, Improperly created WI can not only lead to task completion errors or failures, but also to increased operational safety risks. While it is well-known that personal motivation factors influence an individual’s safety mindset regarding WI (Cornelissen et al., 2014), it is still important to focus on the proper design of adequate WI which includes all necessary safety-risk related elements. However, when developing WI, authors often forget to implement all required safety equipment and techniques, leading to the potential for increased risk to operational safety.
Common Problems With Work Instructions
In 2022, 2.8 million nonfatal workplace injuries and illnesses and 5,486 fatal injuries were reported in the United States (U.S. Bureau of Labor Statistics, 2023). The most common causes included exposure to harmful substances or environments, overexertion and biomechanical reactions, and slips, trips, and falls (National Safety Council, 2024). In job tasks, especially those which are non-routine or where staff expertise is limited, a primary way to prevent and control hazards is through the development of procedures which account for methodologically identified hazards and controls (Occupational Safety and Health Administration, 2024). However, procedures often fail to be developed, implemented, or utilized adequately.
Common reasons why procedures fail include inaccurate or incomplete procedures, poor procedural presentation, and poorly written instructions (Novatsis & Skilling, 2016). In regards to safety risk, missing hazards, protective equipment information, or safe procedures could result in employees being ill-prepared for known or preventable hazards (Read, 2023). Poor procedural presentation can negatively impact both the ability of staff to properly and safely carry out a task (Novatsis & Skilling, 2016), as well as the goal of the procedure itself (Eiriksdottir & Catrambone, 2011). Finally, poorly written WI can reduce the impact of work procedures in reducing human error probability (HSE, 1999). Finding techniques to improve the quality of WI to reduce these common reasons for procedural failures should provide improvements in workplace safety.
Generative AI as a Solution
The emergent popularity of generative AI provides an easily accessible and promising avenue by which WI designers can reduce errors in WI creation. Generative AI uses natural language processing (NLP) and LLMs to generate complex output based on user prompts. Since late 2022, generative AI chatbots such as OpenAI’s ChatGPT and Google’s Gemini (formerly Bard) have increased in popularity, with 65% of organizations adopting generative AI (McKinsey & Company, 2024). The use of AI or algorithms to support the creation of work procedures is nothing new (e.g., Patalas-Maliszewska & Halikowski, 2019), but the rise of consumer generative AI has made complex AI capabilities accessible. Among the ways in which AI can enhance work outcomes are improvements to business functions, providing knowledge when needed, and reducing the number of errors or mistakes related to human reliability (Ramachandran et al., 2022). In the context of producing WI, generative AI can improve safety by ensuring that all relevant safety knowledge required is included. For example, a manager creating WI for offloading baggage from a plane can provide the completed instruction to generative AI, where it can be vetted against a database of safety standards and procedures. The manager can then approve or disapprove of the AI provided improvements to the WI.
Multiple techniques can be employed to leverage generative AI for WI improvements. At its most basic implementation, a work procedure can be provided to generative AI with the simple prompt for the AI to improve the safety aspects of the instruction using commonly implemented safety standards and procedures, while considering the task at hand. However, such an approach may fail to consider that the safety considerations may be task, industry, professional, or other factor dependent. A better approach is to provide a comprehensive database of identified hazards, established controls, and approved standards that an organization must abide with to the generative AI, with instructions to use the database when considering improvements to the safety procedures in the WI. This database approach would allow companies to use internal generative AI LLMs, reducing privacy, intellectual property, and security risks. However, there is still insufficient literature demonstrating the efficacy of using generative AI for this purpose.
The Present Project
The present project examined the ability of generative AI to improve WI by incorporating safety elements derived from a safety risks, controls, and standards database. Workplace procedures and a safety database were provided to OpenAI’s ChatGPT. After the AI vetted the WI and created a finalized version which incorporated safety improvements, the WI were evaluated using a heuristics based approach.
Method
To evaluate the effectiveness of generative AI supplemented WI, six professionals examined AI generated instructions against a set of heuristics.
Procedure
First, 75 workplace safety rules were created for the generative AI to follow when evaluating the WIs. The database included three columns in a spreadsheet format describing when the rule applicability, what safety procedures must be followed, and the industry standard or company policy the safety requirement is derived from. The rules were based on topics relevant to general rules on personal protective equipment (PPE), as well as walking/ground-based, chemical, working at height, and biomechanical motion hazards.
Then, work procedures were designed for three tasks, including baggage loading into a Boeing 737, aircraft landing gear assembly, and a space satellite crane transfer. The work procedures and database were then provided in a prompt to ChatGPT 40, with the AI instructed to create WI which included any relevant safety aspects that needed to be taken into account based on the work procedure and safety database provided.
Heuristic Assessment
The AI generated WI were then evaluated using a heuristic analysis approach. Based on the work of Novatsis and Skilling (2016) and the Health and Safety Executive (HSE) (1999), a set of nine Heuristics for Work Instruction Design, were established for our analysis. These heuristics are described in more detail in the Results section below. Each heuristic was scored on a 0 thru 10 scale, with a maximum attainable score of 90 for each WI.
Evaluators
Evaluators 1 is a civil and industrial engineer with experience designing, implementing, and regulating procedures in manufacturing and airport settings. Evaluator 2 is a spaceflight operations human factors expert with experience evaluating spaceflight procedures for human error potential. Evaluator 3 is a spaceflight operations human factors professional with experience in evaluating spaceflight operations. Evaluator 4 is a human factors professional with experience designing and evaluating work procedures in construction, manufacturing, and spaceflight operations settings.
Results
Heuristic Analysis Results
Table 1 shows the average scores the evaluators assigned to heuristics for each of the WI. Interrater reliability was fair for the satellite task (κ = .26), slight for the landing gear task (κ = .16), and no agreement for the baggage handling task (κ = −.17). This may reflect a wide range of opinions about AI-assisted WI. However, while the scores showed variability, there were strong common themes on the strength and weaknesses of each WI.
Average Ratings Across All Evaluators for All Work Instruction.
LG = landing gear task; BH = baggage handling task; ST = satellite transfer task.
Format
Adequate WI should be presented in a consistent and logical format that includes a purpose statement, hazards and safety requirements, equipment and tools needed, preconditions required, and reference documents (Novatsis & Skilling, 2016). The generative AI either maintained or improved the format of the instructions it was provided. Importantly, it added bold sections to emphasize key elements for users and organized safety aspects in a logical format.
Comprehensibility
WI should be presented using the language commonly used by the operators who will use the WI (Novatsis & Skilling, 2016). For the most part, the comprehensibility of the WI was excellent, including the safety procedures and requirements added by the generative AI. However, the landing gear WI was suboptimal in comprehensibility and this deficiency was not improved by the AI outside of the safety related aspects.
Completeness
For instructions to be effective, all necessary steps, equipment and materials, safety precautions, and PPE requirements should be listed (Health and Safety Executive [HSE], 1999). Missing items can lead to incomplete steps, confusion, or the use of incorrect and potentially erroneous parts. For example, incomplete battery management procedures were a contributing factor in the loss of a multi-million dollar robot at NASA’s Jet Propulsion Laboratory (National Aeronautics and Space Administration [NASA], 2018).
Across the WI generated, the AI provided some PPE guidance that lacked thoroughness into its applicability on the work task presented. For example, “hearing protection to ensure clear communication” leaves the reader needing clarification. Similarly, while the AI significantly improved the safety aspects of the WI, some of the required PPE was missing from a few of the steps.
Accuracy and Up-to-Date
Inaccurate or outdated WI are a common reason why procedures fail (Novatsis & Skilling, 2016). Al WI should be vetted and reviewed for accuracy and up-to-datedness. The Generative AI-provided WI were accurate and up-to-date. Additionally, the AI improved on the WI by adding relevant industry standards it was not provided with.
Usability
Usable WI improves output consistency, training, performance, and compliance, while reducing safety risks (Novatsis & Skilling, 2016). While research on instruction usability varies by task, studies show that WI which are designed to improve usability and are centered on the target user lead to increases in task efficiency and effectiveness, as well as reductions in human error (Allwood & Kalén, 1997).
Usability was rated high for all the WI with no raters pointing out any deficiencies. The generative AI improved usability of the procedures by using emphasis on critical steps, simplifying language, and providing small improvements in formatting.
Safety and Precautions
Adequately designed WI identifies relevant hazards and includes all necessary safety equipment and procedures (Health and Safety Executive [HSE], 1999). However, it is important that the safety procedures do not overwhelm the instructions and only hazards are included (Novatsis & Skilling, 2016).
The generative AI significantly improved safety and precaution information, identifying all, but 2.5% of steps which required inclusion of safety procedures. However, there is still some room for improvement. For example, the generative AI generally missed specifics on PPE needed, including only the general name of the PPE and the standard it was required to adhere to. At other moments, the PPE was presented, but it was not tied to a hazard. Finally, there was one instance where the PPE was simply referred to as “protective clothing”.
Human Error Prevention
Human reliability, and conversely a reduction in human error, can be achieved by anticipating potential human errors and safety concerns and including mitigations into WI (Sharit, 1998). Effective WI identifies potential hazards and error points and influences the behavior of users by providing specific steps to avoid errors, highlighting potential error points, or demonstrating consequences of errors (Novatsis & Skilling, 2016).
Evaluators felt that the WI provided enough detail to improve human reliability in task performance. However, potential errors were not specifically listed at any moment. When a task has significant safety criticality, outlining such errors might improve operator situational awareness of potential hazards.
Tracking and Feedback
It is important for users of WI to have enough information to ensure their work is completed properly and for process stakeholders to know that procedures were completed successfully. Successful and documented completion of operational steps helps improve the effectiveness of WI at reducing risks and hazards (Yashar, n.d.). WI should provide users with expected outcomes, as well as buyoff processes to ensure it can be verified if steps were completed correctly.
Evaluators significantly diverged in ratings regarding this heuristic. Two evaluators believed most of the WI provided enough detail in regards to feedback, but the landing gear task received low marks for lack of feedback instructions. The other two evaluators felt tracking of step buy offs was nonexistent. The difference in ratings between evaluators may be due to the dual construct nature of the heuristic and may indicate an opportunity to improve the heuristics themselves.
Simplicity and Efficiency
Designing procedures adequately requires skill in understanding the level of detail needed. WI should never be so detailed they become cumbersome, but they do need to include all the steps necessary for the process to be completed with little confusion (Stup, 2023). In essence, a good balance must be ensured to provide a low number of steps which improves efficiency, as well as enough detail to eliminate significant variation in process outcomes.
In general, evaluators gave high marks to the generative AI in the simplicity and efficiency category. However, in the landing gear task, the tasks presented were excessively simple and lacked some necessary detail to complete the task adequately. While the generative AI missed this improvement opportunity, the original pre-AI WI was also deficient.
Conclusion
The present research highlights the opportunity presented by generative AI toward improving WI. First, we aimed to explore the ability of generative AI to improve WI by detecting safety procedures needed and inserting them into the relevant steps. Second, we wanted to assess the quality of the resultant WI through a heuristic assessment using industry experts.
Our results show that commercially available generative AI can significantly improve WI by identifying and inserting necessary safety procedures. While the AI missed some instances, it was successful in 97.5% of potential instances and even added safety procedures which are required by OSHA but were not part of the provided rule set.
However, our study also shows that the quality of the safety precautions in the WI still requires significant improvement. Specifically, the generative AI needs to provide more thorough and understandable PPE statements. For example, while the AI was presented with the safety glasses’ common name, manufacturer, and industry standard, it always provided a generic “safety glasses” statement and applicable standard requiring the user to research which of the glasses meets the standard.
In conclusion, generative AI promises to improve WI significantly, especially if presented with advanced learning sets and rules. While there is significant room for improvement, the AI offers the potential of identifying safety risks which the procedure writer may miss. This will help reduce safety risks in WI.
Limitations
Two limitations were present in the study. First, given limitations on public access to WI, the three WI presented to the AI were created specifically for the study and no active industry WI was employed. Actual industry WI, especially in spaceflight operations and manufacturing, are typically much more specific and vetted through peer review. Second, while the AI prompt was improved over multiple iterations, a GPT model trained specifically on the production of work instruction (as is likely to be seen in a corporation) may significantly outperform OpenAI’s ChatGPT 40, which while powerful, is still a multi-purpose AI.
Future Directions
Although the present work presents generative AI augmented WI, it is important to understand that the most optimal output is achieved through a human-AI team iterative process. Reliability and safety is increased through strong contributions from both humans and AI (Shneiderman, 2021). Tasks should not be visualized as a dichotomous view of what humans versus AI does better, but rather as a synergistic team aligned toward a goal. Future research should evaluate WI that has been vetted by a human-AI team. Specifically, one in which the human improves the AI output after the AI looks for important safety aspects within the procedure. Such an application will likely produce optimal WI.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
