Abstract
Social work scholars have long questioned the validity and utility of the Association of Social Work Boards (ASWB) licensing exams. Data released in 2022 revealed severe disparities in pass rates based on race, age, and language, exacerbating these concerns. In this paper, we explore the potential of generative artificial intelligence (AI) such as ChatGPT to address core problems of the ASWB exams, including the use of a multiple-choice format that does not reflect real-world social work practice. To assess its social work reasoning, we used ChatGPT to answer ASWB-developed practice questions for the Bachelors, Masters, and Clinical exams. ChatGPT scored 76%, 80%, and 64%, respectively, and identified additional validity challenges. Based on this performance, we provide a proof-of-concept for how generative AI might move us toward a more valid and equitable exam. While we strongly support licensure requirements, state regulators and legislators should temporarily suspend the use of the ASWB exams for this purpose.
The emergence of generative artificial intelligence (AI) technologies and large language models (LLMs), such as ChatGPT, the new Bing, and Bard, are reshaping how we interact with information and communicate with others. Previous AI technologies have received some attention in social work, primarily in research settings for information extraction (e.g., Frey et al., 2020; Perron et al., 2019; Victor et al., 2021) and simulation-based clinical and job training (e.g., Asakura et al., 2020; Smith et al., 2023). Generative AI represents a substantial leap forward and opens new and far-reaching opportunities for social work research, education, and practice.
Recent headlines of ChatGPT passing high-stakes medical, business, and legal exams (Hammer, 2023; Kelly, 2023) inspired our decision to evaluate its performance on the social work licensing exams. We think that generative AI can offer additional insights into recent discussions of whether the profession's reliance on licensing exams developed by the Association of Social Work Boards (ASWB) is appropriate, effective, and ethical for their intended purpose (Apgar & Nienow, 2023; DeCarlo, 2022; NASW, 2023; Nienow et al., 2023). To set the stage for this discussion, we first provide a plain language overview of generative AI and the ASWB exams. We then detail our evaluation of ChatGPT's performance on ASWB-developed practice exam questions. We discuss critical validity issues that our evaluation helped reveal. Combined with prior research, we believe our findings indicate that the problems of the ASWB exams do not justify continued use, and the emergence of generative AI models offers a potential solution for creating innovative and equitable procedures for evaluating the preparedness of social workers to practice ethically and safely.
Generative AI Models
Whereas traditional AI technologies rely on rules-based programming to understand and respond to user input, generative AI models use a combination of deep learning and natural language processing to understand the nuances of human conversation and respond with natural, human-like language. LLMs are a specific type of generative AI trained on massive datasets of text, such as books, articles, and web pages. These models use deep learning techniques to analyze and understand the patterns and structures of language, and are capable of generating human-like text that can be used for a wide range of applications, including language translation, chatbots, summarization, content creation, and text classification. Other tasks relevant to our purposes here are question answering and evaluating the quality of written responses.
In this evaluation, we use ChatGPT, a LLM currently taking the world by storm. Developed by OpenAI (2022), a research company based in San Francisco, ChatGPT uses advanced natural language processing techniques to understand the user's input and generate human-like responses. The model was trained on a large amount of conversation data, allowing it to understand the context of conversations across various topics. The primary use case for ChatGPT is to build chatbots and other conversational AI applications that can be integrated into websites, mobile apps, and messaging platforms, giving the end user quick and convenient access to information and task automation. ChatGPT has made history as the fastest-growing consumer application, with a staggering 100 million within just two months after its launch (Hu, 2023). ChatGPT's popularity stems from its ease of use, the potential to integrate with a wide range of platforms and devices, and the ability to generate high-quality text.
ASWB Licensing Exams
Although we see considerable opportunities for the use (and misuse) of generative AI and LLMs across social work research, teaching, and practice, we are limiting our focus to the profession's current licensing exams. Since 1979 the ASWB has created a set of exams that are now mandatory for social workers seeking licensure at various levels throughout the United States (ASWB, 2022a). In our home state of Michigan, for instance, clinical licensure applicants are obligated by law to take the ASWB clinical exam. Those seeking a macro social work license must take the ASWB advanced generalist exam. And those seeking a Bachelor's license must take the ASWB bachelors exam. Combined with several thousand supervised hours, these exams are intended to ensure that social workers can practice ethically and safely as fully licensed professionals.
The debut of ChatGPT and other generative AI models coincides with an essential social work debate about the practical and ethical implications of ASWB exams on the profession. After many years of educators and practitioners calling for transparency from ASWB around test construction and pass rates (Albright & Thyer, 2010; DeCarlo, 2022; Thyer, 2011), the ASWB released exam-related data in August 2022 (ASWB, 2022b). The report revealed severe disparities in exam success rates across race, age, and first language along with smaller disparities based on gender. In response, the National Association of Social Workers (NASW), the profession's largest organizing body, has publicly opposed using ASWB exams as a requirement for professional social work licensure (NASW, 2023). ASWB has acknowledged these concerns, stating that the “discrepancies seen in the data are unacceptable” (para. 2) and that the organization is “looking at ways that candidates can demonstrate competence beyond the use of a multiple-choice examination format” (ASWB, 2023a, para. 6).
As educators engaged in preparing social work professionals, we share concerns about the ASWB licensing exams articulated by the NASW. We believe that new approaches are needed to ensure that workers are ready to practice safely and ethically at the time of licensure. Here we explore whether generative AI technologies might be useful in helping our profession to develop those new approaches in ways that are fair and equitable.
ChatGPT as Test Taker
We started this exploration by assessing the performance of ChatGPT on a set of ASWB practice exam questions. Since the official exams were not accessible for our purpose, we obtained a bank of practice questions and answer keys developed and sold by the ASWB (2023b). We assumed these questions to represent the actual exam questions, thereby serving as a proxy for the exam itself. We obtained a total of 150 practice questions, 50 each for the three most common versions of the exam: Bachelors, Masters, and Clinical.
Our objective in having ChatGPT answer these practice questions was two-fold. First, we wanted to assess the capacity of ChatGPT to perform tasks that relied on an ability to recognize text patterns relevant to social work practice. If ChatGPT cannot recognize social work-related patterns, it lacks exposure to sufficient relevant text documents and is not suitable for developing alternatives to the current ASWB format. Our second goal was to compare ASWB answers with ChatGPT answers to detect any differences. This allows us to determine if the ASWB answers align with the texts that ChatGPT was trained on, revealing potential flaws in the ASWB exams, the generative AI model, or both.
ASWB exam questions typically start with a brief scenario, presented in 2–3 sentences, describing a specific practice-related challenge. The prompt that follows the scenario usually asks “What should the social worker do FIRST?” or “What should the social worker do NEXT?” The test taker then chooses the best response from four multiple-choice options. The correct response is determined by the ASWB exam developers.
We prompted ChatGPT to answer the ASWB practice questions with the following statement and question: “I would like to give you some different questions and scenarios. Can you select the correct response and provide a rationale for each one?” Please note that we did not provide any additional contextual information or guidance in the prompt, such as directing the model to consider the NASW Code of Ethics or best practices in social work. Our intention was to evaluate the model's ability to perform zero-shot learning. Zero-shot learning refers to an AI model performing a task that it has not been explicitly trained to do. In doing so, this estimate helps us understand the baseline or out-of-the-box performance of the models. After prompting ChatGPT, we input exam questions and response options exactly as they appeared in the question bank. We recorded ChatGPT's responses and respective rationales.
ChatGPT Model Performance
We assessed performance with a simple count of the number of questions that the models answer correctly across the 50 items in each set of practice exam questions. While the passing score for actual exams varies by exam level and version, a score of 70% is an approximated passing rate. ChatGPT has a high chance of passing the Bachelors exam (76% correct) and Masters exam (80% correct). The Clinical exam was close to the benchmark, with 64% correct responses.
Our results approximate actual scores. Again, these are ASWB practice questions, and the official exams contain roughly three times as many questions as we tested. ChatGPT is a probabilistic model that can generate different results. Therefore, it is possible to observe differences in the answers provided by ChatGPT, especially for questions that are ambiguous or have subjective interpretations.
Content and Quality of ChatGPT Decision-Making Rationales
Below we offer a few examples of the rationales provided by the models to give the reader a sense of ChatGPT's ability to recognize social work-related text patterns and generate rationales aligned with safe and ethical practice. For copyright reasons, we cannot provide the actual ASWB practice questions or the rationales for the answers.
One exam question asks what to do in the event that an interpreter needed to cancel an upcoming appointment with a social worker and a deaf client. ChatGPT responded with the correct answer as defined by ASWB. ChatGPT also provided the following rationale, explaining why this answer was selected, and the other choices were excluded. This example is illustrative and representative of the quality and content of ChatGPT's rationales observed throughout the evaluation. In this scenario, the social worker should communicate the need to set up another appointment time because the sign language interpreter has become ill and cannot be present at the session. It is important for the social worker to ensure that the client is able to effectively communicate and understand the information being discussed during the session. The use of a sign language interpreter is necessary for this to occur, and the absence of the interpreter means that the session cannot proceed as planned. The social worker should explain the situation to the client and arrange for another appointment time when the interpreter is available. Attempting the session anyway or providing the client with reading material about the topic to be discussed are not appropriate solutions, as they do not address the client's communication needs. While it may be helpful to ask the client for names of alternative interpreters, the social worker should first communicate the need to set up another appointment time to ensure that the client's needs are met in a timely and appropriate manner. Before developing a new tool, the social worker should determine if any existing tools or instruments can be used to achieve the desired outcome of assessing the characteristics of clients who are successful in maintaining sobriety. This will help the social worker to avoid reinventing the wheel and will also ensure that the assessment tool is reliable, valid, and consistent with best practices in the field. Additionally, it may be more cost-effective to utilize an existing tool, as the development of a new tool can be time-consuming and resource-intensive.
ChatGPT gave several reasons to support its decision. Furthermore, these reasons were communicated in a simple language that is likely to be comprehensible to a wide audience.
For us, the performance of ChatGPT on these questions and many others provided strong evidence that ChatGPT can recognize social work-related text patterns, consider contextual information, and generate decision-making rationales while accounting for safety and ethics in social work practice. Given this performance, we then proceeded to our second objective, which involved comparing discrepancies between ChatGPT's answers and the ASWB answer key. The comparison indicated that ChatGPT performed better than or equal to ASWB on several discrepant items, which brings up validity concerns that we will explain next.
Validity Challenges of the ASWB Exams
Our approach to evaluating ChatGPT's performance assumed that the ASWB-developed answers to the test questions were the gold standard. However, in our comparative analysis, we found several instances in which ChatGPT provided a discrepant answer but a compelling and sometimes better rationale than what is provided by the ASWB. Because LLMs like ChatGPT are based on vast collections of text data, some of the ASWB answers to the test questions may not be the best or correct response. These validity concerns also suggest that our evaluation is likely under-estimating the actual performance of ChatGPT.
Our concerns of validity are consistent with validity concerns that have been raised (but not resolved) more than a decade ago (see Albright & Thyer, 2010). We identified three major question types related to our validity concerns.
None of the Above
The format of the ASWB exam involves providing 2–3 sentences of context before asking test takers to decide what the social worker should “do NEXT” or “do FIRST.” This format assumes that the social worker has been provided with enough relevant details and that the sole appropriate action is listed among the four possible responses. One of the questions presented to ChatGPT began with a scenario in which a child was in foster care. The father was incarcerated and the mother was in residential treatment for alcohol misuse. The scenario also indicated that the child was small for his age with immature speech, had temper tantrums, and experienced difficulties completing school work. The question asks what the social worker should do “FIRST.” The details presented here represent the entire collection of details presented in the question.
For this question, ChatGPT did not select any of the four responses provided—screen for fetal alcohol syndrome, develop a behavior modification plan, refer to special education, or pursue family reunification—but recommended the following course of action: The FIRST action for the social worker to take is to gather more information and assess the child's needs. This may involve conducting a comprehensive assessment of the child's developmental, behavioral, and educational needs, as well as any past trauma or neglect. Based on this assessment, the social worker can then determine the appropriate interventions, such as referral for a medical evaluation or special education services, working with the foster parents on a behavior plan, or working with the biological mother on reunification if appropriate. However, referring the child for assessment for fetal alcohol syndrome may be a priority if there is any indication that the child may have been exposed to alcohol in utero, but this would depend on the results of the comprehensive assessment.
Empirically Unsupported Test Items
Another instance in which ChatGPT provided a different answer than the ASWB developers related to a question about the most likely emotional response among adult children when learning of a parent's Alzheimer's diagnosis. Choosing among the available responses, ChatGPT suggested that depression was likely to be the most common response, while also hedging its response by acknowledging that the children would be “likely to demonstrate a range of emotional reactions.” Conversely, the ASWB rationale asserted that “denial” was the most likely emotional response based on the Stages of Grief model developed by Elizabeth Kublher-Ross. See Kubler-Ross (1969) and Kubler-Ross & Kessler (2005) for a model overview.
A major validity concern here is that the ASWB's correct answer does not have sufficient empirical support to inform practice (Bonanno et al., 2002; McVean, 2019). In fact, health researchers have cautioned against the continued use of the Stages of Grief model in anticipating outcomes and counseling those experiencing grief (Stroebe et al., 2017). Again, this example question also shows that we may be underestimating the actual performance of ChatGPT by assuming that the ASWB answer key is the gold standard.
Ambiguity of Multiple Choice Format
The expressed objective of the ASWB exam is to ensure social workers are ready to practice ethically and safely. Our evaluation of ASWB practice questions using ChatGPT supports the move away from a multiple-choice format on a social work licensing exam, a move the ASWB itself has suggested might be necessary.
Consider an ASWB practice exam question related to domestic violence. The question begins with a scenario in which a client at a domestic violence shelter states intentions to re-engage with an abusive partner but is concerned the social worker will express disapproval. The question directs the test taker to select the BEST way to demonstrate acceptance of the client's choice (italics added for emphasis). The correct response according to ASWB was to encourage the client to further discuss their decision. The ASWB answer key explains this decision is consistent with the ethical principle of self-determination without any further explanation. ChatGPT selected a different response – that is, allow the client to continue guiding the conversation. ChatGPT's rationale was as follows: In this scenario, it is important for the social worker to demonstrate acceptance of the client's choice, even if the choice is to return to an abusive partner. The client is concerned that the social worker will not approve of the decision, which may create further barriers to the client's engagement in the process. Allowing the client to direct the conversation and to express their thoughts and feelings can help to build trust and rapport between the client and the social worker. This can increase the likelihood that the client will be open to receiving support and services, regardless of their decision. The social worker can also outline concerns about the risk of returning to the abusive partner, but it is important to do so in a non-judgmental and supportive manner. The goal is to empower the client to make informed decisions about their safety and well-being. Referring the client to other services and encouraging the client to discuss the decision can also be helpful, but it is important to prioritize allowing the client to direct the conversation.
Potential Opportunities for Dynamic Testing
Our evaluation reveals that generative AI models like ChatGPT show an excellent ability to recognize social work-related text patterns and offer high-quality rationales for scenario-based decision-making in the absence of any training for this particular task. Even though generative AI technologies like ChatGPT and Bard are still early in their development, we think they have shown considerable promise for creating dynamic testing as a replacement for traditional static standardized exams. However, because generative AI models are just in their infancy and still prone to frequent errors, we are not offering a specific course of action at this point. Instead, we offer a proof of concept with the existing technology that we view as a starting point for exploring and anticipating future possibilities.
We specifically explored whether ChatGPT might be able to assist in moving our licensing exam away from a multiple-choice format. A licensing exam designed to ensure ethical and safe social work practice will need to account for the various courses of safe and ethical action that could be taken in most professional decision-making scenarios. A format in which test takers provide brief written responses to presented decision-making scenarios would better account for this reality. Yet unlike a multiple choice format, an exam with short written responses could be cost-prohibitive given the time and resources required to manually grade. That is, unless we could automate the grading process in some way.
As an initial proof-of-concept, we provided ChatGPT with the following prompt and question: I would like to provide you with a decision-making scenario and a suggested response. Can you tell me if the suggested response is ethical and safe from a professional social work perspective? You are working with a 25-year-old client who has a history of substance abuse and has been sober for six months. The client reports feeling lonely and disconnected from others since most of their time spent with friends was at the bar. The client thinks that dating someone might be a good solution but that it's complicated. They disclose that they are attracted to people of the same gender but have not yet come out to their family or friends. The client is anxious about how their disclosure will be received by others but is eager to start dating. What should be your next step as the therapist?
We then proposed three possible responses. We considered the first two options to be ethical and safe, and the third option to be both unethical and unsafe for the client. Please note that in the following excerpts, only the first paragraph from ChatGPT's answer is provided. This selected paragraph shows the model's decision to help determine whether the proposed response is appropriate or not.
The first suggested action submitted to ChatGPT was to “affirm the dignity and worth of the client and explore strategies for disclosure of their sexual orientation.” ChatGPT's responded: Based on the scenario you provided, the suggested response of affirming the client's dignity and worth while exploring different strategies for disclosing their sexual orientation appears to be an ethical and safe response from a professional social work perspective. Referring the client to a support group for those exploring their sexuality could be a helpful next step as a therapist, as it can provide the client with a safe and supportive space to explore their sexual orientation, connect with others who may be going through similar experiences, and receive additional resources and support.
The third course of action submitted to ChatGPT was to “refer to the client to conversion therapy.” ChatGPT responded: Conversion therapy is not an appropriate or ethical response to this scenario. Conversion therapy is a harmful and discredited practice that seeks to change an individual's sexual orientation or gender identity. It has been widely disavowed by professional organizations, including the National Association of Social Workers, the American Psychological Association, and the American Medical Association, due to its lack of scientific evidence and the harm it can cause to individuals.
ChatGPT was able to consider the contextual factors presented in our scenario and—from our perspective—correctly classify the proposed responses on dimensions of social work ethics and safety. Extensive testing and fine-tuning of the generative AI model would be needed prior to any deployment within the licensing process. However, the out-of-the-box performance of ChatGPT on this task suggests that it might be possible to automate the grading of an exam format centered on brief written responses.
Moving Forward
Generative AI models such as ChatGPT represent a technological breakthrough that demands the attention of social work professionals. We evaluated ChatGPT within the context of the ASWB licensing exams, which is a timely contribution given the significant race, age, and language disparities recently disclosed by ASWB (2022b). We think that generative AI models in general and LLMs specifically hold the potential to support safe and ethical social work decision-making in a variety of ways. At the same time, we recognize this is an entirely new era of rapidly growing technologies, which necessarily requires the field to be cautious moving forward. Importantly, we want to be clear that we see generative AI models as tools that can help social workers, but we do not think these tools can replace social workers.
As for the current debate concerning the role of the ASWB exams in professional social work licensing, the exact solution is not yet known. What seems clear to us, however, is that licensing exams with serious validity issues that disproportionately exclude workers from the profession on the basis of race, age, and language are not the path forward. Our current evaluation reveals further problems with the exams that cannot be ignored. While we strongly support licensure and title protection for those who graduate from CSWE-accredited schools, we believe state legislatures and regulatory bodies should suspend passage of an ASWB exam as a requirement of social work licensure until an equitable and valid solution is in place. Certainly, public safety is the concern of state regulators, but historically we also know that when states initially legislated licensure, many practicing social workers were ‘grandparented’ into the profession if they could demonstrate supervised practice which did not jeopardize safety. Temporarily suspending the testing requirement will instead rein in the big and lucrative business of the licensing exams and preparation courses, allowing the field to redirect intellectual and financial resources to create new, valid processes for ensuring social workers are prepared to practice ethically and safely. Time will tell, but generative AI models like ChatGPT and those sure to follow might well play a pivotal role in building those solutions for our profession.
Footnotes
Acknowledgments
The authors would like to thank Drs. Suzanne Brown and Carolyn Joy Dayton at the Wayne State University School of Social Work for their consultation on best practices for clinical social work.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
