Abstract
This article explores the limits and risks of using artificial intelligence (AI) and large language models (LLMs) as tools to expedite and streamline empirical legal research and content analysis of cases. It emphasises the current risks and limits of AI for enabling socio-legal research and case-based analysis, including a lack of reproducibility, bias, hallucinations and inaccuracy. It concludes that, at present, legal researchers should be exceedingly cautious when utilising new tools and need to put in place stringent procedures to better evaluate the outputs of AI models.
Content analysis is often used as a research method for analysing case law, especially where doctrinal analysis is inappropriate or limited due to a relative absence of superior case law. However, content analysis and coding, if done manually, can be a time and labour-intensive research method where there are many first instance judicial decisions. There is a temptation, then, as cases are increasingly published online, to use artificial intelligence (AI) and large language models (LLMs) as tools to expedite and streamline content analysis of cases.
This article explores the limits and risks of undertaking empirical legal research ‘creatively’ using AI and LLMs, with a particular focus on content analysis of cases. It argues for a need for caution before using AI tools in content analysis, flagging risks of hallucinations, bias and a lack of reproducibility. The second section considers how content analysis might be used as a method in legal research, and how it might complement doctrinal research. The article then considers how AI and LLMs might be used to expedite content analysis of cases. The fourth section considers the risks posed by using these automated tools, and the article ultimately concludes that content analysis should not currently be automated without stringent controls in place, as the risk of producing poor research is too high.
Content analysis of legal decisions: Possibilities and limits
Van Hoecke describes doctrinal research as an ‘empirical-hermeneutical discipline’: 1 ‘a mainly hermeneutic discipline, with also empirical, argumentative, logical and normative elements’. 2 For Van Hoecke, legal research is primarily focused on description and interpretation of the law, as well as systematisation or theory building. 3 Bogg, building on MacCormick, 4 sees doctrinal method as involving ‘rational reconstruction’ 5 to present disparate legal materials ‘as a rational, coherent, and systematic whole’. 6 Doctrinal analysis generally focuses on authoritative materials, such as superior court decisions and legislation; it is less easily undertaken in substantive areas with little authoritative case law, but many first instance judicial decisions (like equality law, for example). The reasoning in first instance decisions is generally not legally binding on later decisions. At the same time, these lower-level decisions might offer critical insights into how the law is operating in practice: who is (and is not) bringing legal claims? How are those claims being resolved? Which legal rules, if any, might inhibit access to justice or the success of claims? These insights are critically important in areas of the law where most potential claims are not pursued via legal avenues 7 or are resolved via confidential conciliation, 8 as in equality law. Where case law is limited, and conciliated outcomes are confidential, first instance decisions become the key public artifact for examining the law as it operates in practice.
It is here that content analysis offers potential as a legal research method, as a critical complement to doctrinal analysis. Content analysis is particularly useful when scholars are seeking to examine themes or patterns in texts, 9 rather than to systematise the texts (as in doctrinal analysis). As Hall and Wright emphasise, the focus in content analysis is on the collective insights that can be gained from case law, not on individual decisions themselves. 10 It works best as a method when there are many judicial decisions, and all decisions have around the same weight (that is, are decided by courts at the same level in the judicial hierarchy). 11 For Hall and Wright, content analysis brings the rigour of social science methods to legal research. 12 For the authors, the benefits of content analysis are in producing ‘objective, falsifiable, and reproducible knowledge about what courts do and how and why they do it’. 13 Hall and Wright express concern, though, about the use of content analysis to try to predict the outcome of future cases: 14 the facts and reasoning in judicial decisions are necessarily limited, selective and biased, to build a narrative of coherence, 15 and are therefore not necessarily good predictors of later decisions. 16 Despite this, the burgeoning field of judicial analytics has developed a number of models for predicting the outcomes of future cases, although this scholarship is not without risks, including – potentially – of undermining the rule of law. 17
While content analysis typically involves coding features of decisions, this is often not a straightforward process. There is a need to be clear about coding categories and instructions.
18
For example, Hall and Wright speak to the complexity of coding case outcomes: Defining what counts as a win or loss across a range of cases is not a simple matter. Appellate cases arise in a variety of procedural postures, involve multiple issues, and each issue can be resolved in several different ways. Case coding projects often have to devise complex categories to capture all the relevant detail.
19
Given this complexity, there is a need to ensure the reliability of coding; if coding is subjective or uncertain, there is a need to ensure that different people would code items consistently, for example, so the study is reproducible. 20 As Hall and Wright summarise: ‘Coding that primarily reflects the subjective, idiosyncratic interpretation of the particular individuals who read the cases or that has large elements of error or arbitrariness undermines the claim of replicability.’ 21
Content analysis can identify trends and gaps in case law, which would not be evident in doctrinal research. For example, in my 2020 Sydney Law Review article, analysing all Australian age discrimination decisions relating to employment handed down until the end of 2017, using qualitative and quantitative content analysis, I found that most cases related to dismissal, and were brought by older, white men. Further, few claims were substantively successful (12 out of 108). 22 Content analysis, here, enabled me to identify systemic gaps in enforcement, particularly for older women and young people, and set the groundwork for empirical analysis on the barriers to justice for these cohorts. 23 In the UK, I once again analysed age discrimination decisions, focusing on Employment Tribunal decisions that are published online. 24 Using qualitative and quantitative content analysis, I mapped how most claims were withdrawn, 25 and few were successful. 26 Most claims were brought by older workers. 27 As in Australia, I found gaps in enforcement for young people and older women. 28 I also considered issues relating to delays in Employment Tribunals, as they emerged from the cases, and the inconsistencies raised with the strict time limits imposed for filing a complaint. 29 This study revealed issues with the use of Employment Tribunal decisions as ‘data’: many decisions lacked full written reasons, 30 or did not clearly state that the age discrimination claim had been withdrawn (often claims simply disappeared). 31 The laborious nature of this research took over a year to read and code the 1208 cases in the study 32 and, as academic workloads intensify, this sort of labour intensive research may prove prohibitive. 33
Could content analysis be automated?
So, rather than spending years analysing decisions in this way, could the process be automated, to expedite content analysis? Could we use AI to streamline this research method? I used the Python programming language to pull data relating to Employment Tribunal decisions and add them to a spreadsheet for analysis, 34 but then coded the decisions manually. 35 Could AI and LLMs be used to undertake this coding, saving researchers time and effort? Researchers are starting to experiment with LLMs in just this way.
In their working paper, Ribeiro de Faria and her co-authors used LLMs to extract information from UK Employment Tribunal decisions. 36 The authors asked the LLM GPT-4 37 to extract information from 260 Employment Tribunal decisions across eight areas, including: (1) the facts of the case, (2) the claims made by the parties, (3) references to statutes, (4) references to case law, (5) general case outcomes, (6) general case outcomes summarised as one of four labels (‘claimant wins’, ‘claimant partially wins’, ‘claimant loses’ and ‘other’), (7) detailed orders and remedies, and (8) reasons for the decision. 38 A human ‘legal expert’ 39 was used as a ‘quality check’ on the GPT-4 outputs, with GPT-4’s outputs coded as ‘accurate’ if they were ‘generally correct’ (0 for inaccurate, 1 for accurate). 40 The authors found generally high accuracy across the GPT-4 summaries, 41 except in relation to the four ‘outcome’ summaries. 42 As discussed in the next section, assessing the ‘success’ of a claim can be complicated and nuanced; perhaps unsurprisingly, this was where the LLM struggled.
The ultimate aim of Ribeiro de Faria and her co-authors was to create a tool which could predict the outcome of employment disputes 43 based on the facts and the claims made. 44 The general case outcomes (as coded to the four labels) would then be used ‘as gold-standard references to guide model training and testing’. 45 The authors posit that the predictive model should be trained only on complete GPT-4 summaries (that is, where facts, claims and outcomes were identified) and on substantive (not procedural) decisions. 46 This left 124 decisions in the case sample. 47 Given that the predictive model would be based on the four-fold outcome summary, and this was the least accurate output from the summaries, concerns might be raised about the suitability of this data to build a predictive model.
In a related paper, Xie and her co-authors then used GPT-4 to predict the outcomes of Employment Tribunal decisions. 48 In that study, the authors filtered out all one-page decisions from the sample, 49 likely excluding most (but not all) cases where claims were withdrawn. 50 GPT-4 was used to annotate the remaining decisions. 51 This was compared with the coding of outcomes done by a legal expert (a PhD student in law). GPT-4 was then used to predict outcomes of cases, given a set of facts and claims. These predictions were compared with the predictions made by human PhD students in law, using the facts and claims summarised by GPT-4. 52 Humans outperformed every variation of GPT-4 model used. 53
These working papers are perhaps best seen as exploratory investigations of the potential of AI and LLMs to enable legal analysis and prediction. However, early concerns emerge from this approach to automation. First, concerns might be raised with the approach adopted by Ribeiro de Faria and her co-authors to evaluating the GPT-4 outputs and summaries. In the study, GPT-4’s summary was compared with the Employment Tribunal decision itself, to ascertain if it was ‘generally correct’. GPT-4’s summary was not independently and blindly evaluated against a human summary of the decision, for example (compare the ASIC/AWS study below). There is a significant risk that the results of this study, and its findings of high levels of accuracy, might have been distorted by automation bias – that is, the propensity for humans to favour suggestions from automated decision-making systems. 54 Without a blinded evaluation process, it is difficult to judge whether the summaries were, indeed, ‘generally correct’. This is a point I return to below.
Second, using materials which are ‘generally correct’ as the basis for content analysis, and predictive analysis, raises questions about the veracity of the findings more generally. As noted earlier, for Hall and Wright, content analysis should bring the rigour of social science methods to legal research. 55 Using materials that are ‘generally correct’ as the foundation for analysis seems to lack a degree of rigour in the research process.
Third, these papers fail to address the risk of bias or selectively in the reporting of facts in judicial decisions themselves, and how this might distort or limit predictive analysis of case outcomes. 56 Ribeiro de Faria and her co-authors, and Xie and her co-authors, both note that the facts in judicial decisions are biased, and both suggest the possibility of obtaining other sources of facts to correct this bias, 57 yet still endeavour to predict case outcomes.
Fourth, Xie and her co-authors compare the case outcome predictions of GPT-4 with the predictions of human PhD students in law. Without careful recruitment and selection, PhD students are unlikely to be experienced practitioners in employment law, with a clear ability to predict case outcomes. (Indeed, even experienced practitioners may be hesitant to predict case outcomes in many cases, especially based on a short factual summary.) While PhD students are experts in other topics, using employment law practitioners as an expert check would seem a more reliable measure. Given even the PhD students outperformed GPT-4 significantly, experts in employment litigation would likely reveal the limits of automation even more dramatically.
Should content analysis be automated?
These working papers prompt us to think more deeply about how LLMs and AI should be used in the research process, particularly in relation to legal research. As Hall and Wright articulate, the benefits of content analysis are in producing ‘objective, falsifiable, and reproducible knowledge about what courts do and how and why they do it’. 58 At present, however, content analysis conducted with the aid of AI and LLMs potentially offers few of these benefits.
The usefulness of LLMs for content analysis of case law might be undermined by three key issues: automation bias, ‘hallucinations’, and LLM bias. First, automation bias recognises the propensity for humans to favour suggestions from automated decision-making systems. 59 In a study of AI-aided mammography readings, for example, radiographers were purposefully given incorrect information from a purported AI-based system. Inexperienced, moderately experienced, and very experienced radiologists were all more likely to adopt the AI’s (wrong) assessment of the scan; but inexperienced radiologists were most often swayed by the false AI information. 60
In the legal context, overcoming automation bias to critically evaluate the limits and benefits of LLMs as a research tool requires proper, independent and blind testing of LLM outputs. One example of how this might be done comes from the ASIC-commissioned AWS study of the use of LLMs to summarise submissions to a Parliamentary Joint Committee inquiry.
61
LLM summaries (created with Llama2-70B) were graded against human summaries by human assessors in a blind test. Each summary (LLM and human-generated) was scored against specific criteria and given comments by the assessors. In the final results, the AI summaries performed lower, across all criteria, than the human summaries. Out of a maximum of 75 points, human summaries in aggregate scored 61 (81 per cent) and the AI summaries scored 35 (47 per cent).
62
The qualitative comments emphasised the LLM’s limited capacity to understand the nuance or context required to effectively analyse the submissions.
63
LLM summaries frequently included incorrect information, missed information, missed the key point, or focused on minor or irrelevant points.
64
LLMs also lacked the ability to understand the nuance of a submission: LLMs [had] limited ability to analyse and summarise complex content requiring a deep understanding of context, subtle nuances, or implicit meaning. This led to summaries that missed deeper implications or oversimplified concepts within the original submissions. The finding emphasises the importance of a critical human eye which can ‘read between the lines’ and not take information at face value.
65
This raises particular concerns in relation to legal materials, which often require a critical human eye to be meaningfully comprehended. Overall, the assessors in the AWS study found that LLMs (in their current level of development) often just create more work, given there was a need to fact check each of their outputs, and because the original submission was often easier to read and comprehend. 66 The AWS study does offer a helpful model of how LLM outputs might be meaningfully assessed and evaluated, in a way that is more likely to overcome automation bias. Indeed, when using this model of evaluation, the usefulness of LLM summaries appears far more questionable.
Second, ‘hallucinations’ may distort the results of analysis. Hallucinations (or ‘bullshit’) 67 are a well-recognised property of LLMs, with well-publicised examples of the invention of case law appearing in the courts. 68 Hallucinations are understood as being ‘text that is nonsensical, or unfaithful to the provided source input’, 69 and may be an inevitable part of LLMs. 70 For Farquhar and his co-authors, hallucinations ‘have come to include a vast array of failures of faithfulness and factuality’. 71 Hallucinations appear particularly common in legal uses of LLMs: in one study of American law, the authors found that, when asked a direct question about a federal court case, LLMs hallucinated anywhere from 58 per cent (GPT-4) to 88 per cent (Llama2-13b) of the time. 72 Hallucinations increased in frequency with the complexity of the task; 73 for high complexity binary tasks, the LLMs sometimes performed worse than a random guess. 74 Further, the authors found that the LLMs performed better in relation to law from more prominent jurisdictions; 75 the results in relation to Australian law may therefore be worse than those in this study. As the authors conclude, then, ‘[o]ur results … temper optimism for the ability of off-the-shelf, publicly available LLMs to accelerate access to justice.’ 76
Hallucinations are also likely to affect how LLMs categorise or summarise data. Indeed, in the AWS study, it was noted that hallucinations appeared in the LLM summaries: ‘Text-based Gen AI can on occasion generate texts that suffer from “hallucinations”, which means that the model generated text that was grammatically correct, but on occasion factually inaccurate.’ 77 While work is being done to help detect and identify some types of hallucinations, 78 this work is still in its infancy, and hallucinations may simply be an inevitable and intrinsic property of LLMs. 79
Third, bias can also be built into LLMs through their training data or design, potentially distorting the results of legal analysis. As Gallegos et al describe, ‘[t]ypically trained on an enormous scale of uncurated Internet-based data, LLMs inherit stereotypes, misrepresentations, derogatory and exclusionary language, and other denigrating behaviors that disproportionately affect already-vulnerable and marginalized communities’. 80 Chat-GPT has therefore produced responses that are racist and sexist 81 and, in résumé screening experiments, has discriminated against women, people of different ethnicities, 82 and parents as potential recruits. 83 Again, while attempts can be made to mitigate the bias in LLMs, it cannot (yet) be eliminated; bias persists. 84
Content analysis using LLMs is also limited in its reproducibility. While, in theory, LLMs and the algorithms underlying AI should be able to reproduce the same outputs from the same inputs, practical considerations make this nearly impossible. Many LLMs which are suitable for aiding content analysis are too large and complex to run on personal computers. Researchers are therefore likely to use a third-party Software as a Service (SaaS) provider to access LLMs (such as OpenAI). SaaS providers regularly update their LLMs; even if using the same prompts at a later date, the LLM parameters may be different, meaning results may not be reproducible. For example, in one study of Meta’s algorithm, the algorithm was changed mid-study, 85 fundamentally changing the control condition of the researcher’s experiment, and potentially undermining the validity and conclusions of the study. 86 Similarly, for studies using LLMs, the results will not be reproducible unless the LLM and all of its inputs and settings are preserved exactly.
Given these challenges and limits, LLMs and AI might be useful legal research tools in specific, limited circumstances. The lowest risk would arise when using LLMs: to summarise specific texts which are provided directly to the LLM; to analyse only non-confidential texts when running an LLM on a third-party provider (such as OpenAI); where a level of inaccuracy can be tolerated or is accepted; where there is limited benefit in a researcher themselves becoming deeply acquainted with the primary materials; where numerical computations are not required; and where blind, independent testing of AI outputs is implemented to assess their utility as a research tool.
Conclusion
Overall, then, studies using LLMs and AI cannot yet achieve the posited benefits of content analysis. Analysis using LLMs offers neither accuracy nor easy reproducibility. Technology may well come to be a useful tool to support case-based analysis and content analysis of judicial decisions. At present, though, this article sounds a clear warning regarding the use of AI and LLMs to aid analysis. Even with blind, independent testing of AI outputs, researchers need to exercise significant caution in using new tools to aid analysis and require a clear understanding of when AI tools might be a useful and beneficial addition to human analysis, and when they will simply produce junk.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
