Abstract
The Short-Term Assessment of Risk and Treatability (START) is a 20-item structured professional judgment instrument for assessing dynamic risk in mental health services. Much of the START research literature examines the relationship between Strengths and Vulnerabilities sub-scale total scores and various adverse outcomes including violence. This assumes that the two sub-scales have the psychometric property of unidimensionality i.e. all the items cluster together as a measure of a single construct. Such assumed unidimensionality is a necessary condition for any analyses based on scale “total score” and the widespread use of scores summated in this way in research studies may obscure more specific clusters of items within each sub-scale. This multinational study examined START assessments (n = 685) conducted in four forensic services in Scandinavia and the UK using principal component analysis. It was found that all but three Strengths items (Substance Use, Social Support and Material Resources) and all but four Vulnerabilities items (Substance Use, Social Support, Material Resources and Self care) loaded >0.5 on the expected component. This indicates a unidimensional structure underlying the START and provides empirical support from a large multinational sample for the widespread use of summated Strengths and Vulnerabilities scores in forensic psychiatric risk research.
Introduction
The Short Term Assessment of Risk and Treatability (START) (Webster et al., 2006) is a structured professional judgment (SPJ) instrument developed in the Canadian forensic mental health system in the early 2000s which has since been widely implemented in forensic and general mental health services in many countries (O'Shea & Dickens, 2014; Ramesh et al., 2018). It is concerned with the improvement of medium term risk management (i.e. over weeks to months) and is usually completed by clinicians based on interactions with the patient, consultation with colleagues and case note review. In some situations co-completion with the person being assessed has been implemented.
The START consists of 20 items covering a broad range of domains considered pertinent to mental health and risk including, for example, social support, treatment adherence and substance use. It stands out in the crowded field of risk assessment instruments for two particular reasons. Firstly, whilst there are many tools which guide decision making on violence and, to a lesser extent, self-harm/suicide (Carter et al., 2017; Viljoen et al., 2018), the START purports to provide information relating to four other negative outcomes beyond these i.e. substance misuse, self-neglect, unauthorized leave and victimization (Marriott et al., 2017). Secondly, each item is rated in terms of its relevance as both a strength and a vulnerability in the patient’s profile. This emphasis on the importance of individual strengths alongside vulnerabilities has contributed to a widespread acceptance that an overall risk estimate is more valid if it takes positive, protective factors into account (Robbé et al., 2013). These two features, alongside the comprehensiveness of the domains covered by the twenty items, has led to the popularity of the START in a large number of services (Nielsen et al. 2015; Singh et al., 2016).
Since its introduction in 2004 a solid research evaluation literature has developed supporting the use of the START in clinical services. A number of studies have examined its successful implementation (Kroppan et al., 2017) and established that it has adequate psychometric properties in terms of internal consistency and inter-rater reliability (Nicholls et al., 2006; Timmins et al., 2018). There is also evidence of good predictive validity in relation to violence for the separate Strengths and Vulnerabilities scales (Chu et al., 2011; O'Shea & Dickens, 2016). As a tool which consists entirely of dynamic risk factors, it has been used to examine associations between changes in risk and changes in violent outcomes (Whittington et al., 2014) and it has also been tested as an active intervention in a randomized controlled trial involving forensic outpatients (Troquete et al., 2013).
Most studies of predictive validity test the association of the total scores on the two START scales with violent outcomes but some more concise variations of the START have also been tested in this way including the accuracy of individual items (O'Shea & Dickens, 2015; Paetsch et al., 2019). In particular, Braithwaite et al. (2010) examined the predictive validity of the overall START (all twenty items) and compared this overall predictive validity with that for various shortened “optimized scales” related to violence and the other outcomes. They report better performance in this method by the optimized scales than the overall instrument. For example, four vulnerability items on their own (Mental state, Impulse control, External triggers and Conduct) were more highly associated with subsequent violence (OR 1.23) than the full scale of twenty items (OR 1.05). Equally six vulnerability items on their own (Emotional State, Impulse Control, External triggers, Attitudes, Rule Adherence, and Conduct) were more highly associated with subsequent victimization (OR 1.26) than the full scale of twenty items (OR 1.05) and three strength items (Impulse Control, Rule Adherence and Conduct) were better predictors (OR 0.72) of the avoidance of victimization (OR 0.97 for the full scale). This finding raises the possibility that one or more shortened versions of the START could be developed which are easier to administer and might have improved validity for predicting the likelihood of the various outcomes when this is required. This in turn raises the theoretical question of whether the broad domain of risk captured by the START masks a number of underlying clusters which meaningfully constitute different components of the overall risk construct. These clusters would be sub-groups of items which associate with each other and dissociate from other sub-groups empirically.
This evidence of potential item clusters within the START highlights the absence of empirical evidence on the dimensionality of the START. Whilst the START manual advises against calculating total scores for clinical purposes (Webster et al., 2009), many, if not most, research studies on the START construct total Strengths and Vulnerabilities scores by summing across the twenty ratings in each domain and then conduct analyses on these total scores (e.g. Abidin et al., 2013; Hogan & Olver, 2018; Wilson et al., 2010). However, such an approach assumes that the START is a unidimensional scale and that it is meaningful to “add up” across all the contributing items in this way. Unidimensionality is an important attribute of a measurement instrument because, in psychometric terms, it indicates that there is a single latent trait (e.g. in this case “general risk”) underlying the responses (Hattie, 1985) rather than several such traits which are conceptually incompatible and potentially irrelevant to the key trait. This assumption is wrong if the START Strengths and Vulnerabilities subscales are actually made up of item clusters which are distinct from each other and which relate to different aspects of a person’s risk level. Unidimensionality should be examined in any psychological measurement tool to ensure the soundness of the assessments being made about the overall concept under consideration (Ziegler & Hagemann, 2015).
Unidimensionality is tested using factor analysis or related techniques and, in psychometric terms, only a finding that “all the items have substantial loadings on a single factor can be used to justify adding the item scores together to generate a single scale score” (Gardner, 1995). For comparison, it should be noted that the assumption of unidimensionality in various widely-used depression scales was unsupported in a recent investigation indicating that summated scores on such scales should not be interpreted theoretically as reflecting a single construct of depression (Fried et al., 2016). Measurement of the theoretical construct of “total risk” reflected by summated Strengths and Vulnerabilities scores from the START would benefit from similar interrogation and clarification.
To the best of our knowledge there are no previous published studies examining the factor structure of the START and thus this commonly used unidimensional approach remains an assumption without evidence. There is an argument that structured professional judgment (SPJ) instruments such as the START are not psychometric tools at all and only the latter are specifically designed with the aim of measuring a single underlying theoretical construct using multiple items (Fayers & Hand, 2002). As an SPJ instrument, it is argued that the START more closely resembles a clinimetric (rather than a psychometric) tool. Such tools have a more practical purpose as “an index that is ‘clinically sensible’ and has desirable properties for prognosis or prediction” (Machin & Fayers, 2016, p. 53) in which case the abstract concept of unidimensionality is not relevant.
However, SPJ tool development relies heavily on a wide range of psychometric techniques to establish credibility in terms of, for example, inter-rater reliability, internal consistency and convergent validity (Nonstad et al., 2010; O'Shea & Dickens, 2014). Summated Strength and Vulnerability scores are regularly presented and discussed as if the items can be meaningfully combined presumably to represent high or low levels of an unobserved construct sometimes called “risk.” The dimensionality or factor structure of SPJ tools is therefore important regardless of whether such instruments have primarily a psychometric or clinimetric rationale. This is confirmed by, for example, a recent factor analysis of the Structured Assessment of Protective Factors for Violence (SAPROF) which indicated a 4 factor structure in contrast to the 3 subscales rationally derived by the instrument authors (Abbiati et al., 2020).
It is also true that a number of studies have demonstrated that START subscales have high internal consistency with estimates ranging from 0.80 to 0.95 for Strengths and from 0.76 to 0.95 for Vulnerabilities in O’Shea & Dickens’ review (2014). Whilst Hattie (1985) considers the main measure of internal consistency (Cronbach’s alpha) to be “suspect” as a measure of unidimensionality and O'Shea and Dickens (2014) note that it is not a direct metric of unidimensionality, the latter do argue that repeated internal consistency values at this level (≥0.80) from several studies are a good proxy measure of it. Nevertheless the average sample size of studies in this review was 60 indicating the need for further direct examination of unidimensionality in a large combined sample as reported below. This study therefore sought to test the dimensionality of the START instrument in a large multinational forensic sample in order to establish whether the common research practice of summating total Strengths and Vulnerabilities scores is justifiable in psychometric terms.
Materials and methods
The START instrument (Webster et al., 2009 )
The START consists of 20 items each of which is rated on an ordinal scale with values of 0 (not present), 1 (present to some extent) and 2 (fully present) according to the degree to which a factor is considered a feature of a specific patient’s current clinical profile. Each item is considered and scored in terms of its potential as both a risk factor (vulnerability) and as a protective factor (strength) in relation to the propensity to engage in 7 different types of adverse behavior (violence, self-harm, suicide, unauthorized leave, substance abuse, self-neglect, and exposure to victimization). It is designed for completion through consensus discussion amongst a clinical team but can be meaningfully completed by an individual practitioner. In many countries it is primarily completed by nursing staff based on personal knowledge of the patient amongst team members and appraisal of case notes and relates to the period since the last assessment. Assessment at least every 12 weeks and at most every week is recommended. Raters are required to possess a qualification in one of the recognized mental health professions and ideally should attend a training course. The instrument Manual provides extensive guidance on item descriptors and scoring anchors.
The EuroSTART dataset
This is an integrated standardized dataset of START assessments conducted as part of routine clinical practice in mental health services in Scandinavia and the UK. It has been constructed through collaboration between forensic mental health services in five countries with the aim of pooling data to increase statistical power and enable cross-national comparisons. Four high-security forensic mental health in-patient services have contributed all START ratings conducted as part of routine clinical practice over a specified time period: Forensic Psychiatric Clinic of Stockholm County, (FS), Sweden (168 beds, ∼55 admissions per year); Sct. Hans Mental Health Center (SH), Roskilde, Denmark (104 beds, ∼40 admissions per year); the Scott Clinic (SC), Merseyside, UK (66 beds, ∼29 admissions per year); and Vanha Vaasa Hospital (VV), Vaasa, Finland (152 beds, ∼91 admissions per year).
Three of these services (FS, SH and SC) are regional in scope serving catchment areas with a population of approximately 2 million people. The fourth service (VV) is a national hospital covering the whole of Finland with a population of 5.5 million. The four samples largely reflect the overall demographic and clinical profile of each service and are broadly comparable. The average admission duration (years) was as follows: VV: 7.0; FS: 4.9; SC: 2.3; SH: 2.2. The percentage of each sample that was male and the mean/median age was as follows: VV: 79%, 41 years; FS: 84%, 35 years; SC: 98%, 33 years; SH: 94%, 40 years. The most common diagnosis was schizophrenia or psychosis in all services and the most common legal decision governing compulsory treatment in all services was diminished responsibility, not guilty by reason of insanity or equivalent. The median number of days between admission to the service and the first START assessment varied substantially being 8 months in FS and 57 months in SH. Further information on the demographic and clinical characteristics of the samples is not available as such information was required to be removed to obtain ethical approval. The project is co-ordinated by the Brøset Center for Research and Education in Forensic Psychiatry, Trondheim, Norway.
The START had been implemented in each service for a number of years prior to data integration. The median number of assessments per patient were VV: 4; FS: 2; SC: 7; SH: 5. Assessments at VV were conducted at fixed time points every six months regardless of clinical condition but assessments in the other services were conducted as required and/or when staff resources were available. In SC and SH, some patients were assessed many times with a maximum of 29 assessments on one patient at SC and 22 assessments at SH. One fifth of patients at FS (22.6%) and SH (20.8%) had a single assessment compared to less than a tenth (8.6%) of patients at SC. Completion of the START in all cases was conducted by staff who had training based on the instrument Manual and who followed the protocol set out in that document as far as possible. Assessments were based on clinical documentation, multidisciplinary team consultation and, where possible, collaboration with the patient. If recorded on paper they were subsequently loaded in a digital format to a centralized secure drive run by the relevant service.
Each service obtained appropriate local ethical and research governance approval and exported anonymized START ratings into a standardized Excel spreadsheet. The four datasets were then merged and exported into SPSS v25 for analysis.
The overall dataset consists of 2890 START assessments but only one rating per patient was included in this analysis to avoid confounding through repeated assessments (Tabachnik & Fidell, 2007). The selected rating was usually the first assessment conducted on the patient during the study period. When the assessment date was unknown the first assessment listed in the dataset for that patient was selected. This may reflect the first assessment in time but may also be the result of how the data were loaded and sorted during data preparation. This created a sample of 685 patients with full (n = 593) or partial (n = 92) item completion on either the Strength or Vulnerabilities scales (VV: n = 112, 16% of the overall sample; FS: n = 327, 48%; SC: n = 112, 16%; SH: n = 134, 20%).
Statistical analysis
Categorical principal components analysis (CATPCA) (Linting et al., 2007) using variable principle normalization in the SPSS Dimension Reduction menu was used to examine the dimensionality of the START Strengths and Vulnerabilities ratings. This approach was chosen because of the ordinal nature of the three category response format for each item. Whilst factor analysis is routinely used with ordinal data in applied social sciences it can be unsuitable as it can generate erroneous factors (Dolan, 1994). The ordinal scale was selected as optimal for the SPSS procedure and all ratings were recoded from 0/1/2 to 1/2/3 as values of 0 are treated as missing by SPSS in this procedure (IBM Support, 2020). A stringent factor loading cut off >0.5 was set (Tabachnik & Fidell, 2007). There was a small amount of missing data (1.12% of observations). For all item-level analyses these were treated passively (Linting et al., 2007) in that the missing observation on a variable did not contribute to the analysis on that variable only. Subscale scores were not calculated for a case when missing data was present.
Results
Mean scores on individual items and the summed Strength and Vulnerabilities scale for each country are given in Table 1.
Mean scores on individual items and the summed strength and vulnerabilities scale for each country.
Total n = 593 due to incomplete subscale item completion.
Variations between countries were highly statistically significant (chi-squared test, p < .005) for every Strength item. The distributions of scores were more similar across the four countries for the Vulnerabilities scale but variation in all cross-national comparisons was statistically significant (p < .05) for all items apart from Emotional State, Substance Use, Impulse Control, Social Support and Attitudes. Both scale total scores also varied significantly between countries (Strengths: F = 31.07, p < .001; Vulnerabilities: F = 3.43, p = .017; df = 3, 589) with Strengths rated particularly highly in the UK sample. However, for Vulnerabilities, despite the statistical significance of the variation, no national sample varied by more than 10% away from the overall mean.
The results of the PCA are presented in Table 2 for the combined sample from all four countries and in Supplementary Table 1 for each country separately. In the overall sample, there was strong evidence of unidimensionality for both Strengths and Vulnerabilities. All Strengths items loaded >0.3 on Component 1 and all but three items (Substance Use, Social Support and Material Resources) loaded on this component at the required cutoff (>0.5). Two of the Component 1 non-loading items (Substance use and Material resources) loaded instead onto Component 2 and the third (Social support) loaded onto Component 3. The eigenvalue for Component 1 indicates that it explained 43% of the overall variance and the high Cronbach’s alpha value indicates that Component 1 had high internal consistency. With regard to Vulnerabilities, again all items loaded >0.3 on Component 1 and all but four items loaded >0.5. Three of these items were the same as the non-loading Strengths items but the distribution of loadings on the other components was slightly different for two of these three items. Social support did not load onto any component for Vulnerabilities and Material resources loaded onto Component 4 (Vulnerabilities) rather than Component 2 (Strengths). The eigenvalue and Cronbach’s alpha for Component 1 Vulnerabilities was slightly lower (7.33, 0.91) than that for Strengths (8.60, 0.93) though remained high overall.
Component loadings for strengths and vulnerabilities items (>0.5 highlighted in grey).
Total: 0.96 Strengths, 0.96 Vulnerabilities.
Total: 11.86 Strengths, 10.99 Vulnerabilities.
The same analysis conducted for each country individually was largely consistent with this unidimensional pattern ( Supplementary Table 1 ). Unidimensionality was somewhat weaker in the Swedish sample especially for Vulnerabilities. Material resources and Substance use were the least well-fitted items and did not load on more than half of the individual country analyses.
Discussion
This is the first study to examine the unidimensionality of the START Strengths and Vulnerabilities scales in a multinational sample in order to test the empirical justification for summed total scores. There is strong evidence here that both scales are indeed unidimensional with all but three or four items loading strongly onto a single component in both cases. This is preliminary evidence that a total Strengths or Vulnerabilities score as used by researchers is a meaningful entity. Thus the rational process used to select items for inclusion in the START as a clinimetric tool in the development stage has successfully produced a psychometrically robust pair of scales. Given this evidence of unidimensionality it appears that those making the ratings view risk or protective factors as a unified concept with few separate domains. A set of items here consistently did not fit with this unidimensional pattern for either Strengths or Vulnerabilities. Substance use in particular cross-loaded onto other potential components with one or two other items. This suggests perhaps the unique contribution of substance use to perceptions of risk, cutting across other risk domains as a general factor exacerbating the potential for poor outcomes in violence, self-harm and the other aspects to which the START purports to relate. The other two divergent items, Material resources and Social support, both clearly relate to the patient’s external environment and factors which are relatively beyond their sphere of personal control. As such, they may be perceived by raters as contributing a special set of challenges for the patient which is independent of their internal world.
The unidimensionality demonstrated here does not in itself provide empirical support for the theoretical concept of risk sub-domains embedded in the overall START item list. However, this does not undermine the potential usefulness of the optimized scales proposed by Braithwaite et al. (2010). Again, the lack of evidence here for multiple dimensionality is a psychometric issue but the relatively high level of predictive validity demonstrated for the optimized scales in that study is evidence of the START’s effectiveness as a clinimetric tool.
Whilst there is support here for the process of summation, a number of aspects of the clinical context must be considered. Firstly, despite the availability of quite specific rating guidelines in the manual, the complexity of the behavior being assessed still leaves much room for subjectivity in evaluations. For example the Substance Use item could be rated as a Strength (1 or 2) in a variety of ways: the patient has never used substances; the patient has used substances before hospitalization and has a substance use diagnosis, but cannot access drugs in the hospital because of restrictive conditions; or the patient admits his or her urges, but does not use substances. All of these are valid responses but reflect very diverse clinical situations. This raises the question of consistent communication within and between clinical teams over the precise meaning of individual items and overall risk estimates. One way to enhance such communication is to provide comprehensive training for staff in making the risk formulation including risk-scenarios with related actions described together with the patients. In addition it should be noted that simple summation assumes equal weighting across all items when it is quite possible that some items are more important than others and so different weights should be attached to some items.
The focus throughout this report has been on the psychometric approach to the START. It has been noted that clinical staff are advised in the START manual to avoid summation in practice and to use the individual items to guide individualized planning and interventions with patients. They are encouraged however to make a specific risk estimate (low, medium or high) for each of the adverse outcomes (aggression etc.) and space for recording such estimates are prominent on the START summary sheet. As with all SPJ tools, these risk estimates are explicitly not intended to be direct numerical translations of the summated score but instead are expected to be more sophisticated reflective interpretations of the person’s overall profile. Such formulations will take into account specific key risks or strengths particular to the person being assessed and awareness of this individual profile plays a bigger role in guiding an estimate of specific risk than a total score. However, it is interesting to consider the degree to which a risk estimate of “high” in clinical practice relates to a high numerical score on the two subscales especially as there is little guidance on what constitutes high, medium or low categories in risk estimates. It is likely that in practice, the process of conducting an assessment which yields consistent scores of “2” across all START Vulnerabilities items and “0” on Strength items will create a presumption that a high risk estimate is appropriate. Indeed, high total scores have been found to be significantly associated with specific risk estimates for violence in several studies (O'Shea & Dickens, 2014). In that sense, the summated score is at least likely to be an influence on the risk estimate and the psychometric issue of unidimensionality is thus relevant in the clinical context as well as the research context.
The main strength of this study is the overall sample size which is more than double the requirement for meaningful PCA (Guadagnoli & Velicer, 1988; Tabachnik & Fidell, 2007).The reliance here on secondary data derived from routine clinical information has both positive and negative implications. The approach has high ecological validity as it clearly reflects the “real world” usage of the START instrument by clinicians as they go about their business of managing dangerous individuals. At the same time the reliability and validity of individual ratings is questionable given the large number of different raters in diverse countries and the constraints imposed upon them when conducting risk assessments as part of a busy routine and heavy workload. The strength of the overall findings is also limited by identification of some variations in unidimensionality between the contributing countries and generalizability is restricted due to the use of the instrument here with an overwhelmingly male sample. The individual country samples are variable in size with some countries contributing fewer than 150 cases and one country contributing a much larger proportion than others. Conclusions about individual countries should therefore be made with extra caution and those drawn about the overall sample must be made with an awareness of the large contingent from Sweden. There was also substantial variation between countries in terms of the length of time between admission and the START assessment in the study. This should be noted when considering the results but the variation itself is not fundamentally relevant to the issue of the internal structure of the scale being addressed here.
This analysis of the EuroSTART dataset represents the first step in a potential research programme examining the psychometric properties of the START instrument. A number of additional analyses will be considered based on the current dataset and additional variables when they can be added. Exploratory and then confirmatory factor analysis may be conducted to enable comparison between this unidimensional PCA model and any multidimensional models which may be identified by EFA. Also item response theory (IRT) models may be used to examine the question of whether additional weightings should be allocated to one or more items when calculating the total score. It is desirable to add supplementary variables to the dataset when this is feasible. These variables include the specific risk estimates from the START assessments and the relevant outcomes in terms of adverse behavior. Whilst these aspects are universally available in the various contributing clinical services contributing to the project there are major challenges when integrating these into a combined dataset. The START itself is always recorded in a standardized structure which make it relatively straightforward to integrate but outcome data in particular is recorded very differently across services even within the same country. Addressing this inconsistency is a priority but will require some time to achieve.
The START evidence base continues to expand and the findings from this study suggest a number of avenues for future research. In particular the formulation process by which assessors move from rating of individual items to a specific overall risk estimate as the key construct underpinning the prioritization of clinical decisions could be examined further. Tighter protocols for rating and replication of the approach adopted here with larger female samples and non-forensic samples would also be worthwhile.
In conclusion, the evidence here supports the use of summated START Strengths and Vulnerabilities scores for research purposes. It is therefore meaningful to conduct analyses using such total scores. However, notwithstanding this evidence for a single “general risk” construct underpinning START assessments, there may still be other clinimetric reasons for developing sub-scales or identifying clusters of items if these produce clinically useful findings.
Footnotes
Acknowledgements
Thanks to Dr. Ghitta Weizmann-Henelius for contributing to the early stages of this project and to Dr. Nutmeg Hallett for comments on a draft of the paper.
Conflict of interest
The authors have no conflicts of interest to report.
