Sage Journals: Discover world-class research

Abstract

This research aims to develop a novel domestic abuse risk assessment tool, called Lizzy, for predicting the repeat victimisation of German female victims of physical intimate partner violence. Our approach includes actuarial and machine learning techniques based on data from a longitudinal online survey with a nationally representative sample of 3878 respondents (July to November 2023). Four algorithms were employed: CatBoost, XGBoost, Logistic Regression, and Random Forest. Logistic regression performed best with an accuracy of 0.82 and Area Under the Curve of 0.85. We find that predictors covering multiple dimensions of abuse, including physical abuse as well as economic, digital, and emotional abuse, contribute to model performance.

Keywords

Domestic abuse intimate partner violence recidivism risk risk assessment

Introduction

Intimate partner violence (IPV) is categorised by the World Health Organisation (WHO) as a ‘major public health problem’, as globally, 27% of women aged 15–49 have experienced IPV at least once in their lifetime (WHO, 2024). Its severity is evident from 38% of global femicides being committed by intimate partners. The lifetime prevalence of physical and/or sexual IPV amongst women in Europe is ∼22%, with Germany being close to the EU average at 21% (OECD, 2017; WHO, 2024). In Germany, cases rose by 39% from 120,758 in 2012, when data on IPV were first officially collected, to 167,865 in 2023 (BKA, 2024). Reflecting global rates, most victims in Germany are female (around 80%) (Trafford and Le, 2024; WHO, 2024), with 132,966 women (79.2%) reporting IPV to the police in Germany during 2023 (BKA, 2024).

Structured risk assessments have been developed as tools to help improve criminal justice responses to domestic abuse (DA) (Myhill et al., 2023; Robinson, 2011). Using a question-and-answer format, victim responses to risk assessments can help safeguard survivors from re-victimisation, determine appropriate interventions and enable resource prioritisation. The development of tools to predict future IPV, including recidivism and lethality, has grown extensively over the last two decades following a growth in predictive policing. Recently, Van der Put et al. (2019) identified over 39 DA risk assessment tools in their review. Replacing or supporting practitioner judgement, they are intended to work as a standard measurement for determining the risk of future and current harm to victims and have come to be relied on internationally in responding to IPV (EIGE, 2019). Their primary utility in violence reduction lies in identifying cases with an elevated risk of harm to the victim and aligning interventions with the assessed risk level to mitigate further harm (Lauria, 2020). This makes it crucial to ensure accurate and reliable risk measurements for recidivism to protect victims from future and potentially lethal consequences (Graham et al., 2021; Van der Put et al., 2019). That being said, researchers believe that the evolution of accurate DA risk assessment tools is still in its ‘infancy’ (Van der Put et al., 2019).

Despite the widespread use of risk assessment tools, evidence of their effectiveness in reducing or preventing violence is mixed, often due to methodological limitations. In a systematic review, Viljoen et al. (2018) noted common issues in intervention studies, including sample representativeness, lack of comparison groups, and adherence to protocols. However, there is evidence that risk assessments can reduce violence, particularly when they go beyond simple risk prediction to inform evidence-based interventions. For example, a study in Rheinland-Pfalz, Germany, conducted by Weis et al. (2016) used a parallel group design to test a new IPV risk management procedure with tools including Ontario Domestic Assault Risk Assessment (ODARA) (an actuarial tool that calculates risk based on information available in police records with a follow-up period of five years) (Hilton et al., 2004) and Danger Assessment (a structured professional tool administered to victims with a follow-up period of one year) (Campbell, 1986), and multiagency risk assessment conferences (Weis et al., 2016). Across three police forces, the treatment group saw a total reduction in recidivism, dropping from 42% to 20% over one year, demonstrating the potential of comprehensive risk management strategies to reduce reoffending.

This study seeks to develop a novel DA risk assessment tool called Lizzy by utilising machine learning techniques to predict victimisation amongst female victims of physical IPV in the German context. The tool is based on data from a nationally representative longitudinal online survey, which included physical and non-physical predictors of IPV.

Literature review

Types of DA risk assessments

DA risk assessment tools can generally be categorised into three major types, distinguished by the methods used to weigh and combine information to generate a risk output (Kebbell, 2019). Unstructured clinical judgement or ‘“gut” approaches’ (Kebbell, 2019: 4) rely on the professional's training and experience, and tend to be less reliable and accurate compared to actuarial tools (Seewald et al., 2017). Actuarial approaches like ODARA rely on statistical models to identify risk factors and apply standardised, weighted risk items to assess the likelihood of repeat or severe IPV (Hilton, 2021). Structured professional judgement approaches like the Spousal Assault Risk Assessment combine these two approaches (Kropp, 2008). They guide professionals in evaluating risk factors based on predefined criteria, while also allowing for clinical discretion in individual cases. Germany's approach to DA risk assessment is defined by a patchwork of methods, including ODARA and Danger Assessment, along with local adaptations and adjustments derived from these tools, as well as other regionally developed structured professional instruments (Weißer Ring, 2021).

Evaluation metrics for DA risk assessments

To compare and validate risk assessments, the Receiver Operating Characteristic Area Under the Curve (AUC) has been used as the most important measurement of efficacy due to the metric's reliability. This reliability stems from its resistance to changes across base rates, selection ratios and truncated distribution (Messing, 2019; Rice and Harris, 2005; Swets et al., 2000; Van der Put et al., 2019). Generally, values between 0.556 and 0.639 represent a small effect size, while values between 0.639 and 0.714 represent a moderate effect size, and values > 0.714 have a large effect size (Van der Put et al., 2019). Others have proposed more stringent criteria, suggesting that AUC scores below 0.7 are poor and can be considered no better than chance (Lamb et al., 2022). Van Der Put et al. (2019) found that most risk assessments used in the context of IPV had a small-to-average effect. Their extensive review of 39 predictive risk assessment tools produced an average AUC score of 0.643. When adjusted to account for missing effect sizes, the average AUC fell to 0.599. Messing and Thaller (2013) similarly found that risk assessments (weighted by sample size) had a small effect size, ranging from 0.537 to 0.628. ODARA was the exception, with a 0.67 chance of correctly identifying recidivism. Still, the moderate predictive accuracy of these scores demonstrates that the efficacy of these tools remains low.

Solely relying on AUC as a measure of the predictive ability of risk assessment tools is, however, not recommended as it might mask variations in performance at specific thresholds (Lobo et al., 2008). To analyse the diagnostic performance of a risk assessment tool, Dhamnetiya et al. (2022) recommend considering additional metrics such as sensitivity (true positive rate/precision) and specificity (true negative rate). Sensitivity measures how often a test correctly predicts re-victimisation, while specificity measures how often a test correctly predicts that victims will not re-experience IPV. For instance, while having an AUC of 0.628 for predicting severe violence, the Danger Assessment reported sensitivity values of 0.494–0.921, specificity values of 0.201–0.666 and positive predictive values of 0.255–0.306 (Roehl et al., 2005). This indicates that while the Danger Assessment tool has a moderate AUC score, it struggles with specificity and positive predictive power, showing a tendency to produce ‘too many false positives’ in its predictions (Lamb et al., 2022). Moreover, research has repeatedly found that victims’ self-assessments are capable of outperforming the predictive power of DV-specific tools (Roehl et al., 2005; Van der Put et al., 2019).

It is also essential to report the models’ performance in different development phases (training, validation, testing) to give users a sense of the risk of overfitting (Seghier, 2022). Overfitting occurs when an algorithm fits too closely to the data it was trained on – that is, it replicates the noise and random fluctuations in the training data instead of deriving a pattern. The result is an algorithm that cannot make accurate predictions when deployed on new data as it fits the training data too well. Ideally, minor variations will occur between the models performance values across development phases. López-Ossorio et al. (2016) tested the predictive validity of the Police Risk Assessment (VPR – Valoración Policial del Riesgo), used by Spanish police forces to screen IPV cases as part of the Comprehensive Monitoring System for Cases of Gender Violence (VioGén), across different stages of development. They found the models’ AUC scores to be 0.71 within a follow-up period of three months, while during its fourth iteration, the authors reported an AUC score of 0.63 for VPR. Whilst this change indicates a risk of overfitting, the size of this risk is difficult to estimate.

Data sources for DA risk assessments

Although sample sizes for constructing and validating actuarial risk assessments have increased, they remain relatively small. Messing and Thaller's (2013) reported sample sizes for five different risk assessments varied between 56 and 1465. By the time of van der Put's (2019) study, sample sizes had expanded to between 26 and 29,317 participants. Yet, only ten of the 39 risk assessment tools included in the review were tested at least once with a sample larger than 1000 participants. While only five IPV tools included in the review had at least one study with sample sizes exceeding 2000 participants and none of these tools are widely used (Fitzgerald and Graham, 2016; Mason and Roberta, 2009; Stansfield and Williams, 2014; Williams, 2012).

Additionally, most risk assessment tools have been developed using criminal justice samples. Euser et al. note that retrospective designs can be a cost-effective and ‘a time-efficient and elegant way of answering new questions with existing data, but one has no choice other than to work with what has been measured in the past, often for another purpose … than the one under investigation’ (Euser et al., 2009: c216). Using historical crime data, which can be biased and/or incomplete, can lead tools trained on this data to ‘reproduce and entrench existing discriminatory practices’ (EFRA, 2022: 36). In their reanalysis of civilian–police encounters, Knox et al. (2020) found that traditional estimates tend to underestimate racial bias significantly, ‘a naive analysis that assumes no race-based selection into the data suggests only 10,000 black and Hispanic civilians were handcuffed because of racial bias in New York City between 2003 and 2013, we estimate that the true number is approximately 56,000’ (Knox et al., 2020: 620).

The possibility for replication of discrimination, and particularly racial bias, is clear from López-Ossorio et al.'s (2019) study which used historical police records from over 6613 IPV cases to validate VioGén's DA risk assessment tools. The dataset showed a high overrepresentation of several groups, including Romanian victims, who constituted 5.4% of the cases despite representing only 1.3% of the population (López-Ossorio et al., 2019; Statista, n.d.). As such, historical police data does not necessarily provide the most reliable basis for developing risk assessment tools. In fact, government agencies often conduct victimisation surveys, which poll a random sample of the population about their experiences of crime and violence to estimate the ‘real’ prevalence and extent of a crime. This is because official data sources, such as law enforcement data, can provide an unrepresentative sample of IPV victims and crime types, as most IPV is unreported (Tjaden and Thoennes, 2000). Consequently, this research utilises a national sample to determine victimisation rates across Germany.

Using policing data derived from arrests to build risk assessments also poses other challenges for risk assessments, which are intended for use in alternative settings. Trafford and Le's (2024) analysis of a nationally representative sample of IPV victims in Germany pointed to sociodemographic differences between IPV victims across the various types of support services approached. This suggests that respondents who seek support from the police may differ from respondents who consult other services. Thus, the reliability and validity of a DA risk assessment tool trained solely on respondents likely to interact with the police may not necessarily be upheld in non-police settings, such as health or consulting environments.

In terms of research design, most DA risk assessment tools such as ODARA, the VioGén risk assessment tools and B-SAFER, rely on historical data (López-Ossorio et al., 2019; Van der Put et al., 2019). In contrast, prospective designs, while more resource-intensive, allow for more precise and flexible data collection by observing risk factors in real time, which offers a more reliable approach to understanding recidivism (Euser et al., 2009). Rather than relying on historical data and associations, which might suffer from biases and recall errors, researchers can track how different risk factors impact recidivism over time.

Follow-up periods for DA risk assessments

It is essential to consider the follow-up period length for risk assessment tools because this defines the timeframe over which it is effective. Douglas and Skeem (2005: 370) note that ‘…most time intervals will risk missing fluctuations that occur in between assessments. The longer the interval between assessments, the greater the risk that changes will be missed. If changes are missed, the validity of measurement is threatened’. Therefore, longer follow-up periods spanning numerous years can present significant challenges for targeted intervention and prevention strategies. Where perpetrators are tracked for up to five years, such as in the sample used to construct ODARA (Hilton et al., 2004), the tool can predict the likelihood of physical abuse occurring over a five-year period. Yet, predictions across multiple years can be impractical for developing effective risk management strategies by emergency workers who seek to intervene with victims across short timeframes. Comparatively, shorter follow-up times, such as the one-year follow-up period used to develop the Danger Assessment (Campbell, 1986), allows for closer monitoring and more immediate and actionable insights across time periods that are more reflective of professional intervention. Despite the difficulties with longer follow-up periods for implementing tools in practice, Graham et al. (2021) found that 14/26 tools analysed did not record a follow-up period. This included B-SAFER, which makes it difficult to gauge the period over which predictions are effective (Kropp, 2008). Furthermore, 5/26 tools had a follow-up time of more than one year, with follow-up times ranging up to eight years.

Consequently, the literature suggests that risk assessment tools often have limited predictive validity in practice. This is despite the enhanced focus and reliance upon risk assessments over the past decade (Messing and Thaller, 2013; Van der Put et al., 2019). Whilst the predictive accuracy of current risk assessments might be ‘sufficient to justify their use … in both high-risk and general populations’, it is essential that the validity and reliability of these tools are improved (Van der Put et al., 2019: 113). Increasing the accuracy with which IPV is predicted can enhance targeted intervention and prevention, but it requires a greater investment of ‘time, money, and resources’ (Van der Put et al., 2019: 113).

AI advancements in DA risk assessments

Advancements in AI and predictive technology have heralded a new era for risk assessments to become increasingly sophisticated. Using historical data, several researchers have begun to demonstrate the potential use of AI to create models with significantly higher accuracy scores. For example, Turner et al. (2022) utilised information gathered using the DASH questionnaire, alongside incident descriptors, IPV history, personal and geographical demographics, and criminal and victimisation history to create a logistic regression model with an AUC of 0.748 for IPV recidivism, compared to an AUC of 0.567 based purely on DASH risk grading. Garcia-Vergara et al. (2023) have also utilised natural language processing to construct a dataset of relevant IPV cases and employed a random forest classifier that achieved an average accuracy of 81.14%. While Quijano-Sánchez et al. (2021) enhanced the VioGén model, achieving an accuracy of 81.26% through the application of machine learning techniques. Thus, while the application of AI and machine learning to the development of risk assessments remains sparse in the sphere of DA, these results suggest it is a promising area for development (Hui et al., 2023).

Aim

This research seeks to build upon previous studies by utilising a nationally representative sample of victims in Germany to develop a risk assessment tool which utilises AI to predict the likelihood of future physical abuse within the next three months, given previous physical emotional, financial and digital abuse by an intimate partner/ex-partner. Sexual abuse was not included in the study due to its higher potential to inhibit disclosure of abuse, which may stem from the sensitive nature of this form of abuse and shame experienced by victims (Felson and Paré, 2005; Hassanpour et al., 2025). While the harmful consequences of non-physical abuse are well-documented, German law only recognises physical forms of IPV. The vast majority of IPV cases are prosecuted under provisions related to bodily harm (§ 223 of the German Criminal Code – StGB), dangerous bodily harm (§ 224 StGB), grievous bodily harm (§ 226 StGB), as well as murder and manslaughter (§§ 211, 212 StGB). In contrast, less than one in five IPV cases involve charges related to stalking (§ 238 StGB), coercion (§ 240 StGB), or threats (§ 241 StGB), highlighting the relative neglect of non-physical forms of abuse in legal proceedings. As such, this study focuses on predicting physical abuse. This study is unlikely to be affected by professional intervention as a confounding factor because data collection included victims who had and had not been in contact with public authorities.

Methods

Study design and setting

To build a novel IPV risk assessment tool for the German context called Lizzy, we designed a longitudinal study with a three-month follow-up period based on a nationally representative German sample of 3878 respondents. The first survey was administered over two and a half weeks from 12th to 31st July 2023 (hereafter referred to as Wave 1) to create a nationally representative sample of 7400. Having limited the sample to female victims of abuse, this provided a sample of 1400 respondents in Wave 1. Respondents who disclosed having experienced any abuse at least once in the past year were recontacted after three months between 12 October and 12 November 2023 (hereafter referred to as Wave 2). The re-contact rate was high at 76%, resulting in 1065 participants. This is in line with the retention rates reported by other studies on IPV (Devries et al., 2013; Whitton et al., 2019). Overall, this provided us with a sample of 3878 female victims who had experienced abuse in the past year and responded to the second survey, three months later. The sample size reflected our estimated figure of 1000 victims, based on previous research indicating that one in four female Germans experience abuse by a partner at least once in their lifetime (Schröttle and Müller, 2004).

The survey was introduced to respondents with the following text: ‘This survey addresses the topic of unhealthy behaviours in romantic relationships. Even if you have not experienced unhealthy behaviours in your current or previous relationships, answering these questions will help create a broader understanding of relationship behaviours. Please be assured that we will treat your data confidentially. We will only evaluate your answers anonymously’. At the end of the survey, respondents were signposted to the relevant services. The study's design was built with expert and practitioner feedback and based on an extensive literature review. To reduce the risk of self-selection bias, the survey was framed around unhealthy behaviours in intimate relationships rather than explicitly focusing on intimate partner violence.

The survey respondents were recruited from YouGov's German online panel using nonprobability sampling. Panellists were recruited through various sources such as social media advertising, and they voluntarily join YouGov's online panel, providing demographic information to enable targeted sampling. To ensure national demographic representation, YouGov applied post-stratification weighting with quotas derived from the national microcensus of the German Federal Statistics Office. Participants are incentivised with reward points that can be exchanged for donations to a charity of their choice or credits for a gift card worth approximately €1.11 for completing this 20-min survey. Response quality was monitored through various checks (e.g., response patterns, speeding) to ensure data reliability.

Demographics

The participants were all adults aged 18 years old or above, with a negatively skewed age distribution (Table 1). The average age was around 49 (SD = 16), with victims of IPV being on average 10 years younger (M = 43, SD = 16) than non-victims (M = 53, SD = 15). The majority of women were partnered (55.6%), either through marriage, civil partnership, or cohabitation. The rest were single (22.3%), divorced (13.6%), or widowed (5.8%). Relationship status at the time of the wave1 survey indicated that most women were in a relationship (67.8%) or had been in one within the past year (9.5%), with the vast majority of these relationships being heterosexual (90.2%). The sample distribution across German federal states was consistent with census population data. Most participants (83.1%) had completed at least 10 years of education, such as Realschule, POS, Mittlere Reife, Abitur, or equivalent qualifications. Household income was disclosed by 81.8% of women, being primarily below €5000 per month, with 37.6% earning between €1500 and €3000, 28.8% between €3000 and €5,000, and 24% earning less than €1500. These income levels reflect on average two-person households, whether consisting of two adults or an adult and a child.

Table 1.

Demographic characteristics of full sample.

		Grouped by experience of domestic abuse at least once in the past 12 months
		Missing	Overall	0.0	1.0
N			3878	2454	1400
Experience of domestic abuse at least once in the past 12 months, n (%)	0.0	24	2454 (63.7)	2454 (100.0)
	1.0		1400 (36.3)	1400 (100.0)
Age in years, median [Q1, Q3]		0	49.2 (16.1)	52.7 (15.2)	43.2 (16.0)
Marital status, n (%)	Single	82	848 (22.3)	544 (22.6)	296 (21.7)
	Married		1575 (41.5)	945 (39.2)	621 (45.5)
	Civil partnership according to the Civil Partnership Act		44 (1.2)	20 (0.8)	23 (1.7)
	Living with partner		489 (12.9)	246 (10.2)	242 (17.7)
	Living separately		104 (2.7)	55 (2.3)	48 (3.5)
	Divorced		516 (13.6)	409 (17.0)	106 (7.8)
	Widowed		220 (5.8)	191 (7.9)	29 (2.1)
Highest education level, n (%)	Still in school education	84	51 (1.3)	18 (0.7)	33 (2.4)
	Secondary or elementary school certificate		549 (14.5)	380 (15.8)	164 (12.0)
	Realschule or equivalent qualification (POS, Mittlere Reife)		1551 (40.9)	1043 (43.3)	501 (36.7)
	Abitur, entrance qualification for universities of applied sciences		1600 (42.2)	948 (39.3)	648 (47.5)
	No school-leaving qualification		43 (1.1)	22 (0.9)	19 (1.4)
Monthly household net income, n (%)	EUR 0 up to EUR 1500	705	768 (24.2)	547 (27.5)	215 (18.3)
	EUR 1500 up to EUR 3000		1194 (37.6)	769 (38.7)	420 (35.8)
	EUR 3000 up to EUR 5000		915 (28.8)	520 (26.2)	394 (33.6)
	EUR 5000 and more		296 (9.3)	152 (7.6)	143 (12.2)
Household size incl. children and adults, median [Q1, Q3]		0	2.0 [1.0, 3.0]	2.0 [1.0, 2.8]	2.0 [2.0, 3.0]
One or more children under the age of 18 years, n (%)	No	0	3011 (77.6)	2032 (82.8)	961 (68.6)
One or more children under the age of 18 years, n (%)	Yes		867 (22.4)	422 (17.2)	439 (31.4)
One or more children over the age of 18 years, n (%)	No	0	2220 (57.2)	1265 (51.5)	936 (66.9)
One or more children over the age of 18 years, n (%)	Yes		1658 (42.8)	1189 (48.5)	464 (33.1)
Heterosexual relationships in the last 12 months, n (%)	No	880	290 (9.7)	132 (7.9)	146 (11.2)
Heterosexual relationships in the last 12 months, n (%)	Yes		2708 (90.3)	1541 (92.1)	1160 (88.8)
Homosexual relationships in the last 12 months, n (%)	No	880	2884 (96.2)	1626 (97.2)	1239 (94.9)
Homosexual relationships in the last 12 months, n (%)	Yes		114 (3.8)	47 (2.8)	67 (5.1)
Other relationships in the last 12 months, n (%)	No	880	2923 (97.5)	1643 (98.2)	1262 (96.6)
Other relationships in the last 12 months, n (%)	Yes		75 (2.5)	30 (1.8)	44 (3.4)
Federal State in Germany, n (%)	Schleswig-Holstein	0	153 (3.9)	96 (3.9)	56 (4.0)
	Hamburg		112 (2.9)	71 (2.9)	41 (2.9)
	Lower Saxony		341 (8.8)	226 (9.2)	113 (8.1)
	Bremen		46 (1.2)	30 (1.2)	16 (1.1)
	North Rhine-Westphalia		914 (23.6)	582 (23.7)	328 (23.4)
	Hesse		284 (7.3)	166 (6.8)	116 (8.3)
	Rhineland-Palatinate		186 (4.8)	116 (4.7)	69 (4.9)
	Baden-Wuerttemberg		472 (12.2)	267 (10.9)	200 (14.3)
	Bavaria		580 (15.0)	385 (15.7)	190 (13.6)
	Saarland		45 (1.2)	29 (1.2)	16 (1.1)
	Berlin		139 (3.6)	90 (3.7)	48 (3.4)
	Brandenburg		115 (3.0)	82 (3.3)	33 (2.4)
	Mecklenburg-Western Pomerania		83 (2.1)	54 (2.2)	28 (2.0)
	Saxony		204 (5.3)	130 (5.3)	72 (5.1)
	Saxony-Anhalt		108 (2.8)	70 (2.9)	38 (2.7)
	Thuringia		96 (2.5)	60 (2.4)	36 (2.6)

Measures

Dependent variable: The dependent variable was whether the respondent experienced at least one form of physical abuse at least once within the follow-up period of three months. We used the Conflict Tactics Scale (Straus et al., 1996) and Severe Violence Scale (Johnson et al., 2014) to map physical abuse. ‘No’ or 0 indicates no physical abuse, ‘Yes’ or 1 indicates physical abuse.

Independent variable: Abuse experiences at wave 1 were chosen for feature selection as independent variables. This included the Coercive Control Scale (Johnson et al., 2014) for emotional abuse, the Conflict Tactics Scale (Straus et al., 1996) and the Severe Violence Scale (Johnson et al., 2014) for physical abuse, the Revised Scale of Economics Abuse (Adams et al., 2020) for financial and economic abuse, and the Controlling Partners Inventory (Burke et al., 2011) for digital abuse. The variables were ordinal with the following response options: 0 = never, 1 = yes but not in the past year, 2 = once in the past year, 3 = a few times in the past year, 4 = monthly, 5 = weekly, 6 = (almost) daily.

Data preparation

The data were provided in a clean, structured format and required minimal preprocessing. ‘Don’t know’ and ‘Prefer not to say’ responses were removed from the analysis. This was possible due to the high disclosure rates across questions relating to abusive experiences within the online survey setting.

Statistical analysis

The analysis included descriptive analysis of the prevalence and occurrence of IPV, feature selection via Student's t-test, Wilcoxon Rank-Sum Test and Chi-Squared Test and the application of these features on four machine learning algorithms to predict the occurrence of physical abuse within the follow-up period of three months given any abuse experiences in the past year (see Figure 1).

Figure 1.

Workflow chart of development of IPV risk assessment tool, Lizzy.

Multivariate feature selection methods such as forward selection (FS), backward elimination (BE) and Lasso Regression were applied using consensus nested cross-validation (Parvandeh et al., 2020), yielding similar results to univariate selection in terms of emerging predictive features. Given its simplicity compared to FS/BE, it was chosen as a feature pre-selection method – to easily narrow down our search to a practical number of risk factors from a total of 110. Additionally, our focus was to build a predictive model, hence contributions to general explanatory models looking into main factors behind physical abuse recidivism are outside the scope of this article.

The top 3–15 factors were extracted from a list of substantively significant factors. The latter were selected using the Bonferroni-corrected p-value threshold (.0005 for t and Wilcoxon tests, .005 for chi-squared tests) and effect size (Cohen's d ≥ 0.5, Cramer's V ≥ 0.3). After filtering the dataset for the chosen predictors and target, null values were dropped.

For the model development, we performed nested stratified five-fold cross-validation. This method randomly partitions data into five equal-sized folds twice, using four out of the five folds for model training and the remaining fold for model validation. It also ensures the victim/non-victim ratio is equal in the train and test sets. The outer cross-validation layer is meant for model evaluation on a test set that the model has not previously seen. The inner cross-validation layer was used to choose the optimal probability threshold (Youden index). This process results in five sets of evaluation metrics for the inner and outer fold, summarised by their average and standard deviation.

Given the positive skew of abuse data, with the majority of respondents having not experienced IPV, we used the class weights parameter of each model to penalise inaccurate predictions – a method inspired by King and Zeng's paper (2001). The penalty is based on the class prevalences within the training set, so incorrect predictions in the minority class receive a higher penalty than incorrect predictions in the majority class.

A model was considered overfit if the difference between its train and validation AUC scores was more significant than 0.02 points. The same random state (42) was used for inner/outer cross-validation and tree-based models. The specific control parameters for each model are in Table 10 in the Appendix. The models were compared using performance metrics such as AUC score, accuracy, specificity, sensitivity and Brier score. The features essential to the model outcome were shown for the best-performing model and the number of predictors. The best-performing model was selected based on a balance between a high average AUC score and a small AUC score range (at most 0.10 points variation) across the five inner cross-validation folds of each outer cross-validation fold. All analyses were conducted using Python version 3.11. Tests of association were run using SciPy (Pedregosa et al., 2011; Virtanen et al., 2020), while model development was done using MLFlow (Chen et al., 2020) for MLOps, and the ML configuration manager Hydra (Yadan, 2019). FS/BE experiments were run using MLXtend (Raschka, 2018).

Results

Descriptive results

The past-year prevalence of IPV, including emotional, digital, financial, and physical abuse, was 36.1% (1400). Of those 1400 respondents, 1065 participated in the subsequent survey (%) and 58.8% (823) reported abuse by a partner again within three months. 16.6% (232) experienced at least one instance of physical abuse by a partner by Wave 2.

Table 2 shows the prevalence, average frequency, and count of abuse experienced by type of abuse in Wave 1, demonstrated by Table 3 for Wave 2. The ordinal scale applied ranges from once (1), a few times (2), monthly (3), weekly (4) to daily (5). The findings reveal that the majority of IPV experienced in the sample did not occur in isolation, with individuals reporting on average 1.5 types of abuse (SD = 1.7) over the past year in Wave 1 and 1.2 types of abuse (SD = 1.4) over the past three months by Wave 2. A significant portion of the sample experienced several types of abuse at least once in Wave 1 (averaging 70.38% across abuse types) and at least once in Wave 2 (averaging 30.38% across abuse types), indicating a pattern of multiple interacting types of abuse. Respondents in the sample experienced abuse on average between once and a few times in Wave 1 (over the past year) and during Wave 2 (over the past three months). Moreover, individuals reported experiencing various abuse tactics within the same abuse type, with an average of 2.6 tactics of physical abuse (SD = 3.7) in wave 1 and 0.8 (SD = 2.5) in Wave 2.

Table 2.

Prevalence of IPV in the past year in the victim sample by type of abuse in Wave 1.

		Missing	Overall
N			1065
Count of abuse type experiences by binary, mean (SD)		0	2.1 (1.4)
At least once digital abuse (binary), n (%)	No		660 (62.0)
At least once digital abuse (binary), n (%)	Yes		405 (38.0)
Digital abuse (frequency), mean (SD)		0	0.6 (1.1)
Digital abuse (count of at least once), mean (SD)		0	1.7 (3.1)
At least once emotional abuse (binary), n (%)	No		134 (12.6)
At least once emotional abuse (binary), n (%)	Yes		931 (87.4)
Emotional abuse (frequency), mean (SD)		0	1.1 (1.1)
Emotional abuse (count of at least once), mean (SD)		0	3.8 (3.9)
At least once financial abuse (binary), n (%)	No		583 (54.7)
At least once financial abuse (binary), n (%)	Yes		482 (45.3)
Financial abuse (frequency), mean (SD)		0	0.6 (1.1)
Financial abuse (count of at least once), mean (SD)		0	2.0 (3.7)
At least once physical abuse (binary), n (%)	No		776 (72.9)
At least once physical abuse (binary), n (%)	Yes		289 (27.1)
Physical abuse (frequency), mean (SD)		0	0.5 (1.0)
Physical abuse (count of at least once), mean (SD)		0	1.2 (2.7)

IPV: intimate partner violence.

Table 3.

Recidivism of IPV in the past three months in the victim sample by type of abuse in Wave 2.

		Missing	Overall
N			1065
Count of abuse type experiences by binary, mean (SD)		0	1.6 (1.6)
At least once digital abuse (binary), n (%)	0.0		706 (66.3)
	1.0		337 (31.6)
	None		22 (2.1)
Digital abuse (frequency), mean (SD)		728	1.2 (1.2)
Digital abuse (count of at least once), mean (SD)		0	1.6 (3.3)
At least once emotional abuse (binary), n (%)	0.0		389 (36.5)
	1.0		662 (62.2)
	None		14 (1.3)
Emotional abuse (frequency), mean (SD)		403	0.9 (1.0)
Emotional abuse (count of at least once), mean (SD)		0	3.1 (4.1)
At least once financial abuse (binary), n (%)	0.0		648 (60.8)
	1.0		395 (37.1)
	None		22 (2.1)
Financial abuse (frequency), mean (SD)		670	1.0 (1.1)
Financial abuse (count of at least once), mean (SD)		0	2.0 (3.8)
At least once physical abuse (binary), n (%)	0.0		818 (76.8)
	1.0		224 (21.0)
	None		23 (2.2)
Physical abuse (frequency), mean (SD)		841	1.3 (1.3)
Physical abuse (count of at least once), mean (SD)		0	1.1 (2.8)

IPV: intimate partner violence.

Results of difference tests

Table 4 shows the top-performing numerical features of physical abuse by t-test-statistic. The total number of physical risk factors was limited to three to focus on the effect of proxies on predictive accuracy. The mean values between 1 and 2 indicate that victims of physical abuse in Wave 2 experienced these forms of abuse at least once or a few times in Wave 1. For example, victims of physical abuse at Wave 2 had their partners throw something at them that could hurt them on average ‘a few times’ by Wave 1 (M = 1.9, SD = 1.94) compared to almost ‘never’ by victims at Wave 1 who did not experience physical abuse at Wave 2 (M = 0.29, SD = 0.71).

Table 4.

Top 15 risk factors of physical abuse by t-test on unbalanced sample.

		Non-victim	Victim		Cohen's d	Point-biserial r
	df	Mean (SD)	Mean (SD)	t-test	Cohen's d	Point-biserial r
My partner threatened to use physical violence or an object such as a knife or weapon to harm someone close to me. (e.g., friends, family, new partner, children, colleagues, neighbours, etc.)	1026	0.14 (0.56)	1.66 (2.0)	−19.06***	−1.46	−0.51
My partner threw something at me that could hurt me.	1034	0.36 (0.81)	1.97, (1.95)	−18.48***	−1.4	−0.5
My partner threatened to use physical violence or an object such as a knife or weapon to harm my pet(s).	1028	0.1 (0.49)	1.47 (1.92)	−18.35***	−1.4	−0.5
My partner pulled my hair.	1034	0.21 (0.62)	1.61 (1.87)	−18.11***	−1.37	−0.49
My partner used physical force or an object such as a knife or weapon against me.	1032	0.17 (0.56)	1.57 (1.94)	−18.11***	−1.37	−0.49
My partner took out a loan or bought something on credit in my name without my consent.	1035	0.15 (0.6)	1.59 (1.99)	−17.83***	−1.35	−0.48
My partner threatened to publish embarrassing, insulting and threatening posts about me or has done so.	1028	0.15 (0.57)	1.58 (2.13)	−17.04***	−1.3	−0.47
My partner threatened to use physical force or an object such as a knife or weapon to harm me.	1030	0.21 (0.65)	1.61 (1.99)	−17.04***	−1.3	−0.47
My partner forced me to take out a loan or buy something on credit even though I did not want to.	1032	0.18 (0.67)	1.62 (2.03)	−16.99***	−1.29	−0.47
My partner threatened to publish inappropriate photos of me or has done so.	1029	0.14 (0.55)	1.47 (1.97)	−16.95***	−1.29	−0.47
My partner locked me in a room or elsewhere.	1029	0.15 (0.6)	1.53 (2.01)	−16.93***	−1.29	−0.47
My partner sent me threats by email, text message, social media or phone call.	1033	0.29 (0.87)	1.85 (2.05)	−16.83***	−1.28	−0.46
My partner has forced me to spend my money on buying them things or paying their bills even though I didn't want to.	1029	0.32 (0.85)	1.9 (2.11)	−16.82***	−1.28	−0.46
My partner used a (hidden) webcam and/or spyware to monitor my activities.	1013	0.13 (0.67)	1.54 (2.01)	−16.73***	−1.28	−0.47
My partner slapped or hit me.	1025	0.35 (0.76)	1.74 (1.87)	−16.66***	−1.27	−0.46
My partner issued invoices in my name so that I had to pay them.	1031	0.2 (0.68)	1.57 (1.97)	−16.6***	−1.26	−0.46

Note: *p < .05, **p < .01, ***p < .001.

Due to the unequal sample sizes and the possibility of unequal variances, we also conducted the Wilcoxon Rank-Sum Test. The top risk factors by t-test relate to physical, while the top risk factors by Wilcoxon Rank-Sum Test relate to digital abuse. Most features appear in both tests with slight differences in order of effect size. Overall, victims of physical abuse at Wave 1 appear to have frequently experienced various forms of digital, financial and physical abuse.

All nominal variables except for recent partner step parenthood, that is, Has one of your partners taken on the role of a step-parent for one or more of your children in the last 12 months?, which had a Cramer's V of 0.307 (X2(1) = 40.04, p < .000) were excluded due to low effect size. However, since this question is relevant to only 40% (<50%) of our sample, it was also excluded from model development.

Results of model development

To determine the number of risk factors to be input into the models, we first explored the performance of the validation AUC values as a function of the number of risk factors. The t-test statistic determined the order of risk factors. Figure 2 indicates that a longer risk assessment is not necessarily a better predictor of the outcome, that is, physical recidivism by an intimate partner. In the validation set, all models peak at seven features, whereas logistic regression peaks at 6, with model performance decreasing as more features were added. We see a similar trend in the test set, although models peak at 4 (RF, XGBoost) and 8 (LR, CatBoost) features, respectively (Figure 3).

Figure 2.

Line chart showing validation AUC values as a function of several risk factors across models.

Figure 3.

Line chart showing test AUC values as a function of several risk factors across models.

When looking at the difference in train and validation AUC scores across each number of questions, we noted that a higher number of questions increased the risk of overfitting (see Figure 4). Although reducing the number of trees and maximum depth of tree-based models to 15 and 3, respectively, decreased the risk of overfitting considerably, models with fewer predictors still overfit (see Figure 5).

Figure 4.

Line chart showing train-validation-test AUC values as a function of risk factors across models.

Figure 5.

Line chart showing train-validation-test AUC values as a function of risk factors across models when the number of trees and maximum depth for tree-based models were reduced.

Setting six as the optimal number of risk factors to reduce the chance of overfitting, we re-selected predictors by pre-filtering on t-test results with a Cohen's d lower than −1, then ranked questions based on a mixture of t-test statistic and percentage prevalence within the sample. This was to ensure that Lizzy could detect as many potential victims as possible, assuming that a higher prevalence rate also indicates a higher disclosure rate in more stressful settings.

To create the ranking, we first scaled the absolute t-test statistic and percentage prevalence on the pre-filtered sample to fall within the 0–1 range using Min–max scaling. Then, we summed the scaled scores to create an absolute ranking, denoted as Ranksum in Table 5.

Table 5.

Top seven risk factors of physical abuse by scaled t-test statistic and sample pct. prevalence.

Question	Ranksum	Pct. prevalence	Pct. prevalence (scaled)	Abs. t-test statistic	Abs. t-test statistic (scaled)
My partner threw something at me that could hurt me.	1.39	14.77	0.49	18.48	0.90
My partner pushed, grabbed or shoved me.	1.37	19.07	0.80	16.52	0.56
My partner threatened to use physical violence or an object such as a knife or weapon to harm someone close to me (e.g., friends, family, new partner, children, colleagues, neighbours, etc.)	1.21	10.89	0.21	19.06	1.00
My partner has forced me to spend my money on buying them things or paying their bills even though I didn't want to.	1.11	14.84	0.50	16.82	0.61
My partner pulled my hair.	1.10	11.68	0.27	18.11	0.84
My partner sent me threats by email, text message, social media or phone call.	1.07	14.30	0.46	16.83	0.62
My partner checked my emails and/or internet browser history and/or call logs.	1.07	18.86	0.79	14.86	0.28

From the top seven risk factors by t-test statistic and percentage prevalence (Table 5), two models were further tested – one with three physical risk factors (including threats) and one with four, both totalling six risk factors. The model with the best validation AUC score included all the risk factors aforementioned (Table 6) except for ‘My partner pulled my hair’. This ranking method was applied using different weightings to the t-test statistic and pct. prevalence, but an equal weighting gave the best results.

Table 6.

Evaluation of the model's average performance across configurations.

	Logistic regression	Random Forest	CatBoost classifier	Extreme gradient boosting
Avg. accuracy	0.82	0.828	0.797	0.812
Avg. AUC	0.846	0.840	0.841	0.841
Avg. PR AUC	0.694	0.677	0.679	0.678
Avg. AUC (train/mean)	0.850	0.860	0.861	0.881
Avg. AUC (train/std)	0.008	0.007	0.008	0.008
Avg. AUC (val/mean)	0.839	0.836	0.836	0.828
Avg. AUC (val/std)	0.034	0.035	0.032	0.032
Avg. probability threshold (Youden index)	0.491	0.484	0.509	0.523
Avg. probability threshold (val/mean)	0.491	0.484	0.509	0.459
Avg. probability threshold (val/std)	0.130	0.119	0.163	0.153
Avg. sensitivity (TPR)	0.620	0.650	0.650	0.630
Avg. specificity (TNR)	0.870	0.870	0.830	0.860
Avg. specificity CI (low)	0.808	0.813	0.767	0.794
Avg. specificity CI (high)	0.917	0.922	0.888	0.908
Avg. sensitivity CI (low)	0.459	0.479	0.483	0.464
Avg. sensitivity CI (high)	0.772	0.789	0.793	0.776
Avg. Brier score	0.141	0.144	0.146	0.142

We compared the four algorithms using the average Accuracy, AUC, Recall (Sensitivity), Specificity, Precision, and Brier score across model configurations by algorithm. The average test accuracy score of the Logistic Regression, CatBoost, Random Forest, and XGBoost algorithms for our sample was 0.82, 0.80, 0.83 and 0.81, respectively, as shown in Table 6. The logistic regression algorithm performed best on AUC and Brier scores, with the highest AUC score of 0.85 and the lowest Brier score of 0.141, indicating better calibration. The Random Forest algorithm performed best with an accuracy of 0.83, but the train-validation AUC scores showed slight overfitting, with a ∼0.03-point difference. Regarding AUC, all models had a score between 0.84 and 0.85, with logistic regression performing best and Random Forest performing the worst. Figure 6 displays the Receiver Operating Characteristic curves for each model. The predictors were also associated with a high internal consistency with a Cronbach's alpha of 0.894 [0.884, 0.904]. When using only the non-physical predictors from the six predictors, the performance of the logistic regression drops to 0.78 AUC.

Figure 6.

ROC curve (average) across the five outer cross-validation folds with k = 6.

Discussion

To build Lizzy, a novel IPV risk assessment tool based on nationally representative data, our model development approach deviated from existing actuarial methodologies through two critical changes. First, a comprehensive evaluation of 18 machine learning algorithms, expanding beyond the limited model sets typically employed in risk assessment, and second, a sophisticated feature selection approach integrating both univariate and multivariate methods. Our methodology leverages parametric and non-parametric statistical tests (t-test, Wilcoxon Rank-Sum test), regularisation techniques (Lasso regression), and stepwise selection methods (forward/backward elimination). The number of predictors was tested as a function of the validation AUC value. We found that a longer risk assessment is not necessarily better for predictive validity and chose six as the number of predictors to be input into the models. This is significant as longer risk assessments can be difficult to administer due to the psychological burdens imposed on victims and the time pressures faced by practitioners. Accuracy scores ranged from 0.80 to 0.83 and AUC values from 0.84 to 0.85, with XGBoost performing the worst (due to overfitting) and logistic regression performing the best. Our findings also demonstrate that multiple linear regression models using representative data can outperform conventional tools by >0.20 points and surpass the minimum threshold set by Lamb of 0.70. This is important as DA risk assessment tools are increasingly being used in criminal justice settings (CJS) and social services to inform and determine outcomes (Viljoen et al., 2018).

The top six factors identified as most relevant by feature importance were:

1. A partner threw something at them that could hurt them.

2. A partner pushed, grabbed, or shoved them.

3. A partner threatened to use physical violence or an object such as a knife or weapon to harm someone close to them.

4. A partner forced them to spend their money on buying things or paying bills even if they didn’t want to.

5. A partner sent them threats by email, text message, social media, or phone calls.

6. A partner checked their emails, internet browser history, or call logs.

Consequently, the most significant predictors of future physical violence appear to be a combination of prior physical abuse and coercive control, according to the predictors tested. It is important to note that one key benefit of this research is the identification of proxies. Risk assessments can provide false negatives where victims are unwilling to disclose abuse suffered due to fear of criminal justice or social service interventions, or where they do not perceive themselves to be victims of certain crimes or wish to disclose victimisation (Felson and Paré, 2005; Hassanpour et al., 2025). Consequently, using various forms of abuse across different levels of severity as indicators of future risk may increase the likelihood of physical abuse predicted from behaviours disclosed. This identification is essential for protecting victims.

Another significant shift emerging from this study, compared to conventional risk assessments, is the adoption of a shorter follow-up period of three months between data collection. This timeframe is informed by research which suggests that help-seeking might be linked to shorter durations of IPV, with most victims seeking help within the first three months after their initial experience of violence (Trafford and Le, 2024). Trafford and Le's analysis also shows that nearly half of abusive relationships last less than two years. It is therefore unlikely that tools predicting that a victim may be revictimised within the next 5–8 years provide sufficiently actionable information to improve emergency workers’ case management. In contrast, a three-month horizon aligns more closely with the behaviour patterns of victim populations and offers a critical window for effective intervention.

Looking ahead, pilot tests and consultations with practitioners and victims in the field and across different settings are needed to further develop Lizzy into a tool that, in addition to providing accurate risk assessments, can guide helpful risk- and needs-based risk management strategies (Graham et al., 2021; Jose Medina Ariza et al., 2016; Robinson et al., 2018; Spivak et al., 2021). This would reduce the likelihood of unintended consequences and help understand how the tool could affect risk management processes already in place. Furthermore, as the data was built on a nationally representative sample and as Trafford and Le's (2024) research has shown, there are subtly different typologies in the victims who approach various settings; there is a chance that Lizzy's accuracy would change in setting-specific environments.

Limitations

There were some limitations in the present study. Firstly, respondents might have been inhibited from fully disclosing their IPV experience due to the topic's sensitivity. However, the online research design enabled responses which reduce the potential of social desirability bias (Sperber et al., 2023). Additionally, using a nationally representative sample increases the validity and reliability of collected and relied upon data. This is because it avoids inaccuracies and biases prevalent in CJS data (Jose Medina Ariza et al., 2016) and avoids focusing on a targeted sub-sample of victims (Graham et al., 2021). Yet, using a nationally representative sample can also raise limitations. For example, prior research conducted by the authors has shown that victims who approach one setting are often slightly different to victims who approach another. This means that whilst Lizzy may be effective on a nationally representative sample, determining accuracy and applicability to specific settings (such as within criminal justice agencies and health services) will require further validation studies to understand and evaluate the impact of setting-specific targets and practitioner usage for risk assessment tools. Limitations also arise concerning the forms of data that can be collected at a national level. Due to limitations on data collection in Germany (Weider, 2025), it has not been possible to collect data on the ethnicity of victims and their abusers or to determine the accuracy of Lizzy across ethnicities.

Additionally, the choice of predictors is constrained to information available to the victim. This precludes characteristics about the perpetrators, such as psychological variables and other predictors, which are known to be strongly correlated with violent IPV and violent recidivism (Garcia-Vergara et al., 2022; Gerino et al., 2018; Yakubovich et al., 2018). This was a deliberate choice, as our aim in creating Lizzy was to develop an accurate, evidence-based, yet user-friendly DA risk assessment tool that frontline workers can easily and quickly administer without requiring substantial prior training or access to data from other agencies. Similarly, we chose to limit Lizzy to cases involving female victims and male perpetrators to reflect the disproportionately high prevalence of reportage of IPV among women. To the authors’ knowledge, no validated risk assessment for male victims or victims in same-sex relationships exists to date. However, since none of the questions are gender-specific, there is a strong case for validating the tool on more diverse samples. We also chose not to include sexual abuse as a potential predictor in the feature selection procedure due to the potential inhibitive impact on victims’ disclosure of abuse (Felson and Paré, 2005; Hassanpour et al., 2025). Future research could consider how direct questions about sexual abuse impact disclosure rates and/or violence prediction.

Finally, like many other risk assessment tools, the current version of Lizzy does not provide guidance on risk management strategies tailored to different risk levels (Lamb et al., 2022). However, access to data on emotional, financial and digital abuse enables the development of multiple models by type of abuse. These models could assist frontline workers in establishing risk profiles that differentiate by type of abuse, thereby contributing to more risk- and needs-driven risk management (Viljoen et al., 2018). Whilst comparisons to other risk assessment tools are difficult due to the varying reporting levels in construction studies (Graham et al., 2021), the creation of Lizzy builds upon and enhances current conventions by utilising a nationally representative sample, combined with a prospective design, to harness the power of AI.

Conclusion

In conclusion, our study utilised machine learning techniques and nationally representative online survey data to identify IPV risk factors in Germany and build an IPV risk assessment tool which we have called Lizzy. The predictors were selected via t-test with a minimum Cohen's d of 1 and were associated with a high internal consistency with a Cronbach's α of 0.894 [0.884, 0.904]. The logistic regression algorithm performed best on AUC and Brier scores, with the highest AUC score of 0.85 and the lowest Brier score of 0.141, indicating better calibration. Average sensitivity was 0.64, and the average specificity was 0.86 at a cutoff-point at the Youden-index. The results demonstrate that machine learning classifiers, such as logistic regression, performed well in predicting IPV recidivism and that the length of the risk assessment tool does not correlate with predictive performance. This supports the case for developing short, yet accurate, tools to predict violent recidivism in IPV cases. It also demonstrates that proxies are effective in predicting the re-occurrence of physical abuse in female victim populations alongside physical risk factors.

Supplemental Material

sj-docx-1-euc-10.1177_14773708251412637 - Supplemental material for Beyond physical violence: A machine learning framework for predicting IPV victimisation using multidimensional predictors

Supplemental material, sj-docx-1-euc-10.1177_14773708251412637 for Beyond physical violence: A machine learning framework for predicting IPV victimisation using multidimensional predictors by Ba Linh Le, Lucy Trafford, Sabina Firtala and Babatunde Williams in European Journal of Criminology

Footnotes

ORCID iDs

Ba Linh Le

Lucy Trafford

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the German Federal Ministry of Economy and Climate Action and the European Social Fund (Grant No. 03EGSBE582).

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental material

Supplemental material for this article is available online.

References

Adams

Greeson

Littwin

, et al. (2020) The revised scale of economic abuse (SEA2): Development and initial psychometric testing of an updated measure of economic abuse in intimate relationships. Psychology of Violence 10(3): 268–278.

BKA (2024) Häusliche Gewalt. Lagebild zum Berichtsjahr 2023 (Bundeslagebild Häusliche Gewalt). Bundeskriminalamt. Available at: https://www.bka.de/SharedDocs/Downloads/DE/Publikationen/JahresberichteUndLagebilder/HaeuslicheGewalt/HaeuslicheGewalt2023.html?nn=219004.

Burke

Wallen

Vail-Smith

, et al. (2011) Using technology to control intimate partners: An exploratory study of college undergraduates. Computers in Human Behavior 27(3): 1162–1167.

Campbell

(1986) Nursing assessment for risk of homicide with battered women. Advances in Nursing Science 8(4): Article 4. 10.1097/00012272-198607000-00006.

Chen

Chow

Davidson

, et al. (2020). Developments in MLflow: A system to accelerate the machine learning lifecycle. In: Proceedings of the fourth international workshop on data management for end-to-end machine learning, pp.1–4. 10.1145/3399579.3399867.

Devries

Mak

Bacchus

, et al. (2013) Intimate partner violence and incident depressive symptoms and suicide attempts: A systematic review of longitudinal studies. PLoS Medicine 10(5): e1001439.

Dhamnetiya

Jha

Shalini

, et al. (2022) How to analyze the diagnostic performance of a new test? Explained with illustrations. Journal of Laboratory Physicians 14(01): 090–098.

Douglas

Skeem

(2005) Violence risk assessment: Getting specific about being dynamic. Psychology, Public Policy, and Law 11(3): 347–383.

EFRA (2022) Bias in algorithms: Artificial intelligence and discrimination. Publications Office. Available at: https://data.europa.eu/doi/10.2811/25847.

10.

EIGE (2019) Risk assessment and management of intimate partner violence in the EU. Publications Office. Available at: https://data.europa.eu/doi/10.2839/39960.

11.

Euser

Zoccali

Jager

, et al. (2009) Cohort studies: Prospective versus retrospective. Nephron Clinical Practice 113(3): c214–c217.

12.

Felson

Paré

(2005) The reporting of domestic violence and sexual assault by nonstrangers to the police. Journal of Marriage and Family 67(3): 597–610.

13.

Fitzgerald

Graham

(2016) Assessing the risk of domestic violence recidivism. NSW Bureau of Crime Statistics and Research 189: 2–12. Available at: https://eprints.qut.edu.au/127945/1/Report-2016-Assessing-the-risk-of-domestic-violence-recidivism-cbj189.pdf.

14.

Garcia-Vergara

Almeda

Fernández-Navarro

, et al. (2023) Artificial intelligence extracts key insights from legal documents to predict intimate partner femicide. Scientific Reports 13(1): 18212.

15.

Garcia-Vergara

Almeda

Ríos

, et al. (2022) A comprehensive analysis of factors associated with intimate partner femicide: A systematic review. International Journal of Environmental Research & Public Health 22: 1–22.

16.

Gerino

Caldarera

Curti

, et al. (2018) Intimate partner violence in the golden age: Systematic review of risk and protective factors. Frontiers in Psychology 9: 1595.

17.

Graham

Sahay

Rizo

, et al. (2021) The validity and reliability of available intimate partner homicide and reassault risk assessment tools: A systematic review. Trauma, Violence, & Abuse 22(1): 18–40.

18.

Hassanpour

Buchwald

Mehta

AHP

, et al. (2025) Sexual violence and shame: A meta-analysis. Trauma, Violence. & Abuse 27(1): 240–255.

19.

Hilton

(2021) Domestic Violence Risk Assessment: Tools for Effective Prediction and Management. 2nd ed. Washington DC: American Psychological Association. 10.1037/0000223-000.

20.

Hilton

Harris

Rice

, et al. (2004) A brief actuarial assessment for the prediction of wife assault recidivism: The Ontario domestic assault risk assessment. Psychological Assessment 16(3): 267–275.

21.

Hui

Constantino

Lee

(2023) Harnessing machine learning in tackling domestic violence—An integrative review. International Journal of Environmental Research and Public Health 20(6): 4984.

22.

Johnson

Leone

(2014) Intimate terrorism and situational couple violence in general surveys. Violence Against Women 20(2): 186–207.

23.

Jose Medina Ariza

Robinson

Myhill

(2016) Cheaper, faster, better: Expectations and achievements in police risk assessment of domestic abuse. Policing (Bradford, England) 10(4): 341–350.

24.

Kebbell

(2019) Risk assessment for intimate partner violence: How can the police assess risk? Psychology, Crime & Law 25(8): 829–846.

25.

King

Zeng

(2001) Logistic regression in rare events data .

26.

Knox

Lowe

Mummolo

(2020) Administrative records mask racially biased policing. American Political Science Review 114(3): 619–637.

27.

Kropp

(2008) Development of the spousal assault risk assessment guide (SARA) and the brief spousal assault form for the evaluation of risk (B–SAFER). In: Baldry AC and Winkel FW (eds) Intimate Partner Violence Prevention and Intervention: The Risk Assessment and Management Approach. Nova Science, 19–31.

28.

Lamb

Forsdike

Humphreys

, et al. (2022) Drawing upon the evidence to develop a multiagency risk assessment and risk management framework for domestic violence. Journal of Gender-Based Violence 6(1): 173–208.

29.

Lauria

(2020) The risk assessment and management of intimate partner violence in an Australian policing context. PhD Thesis. Swinburne University of Technology.

30.

Lobo

Jiménez-Valverde

Real

(2008) AUC: A misleading measure of the performance of predictive distribution models. Global Ecology and Biogeography 17(2): 145–151.

31.

López-Ossorio

González-Álvarez

Andrés-Pueyo

(2016) Eficacia predictiva de la valoración policial del riesgo de la violencia de género. Psychosocial Intervention 25(1): 1–7.

32.

López-Ossorio

González-Álvarez

Muñoz Vicente

, et al. (2019) Validation and calibration of the Spanish police intimate partner violence risk assessment system (VioGén). Journal of Police and Criminal Psychology 34(4): 439–449.

33.

Mason

Roberta

(2009). Analysis of the Tasmania Police Risk Assessment Screening Tool (RAST). Tasmanian Institute of Law Enforcement Studies. Available at: https://www.safeathome.tas.gov.au/__data/assets/pdf_file/0009/567450/RAST_Report_Analysis_of_Risk_Assessment_Screening_Tool.pdf.

34.

Messing

(2019) Risk-informed intervention: Using intimate partner violence risk assessment within an evidence-based practice framework. Social Work 64(2): 103–112.

35.

Messing

Thaller

(2013) The average predictive validity of intimate partner violence risk assessment instruments. Journal of Interpersonal Violence 28(7): 1537–1558.

36.

Myhill

Hohl

Johnson

(2023) The ‘officer effect’ in risk assessment for domestic abuse: Findings from a mixed methods study in England and Wales. European Journal of Criminology 20(3): 856–877.

37.

OECD (2017) Violence against women indicator [Data set]. OECD. 10.1787/f1eb4876-en.

38.

Parvandeh

Yeh

H-W

Paulus

, et al. (2020) Consensus features nested cross-validation. Bioinformatics (Oxford, England) 36(10): 3093–3098.

39.

Pedregosa

Varoquaux

Gramfort

., et al. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12, 2825–2830.

40.

Quijano-Sánchez

Liberatore

Rodríguez-Lorenzo

, et al. (2021) A twist in intimate partner violence risk assessment tools: Gauging the contribution of exogenous and historical variables. Knowledge-Based Systems 234: 107586.

41.

Raschka

(2018) MLxtend: Providing machine learning and data science utilities and extensions to Python’s scientific computing stack. Journal of Open Source Software 3(24): 38.

42.

Rice

Harris

(2005) Comparing effect sizes in follow-up studies: ROC area, Cohen’s d, and r. Law and Human Behavior 29(5): 615–620.

43.

Robinson

(2011) Risk and intimate partner violence. In: Kemshall

Wilkinson

(eds) Good Practice in Assessing Risk: Current Knowledge, Issues and Approaches. London, UK: Jessica Kingsley Publishers, 119–138.

44.

Robinson

Pinchevsky

Guthrie

(2018) A small constellation: Risk factors informing police perceptions of domestic abuse. Policing and Society 28(2): 189–204.

45.

Roehl

O’Sullivan

Webster

, et al. (2005) Intimate partner violence risk assessment validation study: The RAVE study practitioner summary and recommendations: Validation of tools for assessing risk from violent intimate partners: (515672006-001) [Data set]. American Psychological Association. 10.1037/e515672006-001.

46.

Schröttle

Müller

(2004) Lebenssituation, Sicherheit und Gesundheit von Frauen in Deutschland. Eine repräsentative Untersuchung zu Gewalt gegen Frauen in Deutschland. BMFSFJ. Available at: https://www.bmfsfj.de/bmfsfj/studie-lebenssituation-sicherheit-und-gesundheit-von-frauen-in-deutschland-80694.

47.

Seewald

Rossegger

Urbaniok

, et al. (2017) Assessing the risk of intimate partner violence: Expert evaluations versus the Ontario domestic assault risk assessment. Journal of Forensic Psychology Research and Practice 17(4): 217–231.

48.

Seghier

(2022) Ten simple rules for reporting machine learning methods implementation and evaluation on biomedical data. International Journal of Imaging Systems and Technology 32(1): 5–11.

49.

Sperber

Bor

Fang

, et al. (2023) Face-to-face interviews versus internet surveys: Comparison of two data collection methods in the Rome foundation global epidemiology study: Implications for population-based research. Neurogastroenterology & Motility 35(6): e14583.

50.

Spivak

McEwan

Luebbers

, et al. (2021) Implementing evidence-based practice in policing family violence: The reliability, validity and feasibility of a risk assessment instrument for prioritising police response. Policing and Society 31(4): 483–502.

51.

Stansfield

Williams

(2014) Predicting family violence recidivism using the DVSI-R: Integrating survival analysis and perpetrator characteristics. Criminal Justice and Behavior 41(2): 163–180.

52.

Statista (n.d.) Spain: Immigrant population by nationality 2023. Statista. Available at: https://www.statista.com/statistics/445784/foreign-population-in-spain-by-nationality/ (accessed 17 September 2024).

53.

Straus

Hamby

Boney-McCoy

, et al. (1996) The revised conflict tactics scale. Journal of Family Issues 17: 283–316.

54.

Swets

Dawes

Monahan

(2000) Psychological science can improve diagnostic decisions. Psychological Science in the Public Interest 1(1): 1–26.

55.

Tjaden

Thoennes

(2000) Prevalence and consequences of male-to-female and female-to-male intimate partner violence as measured by the national violence against women survey. Violence Against Women 6(2): 142–161.

56.

Trafford

(2024) Victim-help seeking patterns and how they can inform future support services for victims of intimate partner violence (IPV) . 10.31219/osf.io/gp4vz.

57.

Turner

Brown

Medina-Ariza

(2022) Predicting domestic abuse (fairly) and police risk assessment. Psychosocial Intervention 31(3): 145–157.

58.

Van der Put

Gubbels

Assink

(2019) Predicting domestic violence: A meta-analysis on the predictive validity of risk assessment tools. Aggression and Violent Behavior 47: 100–116.

59.

Viljoen

Cochrane

Jonnson

(2018) Do risk assessment tools help manage and reduce risk of violence and reoffending? A systematic review. Law and Human Behavior 42(3): 181–214.

60.

Virtanen

Gommers

Oliphant

, et al. (2020) Scipy 1.0: Fundamental algorithms for scientific computing in python. Nature Methods 17: 261–272.

61.

Weider

(2025) Let’s talk about race. Diversität und race an universitäten zwischen erinnerung und datenschutz. ZDfm–Zeitschrift Für Diversitätsforschung Und- Management 10(1): 22–37.

62.

Weis

Görgen

Käsmayr

, et al. (2016) Evaluation der Fortführung und Erweiterung des Pilotprojektes High Risk in Rheinland-Pfalz [Universität Koblenz-Landau]. Available at: https://mffki.rlp.de/fileadmin/07/Dokumente/Themen/Frauen/Downloads/Dokumentationen_und_Studien/HRM_Studie_II_Kurzbericht.pdf.

63.

Weißer Ring (2021) Tödliche gewalt gegen frauen. Es geschieht an jedem dritten tag. Forum Opferhilfe 44(04): Article 04. https://weisser-ring.de/system/files/domains/weisser_ring_dev/downloads/wer21-04forum-opferhilfemagazinweb.pdf .

64.

Whitton

Newcomb

Messinger

, et al. (2019) A longitudinal study of IPV victimization among sexual minority youth. Journal of Interpersonal Violence 34(5): 912–945.

65.

WHO (2024, March 25) Factsheets on Violence against Women. Geneva, Switzerland: World Health Organization (WHO). Available at: https://www.who.int/news-room/fact-sheets/detail/violence-against-women.

66.

Williams

(2012) Family violence risk assessment: A predictive cross-validation study of the domestic violence screening instrument-revised (DVSI-R). Law and Human Behavior 36(2): 120–129.

67.

Yadan

(2019) Hydra—A framework for elegantly configuring complex applications [Computer software]. Meta. Available at: https://github.com/facebookresearch/hydra.

68.

Yakubovich

Stöckl

Murray

, et al. (2018) Risk and protective factors for intimate partner violence against women: Systematic review and meta-analyses of prospective–longitudinal studies. American Journal of Public Health 108(7): e1–e11.