Abstract
There is considerable interest in studying the impact of major life events (e.g., marriage, job loss) on people’s lives. This line of research is inherently causal: Its goal is to study whether life events cause changes in the examined outcomes. However, because major life events cannot be randomly assigned, studies in this area necessarily rely on longitudinal observational data. In this article, we provide guidelines for researchers interested in studying life events in an explicitly causal framework. Although focused on life-event studies for substantive context, many recommendations also apply to longitudinal observational studies more broadly. We begin by emphasizing the importance of clearly specifying the causal estimand and describe conditions in which the defined causal estimand can be identified. Then, we discuss the features and challenges of the two main analytical approaches to causal inference in life-event studies: difference-in-difference designs with a (matched) comparison group that attempt to separate event-related changes from normative changes and within-person designs that control for all time-invariant person-level confounders. We describe how the desired causal effect can be estimated in these designs and provide recommendations for when to apply each modeling strategy. In addition, we present methods for conducting sensitivity analysis, probing the robustness of the estimated causal effects, and evaluating the generalizability of the results. We conclude by describing how new specialized panel studies can be designed to examine the impact of various life events in more controlled settings.
People get married, become (grand)parents, and start their dream job, but they also experience breakups, the death of loved ones, and involuntary job losses. Such major life events can be defined as “time-discrete transitions that mark the beginning or the end of a specific status” (Luhmann, Hofmann, et al., 2012, p. 594). Life events can be primarily exogeneous, for example, when a child is killed in an automobile accident (Lehman et al., 1987). Or life events can have a large endogenous component, such as when individuals get divorced (van Scheppingen & Leopold, 2020). Studying how life events affect people has been an active area of research in diverse areas of psychology in recent years. For example, researchers have examined how life events affect subjective well-being (for a meta-analysis, see Luhmann, Hofmann, et al., 2012), personality traits (for a meta-analysis, see Bühler et al., 2024), mental and physical health (e.g., Asselmann, Garthus-Niegel, Knappe, & Martini, 2022; Lawes et al., 2022), loneliness (Buecker et al., 2021), optimism (Chopik et al., 2020), spirituality (Trutzenberg & Eid, 2024), and perceived social support (Asselmann, Garthus-Niegel, & Martini, 2022). The goal of life-events research is inherently causal: It studies whether life events cause changes in the examined outcomes. Yet the traditional gold standard method of estimating causal effects, the randomized experiment, is not feasible in the context of life-event studies because ethical and practical issues preclude random assignment of individuals to specific life events (West et al., 2008). Consequently, life-event research has to rely on observational data in which individuals are exposed or not exposed to specific life events.
Most life-event studies are prospective in nature, meaning that the outcome is assessed at least once before the life event occurred. In contrast, in retrospective life-event studies, the first assessment occurs after the life event occurred. In this article, we cover only prospective life-event studies (for a discussion of issues in retrospective studies, see Holland & Rubin, 1988). Many prospective life-event studies are based on existing panel studies, such as the German Socio-Economic Panel (Wagner et al., 2007; for an example, see Krämer & Rodgers, 2020), the Swiss Household Panel (Tillmann et al., 2022; for an example, see Anusic et al., 2014), the Household, Income, and Labour Dynamics in Australia panel (Watson & Wooden, 2004; for an example, see Hentschel et al., 2017), the British Household Panel Survey (University of Essex, Institute for Social and Economic Research, 2018; for an example, see Yap et al., 2012), or the Dutch LISS panel (Scherpenzeel & Das, 2010; for an example, see Reitz et al., 2022). All of these studies repeatedly interview many individuals using a common set of items over multiple years. These studies typically assess a wide range of psychological constructs (e.g., personality, well-being) and ask their respondents whether they experienced any major life events since the last interview. Based on these data, researchers can track how various constructs change before and after major life events. Event-related changes in the focal outcome that occur after the life event are generally called
Historically in psychology, there have been strong admonitions against the use of causal analysis in observational studies, with advice to substitute causal language with scientific euphemisms such as the life event is “associated with” or “predicts” the outcome. Over the past few decades, major advances have occurred in the understanding of causal inference in computer science (e.g., Pearl, 2009), econometrics (e.g., Imbens, 2024), epidemiology (e.g., Hernán & Robins, 2024), and statistics (e.g., Rosenbaum, 2020; Rubin, 2007). Comparisons of randomized experiments and well-designed observational studies sharing the same treatment and control groups have shown that they frequently lead to comparable estimates of causal effects (e.g., Cook et al., 2020; Keller et al., 2024). Given these developments, calls for the use of causal thinking in the design and causal language in the write-up of observational studies have occurred in public health (Hernán, 2018), medicine (Dahabreh & Bibbins-Domingo, 2024), and psychology (Grosz et al., 2020). In this article, we review these new developments in design and analysis with the goal of presenting a guide to strengthen causal inference in longitudinal research on the effects of life events. As a running example, we often use the research question, “How does the birth of the first child affect life satisfaction?” (Dyrdal & Lucas, 2013; Krämer & Rodgers, 2020; Yap et al., 2012) to clearly convey the ideas presented. Although we focus on life-event studies to provide a specific substantive context, most of our recommendations will also apply to longitudinal observational studies more broadly.
We commence this article by stressing the importance of clearly defining the causal estimand. Then, we describe conditions in which the defined causal estimand can be identified. We proceed by discussing the features and challenges of the two main design approaches to estimate causal effects in longitudinal observational studies: difference-in-difference designs with a (matched) comparison group that attempt to separate event-related changes from normative changes and within-person designs that control for all time-invariant person-level confounders. We then describe methods for conducting sensitivity analyses that probe the robustness of the estimated causal effects and methods for evaluating the generalizability of the results. We conclude by describing how new panel studies could be designed to deepen understanding of how various life events affect individuals.
Step 1: Defining the Causal Estimand
The first step in a life-event study should be to clearly define the causal estimand—the specific quantity or parameter that represents the causal effect of interest in the study. A clear definition of the causal estimand enables researchers to pose well-defined research questions without reference to any specific statistical model (Lundberg et al., 2021; Rohrer & Murayama, 2023). Only after specifying the estimand should researchers think about developing a suitable method to estimate their defined target quantity. The definition of the causal estimand in life-event studies requires specification of the following four entities: the causal contrast of interest, the focal outcome, the time lag of the causal effect, and the target population (Diener et al., 2022; Hernán & Robins, 2024; Lundberg et al., 2021). We consider each of these four entities in turn below. Moreover, in Box 1, we present guiding questions for the definition of the causal estimand that researchers should try to consider and transparently answer in their life-event studies.
Guiding Questions to Define the Causal Estimand in Life-Event Studies
Defining the Causal Contrast
To clearly define the causal contrast, it can be helpful to think about the randomized experiment the researcher would have conducted if no practical or ethical concerns were present. This hypothetical experiment is sometimes referred to as the
Defining the life event of interest
On the surface, defining the life event of interest seems straightforward: The occurrence of the focal life event needs to be observed and coded (e.g., by using event checklists or observing a status change over time). 1 What is often overlooked in this context, however, is that the same life events can refer to distinctly different exposures. For example, a repeatedly occurring life event can have different effects depending on whether it was the first, second, or third occurrence (Luhmann & Eid, 2009): For parents, the birth of their first child is likely a different event than the birth of their third child. Furthermore, the effects of life events can differ depending on whether their occurrence is normative or nonnormative (Luhmann et al., 2014): Becoming a parent in one’s 20s or 30s is normative in Western societies and will likely represent a different event than becoming a parent at age 15. Individuals may further differ in how they perceive different life events (Haehner et al., 2023; Luhmann et al., 2021): A pregnancy can be expected or unexpected, desired or undesired.
To identify the causal effects of interest, it is crucial that there is only a single well-defined version of the exposure (see section “Identification of the Causal Effect”). Careful consideration is therefore needed to determine potential dimensions on which the experience of the focal life event might differ between individuals. To minimize unwanted variation in the exposure (i.e., the experience of the focal life event), researchers can specify eligibility criteria for their study (Hernán & Robins, 2016; Moreno-Betancur, 2021). Paralleling experimental studies, these eligibility criteria need to be defined in terms of characteristics that are measured before the exposure has already had an effect. Life-event researchers should always make it transparent how they define the exposure and the eligibility criteria. For example, researchers could define the exposure as “voluntarily becoming first-time biological parents between the ages of 20 and 35” and consider only individuals who fulfill these criteria.
Defining the comparison condition
After defining the life event of interest (i.e., the exposure), the comparison condition to which this exposure should be contrasted needs to be specified. In life-event studies, individuals who experienced the life event (i.e., event group) are commonly compared with a comparison group of individuals who could have potentially been exposed but were not exposed to the life event. We term the contrast between the event group and the comparison group a
An alternative approach is to define a within-person contrast by examining how an individual’s outcomes change after experiencing the focal life event compared with the individual’s own outcome levels before the event occurred. A key assumption of this approach is that the level of the outcome at pretest assessment (e.g., before the life event occurred) resembles the level of the outcome that would have been observed at posttest assessment if that person had not experienced the life event. This assumption may be violated in life-event studies, for example, when the life event of interest is predictable and known to have anticipatory effects, such as the birth of the first child. To provide a proper baseline comparison, researchers interested in the causal effect of the life events should therefore ideally include only preevent observations that were made before the potential onset of anticipatory effects. Researchers specifically interested in distinguishing between the anticipatory effects and the immediate effects of the event that occur over and above anticipatory effects should include observations made during the preevent period in which anticipatory effects are expected to occur.
Defining the Outcome
After specifying the causal contrast of interest, researchers have to decide on a measure of the outcome. The outcome measure has to be carefully chosen and should be sensitive to the theoretically expected pattern of change over time. In general, three concepts of change can be distinguished.
Researchers should also carefully think about the method of measurement. Self-reports are typically the most important method of assessing the participant’s personal viewpoint. In Box 2, we present four key factors that should be considered when working with self-reports in life-event studies. However, self-reports also have well-known limitations: Depending on the construct of interest, expert ratings or ratings of knowledgeable informants might be a more appropriate or an important supplemental measure of the construct (Eid et al., 2025; Funder & West, 1993; Letzring & Spain, 2021). Moreover, assessing changes in physiological systems—activity, health, social functioning, mobility, and so on—might also offer valid insights into the effects of life events. Modern measurement devices, such as mobile sensing, allow less intrusive and more frequent recording of many important physiological and social variables that can be used in life-event research (Mehl et al., 2024). Multimethod assessment strategies not only allow researchers to measure diverse facets of the effects of life events but also can enhance the validity of change measurements (e.g., Eid & Diener, 2006).
Key Factors to Consider When Using Self-Report Measures in Life-Event Studies
Finally, because questionnaire measures do not have a natural scaling, it can be useful to rescale questionnaire measures to allow comparison across constructs and studies. For example, when response scales with a varying number of response categories are used (e.g., 5-point and 7-point scales), responses can be transformed into percentage of maximum possible (POMP) scores (Cohen et al., 1999). POMP scores range from 0 to 100, making them interpretable as percentage scores, which can simplify the interpretation of the magnitude of causal effects. Furthermore, POMP scores, unlike standardized scores (e.g., Cohen’s
Defining the Time Lag
Another integral part of defining the causal estimand is to specify the time lag of interest (Gollob & Reichardt, 1987; Voelkle et al., 2018). In the case of first-time parenthood, researchers might be interested in (a) the immediate effects occurring in the days and weeks after the birth or (b) the longer-term effects several years after becoming a parent. Researchers should ideally match their choice of time lags to those proposed by theories describing the timing of how a certain life event affects a certain outcome (Hopwood et al., 2022; Luhmann et al., 2014). In practice, achieving this desideratum will often be challenging because psychological theories often fail to precisely describe the timing of psychological processes. Furthermore, life-event studies often rely on preexisting panel data so that researchers are bound to the temporal resolution provided by the measurement schedule in these studies. Specifically, most panel studies use yearly measurements, which are well-suited to examine long-term effects of life events but are too coarse to study finer-grained changes occurring in close proximity to life events. One way to increase the temporal resolution of life-event studies is to collect information on the exact timing of the occurrence of the event (Hudde & Jacob, 2022). An even better way is to collect new prospective-panel data with shorter time intervals between measurement occasions (e.g., Lawes et al., 2023). We recommend that researchers always clearly state the time lag when interpreting the effects and elaborate on how well the study’s time lags align with the theorized temporal processes (Hopwood et al., 2022). One effective approach is to create a timeline that maps both the theoretically relevant periods associated with a life event and the study’s time frame. This timeline can also guide the selection of covariates, ensuring that the event and comparison groups are appropriately defined and balanced (see section “Step 2: Identifying the Causal Effect”).
Defining the Target Population
The last component of the causal estimand is the target population for which researchers wish to make inferences. Causal effects can be defined at the individual level, at the subgroup level (e.g., males), or for a population of individuals. Rubin (1974; see also Holland, 1986) originally defined the
When researchers aim at estimating average causal effects, they should clearly conceptualize and transparently describe the target population to which they wish to generalize. Clearly defining the target population enables researchers to make well-informed decisions concerning which data they should use and which analyses they should run (Greifer & Stuart, 2023; Lundberg et al., 2021). Often, researchers are interested in the average effect of the life event in a whole population; this effect is called the
Step 2: Identifying the Causal Effect
Conditions That Need to Be Met
Once the causal estimand is conceptually defined, researchers need to identify the causal estimand using observable variables (Lundberg et al., 2021). Average causal effects can be identified if the following three conditions are met (Hernán & Robins, 2024): consistency, positivity, and exchangeability. Consistency refers to the fact that (a) the exposure is well defined and (b) only a single version of the exposure is realized. Whether the consistency condition can actually be fulfilled cannot be empirically tested; it can be made plausible only by reasoning. As discussed above, this can be achieved by minimizing unwanted variation in the exposure (i.e., the life event), for example, by specifying eligibility criteria. Furthermore, consistency implies that there is no interference, meaning that the potential outcome of any individual should not depend on the exposure versus nonexposure of other individuals. In the context of becoming first-time parents, this would mean that the effect of becoming a first-time parent on the parents’ life satisfaction has to be independent of whether other couples in the study (e.g., friends or neighbors) also become first-time parents. Rubin (1980) introduced the term

Hypothetical directed acyclic graph for the effects of birth of first child on life satisfaction.
Using Directed Acyclic Graphs to Specify the Hypothesized Causal Structure
The hypothesized causal structure can be specified using directed acyclic graphs (DAGs; for a nontechnical introduction, see Rohrer, 2018). A DAG consists of nodes (observed or unobserved variables) that are connected by arrows. These arrows encode theory, prior empirical research, and assumptions to represent the hypothesized causal relationships between nodes. Typically, DAGs are nonparametric so that no particular functional form (e.g., linear, quadratic) of the causal relationships is assumed. DAGs should include all theoretically relevant variables regardless of whether they were measured in a given study. As an example, Figure 1 depicts a simple hypothetical DAG for the effects of becoming first-time biological parents on life satisfaction.
Based on the hypothesized DAG, researchers can determine which variables need to be controlled for and which variables should not be controlled to identify the causal effect of interest (Cinelli et al., 2024; Wysocki et al., 2022). Of particular relevance is how the different covariates are assumed to be related to (a) the exposure (e.g., birth of first child) and (b) the outcome of interest (e.g., life satisfaction). Three types of covariates can be differentiated: confounders, mediators, and colliders. Confounders are common causes of the exposure and the outcome. In our example DAG, career focus is a common cause of the likelihood of becoming first-time biological parents and of life satisfaction. To identify the causal effects of interest, it is essential to control for confounders. Confounders can be either time-invariant, meaning that their levels and effects do not change over time (e.g., biological sex, ethnicity), or time-varying, meaning that their levels or effects can change over time (e.g., social support, career focus, income).
Mediators are variables that are causally affected by the exposure and that, in turn, have a causal effect on the outcome. In our illustrative DAG (Fig. 1), parental role identity is a mediator because it is causally affected by the birth of first child while it, in turn, causally affects life satisfaction. Optimally, mediators are assessed after the exposure and before the outcome is assessed (see Rosenbaum, 1984). 2 Colliders are variables that are caused by both the exposure and the outcome. In our example, relationship satisfaction after the birth of the first child is a collider because it is causally affected by both becoming a first-time parent and life satisfaction. In contrast to confounders, mediator and collider variables should not be controlled for because they are causally affected by the exposure itself; they are posttreatment variables (Elwert & Winship, 2014; Pearl, 2009; Rohrer, 2018; Rosenbaum, 1984; Rubin, 1974).
In longitudinal studies, DAGs quickly become complex because the same construct can play different roles in the same causal chain depending on its (temporal) location. For example, prebirth relationship satisfaction could be both a positive cause of the likelihood of becoming first-time parents and of postbirth life satisfaction, making it a confounder. In contrast, postbirth relationship satisfaction may be changed by the birth of the first child and, in turn, affect life satisfaction, making it a mediator. Thus, the timing of the measurements needs to be considered when identifying variables that need to be controlled for. Another challenge arises when an earlier life event affects the occurrence of a later life event (e.g., marriage increases the probability of the birth of the first child) and researchers want to examine the effects of both events in a single analysis. In these cases, the same posttreatment variable can serve (a) as a confounder on one path in the DAG that should be controlled for (b) and as a collider on another path that should not be controlled for. For these situations, Robins (1986) developed an approach known as
In sum, specifying DAGs in longitudinal studies can be challenging. Yet it is an integral part of the research process to make the exchangeability assumption plausible. Furthermore, the process of specifying the DAG makes the assumed causal structures and limitations of a given study transparent. Finally, the DAG informs future research about which confounders should be measured in order to estimate the effects of interest.
Step 3: Estimating the Causal Effect
After the causal effect of interest has been identified, it can be estimated using empirical data. In theory, unbiased estimates of a causally identified effect can be obtained if the following two conditions are met (Hernán & Robins, 2024). First, all variables included in the analysis are reliably measured so that measurement error does not bias the estimates. Second, the parametric models are correctly specified. This means that the functional form (i.e., linear, polynomial, spline; see Suk et al., 2019) for the relationships between variables needs to be correctly specified. In practice, there are two main strategies for estimating causal effects of life events: difference-in-difference designs that use a comparison group to separate event-related changes from normative changes (e.g., Lawes et al., 2023; Yap et al., 2012) and within-person designs that control for all time-invariant person-level confounders (e.g., Clark et al., 2008; Krämer et al., 2024). We describe below the main features and most frequent challenges of these two modeling approaches. To address the two key statistical conditions for estimating causally identified effects described above (i.e., no measurement error, no model misspecification), we also discuss latent-variable models and methods of representing different functional forms.
Difference-in-Difference Designs
Difference-in-difference designs (for a review, see Wing et al., 2018) can be used to distinguish event-related changes from changes that would occur irrespective of the life event (e.g., normative changes or aging). In prototypical life-event studies, difference-in-difference designs compare individuals who experience the focal life event (i.e., the event group) with individuals who do not experience it (i.e., the comparison group). Causal effects are estimated based on group differences in within-person changes, meaning that time-invariant confounders that affect only the outcome
The central assumption of difference-in-difference designs is that the changes in the comparison group equal the counterfactual changes of individuals in the event group if the event group (contrary to fact) had not experienced the life event. This assumption is often called the
Creating covariate balance between the event and comparison groups
The first step to achieve balance between the event and comparison groups is to identify and measure all baseline covariates that are potential confounders. DAGs facilitate this process. In general, researchers should aim to balance the groups with respect to the last measure before the first (anticipatory) effects of the life event occurred. Researchers can then choose between three main approaches to achieve balance: matching, weighting, and covariate adjustment in outcome regression (for an overview, see Schafer & Kang, 2008). These approaches aim at optimizing the trade-off between internal validity (i.e., ruling out the confounding by creating event and comparison groups whose covariate distributions are highly similar), precision (i.e., obtaining a large sample size for estimating the effect with minimal standard error), and external validity (i.e., estimating effects that can be generalized to the target population; Greifer, 2020). Current research suggests that the choice of covariates used for balancing plays a much more important role for estimating unbiased causal effects than the method used to create balance (Cook et al., 2009; Pohl et al., 2009).
Matching
Matching achieves covariate balance by identifying individuals in the event and comparison groups who are highly similar to each other on the set of identified confounders. Most existing life-event studies have used the propensity score in their matching algorithms (e.g., Buecker et al., 2021; Golle et al., 2019; P. L. Hill et al., 2021; Jackson et al., 2012; Krämer & Rodgers, 2020; Lawes et al., 2023). The propensity score is defined as the probability of treatment exposure (e.g., experiencing the life event) given a set of observed baseline covariates (for an accessible introduction to propensity-score methods, see West et al., 2014). Most often, the propensity scores are estimated using logistic-regression models; however, alternative machine-learning methods may also be used (for an overview, see Westreich et al., 2010). The propensity score (if correctly estimated) has a highly valuable property: Conditioning on the propensity score will balance all measured covariates used to construct the propensity score between the event and comparison groups (Austin, 2011). Matching individuals from the event and comparison groups with highly similar propensity scores is expected to produce unbiased treatment effects. Algorithms for a variety of different matching approaches are implemented in many software packages (e.g.,
Matching approaches have many valuable features: They make it easy to inspect whether each of the covariates is balanced between the event and comparison groups, they separate the covariate-balancing step from the effect-estimation step, and the effect estimation does not rely on functional form assumptions. However, matching approaches also have several drawbacks. Most importantly, matching often involves discarding individuals with poor matches from the analysis so that the effect of the exposure is not estimated based on the full sample. Specifically, in most matching applications, some individuals from the comparison group will be discarded because they are too different from the individuals in the event group. Furthermore, individuals from the event group may also be discarded from the effect estimation, for example, when a caliper is specified (i.e., calipers restrict the differences on propensity scores between matched individuals to be below a certain threshold) or matches are restricted to the region of common support (i.e., the region in which the covariate distributions of the control and event groups overlap). This exclusion of individuals from the analysis sample has two consequences. First, the sample size is reduced, resulting in less precision (i.e., larger standard errors, lower statistical power) in the estimate of the causal effect. Second, depending on which individuals are discarded, the analysis may not estimate the ATE but, rather, another estimand that might not generalize to the population from which the sample was drawn. Indeed, it may even be unclear whether the effects estimated in the matched sample generalize to any broader population (Greifer & Stuart, 2023). Finally, matching is unable to control for unmeasured confounders and selective attrition.
Weighting
Covariate balance between the event and comparison groups can also be achieved by weighting the observations in a way that creates a pseudopopulation in which the confounding variables are not associated with the likelihood of exposure to the event (Hernán & Robins, 2024). If weighting is successful, unbiased causal effects can be estimated based on the weighted sample (e.g., using weighted linear regression). The most common weighting method is inverse probability of treatment weighting, which is based on the propensity score (for an introduction, see Thoemmes & Ong, 2016). However, numerous other weighting methods have been proposed (see e.g., Chan et al., 2016; Hainmueller, 2012; Huling & Mak, 2022; Wang & Zubizarreta, 2019); most of these methods are implemented in software packages (e.g.,
Weighting approaches share many desirable features with matching: Balance of each covariate can be easily investigated by inspecting the covariate distributions in the weighted samples, the covariate-balancing step is separated from the effect estimation, and the effect estimation does not rely on any functional form assumptions. Weighting approaches also have their own unique advantages. Weighting is less likely to require discarding individuals so that the causal effects can be estimated with higher statistical power (Desai & Franklin, 2019). Furthermore, when no individuals are discarded, weighting allows the estimation of causal effects that can be generalized to different target populations (Greifer & Stuart, 2023). Finally, the balancing weights can be combined with survey weights (Dong et al., 2020) and censoring weights (Cole & Hernan, 2008; Hernán & Robins, 2024, Chapter 12.6) to minimize bias because of selective participation and selective dropout, respectively. However, weighting approaches also have important limitations. First, weighting approaches may result in extreme weights for some individuals, in which case, causal-effect estimates are primarily driven by those individuals with large weights. To avoid this problem, researchers can either use a different method of estimating the weights or trim the weights if they are more extreme than a certain threshold (e.g., the first and 99th percentiles; Austin & Stuart, 2015). Second, even though weighting tends to yield more statistical power than matching, the effects will still be estimated with less precision compared with the unweighted sample (i.e., lower effective sample size; Shook-Sa & Hudgens, 2022). In other words, weighting approaches trade off the loss of some statistical power to obtain unbiased estimates of the causal effect that can be generalized to a specific target population. Finally, like matching, weighting approaches do not control for unmeasured confounding variables.
Covariate adjustment in outcome regression
Measured covariates can also be included in parametric-regression models used to estimate the effects of the life event on the outcome variable. One important advantage of this approach is that these models can account for the measurement error in the confounders and the outcome (Mayer et al., 2016; Sengewald & Mayer, 2024). 4 A second advantage is that no individuals are excluded from the effect estimation so that various (conditional) causal effects can be estimated with high precision. However, outcome regression also has disadvantages. First, researchers have to make the correct assumptions about the functional form of the relationship between the covariates, the exposure, and the outcome variable to obtain unbiased estimates of the causal effects. Functional forms can be challenging to specify, especially when many covariates are included in the outcome regression. In addition, outcome-regression models extrapolate the parametric model, potentially leading to implausible extrapolation to domains beyond the range of the observed data. Furthermore, outcome regression is a one-step approach: The covariate-balance step is not separated from the effect-estimation step. Thus, balance in the covariate distributions between the event and comparison groups cannot be easily inspected, and there is greater risk that researchers might (unknowingly) select an outcome-regression model that yields the desired effects, limiting the objectivity of a study (Rubin, 2007). Finally, as with the other balancing approaches, outcome regression can control only for measured confounders but not for unmeasured confounders.
How to choose an appropriate balancing method
The choice of balancing method should primarily depend on the analysis goal. For example, if researchers are interested in the ATE in the population from which the sample was drawn, they should not use matching approaches because these typically involve discarding unmatched individuals from the estimation of the effect. In our opinion, weighting methods are the most versatile and would generally be our preferred choice. Furthermore, matching or weighting can be combined with outcome-regression methods in doubly robust balancing methods that aim at combining the advantages of two balancing methods. For example, matching and outcome regression can be combined by including the covariates that are still insufficiently balanced after propensity-score matching in a regression model fitted to the matched sample. The advantage of doubly robust methods is that it is sufficient for only one of the balancing models to be correct to estimate causal effects without bias (Funk et al., 2011; Kang & Schafer, 2007).
Analysis models
Once the event and comparison groups have been balanced with respect to all relevant confounders, difference-in-difference estimation entails estimating the causal effects by comparing the event group and the comparison group regarding their changes in the outcome(s). To specify the trajectories of the event and comparison groups, researchers have to make assumptions about the timing and the shape of the changes that occur over time in each group.
Modeling the trajectory of the event group
In life-event research, change patterns are often nonlinear and discontinuous because life events tend to have stronger effects shortly after the event and weaker effects the further a life event is in the past (Baltes et al., 1980; Luhmann et al., 2014). For example, the birth of the first child probably has large detrimental effects on environmental mastery in the weeks after birth but substantially smaller effects several years later. To model such nonlinear and discontinuous changes, distinct phases (e.g., preevent and postevent phases) with phase-specific trajectories can be defined (for details, see Hosoya et al., 2020; Luhmann & Eid, 2013; Singer & Willett, 2003). Such phase-specific trajectories can be estimated using multilevel approaches with piecewise regression or spline models (e.g., Denissen et al., 2019; Krämer & Rodgers, 2020; Luhmann & Eid, 2009; Suk et al., 2019). Figure 2 depicts three examples of models that differ in the number of phases (two vs. five phases) and/or the parametric functional form of the phase-specific trajectories (linear vs. quadratic). To control for measurement error, multiple-indicator models can be used in multi-level structural equation models (SEMs). Alternatively, single-level SEMs can also be specified that are equivalent to these multilevel specifications (e.g., Castro-Alvarez, Tendeiro, Meijer, & Bringmann, 2022). Single-level SEMs tend to be more flexible than their multilevel counterparts and allow for testing of the central assumptions of the analysis (e.g., measurement invariance over time). However, single-level SEMs also tend to be more tedious to specify. The supplementary materials of Dugan et al. (2024), who examined changes in personality-development trajectories around life events (without an explicitly causal focus), further provide open-source code and data for specifying single-level SEMs to model phase-specific trajectories using the latent-growth-curve framework (see McArdle, 2009).

Examples of piecewise regression models. (a) A model with two phase-specific linear trajectories. (b) A model with five phase-specific linear trajectories in which an intercept shift is modeled only for the transition from the “anticipation phase” to the “short-term effects phase.” (c) A model with two phase-specific quadratic trajectories.
Modeling the trajectory of the comparison group and effect estimation
In life-event research, two approaches to modeling the trajectory of the comparison group can be differentiated. The first approach (see Table 1, first model) is to model the general shape of the trajectory of the comparison group (dotted line) analogously to the event group (solid line). This approach assumes that the general shape of the phase-specific trajectories applies to all individuals regardless of whether the life event is experienced. A central challenge of this approach is establishing a clear counterfactual because it is unknown when the comparison group would have experienced the life event of interest (e.g., when childless women in the comparison group would have given birth). As a consequence, it is not clear how to temporally align the measurement occasions of the comparison group to those of the event group. One way to solve this issue is to first match individuals from the event group to individuals from the comparison group and then use the measurement occasion in which the individuals in the event group experienced the life event as the “artificial time” for when their matched counterparts in the comparison group would have experienced the life event. For detailed analysis scripts for this approach, see, for example, the supplementary materials of Krämer and Rodgers (2020) and van Scheppingen and Leopold (2020). After the trajectories of the event and comparison groups have been temporally aligned, the causal effects can be estimated by interacting an exposure indicator (i.e., event group vs. comparison group) with the parameters specifying the phase-specific trajectories (e.g., intercept, linear slope; for example analysis code, see Krämer & Rodgers, 2020). Alternatively, multigroup SEMs can be applied to test whether the phase-specific trajectories of the control and event groups differ from each other (for example analysis code, see van Scheppingen & Leopold, 2020).
Overview of Modeling Approaches
Note: SEMs = structural equation models. aFor the two difference-in-difference designs, we depicted piecewise models with a linear preevent and postevent slope. For the within-models design, we depicted a model with occasion-specific effects. Although these are popular choices, other specifications are, of course, possible (see also Fig. 2).
The second approach (Table 1, second model) is to not model a trajectory for the comparison group at all (dotted line). The assumption underlying this so-called “iron bar” model is that the outcome levels are stable in the absence of exposure to the life event (e.g., childless comparison individuals do not experience changes in life satisfaction over time). This approach can be implemented by setting the indicator (dummy) variables referencing the various time phases for individuals in the comparison group to zero. The causal effects in these models then correspond to the coefficients of the dummy variables referencing the various time phases for individuals following the life event in the event group. Each of the dummy variables estimates the mean difference between the time-invariant intercept in the comparison group and the estimated level of the trajectory at the beginning of each time phase for the event group. Because it is unlikely that the outcome levels in the comparison group are completely stable over time, additional covariates can be included to adjust the causal effect for normative changes that occur regardless of the life event. For example, by adding age as a covariate, one can control for normative age-related changes. For detailed analysis scripts for this approach, see, for example, the supplementary materials of Reitz et al. (2022) and Denissen et al. (2019).
Both approaches rely on strong assumptions. The first approach heavily relies on the assumption that the artificial time in which individuals in the comparison group would have experienced the life event is correctly specified and that the trajectories of the event and comparison groups have the same functional form. The second approach relies on the assumption that all changes occurring in the comparison group can be adequately modeled through the inclusion of covariates. However, whether these conditions are actually met in a given study cannot be tested.
Limitations of difference-in-difference designs
Even though difference-in-difference designs attempt to mimic randomized experiments and are viewed as relatively strong designs for estimating causal effects in observational studies (Rubin, 2007), they also have important drawbacks for the study of life events. For one, developing a clear definition of the event and comparison groups becomes difficult when life events are reversible (e.g., marriage → divorce; job loss → reemployment). Consider a study that examines the effects of job loss over an extended time period. If the event group is defined based on whether a person experienced a transition from employment to unemployment, the event group will contain subgroups of individuals who become long-term unemployed and individuals who find a new job quickly. Although it may be possible to estimate the short-term effects of a job loss with this event group, estimation of longer-term effects will conflate the results of adaptation processes in the long-term unemployed group with the (positive) effects of reemployment in the latter group. One approach to this problem is to define different event subgroups depending on how long individuals stay unemployed (e.g., Event Group 1: reemployed after 1 month of unemployment; Event Group 2: reemployed after 2 months of unemployment). This approach, however, may become infeasible because of low sample sizes in the different event subgroups. Another, more straightforward approach is to use g-estimation (Loh et al., 2024; Robins, 1986), briefly described in the above section “Using Directed Acyclic Graphs to Specify the Hypothesized Causal Structure.” G-estimation allows for the examination of the effects of different sequences of reversible life events (e.g., employed → unemployed → unemployed → employed). A further limitation of difference-in-difference designs is that all confounders that are related to changes in the outcomes over time need to be measured and adequately accounted for. Within-person designs, to which we now turn, can (at least in part) address the latter of these limitations because they allow researchers to account for all time-invariant confounders regardless of whether they are measured or unmeasured.
Within-Person Designs
Within-person designs use each individual as his or her own control. Estimates of causal effects in within-person designs are obtained by examining how strongly individuals deviate from their “baseline levels” before and after the exposure (see Table 1, third model). By focusing on within-person changes, these models can rule out confounding resulting from any observed or unobserved time-invariant covariates so that only time-varying confounders need to be controlled for (Allison, 2009; Kim & Steiner, 2021; Rohrer & Murayama, 2023; Wysocki et al., 2022).
A key challenge in within-person models is to define the baseline level of each individual. This baseline level should represent the outcome level in the absence of the exposure. In life-event studies, the baseline level is often defined as the mean outcome level of a person across all observations for which it is plausible that the life event has not (yet) affected the outcome measure. For example, in studies examining the effects of the birth of a first child, it may be sensible to define the baseline level as the mean of all observations collected at least 9 months before the birth. In practice, the data are usually organized in a long format (i.e., one row per measurement occasion per individual), and occasion-specific dummy variables are created for each of the focal occasions relative to the occurrence of the life event (e.g., the first, second, or third observations after the child is born). The baseline level then corresponds to the mean outcome levels across all nonfocal observations, that is, those for which each of the occasion-specific dummy variables is 0 (for example analysis code, see Krämer et al., 2024). Individuals who were not exposed to the event are often also included in within-person models with the justification that (a) the baseline levels (model intercept) and (b) the regression coefficients for time-varying covariates can be estimated with higher precision. To implement this approach, the occasion-specific dummy variables are set to 0 for all individuals who were not exposed to the event. However, this approach assumes that regression coefficients are the same for individuals who experience the life event and individuals who do not. Because this assumption is often unrealistic, we advise against this practice.
For reversible life events (e.g., job loss), the reference period is often defined as all occasions in which individuals are not observed in the new status resulting from the life event (e.g., all occasions during which the individual is employed). However, this approach relies on the assumption that the focal life event does not have any persisting effects on the outcome after it has been reversed. Yet, for example, negative effects of unemployment on life satisfaction have sometimes persisted when individuals regain employment (“scarring effect”; Clark et al., 2001; Hetschko et al., 2019; for a contrasting finding, see Zhou et al., 2019). In such cases, postevent measurement occasions should not be included in the definition of the baseline levels to maintain a clear separation between the preexposure baseline level and the event’s effects.
Analysis models
There are two general approaches that may be used to estimate within-person causal effects: multilevel modeling with person-mean-centered predictors (e.g., Hoffman, 2019; Raudenbush & Bryk, 2002) and fixed-effects models (Allison, 2009). Although these two modeling strategies have different mechanics and are rooted in different research traditions, they both aim at estimating the same within-person effect (Hamaker & Muthén, 2019). Both approaches are implemented in various statistical packages, such as R, SAS, or STATA.
Multilevel models with person-mean centering
In the psychological literature, multilevel models with person-mean-centered predictors are generally used to simultaneously examine within-person changes and between-person differences (e.g., Hoffman, 2019; Raudenbush & Bryk, 2002). These flexible models allow for the inclusion of time-invariant person-characteristics (stable traits; e.g., biological sex) to explain between-person differences. Moreover, they aim at making population-based inferences by treating the observed individuals as a random sample from the target population and assuming a specific distribution (typically normal) of the between-person effects across individuals in the population (McNeish, 2023). When multiple indicators of the outcome variable are available, the influence of measurement error can also be addressed. Castro-Alvarez, Tendeiro, de Jonge, et al. (2022) proposed a particularly relevant model for this purpose: the mixed-effects trait-state-occasion (ME-TSO) model, which is rooted in latent-state-trait theory (Steyer et al., 1992, 1999, 2015). The ME-TSO model can be used to contrast the true (i.e., free of measurement error) trait-outcome levels before and after a life event occurred (for example analysis code, see Lawes et al., 2024).
Fixed-effects models
In economics, sociology, and political science, fixed-effects models (Allison, 2009) are often applied to estimate within-person causal effects. Fixed-effects models account for all stable differences between individuals (Angrist & Pischke, 2009; Bell & Jones, 2015; Wooldridge, 2013). 5 The goal of this approach is to remove all between-person variation so that only within-person variation remains. An advantage of this approach is that no assumptions are made about the distribution of the between-person effects in the population. However, this approach comes at a cost of being unable to quantify between-person differences. Fixed-effects models also aim at making inferences for the specific collection of individuals in the sample rather than in the target population (McNeish, 2023; McNeish & Kelley, 2019). Furthermore, to our knowledge, fixed-effects models cannot address the issue of measurement error (T. D. Hill et al., 2020). Given these limitations, fixed-effects models are less suited for life-event studies, which are typically interested in (a) quantifying between-person differences, (b) estimating causal effects that generalize to broader target populations, and (c) controlling for measurement error.
Limitations of within-person models
The results of within-person designs can be biased by time-varying covariates (e.g., historical events) and maturational trends that may differ between the baseline and event periods (Holland, 1986; Shadish et al., 2002) and confound the estimate of the effect of the event. In life-event studies, many covariates are time-varying because either their levels or the magnitude of their effects may change over time. Of particular concern are other life events that co-occur with the focal event (Krämer et al., 2024). For example, the estimate of the effect of the birth of the first child might be confounded with the effects of other life events that occur immediately before the event (e.g., partners moving in together, getting married). Furthermore, characteristics that are typically conceptualized as being stable over time (e.g., residence, personality) might change over time. We therefore advocate that—just like in difference-in-difference models—researchers critically discuss the hypothesized causal structures when they use within-persons models and ideally measure and analytically control for time-varying confounders.
Another important limitation of within-person models is that only individuals who were exposed to the event actually contribute to the estimation of the causal effects. This limitation has two important implications. First, the statistical power to detect the causal effects may be lower than in difference-in-difference designs (Allison, 2009). Second, the causal effects of within-person models apply only to those individuals who were actually exposed to the event. This feature alters the estimand to the ATT, which is generally different than the ATE, which generalizes to the population from which the sample was drawn (Collischon & Eberl, 2020; T. D. Hill et al., 2020).
Recommendations for Deciding Between Difference-in-Difference Models and Within-Person Models
Whether difference-in-difference or within-person models should be used primarily depends on the assumed causal structure and the desired causal estimand. Figure 3 depicts a flowchart that provides a guide for choosing the appropriate analysis model in different contexts. The flowchart also shows that it is not possible to estimate unbiased causal effects if there are (a) unmeasured time-varying confounders or (b) unmeasured time-invariant confounders and the desired causal estimand is not the ATT.

Guide for choosing between difference-in-difference and within-person models. aAs described in the section on matching, it is generally not appropriate to use matching when the target estimand is the average treatment effect (ATE) for the population from which the sample was drawn.
Step 4: Probing Credibility, Robustness, and Generalizability of Effects
The process of establishing the most credible causal inference in observational life-event studies does not end once the causal effect is estimated. After estimating the causal effects of interest, it is essential that researchers (a) probe whether their theoretical assumptions are plausible, (b) examine the robustness of the results when alternative analysis methods are used, and (c) critically discuss the generalizability of the results.
Plausibility of the Assumptions
As a first step, researchers should critically check whether they did everything to make the key assumptions of their analytical strategy plausible. Of particular importance is the assumption of exchangeability. Recall that this assumption states that, conditional on the set of controlled confounders, it is random whether an individual is exposed to the event. In life-event studies, there are two main issues that challenge this assumption. First, there are often strong selection effects, meaning that individuals who experience versus not experience the focal life event may differ considerably. Even after controlling for measured covariates (e.g., through covariate balancing), selection effects resulting from unobserved confounders potentially remain. Second, the true causal structure of life-event studies is often highly complex, and the relevant psychological theories are often incomplete. Thus, it is generally impossible to be sure that the identified confounders comprise a sufficient set to identify the causal effect. Accordingly, full (conditional) exchangeability will only rarely hold in life-event studies. Below, we present two indirect methods to probe the plausibility of the exchangeability assumption in studies using difference-in-difference designs.
Design elements and pattern matching
Shadish et al. (2002) and Rosenbaum (2020) emphasized the importance of including so-called design-element controls to strengthen causal inferences from observational studies. Design elements are used to capture the effect of plausible threats to internal validity in the absence of an actual treatment effect. For example, multiple pretests over time may be used to estimate maturational trends before exposure to a life event. The obtained pattern of results predicted from the threat to internal validity (confounding variable) is compared with the pattern of results predicted from the hypothesized effect of the treatment. In life-events research, multiple comparison groups that are differentially sensitive to the confounding variable are often used. These different comparison groups are sometimes termed
Another common design element is the inclusion of additional outcome variables that are not expected to be affected by the exposure but would be expected to be affected by other threats to validity (nonequivalent dependent variables; Shadish et al., 2002). These additional outcomes are sometimes called
Sensitivity analysis
A second method to examine the plausibility of the exchangeability assumption is sensitivity analysis (Rosenbaum, 1986, 2017). The basic idea of sensitivity analysis is to examine how large the effect size of an unmeasured confounder would need to be to alter the study conclusions, either making the effect no longer statistically significant or making the magnitude of the effect no longer of practical interest (Rosenbaum, 1986, 2017). For example, Haviland et al. (2007) investigated the effects of adolescent boys joining a youth gang (i.e., exposure) on subsequent self-reported violence (i.e., outcome). Their sensitivity analysis showed that it would require an unmeasured confounder that led to an increase in the odds of joining a gang of more than 50% to reduce the effect of gang membership on violence so that it would no longer be statistically significant.
In a related approach, researchers can use measured covariates to gauge the possibility that unmeasured confounders could potentially alter the study’s conclusions. For example, the covariate with the largest standardized mean difference can be used as a benchmark for a large imbalance on an unmeasured covariate between the event and comparison groups. Furthermore the largest correlation between any of the covariates and the outcome can be taken as the benchmark for a large relationship. Researchers can then compute how the original estimate of the causal effect would be reduced if an unmeasured confounder existed with the properties of the derived benchmarks. Although there is no theoretical proof of the absence of a stronger unmeasured confounding variable, this procedure provides a plausible empirically based upper limit for the effect of an unmeasured confounder (Hong, 2004; Rosenbaum, 1986). If even the presence of this large hypothetical confounder would not alter the conclusion of the study, it becomes more plausible that the causal effect is credible.
Robustness of the Analysis
Researchers should also test whether their causal effects are robust when different analytical methods to estimate them are used. For example, Lawes et al. (2023) found highly similar effects of unemployment on well-being in (a) an unmatched analysis, (b) a propensity-score-matched analysis, and (c) an analysis that included key potential confounders as covariates in the outcome-regression model. Optimally, a robustness check would involve comparing all potentially appropriate analytical approaches to each other. This can, for example, be achieved in a specification-curve analysis (Simonsohn et al., 2020). Of importance, only those analytic methods that estimate the same causal estimand should be considered in such comparisons. When analytic methods that yield different causal estimands are compared with each other, it is unclear whether potential differences in the effect estimates result from a lack of robustness or from differences in the causal estimands. For example, when the causal estimand is the ATE, matching procedures that estimate the ATT should not be included in the robustness check.
Generalizability, Effect Heterogeneity, and Effect Moderation
So far in this article, we have primarily focused on how to identify and estimate average causal effects in longitudinal life-event studies. However, life-event researchers have stressed that people generally have heterogeneous reactions to life events (e.g., Bleidorn et al., 2020; Luhmann et al., 2021). For example, there are strong individual differences in how unemployment affects well-being (e.g., Lawes et al., 2024): For most individuals, becoming unemployed negatively affects life satisfaction; yet some individuals also report higher life satisfaction during unemployment. An important step in life-event research is to document this heterogeneity of responses to different life events on various outcomes and to understand its sources. Future research needs to search for individual-difference variables that lead to different magnitudes or even different directions of the causal effects of life events. These effect modifiers may be categorical (biological sex, marital status) or numeric (e.g., age, years employed) measured variables. Or they may be latent continuous variables (e.g., socioeconomic status created from measures of education, income, and occupation) or latent groups created from measured variables. To identify homogeneous latent groups with similar response trajectories to life events, researchers have proposed latent-trajectory analysis (e.g., Galatzer-Levy et al., 2010, 2011; Infurna & Luthar, 2017). This method can determine the proportion of individuals who exhibit different patterns of change after the event (e.g., declines, increases, no change). If properly specified, latent-trajectory models provide a reliable summary of the different patterns of change based on multiple measurement waves.
An important issue with effect moderation is the theoretical distinction between causal moderation, which implies that intervening on the moderator variable would cause changes in the effects, and noncausal moderation, which implies that the moderators are related but not causally related to heterogeneity in the effects of life events (see Hernán & Robins, 2024; VanderWeele, 2015). To date in life-event research, nearly all moderation analyses have been noncausal. Thus, it is important to carefully report and interpret these moderation analyses so that readers do not misinterpret noncausal moderation effects as causal (Rohrer et al., 2022).
The issues of effect heterogeneity and effect moderation further provide a segue into another important topic that often receives less attention in causal analyses: the external validity of study results (Diener et al., 2022; Rothwell, 2005). To directly generalize the estimated causal effects to the target population, a random sample of the defined target population should be drawn. In practice, random samples can, however, never be achieved in life-events studies; for example, because not all individuals selected in the sample will agree to participate in the study. Thus, the obtained causal effect in most life-event studies will not directly generalize to the population from which the individuals were drawn. When information on the characteristics of the population are available (e.g., through administrative records), methods such as the inclusion of sampling weights (Winship & Radbill, 1994) may still permit the researcher to estimate effects that are directly generalizable to the target population of interest. It might even be possible to provide an estimate of the causal effect in a different target population, a concept known as
Generalizability and Transportability of the Results
Summary: Causal Inference in Life-Events Studies
Approaching life-event research within an explicitly causal framework is challenging. We have presented the four central steps involved in making credible causal inferences in life-event studies: (a) conceptually defining the causal estimand, (b) identifying the causal estimand, (c) estimating the causal effect, and (d) probing the credibility, robustness, and generalizability of the estimated effect. For a checklist summarizing these steps, see Box 4.
Checklist for Steps Involved in Causal Inference for Life-Event Studies
Future Directions
Toward More Specialized Observational Studies
Currently, most life-events studies rely on preexisting data from large national panel studies. These panel studies have many valuable features, including large, nationally representative samples and long study durations, but they also have several important limitations. First, they often have rather long time lags between measurement occasions (typically 1 year) so that the temporal resolution is generally too coarse to study shorter-term effects. Second, they generally have a very limited number of outcome measures and relevant covariates are often not assessed. Third, single-item measures or (less reliable) short scales are often used so that the variables of interest cannot be well operationalized. To overcome these shortcomings, additional panel data need to be collected. In an ideal world, intensive longitudinal data would be collected from many individuals over an extended period of time before and after the life event of interest (for a comprehensive discussion about desirable design features, see Bleidorn et al., 2020). However, such intensive studies are generally not feasible because of both budget and time constraints and the high response burden on participants. Participant reactivity to repeated measurements may further alter the causal systems of interest (Rohrer & Murayama, 2023) and decrease the signal-to-noise ratio (Boker & Nesselroade, 2002). Thus, a useful strategy is for researchers to aim at determining the minimum number of measurement occasions needed to appropriately capture the causal effects of interest. For some outcome variables, participant reactivity can further be minimized through the use of modern assessment tools, such as wearable devices and content analysis of social media (Mehl et al., 2024). Knowledge is needed about the timing and functional form of the change processes so that the number and spacing of the measurement is adequate to model the event-related changes in the outcome variables (Collins, 2006; Hopwood et al., 2022; Luhmann et al., 2014). A valuable approach could be to supplement the panel data with measurement-burst designs (Nesselroade, 1991) in which highly frequent assessments occur in the months before and after the life event with less frequent routine assessments otherwise. When it cannot be anticipated when the life event might occur, online tools could be used to ask individuals to indicate when they (expect to) experience the focal life event. Such event-contingent responding permits increased measurement frequency ideally both immediately before and after the life event (for a discussion of event-contingent methods, see Moskowitz & Sadikaj, 2012). In planning the total duration of the study, researchers can be guided by their choice of causal estimand (Rohrer & Murayama, 2023): If the researchers are interested in the immediate effects of the life event, shorter studies may be sufficient, whereas long-term effects require longer-duration studies.
In planning a study, researchers should also consider their measurement instruments. Specifying the hypothesized causal structure a priori can help identify potential confounders that need to be measured to identify the causal effects of interest (Rohrer & Murayama, 2023). Furthermore, self-reported data should ideally be complemented with other data sources (e.g., peer reports, biomarkers, wearable devices, administrative data) to obtain a detailed and differentiated picture of the effects. If possible, the constructs of interest should be measured with multiple indicators, permitting correction for measurement error. To make the inclusion of multiitem questionnaires in panel studies possible, more reliable short scales that are sensitive to change need to be developed and validated.
Natural Experiments
A promising opportunity for particularly credible causal inference in longitudinal observational studies is the increased use of natural experiments (Grosz et al., 2024). Natural experiments are quasi-experiments in which “some key elements of a randomized experiment occur on their own, even though the investigator neither creates nor assigns the treatments” (Rosenbaum, 2017, p. 100). An example of a natural experiment is the German Job Search Panel (GJSP; Hetschko et al., 2022), a study on the effects of unemployment on well-being and health. The GJSP exploited the German job-search registration process, in which employees have to register as jobseekers at least 3 months before they expect to lose their job because of plant closures or mass layoffs. Crucially, only some of the registered jobseekers actually enter unemployment; others manage to either keep their jobs (e.g., because the plant could be saved after all) or immediately start a new job without entering unemployment. By recruiting employed registered jobseekers for a smartphone panel study with monthly measurements, the GJSP permitted comparisons of individuals who entered unemployment (i.e., the event group) with highly similar individuals who remained employed (i.e., the comparison group). This kind of research design can easily be transferred to other life events. For example, the effects of retirement could be studied by collaborating with national pension funds to recruit individuals who are approaching retirement age, newly married couples can be contacted via local marriage bureaus to study the effects of the birth of the first child, or couples can be recruited through marriage counselors to study the effects of divorce. It will likely not be random which of these individuals “at risk” will experience the focal life event; therefore, confounding variables (e.g., expectations to experience the event) need to be assessed and controlled for. Another potential drawback of these natural experiments is that the identification of individuals who are at risk of experiencing the focal life event will be based on a prior event (e.g., registration as jobseeker because of expected job loss, marriage) or a normative process (e.g., reaching retirement age) that triggers recruitment. Thus, estimation of anticipation effects of the life event will often not be possible, and the number of preevent measurements may be limited. We recommend that the findings from such highly controlled natural experiments be combined with those from nationally representative panel studies to obtain credible and generalizable estimates of the effects.
Conclusion
Almost all research questions in life-events studies are inherently causal. Significant advances in research design and statistical analysis for causal inference in observational studies have emerged in fields such as computer science, econometrics, epidemiology, and statistics. Psychology, like other behavioral and health sciences, is just beginning to adopt these advances. In this article, we draw on these developments to present a guide for making credible causal inferences in life-event studies. We are convinced that carefully considering and detailing decisions at each step of the research process is essential: It helps researchers and readers understand a study’s assumptions and limitations, facilitates precise research-question formulation, guides design and analysis choices, and improves the interpretation of estimated effects. We believe that adopting an explicit causal framework for causal research questions will enhance transparency in psychological research, support constructive scientific critique, and ultimately bring the field closer to the “truth” of scientific claims (Campbell, 1988).
Footnotes
Acknowledgements
We thank Claudia Crayen and Ana Tomova for their valuable feedback throughout the writing process. We acknowledge support by the Open Access Publication Fund of Freie Universität Berlin.
Transparency
