Abstract
With the progressing advances in text analysis methods and the increasing accessibility of parliamentary documents, the range of available tools for legislative scholars has increased massively over the past years. While the potential for comparative studies is huge, researchers can easily overlook the pitfalls associated with analyzing these documents. Against this background, I asses which theoretical considerations need to be carefully thought through preceding any (legislative) text analysis: I show that a clear definition and conceptualization of the unit of analysis such as speech or bill can vary substantially depending on the research interest. Furthermore, I discuss how the nested structure of legislative behavior and the data generating process influence our theoretic assumptions about parliamentary behavior. Based on these concepts, I derive some recommendations for the theoretical approach to quantitative textual analyses of parliamentary documents.
Introduction
In recent years, the field of legislative research has experienced significant advances in text analysis methods (Lauderdale and Herzog, 2016; Rheault and Cochrane, 2020) as well as in the accessibility to parliamentary documents (Schwalbach and Rauh, 2021; Sebők et al., 2023). This has provided an opportunity for comparative studies, which can offer a deeper understanding of legislative behavior across countries. However, the analysis of legislative text documents can be challenging, and scholars must be aware of the associated pitfalls. In this study, I address these challenges by pointing towards concepts that systematize known theoretical parameters that influence parliamentary speech behavior. The aim is to structure theoretical factors as well as to provide a better framework for a methodological implementation. By showing that these conceptual choices have an impact on the results, I aim to provide guidance especially for researchers new to quantitative text analysis of parliamentary documents.
To this end, I assess three theoretical concepts that are crucial to analyze legislative documents: The nested structure of legislative procedures, the definition of the unit of analysis, and the data generating process. If these are taken into account when formulating theoretical expectations, methodological implementations can come to meaningful outcomes. Based on these concepts, the paper derives recommendations for the theoretical approach to quantitative textual analyses of parliamentary documents to improve the quality and reliability of studies of legislative behavior.
Theoretical framework
It is important to emphasize that previous studies of legislative behavior based on quantitative text analyses have by no means ignored the theoretical context. Rather, the theoretical foundation of previous studies often focuses on particular aspects (see Laver (2021) and Slapin and Proksch (2021) for extensive overviews). In this study, however, the focus is on the holistic nature of the legislative context. While the theoretical conceptualization of the research question and hypotheses of a quantitative text analysis should precede the methodological implementation in general, they are never detached from and can inform each other (Grimmer et al., 2022). For this reason, the theoretical foundation or concept should be in a feedback relationship with the data collection process and the data analysis process (see Figure 1). It is exactly between these steps where the systematization of known theoretical foundations into concepts is supposed to inform measurement validity issues that “arise in moving between concepts and observations” (Adcock and Collier, 2001: p.530). Feedback loop.
In case existing data does not allow to measure or control for all relevant factors, the theoretical model needs to be revised, while keeping the theoretical assumptions consistent with previous research. The same process of a feasibility check should be implemented for the analysis. It is also not always necessary to clearly differentiate between theory and methodology in this regard. Some decisions that can be framed as pre-processing decisions in a quantitative text analysis strongly express a theoretical approach and should thus only be taken if they are in line with the theoretical framework.
It is also important that it depends on the case selection “which contexts matter (or should matter), and under what circumstance” (Gerring, 1999: p.367). Some issues raised might not be applicable, other might need to be spelled out very differently or in more detail. The following sections discuss in detail three theoretical frameworks: The nested structure of legislative behavior, the conceptualization of the unit of analysis, and the data generating process. Appendix A illustrates an example for the consideration of all three concepts. These are implemented in a scaling example, using an extended version of the ParlSpeech dataset (Rauh and Schwalbach, 2020).
The nested structure of legislative procedures
The nested structure of parliamentary procedures can have two unintended consequences if it is not considered in the methodological implementation: First, important confounding variables may not be considered, which may affect the outcome of models based on the collected textual data. Second, there is a risk that it might lead to a lack of variance in the analysis. Figure 2 shows a very simplified version of the nested structure for the cases where a legislative speech or bill proposal are the unit of analysis for a given study. In both cases, however, the nested structure could be broken down even further to a semantic level. For example, speeches can consist of different contributions with interruptions or of different paragraphs, which in turn consist of different sentences and words. Again, all entities are interdependent: Each day of debates is structured hierarchically, with sentences nested within utterances, which are in turn nested within discussions (Tzelgov, 2014). Nested structure of parliamentary procedures.
The implications of some structures are rather easy to follow. For example, it is apparent that parliamentary speeches on a particular agenda item are not independent of each other. Lauderdale and Herzog (2016) have even considered this for scaling approaches in methodological implementation in their Wordshoal model. However, with other structures it is less obvious. The nested structure can be particularly difficult to conceptualize when different structures overlap and/or the hierarchical order is not clear. For example, parliamentary speeches and bills can be grouped by the type of procedure, for example, government bill, or by the policy field, for example, debate on immigration (Bäck et al., 2019). Both can potentially be an important factor, and the extent to which both analyses need to be controlled for depends on the particular research context.
It is especially important to consider the nested structure when relevant variables vary only at a very high level of aggregation. While, for example, one or two legislative periods might assemble thousands of speech acts or bill proposals, there might be no variance in important variables like the government constellation. It may be the case, for example, that a government coalition is particularly prone to conflict or that it contains a party with a very specific policy profile. Even if the level of analysis is then at the level of speeches or bills, and thus many thousands of speeches or hundreds of bills are analyzed, the generalizability is severely limited if there is no variation in government. Additionally, if several periods are considered, other changes may co-occur at this level (e.g., the entry of a populist party), which are again difficult to disentangle in the analysis. Therefore, it is important to consider clusters at higher level and their implications.
Moreover, nested structures do not only exist on a vertical line but also on a horizontal/time axis. For example, a speech held in parliament can be influenced by the speeches before (e.g., a reaction to a verbal attack). However, the extent of the reaction will depend on various factors (such as the debate style or the number of speeches in between). Additionally, it is important to keep in mind that the nested structure originates from institutional differences. This institutional heterogeneity is due to endogenous decisions how to structure parliamentary processes. For example, whether a bill proposal can be discussed in combination with different other proposals depends also on whether a parliament usually schedules only one or several readings.
The unit of analysis
Recognizing the importance of the nested data structure, the question of the adequate unit of analysis for the respective study arises immediately. While this may be clear in some cases, it can be difficult to define from a theoretical perspective as well as problematic from a methodological perspective to create a comparable data base between different cases. The unit of analysis needs to be consistent with the theoretical concept, the empirical implementation and the conclusion drawn from the analysis. While a clear definition of the concepts is important for the analysis, many studies do not explicitly discuss what is meant by analyzing speeches or a bill. However, concentrating only on these two entities poses several hurdles.
In their systematic review of sentiment and position-taking analyses of parliamentary debates, Abercrombie and Batista-Navarro (2020, p.258) find that “speech appears to mean different things in different publications, and in some, it is not immediately clear just what the unit of analysis actually is.” Furthermore, differences between academic fields were identified with a tendency of computer scientists working at finer-grained levels and political scientists on a more aggregate level. This is important since for many scaling approaches it remains an open question what constitutes a sufficient number of documents and words (Proksch and Slapin, 2009). Thus, the theoretical conceptualization of the unit of analysis can be at odds with what is ideal for analyses from a methodological side. Moreover, “systematic text analysis methods perform better when calibrated to a specific speech setting, rather than collections of diverse settings” (Laver, 2021: p.30).
Furthermore, a common approach for the analysis of speeches is to combine all text from a speaker or a party for a certain point on the parliamentary agenda (Schwalbach, 2023). However, while this procedure seems straightforward, several challenges arise. A first challenge emerges if several topics are grouped under one agenda item in a parliamentary debate. In some cases, only government bill proposals are of interest (Martin and Vanberg, 2014). But what if they are not discussed independently? It cannot be traced a priori whether a speech refers to only one or several parts of an agenda point or all of them. For example, in a sentiment analysis, it may happen that both a government and an opposition politician speak particularly positively in a debate because one talks about a government bill and the other about an opposition amendment. Thus, it might be necessary to analyze the speech text differently for each item on the agenda. However, this may lead to the situation that one speech is more important for the analysis than another.
Subsequently, the question arises whether the speeches on a parliamentary agenda item, which is discussed on different days, should theoretically be summarized as speeches under the same agenda item. Many parliaments require several readings for a regular legislative process. If a researcher decides to combine all speeches on a given legislative process for a scaling approach, there could be different groupings with other legislative procedures on each debate day resulting in a lack of comparability. Vice versa, the respective individual readings for a bill sometimes do not result in a comparable baseline either, since these can range from a simple presentation/introduction of the bill to a sequence of speeches lasting several hours (see speakers for procedure 2 in Figure 3). Unit of analysis.
Moreover, on a lower level, the extent to which multiple segments of text are assigned to one speaker per agenda item in a parliamentary protocol may depend on the parliamentary debate style and the instructions given to the respective stenographic service. This brings us back to the essential question of what a speech is. It can be defined as the sequence of a documented spoken text, which always ends when the spoken text of another speaker begins. Such a definition is sometimes difficult in its strict interpretation when comparing different parliaments. For example, in some parliaments, the spoken words of the chair are documented, while in others, this is not the case. This can lead to a considerable fragmentation of speeches if there are several interruptions at the end of a speech to call attention to a time limit. Similarly, it is difficult to compare the text of different debate styles from different parliamentary traditions (Proksch and Slapin, 2012). Whereas in the British House of Commons, it is common for speakers to interrupt and interact with each other, in the German Bundestag speeches are read out one after the other and intervening questions need to be announced and approved.
Finding the right unit of analysis can be just as challenging for bill proposals. An example of this are so-called package bills, where different legislative acts are treated together. Furthermore, it is sometimes not trivial to define what version of a given bill proposal is available as a text and if this is comparable between different types of bill proposals and in different countries. In any case, the following fundamental questions always need to be answered for the theoretical conceptualization: What is the appropriate level of aggregation to reach the best balance between comparability and simplification? How comparable is the concept between cases and should we compare different units, and can we compare at all?
The data generating process
As a synthesis of the first two concepts, an approximation of the data generating process follows. How did the unit of analysis in the nested structure come about? The data generating process is a complicated relationship between actors’ preferences and the institutional context and will never be completely traceable for researchers. Asking questions like which actors and structures had an influence on a certain parliamentary behavior and what assumptions do I have about what happens when these changes are good approximations. Thus, the three conceptual perspectives outlined should by no means be viewed in isolation from one another and one after the other. Rather, they are mutually dependent and should be understood as an iterative process.
Figure 4 illustrates a simplified version of this connection. This sketch might look very simplistic at first. However, it becomes more complicated when the broad categories are spelled out in detail: For example, one of the important factors for legislative behavior of parliamentary parties or speakers is the division into government and opposition (Hix and Noury, 2016). It influences many forms of legislative behavior from speech-making to voting as well as the parliamentary output such as the drafting of bill proposals. Nevertheless, while this division has proven to be important for many types of parliamentary behavior the ideological position of a party is also relevant (Schwalbach, 2023). Furthermore, many other factors on the individual speaker level such as gender (Bäck et al., 2014) or on the party level (Bäck et al., 2019) can have an influence on legislative behavior and how party rules for debates are endogenous to strategic considerations (Proksch and Slapin, 2012). All these actor-centered factors not only need to be considered for the respective analyzed actor but also for the other actors with whom this actor interacts. Data generating process.
In many (quantitative text) analyses of legislative behavior, the influence of institutional factors is considered as a central part or is even in focus. These effects are very diverse and include but are not limited to: the effect of the electoral system (Slapin and Proksch, 2008); the effect of parliamentary rules restricting or granting open access to the floor on the allocation of speaking time within government and opposition parties (Giannetti and Pedrazzani, 2016); constraints of legislative actors influencing the likelihood that these actors will take the lead in legislative agenda-setting (Bräuninger and Debus, 2009); differences in the procedure and usage of oral and written questions in parliament (Rozenberg and Martin, 2011). It is central to any analysis of legislative behavior to sort out which of these factors are relevant, what their effects are, and to what extent they are comparable across cases.
Furthermore, it is important to note that the positions and intention of actors as well as the effects of the institutional context can have a time dimension. Martin (2004) finds that more conflictual bills are introduced by coalition partners at the end of the legislative cycle. This change of legislative output again influences the behavior of parliamentary actors. Knox and Lucas (2021, p.649) thus model political speech as a “stochastic process shaped by fixed and time-varying covariates, including the history of the conversation itself”. Moreover, the parliamentary rules like the parliamentary standing orders can change very quickly and asymmetrically in different parliaments (Sieberer et al., 2016). It is therefore extremely important to consider if the institutional context is consistent for all identified cases or whether corresponding changes need to be considered for the analysis.
In the end, the data generating process is always a combination of what all involved actors want, all involved rules that restrict the actors, and all previous actions by all actors involved. Once again, before the implementation of a quantitative text analysis, it is important to ask what this means in detail. Furthermore, for comparative analyses, the data generating process needs to be comparable between cases and within cases between units. This framework and the aspect outlined above are generally applicable for all analyses of legislative behavior. However, it is especially important in the case of quantitative text analysis where it is hard to get face validity benchmarks for many measures. Moreover, this is also an iterative process: As indicated in Figure 4, the measured legislative behavior or output is never an end point, but in turn has an impact on the previous aspects.
Conclusion and recommendations
This study aims to systematically point to the underlying theory-based decisions that should guide an analysis of legislative text. Against this background, three theoretical concepts have been highlighted, which should precede the development of any quantitative textual analysis: The nested structure of parliamentary procedures, the conceptualization of an appropriate unit of analysis, and the data generating process. Following the described theoretical concepts does not always mean that there is an objectively correct way of implementation. Instead, researches need to take informed decisions based on the available information which methodological implementation is best for the given research question. While the three concepts have been considered separately for illustrative purposes, all are related to one another and can have strong effects on each another.
This study naturally comes with limitations. While it aims to help researchers in sorting theoretical factors in broader concepts, it does not tell them which factors are relevant for a certain analysis. Furthermore, as mentioned above, this study does not provide any guidance on the essential decisions that need to be taken regarding the methodological implementation, especially for pre-processing the data. Thus, it is important to stress that the final impact and relevance is always case-dependent. Therefore, the concepts should be viewed as an overarching framework for approaching legislative data. This leads to some recommendations for the quantitative textual analysis of parliamentary behavior:
First, case knowledge is important to judge if and how cases are comparable: What are notable differences that can be discovered when applying the same approach to different cases and what are differences between cases that make the comparison difficult or impossible? Second, transparency is key. This means providing descriptions and justifications for all decisions from theory-based concepts to making analysis scripts available for replication purposes. In particular for theory-based design choices such as the selection of the unit of analysis, decision trees (see Figure 5 as an example for choosing the unit of analysis for a scaling analysis of parties in parliaments) can help. These can be used both for the comparison of theoretical options but also for empirical validation of results. Example decision tree.
Third, validate all results, using human judgment as a benchmark (Lowe and Benoit, 2013). It is important to keep in mind that a particular text analysis has probably never been conducted on a specific set of text documents before. Thus, especially when, for example, measures are aggregated across documents, validation is central (Grimmer et al., 2022). Fourth, a good consideration of the trade-off between over- and under-simplification is necessary: It is not helpful to come up with an explanation for every single unit. However, if you do aggregate you need to explain why expectations stay the same.
Fifth, from a measurement perspective, it is important to keep in mind that the “complexity of language implies that all methods necessarily fail to provide an accurate account of the data generating process used to produce texts” (Grimmer and Stewart, 2013). While assessing the data generating process in theory is still very important in this case, it is as much necessary to evaluate if the used data source really is the best to measure the concept of interest.
Supplemental Material
Supplemental Material - The role of theoretical concepts for analyzing legislative text data
Supplemental Material for The role of theoretical concepts for analyzing legislative text data by Jan Schwalbach in Research & Politics.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
Carnegie Corporation of New York Grant
This publication was made possible (in part) by a grant from the Carnegie Corporation of New York. The statements made and views expressed are solely the responsibility of the author.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
