Sage Journals: Discover world-class research

Abstract

Studying political opinions of citizens stands as a fundamental pursuit for both policymakers and researchers. While traditional surveys remain the primary method to investigate individual political opinions, the advent of social media data (SMD) offers novel prospects. However, the number of studies using SMD to extract individuals’ political opinions are limited and differ greatly in their methodological approaches and levels of success. Recent studies highlight the benefits of analyzing individuals’ social media network structure to estimate political opinions. Nevertheless, current methodologies exhibit limitations, including the use of simplistic linear models and a predominant focus on samples from the United States. Addressing these issues, we employ an unsupervised Variational Autoencoder (VAE) machine learning model to extract individual opinion estimates from SMD of N = 276 008 German Twitter (now called ’X’) users, compare its performance to a linear model and validate model estimates on self-reported opinion measures. Our findings suggest that the VAE captures Twitter users’ network structure more precisely, leading to higher accuracy in following decision predictions and associations with self-reported political ideology and voting intentions. Our study emphasizes the need for advanced analytical approaches capable to capture complex relationships in social media networks when studying political opinion, at least in non-US contexts.

Keywords

social media political opinion estimation network structure machine learning

Introduction

Monitoring the political opinions of the general public is a valuable asset in today’s world. Tracking individuals’ political views and attitudes enables policymakers to develop tailored regulatory measures and supports researchers in studying political trends and factors that shape political opinions and behavior (Dong & Lian, 2021; Schober et al., 2016). Today, various methods exist to study public- and individual-level political opinions. Although self-reports integrated into surveys still dominate the opinion research space (Berinsky, 2017), the utilization of publicly available social media data (SMD) has gained extensive popularity in recent years (Rousidis et al., 2020). Arguably, SMD offers key benefits: Spontaneous expressions of opinions can be captured, accessed and analyzed in real-time without the limitations of predefined response options (e.g., Reveilhac et al., 2022; Schober et al., 2016). Further, the pool of available participants is high with a total of 4.8 billion unique user identities and 400 million active users worldwide across the major platforms as of July 2023 (Kemp, 2023). Although SMD represents a promising tool in social science, the number of studies using SMD to study sociopolitical issues—although increasing—remains relatively low compared to other data sources like self-reports (c.f. Dong & Lian, 2021), especially in countries other than the US. Additionally, existing studies differ greatly in their overall study aim, methodologies and study characteristics (Dong & Lian, 2021). The existing work extracting information about political opinions (e.g., attitudes and ideology) and behavior of users from SMD on an individual level can be broadly categorized based on what type of SMD is analyzed. Most of the work either deployed text analysis or network analysis (or in rare cases a combination of both).

Text analysis aims to transform textual (social media) content into quantitative data using (automated) lexical analyses (Schober et al., 2016). A variety of methods exists within text analysis (for an overview see Eichstaedt et al., 2021), all relying on the theoretical notion that individuals freely share their thoughts and opinions (e.g., in posts) and that derived estimates are based on psycho-linguistic properties of the texts (e.g., Kumar & Sebastian, 2012). A popular text analysis method to analyze SMD is sentiment analysis. Here, researchers first cluster semantically similar words within a text (e.g., social media posts) into categories or contexts which are then used to further categorize the content into positive, negative or neutral (and infrequently more detailed) sentiments. The contextualization of words in these models to extract sentiments is done using dictionary-based approaches, word embedding models or (nowadays increasingly popular) transformer-based models (Widmann & Wich, 2023). Finally, the estimated sentiment of posts is then used to predict sociopolitical outcomes (see Skoric et al., 2020, for an overview of recent studies). Notably, not all studies on SMD explicitly model sentiments. Some works also use machine learning models (ML) to directly use the contextualized words to predict sociopolitical outcomes (Skoric et al., 2020). Although using text analysis—specifically sentiment analysis—has yielded promising results in the past (e.g., Chung & Zeng, 2016; Lansdall-Welfare et al., 2012; Tumasjan et al., 2010), recent meta-analyses and reviews have revealed significant divergences in their prediction accuracies of sociopolitical outcomes (e.g., Rousidis et al., 2020; Skoric et al., 2020). This may not only be due to large differences in the analyzed political outcomes, data sources and political systems but also due to the model types used, with ML models yielding the highest prediction performance of political opinions and sociopolitical outcomes (Skoric et al., 2020).

Further, recent articles also showed that so-called network analysis approaches often outperformed text analysis approaches in estimating (or predicting) political opinions, with the highest accuracy observed when using both approaches in combination (e.g., Skoric et al., 2020). Based on these results and as argued by others, network analysis presents a promising way to study political opinions and sociopolitical outcomes using SMD (e.g., Gayo-Avello, 2013; Kwak & Cho, 2018; Livne et al., 2021; Pallavicini et al., 2017), which is why we will focus on this approach in our work. Network approaches use the connections (i.e., follows) and interactions (i.e., sharing, likes and comments) between users, so-called relational edges, to extract information about individuals’ political opinions and behavior. Such approaches rely on the theoretical notion that interactions and following decisions of users represent signals of their political opinions (e.g., attitudes and ideology) (e.g., Barberá et al., 2015). In line with that, research has shown that individuals tend to be more likely to connect with individuals who are similar to them, for instance, those who share their (political) opinions (McPherson et al., 2001), a phenomenon known as homophily. Especially on social media, such a tendency can be strengthened by friend recommenders and other algorithms (Aral, 2020). Although the homophily assumption describes a general tendency and does not imply that all individuals only connect with like-minded others, most empirical studies have shown that—much like in the real world—like-minded users (e.g., similar opinions) are more often than not connected on social media and, consequently, are also more likely to form homogeneous opinion networks (e.g., Aiello et al., 2012; Cinelli et al., 2021; Figeac & Favre, 2023; Khanam et al., 2023; Lee & Brusilovsky, 2010; McPherson et al., 2001; Pallavicini et al., 2017). Building on the homophily assumption, we also aim to estimate individuals’ political opinions from their social networks.

Although manifold approaches and model classes exist to analyze social networks, an essential method is dimensionality reduction (DR). DR presents a family of tools to condense the network structure (i.e., user and their connections) to create a more compact and informative representation of the network (Chikhi et al., 2007; Nishana & Surendran, 2013). In previous research, DR has been used to enable network models to work with high-dimensional SMD through creating embeddings (e.g., Chikhi et al., 2007; Grover & Leskovec, 2016; Yan et al., 2007) or was indirectly built into (advanced) models (e.g., layer transformations & pooling in Graph Convolutional Neural Networks, Bronstein et al., 2016; Zhang et al., 2019). However, DR methods have also been used standalone in network analysis for user clustering and community detection (e.g., Al-Omairi et al., 2021; Zarzour et al., 2018), uncovering hidden structures (e.g., Chikhi et al., 2007) or extracting global network characteristics like religious beliefs (e.g., Kurucz et al., 2008). Notably, and relevant to our approach, many studies on estimating (individual-level) political opinions from SMD have also used standalone DR models. For instance, Barberá et al. (2015) successfully used DR to analyze ideological segregation and cross-ideological communication in social networks. Wojcieszak et al. (2022) used DR to analyze ideological congruency in the interaction of social media users with politicians and news organizations, while Barberá (2015a) and Bond and Messing (2015) used DR to estimate political ideology of Facebook and Twitter users (renamed to “X”; we use the name “Twitter” throughout the work like it was called during data collection) which correlated with self-reported political opinions and was predictive of voting turnout.

Although studies using standalone DR have yielded promising results, they exhibit a few limitations that we will address in the present study. First, the DR techniques used so far to extract individual-level political opinions from SMD (e.g., ideology, beliefs and attitudes) tend to be relatively simplistic. That is, studies mostly used models such as Correspondence Analysis (CA) or linear latent space models (e.g., Barberá, 2015b; Barberá et al., 2015; Eady et al., 2019; Tausanovitch & Warshaw, 2017). However, these models focus on linear relationships in the data and thus are likely to perform badly in instances when more complex, non-linear relationships exist (e.g., De Backer et al., 1998; Nanga et al., 2021). We argue that similar to SMD sentiment analysis, where more complex ML models increased estimation performance (Anjaria & Guddeti, 2014; Skoric et al., 2020), network-based analysis of SMD would also benefit from using complex ML models. This is because individual political opinions (such as ideology) are complex phenomena and a linear combination of factors (i.e., followed accounts) seems unlikely to explain this manifoldness. Social networks—including those on platforms like Twitter—can be understood as complex systems (Boccaletti et al., 2006; Tunstel et al., 2021). As such, the different nodes (i.e., Twitter accounts) in a social network may interact, communicate, provide feedback to each other, and adjust accordingly. Looking at social networks from the angle of cybernetics, causal feedback loops continuously impact nodes in their thinking, attitudes, and behaviors (Tilak et al., 2023). In such networks, next to linear and non-linear effects, also synergetic effects can occur between nodes. Therefore, posts, likes, shares, etc. from various accounts a user follows, might exhibit synergetic effects that only occur at a specific constellation and/or complex non-linear, non-additive effects might shape the opinions of the user, including ideological views. Such effects, however, are unlikely to be caught by simplistic linear models. Complex ML models therefore seem more fitting to examine complex network structures and ultimately individuals’ opinions (Silva & Zhao, 2016). Notably, however, they have not been widely applied to extract individual-level public opinions from SMD.

Further, most studies so far have been conducted in the US political context (e.g., Barberá et al., 2015; Bond & Messing, 2015; Brito et al., 2021; Eady et al., 2019). It is questionable if their results would translate to other parts of the world, since estimating individual political opinions of users in bipartisan political systems (such as the US) seems much simpler than in more complex, multiparty systems like most European countries (Traber et al., 2023; Wagner, 2021). Further, previous studies analyzing cross-country network compositions revealed that European Twitter users usually have more ideologically heterogeneous networks than US users, making it potentially harder to pinpoint their political opinion from their network structure (Barberá, 2015b). To our knowledge, only one study by Barberá (2015b) explicitly estimated individuals’ political opinions from social media networks in countries other than the US. However, validation of the opinion estimates was limited to the correlation of political elites (i.e., followed political accounts by individual users) with expert ratings. Therefore, the feasibility and accuracy of estimating individuals’ political opinions from social media networks in non-US contexts is still understudied.

In the current study, we aim to address the mentioned limitations. Our primary research objective is to examine if general political opinions of non-US social media users can be estimated based solely on their Twitter network structure. Further, we aim to test if a complex, non-linear DR model outperforms a widely used linear DR model in extracting these opinions. Specifically, we employ an unsupervised Variational Autoencoder (VAE) ML model and a linear CA. Thereby, we obtain a point estimate for each social media account in latent space which we assume to represent their general political opinion. To test the capability of both models to estimate individuals’ political opinions, we first compare their DR performance on the same dataset and inspect the distribution of point estimates to validate how closely they resemble expected opinion distributions (from left-leaning to right-leaning). In the final validation step, we correlate the models’ point estimates with self-reported opinion data (symbolic ideology and voting intention) for a sub-sample of users from a dataset surveyed in 2021.

Methods

Data Collection and Pre-Processing

To facilitate individual-level opinion estimation, we analyzed the connections of regular, politically interested Twitter users with political accounts which present a clear political stance. For the most part, we followed the approach of Barberá et al. (2015). First, we identified the Twitter accounts of German politicians and their followers. To accomplish this, we created a list of German politicians in office in all sixteen federal states or the national parliament and matched each politician with their Twitter account. A detailed description of this procedure can be found in the supplemental materials, I.I. In total, we retrieved N = 1434 active Twitter accounts of German politicians. Next, using their Twitter user IDs, we queried the complete follower list of each account. Data collection took place in July 2022. In the following, we refer to the retrieved followers of political accounts as users. Similar to previous studies and to improve data quality (e.g., Gayo-Avello, 2013; Kwak & Cho, 2018), we pre-processed the user lists by removing politicians following each other and user accounts created after our survey data collection (see supplemental materials, I.I for full procedure). After the initial pre-processing, the list of users following political accounts resulted in N = 13 306 769 unique users. Next, analogous to Barberá et al. (2015), we only kept users, who followed at least 10 political accounts to reduce the number of inactive accounts and focus on politically interested users. The final user count after these steps resulted in N = 276 008 unique users. Although this final list only encompassed a subset of all followers obtained in the first step, we expected this reduction based on similar following rates of political accounts (number of users following ten or more political accounts) in social media networks reported in previous research (e.g., Barberá et al., 2015; Wojcieszak et al., 2022).

Analogous to Barberá et al. (2015), we considered the political opinion of a user as a position (i.e., point estimate) in a multidimensional, latent space which can be obtained from the following structure of Twitter users with the respective political accounts. For our models to calculate these positions, we first arranged users and political accounts in an N × m adjacency matrix, representing the following structure of each Twitter user i ∈ {1, ..., n} (row) for a target political account j ∈ {1, ..., m} (column), with Y_ij = 1 indicating a following decision and Y_ij = 0, otherwise. Doing so, our adjacency matrix represented a bipartite, directional graph structure (see Tabassum et al., 2018) with initial network matrix dimensions of Y = [276 008 × 1434]. After a further check of this following matrix, we removed seven additional political accounts that were either no longer followed by any user or were mistakenly identified as accounts of politicians previously, yielding the final following matrix Y = [276 008 × 1427]. To validate our models on out-of-sample data, we divided the final matrix into a separate training and test set (90% train and 10% test), resulting in Y_train = [248 407 × 1427], Y_test = [27601 × 1427] matrices. Further information about the adjacency matrix and dataset splitting procedure can be found in the supplemental materials, subsection I.I.

For our final model checks and the relation of point estimates to self-reported political opinion and behavioral intentions, we used survey data (collected between December 2020 and February 2021) from a study conducted by one of the authors. In total, N = 780 individuals reported on different personality and political opinion items. Relevant to the current study, this survey assessed self-reported symbolic political ideology via the one-item left-right self-placement and voting intentions. Participants were also asked to provide their Twitter user name voluntarily (if they had an account). Of all participants in the study, N = 173 provided a (valid) user name. In the present study, we used the self-report data of these participants to validate our models and refer to it as the self-report dataset in the following. To create this dataset, we first used the reported Twitter user names and queried the Twitter API to retrieve their user IDs. In total, N = 163 unique accounts could be retrieved. Then, we matched user IDs to the IDs in the follower lists of political accounts acquired previously. Since one of our intended models (CA) only works if an individual user follows at least one political account (see Model Selection and Description), we removed all users not following any political account (i.e., all zero values). After applying these procedures, the self-report dataset matrix Y_self-report = 119 × 1427 was obtained. Further information about the self-report dataset (including deployed scales, study procedure and ethics approval) can be found in the supplemental materials, subsection I.III and in (Sindermann et al., 2021, 2022, 2023).

Model Selection and Description

As mentioned in the Introduction, manifold network analysis approaches could be applied to our research questions. However, based on our data structure (bipartite, directed graph) and similar to many previous studies on individual-level political opinion (e.g., Barberá, 2015a; Barberá et al., 2015; Bond & Messing, 2015; Eady et al., 2019; Kurucz et al., 2008; Wojcieszak et al., 2022), we used standalone DR models. Specifically, to test whether individual political opinions of German Twitter users can be inferred solely from their network structure and whether a more complex ML model would outperform a simple linear model, we deployed two DR models: Correspondence Analysis (CA) and Variational Autoencoder (VAE).

CA is conceptually related to Principal Component Analysis and can be used to analyze relationships between multiple categorical variables. Specifically, CA uses linear combinations (i.e., transformations) of the original input data to project the rows and columns onto a new, lower-dimensional subspace. Similar to previous studies (e.g., Barberá et al., 2015; Eady et al., 2019), we used CA to reduce the full following matrix Y to a lower-dimensional subspace. Afterward, we analyzed its DR capabilities by reversing this transformation to reconstruct (i.e., approximate) the input data from this subspace and compared it to the original input data. We further checked if projected subspace row coordinates represented Twitter users’ opinion point estimates and if column coordinates represent the overall political positioning of the respective political accounts. The CA was set to reduce the following matrix to two latent dimensions. This was done for several reasons. First, we were interested in obtaining a single point estimate on one latent dimension for each user that is expected to represent their political opinion. In a CA, the first dimension extracted incorporates the highest eigenvalue (i.e., the highest amount of variance captured) of all dimensions, which we assumed to represent overall political opinion (see Barberá et al., 2015, for a similar approach and in-depth description). Second, fitting a CA on large datasets can be computationally expensive. Thus, mapping inputs to fewer dimensions reduces the computation time substantially (e.g., Halko et al., 2011). Lastly, using two dimensions simplifies data visualization by creating a 2D plot of the point estimates, even though we were only interested in the first CA latent dimension. A more detailed description of the CA and its working principles can be found in the supplemental materials, subsection I.II and in Barberá et al. (2015).

Similar to CA, VAE also presents a DR model. However, in contrast to CA, VAEs compress data with greater complexity using non-linear and probabilistic mappings. VAEs are unsupervised ML models based on neural-network autoencoders. They comprise an encoder network for projecting data into a latent distribution and a decoder network for reconstructing the original input from the latent codes. The encoder typically consists of multiple, fully connected layers with the last layer approximating the latent representations (posterior) using a multivariate Gaussian distribution from which the individual latent codes of users and political accounts (much like the projected row and column coordinates in the CA) can be sampled. The decoder part of the network then reconstructs (i.e., approximates) the original input matrix from the sampled latent codes. From an intuitive standpoint, VAEs learn to represent the essence of a dataset (i.e., network structure) in a condensed and structured way. Using a probabilistic, non-linear, neural network-based DR approach, their encoding of users and political accounts should lead to a more nuanced representation and differentiation of political stances in a multiparty system compared to a linear, deterministic DR model like CA. Although adaptations of VAE for network-based data (geometric ML) exist (Variational Graph Autoencoder, see Kipf & Welling, 2016), we used a regular VAE which does not explicitly preserve relational structures between users but only between users and political accounts. This decision was made based on our data structure and the overall goal of estimating users’ opinions from their decisions to (not) follow political accounts analogous to previous studies. To enable model comparison with the CA, we mapped inputs to two latent dimensions in the VAE as well. We assumed that one dimension would represent individuals’ and politicians’ overall political opinions. The VAE used in our study comprised three hidden, fully connected neural layers for the encoder and decoder. The general model architecture is depicted in Figure 1, a detailed description of the VAE (latent code extraction, differences to CA, training procedure and hyperparameter setting) can be found in the supplemental materials, subsection I.II.

Figure 1.

VAE architecture and analysis procedure. Note: (a) Depiction of the general VAE architecture. Left part (red bars) shows the input and encoder model. Right part (yellow bars) shows the decoder model. (Hidden) layer sizes (data shape) of the VAE are depicted under the respective bars. Green box depicts the sampling layer, in which the user and political account point estimates are extracted (sampled) from the two latent dimensions. (b) Study analysis procedure and workflow for VAE and CA. Detailed information about the VAE model and latent representations can be found in the supplementary material, subsection I.I and I.II.

Analysis Procedure

Both the CA and VAE were first trained on the training dataset. For an initial check of how accurately the models compressed the original network (DR performance), we reconstructed the following matrix from the calculated latent dimensions and compared it with the original matrix of the training dataset. Since both the original and reconstructed matrix contained binary data (following status), we used typical classification performance metrics to check model performance. In detail, we calculated Precision, Recall, F1 score, Matthew’s correlation coefficient (MCC) and Balanced Accuracy (BACC). The latter two are particularly well-suited metrics for sparse, imbalanced datasets, common in many network analysis applications (Chen et al., 2024). Imbalance also applies to our study’s data, since only ∼1.7% of all values in the training and test datasets represented “follows” (i.e., ones). Despite not opting for a data resampling technique before model fitting to balance the distribution, MCC and BACC still allowed us to evaluate the models’ performance and capability to represent the minority class appropriately. Also, we assumed follow decisions to be equally important/signaling as non-follow decisions to estimate individual political opinions. We were thus interested in the classification performance for both follow and non-follows, which are better represented by MCC and BACC compared to the aforementioned metrics. Additionally, we calculated the overall proportion of correctly predicted cases (PCP) and the Brier score to enable performance comparison with previous studies (see supplemental materials, subsection II.II). As an uninformed classification baseline and to benchmark both the CA and VAE, we further calculated a Naive Classifier (NCF), which always predicts the positive majority class (“no follow”). After training and evaluating the models on the training set via the reconstruction approach, we applied and evaluated the trained models on the test set using the same classification performance metrics.

Next, we visually inspected our models’ point estimates and latent dimensions. First, we inspected the overall distribution of user point estimates on the two latent dimensions. As mentioned, the latent dimensions in the models should represent a condensed version of the overall structure in the original network matrix. Therefore, we expected one of the two dimensions in our models to resemble the general political opinions of individual users. In doing so, we plotted the point estimates of each user (training and test set) on the two latent dimensions and highlighted them based on the percentage of political accounts from a respective party the user followed in relation to the total number of followed accounts. To do so, we first matched all political accounts in our dataset to their respective party membership for all parties that were part of the German national parliament at the time of the study: Parties on the left side of the political spectrum: Alliance ’90/Greens (Green party), Social Democratic Party (SPD) and The Left (Linke); and parties on the right side: Alternative for Germany (AfD), Christian Democratic Union/Christian Social Union (CDU/CSU), Free Democratic Party (FDP). The left-right categorization was based on the Open Expert Survey 2021 (OES), in which experts rated each party for their political stance on different political dimensions and scales (Jankowski et al., 2022). We used data from this work instead of the often used Manifesto report (Lehmann et al., 2023) because the OES surveyed considerably more experts, potentially leading to more robust ratings of parties’ political positions. As described, we expected that the more accounts from a specific party the users follow (percentage-wise), the more they are likely to hold similar opinions to the party based on the homophily assumption (McPherson et al., 2001), which was in detail introduced before and has been investigated and supported by previous research (e.g., Aiello et al., 2012; Barberá, 2015a; Barberá et al., 2015; Bond & Messing, 2015; Cinelli et al., 2021; Lee & Brusilovsky, 2010; McPherson et al., 2001; Pallavicini et al., 2017; Wojcieszak et al., 2022). Therefore, individuals following a relatively high number of politicians from left-leaning parties (Linke, Green party, SPD) were expected to be on the opposite spectrum compared to individuals following right-leaning parties (AfD, CDU/CSU, FDP). The total and average counts of followed accounts per party and dataset as well as the final vote shares of the German federal election in 2021 (Wilko, 2021) are depicted in Table 1. Vote shares were added to enable a comparison of the party following distribution in our datasets against the respective party popularity in the German population.

Table 1.

Overview of party accounts followed per dataset.

Party	Dataset
	Train		Test		SR		Federal election
	% follows	Mdn (IQR)	% follows	Mdn (IQR)	% follows	Mdn (IQR)	% votes
Linke	11.3	2 (1, 4)	11.4	2 (1, 4)	10.0	2 (1, 5)	4.9
SPD	27.0	5 (1, 8)	26.9	5 (1, 8)	33.3	3 (2, 5)	25.7
Greens	24.9	4 (1, 7)	24.9	4 (1, 7)	34.9	2 (1.5, 5)	14.8
FDP	11.0	2 (1, 3)	11.0	2 (1, 3)	12.0	2 (1, 5)	11.5
CDU/CSU	19.3	4 (1, 6)	19.1	4 (1, 6)	7.8	2 (1, 3)	24.1
AfD	6.5	3 (1, 9)	6.7	3 (1, 9)	2.0	3.5 (2, 10.3)	10.3

Note. % follows = Percentage of follows relative to all followed party accounts, excluding non-follows. Mdn = Median number of party accounts followed calculated for all users following at least one of the respective party accounts. IQR = Interquartile range of 25% and 75%. % votes indicate overall party vote shares in the 2021 German federal election. Train: Training dataset, Test = Test dataset, SR = Self-report dataset. Values are rounded to one decimal. Sample size per dataset: N_train = 248 407, N_test = 27601, N_self-report= 119, total count of accounts followed per dataset: N_train= 5 801 440, N_test = 643 563, N_self-report = 1759.

Afterward, we checked if the overall estimated political positioning of parties corresponded to their overall political positioning judged by experts. To this end, we again used data from the 2021 OES (Jankowski et al., 2022). Specifically, we used expert ratings positioning each party on a typically used left-right ideology scale ranging from 0–20. To check alignment, we plotted these scores against the median party point estimates of both the CA and VAE from our models. In detail, similar to Barberá et al. (2015), we collated the model column coordinates of political accounts from the same party on the first dimension (matching individual political accounts to their party) and used the median of all these column coordinates as a measure for the models’ overall party positioning. Further information about the extraction of column coordinates can be found in the supplementary materials, subsection I.II. All estimates were standardized to ensure comparability of party positionings in the models and the scores of the OES report. As a final check of our models capturing political opinions, we compared user point estimates with their self-report data in our self-report data set. After applying our trained models on the self-report dataset to get users’ point estimates, we first calculated Spearman correlations between users’ point estimates and their self-reported political (symbolic) ideology assessed on a scale ranging from 1 (left) to 10 (right). We expected a positive relationship between point estimates and symbolic ideology. Next, we plotted users’ point estimates conditioned on their reported party voting intention in the next German Federal election (Sonntagsfrage). Analogous to self-reported ideology, we expected a gradient in overall point estimates. That is, median point estimates of users signaling to vote for more left-leaning parties should be lower than users intending to vote for more right-leaning parties. Further, we used the opinion point estimates to predict self-reported voting intention using a multinomial logistic regression model (see supplemental materials, subsection I.III). However, due to the highly unequal cell sizes with much less participants indicating to vote for more right-leaning parties (see section “Opinion Estimate Validation Using Self-Report Data”), we focus on the visual inspection of point estimate distributions in the main text. Finally, we calculated the respective classification metrics of the reversed (i.e., reconstructed) matrices for the self-report set (see supplemental materials, subsection II.I). Figure 1 provides a visual depiction of the full analysis procedure and workflow.

Results

Reconstruction Performance

To check the models’ capacity to represent users’ network structure (i.e., following decisions) in a latent space, we inspected the adjacency matrix reconstruction performance. The overall results are depicted in Table 2. Across all calculated metrics and the training and test datasets, the VAE consistently achieved a higher reconstruction performance than both the CA and the NCF. The VAE managed to classify the majority of the positive class (non-follows) more accurately (higher Recall) and misclassified less of the following decisions (i.e., negative class) as non-follows (higher Precision). However, since both datasets were highly imbalanced (∼98.3% of all values are in the positive, non-follow class), the NCF seems to (almost) match the performance of the VAE on the F1 score, Precision and Recall metrics and even outperformed the CA on the former two. Therefore, we focused on comparing model performance on the MCC and BACC which evenly take the classification performance for both classes into account. Again, the VAE showed an overall higher classification performance for the follow and non-follow decisions (higher BACC and MCC), thus reconstructing the following matrix more accurately than the other two models. Although performing worse than the VAE and the NCF on unbalanced metrics, the CA outperformed the NCF on the balanced BACC and MCC metrics. A similar results pattern emerged for the additionally calculated Brier score and PCP (see supplemental materials, subsection II.II).

Table 2.

Model reconstruction performance.

Models	Metrics
	BACC		MCC		F1 score		Precision		Recall
	Train	Test	Train	Test	Train	Test	Train	Test	Train	Test
VAE	0.69	0.69	0.52	0.52	0.99	0.99	0.99	0.99	1.00	1.00
CA	0.52	0.52	0.12	0.12	0.79	0.79	0.66	0.65	0.99	0.99
NCF	0.50	0.50	0.00	0.00	0.99	0.99	0.98	0.98	1.00	1.00

Note. Performance rounded to two decimals. NCF: Naive classifier (always predicting majority class, “No follow”). Positive class: “No follow,” negative class: “Follow.” BACC: Balanced accuracy, MCC: Matthew’s correlation coefficient. Train: Training dataset, Test: Test dataset. Preferred model for each metric and dataset in bold.

Inspection of Opinion Estimates

Next, we visually inspected the latent dimensions and user opinion point estimates. Since the VAE outperformed the CA regarding matrix reconstruction performance, we only show the point estimates for this model in the following (see supplemental materials, subsection II.III for CA point estimates). Figure 2 shows individual user point estimates from the VAE on the two latent dimensions plotted separately for each party. As expected, individuals following a proportionally high number of left-wing parties (Green party, Linke and SPD) appear to have low values on the second latent dimension and vice versa for followers of right-wing parties (AfD, CDU/CSU, FDP). This finding indicates that the second latent dimension in the VAE symbolizes individuals’ general political opinions. The first latent dimension in the VAE seems to represent the magnitude of users’ following decisions. In detail, the more political accounts users followed overall (total follows), the lower their scores in the first latent dimension seem to be. This is indicated by a gradient of large to small points from negative to positive values on the first latent dimension (Figure 2). Supporting this visual inspection, we found a substantial correlation between the total number of followed accounts and user values on the first latent dimension (r = −.54, p < .001). These observations are, however, purely post-hoc interpreted, because we had no prior assumptions about what this latent dimension in the VAE might represent before data analysis.

Figure 2.

Latent dimensions in the VAE. Note: Opinion point estimates in the VAE for all parties and users (combined training and test set). Each point represents a user, sequentially colored by the percentage of accounts followed per party (total number of accounts followed from a specific party divided by overall followed political accounts) with a brighter hue indicating a lower and a darker hue indicating a higher percentage of accounts followed from the respective party. Point sizes refer to the total number of followed accounts (independent of party) per user. Larger points indicate more followed accounts overall.

Next, we used the party estimates (i.e., column coordinates) on the second latent dimension in the VAE and the first dimension in the CA to check the overall estimated party alignment with expert ratings from the OES. The results are depicted in Figure 3. Broadly, the VAE more closely matched the expected overall party positions from the OES. The distance to the reference opinion estimates (OES) was lower for most of the parties compared to the CA (except SPD). Additionally, the overall distinction of estimated party positions (i.e., separating more left from more right-leaning parties) in the VAE was higher. In fact, party estimates in the CA were almost identical for five out of six parties (except AfD). Notably, the party estimates in both the VAE and CA for the party LINKE seem to deviate the most from the OES.

Figure 3.

Comparison of left-right alignment parties. Note: OES estimates represent the mean estimates of party positioning by experts from 2021. The party ordering runs from most left to most right party as identified in the OES. Negative opinion point estimates on the Y-axis represent left-leaning parties and positive ones represent right-leaning parties. Points for VAE and CA represent party opinion estimates, calculated by using the median of all political accounts’ column coordinates for each party. All estimates are standardized.

Opinion Estimate Validation Using Self-Report Data

In the last step, we validated opinion estimates from the VAE and CA with our self-report dataset. First, we looked at the distributions of self-reported ideology and voting intentions of our sample. On average, our sample’s self-reported ideology levels were left-leaning, and not a single individual reported strong right-leaning ideologies (M_ideology = 3.29, SD_ideology = 1.33, MinMax_ideology = [1, 8]). The same was true for voting intentions. The majority of individuals reported intending to vote for left-leaning parties (Green party: 54, Linke: 15, SPD: 19) and only twelve individuals in total for the (more) right-leaning parties (FDP: 7, CDU/CSU: 4, AfD: 1). The remaining reported to vote for other, non-major parties (Other: 18) or to not vote at all (no vote: 1). These self-report results align only partly with the overall following rates of political accounts in our work since users in the self-report dataset followed even more politicians from left-leaning parties percentage-wise than users in the training and test set (Table 1).

Looking at the distribution of opinion point estimates of both the VAE and CA, their mean and standard deviation seemed to align with this overall left-leaning trend of the self-report data, with the VAE exhibiting a lower mean and higher standard deviation compared to the CA (M_CA = −0.24, SD_CA = 0.33, M_VAE = −0.72, SD_VAE = 0.76). Looking at the relationship between opinion point estimates of users with their self-reported symbolic political ideology (left-right), we found a medium correlation in both models (VAE: r = .46, p < .001, CA: r = .46, p < .001). Although exhibiting similar positive correlation coefficients, the opinion point estimate distributions and absolute values seemed to differ between models (see Figure 4). In the CA, most point estimates were closely grouped except for a significant outlier on the highest self-reported ideology scale point. In comparison, the general pattern of point estimates and self-reported ideology in the VAE seemed more nuanced, showing a bigger variance in point estimates within and between self-reported ideology scores.

Figure 4.

Correlation plot of opinion point estimates and self-reported ideology. Note: Opinion point estimates for all participants in the self-report dataset are represented by individual dots jittered on the x-axis with a bold red regression line. Error bars around the regression line indicate 95% confidence interval.

Next, we analyzed the capacity of our models’ point estimates to predict party voting intentions. Figure 5 shows the distribution of point estimates by self-reported voting intentions for each party and model. As expected, the ordering of opinion estimates in the VAE was mostly in line with the expected ordering per party. That is, individuals intending to vote for more left-leaning parties had lower, negative median opinion estimates (Md_Linke = −1.04, Md_90/Greens = −0.87, Md_SPD = −0.62), whereas those intending to vote for more right-leaning parties had higher, (mostly) positive opinion estimates (Md_CDU/CSU = −0.1, Md_FDP = .33, Md_AfD = 1.73). Notably, individuals intending to vote for CDU/CSU exhibited an overall lower median point estimate than expected (i.e., lower median estimates than the ones intending to vote for FDP). However, the point estimate variance for the CDU/CSU was relatively large compared to the other parties and only four individuals reported voting intentions for this party. Compared to the VAE, the CA median opinion estimates were less in line with the expected party ordering. Similar to the overall party positioning (Figure 3), the variance between median point estimates for the different parties was much smaller compared to the VAE with only the most right-leaning party (AfD) exhibiting a high difference (and high absolute value) in median point estimates. Median estimates for parties LINKE and Greens were identical (Md = −0.3) with the median estimate for SPD (Md = −0.29) being the lowest of all parties (although only marginally). Additionally, estimates for more right-leaning parties FDP and CDU/CSU were only slightly higher than for the previously mentioned left-leaning parties and also almost identical to one another (Md_FDP = −0.19, Md_CDU/CSU = −0.2). Put simply, the CA only showed small differences in estimated opinion estimates between individuals voting for different parties. This may, in turn, have led to a higher divergence of expected opinion estimate ordering based on the voting intention compared to the VAE. In sum, the relationship between the point estimates and voting intention in the VAE seemed more nuanced and in line with the expected distribution compared to the CA. This result was further supported by the multinomial logistic regression model, in which the VAE showed a superior model fit compared to the CA in predicting voting intentions through opinion point estimates (see supplemental materials, subsection II.IV).

Figure 5.

Self-reported voting intention and estimated ideological position. Note: Individual boxplots calculated for all participants intending to vote for respective party with an interquartile range of IQR = 3. Black dots indicating single observations. Points outside of Whiskers indicate outliers. Boxplots for AfD only showing single point estimate (N = 1). Median point estimates per intended voting behavior shown underneath individual boxplots. CA graph (b) includes y-axis break to facilitate plotting.

Lastly, we also checked the CA and VAE following matrix reconstruction performance on the self-report dataset. Similar to the train and test set, the VAE consistently outperformed the CA and NCF across calculated metrics (see supplemental materials, section subsection II.I). This again indicates that the learned latent representations in the VAE more closely captured the network relationships compared to the CA.

Discussion

In the present study, we explored whether social network data can be used to infer individuals’ political opinions in countries other than the US. In doing so, we utilized dimensionality reduction (DR) models to analyze the network of German Twitter users to obtain individual-level political opinion estimates.

Our results not only corroborate findings from previous studies showing that estimating individual political opinions from social media data (SMD) using individuals’ following decisions is feasible (e.g., Barberá, 2015a; Barberá et al., 2015; Bond & Messing, 2015) but that this estimation also works in a multiparty political system like Germany; and thus potentially in other countries outside of the US as well. This result is noteworthy since many countries other than the US have a multiparty system and—in the special case of Germany—ordering political parties on a general left-right continuum is a complex issue with no universally agreed-upon solution. Depending on the used dimension (social, economic, etc.) and applied method, researchers have come to different party orderings in the past (Jankowski et al., 2022; Lehmann et al., 2023). This ordering issue in combination with a higher number and unequal distribution of voting intention classes in our study (six parties following different political agendas in the German context) thus presents a more challenging prediction endeavor than in a bipartisan context like the US.

Despite these challenges, our study shows that opinion estimates of users following accounts from German politicians broadly align with the respective overall party stance judged by experts. Further, the results suggest that the more accounts portraying a specific political stance social media users follow, the more likely they are to hold similar opinions. Particularly, the more accounts from “extreme” parties, like the right-leaning AfD, individuals follow, the more extreme their opinion point estimates become, reflecting a left-right dimension. In line with previous studies on social media and offline social networks, this supports the assumption of homophily (e.g., Aiello et al., 2012; McPherson et al., 2001).

Moreover, we find that using complex ML models can benefit the estimation accuracy of individuals’ public opinions from SMD. First, the unsupervised VAE ML model captured the intricacies of the following structure more precisely compared to an uninformed baseline model and a linear DR algorithm (CA). Possibilities to compare our models’ matrix reconstruction performance to the literature are somewhat limited since—to our knowledge—only one related, previous study (conducted in the US) reported the (unbalanced) reconstruction performance (e.g., Barberá et al., 2015). Compared to this study, however, our VAE exhibited higher absolute reconstruction performance and greater improvements over an uninformed baseline. The nuanced network representation in the VAE is further corroborated by a closer match of the ideological party estimates with expert ratings compared to the CA. The individual-level opinion estimates in the VAE also showed stronger relationships with self-reported opinions compared to the linear CA. Not only did the user opinion estimates show strong correlations with self-reported symbolic ideology but also a more accurate relationship with individuals voting intentions. All these results empirically support the assumption that complex, non-linear relationships in the following structure of social media networks exist and need to be considered in the modeling phase to estimate individual-level opinions more accurately. Being able to extract individual-level political opinions solely from SMD following decisions yields several practical implications. Using SMD seems to provide a promising alternative to cost and labor-intensive surveys. The extracted opinion estimates may further be used by researchers and policymakers alike to predict political outcomes like elections, opinion networks and political polarization in social media networks.

Nevertheless, our study comes with a few limitations. We found slight differences in the estimated political stance of German parties in our models and their expected positions rated by experts. One of the reasons might be structural biases in our data, even though we adhered to previous best practices of data cleaning as far as possible. We might not have excluded all inactive and/or bot accounts due to restrictions in users’ geo-location, and other meta information. However, despite the potentially resulting noise in our datasets, the VAE still captured meaningful relationships as shown by our results. Judging from the following distribution of political accounts compared to the overall vote shares in the 2021 German federal election and the fact that most social media users do not seem to follow political elites, our data may not be representative of the general population, a common problem of studies using SMD (Mellon & Prosser, 2017; Wojcieszak et al., 2022). The representativeness of our sample for the Twitter user space might further be limited since we only analyzed data of users who followed at least ten political accounts (similar to previous approaches). Importantly however, users who do follow political elites, seem to be more aligned with the opinions of the followed accounts which conversely aids our study’s assumption of homophilic networks (Wojcieszak et al., 2022). On a similar note, our self-report dataset is comparatively small and the following frequencies of political accounts deviated from the bigger training and test datasets. As shown by our results, the overall distribution of voting intentions and self-reported ideology are rather left-leaning and limited in their variance; only a few participants reported voting for right-leaning parties. This may have led to slightly biased correlations of self-reported opinions with point estimates and may thus also explain similar correlations comparing the VAE and CA. Nevertheless, a small to medium correlation of opinion estimates with symbolic ideology and a fit with voting intention patterns indicate that the VAE captured important patterns in the self-report dataset users’ networks. Finally, researchers planning to adapt our method should be aware that VAEs require more careful planning, fine-tuning (i.e., hyperparameter and architecture selection) and computation time in contrast to simple linear DR models. Also, the interpretation of latent dimensions may require additional effort as noted in the Methods section. However, when trained and applied properly, they can capture the intricacies of the network in more detail, as shown by our results.

Based on our findings and study limitations, we pose several recommendations for future studies: While a ML DR model yielded promising results in the context of German SMD network-based analysis, future research should apply these models to other countries, political systems and politically uninterested users to test generalizability. Future studies may also test other existing, more traditional network analysis models and approaches which, for instance, focus on node-link prediction (instead of DR like our model, e.g., GNNs) to estimate individual-level political opinions. On a similar note, since our dataset exhibits a directed bipartite network structure (users →political accounts), future studies may explore the potential of other types of network data structures and representations (e.g., undirected user-user/user-political accounts networks, attributed graphs) to estimate political opinions from SMD. Doing so could, for instance, facilitate the application of resampling techniques and the application of other network analysis and benchmark models. Considering our rather small self-report dataset, future research may validate point estimates with self-report measures using larger samples. While we found substantial correlations between opinion estimates and self-reported symbolic ideology, future studies may investigate whether manifestations of dimensions of more complex models of political views (see for example Fatke, 2017; Feldman & Johnston, 2014; Gerber et al., 2010; Jost et al., 2009; Treier & Hillygus, 2009) can be predicted by these data as well. In this case, fitting models with more latent dimensions may be advisable to capture the manifoldness of users’ political opinions in European countries even more accurately (Barberá, 2015a). On that note, future studies could explore whether our results can be generalized to measures of operational ideology assessing individuals’ attitudes on specific topics and policy issues. Moreover, although our VAE ML model surpasses the network reconstruction performance of previous studies, linear (CA) and uninformed models (Naive), it is debatable what constitutes “good” dimensionality reduction performance. Therefore, future research could explore potential thresholds and benchmarks. Shortly after our data collection (July 2022), Twitter’s leadership changed and user counts decreased substantially (Alex Hern, 2024). Although we do not expect this circumstance to affect (homophilic) behavior, future research may still investigate periodic effects of such events on social media networks and replicate our findings. Interrelated with this, Twitter has become less attractive to researchers and political actors as a data source since it does not share data with independent researchers (like most social media platforms) and sets comparatively high pricing for acquiring big datasets (Calma, 2023). This hinders and may even prevent further in-depth examinations in this research area (Bruns, 2019). We hope that with the Digital Services Act, researchers will (re)gain access to conduct research on online platforms in the future (European Commission, 2024).

Using social media data to assess the political opinions of individuals offers valuable benefits over traditional opinion surveys for researchers and policymakers alike. However, finding reliable and accurate methods to extract information from unstructured SMD in different sociopolitical contexts remains one of the main challenges. Our study adds new insights to this endeavor, showing that using ML models capable of capturing intricate user network characteristics in a multiparty system is crucial to estimating individual-level political opinions more accurately. Opinion estimates from SMD may then be used to monitor political trends, capture ideological shifts in societies, or predict political behavior like voting patterns or policy support, ultimately benefiting the democratic process in modern societies.

Supplemental Material

Supplemental Material - To Follow or Not to Follow: Estimating Political Opinion From Twitter Data Using a Network-Based Machine Learning Approach

Supplemental Material for To Follow or Not to Follow: Estimating Political Opinion From Twitter Data Using a Network-Based Machine Learning Approach in Nils Brandenstein, Christian Montag, and Cornelia Sindermann in Social Science Computer Review

Data Availability Statement

The data underlying this research project constitute “special categories of personal data” (“personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, [. . . ]” GDPR, Chapter 2, Art. 9). Data processing was based on the fact that “processing relates to personal data which are manifestly made public by the data subject” (GDPR, Chapter 2, Art. 9, §2e; training set, test set) or “the data subject has given explicit consent to the processing of those personal data [...]” (GDPR, Chapter 2, Art. 9, §2a; set). Participants of the survey to recruit the self-report dataset additionally provided consent that their data might be shared if re-identification is impossible. Given the exponential growth in available data analysis strategies (algorithms, etc.), we did not find an appropriate data masking or data aggregation strategy to prevent re-identification of participants and still provide the whole dataset for replicability of the findings to the public. Thus, if researchers are interested in replicating the results reported in the present work, we ask them to contact us (cornelia.sindermann@iris.uni-stuttgart.de) and we will provide access to part of the data, aggregated data, or the like. Note that every request will need to undergo thorough examination first to ensure compliance with the GDPR and that participants cannot be re-identified in the dataset access is provided to. All questionnaires used in the survey to recruit the self-report dataset will be made publicly available at the OSF upon acceptance of the manuscript. The analysis code used to produce the results is openly accessible on the OSF: https://osf.io/2ft8d/?view_only=e56f7e9b7fdc4b4f8188084584141793.

Footnotes

Acknowledgments

We acknowledge the support of the Ministerium für Wissenschaft, Forschung und Kunst Baden-Württemberg (MWK, Ministry of Science, Research and the Arts Baden-Württemberg under Az. 33-7533-9-19/54/5) in “Künstliche Intelligenz & Gesellschaft: Reflecting Intelligent Systems for Diversity, Demography and Democracy (IRIS3D)” and the support by the “Interchange Forum for Reflecting on Intelligent Systems” (IRIS) at the University of Stuttgart. Further, we acknowledge the support by the Stuttgart Center for Simulation Science (SimTech).

Author Contributions

Nils Brandenstein: Conceptualization; Methodology; Software; Validation; Formal analysis; Investigation; Data Curation; Writing - Original Draft; Visualization; Administration

Christian Montag: Writing - Review & Editing

Cornelia Sindermann: Conceptualization; Methodology; Software; Validation; Formal analysis; Investigation; Data Curation; Writing - Review & Editing; Supervision; Administration

Declaration of conflicting interests

The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: For reasons of transparency Dr. Montag mentions that he has received (to Ulm University and earlier University of Bonn) grants from agencies such as the German Research Foundation (DFG). Dr. Montag has performed grant reviews for several agencies; has edited journal sections and articles; has given academic lectures in clinical or scientific venues or companies; and has generated books or book chapters for publishers of mental health texts. For some of these activities he received royalties, but never from gaming or social media companies. Dr. Montag mentions that he was part of a discussion circle (Digitalität und Verantwortung: ) debating ethical questions linked to social media, digitalization and society/democracy at Facebook. In this context, he received no salary for his activities. Also, he mentions that he currently functions as independent scientist on the scientific advisory board of the Nymphenburg group (Munich, Germany). This activity is financially compensated. Moreover, he is on the scientific advisory board of Applied Cognition (Redwood City, CA, USA), an activity which is also compensated.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Nils Brandenstein

Christian Montag

Cornelia Sindermann

Supplemental Material

Supplemental material for this article is available online.

Author Biographies

Nils Brandenstein is currently a PhD student in Psychology at Heidelberg University, Germany, where he also earned his Masters' degree. His research interest lies in the field of Political Psychology and he uses Machine Learning models to investigate a variety of topics, including belief in conspiracy theories, sustainable behavior and political attitudes/behavior.

Dr. Christian Montag is Professor for Molecular Psychology at Ulm University, Ulm, Germany. He works at the intersection of psychology, neuroscience, computer science and behavioral economics. At the moment his work focuses on the research-questions how AI impacts on society and how social media can be improved.

Dr. Cornelia Sindermann did her Ph.D. in Psychology at Ulm University, Ulm, Germany. Currently, she is the Independent Research Group Leader of the Computational Digital Psychology team within the Interchange Forum for Reflecting on Intelligent Systems, University of Stuttgart. She is interested in how interactions between individual differences and technological innovations shape how information is presented and processed, and how this, in turn, impacts political opinion formation and behavior.

References

Aiello

L. M.

Barrat

Schifanella

Cattuto

Markines

Menczer

(2012). Friendship prediction and homophily in social media. ACM Transactions on the Web, 6(2), 1–33. https://doi.org/10.1145/2180861.2180866

Alex Hern . (2024). Twitter usage in US ‘fallen by a fifth’ since Elon Musk’s takeover. Retrieved June 24, 2024, from. https://www.theguardian.com/technology/2024/mar/26/twitter-usage-in-us-fallen-by-a-fifth-since-elon-musks-takeover#:∼:text=Use_of_Twitter_in_the,app%2Dmonitoring_company_Sensor_Tower

Al-Omairi

L. J.

Abawajy

Chowdhury

M. U.

Al-Quraishi

(2021). An empirical analysis of graph-based linear dimensionality reduction techniques. Concurrency and Computation: Practice and Experience, 33(5), e5990. https://doi.org/10.1002/cpe.5990

Anjaria

Guddeti

R. M. R.

(2014). Influence factor based opinion mining of Twitter data using supervised learning. 2014 sixth International Conference on communication systems and networks (COMSNETS), (pp. 1–8). https://doi.org/10.1109/COMSNETS.2014.6734907

Aral

(2020). The hype machine: How social media disrupts our elections, our economy, and our health–and how we must adapt. Crown. https://books.google.de/books?id=oH7JDwAAQBAJ

Barberá

(2015a). Birds of the same feather tweet together: Bayesian ideal point estimation using Twitter data. Political Analysis, 23(1), 76–91. https://doi.org/10.1093/pan/mpu011

Barberá

(2015b). How social media reduces mass political polarization. Evidence from Germany, Spain and the US. Working Paper, New York University. https://pablobarbera.com/static/barbera_polarization_APSA.pdf

Barberá

Jost

J. T.

Nagler

Tucker

J. A.

Bonneau

(2015). Tweeting from left to right: Is online political communication more than an echo chamber? Psychological Science, 26(10), 1531–1542. https://doi.org/10.1177/0956797615594620

Berinsky

A. J.

(2017). Measuring public opinion with surveys. Annual Review of Political Science, 20(1), 309–329. https://doi.org/10.1146/annurev-polisci-101513-113724

10.

Boccaletti

Latora

Moreno

Chavez

Hwang

(2006). Complex networks: Structure and dynamics. Physics Reports, 424(4), 175–308. https://doi.org/10.1016/j.physrep.2005.10.009

11.

Bond

Messing

(2015). Quantifying social media’s political space: Estimating ideology from publicly revealed preferences on Facebook. American Political Science Review, 109(1), 62–78. https://doi.org/10.1017/S0003055414000525

12.

Brito

K. D. S.

Filho

R. L. C. S.

Adeodato

P. J. L.

(2021). A systematic review of predicting elections based on social media data: Research challenges and future directions. IEEE Transactions on Computational Social Systems, 8(4), 819–843. https://doi.org/10.1109/TCSS.2021.3063660

13.

Bronstein

M. M.

Bruna

LeCun

Szlam

Vandergheynst

(2016). Geometric deep learning: Going beyond Euclidean data. [Publisher: arXiv Version Number: 2]. https://doi.org/10.48550/ARXIV.1611.08097

14.

Bruns

(2019). After the ‘APIcalypse’: Social media platforms and their fight against critical scholarly research. Information, Communication & Society, 22(11), 1544–1566. https://doi.org/10.1080/1369118X.2019.1637447

15.

Calma

(2023). Twitter just closed the book on academic research. The Verge. Retrieved September 21, 2023, from. https://www.theverge.com/2023/5/31/23739084/twitter-elon-musk-api-policy-chilling-academic-research

16.

Chen

Gan

Lin

(2024). Data scarcity in recommendation systems: A survey [publisher: ACM New York, NY]. ACM Transactions on Recommender Systems.

17.

Chikhi

N. F.

Rothenburger

Aussenac-Gilles

(2007). A comparison of dimensionality reduction techniques for web structure mining, 116–119. https://doi.org/10.1109/WI.2007.86

18.

Chung

Zeng

(2016). Social-media-based public policy informatics: Sentiment and network analyses of U.S. Immigration and border security. Journal of the Association for Information Science and Technology, 67(7), 1588–1606. https://doi.org/10.1002/asi.23449

19.

Cinelli

De Francisci Morales

Galeazzi

Quattrociocchi

Starnini

(2021). The echo chamber effect on social media. Proceedings of the National Academy of Sciences, 118(9), Article e2023301118. https://doi.org/10.1073/pnas.2023301118

20.

De Backer

Naud

Scheunders

(1998). Non-linear dimensionality reduction techniques for unsupervised feature extraction. Pattern Recognition Letters, 19(8), 711–720. https://doi.org/10.1016/S0167-8655(98)00049-X

21.

Dong

Lian

(2021). A review of social media-based public opinion analyses: Challenges and recommendations. Technology in Society, 67(101724). https://doi.org/10.1016/j.techsoc.2021.101724

22.

Eady

Nagler

Guess

Zilinsky

Tucker

J. A.

(2019). How many people live in political bubbles on social media? Evidence from linked survey and Twitter data. Sage Open, 9(1), 215824401983270. https://doi.org/10.1177/2158244019832705

23.

Eichstaedt

J. C.

Kern

M. L.

Yaden

D. B.

Schwartz

H. A.

Giorgi

Park

Hagan

C. A.

Tobolsky

V. A.

Smith

L. K.

Buffone

Iwry

Seligman

M. E. P.

Ungar

L. H.

(2021). Closed- and open-vocabulary approaches to text analysis: A review, quantitative comparison, and recommendations. Psychological Methods, 26(4), 398–427. https://doi.org/10.1037/met0000349

24.

European Commission . (2024, January). DSA: Very large online platforms and search engines Shaping Europe’s digital future. Retrieved from 26 January 2024. https://digital-strategy.ec.europa.eu/en/policies/dsa-vlops

25.

Fatke

(2017). Personality traits and political ideology: A first global assessment. Political Psychology, 38(5), 881–899. https://doi.org/10.1111/pops.12347

26.

Feldman

Johnston

(2014). Understanding the determinants of political ideology: Implica-tions of structural complexity. Political Psychology, 35(3), 337–358. https://doi.org/10.1111/pops.12055

27.

Figeac

Favre

(2023). How behavioral homophily on social media influences the perception of tie-strengthening within young adults’ personal networks. New Media & Society, 25(8), 1971–1990. https://doi.org/10.1177/14614448211020691

28.

Gayo-Avello

(2013). A meta-analysis of state-of-the-art electoral prediction from Twitter data. Social Science Computer Review, 31(6), 649–679. https://doi.org/10.1177/0894439313493979

29.

Gerber

A. S.

Huber

G. A.

Doherty

Dowling

C. M.

S. E.

(2010). Personality and political. attitudes: Relationships across issue domains and political contexts [Publisher: Cambridge University Press]. American Political Science Review, 104(1), 111–133. https://doi.org/10.1017/S0003055410000031

30.

Grover

Leskovec

(2016). Node2vec: Scalable feature learning for networks. [Version Number: 1]. https://doi.org/10.48550/ARXIV.1607.00653

31.

Halko

Martinsson

P. G.

Tropp

J. A.

(2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2), 217–288. https://doi.org/10.1137/090771806

32.

Jankowski

Kurella

A.-S.

Stecker

Blätte

Bräuninger

Debus

Müller

Pickel

(2022). Die Positionen der Parteien zur Bundestagswahl 2021: Ergebnisse des Open Expert Surveys. Politische Vierteljahresschrift, 63(1), 53–72. https://doi.org/10.1007/s11615-022-00378-7

33.

Jost

J. T.

Federico

C. M.

Napier

J. L.

(2009). Political ideology: Its structure, functions, and elective affinities. Annual Review of Psychology, 60(1), 307–337. https://doi.org/10.1146/annurev.psych.60.110707.163600

34.

Kemp

(2023, July). Digital 2023 july global statshot report (tech. Rep.). Datareportal. Retrieved August 23, 2023, from. https://datareportal.com/reports/digital-2023-july-global-statshot

35.

Khanam

K. Z.

Srivastava

Mago

(2023). The homophily principle in social network analysis: A survey. Multimedia Tools and Applications, 82(6), 8811–8854. https://doi.org/10.1007/s11042-021-11857-1

36.

Kipf

T. N.

Welling

(2016). Variational graph auto-encoders. [Version Number: 1]. https://doi.org/10.48550/ARXIV.1611.07308

37.

Kumar

Sebastian

T. M.

(2012). Sentiment analysis: A perspective on its past, present and future. International Journal of Intelligent Systems and Applications, 4(10), 1–14. https://doi.org/10.5815/ijisa.2012.10.01

38.

Kurucz

Benczúr

Pereszlényi

(2008). Large-scale principal component analysis on livejournal friends network [Publisher: Citeseer]. Proceedings of SNAKDD, 2008.

39.

Kwak

J.-a.

Cho

S. K.

(2018). Analyzing public opinion with social media data during election periods: A selective literature review. Asian Journal for Public Opinion Research, 5(4), 285–301. https://doi.org/10.15206/AJPOR.2018.5.4.285

40.

Lansdall-Welfare

Lampos

Cristianini

(2012). Effects of the recession on public mood in the UK. Proceedings of the 21st International Conference on World Wide Web, 1221–1226. https://doi.org/10.1145/2187980.2188264

41.

Lee

D. H.

Brusilovsky

(2010). Social networks and interest similarity: The case of CiteULike. Proceedings of the 21st ACM conference on Hypertext and hypermedia (pp. 151–156). https://doi.org/10.1145/1810617.1810643

42.

Lehmann

Franzmann

Burst

Regel

Riethmüller

Volkens

Weßels

Zehnter

(2023). The Manifesto data collection. Manifesto Project (MRG/CMP/MARPOR). Version 2023a [Place: Berlin/Göttingen]. https://doi.org/10.25522/manifesto.mpds.2023a

43.

Livne

Simmons

Adar

Adamic

(2021). The party is over here: Structure and content in the 2010 election. Proceedings of the International AAAI Conference on Web and Social Media, 5(1), 201–208. https://doi.org/10.1609/icwsm.v5i1.14129

44.

McPherson

Smith-Lovin

Cook

J. M.

(2001). Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27(1), 415–444. https://doi.org/10.1146/annurev.soc.27.1.415

45.

Mellon

Prosser

(2017). Twitter and Facebook are not representative of the general population: Political attitudes and demographics of British social media users. Research & Politics, 4(3), 205316801772000. https://doi.org/10.1177/2053168017720008

46.

Nanga

Bawah

A. T.

Acquaye

B. A.

Billa

M.-I.

Baeta

F. D.

Odai

N. A.

Obeng

S. K.

Nsiah

A. D.

(2021). Review of dimension reduction methods. Journal of Data Analysis and Information Processing, 9(3), 189–231. https://doi.org/10.4236/jdaip.2021.93013

47.

Nishana

Surendran

(2013). Graph embedding and dimensionality reduction-a survey. International Journal of Computer Science & Engineering Technology (IJCSET), 4(1), 29–34.

48.

Pallavicini

Cipresso

Mantovani

(2017). Beyond sentiment. In Sentiment analysis in social networks (pp. 13–29). Elsevier. https://doi.org/10.1016/B978-0-12-804412-4.00002-4

49.

Reveilhac

Steinmetz

Morselli

(2022). A systematic literature review of how and whether social media data can complement traditional survey data to study public opinion. Multimedia Tools and Applications, 81(7), 10107–10142. https://doi.org/10.1007/s11042-022-12101-0

50.

Rousidis

Koukaras

Tjortjis

(2020). Social media prediction: A literature review. Multimedia Tools and Applications, 79(9–10), 6279–6311. https://doi.org/10.1007/s11042-019-08291-9

51.

Schober

M. F.

Pasek

Guggenheim

Lampe

Conrad

F. G.

(2016). Social media analyses for social measurement. Public Opinion Quarterly, 80(1), 180–211. https://doi.org/10.1093/poq/nfv048

52.

Silva

T. C.

Zhao

(2016, January). Machine learning in complex networks (1st ed.). Springer.

53.

Sindermann

Cornelia

Kannen

Christopher

Montag

Christian

, et al. (2021). The degree of heterogeneity of news consumption in Germany—Descriptive statistics and relations with individual differences in personality, ideological attitudes, and voting intentions. News Media & Society, 26(2), 711–731. https://doi.org/10.1177/14614448211061729

54.

Sindermann

Cornelia

Kannen

Christopher

Montag

Christian

(2022). Longitudinal data on (political) news consumption and political attitudes in a German sample collected during the election year 2021. Data Brief, 43, 108326. https://doi.org/10.1016/j.dib.2022.108326

35712362

55.

Sindermann

Cornelia

Kannen

Christopher

Montag

Christian

(2023). Linking primary emotional traits to ideological attitudes and personal value types. PLoS One, 18(1), e0279885. https://doi.org/10.1371/journal.pone.0279885

36595556

56.

Skoric

M. M.

Liu

Jaidka

(2020). Electoral and public opinion forecasts with social media data: A meta-analysis. Information, 11(4), 187. https://doi.org/10.3390/info11040187

57.

Tabassum

Pereira

F. S. F.

Fernandes

Gama

(2018). Social network analysis: An overview. WIREs Data Mining and Knowledge Discovery, 8(5), Article e1256. https://doi.org/10.1002/widm.1256

58.

Tausanovitch

Warshaw

(2017). Estimating candidates’ political orientation in a polarized congress. Political Analysis, 25(2), 167–187. https://doi.org/10.1017/pan.2017.5

59.

Tilak

Evans

Wen

Glassman

(2023). Social network analysis as a cybernetic modelling facility for participatory design in technology-supported college curricula. Systemic Practice and Action Research, 36(5), 691–724. https://doi.org/10.1007/s11213-022-09625-9

60.

Traber

Stoetzer

L. F.

Burri

(2023). Group-based public opinion polarisation in multi-party systems. West European Politics, 46(4), 652–677. https://doi.org/10.1080/01402382.2022.2110376

61.

Treier

Hillygus

D. S.

(2009). The nature of political ideology in the contemporary electorate. Public Opinion Quarterly, 73(4), 679–703. https://doi.org/10.1093/poq/nfp067

62.

Tumasjan

Sprenger

Sandner

Welpe

(2010). Predicting elections with Twitter: What 140 characters reveal about political sentiment. Proceedings of the International AAAI Conference on Web and Social Media, 4(1), 178–185. https://doi.org/10.1609/icwsm.v4i1.14009

63.

Tunstel

Cobo

M. J.

Herrera-Viedma

Rudas

I. J.

Filev

Trajkovic

Chen

C. L. P.

Pedrycz

Smith

M. H.

Kozma

(2021). Systems science and engineering research in the context of systems, man, and cybernetics: Recollection, trends, and future directions [Conference Name: IEEE transactions on systems, man, and cybernetics: Systems]. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 51(1), 5–21. https://doi.org/10.1109/TSMC.2020.3043192

64.

Wagner

(2021). Affective polarization in multiparty systems. Electoral Studies, 69(102199). https://doi.org/10.1016/j.electstud.2020.102199

65.

Widmann

Wich

(2023). Creating and comparing dictionary, word embedding, and transformer-based models to measure discrete emotions in German political text. Political Analysis, 31(4), 626–641. https://doi.org/10.1017/pan.2022.15

66.

Wilko

(2021, October). Results of the German federal election 2021. (tech. rep.). https://www.wahlrecht.de/ergebnisse/bundestag.htm

67.

Wojcieszak

Casas

Nagler

Tucker

J. A.

(2022). Most users do not follow political elites on Twitter; those who do show overwhelming preferences for ideological congruity. Science Advances, 8(39), Article eabn9418. https://doi.org/10.1126/sciadv.abn9418

68.

Yan

Zhang

H.-j.

Yang

Lin

(2007). Graph embedding and extensions: A general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1), 40–51. https://doi.org/10.1109/TPAMI.2007.250598

69.

Zarzour

Al-Sharif

Al-Ayyoub

Jararweh

(2018). A new collaborative filtering recommendation algorithm based on dimensionality reduction and clustering techniques [Journal abbreviation: 2018 9th International Conference on Information and communication Systems (ICICS)]. 2018 9th International Conference on information and communication systems (ICICS), (pp. 102–106). https://doi.org/10.1109/IACS.2018.8355449

70.

Zhang

Tong

Maciejewski

(2019). Graph convolutional networks: A comprehensive review. Computational Social Networks, 6(1), 11. https://doi.org/10.1186/s40649-019-0069-y

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.79 MB