Proportional Classification Revisited: Automatic Content Analysis of Political Manifestos Using Active Learning

Abstract

Supervised machine learning is a promising methodological innovation for content analysis (CA) to approach the challenge of ever-growing amounts of text in the digital era. Social scientists have pointed to accurate measurement of category proportions and trends in large collections as their primary goal. Proportional classification, for example, allows for time-series analysis of diachronic data sets or correlation of categories with text-external covariates. We evaluate the performance of two common approaches for this goal: a method based on regression analysis with feature profiles from entire collections and a method aggregating classifier decisions for individual documents. For both, we observed a significant negative effect on classification performance due to the uneven distribution of characteristic language structures within the text collection. For proportional classification, this poses considerable problems. To fix this problem, we propose a workflow of active learning, which alternates between machine learning and human coding. Results from experiments with empirical data (political manifestos) demonstrate that active learning enables researchers to create training sets for automatic CA efficiently, reliably, and with high accuracy for the desired goal while retaining control over the automatic process.

Keywords

content analysis active learning proportional classification text classification text as data supervised machine learning computer-assisted content analysis computational social science big data

Get full access to this article

View all access options for this article.

References

Baram

El-Yaniv

Luz

(2004). Online choice of active learning algorithms. Journal of Machine Learning Research, 5, 255–291.

Blei

D. M.

(2012). Probabilistic topic models: Surveying a suite of algorithms that offer a solution to managing large document archives. Communications of the ACM, 55, 77–84.

Bloodgood

Vijay-Shanker

(2009). A method for stopping active learning based on stabilizing predictions and the need for user-adjustable stopping. In Stevenson

Carreras

(Eds.), CoNLL’09, Proceedings of the 13th Conference on Computational Natural Language Learning (pp. 39–47). Stroudsburg, PA: ACL.

Burscher

Odijk

Vliegenthart

Rijke

M. de

Vreese

C. H. de.

(2014). Teaching the computer to code frames in news: Comparing two supervised machine learning approaches to frame analysis. Communication Methods and Measures, 8, 190–206.

Burscher

Vliegenthart

Vreese

C. H. de

(2015). Using supervised machine learning to code policy issues: Can classifiers generalize across contexts? The ANNALS of the American Academy of Political and Social Science, 659, 122–131.

Ceron

Curini

Iacus

S. M.

(2014). Using sentiment analysis to monitor electoral campaigns. Social Science Computer Review, 33, 3–20.

Chen

Mani

(2010). Study of active learning in the challenge. In The 2010 International Joint Conference on Neural Networks (IJCNN) (pp. 1–7). IEEE

Collingwood

Wilkerson

(2012). Tradeoffs in accuracy and efficiency in supervised learning methods. Journal of Information Technology & Politics, 9, 298–318.

Denny

M. J.

Spirling

(2017). Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Retrieved from SSRN: https://ssrn.com/abstract=2849145

10.

Di Fatta

Musotto

(2017). Content and sentiment analysis on online social networks (OSNs). In Hai-Jew

(Ed.), Data analytics in digital humanities (pp. 121–133). Cham, Switzerland: Springer.

11.

D’Orazio

Landis

S. T.

Palmer

Schrodt

(2014). Separating the wheat from the chaff: Applications of automated document classification using support vector machines. Political Analysis, 22, 224–242.

12.

Fan

R.-E.

Chang

K.-W.

Hsieh

C.-J.

Wang

X.-R.

Lin

C.-J.

(2008). LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874. Retrieved from http://jmlr.org/papers/volume9/fan08a/fan08a.pdf

13.

Grimmer

(2013). Appropriators not position takers: The distorting effects of electoral incentives on congressional representation. American Journal of Political Science, 57, 624–642.

14.

Grimmer

Stewart

(2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21, 267–297.

15.

Hillard

Purpura

Wilkerson

(2008). Computer-assisted topic classification for mixed-methods social science research. Journal of Information Technology & Politics, 4, 31–46.

16.

Hopkins

D. J.

King

(2010). A method of automated nonparametric content analysis for social science. American Journal for Political Science, 54, 229–247.

17.

Jungherr

Schoen

Posegga

Jürgens

(2016). Digital trace data in the study of public opinion. Social Science Computer Review, 35, 336–356.

18.

Jurka

T. P.

Collingwood

Boydstun

A. E.

Grossman

van Atteveldt

(2013). RTextTools: A supervised learning package for text classification. The R Journal, 5, 6–12. Retrieved from http://rjournal.github.io/archive/2013-1/collingwood-jurka-boydstun-etal.pdf

19.

Krippendorff

(2013). Content analysis: An introduction to its methodology (3rd ed.). Los Angeles, CA: Sage.

20.

Landis

J. R.

Koch

G. G.

(1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159.

21.

Lewis

S. C.

Zamith

Hermida

(2013). Content analysis in an era of big data: A hybrid approach to computational and manual methods. Journal of Broadcasting & Electronic Media, 57, 34–52. doi:10.1080/08838151.2012.761702

22.

Liew

J. S. Y.

McCracken

Zhou

Crowston

(2014). Optimizing features in active machine learning for complex qualitative content analysis. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science (pp. 44–48). Baltimore, MD: Association for Computational Linguistics.

23.

Marchand

Hennig-Thurau

Wiertz

(2017). Not all digital word of mouth is created equal: Understanding the respective impact of consumer reviews and microblogs on new product success. International Journal of Research in Marketing, 34, 336–354.

24.

McHugh

M. L.

(2012). Interrater reliability: The kappa statistic. Biochemia Medica, 22, 276–282.

25.

Merz

Regel

Lewandowski

(2016). The manifesto corpus: A new resource for research on political parties and quantitative text analysis. Research & Politics, 3, 1–8.

26.

Mikhaylov

Laver

Benoit

(2012). Coder reliability and misclassification in the human coding of party manifestos. Political Analysis, 20, 78–91.

27.

Phan

X.-H.

Nguyen

C.-T.

D.-T.

Nguyen

L.-M.

Horiguchi

Q.-T.

(2011). A hidden topic-based framework toward building applications with short web documents. IEEE Transactions on Knowledge and Data Engineering, 23, 961–976.

28.

Porter

M. F.

(1980). An algorithm for suffix stripping. Program, 14, 130–137.

29.

Scharkow

(2012). Automatische Inhaltsanalyse und maschinelles Lernen. Berlin, Germany: epubli GmbH.

30.

Scharkow

(2013). Thematic content analysis using supervised machine learning: An empirical evaluation using German online news. Quality and Quantity, 47, 761–773.

31.

Settles

(2010). Active learning literature survey. Retrieved January 28, 2015, from University of Wisconsin: http://burrsettles.com/pub/settles.activelearning.pdf

32.

L. Y.-F.

Cacciatore

M. A.

Liang

Brossard

Scheufele

D. A.

Xenos

M. A.

(2016). Analyzing public sentiments online: Combining human- and computer-based content analysis. Information, Communication & Society, 20, 406–427.

33.

Volkens

Lehmann

Merz

Regel

Werner

(2014). The Manifesto Data Collection. Manifesto Project (MRG/CMP/MARPOR) (Version 2014b). Berlin, Germany: Wissenschaftszentrum Berlin für Sozialforschung.

34.

Wiedemann

(2013). Opening up to big data: Computer-assisted analysis of textual data in social sciences. Historical Social Research, 38, 332–357.

35.

Wiedemann

(2016). Text mining for qualitative data analysis in the social sciences: A study on democratic discourse in Germany. Kritische Studien zur Demokratie. Wiesbaden, Germany: Springer VS.

36.

Wilkerson

Casas

(2017). Large-scale computerized text analysis in political science: Opportunities and challenges. Annual Review of Political Science, 20, 529–544.

37.

Zadrozny

(2004). Learning and evaluating classifiers under sample selection bias. In Brodley

(Ed.), Proceedings of the 21st International Conference on Machine Learning (ICML)? New York, NY: ACM.

38.

Zamith

Lewis

S. C.

(2015). Content analysis and the algorithmic coder: What computational social science means for traditional modes of media analysis. The ANNALS of the American Academy of Political and Social Science, 659, 307–318.

39.

Zoonen

W. van

Meer

T. van der.

(2016). Social media research: The application of supervised machine learning in organizational communication research. Computers in Human Behavior, 63, 132–141. Retrieved from http://www.sciencedirect.com/science/article/pii/S0747563216303557

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.76 MB