Abstract
Supervised machine learning is a promising methodological innovation for content analysis (CA) to approach the challenge of ever-growing amounts of text in the digital era. Social scientists have pointed to accurate measurement of category proportions and trends in large collections as their primary goal. Proportional classification, for example, allows for time-series analysis of diachronic data sets or correlation of categories with text-external covariates. We evaluate the performance of two common approaches for this goal: a method based on regression analysis with feature profiles from entire collections and a method aggregating classifier decisions for individual documents. For both, we observed a significant negative effect on classification performance due to the uneven distribution of characteristic language structures within the text collection. For proportional classification, this poses considerable problems. To fix this problem, we propose a workflow of active learning, which alternates between machine learning and human coding. Results from experiments with empirical data (political manifestos) demonstrate that active learning enables researchers to create training sets for automatic CA efficiently, reliably, and with high accuracy for the desired goal while retaining control over the automatic process.
Keywords
Get full access to this article
View all access options for this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
