Abstract
The objective of this study is to characterize writing samples in Catalan written by boys and girls in primary school (from seven to 12 years old) using syntactic patterns. The corpus contains 169 writings divided by sex (76 boys and 93 girls) with an average of 200 words and a total length of 33,763 words. From this corpus, we calculated the 40 n-grams of the most frequent morphological categories (bigrams, trigrams). The data were statistically analysed using ANOVA and Linear Discriminant Analysis, and the accuracy in predicting the writer’s gender in a cross-validation experiment was 60.4% using both bigrams and trigrams. When the children’s age was taken into account, the percentage of accuracy was higher than 70% in both the original classification and the cross-validation. The identification of the most discriminating bigrams and trigrams allowed us to determine that girls show a greater expressive capacity and superior syntactic maturity, and greater lexical and syntactic richness.
Get full access to this article
View all access options for this article.
