Abstract
This corpus-based multifactorial study delves deeper into the well-known alternation between bare and full infinitive complements, specifically regarding the help concordances. It extends the line of research to learners’ language productions with a focus on comparing and contrasting the probabilistic grammatical knowledge reflected in the help/help to choices, which are constrained by a complex network of factors. Regression modeling indicates that constraints such as the horror aequi and certain morphological forms of the verb help are comparatively stable among groups of Chinese English language learners and native English speakers. Therefore, the principle of cognitive complexity and the avoidance strategy fueling infinitive variations are further tested. While the object typology differs in effect strength, it also prompts consideration of cultural influences on the part of Chinese English learners. In addition, two different patterns of interaction effects are explored in each dataset, which implies that learners’ language choice is usage-based, determined by an unconscious assessment of multiple clues that are learned from language input and shaped by language experiences.
Plain language summary
Alternation studies situated at the crossroads of learner corpus linguistics and second language acquisition deal with one of the questions, namely, what factors constrain learners’ choices between similar language structures. By comparing and contrasting, we can explore reasons why native speakers may prefer one structure over another, to help learners predict a more appropriate choice matching the specific context. This study explores the verb help with following bare or full infinitives, such as help prepare the meeting or help to prepare the meeting. We select the data from a learner corpus. A native-speaker student corpus is used as a reference corpus. Besides, this study is guided by the theory of probabilistic grammar which holds that language is not determined by categorical hard constraints, but by probabilistic soft constraints learned from input and maintained and refined by experience. The mixed-effects regression modeling is employed to clarify the effect strength and direction of predictors and their interactions. Findings are as follows. Constraints such as the horror aequi and certain morphological forms of help are comparatively stable among groups of Chinese EFL learners and native English speakers. Therefore, the principle of cognitive complexity and the avoidance strategy fueling infinitive variations are further tested. While the object typology differs in effect strength, it also prompts consideration of cultural influences on the part of Chinese English learners. In addition, two different patterns of interaction effects are explored in each dataset, which implies that learners’ language choice is usage-based, determined by an unconscious assessment of multiple clues learned from language input and shaped by language experiences. This study makes a valuable attempt to outline part of learners’ probabilistic knowledge, constituting an essential step to help learners produce language appropriately under different contexts.
Introduction
This corpus-based study examines the matching dilemma faced by English learners who are unable to make an appropriate decision when presented with similar language structures. In most cases, the choices learners make show a degree of randomness. Benefiting from the development of large corpora and advancing analytic methods, researchers have been able to explore the reasons why native speakers may prefer one structure over another. It is possible to help learners predict a more appropriate construction that matches the specific context.
Previous studies focusing on alternating phenomena in natural language have generally followed a usage-based perspective. These studies argue that alternating phenomena are multifactorial, and they respect a variety of constraining factors and the interactions between factors (Gries, 2017). Several typical morphosyntactic alternations have been investigated based on various types of data and different perspectives, for example, the dative (Bresnan & Ford, 2010; Gries, 2003; Szmrecsanyi & Engel, 2021), the genitive (Hinrichs & Szmrecsanyi, 2007; Pijpops & Van de Velde, 2018), the particle placement (Gries, 2005; Wulff & Gries, 2019), the clause positioning (Gries & Wulff, 2021; Kang & Xu, 2020) and that complement constructions (Kruger, 2019; Shank et al., 2016), to name a few. The studies, which have primarily employed quantitative and comparative methodologies, suggest that these constructions may present challenges for language users who lack a precise understanding of the exact contexts under which alternating variants should occur. .
The present study aims to contribute to this line of research, by featuring an alternative trap, specifically concerning the verb help with following bare or full infinitives, as exemplified in the sentences in (1).
Two aspects motivate this study. First, English textbooks do not typically address the factors that may influence the choice of either type of construction in everyday language. This is also true of English dictionaries for non-native speakers, although some dictionaries do mention the difference that may be motivated by register (Hornby, 2016; Zhang & Zhang, 2017). It is therefore evident that a multifactorial exploration is of practical significance in view of teaching and learning practice in terms of infinitive usage. Second, the flourishing field of learner corpus linguistics has played an indispensable role in studies from a usage-based perspective. In light of this background, the present study seeks to examine the phenomenon of help/help to alternation in interlanguage from a contrastive perspective, which could shed light upon the universal or local usage patterns. As for feasibility, several corpus-based studies regarding help occurrences in a certain range of English varieties have yielded rich results, offering theoretical and methodological guidance for the current attempt.
Literature Review
Previous studies examining concordances with the verb help have employed two main approaches: the monofactorial approach, which focuses on identifying a single determining factor, and the multifactorial approach, which examines the effects of multiple predictors. The most prevalent assumption is that the use and non-use of a linguistic element are never in absolutely free variation (McGregor, 2013). Based on a corpus of 50 modern English novels, Lind (1983) assessed the relative frequency of optional to after help and identified the syntactic conditions, in which either of these forms might occur. Regarding Lind’s study, Kjellmer (1985) questioned the representativeness of detective stories and the analogy of help with verbs like see, hear, etc., testified the euphonic reason in the case of help with bare infinitives, and proposed that the lexical verb help might have acquired the characteristics of an auxiliary. Kjellmer voiced some criticism concerning Lind’s study, but his conclusion of the auxiliary feature of help was not supported by data-based usages. McEnery and Xiao (2005), however, made a comprehensive investigation of the potential factors relevant to speakers’ choices of infinitives after help, including language internal and external conditions, such as variety and register. Yet, all of the factors they identified were explored through the examination of pure frequency data.
Alongside these studies, there are also general discussions about the underlying principles that determine speakers’ choices. First, Rohdenburg (1996) proposed the cognitive complexity principle to analyze the distribution of competing constructions, in which to-infinitive was viewed as interrelated with a more cognitively complex environment, for instance, when there were intervening elements between help and the following complement. Then, the distance principle was put forward as a semantic motivation (Fischer, 1995; Haiman, 1983). The distance between help and the following infinitive corresponded to semantic connectedness between the subject and the event. The third principle, horror aequi or, the avoidance of identity, was taken as another decisive predictor in speakers’ choices of variant forms (Rohdenburg, 2003). The adjacently preceding to before help would lead to more bare infinitives. These three principles have enriched the factor group and provided additional elucidation for other syntactic constraints. However, not all of these have been confirmed in every alternation study.
The first mixed-effects account of help/help to is found in Lohmann (2011). Lohmann quantified the contributions of various conditioning factors with data retrieved from the BNC corpus. Despite its rigorousness, this study still lacked in the exploration of the variety and the mode that influenced the choice of infinitives after help. Levshina (2018) conducted a more recent multifactorial study of help/help to alternation. Levshina investigated help/help to constructions in seven varieties of web-based English. The findings demonstrated the importance of examining cross-lectal data when searching for universal functional principles of human languages. Continuing in this line of research, Levshina (2022) again investigated constructions with help/help to concordances, but the focus was more on methodological considerations.
As the aforementioned studies show, the primary focus of research on help/help to alternation has been on native speakers while little is known about how English learners choose between these two constructions. Accordingly, the present study hopes to fill this gap by exploring learners’ language productions to uncover the constraining factors underlying the choice of infinitives. To the best of our knowledge, this is one of the few attempts to target the construction under investigation.
Particularly, this study posits the theoretical approach of probabilistic grammar (henceforth PG) within the learner English paradigm, thus adopting a comparative probabilistic grammar analysis. PG has emerged as a research tradition at the crossroads of corpus linguistics, variationist linguistics, and theoretical linguistics (Claes, 2017). The main claim in PG is that language is not determined by categorical hard constraints, but by probabilistic soft constraints which are learned from input and maintained and refined by experience (Bresnan, 2007). Through converging modeling and experimental evidence, previous studies confirm the view that parts of grammatical knowledge are probabilistic and that users can predict about alternative variants (Klavan & Divjak, 2016). These claims have been tested in a growing body of comparative research investigating the alternating phenomena in different English varieties (Grafmiller & Szmrecsanyi, 2018; Szmrecsanyi et al., 2016). However, the adaptability of the PG framework to learner corpus data has not been investigated. Therefore, the current study will patch the hole by exploring the probabilistic knowledge of foreign English learners in particular to further testify the view of probabilistic indigenization which paints a complex picture of the “stability and fluidity” of the probabilistic grammar (Heller et al., 2017; Tamaredo et al., 2020).
To sum up, this article aims to introduce probabilistic grammar to learners’ language. This objective is achieved by conducting a large-scale corpus analysis of the specific help/help to alternation and modeling the probabilistic knowledge through the integration of a wide range of factors. Guided by this aim, we seek to address the following two specific questions:
(1) What factors constrain the alternation between help/help to constructions produced by Chinese EFL learners and native-speaker students?
(2) In the light of previous studies, what is at the core of learners’ probabilistic grammar regarding help/help to alternation?
Since the present study focuses predominantly on learners’ alternating behavior, a contrastive analysis will be conducted between native English and interlanguage through modeling. To guarantee the reliability of comparison, we will collect data from productions of university students and focus solely on the genre of argumentative essays. The study proceeds as follows. Section “Methodology” discusses the methodology, including the corpora selection, the data retrieval, the relevant factors annotated in this study and the modeling method. Then, the results from statistical models are summarized in Section “Results.” Section “Discussion” discusses the findings. Section “Conclusion” provides a general conclusion of the study, including major findings, limitations, future directions and implications of this study.
Methodology
Data
Two corpora were selected to investigate learners’ implicit probabilistic grammatical knowledge and to analyze how various factors influence their choices. For the Chinese group, data were extracted from the Written English Corpus of Chinese learners (version 2.0) (henceforth WECCL) (Wen et al., 2008). This corpus contains a large number of texts produced by students from over 20 Chinese universities of varying levels, types and regions, providing a broad representation of Chinese EFL learners’ English writing in recent years. To ensure sufficient data and maintain genre consistency, only argumentative texts were selected to build the learner corpus, excluding the expositions. For the native-speaking group, data were sourced from the British Academic Written English corpus (henceforth BAWE; Nesi & Gardner, 2012), a specialized corpora representing British students’ academic writing. All the assignments were submitted by undergraduates and masters from three universities across 35 disciplines. To establish a reference corpus aiding in the formation of a native “norm” for comparison, only essays written by students whose first language was English were included. Despite differences in genre, the two corpora are comparable in terms of the time of data collection as all texts were produced approximately 3 to 4 years before 2007 (BAWE, 2004–2007).
We extracted the infinitive complements of constructions with help following three steps: (1) automatic extraction of all concordances using the concordancer AntConc 4.1.4 (Anthony, 2022). In AntConc, the search query was set as “Words” and the context size was 20 tokens. The search string was the key verb and all its inflected forms, namely “help,” “helps,” “helping” and “helped”; (2) quick manual filtering of the interchangeable cases by excluding categorical cases, such as the passive constructions (e.g., The young people should be helped to improve ourselves.), and sentences with not-negation (e.g., …and are helping this poem not to be too serious), where the full infinitives should be used (Levshina, 2022; Lohmann, 2011); (3) manual corrections to ensure the precision of the data. Sentences were re-checked by the “File” function in AntConc to make sure the contextual information was clear enough. For instance, in some cases, the subject information was missing in the helping event due to the limited context size. Duplicate occurrences were removed. Sentences containing grammatical mistakes were also discarded if they caused issues in annotation, such as subject-verb disagreement (e.g., it help us to prepare for employment…), or misspelling (e.g., them for then). Finally, a relatively balanced dataset was obtained, comprising 978 cases from WECCL and 845 cases from BAWE. The distribution of infinitive types reflects significant differences between the two groups, as shown in Table 1. Based on the frequency data, native speakers show less preference for bare or full infinitives, while Chinese EFL learners tend to use bare infinitives more frequently in their writing. The annotation of the qualified attestations for the relevant variables is detailed in Section “Variables and Their Annotation” below.
Frequency of Infinitive Choices of Native English Speakers and EFL Learners.
Variables and Their Annotation
Previous research on the help/help to alternation has provided a comprehensive account of the constraints influencing users’ choice of a particular variant, consisting of both language-internal and language-external factors (cf. Levshina, 2018, 2022; Lohmann, 2011). Crucially, none of these factors is deterministic (Bresnan & Ford, 2010). For instance, an animate helper may favor the bare infinitive, but if the intervening element is long enough, the principle of cognitive complexity which requires grammatical explicitness may outweigh, that is the to-infinitive will be preferred. Therefore, various determinants constitute a complex network waiting to be disentangled. The following paragraphs will elucidate the variables to be targeted in this study. Examples are selected from the attestations retrieved from the corpora (see Section “Data”), mainly for elucidation.
Phonological Factors
Phonological factors, including rhythmic and segmental alternations, have been vastly understudied in the context of morphosyntactic alternations (Gries, 2017). However, they have the potential to play a co-determining role in shaping morphological and syntactic structures (Schlüter, 2003). The principle of rhythmic alternation refers to the preference for an optimal rhythmic pattern when stressed and unstressed syllables alternate in a sequence. The occurrence of two adjacent stressed or unstressed syllables will lead to stress clashes or lapses respectively, which are generally avoided in language production. Therefore, it can be hypothesized that (2b) is preferred over (2a) since the addition of to in the constructed sentence helps eliminate a stress clash. The annotation of the stress follows the practice in Levshina (2022) which proposes to code the attestations for stress patterns in the absence of to. Ternary values are attached to the variable “stress,” namely good, lapse and clash. In the cases of good or lapse, the omission of to is expected since adding it will increase the number of adjacent unstressed syllables. If it is clash, the addition of to will contribute to an ideal phonological structure.
Similarly, the segmental alternation refers to the consonant-vowel alternation to avoid hiatuses and consonant clusters, thereby achieving an optimal syllable structure. The coding of segments focuses on the initial segment of the word, typically the verb in the infinitive. Examples are given below. (3b) is preferred to (3a) because the omission of to contributes to a vowel-consonant-vowel sequence at the boundary between the verb help and the infinitive. The addition of to can result in the adjacency of two vowels. It is hypothesized that when the verb in the infinitive begins with a vowel, the bare form is more likely to occur. However, it should be noted that not all instances ideally align with CV structures. For example, CC structures frequently arise between to and the preceding word (e.g., …help the students to break). Whether to is omitted or not, a CC structure still occurs, making it unsuitable for directly testing the segmental alternation. Therefore, we mainly focus on cases where a vowel begins as research suggests that VV structures are less tolerated in natural languages than CC structures (Schlüter, 2003). In some rare cases, such as …help the boy to explain, even if the omission of to is preferred, a VV structure still happens between boy and explain, but the omission reduces the number of VV sequences. Bearing this in mind, we code each sentence for the segment pattern, resulting in a binary value (yes/no) assigned to the variable “segment.”
This study analyzes phonological well-formedness mainly for two reasons. One is to further validate phonological influence which has not been adequately discussed based on a large sample of data and the other is to see how they work in learners’ language, the written mode in particular.
Structural Factors
The first structural aspect of relevance in this study is the inflection of the verb help. Previous research indicates that the form helping frequently co-occurs with the to-infinitive (Levshina, 2018; Lohmann, 2011). However, this effect may be weakened by the presence of the helpee. When the object of help is specified, the different verb forms show little difference in alternating behaviors. The present study mainly distinguishes among the non-finite form (i.e., helping), the finite forms help (present tense), helps (third-person singular) and helped (both the simple past and the past participle). In the native dataset, it is found that help in all its inflected forms favors full infinitives, with frequencies mostly above 50%, and helping shows the greatest preference at 71.8%. In comparison, Chinese EFL learners exhibit an almost equal preference for bare and full infinitives, both around 50%. In the case of the base form, help is more frequently used with bare infinitives in both groups, but this preference is more pronounced among the Chinese group (64.1%). In annotation, this variable is labeled as “verbform” with the four levels described above.
The presence or absence of the helpee also potentially affects the choice of a variant, as illustrated by the following sentences.
The examples in (4) illustrate that the object of the helping event may be unexpressed, but when it is explicitly stated, the omission of to is favored (Lind, 1983; Lohmann, 2011; McEnery & Xiao, 2005). Hence, we include this factor in our analysis to determine whether it significantly predicts learners’ infinitive choices. In the dataset, both native speakers and EFL learners prefer bare infinitives when the helpee is present, but the Chinese group exhibits a sronger tendency (63% vs. 55%). By comparison, when the helpee is absent, both groups have partiality for the full infinitive (55%). Despite the presence of the helpee, its typology can also trigger the choice of to in various ways. Lind (1983) showed in his data that the omission of to predominated when a noun phrase appeared between the verb help and the infinitive. However, he did not differentiate between noun phrases and pronouns, when comparing examples like (5a) and (5b). In their analysis of the distribution of make and make to in early English medical writing, Calle-Martín and Romero-Barranco (2015) identified three types of intervening elements: noun phrase, pronouns, and adverbials. Their findings indicated a higher ratio of bare infinitives than to-infinitives when a pronoun intervenes. Following this categorization, this study also further classifies the types of helpees.
A preliminary examination of the dataset reveals that adverbials rarely occur and are therefore excluded from the current analysis. In line with previous research, bare infinitives are preferred when there is an explicit helpee. However, among native speakers, the difference in the choice of bare or full complements is minimal when pronouns are involved (52% vs. 48%) compared to cases with noun phrases (56% vs. 44%). In contrast, Chinese learners demonstrate a stronger preference for bare infinitives when pronouns are present (66%), compared to cases with noun phrases (60%). In summary, regarding the variable “helpee,” three values can be identified, namely absence, pron and np.
Another influential factor is the length of the intervening material between help and the infinitive, though previous research has found its effect to be insignificant (McEnery & Xiao, 2005). However, based on a larger corpus, Lohmann (2011) demonstrated that as the number of intervening words increased, the likelihood of using the bare variant decreased. This is consistent with the principle of cognitive complexity, that is, the longer the elements inserted between help and the infinitive, the more marked the to-infinitive becomes. Word count is commonly used to measure syntactic complexity in alternation studies (Levshina, 2018; Szmrecsanyi, 2004). While Deshors (2020) attaches a binary value to the length of direct objects (long/short), the intervening elements in our dataset consist primarily of pronouns and noun phrases, thus infrequently up to three words, therefore, a numeric value is assigned to the predictor “length.” We assume that if the infinitive follows help immediately, the length will be coded as 1. Hence, the length will be 2 in (6a) and 3 in (6b) respectively. To minimize the impact of some outliers, we gain the factor “length_log” after logarithmic transformation. Generally speaking, occurrences of long intervening materials are rare in the two corpora. The presence of parenthetical elements leads to a comparatively higher probability of bare infinitives, which contradicts the cognitive complexity principle. Concerning the increasing number of words, for EFL learners, the longer the intervening material, the greater the proportion of full infinitives, whereas this tendency is less pronounced among natives.
Horror aequi, also known as the principle of the avoidance of identity, refers to a widespread tendency to avoid the repetition of identical elements (Rohdenburg, 2003). Originally used in phonology, this notion has since been used to interpret phenomena at all linguistic levels. In alternation studies, the horror aequi principle is used to explain why the to-infinitive is avoided when the main verb is itself preceded by to. This factor has also been confirmed to interact with other probabilistic factors, such as mode and variety (Tizón-Couto, 2022). In alternation studies with help concordances, example (7a) shows the case.
Horror aequi has yielded a significant influence on the distribution of bare and full infinitives, but this effect can be weakened by the weight of the intervening material (Levshina, 2018, 2022; Lohmann, 2011), which means that when the distance between help and the infinitive is long enough, the to-infinitive will still be used, though another to is preceding help. Consider example (7b). As expected by the percentage data, the horror aequi effect is significantly evident in both groups of students, that is, when help is preceded by to, they tend to choose bare infinitives to avoid the repetition of to. This effect is more obvious among native speakers (92%) compared to EFL learners (68%). This variable is coded as “horaeq” with a binary value of yes and no depending on whether help is adjacently preceded by to.
Semantic Factors
The animacy of the helper is arguably one of the constraints influencing the choice of infinitives under investigation. An animate helper may trigger the use of the bare infinitive, which is in line with the distance principle, suggesting that there is potentially greater involvement of the helper in the helping event (Lind, 1983; Lohmann, 2011). However, the animacy constraint may be outweighed by the weight of the intervening material, as shown in example (8a) (Levshina, 2022). Moreover, distinguishing between animacy categories can be challenging. While Zaenen et al. (2004) identified five animacy categories (i.e., animate, collective, inanimate, locative, temporal), the present study restricts itself to a binary distinction between animate and inanimate entities. Our approach follows Heller et al. (2017) and Dubois et al. (2022), who argue that a five-fold classification may not be suitable for corpus data, due to the possible high collinearity and the difficulty in annotating a large set of attestations. The animate category mainly includes humans, animals, and human-like entities (e.g., gods, robots), while all others are categorized as inanimate. It should be noted that the categorization of organizations depends on whether they refer to a group of people or denote places, as illustrated in example (8b). Our dataset shows that when organizations, such as governments, universities and countries, are used together with help, they tend to exhibit great initiative, and hence are regarded as animate groups. Based on the distribution of bare and full infinitives in the data, it is observed that EFL learners favor a higher ratio of bare infinitives, regardless of whether the helper is animate (58%) or inanimate (61%). Comparatively, native learners show a stronger preference for the bare form (53%) when the helper is animate and for the full form (53%) in inanimate cases, but the difference is subtle. In our annotation, three values are assigned to the variable “helper,”animate, inanimate and unexpressed. Following previous studies, it is noted that in some attestations, the identity of the helper cannot be definitely determined . See example (8c) from WECCL below. Consequently, the level is marked as unexpressed.
Other Factors
The aforementioned factors have gained significant attention; however, language-external factors, such as variety, mode and genre, also influence infinitive choices to varying degrees. The monofactorial analysis has shown a distinct preference for infinitive types between American and British English. Levshina (2018) investigated the interaction of seven varieties and various factors, proving that cross-lectal similarities and differences existed in the constraining patterns. In terms of mode, McEnery and Xiao (2005) concluded that although the rate of bare infinitives was slightly higher in spoken English than in written English, the influence of the spoken/written distinction was not statistically significant. Similarly, Lohmann (2011) did not identify a clear difference in infinitive choices between spoken and written modes. This is closely related to another factor, the genre, which shows a more complex picture. The distributional pattern of bare and full infinitives across different genres is not easily generalized, though the formality hypothesis holds to some extent, as evidenced by the extreme preference for the full or bare infinitives in certain genres. In this study, mode and genre are not considered in the choice of an infinitive variant due to data type limitations. Nonetheless, to the best of our knowledge, they may not be strong extra-linguistic triggers for the alternating behavior. Regarding the variety factor, we move beyond the native-speaker standard, and instead consider learners’ language as a variety in its own right (Granger et al., 2015).
The Coding Scheme
The response variable is the bare or full infinitive after help. Individual contextualized occurrences are annotated for seven predictors presented in Table 2. An abbreviated sample of the annotation is displayed in Table 3. The level of each variable is either a categorical value or a numeric value, as described in the sections above. A pre-coding agreement measure was conducted, which, we believe, could help examine the coding scheme, to enhance the reliability of the single-coded samples (see Spooren & Degand, 2010). This process was completed in five steps. First, a second coder was trained in the coding strategy developed in this study. Second, three controversial variables were selected for measuring coding agreement, including helper, stress and segment. Unlike the coding of the helpee or the length of the intervening elements which were done objectively, the animacy identification required individual judgment, potentially leading to subjective interpretation. Moreover, given that stress and segment patterns were comparatively less analyzed in previous studies dealing with alternations, we also selected them as variables to be coded in samples. Third, we used the Sample function in R (R Core Team, 2023) to randomly choose about 10% of the data points from each corpus (N=180). Fourth, the two coders worked independently and the degree of agreement was measured by Cohen’s kappa. The ultimate values hinted at almost perfect agreement in animacy and stress coding (=.833, .849 respectively), and complete agreement in segment coding (=1) (Cohen, 1960; Landis & Koch, 1977). Not surprisingly, the stress or segment patterns concerning formal phonological characteristics did not lead to much coding ambiguity among the coders. However, the animacy of the helper, being a semantic feature, caused difficulties in classifying some attestations. The primary question concerned whether the helper was recognized as “unexpressed” or not. For example, in the sentence Within the leisure industry there are numerous paradigms that can be looked at in order to help understand why people choose the leisure activities, it was difficult to identify an overt subject of the helping event (Levshina, 2022; Lohmann, 2011). Finally, the two coders discussed and resolved the disagreements. However, instances involving whether the helper was expressed or not occupied a very small proportion of the data and were excluded from the helper analysis since we solely focused on comparing cases with animate and inanimate helpers.
The Coding Scheme.
A Sketch of the Annotation Table.
The Modeling Method
In order to clarify the influence of the suite of factors on infinitive choices, the mixed-effects regression modeling was employed. Recent years have witnessed a burgeoning of research into alternations with the help of mixed-effects logistic regression (cf. Dubois et al., 2022; Gries, 2021; Li et al., 2023; Pijpops & Van de Velde, 2018), contributing to an exponential growth in assessing the contributions and interactions of a wide range of factors simultaneously. All the variables in Table 2 were derived from typical studies on help/help to alternation, as represented by Lohmann (2011) and Levshina (2022). Based on their findings, we further classified phonological factors into rhythmic and segmental alternations, and object types into noun phrases and pronouns. In order to achieve the optimal model, the modeling process began with a full model including all of the factors mentioned in the coding scheme, with helper, helpee, length_log, verbform, horaeq, stress and segment as fixed effects. We incorporated verbs into the model as random effects, but we constructed random intercepts instead of slopes because the verbs were widely dispersed and most of them occurred very infrequently, making it difficult for the model to analyze effectively. Besides, two interactions were tested separately in the full model, namely, the interaction between helpee and verbform, and between length_log and horaeq. We anticipated that the role of each factor might vary in learners’ data, therefore these interactions were not directly included in the final model. The final full model was determined based on the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC), with lower values indicating a better fit. As expected, two different patterns of interactions were significant in each dataset. Afterwards, to achieve the parsimonious model, we employed the backward elimination procedure, removing insignificant effects one by one and comparing model fit using likelihood-ratio tests. The key criterion was whether the removal of a variable would affect the model fit. If it did not, the variable was removed as insignificant; if it did, the variable was retained in the model. All the operations were performed in R. The lme4 package, specifically the function glmer, was used to build models. The packages, including MuMIn, Hmisc, gmodels and car, were used for assessing the goodness of fit of the model. Other packages, such as ggplot2 and effects, were loaded for drawing interaction plots.
Results
Two fitted models were generated for each dataset. The bare infinitive was set as the reference level, hence a positive value in the estimated coefficient implied a preference for the full infinitive, while a negative coefficient indicated a preference for the bare form. Furthermore, all the coefficients were exponentiated using the exp() function in R to offer the odds ratio as the effect size which was easier to interpret. It was used to estimate the change in the odds of the predicted response variable for every one-unit increase in the predictor. An odds ratio larger than 1 indicated that the likelihood of the full was greater than the bare. Comparatively, an odds ratio between 0 and 1 suggested that the full was less likely in the specific context. For instance, the odds of 3.23 in Table 4 suggested that when help appeared in its gerund form, the chance of the full infinitive was 3.23 times greater than that of the bare infinitive. The level of statistical significance was set at 0.05. Model diagnostics and interpretations were presented separately in the following two parts.
Significant and Marginally Significant Effects in the Parsimonious Model Based on the Native Speakers’ Dataset.
Significant codes: ***0.001. **0.01. *0.05. “.”0.1. “ ”1.
Regression Model Based on the Native Speakers’ Dataset
The minimal model yields a concordance index of 0.76 which indicates an acceptable discrimination between the bare and full infinitives (Levshina, 2015). It correctly classifies 68.9% of the cases, significantly higher than the baseline accuracy of 51.7%. This suggests the existence of a moderate relationship between the model’s predictions and the actual groupings. There is no indication of multicollinearity since the variance inflation factor for every predictor remains below the stricter threshold of 5 (Levshina, 2015).
Adapted from the original statistical results, Table 4 presents the individual significant and marginally significant effects, along with their interactions. The predicted response level is the full infinitive complement. The reference level of each predictor is shown in brackets. The logistic regression results exclude three variables, namely helper, segment, and length_log, which make no statistically significant contribution to the model. The marginally significant predictor, stress, is retained in the model because it offers promising insights into the inclusion of phonological factors in alternation studies. In addition, adding this variable does not distort the model results, as evidenced by the likelihood ratio test.
As shown in Table 4, the strongest predictor of infinitive choices is the horror aequi, recording an odds ratio of 0.064, which indicates that when help is preceded by to, the likelihood of producing the bare infinitive complement increases by 93.6%. An example sentence is provided in (9a). Another strong predictor is the morphological form of help. However, only two forms show significant effects, that is, helps and helping, as illustrated by examples (9b) and (9c), respectively. In the former case, the odds ratio of 2.35 suggests that the third-person singular form is over 1.56 times more likely to be followed by full infinitives compared to the uninflected form. In terms of the gerund form, the likelihood is even higher. The reason why helped is not found to be significant may be the limited number of observations involving this form, thus it cannot be discarded as an insignificant predictor based solely on the results of the present study.
The predictor stress deserves special attention, as the result is contrary to expectations. As shown in Table 4, a stress clash promotes the use of the bare infinitive. Native speakers tend to favor the omission of to instead of adding it to avoid the clash. However, this marginal effect requires further clarification through more data.
Significant interaction is tested between two predictors, helpee and verbform. In general, the presence of helpee decreases the chance of full infinitives. More specifically, when there is a noun phrase or a pronoun following help in inflected forms, native speakers tend to omit to. Sentences in example (10) illustrate these cases. This effect is stronger when helping is followed by a pronoun. In the case of helps, the effect is weaker, but both noun phrases and pronouns following this verb form show the tendency to promote the use of bare infinitives. However, the interaction effect between helps and pronouns is only marginal. As shown in Figure 1, the lower the point in the plot, the less likely the full infinitives occur.

Effect plot of the interaction between verbform and helpee.
Based on the data of argumentative essays by native English speakers, two morphological forms of help and the horror aequi, along with an interaction effect, yield significant influence on the choice of infinitives. When help is preceded by to, they tend to omit the second to in the infinitive to avoid repetition; when help is inflected as helps or helping, there is a greater likelihood of using the marked infinitive. However, the effect direction is reversed when noun phrases or pronouns are inserted.
Regression Model Based on the EFL Learners’ Dataset
The discrimination ability of the model based on Chinese EFL learners’ data is basically acceptable as the concordance index nearly arrives at 0.7. The classification accuracy of this model arrives at 67%, still higher than the baseline of 60.8%, which means that the model has a relatively good descriptive accuracy. The VIFs for each predictor also indicate no problem of multicollinearity.
Table 5 presents the significant and marginally significant predictors. The response level as predicted is also the full infinitive complement. The reference level for each predictor is again indicated in brackets. Three insignificant predictors are excluded, namely, helper, stress and segment. It is seen that the most influential predictor is the horror aequi, that is, the preceding to before help significantly favors the bare form, as shown in example (11a) from the WECCL corpus. Besides, both the object typology and morphological forms of help strongly influence the choice of infinitives. Specifically, when the object is a noun phrase or a pronoun, the possibility of choosing a bare infinitive increases respectively compared to cases with an absent helpee. In particular, if the object is a pronoun, the preference for the bare form is stronger, as demonstrated in example (11b). In terms of the morphological forms, the morphological change of the verb help also results in differences in the choice of infinitives. Compared to the uninflected form, help in the third person singular form is over 1.7 times more likely to prefer the full infinitive, as shown in example (11c). Although the effect of the gerund form is only marginal, we can still assume that this form promotes the use of the full. Similar to native speakers’ data, the case of helped remains unclear due to insufficient observations.
Significant and Marginally Significant Effects in the Parsimonious Model Based on the EFL Learners’ Dataset.
Significant codes: ***0.001. **0.01. *0.05. “(.)”0.1. “ ”1.
Compared to the native dataset, a different interaction effect is observed in this model, that is, the commonly evidenced interaction between length_log and the horaeq. While length_log predicts the full infinitive, the effect is not significant. However, the preceding to strongly predicts the bare infinitive. Nevertheless, the interactional effect results in a higher probability of the full infinitive, which indicates that the increasing length of the intervening material counterbalances the effect of the horror aequi. This is obvious in the examples provided in (12). The effect plot in Figure 2 presents the probability of the full infinitive regulated by the length and the presence of to before help. It is apparent that as the length increases, the probability of choosing the full infinitive rises, thereby the horror aequi effect is mitigated.

Effect plot of the interaction between length_log and horaeq.
Altogether, the predicting power of the same variables on infinitive choices varies across groups of EFL learners and English native speakers. Section “Discussion” explores the similarities and differences in the features of predictors that influence the choice of a bare or full infinitive in both groups, and discusses the reasons for these diverse phenomena.
Discussion
The analysis of variation patterns between bare and full infinitives is guided by two primary research focuses. The first focus is on the predictors that influence help/help to alternation in the language productions of both groups. The second focus explores the core probabilistic grammar in learners’ language concerning help/help to alternation.
The Principles of Horror Aequi and Cognitive Complexity as Determinants of Infinitive Choices Across the Two Groups
A common syntactic condition that may encourage the use of bare infinitives among both learners and native English speakers is when help is preceded by to. Previous research has confirmed the horror aequi effect in the case of omission of to following help. This effect is further verified in the present study, which focuses on learners’ essays. Besides, concerning the relative predicting power of the significant variables, the horror aequi yields a stronger influence on the infinitive choice. This may be due to a “neurological motivation” observed in grammar, which has also been verified in phonological phenomena (Mondorf, 2003, p. 278). From a psychological perspective, it is believed that language users attempt to inhibit reactivation of neurons within a given time span to create refractory phases during language production. The node, which corresponds to the linguistic unit-to preceding help, is activated upon its initial use. After activation, there is a period of self-inhibition, leading to the avoidance of the identical or near-identical forms, as they are linked to the same nodes.
Another similarity lies in the inflected forms of the verb help, one of the language-internal constraints on infinitive variations. In line with previous studies, help in the third-person singular and gerund forms tends to occur with full infinitives. Besides, the gerund form shows less preference for the marked variant compared to other inflectional forms. This suggests a subtle difference from the stronger connection between the gerund form and the full infinitive observed in previous studies (Lohmann, 2011). The principle of complexity may account for this effect. A change in the morphological form may increase the cognitive burden of users, thus resulting in a more explicit variant. As noted by Rohdenburg (2003), a morphologically marked form may represent a more cognitively complex category which tends to exhibit a greater degree of grammatical explicitness than other counterparts. Nevertheless, as previous studies show, in the case of help in past tense, no clear preference is shown in the selected corpora and the reason for this remains elusive. One possible explanation may be the smaller proportion of relevant samples in the selected data. Similar to native English speakers, Chinese EFL learners are influenced by the morphological changes of the verb help in the choice of infinitives, especially when help is in the third-person singular form. However, there is a slight difference in the marginally significant value when help is used in its gerund form. Generally speaking, the lexical effect remains relatively stable in both groups of language users based on the occurrences of full infinitives.
Differences in the Constraining Predictors of Infinitive Choices Between the Two Groups
The modeling results have revealed three significant differences. First, EFL learners appear to consider the typology of objects when they decide on the omission of to. Lind (1983) observed that bare infinitives were more likely to occur when an intervening noun phrase was present. This is also supported by our study. Our results suggest that the increase in the proportion of bare infinitives incurred by an intervening noun phrase is statistically significant. This is particularly true when the object is a pronoun as the share of bare infinitives increases more. Interestingly, this observation contradicts the complexity principle, which holds that the growing discontinuity of two grammatically related elements leads to choices of the more explicit form. However, these results call for the need to reconsider the principle of distance. Further analysis indicates that among the help concordances with intervening noun phrases or pronouns, animate subjects account for 82% of the occurrences. According to the principle of distance, an animate agent shows a potentially greater involvement in the helping event. This shorter conceptual distance is revealed by the shorter linguistic distance manifested by the omission of to. Besides, the data from EFL learners show a high frequency of first and second-person pronouns. This pattern suggests that there are other ways of shortening the distance between the helper and helpee, emphasizing the direct help provided by the subject to the object.
Another difference identified in our study is related to the interactional patterns. Unlike previous studies that incorporate two interactions in statistical modeling, we find that these interactions manifest differently in both native speakers’ and learners’ data. While the interaction between helpee and verbform is obvious in data from native speakers, the dataset from EFL learners features an interaction effect between length_log and horror aequi. This observation underscores the complexity of predictors, demonstrating their integration within an intricate network, whether these predictors are language-internal or -external. In terms of the first interaction effect, we can speculate that native speakers are responsive at the typological features of helpee which may mitigate the effect of complexity attributed to the morphological change. There is no definitive explanation for this phenomenon. On one side, the proportion of animate helpers in the case of noun phrases or pronouns is relatively low compared to the data from EFL learners. On the other side, the tendency remains ambiguous enough in the combined impact of the helpee with different typological features and help in morphological forms. Concerning the second interaction effect, it is evident that the horror aequi effect is significant and more pronounced among native speakers. They consistently employ the avoidance strategy, regardless of changes in help forms or the increasing weight of inserted elements after help. By contrast, EFL learners are attuned to the preceding to, but this sensitivity is readily overridden in complex language situations. This reveals their tendency to use linguistic units that enhance recognition or processing.
As evidenced, phonological predictors display somewhat unexpected effect in learners’ data. First of all, segmental patterns reveal to be insignificant. Although segment features in help/help to alternation are initially explored in learners’ data, similar results are gained compared to previous studies of other genres (Lohmann, 2011). However, the stress alternation, being more readily identifiable through language input, presents a different scenario. While EFL learners do not show sensitivity to the stress constraint, native speakers demonstrate a slight preference when a stress clash occurs. Nevertheless, rather than opting for the full form to avoid clash, they tend to choose the bare form, which coincides with previous studies on the effect direction of stress patterns. Apart from the suggested auxiliary feature of help which may have lost its stress (Levshina, 2022), the most plausible explanation is that language users may not be highly attuned to phonological patterns during writing. As previously noted, even in speaking practices, users reduce the scope of phonological planning under increased task demands (Damian & Dumay, 2007). This phonological advance planning may present difficulties for both native speakers and learners, especially for EFL learners, as they typically receive less natural language input including the phonological stimuli compared to natives. We have made efforts to further substantiate the effect of phonological predictors, but the evidence remains inconclusive. However, it is clear that continuous attempts have evidenced a positive influence of stress and segmental patterns in several types of alternations (Gries & Wulff, 2013; Tizón-Couto, 2022; Wasow et al., 2015; Wulff & Gries, 2015), although phonological effects are not universally observed. Moreover, a synoptic and comparative analysis of relevant studies educes that structures at phrasal levels are more susceptible to rhythmic effects (Kentner & Franz, 2019). Rhythm alone, however, may not fully account for the effects (Shih et al., 2015), as it may interact with other variables, such as mode (Wulff & Gries, 2019). Additionally, it remains unclear whether factors such as the sample size or the statistical method influence the results, given that the Bayesian method has provided weak support for stress effect in cases of stress lapse (Levshina, 2022).
The Core of Learners’ Probabilistic Grammar Concerning help/help to Alternation
Probabilistic grammar has been widely applied in sociolinguistic variation studies, and its working assumptions are also found in other branches of linguistic studies, such as second and foreign language learning and acquisition. By considering learners’ language as a variety per se, we can explore the features of learners’ grammatical knowledge and its relations with various predictive factors. The fundamental idea is that while some predictors are universal, some are socioculturally shapable.
Our study provides compelling evidence for the principles of cognitive complexity and the horror aequi, which show the same pattern of effects on infinitive choices by both native English speakers and learners. In language production, users tend to prefer syntactic structures that can lower their cognitive load. When the syntactic environment is more cognitively complex, they tend to favor explicit grammatical elements. In addition, the reduction of linguistic elements may also indicate a reduced processing load.
However, it is important to note that constraints such as the helper animacy and the length of the intervening material, which have been found significant in choosing infinitives, are unstable across genres. This is particularly evident since the present study focuses solely on argumentative essays while nearly all previous studies have used multi-genre corpora, such as the BNC. Moreover, when research is extended to foreign language learners, the strength and direction of predictor effects also vary. It is convincing that language use, especially among EFL learners, is malleable. Native speakers are more likely to use the avoidance strategy not only because of the psychological foundation of this syntactic phenomenon, but also due to their deeper immersion in similar English grammatical forms which have been linked to horror aequi effects (Mondorf, 2003). However, Chinese EFL learners, although influenced by a universal psychological motivation, lack familiarity with the semantically unmotivated identical elements like native speakers, and therefore may not exhibit the same production pattern. A further investigation into the frequency of avoidance behavior shows that 67.8% of the concordances with to preceding help involve the use of bare infinitives, implying that some users still adhere to the horror aequi principle. However, when all the factors are measured collectively, the effect is weakened. This may be due to a common cognitive process, namely probabilistic processing (Ellis, 2006). Language learner’s grammatical knowledge is sensitive to the frequency of usage, leading them to unconsciously choose forms that match the immediate context based on a network of clues.
Awareness of a cultural aspect should also be raised. EFL learners’ data show a higher proportion of noun phrases or pronouns (85.4%) compared to natives’ data (32%). Those inserted elements are also connected with animate helpers, as we have discovered. This may stem from the cultural connotation of help in China where always being ready to help others is considered a traditional virtue. In such context, people involved in a helping event express deep gratitude to each other, which shortens the conceptual distance between them. This deep-rooted cultural tradition may also be reflected in language use, particularly in the preference for shortened linguistic distance realized through bare infinitives. However, it should be noted that helper animacy has not been verified as a significant predictor of bare infinitives, suggesting that further exploration is needed. The core probabilistic grammar of learners’ output regarding the help/help to alternation echoes some common principles observed across different varieties and language phenomena. Nonetheless, it remains grounded in a usage-based view, which accounts for the stability and variability of constraints recognized by learners with different language experiences to varying degrees.
Conclusion
This paper addresses the constraining factors influencing infinitive choices across groups of Chinese EFL learners and native English speakers with a focus on argumentative essays. To tackle this issue, this study adopts a comparative approach based on large corpus data with help concordances, investigating a richly annotated dataset consisting of over a thousand interchangeable infinitive complement constructions. The regression analysis reveals both similarities and differences in the language-internal constraints. For both the native-speaking group and the learners’ group, the production of infinitives following help is affected by specific morphological changes in the main verb help and the horror aequi principle. Therefore, the principle of cognitive complexity and the avoidance strategy are tested. Comparatively, while Chinese EFL learners prefer to choose the cognitively simpler infinitive variant, their choices show greater emphasis on the shortened conceptual distance between the helper and helpee, as reflected by the omission of to, especially after pronominal objects. This phenomenon also evokes a consideration of cultural differences between the two groups, notably the traditional Chinese virtue of being ready to help others. Besides, the interaction effects between two different patterns are observed in the data, which indicates that the strength and direction of predictor effects may be overridden by another. This finding implies that language use is determined by an unconscious assessment of multiple clues which are learned from language input and shaped by language experiences.
Despite the valuable attempt to outline aspects of learners’ probabilistic knowledge, several limitations are worth noting. First, the two corpora are not completely comparable, as the mode and genre factors, along with learner-related variables, are not explored due to the limitation of selected corpora and consideration of the sample size. Second, although the corpora in the current study show a match in the period when the selected essays are produced, they are not updated for over a decade. This raises concerns about how accurately the alternating practice reflects learners’ language behaviors nowadays. Lastly, the present study does not distinguish between EFL and ESL (English as a second language) learners. To deal with these issues, the present study could be extended in the following ways. Firstly, future research is encouraged to include additional language-external constraints, such as learners’ gender, proficiency levels, mode and other genres, which may collectively regulate infinitive choices in different ways. In addition, the time period of corpus compilation should be given due consideration. More recently, several learners’ corpora have been built or updated, for instance, the Trinity Lancaster Corpus (Gablasova et al., 2019) and the International Corpus of Learner English (version 3; Granger et al., 2020). Most importantly, as it is unclear whether learners of different language backgrounds are sensitive to the same probabilistic constraints, the comparative analysis of infinitive choices could be enriched by considering diverse L1 backgrounds. All in all, the investigation of help/help to alternations in learners’ language has obtained insightful observations and offered valuable implications for extending the theory of probabilistic grammar to foreign language acquisition. Practically speaking, these findings can inform the compilation of dictionaries, teaching materials and other language resources, helping learners understand and develop probabilistic knowledge to use language more appropriately in various contexts.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Ethical Approval
This article does not deal with animal or human research directly. In terms of the corpora used in this study, we have promised to follow the conditions of use and cited the corpora properly with reference to either one of the related publications arising from the corpus-building project or the corpus handbook.
