Sage Journals: Discover world-class research

Abstract

Authorship credit allocation schemes have attracted considerable research attention. However, no consensus about which one is the best has been attained until now, and limited evidence from practical tasks has been reported. Therefore, this study uses the author interest discovery task as a real-world task case to provide valuable insights into authorship credit allocation schemes and guidelines for further practical applications. For this purpose, a novel model, AT^credit, is proposed to strengthen the Author-Topic (AT) model with an authorship credit allocation scheme, and collapsed Gibbs sampling is used to approximate the posterior and estimate model parameters. Extensive experiments using the SynBio dataset reveal several interesting findings as follows. (a) Any scheme for allocating unequal authorship credits performs better than its equal-credit counterpart with our AT^credit model in terms of perplexity. (b) The fixed versions of four out of the six schemes work better than their flexible counterparts with our AT^credit model, regardless of the hyper-authorship strategy. (c) The variation coefficient of credit awards can serve as a criterion to decide whether the hyper-authorship strategy should be used. (d) When the number of authors in a scholarly article is less than three, the six authorship credit allocation schemes are similar to each other with our AT^credit model in terms of perplexity. (e) The harmonic counting scheme performs the best, followed by the arithmetic counting scheme, and the network-based counting scheme performs the worst with our AT^credit model in terms of perplexity. (f) The arithmetic counting scheme is similar to the harmonic counting scheme in terms of the normalised mutual information (NMI) of discovered interests, but the geometric counting scheme is different from the axiomatic and network-based counting schemes.

Keywords

Author interest discovery authorship credit allocation scheme hyper-authorship strategy perplexity

1. Introduction

As big science becomes increasingly bigger, the research problem to address becomes more complex and requires knowledge to become more diverse and specialised [1]. This enables increased collaboration among scholars to dominate modern science in terms of the number of authors per publication [2,3]. Each coauthor’s contribution to a multi-authored article is known to be unequal in the majority of scientific domains. More effort has been devoted towards the fair assessment of each coauthor’s contribution by assigning proper credit awards [4 –7], which is regarded as the currency system of the research and academic communities [8]. This can provide decision-making support for recruiting and promoting a scholar as well as for determining awards and funding.

To date, no single, universally accepted rule exists for arranging multiple authors in an academic article, but various norms have been established in different fields, such as alphabetical order in economics and mathematics [9 –11] and pre-study agreements for ordering names in psychology and nursing [12,13]. An alternative solution is to order authors according to their relative contribution to a target publication, which is a dominant convention in many domains [11]. Thus, the position in the author list bears information about the (partial) responsibility of individuals for a scholarly article and credit they deserve. Therefore, most authorship credit allocation schemes [4,6,7] consider only byline information.

The last four decades have witnessed significant progress in the domain of authorship credit allocation schemes since the work of Lindsey in 1980 [14]. Several major counting scheme have been developed, such as indiscriminate [14 –16], arithmetic [17 –19], geometric [20,21], harmonic [22,23], network-based [24], axiomatic [25] and golden number [26] counting schemes. However, until now, no consensus on which one is the best has been attained.

Interestingly, several scholars have argued that some schemes can approximate real-world credit assignments better than others based on surveys in chemistry, biomedicine and psychology. For instance, Hagen [27,28] determined that the harmonic counting scheme performed best, summarised the advantages of this scheme by simultaneously removing both inflationary and equalising biases and offered a better trade-off between parsimony and accuracy. Kim and Diesner [24] reported that the network-based counting scheme outperformed the others, including the harmonic counting scheme. However, Kim and Kim [4] argued that the superiority of the network-based counting scheme should be attributed to its flexibility rather than its innate characteristics. According to the properties of the previous schemes in Osório [6], determine a general counting method that simultaneously satisfies that no advantageous merging or advantageous splitting is impossible. However, from a theoretical perspective, the generalised variations in geometric and harmonic counting schemes [17,20] are the most flexible and robust [6].

By conducting several empirical and theoretical comparisons [4,6,7], studies have reported limited evidence from practical tasks. Furthermore, the volume of surveyed data from Maciejovsky et al., [29], Vinkler [30,31] and Wren et al. [32] is small-scale (only 21 combination cases between the field and number of authors in total), and the number of authors per paper is less than six. Many actual phenomena are not reflected in the surveyed data, such as equal first authors [33 –36], more than one corresponding author [34] and hyper-authorship [37,38]. Note that hyper-authorship is defined as the phenomenon in which an academic article has more than a certain number of authors (e.g. more than 10 authors). We believe that the advantages and disadvantages of these schemes should depend on specific applications. Therefore, this study takes the author interest discovery task [39,40] as a proxy for evaluating authorship credit allocation schemes.

Each scholar typically has their research themes of interests. The themes of interest of a scholar are easily identified from their curriculum vitae (CV). However, owing to untimely updates or unavailable CVs from the Internet, several data-driven topic models for automatically discovering interests from their research outputs have been proposed in the literature, such as the Author-Topic (AT) [41], Author-Persona-Topic (APT) [42], Author-Interest-Topic (AIT) [43] and Author-Topic over time (AToT) [39,40] models. Upon closer examination, these models evidently assume an implicit uniform distribution on the author lists of scientific publications. That is, each scholar in the byline of a focal article is assumed to contribute equally to this article. This assumption is not in accordance with the real-world situation. Hence, to obtain valuable insights into authorship credit allocation schemes and provide guidelines for further practical applications, we identify the following research questions:

To strengthen conventional interest discovery models using an authorship credit allocation scheme such that the contribution assumption is more realistic.

To investigate whether the discriminate counting scheme can promote the performance of an author’s interest model and to what extent this model can benefit from this scheme.

To determine which authorship credit allocation scheme is best for discovering author interests, particularly when the number of authors in a scholarly article remains unchanged.

2. Research framework and methodology

As shown in Figure 1, the research framework consists of four phases. The first phase splits sentences in the title and abstract of each scientific publication, tokenises and lemmatises each split sentence, recognises entity mentions and then filters stopwords, as preprocessing. In the second phase, after disambiguating the author names [44], credits are allocated to each coauthor using one of six authorship credit allocation schemes. Then, the AT^credit model with tuned parameters, which is an improved version of the AT model [41], is used to discover the themes of interest of each author. Finally, extensive comparisons among the six authorship credit allocation schemes are conducted in the fourth phase. The following subsections further describe the second and third phases.

Figure 1.

Research framework for comparing authorship credit allocation schemes.

2.1. Authorship credit allocation schemes

Before discussing the literature pertinent to authorship credit allocation schemes, this study uses the following notations. Given a scientific publication $m$ , let $A_{m}$ be the number of authors in the document and ${\vec{a}}_{m} = [a_{m, 1}, a_{m, 2}, \dots, a_{m, A_{m}}]$ denote the author list in the byline information. That is, $A_{m} = | {\vec{a}}_{m} |$ . Accordingly, credit assignments are expressed as ${\vec{c}}_{m} = [c_{m, 1}, c_{m, 2}, \dots, c_{m, A_{m}}]$ . Hereafter, a piece of work is assumed to have a unit value of one.

2.1.1. Indiscriminate counting scheme

In the indiscriminate counting scheme, distinguishing the importance of each coauthor is impossible. Full [14], fractional [15] and modified fractional [16] counting methods fall into this category. That is, this approach uniformly distributes one unit value to each coauthor $a_{m, i} \in {\vec{a}}_{m}$ in document $m$ after normalisation

c_{m, i} = \frac{1}{A_{m}}

(1)

2.1.2. Arithmetic counting scheme

This counting method linearly distributes publication credits in descending order among the coauthors ${\vec{a}}_{m}$ . The difference in credit awards between two adjacent coauthors, $a_{m, i}$ and $a_{m, i + 1}$ is defined as $λ = c_{m, i} - c_{m, i + 1}$ for $i = 1, \dots, A_{m} - 1$ . Thus, the higher the value of $λ$ , the higher the decrease in credits between two adjacent coauthors. Formally, for $a_{m, i} \in {\vec{a}}_{m}$ in document $m$ , the resulting credit can be expressed as follows [17]

c_{m, i} = \frac{1}{A_{m}} + \frac{1}{2} λ (A_{m} - 2 i + 1)

(2)

with parameter $λ \in [0, (2 / (A_{m} (A_{m} + 1)))]$ . When $λ = 0$ , this method reduces to an indiscriminate counting scheme. At the other extreme, that is, $λ = 2 / (A_{m} (A_{m} + 1))$ , a fixed version of this method can be obtained [18,19]

c_{m, i} = \frac{2 (A_{m} + 1 - i)}{A_{m} (A_{m} + 1)}

(3)

Table 4 in Appendix 1 illustrates the possible credit allocations generated using equation (2). Here, the number of authors $A_{m}$ is fixed at 10, and the parameter $λ$ uniformly assumes 10 different values between zero and $(2 / (A_{m} (A_{m} + 1))) \approx 0.0182$ . Thus, 10 uniform sites can be obtained at a feasible solution interval of $λ$ . Here, the 0th and 9th sites denote the lower and upper bounds, respectively. As $λ$ increases, two interesting phenomena can be observed from Table 4 in Appendix 1: (a) the allocations become more unequal, but the difference among coauthors is not as significant, even for the upper bound of $λ$ ; and (b) the first five ranked coauthors increasingly obtain credits, but the last five ranked ones decreasingly obtain credits.

2.1.3. Geometric counting scheme

The authorship credits allocated by this scheme form a geometric progression with an initial value $λ^{A_{m} - 1}$ and common ratio $1 / λ (λ \geq 1)$ . That is, the credit award for each author $a_{m, i} (i = 2, \dots, A_{m})$ after the first author is determined by dividing that of the previous neighbour coauthor by parameter $λ$ ; thus, the larger the value of $λ$ , the larger the ratio of credits between two consecutive coauthors. For $a_{m, i} \in {\vec{a}}_{m}$ in document $m$ , the credit award is formally defined according to the following normalised formula [20]

c_{m, i} = \frac{λ^{A_{m} - i}}{\sum_{i' = 1}^{A_{m}} λ^{i' - 1}} = \frac{(λ - 1) λ^{A_{m} - i}}{λ^{A_{m}} - 1}

(4)

When $λ$ is fixed at 1, this approach reduces to an indiscriminate counting scheme. This method delivers a fixed version [21] when $λ = 2$ . If $λ$ converges to infinity, full publication credit is assigned to the first author. This is called the single-author counting scheme [45].

For ease of understanding, Table 5 in Appendix 1 reports an example with the number of authors $A_{m} = 10$ . In this study, parameter $λ$ assumes a value in the interval [1.00, 3.50] with increments of 0.25. As $λ$ increases, the first-positioned coauthors are increasingly awarded and the last-positioned coauthors are increasingly penalised. In particular, for $λ = 3.50$ , almost no credit can be assigned to the last two coauthors ranked.

2.1.4. Harmonic counting scheme

Hagen [22] proposed that coauthor $a_{m, i} \in {\vec{a}}_{m}$ should receive $1 / i$ credits. The normalised credit award for $a_{m, i} \in {\vec{a}}_{m}$ can be readily obtained as follows [22]

c_{m, i} = \frac{1 / i}{\sum_{i' = 1}^{A_{m}} 1 / i'}

(5)

Unlike the geometric counting scheme, the ratio $(i + 1) / i$ between two consecutive coauthors in this method is not a constant but decreases with the authorship rank.

To accommodate different application scenarios, Abbas [23] generalised this approach by introducing a parameter $λ \geq 1$ as follows

c_{m, i} = \frac{1 / i^{1 - 1 / λ}}{\sum_{i' = 1}^{A_{m}} 1 / {i'}^{1 - 1 / λ}}

(6)

for $a_{m, i} \in {\vec{a}}_{m}$ in document $m$ . Here, parameter $λ$ controls the distribution of credit awards between two neighbouring coauthors. At this time, the ratio between two neighbouring coauthors becomes $((i + 1) / i)^{1 - (1 / λ)}$ . When $λ$ approaches 1, this scheme reduces to an indiscriminate counting method. A fixed version of equation (5) is obtained if $λ$ converges to infinity.

Table 6 in Appendix 1 shows that the possible authorship credits are allocated using equation (6) when the number of authors $A_{m} = 10$ . Parameter $λ$ can assume any real number between 1 and infinity. Here, 11 different values are considered, including infinity $(\infty)$ . From Table 6 in Appendix 1, two interesting patterns are observed: (a) as $λ$ increases, the credits become more non-uniform, but the difference among coauthors is larger than that with the arithmetic counting scheme; and (b) because the ratio of credits $((i + 1) / i)^{1 - (1 / λ)}$ between coauthors $i$ and $i + 1$ decreases as the ranks increase, the disparities for last-ranked coauthors are smaller than those for first-ranked coauthors in terms of credits.

2.1.5. Network-based counting scheme

The first step in this method [24] is equivalent to that of the fractional counting scheme [15], that is, $c_{m, i} = 1 / A_{m}$ for $a_{m, i} \in {\vec{a}}_{m}$ . Then, each coauthor after the first author uniformly transfers a portion ( $λ \in [0, 1]$ ) of their own credit awards to preceding authors. Specifically, this method groups the involved authors into three categories: first, last and middle authors. Each type of author is expressed by a different formula as follows

c_{m, i} = {\begin{matrix} \frac{1}{A_{m}} + \frac{λ}{A_{m}} \sum_{i' = 1}^{A_{m} - 1} \frac{1}{A_{m} - i'}, & i = 1 \\ \frac{1 - λ}{A_{m}} + \frac{λ}{A_{m}} \sum_{i' = 1}^{A_{m} - i} \frac{1}{A_{m} - i'}, & 1 < i < A_{m} \\ \frac{1 - λ}{A_{m}}, & i = A_{m} \end{matrix}

(7)

Overall, the smaller the value of $λ$ , the more uniformly the publication credits are distributed. When $λ = 0$ , each coauthor after the first author does not contribute any of their shares. In this case, the indiscriminate counting model is in action. Conversely, if $λ$ is set at 1, the last coauthor receives no credit. This study permits parameter $λ$ to assume a value between 0 and 1 with steps of 0.1 and fixes the number of authors $A_{m}$ to 10, as shown in Table 7 in Appendix 1. As the ratio of credit shares of each author $λ$ that is distributed to preceding coauthors increases, more credits are concentrated on the first-positioned coauthors.

2.1.6. Axiomatic counting scheme

This method is derived from three axioms: ranking preference, credit normalisation and maximum entropy [25]. To cover various practical cases in terms of authorship ordering, the authors of document $m$ are first divided into $G_{m} \leq A_{m}$ groups with the number of elements $g_{m, k}$ in the $k th$ group according to their ordered positions. For $a_{m, i} \in {\vec{a}}_{m}$ in document $m$ , the following formula formally expresses the corresponding credit award [25]

c_{m, i} = \frac{1}{G_{m}} \sum_{j = i}^{G_{m}} \frac{1}{\sum_{k = 1}^{j} g_{m, k}}

(8)

If an equal contribution is not claimed by any coauthor, then each group of coauthors has only one member. In this case, we have $g_{m, k} = 1$ , $G_{m} = A_{m}$ and

c_{m, i} = \frac{1}{A_{m}} \sum_{j = i}^{A_{m}} \frac{1}{j}

(9)

Concerning this popular special case, by observing the ratio of credits between two neighbouring coauthors $a_{m, i}$ and $a_{m, i + 1} (i = 1, \dots, A_{m} - 1)$ , that is, $(1 / i + \dots + 1 / A_{m}) / (1 / (i + 1) + \dots + 1 / A_{m})$ , Osório determined an interesting property [6]: the ratio decreases with authorship position $i$ for the first-ranked coauthors, but increases with position $i$ for the last-ranked coauthors. This property indicates that the last author is penalised more than the other authors.

Note that this method is a fixed type of counting scheme [4]. To benefit from the flexibility, similar to Kim and Diesner [24] and Kim and Kim [4], we extend this approach by combining the fractional counting scheme [15] and equation (8) as follows

c_{m, i} = \frac{1 - λ}{A_{m}} + \frac{λ}{G_{m}} \sum_{j = i}^{G_{m}} \frac{1}{\sum_{k = 1}^{j} g_{m, k}}

(10)

with $λ \in [0, 1]$ . The trade-off between the indiscriminate counting scheme and fixed version in equation (8) is controlled by parameter $λ$ . According to equation (10), Table 8 in Appendix 1 presents the resulting credit awards when $g_{m, k} = 1$ and $G_{m} = A_{m} = 10$ . Again, as parameter $λ$ increases, the difference between coauthors in terms of credits widens.

2.1.7. Golden number counting scheme

With the help of the golden number $ρ = ((\sqrt{5} - 1) / 2) \approx 0.618$ , Assimakis and Adam [26] proposed the following formula to calculate the credit award of $a_{m, i} \in {\vec{a}}_{m}$

c_{m, i} = {\begin{matrix} ρ {(1 - ρ)}^{i - 1}, & i = 1, \dots, A_{m} - 1 \\ {(1 - ρ)}^{A_{m} - 1}, & i = A_{m} \end{matrix}

(11)

The main idea behind this counting scheme is that the fraction $ρ$ of full credit is allocated to the first author, the fraction $ρ$ of the publication credits that remain $1 - ρ$ is assigned to the second author and so on until the last author. This enables the ratio of credits between two consecutive coauthors to be constant $1 / (1 - ρ) = 2.618$ , except for the last author, for whom the ratio is $ρ / (1 - ρ) = 1.618$ . Note that this method is also a fixed-type counting scheme without flexibility. This study proposes the following formula of a flexible version

c_{m, i} = {\begin{matrix} \frac{1 - λ}{A_{m}} + λ ρ {(1 - ρ)}^{i - 1}, & i = 1, \dots, A_{m} - 1 \\ \frac{1 - λ}{A_{m}} + λ {(1 - ρ)}^{A_{m} - 1}, & i = A_{m} \end{matrix}

(12)

with $λ \in [0, 1]$ . The choice of a specific $λ$ may vary from the indiscriminate counting scheme ( $λ = 0$ ) to the fixed version in equation (11) ( $λ = 1$ ). Table 9 in Appendix 1 reports the authorship credit allocation when $λ$ varies between 0 and 1. For the fixed version, that is, the most extreme situation, the first author obtains 61.8% publication credits, and the last author obtains 0.02% publication credits. This is similar to the geometric counting scheme with $λ = 2.50$ (Table 5 in Appendix 1).

2.2. Equal contributors and hyper-authorship

If more than one first and corresponding author is attached to a scholarly article or if the coauthors do not coincide, these can be moved to the first positions before applying the counting scheme, and the average of their scores is then allocated to each of them. Here, the first and corresponding authors are assumed to contribute equally to a target article, regardless of their number. This conventional strategy has been used in previous studies [4,24,28,44]. Thus, the credits of other coauthors are reduced accordingly. Note that the axiomatic counting scheme is an exception that can directly handle this case.

In addition, owing to the diminishing nature of the aforementioned authorship credit allocation schemes, when a publication has an extraordinary number of coauthors [38], many schemes fail when assigning credits to individual authors. Failure means that some coauthors receive almost no credit, such as the last two positioned coauthors in Table 5 in Appendix 1 ( $λ = 3.50$ ). To handle this issue, Tscharntke et al. [46] suggested that the first author should obtain full credit, the second author half, the third a third and so on, up to the 10th author, after which each remaining author receives 0.05 credits. This suggestion is called the sequence-determines-credit (SDC) scheme [46]. Formally, coauthor $a_{m, i} \in {\vec{a}}_{m}$ can obtain credit award $c_{m, i}$ as follows

c_{m, i} = {\begin{matrix} \frac{1}{i}, i \leq 10 \\ 0.05, otherwise \end{matrix}

(13)

This scheme combines an unnormalised harmonic counting scheme [22] and transformed indiscriminate counting scheme.

From another perspective, equation (13) can be understood as once a coauthor obtains less than a tenth of that of the first-ranked author, a twentieth of the credits of the first author is assigned to that coauthor and remaining coauthors. Naturally, normalisation should be re-conducted to sum all credit allocations to 1. Several examples are coloured in grey in Tables 5 –9 in Appendix 1. To some extent, the number of highlighted elements reflects the degree of variability in credits. To quantitatively measure the dispersion of credits, the variation coefficient [47], which is defined as the ratio of the standard deviation to the mean, is calculated, as shown in Table 1. Among the authorship credit allocation schemes, the geometric counting scheme has the largest dispersion degree, followed by the golden number counting scheme, and the arithmetic counting scheme yields the smallest dispersion degree.

Table 1.

Variation coefficient for authorship credit allocation schemes.

Arithmetic counting scheme	Geometric counting scheme	Harmonic counting scheme	Network-based counting scheme	Axiomatic counting scheme	Golden number counting scheme
0.2894	1.6081	0.6836	0.6464	0.4998	1.1074

2.3. Author interest discovery

The author interest discovery can help answer many important questions, such as which themes each scholar prefers, which scholars are similar to each other in terms of their expertise and which scholarly articles are likelier to be read by a focal researcher. Given that many CVs, particularly their latest versions, cannot be accessed from the Internet, several data-driven topic models have been developed. A popular model is the AT model [41], which correlates authorship information with themes to provide a better insight into the expertise of each author. The last decade has witnessed significant progress in this direction, and several variants have been proposed, such as the APT [42], AIT [43] and AToT [39,40] models. Furthermore, many successful cases have been reported in several personalised academic recommendation systems, including Microsoft Academic [48] and AMiner [49].

These models treat each scientific publication as if generated using a two-stage stochastic procedure. Let us consider the AT model [41] as an example. To generate each word token in a scientific publication, a theme index is drawn from its research interests after an author’s index is uniformly drawn from the byline information. Finally, a word token is drawn from the multinomial distribution of the theme. From this generative procedure, these models evidently share the same assumption of indiscriminate contribution. Therefore, the AT model [41] is extended in this subsection to consider the authorship credit allocation scheme when discovering the author’s interest from scholarly articles by introducing a set of hidden random variables for credits, as shown in Figure 2. To facilitate this, this model is renamed AT^credit. Notably, the idea in our AT^credit model is also applicable to other similar models.

Figure 2.

Graphical model representation of the (a) AT and (b) AT^credit models.

Before discussing more specific terms, the notation used in these models is briefly introduced. Let $K$ , $M$ , $V$ and $A$ be the numbers of topics, documents, unique words and unique authors, respectively. Given article $m$ with byline information ${\vec{a}}_{m}$ , $N_{m}$ and $A_{m}$ denote the number of word tokens and authors in $m$ , respectively. ${\vec{c}}_{m}$ stores the credits of each coauthor allocated by an authorship credit allocation scheme with the parameter $λ$ in Subsection 2.1. The research interests ${\vec{ϑ}}_{a}$ for scholar $a$ are modelled as a multinomial distribution over the themes, in which theme $k$ is represented as a multinomial distribution over words ${\vec{φ}}_{k}$ . The probability distribution over the themes in a multiauthor article is a mixture of the distributions associated with their coauthors. The $n th$ token $w_{m, n}$ in article $m$ is associated with topic assignment $z_{m, n}$ and author assignment $x_{m, n}$ . In addition, $\vec{a}$ and $\vec{β}$ are Dirichlet priors (hyper-parameters).

As with many well-known probabilistic topic models [40,41,50 –52], the AT^credit model can be described from the perspective of a generative procedure as follows. (a) Two multinomial distributions ${\vec{φ}}_{k}$ and ${\vec{ϑ}}_{a}$ for each theme $k \in [1, K]$ and author $a \in [1, A]$ are randomly drawn from their respective Dirichlet distributions $\vec{β}$ and $\vec{a}$ . (b) The credits ${\vec{c}}_{m}$ for each article $m \in [1, M]$ are calculated using a designated authorship credit allocation scheme. (c) Finally, word token $w_{m, n}$ is randomly drawn from ${\vec{φ}}_{z_{m, n}}$ after the author and theme indices are randomly drawn from ${\vec{c}}_{m}$ and ${\vec{ϑ}}_{x_{m, n}}$ , respectively.

From Figure 2 and the generative procedure, our AT^credit evidently degenerates to the AT model [41] when using an indiscriminate counting scheme. That is, the AT model is a special case of our model. Although a variety of approximate inference algorithms have been proposed in the literature [53 –55], collapse Gibbs sampling, a particular case of Markov chain Monte Carlo (MCMC), was originally adopted in Rosen-Zvi et al. [41] to approximate the posterior of the AT model. To facilitate a comparison, collapsed Gibbs sampling was used in this study.

In the collapsed Gibbs sampling process, the posterior must be calculated, that is, the conditional distribution of the hidden random variables ( $\vec{z}$ and $\vec{x}$ ) given the observations and other hidden variables, $\Pr (z_{m, n}, x_{m, n} | \vec{w}, {\vec{z}}_{\neg (m, n),} {\vec{x}}_{\neg (m, n),} \vec{a}, \vec{c}, \vec{α}, \vec{β})$ . Here, ${\vec{z}}_{\neg (m, n)}$ and ${\vec{x}}_{\neg (m, n)}$ are the topic and author assignments, respectively, for all word tokens, except $w_{m, n}$ . According to the derivation presented in Appendix A2, the posterior can be formally expressed as follows

\Pr (z_{m, n}, x_{m, n} | \vec{w}, {\vec{z}}_{\neg (m, n),} {\vec{x}}_{\neg (m, n),} \vec{a}, \vec{c}, \vec{α}, \vec{β}, λ) \propto \frac{n_{z_{m, n}}^{(w_{m, n})} + β_{w_{m, n}} - 1}{\sum_{v = 1}^{V} (n_{z_{m, n}}^{(v)} + β_{v}) - 1} \times \frac{n_{x_{m, n}}^{(z_{m, n})} + α_{z_{m, n}} - 1}{\sum_{k = 1}^{K} (n_{x_{m, n}}^{(k)} + α_{k}) - 1} \times c_{m, x_{m, n}}

(14)

where $n_{k}^{(v)}$ denotes the number of tokens of word $v$ with topic assignment $k$ , and $n_{a}^{(k)}$ is the number of word tokens with topic assignment $k$ and author assignment $a$ . The first two terms on the right-hand side of equation (14) are identical to those in the AT model [41]. Using the expectation of the Dirichlet distribution, we obtain the following model parameters

φ_{k, v} = \frac{n_{k}^{(v)} + β_{v}}{\sum_{v = 1}^{V} (n_{k}^{(v)} + β_{v})}

(15)

ϑ_{a, k} = \frac{n_{a}^{(k)} + α_{k}}{\sum_{k = 1}^{K} (n_{a}^{(k)} + α_{k})}

(16)

Although our AT^credit model can consider an authorship credit allocation scheme, parameter estimation formulas (15)–(16) remain unchanged. That is, equations (15)–(16) are shared by the AT^credit and AT models. This study fixed the number of topics $K$ at 50 and symmetric Dirichlet priors $α$ and $β$ at 0.1 and 0.01, respectively. Collapsed Gibbs sampling was run for 2000 iterations, including 500 iterations for the burn-in period. We conducted all experiments discussed in Section 4 with other parameter settings (the number of topics and symmetric Dirichlet priors), and a similar conclusion can be drawn. That is, these parameters did not affect the main conclusions of this study. Owing to space limitations, this study only presents the results with the aforementioned parameter settings.

3. Dataset

This study used the SynBio dataset [56], which was originally collected from the Web of Science (WoS) database with a search strategy pertaining to the synthetic biology (SynBio) field. Upon closer examination, we determined that this dataset includes three duplications, one record without any author and one record missing three coauthors. Only one copy of three duplicate records was retained for further analysis, and records without an author were excluded. Missed coauthors were manually supplemented. Finally, our dataset contained 2580 articles.

Xu et al. [44] disambiguated author names using a revised rule-based scoring and clustering method and then individually checked them manually. After name disambiguation, 9990 unique scholars were identified. Single-authored publications accounted for 6.09%; publications authored by 2–7 scholars, 79.46%; and scholarly articles with more than 10 authors (hyper-coauthorship), 4.50%.

Figure 3 shows the number of publications authored by the number of scholars, in which both axes are shown on the log scale. The distribution of the number of publications can be observed to be similar to that of the power law. That is, most authors appear in only one article, but several scholars authored numerous articles. In our dataset, Charles Boone from University of Toronto surprisingly authored 29 scholarly articles in total, followed by Brenda J. Andrews from the same university with 22 publications.

Figure 3.

Number of publications (y-axis) authored by the number of scholars (x-axis) in the log-log space.

As mentioned, some valuable information is missing from the WoS database [57], although the data quality has been improved significantly. Consider the byline information of a focal publication as an example. According to our observations, only the author list is typically recorded, in addition to the corresponding author. If this publication involves several corresponding authors (Figure 4(a)) or has multiple authors with equal contributions (Figure 4(b) and (c)), the WoS database does not seem to store this information.

Figure 4.

Several examples on various practical authorship orderings: (a) Example of four corresponding authors (UID = ‘WOS:000254209300033’). (b) Example of all equally contributing authors (UID = ‘WOS:000280334000001’). (c) Example of three equally contributing authors with the role of neither first nor corresponding author (UID = ‘WOS:000308225500017’).

Therefore, this study fetched all full texts in PDF format from the World Wide Web (WWW) and individually checked the corresponding authorship information. In the SynBio dataset, 136 articles (5.27%) were not explicitly attached to any corresponding author. Scientific publications with multiple first and corresponding authors accounted for 9.03% (233) and 8.06% (208), respectively. Meanwhile, 13 articles were authored by multiple scholars with equal contributions, but with the role of neither first nor corresponding author.

Note that the same preprocessing steps as in Xu et al. [44,58] were conducted in this study. For completeness, they are described as follows. The Geniass tool [59] was used to detect the sentences in titles and abstracts, and Geniatagger [60] was used to tokenise and lemmatise the detected sentences. Subsequently, an expanded stopword list from the Natural Language Toolkit (NLTK) was used to filter the stopwords. All numbers (including integer and decimal) were collectively denoted by the special word NUMBER. To reduce interference, copyright and article status information were excluded from further analysis.

In addition, this study further grouped this dataset into two disjoint subsets: a training set of 2348 documents and a test set of 232 documents. Unlike the conventional splitting procedure, the following constraint was imposed: each scholar in the test set must author at least one publication set. Indeed, this splitting procedure is nontrivial. Our dataset has the characteristics of a multi-label learning task [61 –63] if the author list is viewed as the labels of the resulting document. To handle this problem, the stratification method in Sechidis et al. [64] was used first. Then, the instances that did not satisfy the above constraint were moved from the test set to the training set, and several solo articles from the training set to the test set.

4. Results and discussion

4.1. Evaluation criteria

Perplexity [65], originally derived from information theory, can measure how well a probability distribution or model predicts an instance. As an intrinsic evaluation metric [66], it is widely used to evaluate language models. To select a proper model from multiple candidates, the performance of each model can be evaluated based on the training dataset by asking how well it predicts a separate instance from the test dataset. Similar to other topic models, our AT^credit model is actually a language model over entire texts and authors from scientific publications. Therefore, perplexity was used in this study. Formally, this measure is defined as the exponential of the negative normalised predictive likelihood of word tokens under trained model $M$ (equations (17)–(18)). A lower value on the testing set indicates a better generalisation performance

\Pr (\vec{\tilde{w}} | \vec{\tilde{a}}, M) = \exp - \frac{\sum_{m = 1}^{M} \log \Pr ({\vec{\tilde{w}}}_{m}, \cdot | {\vec{\tilde{a}}}_{m}, M)}{\sum_{m = 1}^{M} N_{m}}

(17)

In principle, the predicative likelihood of a test document can be calculated by integrating all parameters from the joint distribution of word tokens in the document. For our AT^credit model, the likelihood of a document in the testing set $\Pr ({\vec{\tilde{w}}}_{m}, \cdot | {\vec{\tilde{a}}}_{m}, M)$ can be directly expressed as a function of the multinomial parameters

\Pr ({\vec{\tilde{w}}}_{m, \cdot} | {\vec{\tilde{a}}}_{m}, M) = Π_{n = 1}^{N_{m}} \sum_{a = 1}^{A_{m}} \sum_{k = 1}^{K} φ_{k, {\tilde{w}}_{m, n}} ϑ_{a, k^{C_{m, a}}}

(18)

4.2 Parameter Tuning

To reduce the influence of parameter $λ$ in the authorship credit allocation scheme, the optimal parameter is identified on the training set in terms of perplexity, as shown in Figures 5 –10. Note that $λ = 1.0$ was excluded from the network-based counting scheme because the last coauthor does not obtain any credit in this case. Because the axiomatic counting scheme can handle the case of equal contributors, the other five schemes used the equal contributor strategy for fair comparisons. However, the hyper-authorship strategy was set to a switch such that the influence of this strategy on author interest discovery can be observed. For convenience, Table 2 summarises the optimal parameter settings for each scheme. Note that if the performance corresponding to different parameter settings is not easy to distinguish ( $λ = 0.7$ , $λ = 0.8$ and $λ = 0.9$ in Figure 8), a smaller value for parameter $λ$ was used according to Occam’s razor [67].

Figure 5.

Perplexities of the AT^credit model using the arithmetic counting scheme: (a) disabled hyper-authorship strategy and (b) enabled hyper-authorship strategy.

Figure 6.

Perplexities of the AT^credit model using the geometric counting scheme: (a) disabled hyper-authorship strategy and (b) enabled hyper-authorship strategy.

Figure 7.

Perplexities of the AT^credit model using the harmonic counting scheme: (a) disabled hyper-authorship strategy and (b) enabled hyper-authorship strategy.

Figure 8.

Perplexities of the AT^credit model using the network-based counting scheme: (a) disabled hyper-authorship strategy and (b) enabled hyper-authorship strategy.

Figure 9.

Perplexities of the AT^credit model using the axiomatic counting scheme: (a) disabled hyper-authorship strategy and (b) enabled hyper-authorship strategy.

Figure 10.

Perplexities of the AT^credit model using the golden number counting scheme: (a) disabled hyper-authorship strategy and (b) enabled hyper-authorship strategy.

Table 2.

Optimal parameter settings for each authorship credit allocation scheme.

Scheme	Hyper-authorship strategy	Optimal parameter
Arithmetic counting	Disabled	$\frac{2}{A_{m} (A_{m} - 1)}$
Arithmetic counting	Enabled*	$\frac{2}{A_{m} (A_{m} - 1)}$
Geometric counting	Disabled*	3.50
Geometric counting	Enabled	3.25
Harmonic counting	Disabled*	$\infty$
Harmonic counting	Enabled	$\infty$
Network-based counting	Disabled*	0.8
Network-based counting	Enabled	0.7
Axiomatic counting	Disabled	1.0
Axiomatic counting	Enabled*	1.0
Golden number counting	Disabled*	1.0
Golden number counting	Enabled	1.0

Note: The asterisk (*) superscript indicates the optimal hyper-authorship strategy for each authorship credit allocation scheme.

From Table 2, several interesting phenomena can be observed. First, authorship credit allocation schemes can be divided into two categories according to the hyper-authorship strategy: (a) arithmetic and axiomatic counting schemes and (b) geometric, harmonic, network-based and golden number counting schemes. This seems to be in line with the variation coefficient in Table 1 if 0.5 is taken as the cutting point of the dispersion degree. That is, if the dispersion degree exceeds 0.5, the hyper-authorship strategy is better disabled; otherwise, it is enabled.

Second, four out of six schemes achieved the best status in the author interest discovery task when the resulting fixed versions are assumed, regardless of whether the hyper-authorship strategy is used. Third, any scheme for allocating unequal authorship credits performs better than its equal-credit counterpart in terms of perplexity. This indicates that our AT^credit model outperforms the AT model. Finally, for the network-based counting scheme with the enabled hyper-authorship strategy, when $λ \geq 0.7$ , our AT^credit does not seem sensitive to this parameter.

4.3. Prediction power analysis

To check the prediction power of each scheme and answer the third question listed in Section 1, we trained six AT^credit models on the training set, and each model used a different authorship credit allocation scheme with the tuned parameters in Table 2. Then, five chains of collapsed Gibbs sampling were run for 500 iterations on the test set, and the perplexities from these chains were averaged, as depicted in Figure 11. Surprisingly, among these schemes, the network-based counting scheme performed the worst with our AT^credit model in terms of perplexity. This observation is opposite to that in Kim and Diesner [24]. The perplexities between the other schemes are so close that differentiating them is difficult. This is in accordance with the observations in Kim and Kim [4].

Figure 11.

Prediction performances for each authorship credit allocation scheme with tuned parameters on the test set.

To further investigate the influence of the number of authors on the prediction performance of each authorship credit allocation scheme, we grouped scientific publications into 10 subsets according to the number of authors, as shown in Figure 12. Note that single-authored publications were excluded because they require no scheme. Figure 12 shows several interesting phenomena. (a) When the number of authors in a scholarly article is less than 3, the six authorship credit allocation schemes are considerably similar in terms of perplexity. (b) The harmonic counting scheme performed the best, followed by the arithmetic counting scheme, and the network-based counting scheme performed the worst with our AT^credit model in terms of perplexity. (c) The harmonic and arithmetic counting schemes follow a similar trend, whereas the other schemes follow another pattern. (d) The axiomatic counting scheme outperforms the golden number counting scheme with our AT^credit model when the number of authors is less than 10; however, the latter performs better when the number of authors is greater than 10.

Figure 12.

Performance trends for each authorship credit allocation scheme according to the number of authors in a scholarly article.

4.4. Consistency analysis

To obtain valuable insights into the commonality and specialty of different authorship credit allocation schemes, NMI [66,68] was used. The NMI normalises the mutual information (MI) between two schemes to a range between 0 (no MI) and 1 (perfect correlation). Before this, each author is attached to the interest theme with the highest strength, that is, $\arg ma x_{k} ϑ_{a, k}$ for author $a$ . If more than two themes had the same highest strength, the theme was randomly selected.

Because most scholars have authored only one article (Figure 3), scholars with two or more publications are considered to reduce disturbance. Table 3 reports the NMI values for the six authorship credit allocation schemes. Table 3 shows that the arithmetic counting scheme had the strongest consistency with the harmonic counting scheme, and the geometric counting scheme seems to be significantly different from axiomatic and network-based counting schemes. Furthermore, the interests of a scholar with more articles can be uncovered more consistently than those of a scholar with fewer articles, regardless of the scheme used (Tables 3 (a)–(c)).

Table 3.

Consistencies between six authorship credit allocation schemes in terms of the normalised mutual information.

	1	2	3	4	5	6
1: Arithmetic counting scheme	1.0	0.4509	0.4864*	0.4728	0.4743	0.4533
2: Geometric counting scheme	0.4509	1.0	0.4638	0.4662	0.4498	0.4836
3: Harmonic counting scheme	0.4864*	0.4638	1.0	0.4579	0.4631	0.4684
4: Network-based counting scheme	0.4728	0.4662	0.4579	1.0	0.4514	0.4559
5: Axiomatic counting scheme	0.4743	0.4498	0.4631	0.4514	1.0	0.4586
6: Golden number counting scheme	0.4533	0.4836	0.4684	0.4559	0.4586	1.0
(a) Scholars with publications $\geq 2$
	1	2	3	4	5	6
1: Arithmetic counting scheme	1.0	0.6484	0.6902*	0.6650	0.6574	0.6545
2: Geometric counting scheme	0.6484	1.0	0.6554	0.6440	0.6334	0.6536
3: Harmonic counting scheme	0.6902*	0.6554	1.0	0.6664	0.6648	0.6526
4: Network-based counting scheme	0.6650	0.6440	0.6664	1.0	0.6545	0.6589
5: Axiomatic counting scheme	0.6574	0.6334	0.6648	0.6545	1.0	0.6424
6: Golden number counting scheme	0.6545	0.6536	0.6526	0.6589	0.6424	1.0
(b) Scholars with publications $\geq 3$
	1	2	3	4	5	6
1: Arithmetic counting scheme	1.0	0.7160	0.7872*	0.7425	0.7441	0.7451
2: Geometric counting scheme	0.7160	1.0	0.7423	0.7145	0.7159	0.7536
3: Harmonic counting scheme	0.7872*	0.7423	1.0	0.7397	0.7634	0.7466
4: Network-based counting scheme	0.7425	0.7145	0.7397	1.0	0.7430	0.7479
5: Axiomatic counting scheme	0.7441	0.7159	0.7634	0.7430	1.0	0.7429
6: Golden number counting scheme	0.7451	0.7536	0.7466	0.7479	0.7429	1.0
(c) Scholars with publications $\geq 4$

Note: The asterisk (*) superscript indicates the best consistencies values of the normalized mutual information among six authorship credit allocation schemes.

5. Conclusion

Scholarly articles are not simply a means for communicating scientific findings but also a proxy for a scholar’s performance. Under this perspective, the assessment of a researcher’s output plays an important role in hiring, promotion, award and funding procedures, among others. Given that increasing collaboration among scholars dominates knowledge production, and each coauthor’s contribution to a multi-authored article is unequal in most scientific domains, authorship credit allocation schemes have attracted considerable attention from scientometric and bibliometric scholars. Several schemes have been developed, and comparisons among them have been conducted in the literature [4 –7], but no consensus about which one is best had been attained until now. Furthermore, many actual phenomena of arranging multiple authors are not reflected in the small-scale survey data, and limited evidence from practical tasks has been reported.

As a core module of personalised academic recommendation systems, the author interest discovery task [39,40] is considered a case study in this article to obtain insights about authorship credit allocation schemes and provide guidelines for further practical applications. However, many topic models for discovering author interests implicitly assume equal contribution from each coauthor to a target document. To overcome this limitation, a novel model, AT^credit, was proposed to strengthen the AT model [41] with an authorship credit allocation scheme, and the collapsed Gibbs sampling algorithm was used to approximate the posterior and estimate the model parameters. In summary, our model considered six counting schemes, including fixed and flexible versions and equal contributors and hyper-authorship strategies. Moreover, the methodology of AT^credit is applicable to other similar models.

Several interesting observations can be summarised from the extensive experiments conducted on the SynBio dataset [56]. (a) Any scheme for allocating unequal coauthorship credits performs better than its equal-credit counterpart with our AT^credit model, in terms of perplexity. (b) The fixed versions of four out of the six schemes work better than their flexible counterparts with our AT^credit model, regardless of the hyper-authorship strategy. (c) The variation coefficient of credit awards can serve as a criterion to decide whether the hyper-authorship strategy should be used. (d) When the number of authors in a scholarly article is less than 3, the six authorship credit allocation schemes are considerably similar to each other with our AT^credit model in terms of perplexity. (e) The harmonic counting scheme performed the best, followed by the arithmetic counting scheme, and the network-based counting scheme performed the worst with our AT^credit model in terms of perplexity. (f) The arithmetic counting scheme is similar to the harmonic counting scheme in terms of the NMI of discovered interests, but the geometric counting scheme is different from the axiomatic and network-based counting schemes. The main contributions of this study are summarised as follows:

A novel model, AT^credit, was proposed to strengthen the AT model with an authorship credit allocation scheme, and the collapsed Gibbs sampling algorithm was used to approximate the posterior.

Flexible versions of the axiomatic and golden number counting schemes were developed. Among the six counting schemes, the fixed versions of the four schemes performed better than their flexible counterparts.

The variation coefficient of the credits can serve as a criterion to decide whether the hyper-authorship strategy should be used.

Note that only one dataset was used; therefore, a scientific verification of our findings requires further investigation in the near future. In addition, over 120 journals have been supported by CRediT (Contributor Roles Taxonomy) [69,70] to offer authors to share an accurate and detailed description of their contributions to a target work. However, most authorship credit allocation schemes considered only byline information. Therefore, we intend to further explore how to design an authorship credit allocation scheme based on these contribution declarations.

Footnotes

Appendix 1 Acknowledgements

This research received financial support from the National Natural Science Foundation of China (Nos: 72074014 and 72004012). We also thank the anonymous reviewers for their valuable comments.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship and/or publication of this article.

References

Leahey

From sole investigator to team scientist: trends in the practice and study of research collaboration. Annu Rev Sociol 2016; 42: 81–100.

Greene

The demise of the lone author. Nature 2007; 450: 1165–1165.

Wuchty

Jones

Uzzi

The increasing dominance of teams in production of knowledge. Science 2007; 316(5827): 1036–1039.

Kim

Rethinking the comparison of authorship credit allocation schemes. J Informetr 2015; 9(3): 667–673.

Marušić

Bošnjak

Jerončić

A systematic review of research on the meaning, ethics and practices of authorship across scholarly disciplines. PLoS ONE 2011; 6(9): e23477.

Osório

On the impossibility of a perfect counting method to allocate the credits of multiauthored publications. Scientometrics 2018; 116(3): 2161–2173.

Ding

Song

et al. Author credit-assignment schemas: a comparison and analysis. J Am Soc Inf Sci Technol 2016; 67(8): 1973–1989.

Claxton

LD.

Scientific authorship part 2: history, recurring issues, practices, and guidelines. Mutat Res 2005; 589(1): 31–45.

Frandsen

Nicolaisen

What is in a name? Credit assignment practices in different disciplines. J Informetr 2010; 4(4): 608–617.

10.

van Praag

BMS

. The benefits of being economics professor A (rather than Z). Economica 2008; 75(300): 782–796.

11.

Waltman

An empirical analysis of the use of alphabetical authorship in scientific publishing. J Informetr 2012; 6(4): 700–711.

12.

Spigel

Keith-Speigel

Assignment of publication credits: ethics and practices of psychologists. Am Psychol 1970; 25(8): 738–747.

13.

Waltz

Nelson

Chambers

SB.

Assigning publication credits. Nurs Outlook 1985; 33: 233–238.

14.

Lindsey

Production and citation measures in the sociology of science: the problem of multiple authorship. Soc Stud Sci 1980; 10(2): 145–162.

15.

De Solla Price

. Multiple authorship. Science 1981; 212(4498): 986–986.

16.

Sivertsen

Rousseau

Zhang

Measuring scientific contributions with modified fractional counting. J Informetr 2019; 13(2): 679–694.

17.

Abbas

AM.

Generalized linear weights for sharing credits among multiple authors. Eprint Arxiv 2010; 1012; 5477.

18.

Trenchard

PM.

Hierarchical bibliometry: a new objective measure of individual scientific performance to replace publication counts and to complement citation measures. J Inf Sci 1992; 18(1): 69–75.

19.

van Hooydonk

. Fractional counting of multiauthored publications: consequences for the impact of authors. J Am Soc Inf Sci 1997; 48(10): 944–945.

20.

Abbas

AM.

Polynomial weights or generalized geometric weights: yet another scheme for assigning credits to multiple authors. eprint arXiv:2011:1103.2848, https://arxiv.org/abs/1103.2848

21.

Egghe

Rousseau

van Hooydonk

Methods for accrediting publications to authors or countries: consequences for evaluation studies. J Am Soc Inf Sci 2000; 51(2): 145–157.

22.

Hagen

NT.

Harmonic allocation of authorship credit: source-level correction of bibliometric bias assures accurate publication and citation analysis. PLoS ONE 2008; 3(12): e4021.

23.

Liu

Fang

Fairly sharing the credit of multi-authored papers and its application in the modification of h-index and g-index. Scientometrics 2012; 91(1): 37–49.

24.

Kim

Diesner

A network-based approach to coauthorship credit allocation. Scientometrics 2014; 101(1): 587–602.

25.

Stallings

Vance

Yang

et al. Determining scientific impact using a collaboration index. Proc Natl Acad Sci USA 2013; 110(24): 9680–9685.

26.

Assimakis

Adam

A new author’s productivity index: p-index. Scientometrics 2010; 85(2): 415–427.

27.

Hagen

NT.

Harmonic publication and citation counting: sharing authorship credit equitably – not equally, geometrically or arithmetically. Scientometrics 2010; 84(3): 785–793.

28.

Hagen

NT.

Harmonic coauthor credit: a parsimonious quantification of the byline hierarchy. J Informetr 2013; 7(4): 784–791.

29.

Maciejovsky

Budescu

Ariely

The researcher as a consumer of scientific publications: how do name-ordering conventions affect inferences about contribution credits?

Mark Sci 2009; 28(3): 589–598.

30.

Vinkler

Research contribution, authorship and team cooperativeness. Scientometrics 1993; 26(1): 213–230.

31.

Vinkler

Evaluation of the publication activity of research teams by means of scientometric indicators. Curr Sci 2000; 79(5): 602–612.

32.

Wren

Kozak

Johnson

et al. The write position: a survey of perceived contributions to papers based on byline position and number of authors. EMBO Rep 2007; 8(11): 988–991.

33.

Birnholtz

JP.

What does it mean to be an author? The intersection of credit, contribution, and collaboration in science. J Am Soc Inf Sci Technol 2006; 57(13): 1758–1770.

34.

Loads of special authorship functions: linear growth in the percentage of ‘equal first authors’ and corresponding authors. J Am Soc Inf Sci Technol 2009; 60(11): 2378–2381.

35.

Lapidow

Scudder

Shared first authorship. J Med Libr Assoc 2019; 107(4): 618–620.

36.

Sun

et al. Equal contributions and credit: an emerging trend in the characterization of authorship in major anaesthesia journals during a 10-yr period. PLoS ONE 2013; 8(8): e71430.

37.

Castelvecchi

Physics paper sets record with more than 5000 authors. Nature 2015; 15: 17657.

38.

Cronin

Hyperauthorship: a postmodern perversion or evidence of a structural shift in scholarly communication practices?

J Am Soc Inf Sci Technol 2001; 52(7): 558–569.

39.

Shi

Qiao

et al. Author-topic evolution model and its application in analysis of research interests evolution. J China Soc Sci Tech Inform 2013; 32(9): 912–919.

40.

Shi

Qiao

et al. A dynamic users’ interest discovery model with distributed inference algorithm. Int J Distrib Sens Netw 2014; 2014: 280890.

41.

Rosen-Zvi

Chemudugunta

Griffiths

et al. Learning author topic models from text corpora. ACM Trans Inf Syst 2010; 28(1): 41–438.

42.

Mimmo

McCallum

. Expertise modeling for matching papers with reviewers. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, San Jose, CA, 12–15 August 2007, pp. 500–509. New York: ACM.

43.

Kawamae

. Author interest topic model. In: Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval, Geneva, 19–23 July 2010, pp. 887–888. New York: ACM.

44.

Hao

Yang

et al. A topic models based framework for detecting and forecasting emerging technologies. Technol Forecast Soc Chang 2021; 162: 120366.

45.

Cole

Social stratification in science. Chicago, IL: University of Chicago Press, 1973.

46.

Tscharntke

Hochberg

Rand

et al. Author sequence and credit for contributions in multiauthored publications. Plos Biol 2007; 5(1): e18.

47.

Everitt

Skrondal

The Cambridge dictionary of statistics. 4th ed. Cambridge: Cambridge University Press, 2010.

48.

Wang

Shen

Huang

et al. A review of Microsoft academic services for science of science studies. Front Big Data 2019; 2: 45.

49.

Tang

Zhang

Jin

et al. Topic level expertise search over heterogeneous networks. Mach Learn 2011; 82(2): 211–237.

50.

Wen

et al. A shared interest discovery model for co-author relationship in SNS. Int J Distrib Sens Netw 2014; 2014: 820715.

51.

Blei

Jordan

MI.

Latent Dirichlet allocation. J Mach Learn Res 2003; 3: 993–1022.

52.

Wang

Zhu

Semantic relation extraction aware of n-gram features from unstructured biomedical text. J Biomed Inform 2018; 86: 59–70.

53.

Andrieu

de Freitas

Doucet

et al. An introduction to MCMC for machine learning. Mach Learn 2003; 50(1-2): 5–43.

54.

Hoffman

Blei

Wang

et al. Stochastic variational inference. J Mach Learn Res 2013; 14: 1303–1347.

55.

Jordan

Grhahramani

Jaakkola

et al. An introduction to variational methods for graphical models. Mach Learn 1999; 37(2): 183–233.

56.

Porter

Chiavetta

Newman

NC.

Measuring tech emergence: a contest. Technol Forecast Soc Chang 2020; 159: 120176.

57.

Hao

et al. Types of DOI errors of cited references in Web of Science with a cleaning method. Scientometrics 2019; 120(3): 1427–1437.

58.

Hao

et al. Emerging research topics detection with multiple machine learning models. J Informetr 2019; 13(4): 100983.

59.

Sætre

Yoshida

Yakushiji

et al. AKANE system: protein-protein interaction pairs in the BioCreAtIvE2 challenge, PPI-IPS subtask. In: Proceedings of the 2nd BioCreative challenge evaluation workshop, Madrid, 23–25 April 2007, pp. 209–212. Spain: BioCreative.

60.

Tsuruoka

Tateishi

Kim

et al. Developing a robust part-of-speech tagger for biomedical text. In: Bozanis

Houstis

(eds) Proceedings of the 10th Panhellenic conference on informatics. Berlin: Springer, 2005, pp. 382–392.

61.

ML2S-SVM: multi-label least-squares support vector machine classifiers. Electr Libr 2019; 37(6): 1040–1058.

62.

Zhang

. Team BJUT-BJFU at BioCreative VII LitCovid track: a deep learning based method for multi-label topic classification in COVID-19 literature. In: Proceedings of the BioCreative VII challenge evaluation workshop, Bethesda, Maryland 8–10 November 2021, pp. 275–277. Madrid, Spain: BioCreative.

63.

Chen

Allot

Leaman

et al. Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID-19 literature topic annotations. Database 2022; 2022: baac069.

64.

Sechidis

Tsoumakas

Vlahavas

. On the stratification of multi-label data. In: Gunopulos

Hofmann

Malerba

et al. (eds) Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Berlin; Heidelberg: Springer, 2011, pp. 145–158.

65.

Azzonpardi

Girolami

van Risjbergen

. Investigating the relationship between language model perplexity and IR precision-recall measures. In: Proceedings of the 26th international ACM SIGIR conference on research and development in information retrieval, Toronto, ON, Canada, 28 July–1 August 2003, pp. 369–370. New York: ACM.

66.

Qiao

Zhu

et al. Reviews on determining the number of clusters. Appl Math Inf Sci 2016; 10(4): 1493–1512.

67.

Schaffer

What not to Multiply without Necessity. Australas J Philos 2015; 93(4): 644–664.

68.

Vinh

Epps

Bailey

Information theoretic measures for clustering comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 2010; 11(Oct): 2837–2854.

69.

Allen

O’Connell

Kiermer

How can we ensure visibility and diversity in research contribution? How the Contributor Role Taxonomy (CRediT) is helping the shift from authorship to contributorship. Learn Publ 2019; 32(1): 71–74.

70.

Brand

Allen

Altman

et al. Beyond authorship: attribution, contribution, collaboration, and credit. Learn Publ 2015; 28(2): 11–155.

An improved author-topic (AT) model with authorship credit allocation schemes

Abstract

Keywords

1. Introduction

2. Research framework and methodology

2.1. Authorship credit allocation schemes

2.1.1. Indiscriminate counting scheme

2.1.2. Arithmetic counting scheme

2.1.3. Geometric counting scheme

2.1.4. Harmonic counting scheme

2.1.5. Network-based counting scheme

2.1.6. Axiomatic counting scheme

2.1.7. Golden number counting scheme

2.2. Equal contributors and hyper-authorship

2.3. Author interest discovery

3. Dataset

4. Results and discussion

4.1. Evaluation criteria

4.2 Parameter Tuning

4.3. Prediction power analysis

4.4. Consistency analysis

5. Conclusion

Footnotes

Appendix 1

Acknowledgements

Declaration of conflicting interests

Funding

References