How to identify the roots of broad research topics and fields? The introduction of RPYS sampling using the example of climate change research

Abstract

Since the introduction of the reference publication year spectroscopy (RPYS) method and the corresponding programme CRExplorer, many studies have been published revealing the historical roots of topics, fields and researchers. The application of the method was restricted up to now by the available memory of the computer used for running the CRExplorer. Thus, many users could not perform RPYS for broader research fields or topics. In this study, we present various sampling methods to solve this problem: random, systematic and cluster sampling. We introduce the script language of the CRExplorer that can be used to draw many samples from the population data set. Based on a large data set of publications from climate change research, we compare RPYS results using population data with RPYS results using different sampling techniques. From our comparison with the full RPYS (population spectrogram), we conclude that the cluster sampling performs worst and the systematic sampling performs best. The random sampling also performs very well but not as well as the systematic sampling. The study therefore demonstrates the fruitfulness of the sampling approach for applying RPYS.

Keywords

Climate change research CRExplorer population reference publication year spectroscopy (RPYS)sampling

1. Introduction

Thor et al. [1] introduced the CRExplorer – a programme that can be used to investigate the roots of research fields and topics. For example, the programme has been used by Rhaiem and Bornmann [2] to reveal the historical roots of the new topic in scientometrics of academic efficiency assessments or by Yeung [3] identifying seminal works that built the foundation for functional magnetic resonance imaging studies of taste and food. The CRExplorer facilitates the so-called reference publication year spectroscopy (RPYS) [4]. This statistical method is based on a field- or topic-specific publication set including cited references (CRs). RPYS visualises CR counts by referenced publication years (RPYs, not to be confused with the method RPYS); years with high counts (especially early years) point to underlying cited publications that might be interpreted as historical roots or landmark papers of a field or topic.

Since the introduction of the RPYS, the method faces the problem of processing large data sets that are based on broader topics or fields. The hardware capacities of conventional computers running the CRExplorer are frequently not sufficient enough to process large data sets. To tackle this problem in using the software, we introduce in this article the technique of drawing several samples from a large data set and to produce RPYS results based on these samples. The study is based on a large data set that has been produced by Haunschild et al. [5] to identify the early roots of climate change research [6]. As we will demonstrate in this study, some sampling methods lead to results that are very close to the results from the complete climate change data set (the population).

By using samples to draw conclusions on populations, the study connects to the recent discussion in the Journal of Informetrics around the paper ‘sampling issues in bibliometric analysis’ published by Williams and Bornmann [7]. Both authors demonstrate the relevance of the sampling concept for bibliometric analyses (in the context of inference statistics). Some authors have commented on the paper by questioning the relevance of the sampling topic for the field. In this article, however, we will demonstrate the fruitfulness of this concept for bibliometric studies.

In section 2, we describe the climate change data set that we used in this study to demonstrate the various RPYS sampling methods. The three different sampling methods that are implemented in the CRExplorer are also explained in this section: random, systematic and cluster sampling. Section 3 starts with the RPYS based on the complete climate change data set, that is, the population data set (section 3.1). The results of the population RPYS constitute the outcome which should be reached by the sampling methods: the closer the RPYS of the sampling method is to the population RPYS, the more appropriate is the method for replacing the population RPYS. The results of the population RPYS revealing the historical roots of climate change research are explained in detail. Sections 3.3, 3.4, and 3.6, in section 3, present the RPYS results based on the different sampling methods.

All subsections in section 3 presenting the RPYS results based on the population and sample data are followed by corresponding subsections, in which the script language of the CRExplorer is explained for performing the specific RPYS. The explanations are provided in detail so that the reader learns how to use the language.

2. Data set and methodology

2.1. Climate change publications

Our analyses are based on the Web of Science (WoS, Clarivate Analytics) custom data of our in-house database derived from the Science Citation Index Expanded (SCI-E), Social Sciences Citation Index (SSCI) and Arts and Humanities Citation Index (AHCI) produced by Clarivate Analytics (Philadelphia, PI, USA). We used in this study a publication set containing most of the relevant literature regarding climate change research. The set was compiled using a sophisticated method known as ‘interactive query formulation’. A set of key papers was retrieved and a reformulated search query based on the keyword analysis of key papers was constructed [8]. The search was restricted to the publication years 1980–2014 and to the document types ‘article’ and ‘review’. A detailed description of the search process for retrieving the relevant publications on climate change can be found in Haunschild et al. [5].

In total, the publication set (the population) comprises 222,060 publications and 10,932,050 CRs in 4,004,082 distinct CR variants. An earlier RPYS study by Marx et al. [6] has analysed the RPYs before 1971. The restriction to RPYs before 1971 reduced the number of distinct CR variants to 239,887. This reduction of the number of cited references (NCRs) made the RPYS analysis feasible. The CRs published between 1970 and 2014 comprise 6,594,657 CRs in 3,728,879 distinct CR variants. The main memory requirements rise with the number of unique CR variants, which makes it impossible to analyse the RPYS using the full climate change data set on a current standard computer. Thus, the data set is well suited to demonstrate different sampling methods in this study.

2.2. Sampling methods

If a data set contains numerous CRs from many publications, the full data set cannot be completely imported in the CRExplorer because of restrictions by the available main memory on the computer of many users. To tackle this problem, the user has the option to draw one of the following three types of samples from the full data set. The samples are based on different methods for selecting a subset from the original set of all CRs (the population) [9]:

Random sampling: The sample of CRs is randomly selected from the population where every possible combination of n CRs from the population has the same chance of being selected. For example, if the user wants to import a sample of 100 CRs out of the population of 400 overall CRs, CRExplorer randomly selects 25% of all CRs.

Systematic sampling: Systematic sampling is a very popular sampling method [9], whereby elements are selected from an ordered sampling frame. Here, a given number of CRs is used to select the sample uniformly distributed over the list of all CRs of the citing publications. For example, if the user wants to import 100 CRs out of 400 overall CRs, CRExplorer systematically selects 25% of the list of all CRs by picking the first, fifth, ninth and so on.

Cluster sampling: Cluster sampling is not a sampling frame which is based on individual units but on clusters of units. Thus, clusters of units are sampled instead of individual units. The CRExplorer randomly selects 1 year from the citing publication years which lie between two given years set by the user of the programme. Then, all CRs in the papers published in this year are selected as a sample and are imported. The results of Bornmann and Mutz [10] reveal that the restriction on all CRs from a recent citing year leads to very similar results as the consideration of all CRs from several citing years in references analysis.

3. Results

3.1. Population analysis

The results of the RPYS based on the population data that are shown in Figure 1 (the population spectrogram) serve as baseline for the comparison with the results based on the three sampling methods. The figure presents the NCRs for each RPY. Frequently occurring RPYs show up as distinct peaks within the RPYS spectrogram. The highest peak in Figure 1 with the most CRs is visible for RPY = 2000.

Figure 1.

Annual distribution of CRs throughout the period 1970–2010 which have been cited in climate change publications (published between 1980 and 2014).

For this study, we restrict the RPYS analysis to the RPYs from 1970 to 2010 and use the results for comparison with the RPYS results from various sampling methods. We connect with this focus to the study by Marx et al. [6], who analysed the very early roots of climate change research. Thus, the results of the RPYS are of interest not only in the comparison of samples and population but also for revealing landmark publications in climate change research from the past which have been published more recently.

The RPYS in Figure 1 shows not only the NCRs (in red) but also the 5-year median deviation (in blue). Thus, the blue line is the deviation of the NCRs in each year from the median for the NCRs in the two previous, the current and the two following years. This deviation from the 5-year median provides a curve smoother than the one in terms of absolute numbers. Using the 5-year median deviation curve, peaks in the data can be identified more easily than with the absolute numbers, since each year is compared with its adjacent years. Although we have calculated the RPYS until 2014, we show the spectrogram in Figure 1 only until 2010 to ensure a referencing window of at least 3 years. The spectrogram features nine more or less pronounced peaks at the following RPYs: 1974, 1976, 1982, 1984, 1987, 1993, 2000/2001, 2004 and 2007. Table 1 lists the CRs that occur most frequently within the peak RPYs.

Table 1.

Most frequently CRs, their titles and NCR values from selected RPYs in Figure 1.

No.	RPY	Reference and title	NCR
CR1	1974	AKAIKE H, 1974, IEEE T AUTOMAT CO AC, V19, P716 A new look at the statistical model identification	688
CR2	1974	DEAN WE, 1974, J SEDIMENT PETROL, V44, P242 Determination of carbonate and organic matter in calcareous sediments and sedimentary rocks by loss of ignition: comparison with other methods	527
CR3	1974	ARAKAWA A, 1974, J ATMOS SCI, V31, P674 Interaction of a cumulus cloud ensemble with the large-scale environment, part I	493
CR4	1976	FRITTS HC, 1976, TREE RINGS CLIMATE Book title: Tree rings and climate	1515
CR5	1976	HAYS JD, 1976, SCIENCE, V194, P1121 Variations in the Earth’s orbit: pacemaker of the Ice Ages	923
CR6	1982	NORTH GR, 1982, MON WEA REV, V110, P699 Sampling errors in the estimation of empirical orthogonal functions	676
CR7	1982	RASMUSSON EM, 1982, MON WEA REV, V110, P354 Variations in tropical sea surface temperature and surface wind fields associated with the Southern Oscillation/El Nino	614
CR8	1982	POST WM, 1982, NATURE, V298, P156 Soil carbon pools and world life zones	542
CR9	1984	WIGLEY TML, 1984, J CLIM APPL METEOROL, V23, P201 On the average value of correlated time series, with applications in dendroclimatology and hydrometeorology	793
CR10	1984	IMBRIE J, 1984, MILANKOVITCH CLIMA 1, P269 The orbital theory of Pleistocene climate: support from a revised chronology of the marine d¹⁸O record	768
CR11	1987	ROPELEWSKI CF, 1987, MON WEATHER REV, V115, P1606 Global and regional scale precipitation patterns associated with the El Nino/Southern Oscillation	1243
CR12	1987	BARNSTON AG, 1987, MON WEATHER REV, V115, P1083 Classification, seasonality and persistence of low-frequency atmospheric circulation patterns	1067
CR13	1987	MARTINSON DG, 1987, QUATERNARY RES, V27, P1 Age dating and the orbital theory of the Ice Ages: development of a high-resolution 0 to 300,000 year chronostratigraphy	1047
CR14	1993	STUIVER M, 1993, RADIOCARBON, V35, P215 Extended ¹⁴C data base and revised Calib 3.0 ¹⁴C age calibration program	2332
CR15	1993	DANSGAARD W, 1993, NATURE, V364, P218 Evidence for general instability of past climate from a 250-kyr ice-core record	1872
CR16	2000	NAKICENOVIC N, 2000, SPECIAL REPORT EMISS Special report on emissions scenarios	1470
CR17	2000	GORDON C, 2000, CLIM DYNAM, V16, P147 The stimulation of SST, sea ice extents and ocean heat transports in a version of the Hadley Centre coupled model without flux adjustments	1283
CR18	2001	HOUGHTON JT, 2001, CLIMATE CHANGE 2001 Climate change 2001: the scientific basis	2566
CR19	2001	ZACHOS J, 2001, SCIENCE, V292, P686 Trends, rhythms, and aberrations in global climate 65 ma to present	1779
CR20	2001	IPCC, 2001, CLIM CHANG 2001 SCI	1625
CR21	2004	THOMAS CD, 2004, NATURE, V427, P145 Extinction risk from climate change	1765
CR22	2004	REIMER PJ, 2004, RADIOCARBON, V46, P1029 Intcal04 terrestrial radiocarbon age calibration, 0-26 cal kyr bp	1225
CR23	2007	SOLOMON S, 2007, CLIM CHANG 2007, P19 IPCC Fourth Assessment Report: Climate Change 2007 (AR4), Working Group I Report ‘The Physical Science Basis’, Technical Summary	4125
CR24	2007	IPCC, 2007, CLIM CHANG 2007 PHYS IPCC Fourth Assessment Report: Climate Change 2007 (AR4), Working Group I Report ‘The Physical Science Basis’	2622
CR25	2007	MEEHL GA, 2007, CLIM CHANG 2007, P747 IPCC Fourth Assessment Report: Climate Change 2007 (AR4), Working Group I Report ‘The Physical Science Basis’, Chapter 10: Global climate projections.	1882
CR26	2007	IPCC, 2007, CLIMATE CHANGE 2007 IPCC Fourth Assessment Report: Climate Change 2007 (AR4)	1743

CR: cited reference; NCR: number of cited reference; RPY: reference publication year.

The 26 CRs in Table 1 can be categorised into four different groups of climate change research papers. Nine CRs (CR4, CR5, CR9, CR10, CR13, CR14, CR15, CR19, and CR22) can be assigned to the disciplines paleoclimatology and dating techniques. The corresponding papers deal with the orbital theory of the Ice Ages, the instability of the climate of the past and dendrochronology in connection with climate change research. Six CRs (CR3, CR6, CR7, CR11, CR12, and CR17) are concerned with meteorology. The publications mainly present measured data or modelling results with regard to the atmospheric and oceanic circulation systems. These two sets of CRs are distributed more or less equally over the selected time span. Since the year 2000, however, the Intergovernmental Panel on Climate Change (IPCC) reports increasingly appear as the most frequently CRs. Seven CRs (CR16, CR18, CR20, and CR23-CR26) are part of IPCC reports, mostly related to the scientific basis of climate change and emission scenarios of greenhouse gases. Finally, there are four CRs (CR1, CR2, CR8, and CR21) that deal with various other issues in climate change research, for example, biological and statistical studies about effects from climate change.

We use the spectrogram in Figure 1 and the most frequently cited publications in Table 1 to judge the reliability of the different sampling methods’ results, which are presented in the following sections.

3.2. Using the script language for the population analysis

We employed the script language of CRExplorer to produce the results in Figure 1 and Table 1. The language can be applied instead of using the menus of the graphical user interface of CRExplorer. A separate JAR file is necessary to use the language (this file can be downloaded from http://www1.hft-leipzig.de/thor/crexplorer/CitedReferencesExplorerScript.jar). We started by analysing the CRs in all climate change papers on a machine with 512 GB of main memory (RAM, random access memory). The CRE and CSV files which are necessary for a RPYS analysis of all CRs published between 1970 and 2014 can be produced using the following CRExplorer script.

Listing 1.

CRExplorer script to analyse the CRs in the WoS file savedrecs.txt.

Listing 1 imports the WoS file with the complete climate change data. Furthermore, it identifies variants of the same CR in the data set, clusters them and merges their occurrences (NCRs) [1]. Three export files are saved in different formats.

The set function in the listing can be used to change options of the settings dialog in the CRExplorer. We set usage of two neighbouring RPYs for calculation of the median deviation in this case, that is, a 5-year median deviation. The option n_pct_range: 0 is set here and in the following scripts for purely technical reasons. This option does not change the results presented in this study.

The function importFile is needed to import WoS or Scopus files. We supply options to restrict the CRs to RPYs between 1970 and 2014 and publication years of citing publications between 1980 and 2014. The value of maxCR can be used to limit the number of imported CRs. A value of 0 means no limit. The function info prints a brief line of information to the screen.

With the function cluster, we clustered the imported CRs automatically by using a similarity threshold of 0.75 considering volume and page. The function merge merges the clustered CR variants. Consistent with Marx et al. [6], we removed all CR variants occurring less than 100 times with the removeCR function.

The functions saveFile and exportFile allow us to save the results of our analysis in different formats: the CRE-internal file format, the list of CRs in CSV file format and the data to produce the RPYS graph in CSV format (see Figure 1). The latter can be used to produce RPYS graphs with the Stata command plotrpys (see https://ideas.repec.org/c/boc/bocode/s458378.html) and the R package BibPlots (see https://cran.r-project.org/web/packages/BibPlots/index.html). To cluster the CRs of RPYs 1970–2014, 35 GB RAM is needed.

3.3. Random sampling

In an attempt to cover a range from a small to a large number of samples, we performed seven different random sample RPYS analyses using 10, 50, 100, 500, 1000, 2500, and 10,000 samples with 50,000 CRs in each sample. Figure 2 shows the results of the merged samples in comparison with the population spectrogram (full RPYS). As the samples are of different size, they had to be scaled. We used f = max(NCR_{sample, RPY})/max(NCR_{full, RPY}) as a scaling factor. The samples do not fully reproduce the population spectrogram but most of the relevant peaks also occur in all of the samples. It seems that a few (10 or 50) random samples are sufficient to obtain a first impression of the RPYS.

Figure 2.

Annual distribution of random samples of the CRs throughout the period 1970–2010 which have been cited in climate change publications (published between 1980 and 2014).

The differences between the samples can be seen more clearly in Figure 3 where the difference between each sample and the RPYS with 10,000 samples is shown. The random sampling seems to converge rather slowly with the sample size, but the RPYS with 500 samples seems to be a good compromise between accuracy and computational time. Each sample needed approximately 1 min of computational time on our Intel^® Xeon^® E5-2640 with 2.6 GHz so that 500 samples can be calculated within a day or overnight. In total, 10,000 samples of 50,000 CRs each needed about a week on the same PC. Due to the slow convergence of the random sampling, we present the most important references under the peaks for the results from 10,000 samples in Table 2.

Table 2.

Most frequently CRs from selected RPYs with their NCR values using 10,000 random samples.

No.	RPY	Reference	NCR
CR1	1974	AKAIKE H, 1974, IEEE T AUTOMAT CO AC, V19, P716	676
CR2	1974	DEAN WE, 1974, J SEDIMENT PETROL, V44, P242	509
CR3	1974	ARAKAWA A, 1974, J ATMOS SCI, V31, P674	459
CR4	1976	FRITTS HC, 1976, TREE RINGS CLIMATE	1515
CR5	1976	HAYS JD, 1976, SCIENCE, V194, P1121	913
CR6	1982	NORTH GR, 1982, MON WEA REV, V110, P699	649
CR7	1982	RASMUSSON EM, 1982, MON WEA REV, V110, P354	606
CR8	1982	POST WM, 1982, NATURE, V298, P156	538
CR9	1984	WIGLEY TML, 1984, J CLIM APPL METEOROL, V23, P201	789
CR10	1984	IMBRIE J, 1984, MILANKOVITCH CLIMA 1, P269	768
CR11	1987	MARTINSON DG, 1987, QUATERNARY RES, V27, P1	1042
CR12	1987	ROPELEWSKI CF, 1987, MON WEATHER REV, V115, P1606	1008
CR13	1987	BARNSTON AG, 1987, MON WEATHER REV, V115, P1083	832
CR14	1993	DANSGAARD W, 1993, NATURE, V364, P218	1854
CR15	1993	STUIVER M, 1993, RADIOCARBON, V35, P215	1559
CR16	2000	NAKICENOVIC N, 2000, SPECIAL REPORT EMISS	1470
CR17	2000	GORDON C, 2000, CLIM DYNAM, V16, P147	1274
CR18	2001	HOUGHTON JT, 2001, CLIMATE CHANGE 2001	2566
CR19	2001	ZACHOS J, 2001, SCIENCE, V292, P686	1707
CR20	2001	IPCC, 2001, CLIM CHANG 2001 SCI	1625
CR21	2004	THOMAS CD, 2004, NATURE, V427, P145	1746
CR22	2004	REIMER PJ, 2004, RADIOCARBON, V46, P1029	1175
CR23	2007	SOLOMON S, 2007, CLIM CHANG 2007, P19	4125
CR24	2007	IPCC, 2007, CLIM CHANG 2007 PHYS	2622
CR25	2007	MEEHL GA, 2007, CLIM CHANG 2007, P747	1882
CR26	2007	IPCC, 2007, CLIMATE CHANGE 2007	1743

CR: cited reference; RPY: reference publication year; NCR: number of cited reference.

Figure 3.

Deviation of the randomly sampled RPYS results from the RPYS based on 10,000 samples.

A comparison of the CRs in Table 2 with those in Table 1 shows that the same CRs occur as relevant peak papers in the sampling procedure as well as in the population analysis. However, the order of the peak papers is different for RPYs 1987 and 1993. In the case of eight CRs (CR4, CR10, CR17, CR19, and CR23-CR26), even the NCR value of the sampling result agrees with the result from the population RPYS analysis.

3.4. Systematic sampling

Analogously to the random sampling, we calculated seven different RPYSs of different sample sizes. The scaled sampling RPYS results and the population spectrogram are shown in Figure 4. Essentially all peaks except the small peak in 2004 are reproduced by all samples. Also in the case of the systematic sampling, a small sample size seems to be enough to resemble the most important features of the population spectrogram.

Figure 4.

Annual distribution of systematic samples of the CRs throughout the period 1970–2010 which have been cited in climate change publications (published between 1980 and 2014).

The differences between the RPYS with 10,000 samples and the RPYS results with smaller sample sizes are displayed in Figure 5. In the case of the climate change literature, the systematic sampling converges faster than the random sampling. The difference between the RPYS result of 500 samples and larger samples seems to be insignificant. However, smaller sample sizes do not seem to be sufficient to resemble the RPYS accurately.

Figure 5.

Deviation of the systematically sampled RPYS results from the RPYS based on 10,000 samples.

The most frequently occurring CRs according to 500 systematic samples for the peak RPYs are shown in Table 3.

Table 3.

Most frequently CRs from selected RPYs with their NCR values using 500 systematic samples.

No.	RPY	Reference	NCR
CR1	1974	AKAIKE H, 1974, IEEE T AUTOMAT CO AC, V19, P716	522
CR2	1974	DEAN WE, 1974, J SEDIMENT PETROL, V44, P242	482
CR3	1974	ARAKAWA A, 1974, J ATMOS SCI, V31, P674	427
CR4	1976	FRITTS HC, 1976, TREE RINGS CLIMATE	1510
CR5	1976	HAYS JD, 1976, SCIENCE, V194, P1121	896
CR6	1982	NORTH GR, 1982, MON WEA REV, V110, P699	620
CR7	1982	RASMUSSON EM, 1982, MON WEA REV, V110, P354	579
CR8	1982	POST WM, 1982, NATURE, V298, P156	510
CR9	1984	WIGLEY TML, 1984, J CLIM APPL METEOROL, V23, P201	766
CR10	1984	IMBRIE J, 1984, MILANKOVITCH CLIMA 1, P269	747
CR11	1987	MARTINSON DG, 1987, QUATERNARY RES, V27, P1	1035
CR12	1987	ROPELEWSKI CF, 1987, MON WEATHER REV, V115, P1606	1000
CR13	1987	BARNSTON AG, 1987, MON WEATHER REV, V115, P1083	816
CR14	1993	DANSGAARD W, 1993, NATURE, V364, P218	1851
CR15	1993	STUIVER M, 1993, RADIOCARBON, V35, P215	1552
CR16	2000	NAKICENOVIC N, 2000, SPECIAL REPORT EMISS	1459
CR17	2000	GORDON C, 2000, CLIM DYNAM, V16, P147	1268
CR18	2001	HOUGHTON JT, 2001, CLIMATE CHANGE 2001	2551
CR19	2001	ZACHOS J, 2001, SCIENCE, V292, P686	1696
CR20	2001	*IPCC, 2001, CLIM CHANG 2001 SCI	1619
CR21	2004	THOMAS CD, 2004, NATURE, V427, P145	1735
CR22	2004	REIMER PJ, 2004, RADIOCARBON, V46, P1029	1165
CR23	2007	SOLOMON S, 2007, CLIM CHANG 2007, P19	4109
CR24	2007	*IPCC, 2007, CLIM CHANG 2007 PHYS	2614
CR25	2007	MEEHL GA, 2007, CLIM CHANG 2007, P747	1875
CR26	2007	*IPCC, 2007, CLIMATE CHANGE 2007	1737

CR: cited reference; RPY: reference publication year; NCR: number of cited reference.

A comparison of Tables 1 and 3 shows that all top papers of the population RPYS also appear as top papers in the RPYS from 500 systematic samples. Only the order of the top papers is different for RPYs 1987 and 1993. The ordering of the top papers is the same as in the population RPYS for all other RPYs. Even the NCR agrees quite well in most cases. Mainly, the reference Stuiver M, 1993, is significantly underestimated in terms of the NCRs. It seems from our results that the systematic sampling with 500 samples each can be used to approximate the population spectrogram very well.

3.5. Using the script language for random and systematic sampling

The script language can be extended using the Java programme language. Every user can expand the capabilities of the CRExplorer by writing such extensions. One CRExplorer extension is available at https://github.com/andreas-thor/cre/blob/master/crs/packages/Loop.crs: Loop.crs. This extension simplifies loop programming in the CRExplorer script language. The analysis via sampling procedures was made using the extension Loop.crs. In this case, 10 random samples of 50,000 CRs were drawn from the population of CRs. They were clustered and merged. Afterwards, CRs referenced only once were removed.

Most functions from Listing 2 were already explained in the comments regarding Listing 1. The extension Loop.crs provides the functions forEachUnion and forEach. Both functions provide loops. The number of cycles is provided as the value of count (here 10). The functions differ in their behaviour after the loops are finished; forEach performs no further action whereas forEachUnion merges the CRE files of each cycle to a final CRE data set. The parameter dir can be provided but is optional. If parameter dir is not provided, the system default temporary directory is used. If there is insufficient disc space, the CRExplorer stops with an error message. Furthermore, if dir is provided, the temporary files of each cycle are kept and can be used later on using other CRExplorer script files. The variable index is available in the loop and runs from 0 to count 1. The importFile function contains two additional arguments compared with Listing 1. The parameter sampling can be set to “RANDOM” (as in this example) or “SYSTEMATIC”. Two of the sampling methods can be selected this way. The argument offset: index + 1 instructs the CRExplorer to skip the first index + 1 CRs. This is not necessary for the random sampling, but very important for the systematic sampling. The systematic sampling uses an equidistant set of CRs from the data file. Without the offset option, all samples would contain the same CRs.

Listing 2.

CRExplorer script to analyse 10 random samples of 50,000 CRs from the WoS file savedrecs.txt.

The argument maxCR: 50000 restricts the sample size to 50,000 CRs which easily fit into 1 GB RAM, although about 250,000 CRs could be imported per GB from the climate change publication set. However, merging of the samples needed more RAM depending on the number of samples. As multiple samples need more memory than single samples, we deem it appropriate to restrict the sample sizes in our study consistently to 50,000 CRs per sample.

We conducted a series of merging tests determining the number of samples we were able to merge with a certain amount of RAM. The results are shown in Table 4. However, the number of samples and the amount of RAM should be seen as guiding values as they may differ between publication set types and sampling methods. Especially, the values obtained for the random sampling of course strongly depend on the random samples drawn.

Table 4.

Amount of RAM necessary to merge a certain number of samples with 50,000 cited references each.

Amount of RAM	Number of merged systematic samples	Number of merged random samples
1 GB	29	134
2 GB	64	450
4 GB	148	4664
6 GB	296	9521
8 GB	>500	>10,000

Suppose the user has less than 8 GB of RAM available but still would like to merge 500 systematic samples of 50,000 CRs each, one can also merge in batches, for example, merging four batches of 125 samples each is possible with 4 GB RAM. However, the resulting CR variants might differ somewhat as they might be determined differently in the various merging steps. In the case of cluster sampling, 2 GB were enough to analyse the publication year 2011 and it was possible to process the publication year 2014 with 4 GB RAM.

The function removeCR in Listing 2 now contains a lower threshold than in the case of the population spectrogram. We propose to use the following rule of thumb for calculating the number of CRs to be removed

NC R_{threshold} (sample) = round (\frac{NC R_{threshold} (full)}{NC R_{full} / NC R_{sample}})

(1)

The number of CRs of each sample (NCR_sample) and of the population (NCR_full) can be determined via the function analyzeFile. The syntax of analyzeFile is analogous to the one of the importFile function. This rule of thumb results in our current case in

NC R_{threshold} (sample) = round (\frac{100}{6, 594, 657 / 50, 000}) \approx round (0.758) = 1

(2)

3.6. Cluster sampling

For cluster sampling, the CRExplorer randomly selects 1 year from the given set of citing publication years. Then, all CRs from the papers in this year are selected and imported. As an exploration of the cluster sampling, we used the publication years 2011, 2012, 2013 and 2014 and compared the corresponding spectrograms with the population spectrogram (see Figure 6). All cluster sample spectrograms in the figure resemble only the peak in 2007 quite well. No other peak is reproduced properly. The shoulder in 2009 is much too pronounced in the cluster sampling RPYS results in comparison with the population spectrogram.

Figure 6.

Annual distribution of cluster samples of the CRs throughout the period 1970–2010 which have been cited in climate change publications (published between 2011 and 2014).

It seems from these results that the cluster sampling should not be recommended for RPYS. It should be explored in future studies, whether the cluster sampling approach is appropriate for other publication sets. We could imagine, for instance, that this approach is feasible for research topics that have been started only a few years ago. In these cases, the CRs in the single citing years might be so uniform that the cluster sampling could work.

3.7. Using the script language for cluster sampling

The cluster sampling was performed using CRExplorer scripts like the one in Listing 3. First, the CRs of the citing year 2011 were imported into the CRExplorer. Second, the CRs were clustered using volume and page number but not DOI. Third, the equivalent CRs were merged. Finally, CRs that are referenced 15 times or less were removed from the data set, and the resulting CRE and CSV files were saved.

Listing 3.

CRExplorer script to perform a cluster sample from the WoS file savedrecs.txt.

In this study, we selected the publication year specifically in order to use the appropriate number of CRs to be removed after merging according to our proposed rule of thumb (see section 3.7). Alternatively, one can use PY: [1980, 2014, false], sampling: “CLUSTER” in Listing 3 to randomly select a citing year. We prefer for this study to select the citing year to systematically determine the threshold for removal of CRs. We used removeCR(N_CR: [0, 15]) for 2011, removeCR(N_CR: [0, 18]) for 2012, removeCR(N_CR: [0, 22]) for 2013 and removeCR(N_CR: [0, 24]) for 2014 in accordance with equation (1).

4. Discussion and conclusion

Since the introduction of the RPYS method (and the corresponding programme CRExplorer), many studies have been published revealing the historical roots of topics, fields and researchers [4 –15]. The application of the method was restricted up to now by the available memory of the computer used for running the CRExplorer. This meant that many users could not perform RPYS for broader research fields or topics. In this study, we present various sampling methods to solve this problem. The study therefore demonstrates the fruitfulness of the sampling approach for bibliometric studies. Some comments following the paper by Williams and Bornmann [7] questioned the usefulness of this approach for bibliometric studies.

The statistical analysis of large data sets with the CRExplorer becomes more prevalent, since it has become possible with the new programme version to import data from CrossRef (see https://www.crossref.org). The user of CrossRef gains free access to meta-data of publications that can be (1) downloaded as files and imported in the CRExplorer or (2) directly imported by using the CRExplorer search interface for CrossRef data. Especially the use of the search interface allows fast access on comprehensive CR data from publications.

In this study, we introduce the script language of the CRExplorer, which can be used to draw many samples from the population data set (see also the handbook of the programme at www.crexplorer.net). The language can be applied instead of using the menus in the programme. Script languages are standard in statistical software to automate the process of empirical analysis. Once a script has been produced for a given data set, the script can be used for further similar data sets. Scripts fulfil an important function in the replicability and reproducibility of empirical studies. If script, data set and programme for a published study are available, the results in the manuscript can be reproduced (and possible errors identified). Although replicability and reproducibility are essential components of the open science movement [16], scripts are scarcely available for popular bibliometric software, such as VOSviewer or CitNetExplorer. The script language of the CRExplorer improves reproducibility and replicability of RPYS analyses. Unfortunately, only a minority of journals explicitly welcomes replication studies, according to Yeung [17]. Furthermore, performing the same (or an analogous) action multiple times (e.g., drawing 10,000 samples) is made feasible via the script language of the CRExplorer.

Based on a large data set of publications from climate change research, we compare RPYS results using population data with RPYS results using sampling data. We show RPYS results for three different sampling techniques: random sampling, systematic sampling and cluster sampling. From our comparison with the full RPYS (population spectrogram), we conclude that the cluster sampling performs worst and the systematic sampling performs best. The random sampling also performs very well but not as well as the systematic sampling. Merging 500 systematic samples of 50,000 CRs each reproduces the population RPYS rather accurately, and also the same peak CRs are found in the sampled spectrogram as in the population spectrogram. Merging 10,000 random samples also results in the same peak CRs as obtained from the population RPYS results.

It is unknown if our findings can be transferred to other research fields than climate change. Studying different publication sets might make it necessary to increase the sample sizes or the number of samples drawn, or it might be possible to obtain good RPYS results with smaller sample sizes or number of samples. We would like to encourage other studies to check which sample sizes and number of samples are needed to approximate the population spectrogram accurately enough.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Robin Haunschild

Andreas Thor

References

Thor

Marx

Leydesdorff

, et al. Introducing CitedReferencesExplorer (CRExplorer): a program for reference publication year spectroscopy with cited references standardization. J Informetr 2016; 10: 503–515.

Rhaiem

Bornmann

. Reference publication year spectroscopy (RPYS) with publications in the area of academic efficiency studies: what are the historical roots of this research topic? Appl Econ 2018; 50: 1442–1453.

Yeung

AWK

. Identification of seminal works that built the foundation for functional magnetic resonance imaging studies of taste and food. Curr Sci 2017; 113: 1225–1227.

Marx

Bornmann

Barth

, et al. Detecting the historical roots of research fields by reference publication year spectroscopy (RPYS). J Assoc Inf Sci Tech 2014; 65: 751–764.

Haunschild

Bornmann

Marx

. Climate change research in view of bibliometrics. PLoS ONE 2016; 11: 19.

Marx

Haunschild

Thor

, et al. Which early works are cited most frequently in climate change research literature? A bibliometric approach based on reference publication year spectroscopy. Scientometrics 2017; 110: 335–353.

Williams

Bornmann

. Sampling issues in bibliometric analysis. Journal of Informetrics 2016; 10: 1253–1257.

Wacholder

. Interactive query formulation. Annu Rev Inform Sci Technol 2011; 45: 157–196.

Levy

Lemeshow

. Sampling of populations: methods and applications. Hoboken, NJ: Wiley, 2008.

10.

Bornmann

Mutz

. Growth rates of modern science: a bibliometric analysis based on the number of publications and cited references. J Assoc Inf Sci Tech 2015; 66: 2215–2222.

11.

Barth

Marx

Bornmann

, et al. On the origins and the historical roots of the Higgs Boson research from a bibliometric perspective. Eur Phys J Plus 2014; 129: 1–13.

12.

Leydesdorff

Bornmann

Marx

, et al. Referenced publication years spectroscopy applied to iMetrics: scientometrics, Journal of Informetrics, and a relevant subset of JASIST. J Informetr 2014; 8: 162–174.

13.

Leydesdorff

Bornmann

Comins

, et al. Referenced publication year spectrography (RPYS) and algorithmic historiography: the bibliometric reconstruction of András Schumbert’s Œuvre. In: Glänzel

Schlemmer

(eds). András Schubert – a world of models and metrics Festschrift for András Schubert’s 70th birthday. Berlin: International Society for Scientometrics and Informetrics2016, pp. 79–96.

14.

Wray

Bornmann

. Philosophy of science viewed through the lens of ‘referenced publication years spectroscopy’ (RPYS). Scientometrics 2014; 102: 1987–1996.

15.

Marx

Haunschild

French

, et al. Slow reception and under-citedness in climate change research: a case study of Charles David Keeling, discoverer of the risk of global warming. Scientometrics 2017; 112: 1079–1092.

16.

Cumming

Calin-Jageman

. Introduction to the new statistics: estimation, open science, and beyond. Didcot: Taylor & Francis, 2016.

17.

Yeung

AWK

. Do neuroscience journals accept replications? A survey of literature. Front Hum Neurosci 2017; 11: 468.