Sage Journals: Discover world-class research

Abstract

This study has developed a combined indicator to evaluate the performance of different search engines. Documentary analysis, survey, and evaluative methods are employed in the present study. The research was conducted in two stages. First, a combined indicator was designed to measure search engines. To this end, 72 criteria for measuring the performance of search engines were identified, out of which 22 criteria were selected. Accordingly, 10 criteria were selected in six general classes through a survey of subject matter experts. Validation of our proposed combined indicator was obtained by Delphi method and using the opinions of experts in the fields of information science and information system. Second, web search engines were evaluated based on the proposed combined indicator. The statistical population of this part of the research consisted of two categories: (1) general web search engines, and (2) general subjects. The sample size of the first category contained four search engines Yahoo, Google, DuckDuckGo, and Bing, and the second category involved 40 search terms under 10 general categories. The results showed that the combined indicator had six general criteria: (1) relevance, (2) ranking, (3) novelty ratio, (4) coverage ratio, (5) ratio of unrelated documents, and (6) proportion of duplication hits. According to this indicator, Google is at the top, followed by Bing. This study proposes a new indicator for evaluating search engine performance, which can measure the efficiency of search engines. Therefore, its use to measure the performance of search engines is recommended to researchers and search engine developers.

Keywords

Search engine information retrieval assessment indicator combined indicator performance

Introduction

Since there are many general search engines with different indexing and ranking algorithms and hence different search results, it is necessary to evaluate the performance of search engines to determine which one is the best (Ali and Beg, 2011: 836) as users are very sensitive to their time and effort to search (Silverstein et al., 1998).

Evaluation is the process of measuring the efficiency of a system. In particular, evaluation determines which goals and objectives of the system have been achieved (Ali and Beg, 2011). Evaluation means judging value or worthiness. When evaluating an information retrieval system, subjects, and performance are considered and assessed. Evaluation is essential in solving information systems problems (Pao, 1989). Saracevic (1995) posited that evaluation essentially affects the study, development, and use of information retrieving sytems. Baeza-Yates and Ribeiro-Neto (1999) described two overall dimensions of “functional analysis” and “system performance evaluation.” In studies like Hjørland (2010), Huang and Soergel (2013), Saracevic (2015), and Zeynali-Tazehkandi and Nowkarizi (2020, the methods for assessing information retrieval efficiency are: “1) system-oriented approach, 2)user-oriented approach, and 3) hybrid-oriented approach (system-user-oriented approach).”

Search engine is a system of retrieving information. In recent years, search engine evaluation has received much attention and has been implemented using different approaches. In Lewandowski and Hochstotter (2008), search engine quality assessment has four components: “index quality, results quality, search features quality, and search engine usability.”

“How can search engine evaluation be performed? What metrics and approaches might be used? As an answer, it can be done by manual and automatic evaluation approaches. Azimzadeh et al. (2016) posited that, “evaluation of search engines can be done in two different ways; either manually using human arbitrators or automatically using automatic machinery approaches which do not use human arbitrators and their judgments.” Ali and Beg (2011) offered “testimonials” and “shootouts” categories to evaluate search engines.

Besides, based on the literature, search engine performance has been evaluated by over 72 criteria/metrics (Appendix 2) as well as various frameworks (e.g. Cranfield model and AWSEEM (Can et al., 2004)). Moreover, in evaluating search engines performance, there are problems such as uncertainty in choosing the appropriate criteria for search engine evaluation, consideration of only one or just a number of specific aspects of performance evaluation and inconsistencies in evaluations. When it comes to evaluating search engine performance, the question is: “which metrics should be used?” Due to the fact that one criterion alone cannot be used to evaluate search engines (none of the criteria is comprehensive enough to evaluate search engines thoroughly), a series of criteria for evaluating performance should be considered. It is necessary to create a set of standard tools for evaluating web search engines, so that in the future a better comparison between search engines can be made and changes in the performance of each particular search engine can be tracked over time (Oppenheim et al., 2000: 190). Therefore, in this study, we first introduce a new search engine performance index, which is a combined indicator, and then use this index to evaluate the performance of four search engines Google, Yahoo, DuckDuckGo, and Bing.

Related research

The first major attempt to evaluate and examine search engines was made by Cyril Cleverdon, one of the leading researchers in the field of information retrieval, who conducted two initial series of experiments in 1957, the results of which were published in 1962 (Pao, 1989). The Cranfield model represents a standard for evaluating the efficiency and performance of information retrieval systems, using six criteria: (1) coverage, (2) time delay, (3) recall, (4) precision, (5) method of presenting to the user, and (6) user effort to search. Of these six criteria, recall and precision are the most common in evaluating information retrieval systems (Cleverdon and Keen, 1966).

Most studies on search engine performance evaluation (Chu and Rosenthal, 1996; Clarke and Willett, 1997; Ding and Marchionini, 1996; Gauch and Wang, 1996; Tomaiuolo and Packer, 1996) are based on some relevance concepts and are referred to as Cranfield designs (Harter and Hert, 1997).

In general, studies in the field of search engine performance evaluation can be divided into seven categories as follows.

Category 1: Studies that reviewed previous studies in general or in part or in a specific period of time. For example, Saracevic (1995) performed a critical and historical analysis of evaluations of information retrieval systems and processes. Schwartz (1998) briefly investigated the history of Web search engines and reviewed the search engines’ performance evaluation from 1996 to 1998. Oppenheim et al. (2000) reviewed the literature on the evaluation of web search engines and found that “recall is virtually impossible to calculate in the fast-changing Internet environment, and therefore the traditional Cranfield type of evaluation is not usually possible” (190). Bar-Ilan (2002) introduced a set of measures consisting of (1) technical relevance, (2) technical precision, (3) relative coverage, 4) new and totally new URLs, (5) forgotten, recovered, lost, (6) well-handled and mishandled (mishandled reappeared, mishandled disappeared), (7) self-overlap, a number of rounds in which the URL is retrieved, and 8) persistent URLs to evaluate Web search engine performance. Su (2003) reviewed the studies on search engines evaluation in two periods: (1) the early period (1995–1996), and (2) the later period (1997–2000). He and Li (2010) investigated some recent search evaluation approaches.

Ali and Beg (2011) classified the different methods for the evaluation of Web search systems into eight categories: (1) relevance-based evaluation, (2) ranking-based evaluation, (3) user satisfaction-based evaluation, (4) size/coverage of the Web-based evaluation, (5) dynamics of Search results based evaluation, (6) few relevant/known item based evaluation, (7) specific topic/domain-based evaluation, and (8) automatic evaluation. Azimzadeh et al. (2016) investigated manual and automatic approaches of search engine evaluation, suggesting “a framework for selecting the best pertinent method for each evaluation” (p. 78). Sanderson (2020) re-considered a set of research on measuring search engine performance from the 1990s, which have been largely neglected by subsequent researchers and research communities.

Category 2: Studies that used a performance metric to evaluate search engines. In this category of studies, researchers have evaluated the performance of one or several information retrieval system(s) using criteria such as precision. Below are some of these criteria and the research done with them: (a) Expected Search Length Criteria: Chignell et al., 1999; Cooper, 1968; Dunlop, 1997; Silverstein et al., 1998; (b) Precision criteria: Garoufallou, 2012; Griesbaum, 2004; Khan et al., 2014; Ljosland, 1999; Meghabghab and Meghabghab, 1996; Tawileh et al., 2010; (c) Relative recall: Clarke and Willett (1997), (d) Stability criteria: Bar-Ilan, 1998, 2002; (e) Coverage criteria: Henzinger et al. (2000, (f) Relative Size & Search Engine Overlap Criteria: Bharat and Broder (1998, g) Click Criterion, Query Log Analysis: Joachims (2002), Joachims et al. (2017), Radlinski et al. (2008), Dupret and Liao (2010), and Bitirim and Görür (2017).

Category 3: Studies that used several performance criteria or formulas. For example, Schlichting and Nilsen (1996) evaluated AltaVista, Excite, InfoSeek, and Lycos based on link relevance, hit rate, and false alarm rate. The results showed that the performance of the search engines is far from ideal. Chu and Rosenthal (1996) evaluated AltaVista, Excite, and Lycos in terms of their search capabilities (Boolean logic, truncation, field search, word, and phrase search) and retrieval performances (precision and response time) using sample queries. They found that AltaVista outperformed Excite and Lycos in both search facilities and retrieval performance. Gordon and Pathak (1999) evaluated Yahoo, AltaVista, Excite, Infoseek, Open Text, HotBot, Lycos, and Magellan based on recall and precision metrics. Lawrence and Giles (1998) investigated the coverage and overlap of HotBot, AltaVista, Northern Light, Excite, Infoseek, and Lycos. They found that “no single engine indexes more than about one-third of the indexable Web” (p. 98). Su et al. (1998) examined the effectiveness of AltaVista, Infoseek, Lycos, and Open Text from the user’s perspective based on relevance ranking, precision ratio, connectivity, user satisfaction ratio, etc. The results showed the outperformance of Lycos in terms of relevance ranking performance and highest precision ratio.

In 2000–2009, search engines have been variously examined using different metrics/indicators. Sroka (2000) evaluated five search engines based on the criteria of precision, the overlap of retrieved documents, and response time. The results show that Polski, Infoseek and Onet.pl perform better on the precision criteria. Hawking et al. (2001) evaluated 20 search engines based on several measures such as TSAP, MRR1, P@1, P@5, P@1–5, and coverage. Boudry (2002) examined eight search engines (Bioview, Scirus, Search4-science, Altavista, Google, Copernic, Infomine, and Open Directory Project) based on three criteria (precision, relative coverage, and proportion of dead/out of date links) and found that “the use of non-specialized search engines and meta-engines seems preferable over specific search engines in the field of biology” (p. 1112). Goh and Ang (2002) evaluated the retrieval effectiveness of Google and Overture based on precision and the distribution of relevant documents over the number of documents retrieved. The results showed that Google performed better in both aspects. Muh-Chyun and Ying (2003) investigated the applicability of three user-effort-sensitive evaluation measures (the first 20 full precision, search length, and rank correlation) on Google, AltaVista, Excite, and Metacrawler search engines. The results showed greater consistency between the first 20 full precision and search length. Meng et al. (2007) evaluated new Microsoft search engine (MSE) performance based on the average user response time, average process time for a query reported by MSE itself, and the number of pages relevant to a query. They found that “the MSE performs well in speed and diversity of the query results, while weaker in other statistics” (p. 17). Sakai (2007) compared “14 information retrieval metrics based on graded relevance, together with 10 traditional metrics based on binary relevance” (p. 532). Lu et al. (2007) assessed Yahoo, Copernic, Archivarius, Google, and Windows based on recall and precision averages, document level precision and recall, and exact precision and recall. Deka and Lahkar (2010) evaluated and compared Google, Yahoo, Live, Ask, and AOL based on coverage, relevance, duplicate links, and bad links measures. The overall results reveal the superiority of Google over the other four search engines.

Since 2010, corresponding to the volume of research on search engine evaluation, novel metrics/indicators have been exploited to asses search engines. Zhuhadar and Nasraoui (2010) “investigated the efficiency of ranking the documents using precision and the usability of the visual search engine” (1). Foo (2011) studied the retrieval efficiency of English-Chinese (EC) “cross-language information retrieval” using recall and precision for 4 search engines, indicating Google supercmacy over Yahoo. Bilal and Ellis (2011) explored “Google, Yahoo!, Bing, Yahoo Kids!, and Ask Kids” using overlap across and relevance rankings. Teixeira Lopes and Ribeiro (2011) comparatively evaluated general search engines (such as Bing, Google, Sapo, and Yahoo) and health-specific search engines (such as MedlinePlus, SapoSaude, and WebMD) in health information retrieval using six different measures (graded average precision (GAP), average precision (AP), gap@5, gap@10, ap@5, and ap@10). They found that general search engines surpass the precision of health-specific engines. Zhang et al. (2013) examined the efficiency of “Google, Google China, and Baidu” by searching title, basics, exact phrase, PDF, and URL, indicating the supremacy of Google over the other engines. Ali and Gul (2016) examined the “relative recall” and “precision” of Yahoo and Google via navigational, informational, and transactional queries, showing Google supremacy in these two criteria. Balabantaray (2017) also reported the supremacy of Google over Ask, Yahoo, AOL, and Bing in terms of results ranking and its characteristics. Hussain et al. (2019) evaluated the retrieval effectiveness of Google Images, Yahoo Image Search, and Picsearch in terms of their image retrieval capability based on precision and relative recall ratios. They found that Yahoo Image Search has better performance than two others in both two measures. CheshmehSohrabi and Abassi-Dashtaki (2019) evaluated the Ask search engine performance based on the three types of keyword, phrase, and question queries using recall and precision measures. The results indicate an average of less than 50% for recall and precision, respectively. Gul et al. (2020) assessed the retrieval performance of Google, Yahoo, and Bing using precision and relative recall measures in the fields of life science and biomedicine. The results showed the superiority of Google in both measures. Using various query set classes, Jatwani et al. (2020) measured Google, DuckDuckGo, and Bing in terms of “cumulative gain, discounted cumulative gain, ideal discounted cumulative gain, and normalized discounted cumulative gain” and reported the outperformance of DuckDuckGo in understand human intention and retrieving more related results. Zeynali Zeynali-Tazehkandi and Nowkarizi (2021) evaluated the effectiveness of Google, Parsijoo, Rismoon, and Yooz based on precision, recall, and NDCG measures. The results showed that these four search engines have significant differences in these three metrics. CheshmehSohrabi and Sadati (2022) evaluated the performance of four Image General Search Engines (Google, Yahoo, DuckDuckGo, and Bing), and three Image Specialized Search Engines (Flicker, PicSearch, and GettyImages) in image retrieval. The results showed that the Image General Search Engines have a higher recall and precision average. Bokhari et al. (2021) assessed the retrieval performance of Google, Bing, and Newslookup using vector space, Okapi BM25, and latent semantic indexing models, indicating the Google supremacy over the other systems. Soni and Roberts (2021) reported that TREC-COVID had a better retrieval efficiency than the commercial deep-learning-based Google and Amazon on the metrics of “5 P@5, P@10, NDCG@10, MAP, NDCG, and bpref.”

Category 4: Studies that examined correlation between several criteria. For example, Hawking et al. (2001) evaluated the effectiveness of 20 public search engines using TREC methods. The results indicated a significant difference between search engines and high intercorrelations between measures. Ozmutlu (2005) investigated Google and Ask performance and found that “the age of web user is not as affective on the relevancy score and precision of results as other factors” (p. 656). Turpin and Scholer (2006) studied a precision-based user task and a simple recall-based task and found that “there is no significant relationship between system effectiveness measured by MAP and the precision-based task” (p. 11). Kelly et al. (2007) investigated the relationship between actual system performance and users’ perceptions of system performance. The results indicated “statistically significant relationships between precision and subjects’ evaluations of system performance, and ranking and subjects’ evaluations of system performance” (1). Al-Maskari et al. (2007) investigated the correlation between users’ satisfaction with the effectiveness of Google by measures such as precision, CG, DCG, and NDCG. The results indicated a strong correlation between users’ satisfaction, CG and precision. Al-Maskari et al. (2008) examined the correlation between system and user effectiveness and found that “users behave differently and discern differences between pairs of systems that have a very small absolute difference in test collection effectiveness” (59). Huffman and Hochster (2007) investigated a user’s ultimate session-level satisfaction using a simple relevance metric and found a strong relationship between them.

Category 5: Studies that examined the strengths and weaknesses of one or more evaluation formulas and presented a new formula. Van Rijsbergen (1979) discussed the relationship between the various relevance measures and proposed a theory of evaluation based on precision and recall. Leighton and Srivastava (1999) examined and compared AltaVista, Excite, HotBot, Infoseek, and Lycos using precision ratio and adding weight for ranking effectiveness. They indicated that Alta Vista, Excite, and Infoseek are the top three services. Gwizdka and Chignell (1999) suggested a number of different measures to evaluate information retrieval systems and applied these measures to evaluate AltaVista, HotBot, and Infoseek search engines. Järvelin and Kekäläinen (2002) proposed “several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position” (p. 422). Meng and Chen (2004) critiqued the traditional performance measures of precision and recall and proposed an alternative measure, which was called RankPower. Zhai et al. (2015) presented a subtopic retrieval problem and proposed a framework for evaluating it. Buckley and Voorhees (2004) examined the Cranfield evaluation methodology and found that these measures “are not robust to substantially incomplete relevance judgments” (p. 25). They introduced a new measure that is “both highly correlated with existing measures when complete judgments are available and more robust to incomplete judgment sets” (p. 25). De Beer and Moens (2006) presented rpref for evaluating the quality of search engine results. Aslam et al. (2006) proposed and tested a statistical method for evaluating retrieval systems using incomplete judgments. Moffat et al. (2007) stated some problems of relevance judgments in evaluating retrieval systems performance and proposed a new experimental methodology to solve those problems. While considering the two main problems in question answering systems, Sakai (2004) introduced two new performance metrics (Q-measure and R-measure) to solve those problems based on multigrade relevance. Sakai and Kando (2008) proposed a new approach to address the problem of incomplete relevance data. Agrawal et al. (2009) investigated the issue of diversifying search results considering both the relevance of the documents and the diversity of search results. They proposed an algorithm to solve this problem. Sakai et al. (2010) proposed two metrics (IdivnDCG and Idiv-Q) for evaluating diversified search results. Bouramoul et al. (2011) critiqued and questioned the classical approaches for evaluating information retrieval tools and expressed their limitations and shortcomings. They presented a new approach based on three complementary levels of context for evaluating information retrieval tools (p. 31). Lewandowski (2013) described the problems in assessing retrieval efficacy and the use of measures. Azzopardi et al. (2018) developed a measure using “information foraging theory” to evaluate Search Engine Result Page, SERP, cost, and utility.

Category 6: Studies that examined the performance of search engines from a specific aspect, such as suitability of hit counts, queries related to a specific topic or field, bias, and so on. Some of these studies include the following. Nahl (1998) used ethnographic data to better understand quantitative self-ratings given by search engine users and to understand how Web search tools are used. Chignell et al. (1999) addressed the issue of discriminating meta-search engines. Besides, Li and Shang (2000) evaluated the performance of search engines’ precision based on relevance evaluation and statistical comparison. Hubert and Mothe (2007) used relevance feedback indicator to select the best search engine. Smith and Kantor (2008) investigated the relation between system performance and search behavior. Radlinski et al. (2010) considered click-based metrics for assessing search engine relevance. Dai and Davison (2011) verified topic-sensitive search engine evaluation. Lewandowski (2011) compared “Google, Yahoo!, MSN, Ask, Seekport, and Exalead” in view of navigational queries (e.g., homepage search), indicating the higher efficency of Google, Yahoo!, and MSN. Goutam and Dwivedi (2011) examined search engines using “combined experts judgments on the results with the users clicks hits and found that this strategy is effective to assign ranking” (589). Kazai (2011) introduced crowd sourcing as a new strategy for evaluating search engines. Goel and Yadav (2013) used “page-level keywords” to assess Google, Yahoo, and Bing. In Kontiza and Bikakis (2014), it was shown that information visualization has improved the understanding of search results in Sig.ma and Kngine search engines. Chouldechova and Mease (2013) shwed how query owners and non-owners are different in search assessments. Lewandowski (2015) evaluated Google and Bing search engine using two query lists with 1000 navigational and informational queries each. The results showed that Google outperforms Bing in both query types. Zhou et al. (2017) performed evaluation based on repeated contents of Web page. Azizan et al. (2018) used precision and examined the efficiency of Google, Bing, Yahoo, and DuckDuckGo in terms of information search in a specific domain, indicating Google superiority over the other engines. Sánchez et al. (2018) evaluated the suitability of hit counts of Web search engines as research tools in computational linguistics. Shafi and Ali (2019) evaluated the retrieval effectiveness of Google, Yahoo, and Bing based on precision and relative recall ratios with reference to 15 terms related to physical sciences. Wu et al. (2019) studied Google, Ask, and Bing effectiveness vis-à-vis results diversification using both classical and intention-aware measures, showing Google and Bing advantage compared to Ask. Gezici et al. (2021) evaluated the search engine results with respect to bias. They proposed novel web search bias evaluation measures and a framework to evaluate web search bias. Disli (2020) examined Google effectiveness on Turkish queries via “precision” and “general normalized recall” and reported an average relevance level in Turkish natural language query.

Category 7: Studies that proposed a framework, model or metric for search engine evaluation. Soboroff et al. (2001), for example, proposed a method for evaluating the retrieval systems performance without relevance judgments. Aslam and Savell (2003) proposed a “simple measure for quantifying the similarity of retrieval systems by assessing the similarity of their retrieved results” (p. 361). Vaughan (2004) proposed several measures for evaluating (1) quality of result ranking, (2) ability to retrieve top ranked pages, and (3) stability of search engine performance over time. He evaluated Google, AltaVista, and Teoma based on these measures and found that “the proposed measurements are able to distinguish search engine performance very well” (p. 677). Mea and Mizzaro (2004) proposed and applied a new measure called Average Distance Measure (ADM) while questioning the effectiveness of measures for information retrieval systems based on binary relevance and binary retrieval. Can et al. (2004) introduced an automatic Web search engine evaluation method. They applied this method based on eight Web search engines and binary user relevance judgments, and found that it can be successfully applied in Web search engines evaluation. Lewandowski (2008) compared the retrieval effectiveness of Google, Yahoo, MSN, Ask.com, and Seekport based on a user model and assessed both the results and the results descriptions. The results indicated that Google and Yahoo perform best. Jing et al. (2009) proposed a user-oriented hierarchical analysis model for evaluating search engines. Su (2003) proposed a comprehensive and systematic model comprising performance measures and non-performance characteristics from the end user’s perspective for evaluating Web search engines. Rakesh Kumar et al. (2005) introduced a new metric to evaluate the performance of search engines called “Ranked Precision.” Sadeghi (2011) introduced two automatic approaches and then assessed Ask, Bing, and Google performance. Ozertem et al. (2011) measured search results by combining “pre-existing judgments and clicks.” XiaoLing and He-Ru (2011) proposed a hierarchical model for evaluating and comparing search engines. Brennecke et al. (2011) designed a user-oriented approach to assess person search engines. Lewandowski (2012) offered a framework to measure search engines retrieval efficiency. Singh and Dwivedi (2015) introduced an advanced vector space to evaluate search engines performance and then applied this model on Google, Yahoo, and MSN. Arkhipova et al. (2015) developed “pSwitch” metric to assess search engines efficency. Results of using this metric on Yandex demonstrated the higher sensitivity of this metric compared to SpU, abandonment rate, and time-to-first-click metrics. Shoeleh et al. (2016) introduced an evaluative protocol called “Similarity based automatic web search engine evaluation.” Khusro et al. (2017) introduced “a novel systematic approach for performance evaluations of Desktop search engines (DSEs) by leveraging the scientific methods from information retrieval systems.” Palotti et al. (2018) proposed a “framework, named MM, to evaluate information retrieval systems in the presence of multidimensional relevance” (1699). Lewandowski and Suenkler (2019) suggested a “Relevance Assessment Tool, RAT, to support extensive research” relying upon commercial search engines data. Zhang et al. (2019) used relevance-decrease patterns to examine search results measurements in Google and Bing. They intended to introduce the “best model to describe the trend which would make relevance evaluation and ranking of a retrieval results list more sound and plausible” (14). Kamoun et al. (2019) designed a consensual relevance model, based on which they measured “Google, Yahoo!, Bing, AOL, Ask.com, Duckduckgo, Ecosia, StartPage, and Qwant.” Castaño et al. (2019) used historical search data to describe how to appraise recall and precision. Galanin et al. (2020) developed a feasible approach called “Generative Adversarial Networks (GAN)” to differentiate related from unrelated search results. Zeynali-Tazehkandi and Nowkarizi (2020) compared system-oriented and user-oriented methods, reporting that a combined use of both approaches gives the best evaluation results. Yang et al. (2020) proposed a performance evaluation index of intelligent search engine service with five categories: (1) information search efficiency index, (2) information search effect index, (3) information search support index, (4) information search function index, and (5) user satisfaction index. Rocco and Silvello (2020) used InfoVis tool with SanKey diagrams for the search engines interactive component-based assessment. Alashti et al. (2022) developed an automatic component-based strategy and measured Parsijoo, Yooz, Google, and Bing based on Normalizer, Tokenizer, Query expansion, and Spell-Checker modules, showing the superiority of Google and Bing over ParsiJoo and Youz. Tamilkodi and Nesakumari (2022) measured image retrieval effectiveness using three methods: “surrounding information retrieval (SIR), minimum edge retrieval (MER), and Integrated Feature Retrieval (IFR) as a mix of SIR and MER.” Using Precision and Recall criteria they observed a good retrieval performance for multi-feature extraction. Wang et al. (2022) examined the use of dense external expansion to enhance “zero-shot retrieval” performance.

Table 1 shows the categorization of research conducted in the field of search engine performance evaluation.

Table 1.

Distribution of research conducted in the field of search engine performance evaluation according to the proposed classifications.

N	Category	References
1	Reviewing previous studies on evaluation of information retrieval systems	Saracevic (1995), Schwartz (1998), Oppenheim et al. (2000), Bar-Ilan (2002), Su (2003), Ali and Beg (2011), Azimzadeh et al. (2016)
2	Using one performance metric to evaluate search engines	A) Expected Search Length Criteria	Cooper (1968), Silverstein et al. (1998), Chignell et al. (1999), Dunlop (1997)
		B) Precision criteria	Meghabghab and Meghabghab (1996), Ljosland (1999), Griesbaum (2004), Tawileh et al. (2010), Garoufallou (2012), Khan et al. (2014)
		C) Relative recall	Clarke and Willett (1997)
		D) Stability criteria	Bar-Ilan (1998, 2002)
		E) Coverage criteria	Henzinger et al. (2000)
		F) Relative Size & Search Engine Overlap Criteria	Bharat and Broder (1998)
		G) Click Criterion, Query Log Analysis	Joachims (2002), Joachims et al. (2017), Radlinski et al. (2008), Dupret and Liao (2010), Bitirim and Görür (2017)
3	Using several performance criteria or formulas to evaluate search engines	Schlichting and Nilsen (1996), Chu and Rosenthal (1996), Gordon and Pathak (1999), Lawrence and Giles (1998), Su et al. (1998), Hawking et al. (2001), Sroka (2000), Boudry (2002), Goh and Ang (2002), Muh-Chyun and Ying (2003), Meng et al. (2007), Sakai (2007), Lu et al. (2007), Deka and Lahkar (2010), Zhuhadar and Nasraoui (2010), Teixeira Lopes and Ribeiro (2011), Foo (2011), Bilal and Ellis (2011), Zhang et al. (2013), Ali and Gul (2016), Balabantaray (2017), Hussain et al. (2019), CheshmehSohrabi and Abassi-Dashtaki (2019), Gul et al. (2020), Jatwani et al. (2020), Zeynali Tazehkandi & Nowkarizi (2021), CheshmehSohrabi and Sadati (2022), Bokhari et al. (2021), Soni and Roberts (2021)
4	Study of correlation between several performance criteria to evaluate information retrieval systems	Hawking et al. (2001), Ozmutlu (2005), Turpin and Scholer (2006), Kelly et al. (2007), Al-Maskari et al. (2007, 2008), Huffman and Hochster (2007)
5	Studying strengths and weaknesses of one or more evaluation formulas and presenting a new formula	Van Rijsbergen (1979), Leighton and Srivastava (1999), Gwizdka and Chignell (1999), Järvelin and Kekäläinen (2002), Meng and Chen (2004), Zhai et al. (2015), Buckley and Voorhees (2004), De Beer and Moens (2006), Aslam et al. (2006), Moffat et al. (2007), Sakai and Kando (2008), Agrawal et al. (2009), Sakai et al. (2010), Bouramoul et al. (2011), Lewandowski (2013), Azzopardi et al. (2018)
6	Studying specific aspects of search engine evaluation	Nahl (1998), Chignell et al. (1999), Li and Shang (2000), Hubert and Mothe (2007), Smith and Kantor (2008), Radlinski et al. (2010), Dai and Davison (2011), Goutam and Dwivedi (2011), Kazai (2011), Goel and Yadav (2013), Chouldechova and Mease (2013), Kontiza and Bikakis (2014), Lewandowski (2015), Zhou et al. (2017), Azizan et al. (2018), Sánchez et al. (2018), Wu et al. (2019), Shafi and Ali (2019), Disli (2020), Gezici et al. (2021)
7	Proposing a framework, model or metric for search engine evaluation	Soboroff et al. (2001), Aslam and Savell (2003), Vaughan (2004), Mea and Mizzaro (2004), Can et al. (2004), Lewandowski (2008), Jing et al. (2009), Su (2003), Rakesh Kumar et al. (2005), Sadeghi (2011), Ozertem et al. (2011), XiaoLing and He-Ru (2011), Brennecke et al. (2011), Singh and Dwivedi (2015), Arkhipova et al. (2015), Shoeleh et al. (2016), Khusro et al. (2017), Palotti et al. (2018), Lewandowski and Suenkler (2019), Zhang et al. (2019), Kamoun et al. (2019), Castaño et al. (2019), Galanin et al. (2020), Yang et al. (2020), Rocco and Silvello (2020), Alashti et al. (2022), Tamilkodi and Nesakumari (2022), Wang et al. (2022)

A review of the literature shows that after the Cranfield model in 1960, researchers like Van Rijsbergen (1979), Leighton and Srivastava (1999), Gwizdka and Chignell (1999), and Järvelin and Kekäläinen (2002) sought to address the shortcomings of previous criteria and proposed new criteria; however, these criteria were later criticized and replaced by other criteria. Some researchers (such as Sakai, 2004) even demonstrated the disadvantages of their own criteria hence offering an alternative, more comprehensive criterion, and some are still seeking research on a specific criterion (e.g. Bar-Ilan, 1998) related to search engine stability.

The trend of research in the field of evaluation of search engines performance is more inclined toward the development of criteria that consider only certain aspects of search engines. In other words, by examining the new criteria presented by the researchers, it appears that each of them has evaluated the search engines from a specific aspect. Accordingly, the existing formulas in the field of relevance have some demerits as follows. These formulas address only one aspect of performance evaluation (relevance). Besides, some of these formulas are complex and some are used in specific situations, such as ambiguous and diverse searches, for example, Agrawal et al. (2009) and Sakai et al. (2010) or for evaluations in incompletely proportional conditions, for example, Sakai and Kando (2008). In this context, Oppenheim et al. (2000), Chu and Rosenthal (1996), Su (2003), and Al-Maskari et al. (2007) proposed more comprehensive studies and more complete criteria. Oppenheim et al. (2000) recognized that much of the research is methodologically inconsistent and that an urgent need is felt for a series of criteria to evaluate search engines. Accordingly, they proposed some criteria.

Research methodology

Research method

The present study seeks to solve the problem of the non-exhaustivity of criteria in search engine evaluation. Documentary, Delphi and evaluative methods were used in this study. In parts of the research in which the criteria for evaluating the performance of search engines were identified based on texts, the method of documentary analysis was used and the studies were conducted qualitatively. Then, the survey method was used to investigate the views of subject matter experts on the identified criteria, as well as the prioritization and validation of these criteria. Finally, a quantitative evaluation research method was used to evaluate the performance of the search engines under study.

Statistical population and sample

The statistical population of this research had two categories: (1) general web search engines, and (2) general subjects. The sampling method was purposeful in both statistical populations. To select the search engines, we referred to sites like www.netmarketshare.com, www.searchenginejournal.com, www.gs.statcounter.com, and www.ebizmba.com that showed the statistics of usage and popularity of search engines, and hence a list of top search engines introduced by these sites was used. Some search engines that are native to a particular country (such as Baidu, which is native to China) were then removed from the list. A survey of site statistics showed that among the general search engines, the highest rate of usage belonged to Google (70.60%), Bing (13.02%), Yahoo (2.30%), and DuckDuckGo (0.44%), and therefore these four search engines were selected.

The search terms were selected through a three-step process as follows: (1) using http://Schema.org ontology, (2) using schema.org ontology and MeSH (Medical Subject Headings) and LCSH (Library of Congress Subject Headings), and (3) final selection of search terms by the expert team.

In the first step, since the search engines are among the public Internet search tools, in choosing search terms, general words and phrases were used. To this end, 10 general subject categories based on the ontology of schema.org were first considered: (1) products and things, (2) persons, (3) organizations, (4) medical entities, (5) geographical places, (6) events and incidents, (7) intangible works or things, (8) creative works, (9) action and practice (procedures and processes), and (10) other items (Schema.org Community Group, 2022).

In the second step, after determining the 10 general subject categories, using the hierarchies of these 10 categories, and the two authoritative sources of MeSH and LCSH, subcategories were determined for each category. Besides, the number of subcategories considered for the first–10th categories was 6, 5, 2, 6, 5, 2, 1, 3, 9, and 1, respectively, hence amounting to a total of 40 subcategories (appendix 1).

In the final step, for each of these 40 cases, two–three search terms with their synonyms were identified and given to a three-member team of information retrieval specialists to select the most suitable one. As a result, a search term was selected for each subcategory (Appendix 1). According to the literature review that Zeynali Zeynali-Tazehkandi and Nowkarizi (2021) have carried out on the number of queries or simulated work tasks, it seems that 40 search terms are reasonable.

In order to score the relevance of the records, the relevance judgment criterion was determined. The searches were conducted by one of the researchers and the first 20 records were saved in an Excel file by each search engine. Matching the search results with the search terms based on the determined judgment criteria was done by both researchers, and in cases of ambiguity, we received help from a subject expert related to that field.

After recording the retrieved results for each search term in each search engine, calculations were made based on the combined indicator criteria. First, the scoring methods related to relevance in previous studies such as Chu and Rosenthal (1996), Oppenheim et al. (2000), Leighton and Srivastava (1999), CheshmehSohrabi and Sadati (2022) were reviewed. Then, a four-level fuzzy spectrum from 0 to 3 was determined (a score of 3 for fully relevant records, a score of 2 for records that are somewhat relevant to the user’s needs, a score of 1 for slightly relevant records, and a score of zero for unrelated records). To calculate the criteria in each of the search engines, an Excel file was created. For each criterion, one sheet was included in that file. A file composed of 40 queries and the result of the first 20 retrieved records were thus created. Finally, the scores of some criteria and calculations of formulas such as AP, NDCG, BPREF, and ERR under the Excel software were obtained.

The new search engine performance indicator

In this study, a total of 125 documents related to the subject of search engine performance evaluation were examined. After studying these sources, 72 criteria for evaluating the performance of search engines were identified (Appendix 2). Subsequently, the existing criteria and their strengths and weaknesses were reviewed, and 22 criteria were selected from the studied criteria (Appendix 3). In order to select the criteria, we considered the important aspects of search engine performance as well as the basic needs of users in data retrieval, because suitable criteria for search engines performance evaluation are those in which user’s behavior is acceptably modeled. These 22 criteria were divided into nine general categories. Then, through a survey of subject matter experts, agreement was reached on six general categories of criteria, and sub-criteria were also determined in two general criteria of relevance and ranking. A total of 10 criteria were selected in the form of six general categories. Validation of the proposed combined indicator in this study was obtained by the Delphi method and using the opinions of experts in the field of information retrieval and information system. In the second stage of the Delphi, the criteria were agreed upon by obtaining a high Kendall score (0.882) and then prioritized. According to the Delphi survey, among the first 9 general categories (1) relevance, (2) ranking, (3) recall, (4) coverage, (5) stability, (6) ratio of unrelated documents, (7) information novelty ratio (newness and originality), (8) ratio of dead (inactive) links, and (9) proportion of duplicate hits, according to expert scores, two criteria of stability and ratio of dead links were removed, due to their low scores, from the list of general criteria. The criterion of recall in information systems based on the opinions of experts, was subsumed under the general criterion of relevance. Six general categories representing the essential needs and expectations of users of data retrieval performance in a search engine are: (1) relevance, (2) ranking, (3) information novelty ratio, (4) coverage, (5) ratio of unrelated documents, and 6) proportion of duplicate hits. The criteria of relevance and ranking include subgroups as defined in Table 2. Table 3 describes each of the combined index criteria.

Table 2.

Selected criteria for search engine evaluation.

Combined indicator criteria for search engine evaluation
Criterion	Relevance			Ranking			Novelty Ratio	Coverage ratio	Ratio of unrelated documents	Proportion of duplicate hits
	Distance precision	MAP	Recall	NDCG	ERR	BPREF

Table 3.

Description of selected criteria.

N	Criterion		Criterion description
1	Relevance	Distance precision	In this criterion, in order to calculate the correlation ratio, distance scales are used instead of zero and one, and overall average is reported.
		MAP	This makes an average of AP (Average Precision) in multiple queries. In other words, it is the mean of AP values calculated for all the subjects. The non-internal AP criterion is calculated based on the binary scale (zeroes and ones). Accordingly, each related document is assigned a number of one and each unrelated document is assigned a number of zero. The precision of each related document is calculated separately, and then the precisions are averaged.
		Recall	$\frac{Number of related documents retrieved}{\begin{matrix} Total number of related documents retrieved via various methods \\ and synonyms minus duplicate hits \end{matrix}}$
2	Ranking	NDCG	In this criterion, Discounted Cumulative Gain (DCG) in the actual list of search engines is divided into discounted cumulative gain in the ideal list.
		ERR	This criterion shows the success rate of each search engine in ranking documents in a way that the most useful documents are ranked higher, and therefore the relevance score of each document should be higher than the next document. In this criterion, a descending graph indicates that the user has received more useful information in each article than in the next.
		BPREF	This criterion is calculated by considering how many unrelated documents are displayed before the relevant documents in the search result. The fewer unrelated documents displayed in the engine results before the related documents, the more precise the engine is in ranking.
3	Novelty Ratio		The ratio of retrieved related documents that are new to the user to the total retrieved related documents
4	Coverage ratio		The ratio of user-known related documents that have actually been retrieved to total number of user-known related documents
5	Ratio of unrelated documents		Ratio of unrelated documents to total documents in n retrieved documents
6	Proportion of duplicate hits		Ratio of duplicate hits retrieved to total number of retrieved hits in each search

Using the Delphi results, weights were assigned to different criteria based on their priority, and the combined indicator was then formulated. Weighing was performed numerically and criteria with higher importance were given a higher weight. Besides, criteria that showed positive aspects of search engine performance were assigned positive scores, while criteria that showed negative aspects of search engine performance were assigned negative scores. The designed coefficients and formulas were reported to the experts and the majority of them agreed with the designed coefficients and formulas. Thus, the validation of the proposed combined index was performed through the Delphi method.

In total, the proposed combined indicator consists of 10 criteria (namely), (1) Distance precision, (2) MAP, (3) Recall, (4) NDCG, (5) ERR, (6) BPREF, (7) Novelty Ratio, (8) Coverage ratio, 9) Ratio of unrelated documents, and 10) Proportion of duplicate hits) in six categories (namely), (1) Relevance, (2) Ranking, (3) Information novelty ratio, (4) Coverage, (5) Ratio of unrelated documents, and (6) Proportion of duplicate hits). The relevance and ranking criteria are divided into various sub-criteria is because no single sub-criterion can show the difference of the two search engines as a sub-criterion indicates only some evaluation aspects. Although these 10 criteria were previously introduced by researchers and were used individually or in combination in previous studies to evaluate the effectiveness of information retrieval systems (examples of which were introduced in the introduction section), a combined indicator with these criteria and features has not been introduced so far.

The combined indicator’s criteria and sub-criteria are then given weight numerically with respect to their priority, and resultantly the combined indicator’s formula is formed. By numerical weighting, more weight is assigned to the criteria with frequent uses and higher importance, from the users’ standpoint, in retrieval efficiency evaluation. Besides, a positive score is given to the criteria that indicate the positive side of search engine efficiency, and vice versa.

The relevance and ranking criteria have been assigned a higher coefficient than novelty and coverage to avoid flaws in the evaluation results. If a search engine has a greater coverage and a lower relevance than another engine, then these two engines might be the same in terms of efficiency considering the sameness of other criteria.

After a user searches the Web, the search engine undertakes two tasks to verify relevant document in its database: (1) provision of query-related results, and (2) ranking of the results and placing of the most relevant documents at the beginning of the list. Accordingly, the relevance and ranking criteria were given a higher coefficient (a value of 2) due to their importance. If numerous relevant documents are retrieved without placing the most relevant at the top of the list, the system performance is weak. Moreover, if various unrelated documents are included but the ranking is good, the search engine performance is inappropriate. Consequently, relevance and ranking are two key parameters with quite equal importance. Besides, a coefficient of 1 is ascribed to the novelty and coverage criteria. Unrelated documents and proportion of duplicate hits were negative ratios. By using the aforementioned coefficients in the combined indicator, search engine information retrieval efficiency can be calculated as follows:

\frac{\begin{array}{l} Combined indicator = \\ ((Relevance score \times 2) + (Ranking score \times 2) \\ + Novelty ratio + Coverage) - \\ (Ratio of unrelated documents + Proportion of duplicate hits) \end{array}}{6}

This metric integrates the following search engine evaluation criteria: “(1) Distance precision, (2) MAP, (3) Recall, (4) NDCG, (5) ERR, (6) BPREF, (7) Novelty Ratio, (8) Coverage ratio, (9) Ratio of unrelated documents, and (10) Proportion of duplicate hits.” The combined indicator can be written as follows:

\begin{array}{l} CI (J) = \\ \frac{\begin{array}{l} (2 (\frac{1}{Re (j)} + \frac{1}{Ra (j)}) \\ + (\frac{1}{N (j)}) + (\frac{1}{C (jJ)})) \\ - ((\frac{1}{UD (jJ)}) + (\frac{1}{DH (j)})) \end{array}}{6} \end{array}

CI(j) = the average of six indicators, which is between zero and one.

Re(j) = the average of Relevance (distance precision, MAP, and recall) which is between zero and one.

Ra(j) = the average of Ranking (NDCG, ERR, and BPREF) which is between zero and one.

N(j) = the average of Novelty which is between zero and one.

C(j) = the average of Coverage which is between zero and one.

UD(j) = the average of Unrelated Documents which is between zero and one.

DH(j) = the average of Duplicate Hits which is between zero and one.

6 = The value 6 shows the denominator. The fraction has been divided by 6 because the coefficients of relevance, ranking, novelty, and coverage are 2, 2, 1, and 1, respectively, amounting to a total value of 6. Assume an ideal search engine in which the score of the combined indicator is 1. If the relevance score is equal to 1, the ranking score is equal to 1, and the coverage and novelty scores are each 1, then the total, taking the coefficients into account, will be 6. Since in an ideal search tool, a value of 0 is expected for the ratios of duplicate records and unrelated documents, the deduction is 6. If the combined indicator is divided by 6, the value of 1 is achieved. The obtained number from the fraction represents the combined indicator value, and the closer it is to 1, the higher the engine efficiency.

This indicator has been built upon the previous formulas and metrics such as recall and precision (Buckland and Gey, 1994; Lancaster, 1978, 1979), novelty (Bharat and Broder, 1998), coverage (Henzinger et al., 2000), and duplicate hits (Gwizdka and Chignell, 1999), and the article of Marek and Pawlak (1976).

Working method

In this section, Google, Yahoo, DuckDuckGo, and Bing search engines were evaluated based on the designed combined indicator. First, 40 terms were searched in each engine. For each search term, the status of the first 20 documents retrieved based on distance correlation scores, binary correlation scores, new information for the user, total number of related documents, total number of unrelated documents, total number of new documents, total number of retrieved documents, total number of related documents retrieved with all synonyms by removing duplicates, the number of dead (inactive) links, and the number of duplicate links was calculated and shown in tables. Then, based on the data of these tables, the formulas related to the evaluation criteria were calculated.

Results

The performance status of search engines based on the combined indicator designed

The data in Tables 4 to 7 show:

A. Distance precision: The average distance precision score in Google, Yahoo, DuckDuckGo, and Bing was 0.78, 0.67, 0.68, and 0.66, respectively based on 40 search subjects.

B. MAP (Mean Average Precision): The average score of MAP in Google, Yahoo, DuckDuckGo, and Bing is 0.981, 0.812, 0.9, and 0.861 respectively based on 40 subjects

C. Recall: The average recall score in Google, Yahoo, DuckDuckGo, and Bing was 0.75, 0.62, 0.55, and 0.66 respectively based on 40 subjects.

D. NDCG (Normal Descending Cumulative Gain criterion): The average score of NDCG in Google, Yahoo, DuckDuckGo, and Bing is 0.91, 0.87, 0.88, and 0.86 based on 40 subjects.

E. ERR (Expected Reciprocal Ranking criteria): The average score of the ERR in Google, Yahoo, DuckDuckGo, and Bing is 0.189, 0.049, 0.027, and, 0.047, respectively based on 40 subjects.

F. BPREF (Binary Preferences criterion): The average score of BPREF in Google, Yahoo, DuckDuckGo, and, Bing is 0.945, 0.866, 0.880, and 0.877, respectively based on 40 subjects.

G. Information novelty ratio: The average of information novelty ratio in Google, Yahoo, DuckDuckGo, and Bing is 0.575, 0.562, 0.617, and0.615, respectively based on 40 subjects.

H. Coverage criteria: The average score of Google, Yahoo, DuckDuckGo, and Bing coverage is 0.536, 0.486, 0.494, and 0.512 respectively based on 40 subjects.

I. The ratio of unrelated documents: The average score for unrelated documents in Google, Yahoo, DuckDuckGo, and Bing is 0.13, 0.221, 0.202, and 0.189, respectively based on 40 subjects.

J. Proportion of duplicate hits: The average score for proportion of duplicate hits in Google, Yahoo, DuckDuckGo, and Bing is 0.015, 0.079, 0.02, and 0.041, respectively based on 40 subjects.

Table 4.

Average scores of combined indicator criteria in Google (appendix 4).

Combined index criteria
Criterion	Relevance			Ranking			Novelty Ratio	Coverage ratio	Ratio of unrelated documents	Proportion of duplicate hits
Criterion	Distance precision	MAP	Recall	NDCG	ERR	BPREF	Novelty Ratio	Coverage ratio	Ratio of unrelated documents	Proportion of duplicate hits
Score	0.78	0.981	0.75	0.915	0.189	0.945	0.575	0.536	0.13	0.015
Average	0.837			0.683
Final score	0.667

Table 5.

Average scores of combined indicator criteria in Yahoo.

Combined index criteria
Criterion	Relevance			Ranking			Novelty Ratio	Coverage ratio	Ratio of unrelated documents	Proportion of duplicate hits
Criterion	Distance precision	MAP	Recall	NDCG	ERR	BPREF	Novelty Ratio	Coverage ratio	Ratio of unrelated documents	Proportion of duplicate hits
Score	673/0	812/0	624/0	870/0	049/0	896/0	562/0	486/0	221/0	079/0
Average	703/0			605/0
Final score	0.56

Table 6.

Average scores of combined index indicator in DuckDuckGo.

Combined index criteria
Criterion	Relevance			Ranking			Novelty Ratio	Coverage ratio	Ratio of unrelated documents	Proportion of duplicate hits
Criterion	Distance precision	MAP	Recall	NDCG	ERR	BPREF	Novelty Ratio	Coverage ratio	Ratio of unrelated documents	Proportion of duplicate hits
Score	681/0	9/0	550/0	887/0	062/0	880/0	617/0	494/0	202/0	02/0
Average	71/0			61/0
Final score	0.588

Table 7.

Average scores of combined indicator criteria in Bing.

Combined index criteria
Criterion	Relevance			Ranking			Novelty Ratio	Coverage ratio	Ratio of unrelated documents	Proportion of duplicate hits
Criterion	Distance precision	MAP	Recall	NDCG	ERR	BPREF	Novelty Ratio	Coverage ratio	Ratio of unrelated documents	Proportion of duplicate hits
Score	663/0	861/0	66/0	864/0	060/0	877/0	615/0	512/0	189/0	041/0
Average	728/0			600/0
Final score	0.592

The scores of the four search engines Google, Yahoo, DuckDuckGo, and Bing are listed in Table 8 based on combined indicator:

Table 8.

Search engine scores by combined indicator.

Rank	Search engines	The final score of the combined indicator
1	Google	0.667
2	Bing	0.592
3	DuckDuckGo	0.588
4	Yahoo	0.56

According to the data in Table 8, Google search engine has the best performance (0.667) among the surveyed engines based on the combined index, followed by Bing (0.592), DuckDuckGo (0.588), and Yahoo (0.56). Therefore, the highest rate of the combined index belongs to Google and the lowest belongs to Yahoo.

Comparison of four search engines based on combined index criteria scores

In this section, the scores of each of the combined index criteria in the four search engines are compared.

According to Figure 1, among the search engines, Google search engine has the most relevance, followed by Bing, DuckDuckGo and Yahoo, respectively, with very little difference. In terms of ranking, the highest rank belongs to Google search engine, followed by DuckDuckGo, Yahoo, and Bing with a slight difference from each other. In terms of novelty, the search engines are in the order from DuckDuckGo, Bing, Google, to Yahoo. Therefore, the highest value belongs to DuckDuckGo and the lowest belongs to Yahoo. The maximum coverage belongs to Google search engine, followed by Bing, DuckDuckGo and Yahoo. The highest ratio of unrelated documents belongs to Yahoo search engine and the lowest belongs to Google. The highest proportion of duplicate hits belongs to the Yahoo search engine and the lowest to the Google search engine. The order of search engines with respect to the proportion of duplicate hits is as follows: Yahoo, Bing, DuckDuckGo, and Google. Thus, the level of relevance, ranking and coverage in Google search engine is higher than other engines. Moreover, the Google has less unrelated and duplicate hits than the other three engines. Overall, the Google has performed well in terms of relevance, ranking, coverage, unrelated documents ratio, and proportion of duplicate hits; however, in terms of novelty, Google performance is poor, compared to the other search engines.

Figure 1.

Comparison of the four search engines based on combined indicator scores.

In this section, in each search engine, the scores of each of the combined indicator criteria are compared. It is also specified in which criteria each search engine has obtained higher scores.

Values indicate that in the Google search engine, the order of the criteria from maximum to minimum includes relevance, ranking, novelty, coverage, unrelated document ratio and proportion of duplicate hits. Thus, relevance and ranking have the highest values of the combined indicator in Google search engine.

In the Yahoo search engine, DuckDuckGo and Bing, the highest value is related to relevance and the lowest is related to duplicate hits. The order of the criteria is similar in Google and Yahoo search engines, and the highest value is related to the relevance criterion, followed by the ranking criterion, and the lowest is related to the ratio of unrelated documents and proportion of duplicate hits. The order of the criteria in the two search engines DuckDuckGo and Bing is the same, and the highest amount is related to relevance, novelty and ranking, followed by coverage, ratio of unrelated documents and proportion of duplicate hits. This is consistent with the desired performance characteristics of a search engine, because the lower the rate of unrelated and duplicate documents in a search engine and the higher the quality of ranking and relevance of articles, the better the performance of that search engine.

Discussion and conclusion

To select a search engine with a better performance in data retrieval, it is essential to evaluate the performance of the search engines. There are many criteria for evaluating search engines, but none of these criteria alone indicates the performance of search engines. Evaluation criteria should work to optimize information retrieval systems. These criteria should reflect the essential needs of users. Search engine evaluations based solely on precision and recall will not be complete and will not cover all aspects of what a user expects from a search engine. A set of criteria must therefore be considered. Some evaluation criteria are designed for specific situations, for example, various search criteria are used in specific situations where there is ambiguity in the search term. In designing the combined indicator, an attempt has been made to select criteria that have the necessary comprehensiveness and exclusiveness in meeting the user’s needs. Accordingly, the proposed combined indicator consist of six general criteria of relevance, ranking, novelty, coverage, proportion of duplicate hits, and ratio of unrelated document. Either of the relevance and ranking criteria has 3 sub-criteria. The six general criteria show the basic expectations and needs of users of information retrieval in a search engine. Moreover, the reason for designing sub-criteria for measuring relevance and ranking is that relevance and ranking cannot be measured by the mere use of a single criterion, and each of the criteria has pros and cons, and in some cases cannot improve the performance of an engine well. The MAP criterion, considering the rankings of related documents, does not suffer from the disadvantages of the conventional precision criterion. This criterion is calculated based on a binary scale and cannot show the difference between the performance of two search engines if the number of related documents is high and cannot also separate completely related documents from slightly related documents. The distance precision does not have this problem and is calculated based on the distance scale. However, neither of these two criteria alone can be used to evaluate the performance of search engines, and neither can show the difference in retrieval performance across several engines. The combined use of both criteria can eliminate the shortcomings of the criteria. For more clarity, some explanations are provided below.

In some cases, distance precision alone cannot tell the difference between the performance of two search engines. The following example illustrates this issue (Table 9; Sirotkin, 2013).

Table 9.

Relevance scores of two search engines.

Order of documents	1	2	3	4	5	6	7	8	9	10
Relevance score of engine A	3	2	2	3	1	0	0	0	1	3
Relevance score of engine B	3	2	1	0	0	0	1	3	3	2

In this example, since in engines A and B, there are 3 documents with a relevance score of 3, 2 documents with a score of 2, 2 documents with a score of 1, and 3 documents with a score of zero, that is, the number of documents with the same relevance score in the two engines is equal, so the distance precision of engines A and B is equal (distance precision = 0.5), but the relevance performance of the two search engines is different. In engine A, the first five documents are relevant, while in engine B, the first three documents are relevant. In engine A, among the first five related documents, two have a score of 3, 2 have a score of 2 and 1 has a score of 1, while in engine B, in the first three documents, a document has a score of 3, a document has a score of 2 and a document has a score of 1. Therefore, the performance of engines A and B in terms of relevance is not the same. Therefore, if we calculate the AP of the two engines, the rate of AP in engine A is 0.9, while it is 0.7 in engine B.

\begin{array}{l} Calculation of AP in engine A : \frac{1}{7} \times (\frac{1}{1} + \frac{2}{2} + \frac{3}{3} + \frac{4}{4} + \frac{5}{5} + \frac{6}{9} \\ + \frac{7}{10}) = 0.9 \end{array}

\begin{array}{l} Calculation of AP in engine B : \frac{1}{7} \times (\frac{1}{1} + \frac{2}{2} + \frac{3}{3} + \frac{4}{7} + \frac{5}{8} + \frac{6}{9} \\ + \frac{7}{10}) = 0.79 \end{array}

Therefore, the distance precision cannot show the difference between engines A and B (which is equal to 0.5 in the two engines), but the non-internal average precision shows the difference in the performance of the two engines.

The following example is the opposite of the previous example. Here, the AP value cannot show the performance of the two engines, and rather the distance precision shows this difference (Table 10; Sirotkin, 2013):

Table 10.

Relevance scores of search engines A and B.

Order of documents	1	2	3	4	5	6	7	8	9	10
Relevance scores of engine A	3	2	2	3	3	3	3	2	1	1
Relevance scores of engine B	3	2	1	0	0	0	0	0	0	0

In search engine A, the first ten documents are relevant and in this case AP = 1. Besides, in search engine B, the first three documents are related and the other seven documents are unrelated, in which case AP = 1. However, engines A and B show different performance with dissimilar distance precisions. The distance precision of engine A is 0.76, while the distance precision of engine B is 0.2.

In this study, in the evaluation of the studied engines, which was done by the sub-criteria of relevance and ranking, in addition to the points mentioned above, some other interesting points were also observed.

For example, in the DuckDuckGo search engine, in subjects 1 and 26, the number of related documents in the first 20 documents retrieved is equal (equal to 8). In subject 1, the first two documents were relevant but documents 3 and 4 were unrelated; however, in subject 26, the first five documents were relevant but documents 6 and 7 were irrelevant. The AP value was 0.72 in the former and 0.78 in the latter. Therefore, although the number of related documents in the two subjects is the same, the AP precision in the two subjects is different. The subject in which the related documents were higher had a higher AP.

From the above example, it is understood that neither distance precision nor AP is able to show the differences between the two engines through relevance criterion; therefore, it is better to combine these two indicators so that the disadvantages can be removed.

The criterion of recall in information systems is also added to the aforesaid criteria, because this criterion considers all documents related to keywords or synonymous expressions. Therefore, this criterion is used to calculate the system capability in retrieving synonyms. The higher a search engine’s ability to retrieve synonyms of a search term, the higher the recall score in its information systems.

In this study, while evaluating engines through ranking criteria, other findings were obtained as follows:

The NDCG criterion shows the difference between the actual and ideal rankings, and the smaller the difference, the closer the fraction is to 1 and the better the system performance. On the other hand, the BPREF criterion considers the number of unrelated documents before each relevant document and fines the system. This criterion indicates that the user passes several unrelated documents before reaching a related document, and hence this is a good criterion for showing the performance of a system. However, like the AP, in some cases it may have bugs. The BPREF problem is similar to the AP problem. The following example illustrates this well.

If in the n recovered documents, all unrelated documents are in the last ranks, then BPREF will be 1, no matter how many unrelated documents exist. For example, if out of 20 documents retrieved, the first 2 documents are related and the rest are unrelated, BPREF is equal to 1, and if the first 10 documents are related and the rest are unrelated, BPREF is still equal to 1, or if, out of 20 documents retrieved, the first 17 documents are related and the last 3 documents are unrelated, BPREF is equal to 1. Now if all 20 of the 20 documents retrieved are related, the BPREF is still 1. Therefore, the performance of an engine with 17 related documents and 3 unrelated documents at the end is the same as the performance of an engine with 20 related documents. Therefore, in cases where BPREF is equal to 1, this criterion alone cannot afford to evaluate the performance of two engines and the NDCG criterion should be used along with it.

In this study, concerning the combined indicator criteria in search engines, the highest value is related to relevance criteria. In Google search engine, which has a higher relevance than the other three engines, the ranking also received a higher score than the other three engines. This is consistent with the results of Al-Maskari et al. (2007) who reported a significant correlation between precision and ranking.

In the present study, Google search engine received the highest combined indicator score among the studied engines, followed by Bing, DuckDuckGo and Yahoo. Therefore, the highest combined indicator belongs to Google and the lowest belongs to Yahoo. The results of this study are in line with those of Singh and Sharan (2013) in suggesting that Google search engine performs better than Yahoo and DuckDuckGo in answering questions; however, our findings are different from theirs in suggesting that Bing search engine has a better responding performance than Yahoo and DuckDuckGo.

In the present study, among the considered engines, Google has the highest relevance. The results of this study are in line with the results of Ajayi and Elegbeleye (2014) in that Google outperforms Yahoo in terms of relevance. However, in Ajayi and Elegbeleye (2014), Yahoo has higher precision than Bing, which is inconsistent with the results of this study.

In terms of ranking, the highest ranking belongs to Google, followed by DuckDuckGo, Yahoo and Bing, with a slight difference. These results are in line with the results of Kaur et al. (2011), who found that Google outperforms Yahoo and Bing in terms of rankings. Google outperformed the other three engines in all the three sub-criteria of relevance (distance precision, average precision, and recall). Google also outperforms the other three search engines in all the three sub-criteria of ranking (descending cumulative gain, expected reciprocal rank, and binary preferences). Google’s good performance in terms of relevance and ranking, as well as their sub-criteria, is due to Google’s strong algorithms in this area.

In terms of novelty criterion, the search engines are in the order from DuckDuckGo, Bing, Google to Yahoo, and thus the highest value belongs to DuckDuckGo and the lowest to Yahoo. Therefore, Google is weaker than DuckDuckGo and Bing in terms of the novelty of documents, but has a better position than Yahoo. Accordingly, to retrieve new documents, it is better to use Google rather than Yahoo, but among the four search engines, the best choice is the DuckDuckGo search engine. The highest ratio of coverage belongs to Google search engine, followed by Bing, DuckDuckGo, and Yahoo. Therefore, Google is the best option for searching subjects such as history, in which the search engine should cover a high percentage of information on the web and retrieve the most relevant documents from the web. The highest proportion of unrelated documents belongs to Yahoo and the lowest belongs to Google. Besides, the highest proportion of duplicate hits belongs to Yahoo and the lowest to the Google search engine. The order of search engines in terms of the proportion of duplicate hits is: Yahoo, Bing, DuckDuckGo and Google. Therefore, Yahoo has more duplicate and irrelevant documents than the other three search engines and needs more powerful algorithms and methods to remove duplicate and irrelevant documents. Examining the scores of criteria in specific and general subjects in the four search engines shows that the highest and lowest criteria (except in a few cases) belong to different subjects. In other words, the highest rate of a given criterion in each search engine belongs to a different subject (for example, the highest distance precision in the four search engines belongs to different subjects) and the same is true for the lowest rate of each criterion. Therefore, search engines have different retrieval performances on the same subjects, and this is due to different retrieval algorithms and methods in search engines.

Lack of a clear standard for scoring retrieved documents and lack of a standard framework for selecting criteria were among the limitations of this study. On the other hand, some search engines provide search results based on the user’s geographical location, and thus the results of a search in several geographical locations may be different, and this problem creates a challenge in the evaluation of search engines.

Researchers are recommended to use our proposed combined indicator criteria and formula to evaluate search engines. The proposed combined indicator provides greater integration and coherence in search engine evaluations and also helps to consider important aspects of search engine performance in evaluations.

Users are also advised to choose the right search engine based on the results of the evaluation of the four search engines. According to our findings, in general, Google search engine had the best performance compared to Yahoo, Bing and DuckDuckGo; however, if a certain criterion is more important for users, they can get the best performance from the search engine in which the said criterion has the highest rate. For example, for a user who gives more weight to the novelty of documents, it is better to use DuckDuckGo or Bing as these two search engines have a better performance than Google or Yahoo in this regard.

Based on the results, search engines have different performance in different criteria and might show good performance in one criterion and poor performance in another criterion. Therefore, search engine designers are recommended to improve search engine performance by consider all the criteria in the proposed combined indicator and use the said indicator to identify the strengths and weaknesses of search engine performance.

Due to the fact that the documents retrieved in Google and Yahoo search engines are less updated, the designers of Google and Yahoo search engines are advised to improve the novelty of retrieved documents. Due to the fact that the ratio of unrelated documents and proportion of duplicate hits in Yahoo search engine are more than Google, DuckDuckGo, and Bing; Yahoo search engine designers are advised to reduce the unrelated and duplicate documents in retrieving documents.

In this work, we introduced a new combined indicator for evaluating information retrieval systems’ performance and a formula to calculate that. We applied the proposed combined indicator to evaluate Google, DuckDuckGo, Bing, and Yahoo. In the scope of this work and as future work, it would be interesting to evaluate the search engines that are less popular or known. The evaluation of other information retrieval systems, especially digital libraries, based on this indicator, can be the next task. Moreover, in our experiments, we searched and retrieved search terms and investigated the results based on the relevance judgment criterion by consulting some experts. Thus, a future task is to conduct searches by users in a normal environment. In this work, we focused on general search terms. However, we also plan to study the search terms in a specialized field. For this purpose, we will ask the experts to select search queries and evaluate the results retrieved by search engines.

Supplemental Material

sj-docx-1-lis-10.1177_09610006221138579 – Supplemental material for Proposing a New Combined Indicator for Measuring Search Engine Performance and Evaluating Google, Yahoo, DuckDuckGo, and Bing Search Engines based on Combined Indicator

Supplemental material, sj-docx-1-lis-10.1177_09610006221138579 for Proposing a New Combined Indicator for Measuring Search Engine Performance and Evaluating Google, Yahoo, DuckDuckGo, and Bing Search Engines based on Combined Indicator by Azadeh Hajian Hoseinabadi and Mehrdad CheshmehSohrabi in Journal of Librarianship and Information Science

Supplemental Material

sj-docx-2-lis-10.1177_09610006221138579 – Supplemental material for Proposing a New Combined Indicator for Measuring Search Engine Performance and Evaluating Google, Yahoo, DuckDuckGo, and Bing Search Engines based on Combined Indicator

Supplemental material, sj-docx-2-lis-10.1177_09610006221138579 for Proposing a New Combined Indicator for Measuring Search Engine Performance and Evaluating Google, Yahoo, DuckDuckGo, and Bing Search Engines based on Combined Indicator by Azadeh Hajian Hoseinabadi and Mehrdad CheshmehSohrabi in Journal of Librarianship and Information Science

Supplemental Material

sj-docx-3-lis-10.1177_09610006221138579 – Supplemental material for Proposing a New Combined Indicator for Measuring Search Engine Performance and Evaluating Google, Yahoo, DuckDuckGo, and Bing Search Engines based on Combined Indicator

Supplemental material, sj-docx-3-lis-10.1177_09610006221138579 for Proposing a New Combined Indicator for Measuring Search Engine Performance and Evaluating Google, Yahoo, DuckDuckGo, and Bing Search Engines based on Combined Indicator by Azadeh Hajian Hoseinabadi and Mehrdad CheshmehSohrabi in Journal of Librarianship and Information Science

Supplemental Material

sj-docx-4-lis-10.1177_09610006221138579 – Supplemental material for Proposing a New Combined Indicator for Measuring Search Engine Performance and Evaluating Google, Yahoo, DuckDuckGo, and Bing Search Engines based on Combined Indicator

Supplemental material, sj-docx-4-lis-10.1177_09610006221138579 for Proposing a New Combined Indicator for Measuring Search Engine Performance and Evaluating Google, Yahoo, DuckDuckGo, and Bing Search Engines based on Combined Indicator by Azadeh Hajian Hoseinabadi and Mehrdad CheshmehSohrabi in Journal of Librarianship and Information Science

Footnotes

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Mehrdad CheshmehSohrabi

Supplemental material

Supplemental material for this article is available online.

Author biographies

Azadeh Hajian Hoseinabadi is the librarian of the research department of Isfahan Central Library. She studied Knowledge and Information Science at the University of Isfahan, where he defended his master’s thesis in the field of information retrieval. Her area of interest is Information Retrieval. Email: ahajian91@gmail.com

Mehrdad CheshmehSohrabi, Ph.D., is currently an associate professor at the Department of Knowledge and Information Science, University of Isfahan. In 2006, he completed his Ph.D. in Information and Communication Sciences from Stendhal University, France. His research topic is Information Retrieval, Information Management, Bibliometrics, Research evaluation, Research Ethics, Data Mining, and Semantic Web. He is the corresponding author and can be contacted at Email: mo.sohrabi@edu.ui.ac.ir

References

Agrawal

Gollapudi

Halverson

, et al. (2009) Diversifying search results. In: Proceedings of the second ACM international conference on web search and data mining, Barcelona, Spain, pp.5–14. New York, NY: ACM.

Ajayi

Elegbeleye

(2014) Performance evaluation of selected search engines. Performance Evaluation 4(02): 1–13.

Alashti

Rezaei

Elahi

, et al. (2022) Parsisanj: An automatic component-based approach toward search engine evaluation. The Journal of Supercomputing 78(8): 10690–10711.

Ali

Beg

MMS

(2011) An overview of Web search evaluation methods. Computers & Electrical Engineering 37(6): 835–848.

Ali

Gul

(2016) Search engine effectiveness using query classification: A study. Online Information Review 40(4): 515–528.

Al-Maskari

Sanderson

Clough

(2007) The Relationship between IR effectiveness measures and user satisfaction. In: 30th annual international ACM SIGIR conference on research and development in information retrieval, Amsterdam, The Netherlands, SIGIR, pp.773–774. New York, NY: ACM.

Al-Maskari

Sanderson

Clough

, et al. (2008) The good and the bad system: Does the test collection predict users’ effectiveness. In: 31st annual international ACM SIGIR conference on research and development in information retrieval, Singapore, SIGIR, pp.59–66. New York, NY: ACM.

Arkhipova

Grauer

Kuralenok

, et al. (2015) Search engine evaluation based on search engine switching prediction. Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval – SIGIR’15.

Aslam

Pavlu

Yilmaz

(2006) A statistical method for system evaluation using incompletejudgments. In: ACM SIGIR 2006 proceedings, Seattle, WA, SIGIR, pp.541–548. New York, NY: ACM.

10.

Aslam

Savell

(2003) On the effectiveness of evaluating retrieval systems in the absence of relevance judgments. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, Toronto, Canada, pp.361–362. New York, NY: ACM.

11.

Azimzadeh

Badie

Esnaashari

(2016) A review on web search engines’ automatic evaluation methods and how to select the evaluation method. In: 2016 second international conference on web research (ICWR). doi:10.1109/icwr.2016.7498450

12.

Azizan

Abu Bakar

Abd Rahman

, et al. (2018) A comparative evaluation of search engines on finding specific domain information on the web. International Journal of Engineering and Technologies 7(4): 1–4.

13.

Azzopardi

Thomas

Craswell

(2018) Measuring the utility of search engine result pages: an information foraging based measure. In: The 41st International ACM SIGIR conference on research & development in information retrieval, pp.605–614. New York, NY: ACM.

14.

Baeza-Yates

Ribeiro-Neto

(1999) Modern Information Retrieval. New York, NY: ACM press.

15.

Balabantaray

(2017) Evaluation of web search engine based on ranking of results and its features. International Journal of Information and Communication Technology 10(4): 392.

16.

Bar-Ilan

(1998) On the overlap, the precision and estimated recall of search engines. A case study of the query “Erdos”. Scientometrics 42(2): 207–228.

17.

Bar-Ilan

(2002) Methods for measuring search engine performance over time. Journal of the American Society for Information Science and Technology 53(4): 308–319.

18.

Bharat

Broder

(1998) A technique for measuring the relative size and overlap of public web search engines. Computer Networks and ISDN Systems 30(1–7): 379–388.

19.

Bilal

Ellis

(2011) Evaluating leading web search engines on children’s queries. In: International conference on human-computer interaction, pp.549–558. Berlin, Heidelberg: Springer.

20.

Bitirim

Görür

(2017) A comparative evaluation of popular search engines on finding Turkish documents for a specific time period. Tehnicki vjesnik-technical gazette 24(2): 565–569.

21.

Bokhari

Adhami

Ahmad

(2021) Evaluation of news search engines based on information retrieval models. Operations Research Forum 2(3): 1–22.

22.

Boudry

(2002) Evaluation of web search engine performances in the field of biology. MS-PARIS 18(11): 1107–1112.

23.

Bouramoul

Kholladi

Doan

(2011) A new three levels context based approach for web search engines evaluation. In: Zamojski

Kacprzyk

Mazurkiewicz

Sugier

, et al. (eds), Dependable Computer Systems. Berlin, Heidelberg: Springer, pp.31–45.

24.

Brennecke

Mandl

Womser-Hacker

(2011) The development and application of an evaluation methodology for person search engines. CEUR Workshop Proceedings 763: 42–45.

25.

Buckland

Gey

(1994) The relationship between recall and Precision. Journal of the American Society for Information Science 45(1): 12–19.

26.

Buckley

Voorhees

(2004) Retrieval evaluation with incomplete information. In: The 27th annual international ACM SIGIR conference on research and development in information retrieval, Sheffield, United Kingdom, pp.25–32. New York, NY: ACM.

27.

Can

Nuray

Sevdik

(2004) Automatic performance evaluation of Web search engines. Information Processing & Management 40(3): 495–514.

28.

Castaño

Park

Ávila

, et al. (2019) Evaluating the performance of a terminology search engine using historical data. Studies in Health Technology and Informatics 264: 1439–1440.

29.

CheshmehSohrabi

Abassi-Dashtaki

(2019) Ask Search Engine: Features and Performance identification. Webology 16(1): 77–85.

30.

CheshmehSohrabi

Sadati

(2022) Performance evaluation of web search engines in image retrieval: An experimental study. Information Development 38(4): 522–534.

31.

Chignell

Gwizdka

Bodner

(1999) Discriminating meta-search: A framework for evaluation. Information Processing & Management 35(3): 337–362.

32.

Chouldechova

Mease

(2013) Differences in search engine evaluations between query owners and non-owners. In: Proceedings of the sixth ACM international conference on web search and data mining, pp. 103–112. New York, NY: ACM.

33.

Chu

Rosenthal

(1996) Search engines for the world wide web: A comparative study and evaluation methodology. The annual meeting-American society for information science 33: 127–135.

34.

Clarke

Willett

(1997) Estimating the recall performance of Web search engines. Aslib Proceedings 49(7): 184–189.

35.

Cleverdon

Keen

(1966) Aslib Cranfield Research Project. Factors Determining the Performanc of Indexing Systems. Cranfield: College of Aeronautics.

36.

Cooper

(1968) Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems. American Documentation 19(1): 30–41.

37.

Dai

Davison

(2011) Topic-sensitive search engine evaluation. Online Information Review 35(6): 893–908.

38.

De Beer

Moens

(2006) Rpref a generalization of bpref towards graded relevance judgments. In: ACM SIGIR 2006 proceedings, pp.637–638. New York, NY: ACM.

39.

Deka

Lahkar

(2010) Performance evaluation and comparison of the five most used search engines in retrieving web resources. Online Information Review 34(5): 757–771.

40.

Ding

Marchionini

(1996) A comparative study of web search service performance. Proceedings of the ASIST Annual Meeting 33: 136.

41.

Disli

(2020) Performance evaluation of Google search engine in Turkish natural language queries. Bilgi Dunyasi 21(2): 248–267.

42.

Dunlop

(1997) Time, relevance and interaction modelling for information retrieval. In: SIGIR ’97: Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, pp.206–213. New York, NY: ACM.

43.

Dupret

Liao

(2010) A model to estimate intrinsic document relevance from the clickthrough logs of a web search engine. In: Third ACM international conference on web search and data mining, pp.181–190. New York, NY: ACM.

44.

Foo

(2011) Retrieval effectiveness of cross language information retrieval search engines. In: International conference on Asian digital libraries, pp. 296–306. Berlin, Heidelberg: Springer.

45.

Galanin

Bukharaev

Gusenkov

, et al. (2020) Using generative adversarial networks for relevance evaluation of search engine results. In: 2020 IEEE east-west design & test symposium (EWDTS), pp.1–7. New York: IEEE.

46.

Garoufallou

(2012) Evaluating search engines: A comparative study between international and Greek SE by Greek librarians. Program 46(2): 182–198.

47.

Gauch

Wang

(1996) Information fusion with ProFusion. In: Webnet: The first world conference of the web society 16 October, San Francisco, CA, .

48.

Gezici

Lipani

Saygin

, et al. (2021) Evaluation metrics for measuring bias in search engine results. Information Retrieval Journal 24(2): 85–113.

49.

Goel

Yadav

(2013) Search engine evaluation based on page level keywords. In: 2013 3rd IEEE international advance computing conference (IACC), pp. 870-876. New York: IEEE.

50.

Goh

Ang

(2002) Are pay for performance search engines relevant? Journal of Information Science 28(5): 349–355.

51.

Gordon

Pathak

(1999) Finding information on the World Wide Web: The retrieval effectiveness of search engines. Information Processing & Management 35: 141–180.

52.

Goutam

Dwivedi

(2011) Search engines evaluation using users efforts. In: 2011 2nd international conference on computer and communication technology (ICCCT-2011), pp.589–594. New York: IEEE.

53.

Griesbaum

(2004) Evaluation of three German search engines: Altavista. de, Google. de and Lycos. de. Information Research 9(4): 189. Available at: http://informationr.net/ir/9-4/paper189.html

54.

Gul

Ali

Hussain

(2020) Retrieval performance of Google, Yahoo and Bing for navigational queries in the field of “life science and biomedicine”. Data Technologies and Applications 54(2): 133–150.

55.

Gwizdka

Chignell

(1999) Towards information retrieval measures for evaluation of web search engines. Unpublished manuscript, IML Technical report, pp. 1–8.

56.

Harter

Hert

(1997) Evaluation of information retrieval systems: Approaches, issues, and methods. Annual Review of Information Science and Technology (ARIST) 32: 3–94.

57.

Hawking

Craswell

Bailey

, et al. (2001) Measuring search engine quality. Information Retrieval 4(1): 33–59.

58.

(2010) Evaluating search engines by clickthrough data. In: International semantic web conference, pp.339–354. Berlin, Heidelberg: Springer.

59.

Henzinger

Heydon

Mitzenmacher

, et al. (2000) On near-uniform URL sampling. Computer Networks 33(1–6): 295–308.

60.

Hjørland

(2010) The foundation of the concept of relevance. Journal of the American Society for Information Science and Technology 61(2): 217–237.

61.

Huang

Soergel

(2013) Relevance: An improved framework for explicating the notion. Journal of the American Society for Information Science and Technology 64(1): 18–35.

62.

Hubert

Mothe

(2007) Relevance feedback as an indicator to select the best search engine-evaluation on TREC data. In: International conference on enterprise information systems, vol. 2, pp.184–189. SCITE Press.

63.

Huffman

Hochster

(2007) How well does result relevance predict session satisfaction? In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, Amsterdam, The Netherlands, pp.567–574. New York, NY: ACM.

64.

Hussain

Gul

Shah

, et al. (2019) Retrieval effectiveness of image search engines. The Electronic Library 37(1): 173–184.

65.

Järvelin

Kekäläinen

(2002) Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems 20(4): 422–446.

66.

Jatwani

Tomar

Dhingra

(2020) Comparative performance evaluation of keyword and semantic search engines using different query set categories. Recent Advances in Computer Science and Communications 13(5): 1057–1070.

67.

Jing

Yanfen

Zhengang

(2009) A new method of performance evaluation for search engine. 2009 ISECS international colloquium on computing, communication, control, and management, pp.180–183. New York: IEEE.

68.

Joachims

(2002) Optimizing search engines using clickthrough data. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, Edmonton, Alberta, Canada, pp.133–142. New York, NY: ACM.

69.

Joachims

Granka

Pan

, et al. (2017) Accurately interpreting clickthrough data as implicit feedback. In: ACM SIGIR forum, Vol. 51, No. 1, pp.4–11. New York, NY: ACM.

70.

Kamoun

Maillé

Tuffin

(2019) Evaluating the performance and neutrality/bias of search engines. In: Proceedings of the 12th EAI international conference on performance evaluation methodologies and tools, pp.103–109. New York, NY: ACM.

71.

Kaur

Bhatia

Singh

(2011) Web search engines evaluation based on features and end-user experience. International Journal of Enterprise Computing and Business Systems 1(2): 1–19.

72.

Kazai

(2011) In search of quality in crowdsourcing for search engine evaluation. In: European conference on information retrieval, pp.165–176. Berlin, Heidelberg: Springer.

73.

Kelly

Shah

(2007) Effects of rank and precision of search results on users’ evaluations of system performance. University of North Carolina, United States.

74.

Khan

Sangroha

Ahmad

, et al. (2014) A performance evaluation of semantic based search engines and keyword based search engines. 2014 International Conference on Medical Imaging, m-Health and Emerging Communication Systems (MedCom), Greater Noida, India, 7–8 November 2014, pp.168–173. New York: IEEE.

75.

Khusro

Ali

Alam

, et al. (2017) Performance Evaluation of Desktop Search engines using information retrieval systems approaches. Journal of Internet Technology 18(5): 1043–1055.

76.

Kontiza

Bikakis

(2014) Web search results visualization: Evaluation of two semantic search engines. In: Proceedings of the 4th international conference on web intelligence, mining and semantics (WIMS14), pp.1–12. New York, NY: ACM.

77.

Lancaster

(Ed.) (1978) Precision and Recall. In: Kent

Lancour

Daily

(eds) Encyclopedia of Library and Information Science, vol. 23. New York: Marcel Dekker, pp.170–180.

78.

Lancaster

(1979) Information Retrieval Systems: Characteristics, Testing and Evaluation, 2nd edn. New York: Wiley.

79.

Lawrence

Giles

(1998) Searching the World Wide Web. Science 280: 98–100.

80.

Leighton

Srivastava

(1999) First 20 precision among World Wide Web search services (search engines). Journal of the American Society for Information Science 50(10): 870–881.

81.

Lewandowski

(2008) The retrieval effectiveness of web search engines: considering results descriptions. Journal of Documentation 64(6): 915–937.

82.

Lewandowski

(2011) The retrieval effectiveness of search engines on navigational queries. Aslib Proceedings 63(4): 354–363.

83.

Lewandowski

(2012) A framework for evaluating the retrieval effectiveness of search engines. In: Jouis

Bisrki

Ganascia

, et al. (eds) Next Generation Search Engines: Advanced Models for Information Retrieval. Hershey: IGI Global, pp.456–479.

84.

Lewandowski

(2013) Challenges for search engine retrieval effectiveness evaluations: Universal search, user intents, and results presentation. Intelligent Systems Reference Library 50: 179–196.

85.

Lewandowski

(2015) Evaluating the retrieval effectiveness of web search engines using a representative query sample. Journal of the Association for Information Science and Technology 66(9): 1763–1775.

86.

Lewandowski

Höchstötter

(2008) Web searching: A quality measurement perspective. In: Spink

Zimmer

(eds) Web Search: Multidisciplinary Perspectives. Berlin/Heidelberg: Springer, pp. 309–340.

87.

Lewandowski

Suenkler

(2019) The relevance assessment tool: A modular software to support the implementation of various studies with search engines. Information-wissenschaft und praxis 70(1): 46–56.

88.

Shang

(2000) A new statistical method for performance evaluation of search engines. In: Proceedings 12th IEEE internationals conference on tools with artificial intelligence. ICTAI 2000, pp.208-215. New York: IEEE.

89.

Ljosland

(1999) Evaluation of web search engines and the search for better ranking algorithms. In: SIGIR99 workshop on evaluation of web retrieval, p.19. Mildrid Ljosland: Norwegian University of Science and Technology.

90.

Shukla

Subramanya

, et al. (2007) Performance evaluation of desktop search engines. In: 2007 IEEE international conference on information reuse and integration, pp.110–115. New York: IEEE.

91.

Marek

Pawlak

(1976) Information storage and retrieval systems: Mathematical foundations. Theoretical Computer Science 1(4): 331–354.

92.

Mea

Mizzaro

(2004) Measuring retrieval effectiveness: A new proposal and a first experimental validation. Journal of the American Society for Information Science and Technology 55(6): 530–543.

93.

Meghabghab

(1996) Information retrieval in cyberspace. In: Proceedings of ASIS Mid-Year Meeting, pp.224-237. Medford, NJ: Information Today.

94.

Meng

Chen

(2004) On user-oriented measurements of effectiveness of web information retrieval systems. In: International conference on internet computing, pp.527–533. Las Vegas, NV: CSREA Press.

95.

Meng

Xing

Clark

(2007) An empirical performance measurement of Microsoft’s search engine and its comparison with other major search engines. International Journal of Intelligent Information Technologies 3(2): 65–81.

96.

Moffat

Webber

Zobel

(2007) Strategic system comparisons via targeted relevance judgments. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, pp.375–382. New York, NY: ACM.

97.

Muh-Chyun

Ying

(2003) Evaluation of web-based search engines using user-effort measures. Library and Information Science Research Electronic Journal 13(2): 1–8.

98.

Nahl

(1998) Ethnography of novices’ first use of web search engines: Affective control in cognitive processing. Internet Reference Services Quarterly 3(2): 51–72.

99.

Oppenheim

Morris

McKnight

, et al. (2000) The evaluation of WWW search engines. Journal of Documentation 56(2): 190–211.

100.

Ozertem

Jones

Dumoulin

(2011) Evaluating new search engine configurations with pre-existing judgments and clicks. In: Proceedings of the 20th international conference on World wide web – WWW’11. doi:10.1145/1963405.1963463

101.

Ozmutlu

(2005) Performance of question-based vs keyword-based search engines and effect of web user characteristics on search engine performance. Online Information Review 29(6): 656–675.

102.

Palotti

Zuccon

Hanbury

(2018) MM: a new framework for multidimensional evaluation of search engines. In: Proceedings of the 27th ACM international conference on information and knowledge management, pp.1699–1702. New York, NY: ACM.

103.

Pao

(1989) Concepts of Information Retrieval. Rougemont: Libraries Unlimited.

104.

Radlinski

Kurup

Joachims

(2008) How does click through data reflect retrieval quality? In: Proceeding of the 17th ACM conference on information and knowledge management, Napa Valley, CA, pp.43–52. New York, NY: ACM.

105.

Radlinski

Kurup

Joachims

(2010) Evaluating search engine relevance with click-based metrics. In: Fürnkranz

Hüllermeier

(eds) Preference Learning. Berlin, Heidelberg: Springer, pp.337–361.

106.

Rakesh Kumar

Suri

Chauhan

(2005) Search engines evaluation. Journal of Library & Information Technology 25(2): 3–10.

107.

Rocco

Silvello

(2020) An information visualization tool for the interactive component-based evaluation of search engines. In: Italian research conference on digital libraries, pp.15–25. Cham: Springer.

108.

Sadeghi

(2011) Automatic performance evaluation of web search engines using judgments of metasearch engines. Online Information Review 35(6): 957–971.

109.

Sakai

(2004) New Performance Metrics Based on Multigrade Relevance: Their Application to Question Answering. In: NTCIR-4 conference, April 2003, Tokyo.

110.

Sakai

(2007) On the reliability of information retrieval metrics based on graded relevance. Information Processing & Management 43(2): 531–548.

111.

Sakai

Craswell

Song

, et al. (2010) Simple evaluation metrics for diversified search Results. EVIA@ NTCIR, Tokyo, Japan, pp.42-50.

112.

Sakai

Kando

(2008) On information retrieval metrics designed for evaluation with incomplete relevance assessments. Information Retrieval 11(5): 447–470.

113.

Sánchez

Martínez-Sanahuja

Batet

(2018) Survey and evaluation of web search engine hit counts as research tools in computational linguistics. Information Systems 73: 50–60.

114.

Sanderson

(2020) Modeling Search Engine Performance Measurement. In: Proceedings of the 2020 ACM SIGIR on international conference on theory of information retrieval, p.1. New York, NY: ACM.

115.

Saracevic

(1995) Evaluation of evaluation in information retrieval. In: Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval, (School of Communication, Information and Library Studies, Rutgers University), pp.138–146. New York, NY: ACM.

116.

Saracevic

(2015) Why is relevance still the basic notion in information science. In: Pehar

Schlögl

Wolff

(eds) Re: inventing information science in the networked society. Proceedings of the 14th international symposium on information science (ISI 2015), Zadar, Croatia, 19—21 May 2015, pp.26–35. Glückstadt: Verlag Werner Hülsbusch.

117.

Schema.org Community Group (2022) Organization of Schemas. Available at: https://schema.org/docs/schemas.html (accessed 28 July 2022).

118.

Schlichting

Nilsen

(1996) Signal detection analysis of www search engines. Presented at microsoft’s designing for the web, empirical studies conference. Available at: http://webhost.lclark.edu/nilsen//ms/searchengine.HTM (accessed 17 October 2021).

119.

Schwartz

(1998) Web search engines. Journal of the American Society for Information Science 49(11): 973–982.

120.

Shafi

Ali

(2019) Retrieval performance of select search engines in the field of physical sciences. Annals of library and information studies 66(3): 117–122.

121.

Shoeleh

Azimzadeh

Mirzaei

, et al. (2016) Similarity based automatic web search engine evaluation. In: 2016 8th international symposium on telecommunications (IST), pp.643–648. New York: IEEE.

122.

Silverstein

Henzinger

Marais

, et al. (1998) Analysis of a very large AltaVista query log. Technical report 1998-014, Digital SRC, 383-423.

123.

Singh

Sharan

(2013) A comparative study between keyword and semantic based search engines. Paper presented at the international conference on cloud, Big Data and Trust. Bhopal, Madhya Pradesh, India, 13–15 November, 2013, pp.130–134.

124.

Singh

Dwivedi

(2015) Performance Evaluation of search engines using enhanced vector space model. Journal of Computer Science 11(4): 692–698.

125.

Sirotkin

(2013) On Search Engine Evaluation Metrics. Germany: Heinrich-Heine-University.

126.

Smith

Kantor

(2008) User adaptation: Good results from poor systems. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, Singapore, pp.147–154. New York, NY: ACM.

127.

Soboroff

Nicholas

Cahan

(2001) Ranking retrieval systems without relevance judgments. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval - SIGIR’01. doi:10.1145/383952.383961

128.

Soni

Roberts

(2021) An evaluation of two commercial deep learning-based information retrieval systems for COVID-19 literature. Journal of the American Medical Informatics Association 28(1): 132–137.

129.

Sroka

(2000) Web search engines for Polish information retrieval: Questions of search capabilities and retrieval performance. The International Information & Library Review 32(2): 87–98.

130.

(2003) A comprehensive and systematic model of user evaluation of web search engines: I. Theory and background and II. An evaluation by undergraduates. Journal of the American society for information science and technology 54(13): 1175–1223.

131.

Chen

Dong

(1998) Evaluation of Web-Based search engines from the End-User’s perspective: A Pilot Study. Proceedings of the ASIS Annual Meeting 35: 348–361.

132.

Tamilkodi

Nesakumari

(2022) Image retrieval system based on multi feature extraction and its performance assessment. International Journal of Information Technology 14(2): 1161–1173.

133.

Tawileh

Mandl

Griesbaum

(2010) Search results presentation and interface design: A comparative evaluation study of five web search engines in Arabic language. In: 2010 10th international conference on intelligent systems design and applications. doi:10.1109/isda.2010.5687201

134.

Teixeira Lopes

Ribeiro

(2011) Comparative evaluation of web search engines in health information retrieval. Online Information Review 35(6): 869–892.

135.

Tomaiuolo

Packer

(1996) An analysis of Internet search engines: Assessment of over 200 search queries. Computers in Librarie 16(6): 58–62.

136.

Turpin

Scholer

(2006) User performance versus precision measures for simple search tasks. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, Seattle, WA, pp.11–18. New York, NY: ACM.

137.

Van Rijsbergen

(1979) Information Retrieval. London: Butterworth.

138.

Vaughan

(2004) New measurements for search engine evaluation proposed and tested. Information Processing & Management 40(4): 677–691.

139.

Wang

Macdonald

Ounis

(2022) Improving zero-shot retrieval using dense external expansion. Information Processing & Management 59(5): 103026.

140.

Zhang

(2019) Evaluating the effectiveness of Web search engines on results diversification. Information Research: An International Electronic Journal 24(1): n1.

141.

XiaoLing

He-Ru

(2011) Evaluation of web search engines. In: International conference on web information systems and mining, pp.448–454. Berlin, Heidelberg: Springer.

142.

Yang

Zhao

Wang

, et al. (2020) Design of intelligent search engine service performance evaluation system. In: 2020 5th Asia-Pacific conference on intelligent robot systems (ACIRS), pp.86–91. New York: IEEE.

143.

Zeynali-Tazehkandi

Nowkarizi

(2020) A dialectical approach to search engine evaluation. Libri 70(3): 227–237.

144.

Zeynali-Tazehkandi

Nowkarizi

(2021) Evaluating the effectiveness of Google, parsijoo, rismoon, and Yooz to retrieve Persian documents. Library Hi Tech 39(1): 166–189.

145.

Zhai

Cohen

Lafferty

(2015) Beyond independent relevance: methods and evaluation metrics for subtopic retrieval. In: ACM SIGIR forum, Toronto, Canada, July 28–August 1 2003, pp.2–9. New York, NY: ACM.

146.

Zhang

Cai

, et al. (2019) A study on effective measurement of search results from search engines. Journal of Global Information Management 27(1): 196–221.

147.

Zhang

Fei

(2013) A comparative analysis of the search feature effectiveness of the major English and Chinese search engines. Online Information Review 37(2): 217–230.

148.

Zhou

Wang

(2017) Evaluation of search engine weight by considering repeated web page contents. Intelligent Automation & Soft Computing 23(4): 589–597.

149.

Zhuhadar

Nasraoui

(2010) Evaluating usability and precision of visual search engine. In: Proceedings of the 2010 spring simulation multiconference, pp. 1–4. San Diego, CA: Society for Computer Simulation International.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.02 MB

0.01 MB

0.02 MB

Proposing a New Combined Indicator for Measuring Search Engine Performance and Evaluating Google,Yahoo,DuckDuckGo,and Bing Search Engines based on Combined Indicator

Abstract

Keywords

Introduction

Related research

Research methodology

Research method

Statistical population and sample

The new search engine performance indicator

Working method

Results

The performance status of search engines based on the combined indicator designed

Comparison of four search engines based on combined index criteria scores

Discussion and conclusion

Supplemental Material

sj-docx-1-lis-10.1177_09610006221138579 – Supplemental material for Proposing a New Combined Indicator for Measuring Search Engine Performance and Evaluating Google, Yahoo, DuckDuckGo, and Bing Search Engines based on Combined Indicator

Supplemental Material

sj-docx-2-lis-10.1177_09610006221138579 – Supplemental material for Proposing a New Combined Indicator for Measuring Search Engine Performance and Evaluating Google, Yahoo, DuckDuckGo, and Bing Search Engines based on Combined Indicator

Supplemental Material

sj-docx-3-lis-10.1177_09610006221138579 – Supplemental material for Proposing a New Combined Indicator for Measuring Search Engine Performance and Evaluating Google, Yahoo, DuckDuckGo, and Bing Search Engines based on Combined Indicator

Supplemental Material

sj-docx-4-lis-10.1177_09610006221138579 – Supplemental material for Proposing a New Combined Indicator for Measuring Search Engine Performance and Evaluating Google, Yahoo, DuckDuckGo, and Bing Search Engines based on Combined Indicator

Footnotes

Funding

ORCID iD

Supplemental material

Author biographies

References

Supplementary Material