Sage Journals: Discover world-class research

Abstract

Objectives: To obtain insights about inexperienced clinical researchers’ hypothesis quality and associated factors. The findings inform the development of informatics tools to aid the hypothesis generation process. Methods: We analyzed an existing dataset collected through a randomized controlled study, focusing on individual hypotheses and participants. We invited clinical researchers to analyze datasets and develop hypotheses using the think-aloud method. Participants’ screen activity and audio were recorded, transcribed, coded, and analyzed to measure the time and cognitive events (a granular unit of thought processes used by the participants while generating hypotheses). Hypotheses were rated by an expert panel. Here we analyzed (1) the top 5-rated hypotheses, (2) the bottom 5-rated hypotheses, and (3) the participants who generated them. Results: Participants who generated the top 5-rated hypotheses utilized fewer cognitive events and a shorter range of time per hypothesis; their hypotheses presented a higher valid rate, and they were more experienced. Conclusion: Having more experience is positively associated with higher quality and valid rates of the generated hypotheses. The higher-rated hypotheses seem to be positively associated with slightly fewer cognitive events and shorter time. The effect may not be linear. These analyses provide evidence for customized study designs or tool development based on these associated factors.

Keywords

case study clinical research research design scientific hypothesis generation

Introduction

Hypothesis generation is an important and valuable process in nearly every research project. A solid hypothesis is critical and a foundation for conducting impactful research. The hypothesis marks a project’s starting line. Exploring and understanding these processes and their influencing factors are crucial for improving hypothesis generation, especially to initiate a research project. Past research into the mechanisms of hypothesis generation has focused on clinical care and pure scientific settings.^1–9 Additionally, two textbooks offer systematic introductions to scientific thinking, including hypothesis generation.^10,11 However, more insights are needed regarding the mechanisms and processes of hypothesis generation for clinical research projects, as much is still unknown regarding factors influencing hypothesis generation in such a context.

There are at least two types of research hypotheses: (1) hypotheses based on experimental observations, where researchers may observe phenomena and seek to understand them further, and (2) data-driven scientific hypothesis generation, where researchers analyze various data sets and may identify patterns or differences that prompt further exploration.¹² While both types are valuable, this case study primarily focuses on data-driven hypothesis generation.

Our team conducted a randomized controlled human subject study to explore hypothesis generation by clinical researchers and documented the data-driven hypothesis generation processes, i.e., analyzing data sets and developing hypotheses. We recorded their thought processes (including multiple cognitive events, a granular unit of thought processes used by the participants while generating hypotheses) via think-aloud protocol, the time it took to generate hypotheses, etc.^12–17 The quality of these hypotheses was rated by the same expert panel by using the same clinical research hypothesis quality evaluation instrument.¹⁶ In this manuscript, we only focus on inexperienced clinical researchers (based on their years of experience, publications, and roles in research projects).^13,14

Patel’s studies demonstrate differences in hypothesis generation in clinical diagnosis among clinicians with different experience levels.^1,2,18,19 We, therefore, separate participants based on experience levels. Understanding hypothesis generation gives us insight into ways to improve the hypothesis generation process and how outside tools and factors influence hypothesis generation. Additionally, our team developed VIADS (a visual interactive tool for filtering and summarizing large health datasets coded with hierarchical terminologies) as a secondary data analysis tool to organize, filter, summarize, and visualize large datasets and facilitate data-driven hypothesis generation.¹³ We then conducted utility and usability studies of VIADS. We have published the study protocol,¹³ the usability study of VIADS,¹⁵ clinical research hypothesis quality assessment instruments,¹⁶ cognitive events (i.e., cognitive activities used to generate hypotheses, e.g., “Seeking for Connection”, an attempt to draw connections between data points) used during hypothesis generation,¹² and comparison of hypothesis generation between VIADS users and users of other tools.^14,17,20

To further explore the data collected from the randomized controlled study, in this manuscript, we concentrated on the highest and lowest-rated hypotheses regarding quality and analyzed associated factors to better understand the hypothesis generation process. Subsequently, we utilized data from nine participants who generated these highest and lowest-rated hypotheses and compared their overall hypothesis generation processes and results. While other published results of the study were based on comparisons at the group level, this manuscript focuses on comparisons of the individual hypotheses and participants. Therefore, this case study paper examines an existing data set via different lenses that have not been explored or published. We aim to provide additional insights into understanding factors that may influence the quality of scientific hypotheses within clinical research. Such insights can inform the development of future informatics tools to support the hypothesis generation process.

Methods

Case study purpose and analysis strategies

In order to gain more insights about the quality of hypotheses and associated factors, we compared the quality ratings of hypotheses and examined the associated factors, such as the time needed to create hypotheses, the number of hypotheses generated by each participant, the number of cognitive events used per hypothesis, and the participants’ clinical research experience and demographics. Such analysis can add a higher resolution of granularity, providing a deeper understanding of the factors that may contribute to the quality of the hypotheses generated during the process.

We compared (1) the five highest-rated hypotheses (Top 5) and the five lowest ones (Bottom 5) and (2) all the participants who generated the Top 5 and the Bottom 5 hypotheses. Both the methods and results sections were organized by subheadings: (1) the hypotheses (Top 5 and Bottom 5) and (2) all the participants who generated the Top and Bottom 5 hypotheses. The data collection, qualitative coding, data processing, and quantitative comparison methods and results in the original study were published, respectively.^{12–17,20,21} Given that this is a case study paper, we conducted a secondary analysis of the data collected from the original study to explore the quality of the hypotheses generated. The results are descriptive without statistical analysis due to the nature of the case study. The results presented in this paper complement the published group comparisons, i.e., we used an amplifier to examine each of the Top and Bottom 5 cases one by one and shared the findings.

Context of the study

The original published study was a randomized controlled human subject study that compared hypothesis generation by clinical researchers using the same data sets and different tools within 2 h. The study processes were recorded, coded, analyzed, and compared. The group comparison (VIADS vs other tools, e.g., SPSS, SAS, R) of the original study has been published.^12,14,15,17 After organizing and writing the group comparison results, we realized the data we collected in the original study had not been fully utilized and interpreted since data from individual hypotheses and individual participants had not been examined fully through the group comparisons. Therefore, we conducted this case study to examine more specific aspects of the data collected, particularly the quality of the hypotheses generated (with the highest and the lowest ratings and the participants who generated them). Due to the importance of the quality of hypotheses and our limited understanding of the factors associated with hypothesis quality, we decided to focus on several extreme cases of hypotheses and participants first in this study.

Brief description of the randomized controlled study

After IRB approval, recruitment, and scheduling, all participants used identical health data sets coded with ICD9 (International Classification of Diseases–9^th Revision) codes, adhered to the same study scripts and think-aloud protocols (i.e., verbally “work through” and articulate what they are doing while doing it), were facilitated by the same study coordinator, and used 2 h to analyze data and to generate their hypotheses. Participants were recruited through multiple channels, including team members, professional associations, and international conferences.¹⁴ The inclusion and exclusion criteria were previously published.¹³ The participants were compensated based on the time used during study sessions.¹⁴ Participants were randomly assigned to VIADS or the other groups (any chosen tool, e.g., R, SPSS, Excel).¹³ After the study session, participants completed surveys regarding their background and clinical research experience.¹³ The participants’ screen activities and audio were recorded during the study sessions.

Next, the study recordings were transcribed by a professional transcription service and coded by our team. The time required to generate each hypothesis was examined and averaged for each hypothesis and each participant. Subsequently, cognitive events were coded using Atlas.ti 9 based on an established cognitive process framework.^12,14,17,20 Both inductive and deductive approaches were used to analyze the hypothesis generation processes, focusing on identifying the process’s cognitive events. The inductive approach was used to develop the cognitive process framework, and the deductive approach was used to code the study transcripts based on the framework. For instance, “analyze data” was assigned as a cognitive event whenever a participant examined the data to comprehend them before formulating a hypothesis.¹² Cognitive events are granular units of thought processes used by the participants while generating hypotheses. Codes (such as “analyze data”) and cognitive events were used interchangeably during data analysis. The transcriptions from the pilot study were used as pilot coding materials to establish the coding principles and training materials for the two research assistants to discuss their coding practice, understanding, confusion, and differences. Then, two research assistants independently conducted time and cognitive event coding based on coding principles, compared their results, and consolidated any discrepancies. Any discrepancies were resolved by including a third team member and refining coding principles. The total number of cognitive events (codes) per participant and the average number of codes used per hypothesis were calculated.

Subsequently, an expert panel comprising seven members rated the hypotheses based on established clinical research hypothesis quality evaluation metrics and instruments.^16,21 The comprehensive instrument includes 10 items (validity, significance, novelty, clinical relevance, potential benefits and risks, ethicality, feasibility, testability, clarity, and researcher interest level) and 39 subitems (e.g., scientific validity and clinical validity) to measure the quality of hypotheses in clinical research.¹⁶ Ratings were conducted using a 5-point Likert scale, and the quality rating was determined by an average score among all expert panel members.¹⁶ During the evaluation, the expert panel utilized a three-dimensional instrument (i.e., validity, significance, and feasibility) following a reliability check. Therefore, each hypothesis received a quality rating between 3 and 15.

Results

We organized the results into two main sections: hypotheses and participants who generated these hypotheses. Each section includes the highest and lowest-rated hypotheses while examining factors that may be associated with the ratings, such as cognitive event (code) usage, time allocation, and demographics. Figure 1 provides a visual overview of the study and results.

Figure 1.

The visual study flow and the results overview (the results include two sections: (1) top 5 and bottom 5 hypotheses, (2) all the hypotheses generated by participants who generated top 5 and bottom 5 hypotheses).

Hypothesis

The Top 5 and the Bottom 5-rated hypotheses

Figure 2 provides comparisons of the time required to generate the top 5-rated and the bottom 5-rated hypotheses via VIADS or other tools. For instance, the highest-rated hypothesis, H9 by CP1 (coded participant 1; H9, hypothesis #9; NV- Non-VIADS participant), had a rating of 13 and took 5 min and 37 s to generate (Figure 2).

Figure 2.

Comparison of the top 5-rated (blue bars) and the bottom 5-rated (red bars) hypotheses and time needed (orange). (C- control, V- VIADS; CP, coded inexperienced participants; H, hypothesis).

Observed differences

Time usage

The time required to formulate a hypothesis varied considerably, particularly for the bottom 5-rated hypotheses (Figure 2, orange line). For instance, the longest time taken to generate a hypothesis is CP7’s Hypothesis 7, 0:07:54 (H:MM:SS), whereas the shortest time taken to generate a hypothesis is CP8’s Hypothesis 9, which took only 0:01:05 to generate. Both hypotheses were in the bottom 5-rated hypotheses. The average time it took to generate the top 5-rated hypotheses was 0:04:10, while the average time it took to generate the bottom 5-rated hypotheses was 0:04:24.

Cognitive event usage

The cognitive events used to generate the top 5 and the bottom 5-rated hypotheses are compared in Figure 3.

Figure 3.

Cognitive events used in the top 5 (blue) and the bottom 5-rated (red) hypotheses. (C- control, V- VIADS; CP, coded inexperienced participants; H, hypothesis).

The number of cognitive events utilized to generate the Top 5 and the Bottom 5-rated hypotheses exhibited considerable variability (Figure 3). Participant 7 (i.e., CP7) employed nine cognitive events (the most) to generate Hypothesis 7. Conversely, Participant 3’s (i.e., CP3) Hypothesis 4, used just one (the least). Overall, participants with the Top 5 hypotheses utilized a total of 20 cognitive events to generate their hypotheses, while those with the Bottom 5 hypotheses utilized a total of 24 cognitive events.

Highest versus lowest rated hypothesis comparison

Participant 1’s hypothesis 9 received a rating of 13 (the highest), whereas Participant 5’s hypothesis 17 received a rating of 6.1 (the lowest). Here are the two hypotheses:

• I hypothesize that HPV vaccines made a difference in cervical cancer prevalence. To look into the last 10 to 15 years [or it depends on how long has the HPV vaccine been implemented broadly] of cervical cancer incidence data longitudinally to check if there is a correlation between the decreased cancer incidence and the implementation of the HPV vaccine broadly. (Participant 1’s hypothesis 9, CP1-H9)

• #366 (ICD9 code): Cataracts had a diagnosis frequency of 746 in 2015 and 331 in 2005. The hypothesis is that more patients, especially lower economic status patients, were diagnosed early in 2015, contributing to higher diagnosis frequency in 2015. (Participant 5’s hypothesis 17, CP5-H17)

Table 1 presents comparisons of quantitative measures between the highest and the lowest-rated hypotheses.

Table 1.

Comparison of the highest and lowest individual hypothesis ratings.

Coded participant	Hypothesis rating	Average rating/hypothesis	Hypothesis time	Avg. time/hypothesis	# of valid hypotheses/total # of hypotheses	# of cognitive events	Average cognitive events//hypothesis
CP1- H9	13	10.25	0:05:37	0:05:16	12/15	8	5.93
CP5- H17	6.1	8.819	0:01:49	0:02:09	19/21	2	3.14

Participant 1’s 9th hypothesis (CP1-H9, the highest rating) required significantly more time (0:05:37) and cognitive events (8) than Participant 5’s 17th hypothesis (CP5-H17, the lowest rating), which took 0:01:49 and two cognitive events. Notably, both participants used times similar to the average time per hypothesis, respectively, to generate these two hypotheses (Table 1). Participant 1 appeared to utilize more cognitive events to generate Hypothesis 9 (8) than the average cognitive events used per hypothesis (5.93). In contrast, Participant 5 seemed to employ slightly fewer cognitive events for Hypothesis 17 (2) than the average (3.14).

For comparison, Participant 1 used the following cognitive events flow to generate hypothesis 9 (the highest rated): Pause/think → Analyze Data (x2) → Use Suggestion from Colleague → Analyze Data → Use Analysis Results & Seeking for Connection → Search/ask for more evidence → Use Analysis Results & Seek for Connection.

Participant 5 used Analyze Data → and Use Analysis Results & Seeking for Connection to generate hypothesis 17 (the lowest rated).

Participants

Participants that generated the Top 5 and Bottom 5-rated hypotheses

Four participants generated the Top 5 hypotheses, and five participants generated the Bottom 5 hypotheses. Figure 4 shows the average quality rating per participant and the average time taken per hypothesis by each participant. For instance, CP1 generated the highest-rated hypothesis, had an average rating of 10.25 across all hypotheses generated by CP1 during the study and took an average time of 5 min and 16 s per hypothesis. In contrast, CP5 generated the lowest-rated hypothesis, had an average rating of 8.819 and took an average time of 2 min and 9 s per hypothesis.

Figure 4.

The average quality rating per participant and time per hypothesis. Participants generated the top 5 (blue), and those generated the bottom 5-rated hypotheses (red). (C- control, V- VIADS; CP, coded inexperienced participants).

Observed differences

Time usage

The average time participants took to generate their hypotheses varied significantly, especially among participants with the Bottom 5-rated hypotheses (Figure 4). For instance, CP7 took 0:09:26 per hypothesis (the longest), while CP3 took 0:01:42 (the shortest) to generate each hypothesis. The average time it took for the participants who generated the top 5-rated hypotheses to generate each of their hypotheses was 0:03:39, while the average time it took for the participants who generated the bottom 5-rated hypotheses to generate each of their hypotheses was 0:04:57.

Cognitive event (code) usage

Figure 5 shows each participant’s average number of cognitive events used while generating each hypothesis.

Figure 5.

Average number of cognitive events per hypothesis per participant. Participants generated the top 5 (blue) and the bottom 5-rated (red) hypotheses (corresponding with Figures 2–4). (C- control, V- VIADS; CP, coded inexperienced participants).

CP7 used the highest average number of cognitive events (6.31 codes/hypothesis, Figure 5), while CP3 had the lowest average number of cognitive events (2.60 codes/hypothesis).

Valid rates

During the hypothesis quality rating process, any hypothesis scored at “1” (the lowest rating) for validity by three or more experts was considered invalid. Among participants with the Top 5 hypotheses, CP1 had 12 valid hypotheses out of 15 (80%), CP2 had 9/10 (90%), CP3 had 8/15 (53.33%), and CP4 had 8/9 (88.89%). For participants with the Bottom 5 hypotheses, CP5 had 19 valid hypotheses out of 21 (90.48%), CP6 had 13/16 (81.25%), both CP7 and CP8 had 5/13 each (38.46%), and CP9 had 7/14 (50%).

Demographics

Demographics and experiences provide valuable insights into participants’ backgrounds, potentially influencing their hypothesis generation and the quality ratings of hypotheses (Table 2).

Table 2.

Participants’ demographics and experiences.

Coded participant	YOE in hypothesis generation	YOE in study design	YOE in data analysis	TOR in hypothesis generation	TOR in study design	Publication #
CP1	>= 5 and <10	>2 and <5	>2 and <5	Collaborator role	Collaborator role	<5
CP2	>2 and <5	>2 and <5	<= 2	Leading role	Leading role	<5
CP3	<= 2	<= 2	<= 2	Collaborator role	Collaborator role	<5
CP4	<= 2	<= 2	<= 2	Collaborator role	Collaborator role	<5
CP5	>2 and <5	<= 2	<= 2	Leading role	Collaborator role	<5
CP6	>2 and <5	>2 and <5	>2 and <5	Collaborator role	Collaborator role	<5
CP7	<= 2	<= 2	>2 and <5	Collaborator role	Collaborator role	<5
CP8	<= 2	>2 and <5	<= 2	Research coordinator	Research coordinator	<5
CP9	<= 2	<= 2	<= 2	Mixed; collaborator and leading role	Collaborator role	<5

YOE, years of experience; TOR, type of role. CP1 to CP4 have the highest-rated hypotheses, while CP5 to CP9 have the lowest.

Comparing participants with the highest and lowest average ratings

Table 3 compares the highest and lowest average rating hypotheses between the participants (CP2; CP8). CP2 averaged a longer time per hypothesis, had a higher valid rate, and used more cognitive events per hypothesis in comparison to CP8.

Table 3.

Comparison of the highest and lowest average hypothesis ratings.

Coded Participant	Participant average	Average time/hypothesis	Valid # of hypotheses	Codes used (total)	Codes/hypothesis
CP2	10.55	0:04:10	9 out of 10 (90%)	54	5.4
CP8	8.298	0:03:29	5 out of 13 (38.46%)	56	4.31

Summary of results

Participants used similar lengths of time per hypothesis (i.e., difference 0:03:03) when generating the Top 5 hypotheses compared to those who generated the Bottom 5 hypotheses (i.e., difference 0:07:13). Participants who generated the Top 5 hypotheses also utilized slightly fewer cognitive events on average compared to the ones with the Bottom 5 hypotheses (4 per hypothesis vs 4.8 per hypothesis). When comparing the highest and lowest-rated hypotheses, the highest-rated hypothesis required more time and cognitive events than the lowest-rated hypothesis (0:05:37 vs 0:01:49; 8 vs 2). Comparing the participants who generated the Top 5 and the Bottom 5 hypotheses utilizing their total hypotheses generated throughout the study session, the participants with the Top 5 hypotheses again used similar lengths of time to generate each hypothesis on average (0:03:34 vs 0:07:17). They used fewer cognitive events per hypothesis (3.498 vs 4.626). They also had a higher valid rate of hypotheses (75.51% vs 63.63%) and generally had more experience with clinical research. When comparing the two participants who generated the highest and the lowest rated hypotheses using their total hypotheses in this case study, the higher-rated participant utilized slightly longer time and more cognitive events per hypothesis (0:04:10 vs 0:03:29; 5.4 vs 4.31) on average and had a much higher percentage of valid hypotheses (90% vs 38.46%).

Discussion

Time usage for hypothesis generation

We examined individual participants and their hypotheses to obtain insights about how and where technology support may be valuable. We also compared the Top 5 to the Bottom 5- rated hypotheses (Figure 2).

Regarding time usage, which is the time needed to generate each hypothesis, notable differences emerge when comparing the Top 5 to the Bottom 5 rated hypotheses. The time used to generate the Top 5 rated hypotheses showed fewer variations (difference 0:03:03) as compared to the time used to generate the Bottom 5 rated hypotheses (difference 0:07:13). The latter is more than twice the former. A similar trend was observed when we compared the average time used to generate each hypothesis by participants who generated the Top 5 to those who generated the Bottom 5-rated hypotheses (Figure 4). Once again, the average time per hypothesis among the participants who generated the Top 5 shows a significantly smaller variation than those who generated the Bottom 5 rated hypotheses (difference 0:03:34 vs 0:07:17, Figure 4). The results indicate that higher-rated hypotheses are associated with a relatively more stable time (smaller variations).

However, if we look at the highest (CP1 H9, rated at 13) and the lowest-rated (CP5 H17, rated at 6.1) hypotheses, we notice something different. CP1 took 0:05:37 to generate H9 and CP5 took 0:01:49 to generate H17. The average time it took to generate the top 5-rated hypotheses (0:04:10) is slightly shorter than the average time it took to generate the bottom 5-rated hypotheses (0:04:24). That is, the individual extreme hypotheses did not show similar trends to the aggregated top 5 and bottom 5-rated hypotheses results regarding time usage.

Interestingly, when we compare the two participants (CP2: the highest average rating, 10.55; CP8: the lowest average rating, 8.30) and their average time usage to generate each hypothesis, we noticed a similar trend to extreme hypotheses. That is, CP2 used a slightly longer time than CP8 per hypothesis (0:04:10 vs 0:03:29). In comparison, the average time it took for the participants who generated the top 5-rated hypotheses to generate each of their hypotheses was shorter than those participants who generated the bottom 5-rated hypotheses (0:03:39 vs 0:04:57). Once again, the highest and the lowest rating participants did not show a similar trend to the participants who are in the top 5 or the bottom 5 groups.

Literature indicates that individuals with high time urgency tend to generate lower-rated hypotheses.²² CP5 had to use a tool that he/she had only learned just an hour before the study session (VIADS), which could have inadvertently resulted in a feeling of urgency and cognitive overload, thereby potentially negatively affecting the quality ratings of CP5’s hypotheses. It is also plausible that cognitive overload from attempting to learn to use VIADS could have influenced these results negatively. Findings from a study by Dasgupta et al. support the idea that often, the “brain trades off accuracy and computational cost to make efficient use of its limited cognitive resources to approximate probabilistic inference”.²³

The discrepancy regarding time usage between the top 5 and bottom 5 and the highest and lowest rating hypotheses indicates the real relationship is far more complicated than our observations. The results shared in this study should be treated as cases that may provide insights for further testing; no generalization should be inferred.

Cognitive events during hypothesis generation

Another crucial factor to consider is cognitive events used in generating hypotheses. We noted the Top 5 group utilized slightly fewer cognitive events (4/hypothesis) than the Bottom 5 (4.8/hypothesis) on average, i.e., the Bottom 5 used 20% more cognitive events per hypothesis on average.

The participants who generated the Top 5 rated hypotheses employed an average of 3.498 codes (i.e., cognitive events) per hypothesis, whereas the participants of the Bottom 5 utilized an average of 4.626 codes per hypothesis. This finding aligns with the comparison between the Top 5 and the Bottom 5 rated hypotheses, suggesting that lower cognitive event usage may positively associated with the ratings of hypotheses.

However, like time usage, cognitive events demonstrate a similar trend when looking at the highest and lowest-rated hypotheses or participants. CP1 H9 (the highest-rated hypothesis) took 8 cognitive events to generate, and CP5 H17 (the lowest-rated hypothesis) took 2. A similar trend was noted between the highest (CP2) and lowest (CP8) average rating participants: CP2 used an average of 5.4 cognitive events to generate each hypothesis, while CP8 used 4.31. Like time usage, the results of cognitive events should be used as cases. The number of cognitive events used to generate hypotheses and their relationships to the quality of hypotheses need to be examined carefully in a larger-scale study.

Participants demographics

Interestingly, there are notable discrepancies when examining the demographics and experiences of participants in this case study. For instance, CP1 has the most extensive experience in hypothesis generation, study design, and data analysis, aligning with the expectation that greater experience would yield higher-rated hypotheses. However, outliers such as CP3 and CP4, with minimal experience (i.e., less than 2 years of experience in hypothesis generation, study design, and data analysis) and collaborator roles, produced some of the highest-rated individual hypotheses. Conversely, participants like CP5, despite holding prominent roles in hypothesis generation and possessing more experience in the field, produced one of the lowest-rated individual hypotheses.

CP2 (the highest average rating) has more experience in hypothesis generation than CP8 (the lowest average rating), while both participants had similar experience levels in data analysis and study design. CP2 held a leading role, while CP8 served as a research coordinator in the past. CP2 maintained a slight advantage in experience concerning hypothesis generation, which could have influenced the rating and validity rate of their hypotheses.

The results suggest the relationships between one’s experience level and the quality of hypotheses may be more complicated. Additional studies on these aspects are needed.

On the contrary, the hypotheses’ valid rate demonstrates a positive association with higher-rated hypotheses compared to those with lower ratings. The participants of the Top 5 generated 37 (out of 49, 75.51%) valid hypotheses, whereas participants of the Bottom 5 had 49 (out of 77, 63.63%) valid hypotheses.

Strengths and limitations

One of the strengths of this study lies in its detailed examination and comparison of hypothesis generation through different lenses: at the hypothesis level, at the participant level, as well as examining them individually or as a group. We compared the time needed and cognitive events used during hypothesis generation and the participants’ demographic factors based on hypothesis quality ratings. This study provides systematic comparisons between the higher-rated and lower-rated hypotheses and valuable insights with comprehensive raw data collected during the data-driven hypothesis generation process. This study is also related to the nature of the tools used to support the process of hypothesis generation by experienced and inexperienced researchers.

With a small cohort of nine participants, the study enabled us to have a deeper examination and a thorough comparison between individuals who generated higher-rated and lower-rated hypotheses, even though it has limited generalizability. This approach offers a nuanced understanding of differences in their hypothesis generation processes and also provides an important starting point to explore the complex and crucial process. This manuscript complements our prior reported results, which focused on group comparisons.^12–15,20

However, the study also has its limitations. First, the time constraint of the original study session, capped at 2 h, may have influenced participants’ ability to generate hypotheses naturally, potentially introducing bias.²² Furthermore, this case study provides insights into individual hypotheses and participants, among inexperienced clinical researchers only; any comparisons involving broader participant samples should be approached with caution. Ideally, comparisons need to be conducted throughout the natural hypothesis generation processes by the participants who generated the highest-rated, lowest-rated hypotheses, as well as the intermediate-rated hypotheses. We only compared the highest and the lowest cases in this study. The studies have shown that intermediates behave differently than what would be expected in the process of developing medical diagnosis, than just the two extreme participants. The authors found this phenomenon to be generic, going beyond the medical domain.^18,24

Regarding the roles of VIADS or other tools (such as R, SPSS, SAS) in hypothesis generation among clinical researchers, please refer to our publications for a thorough comparison.^12,14,15,20

Future work

A substantial amount of data from participants with average ratings were excluded by focusing solely on participants with the highest and lowest-rated hypotheses. Incorporating this middle-ground data could provide additional understanding and insights into the hypothesis generating process. Furthermore, this case study exclusively examined inexperienced clinical researchers. Future research could include experienced clinical researchers and a larger sample of participants. Their perspectives and approaches to hypothesis generation may offer additional valuable insights. An additional qualitative data analysis of the think-aloud protocols, using a cognitive theoretical framework, may also provide further insights into understanding the thought process of hypothesis generation.^25,26

Conclusion

Demographics and experiences emerged as influential factors, with higher levels of experience associated with generating higher-rated hypotheses and a higher validity rate of hypotheses. Notably, outliers in hypothesis generation, characterized by either excessively short or prolonged usage of time, were negatively associated with the quality ratings of hypotheses. Fewer cognitive events were used for the participants from the Top five group, possibly showing that they required less cognitive effort to process the data. Both time usage and cognitive events trends were evident among the top and bottom-rated hypotheses, while the individual hypotheses and participants did not show the same trend. This study offers valuable insights into factors affecting hypothesis quality that provide valuable evidence for future larger-scale studies to examine the relationships among these factors and the quality of hypotheses more comprehensively. Such insight can also be used to provide evidence for customized study designs or tool development based on these associated factors.

Supplemental Material

Supplemental Material - A secondary data analysis on hypotheses generated by inexperienced clinical researchers: Cases from a randomized control study

Supplemental Material for A secondary data analysis on hypotheses generated by inexperienced clinical researchers: Cases from a randomized control study by Mytchell A. Ernst, Brooke N. Draghi, James J. Cimino, Vimla L. Patel, Yuchun Zhou, Jay H. Shubrook, Sonsoles De Lacalle, Aneesa Weaver, Chang Liu and Xia Jing in Health Informatics Journal.

Supplemental Material

Supplemental Material - A secondary data analysis on hypotheses generated by inexperienced clinical researchers: Cases from a randomized control study

Footnotes

Author note

The original randomized controlled study was approved by Clemson University (IRB2020-056) and Ohio University (18-X-192). All data used in this publication have been anonymized, and no identifiable information is used in any part of this publication.

Acknowledgment

We would like to thank all participants and expert panel members for participating in this study. Without their significant contribution and participation, none of the work would be possible. The project was supported by a grant from the National Library of Medicine (R15LM012941) and was partially supported by the National Institute of General Medical Sciences of the National Institutes of Health (P20 GM121342). This work has also benefited from research training resources and the intellectual environment enabled by the NIH/NLM T15 SC BIDS4Health research training program (T15LM013977).

Author contributions

XJ, YCZ, JJC, VLP designed the study; XJ and YCZ determined the analysis strategy; XJ, JJC, VLP, YCZ, JHS, SDL, CL obtained the funding; CL provided technical support; JJC, JHS, and SDL helped with the recruitment; XJ conducted the study; MAE, BND coded the data and conducted initial analysis; AW anonymized the hypotheses and randomized the order and put them in all Qualtrics surveys; MAE prepared the first draft of the manuscript; all coauthors revised the manuscript substantially.

ORCID iDs

James J. Cimino

Xia Jing

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: this work was supported by the National Institute of General Medical Sciences of the National Institutes of Health; P20 GM121342

National Library of Medicine; R15LM012941

NIH/NLM T15 SC BIDS4Health research training program; T15LM013977.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

More detailed, analyzed, and organized data are available upon request to the corresponding author; requests for the raw data are considered on a case-by-case basis.*

Supplemental Material

Study scripts for the VIADS group, the control group, and all supplementary materials can be found at https://doi.org/10.1017/cts.2023.708 and . Supplemental material for this article is available online.

References

Arocha

Patel

. Hypothesis generation and the coordination of theory and evidence in novice diagnostic reasoning. Med Decis Mak 1993; 13: 198–211.

Patel

Groen

. Knowledge based solution strategies in medical reasoning. Cogn Sci 1986; 10: 91–116.

Joseph

G-M

Patel

. Domain knowledge and hypothesis generation in diagnostic reasoning. Med Decis Mak 1990; 10: 31–46.

Patel

Kaufman

Cohen

(eds). Cognitive Informatics in Health and Biomedicine: Case Studies on Critical Care, Complexity and Errors. Springer, 2014.

Klahr

Dunbar

. Dual space Search during scientific reasoning. Cogn Sci 1988; 12(1): 1–48.

Spangler

Wilkins

Bachman

, et al. Automated hypothesis generation based on mining scientific literature. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, New York, NY, 24 - 27 August 2014, Association for Computing Machinery, 2014, pp. 1877–1886.

Sybrandt

Shtutman

Safro

. Moliere: automatic biomedical hypothesis generation system. ACM, 2017.

Dunbar

. How scientists really reason: scientific reasoning in real-world laboratories. The nature of insight. The MIT Press, 1995, pp. 365–395.

Dunbar

. The analogical paradox: why analogy is so easy in naturalistic settings, yet so difficult in the psychology laboratory. In: Gentner

Holyoak

Kokinov

(eds). Analogy: Perspectives from cognitive science. MIT Press, 2001, pp. 323–334.

10.

Holyoak

Morrison

. The Oxford handbook of thinking and reasoning. Oxford University Press, 2012.

11.

Holyoak

Morrison

. The cambridge handbook of thinking and reasoning. Cambridge University Press, 2005.

12.

Jing

Draghi

Ernst

, et al. Using think-aloud protocol to identify cognitive events while generating data-driven scientific hypotheses by inexperienced clinical researchers. AMIA Annu Symp Proc 2024; 2024: 561–570, eCollection. https://pmc.ncbi.nlm.nih.gov/articles/PMC12099392/

13.

Jing

Patel

Cimino

, et al. The roles of a secondary data analytics tool and experience in scientific hypothesis generation in clinical research: protocol for a mixed methods study. JMIR Res Protoc 2022; 11(7): e39414.

14.

Jing

Cimino

Patel

, et al. Data-driven hypothesis generation among inexperienced clinical researchers: a comparison of secondary data analyses with visualization (VIADS) and other tools. J Clin Transl Sci 2023; 8(1): e13.

15.

Jing

Patel

Cimino

, et al. A visual analytic tool (VIADS) to assist the hypothesis generation process in clinical research: mixed methods usability study. JMIR Hum Factors 2023; 10: e44644.

16.

Jing

Zhou

Cimino

, et al. Development, validation, and usage of metrics to evaluate the quality of clinical research hypotheses. BMC Med Res Methodol 2025; 25: 11.

17.

Jing

Cimino

Patel

, et al. Data-driven hypothesis generation in clinical research: what we learned from a human subject study? Med Res Arch 2024; 12(2). DOI: 10.18103/mra.v12i2.513.

18.

Patel

Arocha

Kaufman

. Diagnostic reasoning and medical expertise. Psychol Learn Motiv 1994; 31(C): 187–252.

19.

Patel

Groen

Patel

. Cognitive aspects of clinical performance during patient workup: the role of medical expertise. Adv Health Sci Educ Theory Pract 1997; 2: 95–114.

20.

Draghi

Ernst

Patel

, et al. Number of scientific hypotheses and time needed in a 2-hour study session among inexperienced clinical researchers—preliminary results. In: AMIA summit 2023, Seattle, 18-21 Mar, 2023; 13-16 Mar 2023, 818.

21.

Jing

Zhou

Cimino

, et al. Development and preliminary validation of metrics to evaluate data-driven clinical research hypotheses. In: AMIA, Washington DC, 5-9 Nov 2022, pp. 1363–1364.

22.

Alison

Doran

Long

, et al. The effects of subjective time pressure and individual differences on hypotheses generation and action prioritization in police investigations. J Exp Psychol Appl 2013; 19(1): 83–93.

23.

Dasgupta

Schulz

Goodman

, et al. Remembrance of inferences past: amortization in human hypothesis generation. Cognition 2018; 178: 67–81.

24.

Rikers

RMJP

Schmidt

Boshuizen

HPA

. Knowledge encapsulation and the intermediate effect. Contemp Educ Psychol 2000; 25(2): 150–166.

25.

Ericsson

Simon

. Protocol analysis: verbal reports as data. The MIT Press, 1993.

26.

Groen

Patel

. Relationship between comprehension and reasoning in medical expertise. In: Chi

Glaser

Farr

(eds). The Nature of Expertise. Lawrence Erlbaum, 1988, pp. 287–310.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

3.67 MB

3.24 MB

A secondary data analysis on hypotheses generated by inexperienced clinical researchers: Cases from a randomized controlled study

Abstract

Keywords

Introduction

Methods

Case study purpose and analysis strategies

Context of the study

Brief description of the randomized controlled study

Results

Hypothesis

The Top 5 and the Bottom 5-rated hypotheses

Observed differences

Time usage

Cognitive event usage

Highest versus lowest rated hypothesis comparison

Participants

Participants that generated the Top 5 and Bottom 5-rated hypotheses

Observed differences

Time usage

Cognitive event (code) usage

Valid rates

Demographics

Comparing participants with the highest and lowest average ratings

Summary of results

Discussion

Time usage for hypothesis generation

Cognitive events during hypothesis generation

Participants demographics

Strengths and limitations

Future work

Conclusion

Supplemental Material

Supplemental Material - A secondary data analysis on hypotheses generated by inexperienced clinical researchers: Cases from a randomized control study

Supplemental Material

Supplemental Material - A secondary data analysis on hypotheses generated by inexperienced clinical researchers: Cases from a randomized control study

Footnotes

Author note

Acknowledgment

Author contributions

ORCID iDs

Funding

Declaration of conflicting interests

Data Availability Statement

Supplemental Material

References

Supplementary Material