Abstract
Background
With the rise of digital health, proficiency in data science is becoming increasingly important. University curricula often lack adequate data analysis and programming education, hence the need to develop an accessible training platform.
Methods
A free, accelerated R programming course was developed for healthcare trainees with no prior programming experience. The first module was composed of seven video capsules over 10 days to teach foundational R programming for clinical research. A pretest and posttest study assessed participants’ skills pretraining, immediately posttraining, and three months later. Participants (students, researchers, professors) were recruited from Montreal's academic healthcare community. A Real-Time Delphi method guided test development and mixed-effects models compared scores.
Results
Of 102 enrolled participants, 100 were analyzed, which were mostly aged 20–30 (72%) and medical students (92%, 69.6%). Of them, 84% successfully completed the course within 10 days (95% CI [77%—91%]). Mean test scores increased from 4.5/10 pretest (95% CI [4.1–4.8]) to 8.4/10 posttraining (95% CI [8.1–8.7]) (p < .001; Cohen's d = 2.5), with scores at three months (6.8/10, 95% CI [6.4–7.2]) remaining significantly higher than baseline (p < .001), despite a slight expected decline.
Conclusion
This accelerated R programming course effectively improves data science skills in healthcare trainees with no prior knowledge. It addresses key gaps in formal data science education with the potential to enhance independent research and analysis skills in complement to university curricula.
Introduction
Proficiency in digital health is increasingly essential as technology becomes an integral part of clinical practice, particularly in managing and analyzing data science.1,2 However, a significant knowledge gap exists among medical trainees and physicians, primarily due to a lack of relevant digital health training in medical curricula. 3
In 2020, the Royal College of Physicians and Surgeons of Canada created a Task Force on Artificial Intelligence and Emerging Digital Technologies, which highlighted the deficiency of these data science and digital health skills among Canadian healthcare professionals, emphasizing the need for targeted education. 4 Developing data science skills within the health sciences community could promote literacy in informatics, mathematics, and statistics, which can help professionals better navigate the transition toward digital health and foster interdisciplinary collaboration for the integration of technology in clinical settings. It can also enhance the ability to critically interpret findings in the scientific literature and equip researchers to lead more rigorous projects.
Despite this need, many universities’ curricula still lack formal data science training.5,6 Therefore, students may be underprepared for independent research or collaboration with data science teams.
Recognizing this gap, several global initiatives have attempted to introduce computer programming education into medical training. A two-day Python coding course at Imperial College London demonstrated that beginners could develop simple and functional clinical programs after short intensive weekend training. Participants not only acquired foundational programming skills but also recognized the importance of coding in medical education, research, and technology development. 7 Similarly, a pilot project in the Netherlands introduced hospital staff to programming through a two-day workshop, “Building Medical Apps” using Apple's XCode and Swift. 8 The program successfully improved participants’ programming knowledge and app development skills, highlighting the feasibility of integrating coding into clinical environments.
Beyond short-term initiatives, longer training models have been explored. At the University of Toronto, a 14-month “Computing for Medicine (C4 M)” program was developed to provide medical students with a structured, year-long introduction to programming. 9 The curriculum combined weekly hands-on workshops, coding exercises, and expert-led seminars to strengthen participants’ computational thinking and problem-solving skills. 6 Participants highlighted the need for stronger collaboration between medical professionals and technology developers to advance digital health innovation.
Despite these initiatives, most existing studies focus on short-term skill acquisition, with limited research assessing long-term retention of programming skills among health sciences trainees. Most initiatives are conducted in English, leaving a significant gap in French-language programming education tailored to the healthcare sector. Additionally, these initiatives focused on Python but did not evaluate statistical analysis competencies.
To address this, our research team developed a free, accelerated online training program focused on the R programming language, 10 a widely used statistical platform in academic research. Unlike traditional programs, which often require several tens of hours of coursework—such as undergraduate degrees, master's programs, or specialized certifications—our program is designed to be completed in under a week during students’ free time. By offering this flexible and accelerated format, we aim to provide students with the essential skills they need to engage with data science in a condensed and accessible manner. This program is tailored for healthcare trainees and professionals with no prior data science experience. R, being a free, open-source language supported across multiple platforms, is an ideal entry point for health science trainees. 11
This initiative is the first French-language program of its kind, focused on the applications of R in health sciences and is freely available under a Creative Commons 4.0 CC BY-NC-ND license.
The primary objective of this study was to evaluate the feasibility of implementing a structured R programming training program within the health sciences community. This was addressed by the research question: “Can an accelerated online R training program be feasibly implemented for health sciences trainees?” To assess this, we hypothesized that the majority of participants would successfully complete the training within 10 days, demonstrating its practical applicability as a complementary resource to existing health sciences curricula.
The secondary objective was to evaluate the program's effectiveness in improving programming skills over the short and medium term, guided by the research question: “Does participation in this program lead to a statistically significant improvement in R programming proficiency immediately after training, and three-months post training?” We hypothesized that participants would exhibit a statistically significant improvement in R programming proficiency immediately after completing the training and would retain a substantial portion of their acquired skills three months later, despite an anticipated decline in retention.
Methodology
Study design and participants
This study was approved by the ethical committee of the CHU Sainte-Justine Research Center (2024–6020). This was a prospective pretest and posttest design 12 with a single group of volunteers. A control group was not used in this study due to the logistical constraints associated with the implementation and assessment of multiple teaching approaches, and due to the absence of a widely accepted gold standard for data science training for the health sciences community.
The target population was the university academic community in the health sciences field who self-identifies as having no or little prior knowledge in programming, particularly in the R language. The inclusion criteria were:
being a student in Canada at the university level in a health sciences program, a university-affiliated professor in a health sciences program, or a researcher/research assistant in the health sciences (affiliation was demonstrated through an institutional e-mail address) speaking, understanding, and writing French
The exclusion criterion was having previous knowledge of the R language or equivalent (Python, Java, PHP, C/C++, etc.).
Participants were recruited using an online questionnaire (Supplemental Material) which was distributed via social media, university newsletters, and shared by teachers and researchers in their communities. The questionnaire described the project, as well as statements about the voluntary, uncompensated nature of participation, and declarations of anonymity and confidentiality. This recruitment questionnaire is not validated and was not pilot-tested. Snowball sampling was also used, 13 in which study participants recruited other participants from among their acquaintances.
Training program
The training program included 16 video capsules (in French), ranging from 20 to 45 min each, divided into three key sections: “The Basics” (seven capsules), “Data” (five capsules), and “Statistics” (four capsules). This research project was based on the “Basics” section, which provided the foundational knowledge necessary for further learning in data sciences. The content of the capsules, presented in Table 1, was developed by experts in the use of R and RStudio in the healthcare sector, including university biostatistics professors and researchers in the field. The various videos have been mutually reviewed by the experts.
Overview of the content of “The Basics” section of the training.
The video format was chosen to enable participants to learn at their own pace. The capsules began with a short introduction to theoretical concepts, which are then applied to real open-access medical databases. Suggested exercises and a summary guide containing the main aspects covered in each capsule were also provided to participants.
Pre- and posttests
There is no validated test in the scientific literature for assessing programming skills and proficiency in R. We therefore decided to use a method inspired by Real-Time Delphi. 14 We set up a team of five experts using R in a clinical research context. Each person who created a capsule provided three to five multiple-choice questions related to the content of their capsule. A survey was sent to each of the experts, enabling them to rate the relevance of each question on a Likert scale of 1 to 9 and provide comments. The presence of a consensus for a question was determined if at least four experts out of five (80%) assigned a score greater than or equal to 7 on the Likert scale. 14
The pretest and posttest questions were similar, but with minimal differences within the questions preventing the pretest from being reproduced, and the order of questions and answer choices was altered. Each consisted of 10 multiple-choice questions, with five answer choices each, covering the material of the seven capsules of the “Basics” section. They could be completed online by the participant with an estimated duration of less than 30 min (a limit of 2 h was allocated). In the end of Posttest No. 1, participants were able to write optional comments regarding the training and suggest areas to improve. Participants did not receive their score or the correction after completing a test. We consider the success threshold to be a score of 6/10 or higher, which is the pass mark most commonly used in Canada. However, this success threshold for these pre- and posttests has not formally been validated. The pretest and posttests were not pilot-tested.
The platform for hosting the pre- and posttests, as well as the demographic questionnaires (Supplemental Material) was REDCap—hosted and managed by the Applied Clinical Research Unit (Unité de Recherche Clinique Appliquée) at CHU Sainte-Justine. Participants received a personalized, secure REDcap link to complete the demographic questionnaire and pretest. They then received a link to view the seven capsules, and to complete the first posttest. After three months, they received a link to complete the second posttest. Up to five automatic reminders were sent when participants did not complete the tests.
Sample size and statistical analysis
We used the G*Power software (Heinrich-Heine-Universität Düsseldorf, Germany, version 3.1.9.6 of 2020) 15 to determine sample size. 16 We assumed a medium effect size (d = 0.5), so as not to overestimate participants’ learning at the end of training, and an alpha level of .05 and a power 1 − β of 0.8, since these are the most conventionally accepted values in most fields17–19. The calculated sample size was a minimum of 34 participants.
Statistical analyses were performed on R statistical software Version 2024.04.2 + 764 (The R Foundation for Statistical Computing, Vienna, Austria). 10 Participant characteristics were collected and presented in the form of tables containing frequencies as well as medians and interquartile ranges. Pre- and posttest scores were automatically collected for each participant using REDCap. To compare the difference in means obtained at pretest and posttest and calculate a p-value for statistical significance, we used a mixed effect model, which includes all available data under the assumption that missing data are missing at random. We used the package lme4 in R. Given the limited amount of sociodemographic data collected at recruitment, we were not able to use a multiple imputation model to account for missing data. Cohen's D was calculated to determine effect size, using pairwise comparisons while omitting missing values. We also calculated a 95% confidence interval for each test mean score. The proportion of participants who completed the training in less than 10 days was calculated, with a 95% confidence interval. We conducted a thematic analysis on participants’ optional comments to identify areas for improvement.
Results
Demographics
A total of 102 participants were enrolled in the study. Participant demographics can be found in Table 2. Two participants were excluded due to prior training in the R programming language, which made them ineligible. Most participants (72/102, 70.6%) were aged between 20 and 30 years and most participants identified as students (94/102, 92.2%), especially in medicine (64/94, 68.1%).
Demographics of participants who underwent the training program in R language.
Evaluation of the training program
The proportion of participants who completed the training within 10 days was 84% (95% CI [77%–91%]). When asked about their level of recommendation of the training on a scale of 1 (would not recommend at all) to 5 (would recommend strongly), 73% of participants would recommend the training (4–5), 22% were neutral (3), and 5% would not recommend it (1–2) (Table 3).
Answers received to the following question after viewing the capsules: “On a scale of 1 to 5, how much would you recommend this training to your colleagues?” (n = 88).
The mean score (out of 10) increased from 4.5 (SD = 1.8, 95% CI [4.1–4.8]) at pretest to 8.4 (SD = 1.3, 95% CI [8.1–8.7]) at Posttest No. 1, a statistically significant difference (p < .001), with a large effect size of 2.5. For Posttest No. 2, the mean score also increased significantly from 4.5 (SD = 1.8, 95% CI [4.1–4.8]) at pretest to 6.8 (SD = 1.3, 95% CI [6.4–7.2]) (p < .001), with a large effect size of 1.3. The mean score decreases from 8.4 at Posttest No. 1 (95% CI [8.1–8.7]) to 6.8 at Posttest No. 2 (95% CI [6.4–7.2]) (p < .001) with a large effect size of 1.3 (Figure 1).

Comparison of mean scores (out of 10) obtained at pretest (n = 102), Posttest 1 (n = 88), and Posttest 2 (n = 50) with standard deviations, among health science participants who viewed an accelerated data science training course.
Participants who did not complete a test received up to five reminders. At Posttest No. 1, 14 participants were lost to follow-up (14% of total participants), and at posttest No 2, 52 participants were lost to follow-up (51% of total participants).
In fact, 73% of participants would recommend the training in their environment (score of 4–5 out of 5). A thematic analysis was conducted on participants’ optional comments to identify areas for improvement. Four main themes emerged. Firstly, the order of capsules can be improved, with many participants suggesting that capsules 5, 6, and 7 should precede capsule 3, leading to better learning progression. Secondly, participants underlined audio distraction, saying that they prefer not having a background music. Thirdly, participants expressed a desire for more exercises with answer keys to reinforce learning. Finally, some participants discussed capsule duration, and recommended standardizing the length of capsules, with one participant suggesting splitting longer capsules into two parts for better engagement. Several participants said they appreciated having a summary guide containing the main aspects covered in each capsule.
Discussion
To our knowledge, this is the first publicly available accelerated French-language R programming training program for health sciences trainees. We succeeded in demonstrating the feasibility of our training, with 84% of participants completing the course in less than 10 days in their own time (95% CI [77%–91%]). This illustrates that viewing self-taught capsules can be a good introduction to data science for the academic health science community. As previously reported, the strength of this format lies in concise videos, engaging participants with interactive exercises, and clearly demonstrating how the knowledge can be applied to their future projects. 20 Compared to existing time-intensive programming initiatives, our accelerated 10-day format provides a more accessible option for busy students. Despite its short and intensive nature, it achieves effective knowledge retention over time, reinforcing its efficiency as a condensed and impactful training model.
The test mean score increased before and immediately after the training. With follow-up losses of only 14%, it can be said that participants’ skills and knowledge improved greatly in the short term after viewing the training.
Similarly, the test mean score increased before training and after three months posttraining. In addition, the mean score fell from 8.4 at Posttest No. 1 (95% CI [8.1–8.7]) to 6.8 at Posttest No. 2 (CI [6.4–7.2]) (p < .001, Cohen's d = 1.3). It would therefore appear that there is good retention of knowledge in the medium term compared with the baseline. Nevertheless, compared with the time after viewing the capsules, the knowledge acquired tends to diminish over time. This seems to be in line with Ebbinghaus’ forgetting curve, which states that learned information decreases over time if it is not reactivated. 21 It is important to highlight the high number of losses to follow-up at Posttest No. 2 (51%), affecting the validity of the conclusions drawn.
The qualitative data collected shows that participants appreciated the training as most participants recommend the training. Qualitative data will be used in future work to continue improving the training (e.g. fine tuning music, time per video, pace of the training, etc.).
This study has several limitations. Methodologically, the sample was primarily composed of medical students aged 20–30, which limits the generalizability of results to other health disciplines and age groups. The snowball sampling method may have introduced selection bias, favoring participants already interested in digital health. This approach was chosen to enhance recruitment efficiency, but it inherently limits the generalizability of the findings to other populations. While the pre- and posttests were developed through expert consensus, they were not based on a validated questionnaire, as no standardized tool for assessing R programming proficiency was available in the literature at the time of the study. Because pre- and posttests were administrated in an unproctored online setting, we cannot exclude the possibility that some participants used external aids, which may have influenced their test performance. We also did not include a control group, which limits comparison with existing training approaches. Additionally, the 51% loss to follow-up at three months reduces the reliability of long-term retention estimates. For instance, participants with greater motivation may be more likely to complete Posttest No. 2, and participants who struggled with the material may be more likely to be lost to follow-up, leading possibly to an overestimation of knowledge retention. Given the limited amount of sociodemographic data collected at recruitment, we were not able to use a multiple imputation model to address missing data. We instead used a mixed effect model, under the assumption that missing data are missing at random, which is hard to verify in this context and not necessarily true.
Future studies should aim for greater participant diversity and implement strategies to minimize attrition in follow-up assessments. They should also aim to incorporate comparative or randomized designs to better evaluate the relative effectiveness of similar interventions. Given the relevance of statistical knowledge across all medical fields, this training can be easily adapted for a broader audience, including nurses, pharmacists, and allied health professionals with varying levels of digital literacy. As part of our expansion strategy, we plan to collaborate with faculty departments to increase awareness and accessibility of the training among a more diverse healthcare workforce.
To enhance long-term retention, initiatives such as certification upon completion or a long-term retention certification could motivate participants and encourage sustained engagement. Additionally, in-person workshops could provide structured reinforcement opportunities, fostering interaction and practical application of acquired skills.
Potential barriers to this training include access to a computer and reliable internet connection, particularly in lower-resource settings. To address this, universities could provide on-campus computer labs or device lending programs to ensure equitable access in the training.
Conclusion
The positive reception of this program suggests that there is a growing interest and need for such initiatives in the medical field which is highly data-driven. The free, open access and flexible format of the training makes it easily scalable for addressing the current gaps in data science education among healthcare trainees.
Our findings demonstrate the feasibility and effectiveness of integrating this training as a complement to university curricula for students and researchers in health. Most participants completed the training within a short time frame and demonstrated a significant improvement in their R programming skills.
This training serves as a steppingstone for broader efforts to integrate digital health and programming education into university curricula in health sciences. Moving forward, collaboration with universities and research centers is essential to integrate this training into formal health science education, preparing students for clinical and academic research.
Supplemental Material
sj-pdf-1-dhj-10.1177_20552076251343792 - Supplemental material for Developing and validating an accelerated online Canadian training in data science for the health sciences community
Supplemental material, sj-pdf-1-dhj-10.1177_20552076251343792 for Developing and validating an accelerated online Canadian training in data science for the health sciences community by Reda Goudrar, Maxine Joly-Chevrier, Lise De Cloedt, Géraldine Pettersen, Benoît Mâsse and Michaël Sauthier in DIGITAL HEALTH
Supplemental Material
sj-pdf-2-dhj-10.1177_20552076251343792 - Supplemental material for Developing and validating an accelerated online Canadian training in data science for the health sciences community
Supplemental material, sj-pdf-2-dhj-10.1177_20552076251343792 for Developing and validating an accelerated online Canadian training in data science for the health sciences community by Reda Goudrar, Maxine Joly-Chevrier, Lise De Cloedt, Géraldine Pettersen, Benoît Mâsse and Michaël Sauthier in DIGITAL HEALTH
Supplemental Material
sj-pdf-3-dhj-10.1177_20552076251343792 - Supplemental material for Developing and validating an accelerated online Canadian training in data science for the health sciences community
Supplemental material, sj-pdf-3-dhj-10.1177_20552076251343792 for Developing and validating an accelerated online Canadian training in data science for the health sciences community by Reda Goudrar, Maxine Joly-Chevrier, Lise De Cloedt, Géraldine Pettersen, Benoît Mâsse and Michaël Sauthier in DIGITAL HEALTH
Supplemental Material
sj-pdf-4-dhj-10.1177_20552076251343792 - Supplemental material for Developing and validating an accelerated online Canadian training in data science for the health sciences community
Supplemental material, sj-pdf-4-dhj-10.1177_20552076251343792 for Developing and validating an accelerated online Canadian training in data science for the health sciences community by Reda Goudrar, Maxine Joly-Chevrier, Lise De Cloedt, Géraldine Pettersen, Benoît Mâsse and Michaël Sauthier in DIGITAL HEALTH
Footnotes
Acknowledgements
We would like to thank Simon LaRue, Dre Janie Coulombe, Guillaume Dubé, and Dre Masoumeh Sajedi for their help in developing the content of the capsules and presenting them.
Ethical considerations
This study was approved by the ethical committee of the CHU Sainte-Justine Research Center (2024–6020).
Consent to participate
Informed consent to participate was written.
Author contributions
Conceptualization: MJC, RG, LDC, GP, BM, and MS; methodology: MJC, RG, LDC, GP, BM, and MS; formal analysis: MJC and RG; investigation: BM and MS; resources: BM and MS; data curation: RG, MJC, BM, and MS; writing—original draft preparation: MJC, RG, and MS; writing—reviewing and editing: MJC, RG, LDC, GP, BM, and MS; supervision: MS; funding acquisition: MJC, RG, and MS. All authors have read and agreed to the published version of this manuscript.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors acknowledge the financial support of the following organizations that enabled the development of the R training program: Canadian Medical Association, Canadian Federation of Medical Students, Fédération des Associations Étudiants du Campus de l’Université de Montréal, Association Santé Étudiante du Québec, and Fonds de Recherche du Québec en Santé.
Declaration of conflicting of interest
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability
The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request.
Guarantor
Dr Michaël Sauthier is the corresponding author of the project.
Informed consent
Institutional ethics approval was obtained for the training program in R language (Institutional Review Board approbation number: 2024–6020), and all participants provided informed consent before study enrollment.
Reprint request
Michaël Sauthier, MD, MBI, PhD.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
