Sage Journals: Discover world-class research

Abstract

Presentation of data is a major component to academic research. However, programming languages, computational tools, and methods for exploring and analyzing data can be time consuming and frustrating to learn and finding help with these stages of the broader research process can be daunting. In this work, we highlight the impacts that computational research support programs housed in library contexts can have for fulfilling gaps in student, staff, and faculty research needs. The archival history of one such organization, Software and Services for Data Science (SSDS) in the Stanford University Cecil H. Green Library, is used to outline challenges faced by social sciences and humanities researchers from the 1980s to the present day. To compliment this history, participation metrics from consulting services (1999–2021) and workshops (2000–2021) are presented along with updated workshop participant feedback forms (n = 99) and further illustrate the profound impacts that these services can have for helping researchers succeed. Consulting and workshop metrics indicate that SSDS has supported at least 27,031 researchers between 1999 and 2021 (average of more than 1175 per year). A t-test on the feedback form data indicates that participant knowledge in workshops statistically significantly increased more than one scale point from workshop start to completion. Results also indicate that despite our successes, many past challenges continue to present barriers regardless of exponential advances in computing, teaching, and learning—specifically around learning to access data and learning the software and tools to use it. We hope that our story helps other institutions understand how indispensable computational research support is within the library.

Keywords

Archival research computational research support data collection data science metadata

Introduction

University Libraries’ role is to support the scholarship of its research community through diverse acts of librarianship that help make resources available for study. This support exists for virtually all fields, their histories and literatures, main theoretical tenets, and data sources. A major avenue for researchers to contribute to legitimate, peer-reviewed bodies of knowledge is through data exploration, visualization, and analysis for sampling, description, inference, or prediction of a specific piece of data, topic, or issue. However, this stage of the research process is challenging because of the computational, technical, and methodological barriers it presents. Specifically, a researcher must learn computer and software skills to acquire, curate, wrangle, visualize, analyze, secure, and present data.

Formal data analysis tutoring was traditionally associated with departments of computer science, engineering, and statistics to support the inherent technical needs of those disciplines. Computer software for data analysis has not always been taught in “non-technical” academic disciplines and departments and is challenging to learn on one’s own. This is especially true for social scientists and humanists, who pose wide-ranging and highly specific questions for projects that frequently require data and statistical components. For example, Kenny’s (1982) pioneering work The Computation of Style introduced the statistical landscape for studying literature, humanities, and social sciences in the Information Age long before modern challenges posed by rapid technological advancements and (generally speaking) greater access to using computers (Boyd and Crawford, 2012; Castells, 1996; Davidson, 2010; Grimmer, 2015; Hanania, 2018; Rahm and Do, 2000; Varghese and Buyya, 2018).

Thus, many individual and departmental needs around data, computing, and software support catalyzed the creation of technical and interdisciplinary training teams. One of the earliest groups was the Carpentries (2022; https://carpentries.org/). Software Carpentry began in 1998 (followed by Data and Library Carpentries tracks) and consists of software and data science training for researchers that is now found in sessions all over the world, usually through university collaborations—including at the Stanford University Terman Engineering Library. The Carpentries have been crucial for developing pedagogical ideas and protocol-based approaches for community and inclusion, curriculum development, feedback, and teacher training that have been instrumental for the creation of other foundational initiatives such as the Raspberry Pi Foundation’s Teaching Programming in Schools (Waite and Sentance, 2021; https://www.raspberrypi.org/app/uploads/2021/11/Teaching-programming-in-schools-pedagogy-review-Raspberry-Pi-Foundation.pdf).These high-watermark institutions both inform and are informed by the profound challenges for empowering novice coders in their computational research pursuits (Black, 2006; Bubica and Boljat, 2014; de Raadt, 2008; Eckerdal, 2009; Eisner, 2003; Heintz et al., 2015; Horton, 2015; Iqbal Malik et al., 2021; Karvelas, 2019; Kross and Guo, 2019; Loksa et al., 2020; Nederbragt et al., 2020; Prather et al., 2018; Robins, 2019; Shi et al., 2018; Soloway, 1986; Vihavainen et al., 2011; von Vacano et al., 2020).

In academic settings, computational research support units might exist campus-wide, or other times to serve specific departments. Staff members sometimes liaise with their home departments to teach modularized lessons within formal course curricula. These groups are often successful because they purposefully seek to remove barriers at the intersection of researcher identity, subject knowledge, and technical skills in relaxed settings where learners receive hands-on training and can ask questions to learn in more individualized ways. While issues around computational education have long been known (Joni and Soloway, 1986), Dale’s (2002) unsettling—and impressively concise—summation of issues facing computer science education research highlighted depthless further concerns of planning, organization, administration, advertising and community, publication, and continuity maintenance. These concerns apply to computational research support units in libraries as well, specifically around changing technologies for data access and storage, compute power, and familiarity with computational software, methods, and statistics. Other issues such as staff turnover, funding, and training further complicate organizational coherence, collaboration with other library and campus units, and complicate internal measurement instrument construction and data collection – necessary tools to improve planning, prototyping, and evaluation of teaching, consulting, and data-related services.

Unsurprisingly, university libraries and associated collections and data repositories have long been instrumental hubs for collaborating with and training researchers for working with data of all types in preparation for careers in academia and industry. The Inter-University Consortium for Political and Social Research (ICPSR, founded 1962; https://www.icpsr.umich.edu) and the United Kingdom Data Service (founded 1967; https://ukdataservice.ac.uk/) have been the gold standards for social science/behavioral archives. In the United States of America, Princeton University Library’s Data and Statistical Services (https://library.princeton.edu/dss) and initiatives at the University of California, San Diego Library (https://library.ucsd.edu/) and the Yale StatLab (https://marx.library.yale.edu/data-gis-statlab/statlab) have long provided data and statistical support. Other notable programs include those in libraries at Columbia University, Cornell, Duke, Emory, University of North Carolina, University of Michigan, the University of California, Berkeley (and other UC campuses), amongst others in the USA and around the world. See Gold (2010) and Yoon and Schultz (2017) (and references therein) for useful information.

We present a brief history of our own computational research support organization in Stanford University Libraries that has grappled with these challenges. Software and Services for Data Science (SSDS; https://ssds.stanford.edu/) is the public service point within the Stanford Libraries’ Center for Interdisciplinary Digital Research (https://cidr.stanford.edu/) that frequently collaborates to provide consultation and workshop services to students, faculty, and staff. Our focus is on the acquisition of social science data and the selection and use of quantitative (and until recently, qualitative) software and methods through workshop trainings, one-to-one consultation sessions, and other less formal connections that occur in-person, through emails and asynchronous learning materials, and more recently, on Zoom teleconferencing calls. SSDS specializes in software onboarding and introductory programming, data acquisition, wrangling, visualization, analysis, survey design, working with text data, and machine/deep learning.

It is hard to tell how original the SSDS service model is because our own history is patchy, not well-organized, and not publicly available—this is part of our motivation for this study, to make public our history and participant statistics. As such, comparison to other institutions’ service models is complicated because they also suffer from the same challenges we face in curating and advertising their histories. Regardless of how unique SSDS’s history is, our service model focuses on a shared public space where consultations and workshop trainings occur and a small, lean staff that is diverse in their backgrounds and computational and methods expertise. Since its origins, SSDS staff structure has generally consisted of 1–2 (part or full time) data specialists who plan workshop and services, outreach activities, and maintain the website. The data specialists have been formally trained librarians, a humanist, and a social scientist/humanist (EM, current Head of SSDS). A small group of around 6–8 graduate students (paid hourly) consult on data, tools, and software, design and teach workshops, and write asynchronous learning guides. In the event we cannot help, we refer consultees to our network of on-campus partners.

We present three sources of SSDS data to explore the impact of our services: (1) an annotated historical timeline, (2) over 20 years’ worth of consulting and workshop participation metrics, and (3) workshop feedback form responses (n = 99) since December 1, 2021 when the form was overhauled. Results indicate that SSDS computational research support services in Green Library have remained in-demand, despite changes in tools, software, methods, online learning, and the global COVID-19 pandemic. Furthermore, issues that researchers attempted to address in the 1980s are still prevalent today, especially for onboarding beginner researchers to software and methods and for navigating large datasets and remote computing support. In conclusion, we discuss overarching trends and emphasize why University Libraries should build and maintain well-integrated computational research units as part of their broader support for researchers and provide recommendations for some of the challenges faced.

Data and methods

The SSDS archive

We explored the SSDS printed and written archival materials for the historical analysis of this research. A case study approach (Creswell, 2013) was used to analyze the SSDS physical archive, which dates from 1971 to May 2022. The archive contains much of SSDS’s internal documentation and is assumed to be representative of its activity and service history. Printed materials include old notes and emails, meeting agendas, invoices, teaching materials, learning guides, consulting and workshop registration and sign-in sheets, and yearly reports. Written materials included many notes about services, procedures, and attendance. Old newspaper clippings were also included when excitement was so great that it was publicized. The only digital items were a few emails and notes that were printed on paper and added to the physical archive. EM manually reviewed the materials and coded the historical categories, organization, and patterns. A matrix (Miles et al., 2013) was used to organize the archive entries into categories about computing, data, software, and methods support. They were entered into a timeline to explore inflection points and trends in social science computing in the library through time. Examples include needs assessments and reports, technological advancements, and new software documentation.

Consulting and workshop metrics

Participant data from consultations and workshops are presented to compliment the historical perspective. These data were curated and analyzed from multiple ingestion points such as emails, appointments, drop-in hours, paper sign-in sheets, electronic sign-in kiosks, shared storage hard drives, notes, summary reports, and Zoom attendance reports. The consulting data consists of academic years 1999–2021, while the workshop data is from 2000 to 2021. Stanford operates on the quarter system, with an academic year consisting of Fall (September to December), Winter (January to March), Spring (March to May), and Summer (June to September) quarters. A dot plot is used to illustrate the number of consulting and workshop participants through time.

The consulting dataset contains the following variables per academic year: total number of consultations, number of email and in-person consultations, and percent consultee affiliation, when available: graduate, undergraduate, faculty, staff, postdoctoral, or other (which includes visiting scholars/researchers, resident assistants, non-affiliated staff, etc.). The percent affiliation columns might not sum to 1 for some years because this information was not always captured and instead was often included in the SSDS physical archive as a random note or scribble. Due to the COVID-19 pandemic and the health risks for in-person consultations, in-person consultations for academic years 2020–2022 were held over Zoom, except Winter and Spring Quarters 2022 when in-person consultations resumed. The 2021–2022 data only represent Fall, Winter, and part of Spring Quarter and thus do not represent the entire year and are presented separately at the end of the Results section.

The workshop dataset reports the following variables per academic year: total number of participants, number of workshops, average number of participants per workshop, and percent affiliation, when available: graduate, undergraduate, faculty, staff, postdoctoral, or other (which includes visiting scholars/researchers, resident assistants, non-affiliated staff, etc.). Like the consulting dataset, the percent affiliation columns also might not sum to 1 because this information was similarly not always recorded. Attrition rate (percentage of those who register for a workshop but did not participate) for years 2020–2022 is 51%. In cases where only registrant information was present, this number was multiplied by 0.49 to estimate the actual participation in a given workshop. Also like the consulting data, the 2021–2022 data only represent Fall, Winter, and part of Spring Quarter and thus do not represent the entire year and are also presented separately at the end of the Results section. Finally, due to the COVID-19 pandemic and the potential dangers for in-person meetings, in-person workshops for academic years 2020–2022 were held over Zoom, except for Spring Quarter 2022 where in-person/Zoom hybrid remote formats are currently being experimented with.

Workshop feedback form responses

To provide a glimpse of how services impact learning computational software, tools, and methods, we overhauled the workshop feedback form on December 1, 2021, to gain a better understanding of workshop participant attitudes during the COVID-19 pandemic and the transition back to in-person/hybrid learning. Qualtrics is used to distribute the form, and the following workshop feedback form metrics are presented: mean item response, mean standard deviation, and Cronbach’s alphas and confidence intervals. The arithmetic mean of the response (x̄) is simply the sum of the responses divided by the number of responses. The mean standard deviation (σ) indicates the amount of dispersion in the responses around the mean. N is the number of responses. Cronbach’s alpha (α) ranges from 0 to 1 and indicates the internal consistency/reliability of a set of questions in a block, or more simply how related questions are to each other by block; closer to 1 means they are more reliable. Alpha was calculated for question blocks familiarity, content, instructor, and environment. The following agreeability statements and point scales are asked in the blocks presented below:

Pace of the workshop:

1 = very slow, 2 = a little slow, 3 = just right, 4 = a little fast, 5 = very fast

Familiarity with the presented workshop topic and/or tool before versus after the workshop:

1 = not at all familiar, 2 = a little familiar, 3 = familiar, 4 = very familiar, 5 = expert

The workshop content: A) I learned a new skill or skills, B) I learned something I can apply directly to my own research, and C) I learned something to help in a class:

1 = strongly disagree, 2 = somewhat disagree, 3 = neither agree nor disagree, 4 = somewhat agree, 5 = strongly agree

The workshop instructor: A) Clearly explained the workshop goals, B) explained the topic in a way that I could understand, and C) took time to answer my questions:

1 = strongly disagree, 2 = somewhat disagree, 3 = neither agree nor disagree, 4 = somewhat agree, 5 = strongly agree

The workshop environment: A) Considered my needs, B) made me feel welcome, and C) was appropriate for asking questions:

1 = strongly disagree, 2 = somewhat disagree, 3 = neither agree nor disagree, 4 = somewhat agree, 5 = strongly agree

Beginning April 1, 2022 we have also begun to ask: did the hybrid learning environment meet your needs?

0 = no, 1 = yes

To try and boost the response rate, we currently do not publicly report out participant affiliation by department from the workshop feedback form (i.e. are you an: undergraduate student, graduate student, faculty member, staff member, etc.) because some verbal feedback indicated that it is an identifying response item when groups are small, an apprehension expressed by some undergraduate students that has hindered their participation amongst groups of graduate students, faculty, and staff. A two-sample, two-tailed homoscedastic t-test was used to test for statistically significant differences between participant familiarity with the presented topic before versus after a workshop.

Historical context of Software and Services for Data Science (SSDS) at Stanford University

Early years

The earliest Social Science Computing at Stanford - Report to the Vice Provost for Academic Computing and Information Systems and the Dean of the School of Humanities and Sciences - in the SSDS archives was dated March 1984. For this report, Stanford’s Instruction and Research Information Systems Division surveyed faculty members in Fall of 1983 through interviews and questionnaires to understand substantive themes around computer use by social scientists on campus and identified two major themes. First, social scientists overwhelmingly used computers to perform text processing and statistical analysis of their datasets and faced myriad challenges when trying to manage databases, visualize data, and use data for instructional purposes. Second, social scientists were frustrated with the high cost of financial resources needed to use microcomputers, computer systems, and mainframes for their varying departmental needs.

By 1987–1988, two Stanford VM/CMS IBM operating system computers were made available for use by researchers. The first machine named Watson (2 CPUs; specs uncertain at time of writing this manuscript) was used for statistical processing, mainly by social science researchers in the College of Humanities and Sciences (H&S) (40% of users), Academic Data Services (ADS, 12.5%; discussed below), Institute for Mathematical Studies in the Social Sciences (25%), Graduate School of Education (10%), and Law (5%). The second, named Oberon (also 2 CPUs; specs also uncertain), was used for numerical processing by Chemistry (50% of users), various vector processing tasks (25%), and Law (12.5%). Researchers from Engineering, Earth Sciences, Medicine, and the Graduate School of Business became more frequent users as well. Besides the hardware itself, administrative challenges revolved around how to combine users with similar needs, problems, and applications while allowing for the maximum resource allocation in environments that would minimize software maintenance and still reserve space for one-off users not affiliated with main departments or projects. Servers named Power and Wisdom were also two others made available for faculty and student use.

SSDS’s history began a few years later following the 1989 Loma Prieta earthquake (https://news.stanford.edu/features/2014/loma-prieta/). During the interior renovation of the earthquake-damaged Cecil H. Green Library west wing, plans to expand library services were also underway. Responding in large part to the growing trend among research on campus toward more distributed computing, the planning group for the newly formed Social Science Resource Center (SSRC) within Stanford University Libraries considered how the library might provide better access to, and help for, finding, accessing, and using social science data. Their proposal called for a service for students, faculty, and staff that would incorporate library resources, statistical software, and computational expertise to support social science and humanities research.

However, Academic Data Services (ADS) was actually the forerunner to SSDS. It was officially founded in 1971, although formal documentation does not appear in the SSDS archive until 1986. ADS was a joint collaboration between SSRC and the Distributed Computing Group from Information Technology Systems and Services (now known as University IT). ADS provided consultation help for accessing, downloading, and wrangling data, published user guides, and troubleshooted SAS and SPSS on Unix. Two other complementary and colocated public services, Social Sciences Data Service (SSDS) and Statistical Software Support (SSS), opened in Fall 1999, although the group between these two services formed in 1992. During the first 3 years these services grew steadily and in 2003 SSS merged with SSDS to form Social Science Data and Software (keeping the original SSDS acronym). The change underscored a unique collaboration of services and resources first, by establishing a single name and point of contact for users. This included the website https://ssds.stanford.edu, with the earliest Internet Archive record capture of our home page dating to November 25, 2003 although the actual origin date might be even earlier. These collaborators collected and cataloged numeric datasets (and their documentation) in machine-readable format (e.g. data tapes) and staff members provided assistance to the Stanford community in finding computer datasets via the Libraries’ online catalog (“Socrates”, now called Search Works). Researchers could request access to Stanford-owned datasets on tape via ADS and mount them on the Andrew File System (AFS) Stanford-based data storage servers. Statistical consultants helped consultees select and use quantitative software to analyze data and provided workshop trainings.

Between 1986 and 1989, ADS received 1230 total requests (44% Political Science, 36% Economics, and 13% Sociology, in addition to other unknown percentages of departments) and averaged one request per day in 1988, and two requests per day in 1995. Libraries and Information Resources (https://news.stanford.edu/pr/94/940809Arc4164.html) statistical consulting in Fall Quarter 1993 reported that 54% of inquiries pertained to using SAS or SPSS packages on Unix, 34% dealt with application issues on PC and Mac (especially SAS on PC), with other questions pertaining to packages from the Bio-Medical Data, Time Series Processor, MiniTab, and Math softwares. Between Fall Quarter 1995 and Winter Quarter 1997, a total of 730 data tapes were ordered for 74 users (52 graduate students, 8 undergraduate students, 7 faculty, and 7 unknown). Students were mostly from Psychiatry (22%), Education (19%), and Sociology (16%) departments, along with unknown percentages from Economics, Economic Engineering Systems, Anthropology, Industrial Engineering, Medicine, and Business.

The Stanford University Library Academic Information Resources (SULAIR) was formed in 1991. SULAIR was a consortium of staff from the Information Center, Digital Library Systems and Services (DLSS), Branner GIS Library, Cubberley Education Library, Residential and Library Clusters, Academic Technology Specialists, the Integrated Data and Statistical Lab, and Communication and Documentation specialists. The Social Sciences Research Group was founded in 1992 to perform statistical applications consulting. Most users of these services between 1995 and 1996 were from the departments of Economics, Business, Political Science, Sociology, Health Research/Policy, Education, Law, and University of California, San Diego. In 1999 the Collection Redeployment Plan requisitioned Federal, California, and International government documents on Folio and Microfiche. Between Fall 2000 and Spring 2002, 734 datasets were requested (out of 3359 available) on the Stanford Ops.db server for acquiring tapes and print codebooks from organizations such as ICPSR. Sixty three percent of all requests came from students, many as research assistants working for faculty members.

The 2000s

The 2001 Computing Infrastructure and Academic Needs Report at Stanford University provided an updated assessment for several aspects related to social science computing. Hardware was to be updated on 3-year cycles and provide high performance computing clusters and multiprocessor arrays for students. Additional needs included the conversion and building of classrooms to accommodate rapidly changing technologies. Software was to be provided and supported with new market releases, along with commercial Internet access in the evolving campus-wide infrastructure, the updating and installation of high-speed wall ethernet ports in offices, labs, cafes, and common spaces, and facilitation of high speed Internet access to faculty homes. These upgrades required staff support in the form of computing and database professionals, expansion of the Academic Technology Specialist program (departmental specialists who utilize computers, software, and other technology to enhance departmental-specific teaching and research), expansion of the Software Licensing Office, and development of training programs and tools for faculty, staff, and students. At this time there was also a need to increase central and localized security protocols, provide a single access point for resource access, expand IT and curriculum resources to Stanford international campuses, develop a general endowment fund to support the prevailing initiatives, and kickstart an innovation fund to ensure that Stanford stayed on the cutting edge of social science computing headed into the future. Many users in 2001–2002 were from Human Biology, Graduate School of Education, Sociology, Graduate School of Business, Economics, International Relations, and Medicine.

By 2002, SSDS workshop and consulting topics included service overviews, how to access and download data, and support for the most popular quantitative and qualitative software such as: Microsoft Office ‘98, Adobe PageMaker, SAS, SPSS, Stata, S and S-PLUS, Stat/Transfer, MiniTab, Amos, DeltaGraph, Table Curve 3D, ArcGIS, ArcView 3.2, S+ for ArcView GIS, S+ Spatial Stats, SpaceStat, MapInfo, Adobe Flash, EndNote, NVivo, ATLAS.ti, Sudaan, HLM, MLwiN, and CVINET. Common support requests were made for help with data wrangling, text processing, database searching and management, survey design and text analysis, social network analysis, geospatial mapping and analysis, statistical procedures, and instructional uses for data.

From Fall 1999 through Summer 2012, SSDS reported 16,234 total in-person and email consultations (48% graduate students, 42% undergraduates, and 10% faculty, staff, visiting scholars, and others). Between Fall 1998 and Summer 2012, SSDS provided more than 257 workshops that covered social science data resources, data discovery and download strategies, and quantitative and qualitative software for beginning and advanced users that reached at least 2660 students, faculty, staff, and postdocs (average of more than 10 participants per workshop during this time period). SSDS and the Library also created the Social Science Data Collection which allowed Stanford researchers to curate, redistribute, and archive their projects. The earliest evidence of feedback form distribution in the archives is from 2002, which consisted of a staff email discussing which questions to send to Honor’s College workshop participants to improve delivery of services and what topics participants hope to see in future offerings, questions that are still on the feedback form to this day.

Many consultation topics at this time dealt with accessing numeric datasets on tape, CD-ROM, and diskette. The SSDS archive also included documentation to help researchers access email and navigate the Internet. Furthermore, SSDS continued support ICPSR, the Roper Center for Public Opinion Research, Association of Public Data Users, Data Documentation Initiative, and Data Extraction Web Interface (DEWI). ICPSR was immensely popular with a total of 68,920 data files downloaded between 1998 and 2005. The Roper Express service was also launched and in Fall Quarter 2004, 20 datasets and 25 codebooks were downloaded. In Winter Quarter 2005–2006 alone, Stanford users downloaded a total of 38,665 data files. DEWI was launched in 2003 and was a well-integrated web-based data search and extraction tool created by the Jonsson Library of Government Documents and ADS for accessing social science numeric data for research and teaching. It contained 156 datasets by 2005 and provided an easy-to-use resource for instructors teaching courses that used social science methods via secondary survey data, as well as a valuable discovery tool for research. DEWI outreach occurred in individual consultations, workshops, and course sessions in Sociology, Political Science, and Education. Other special sessions provided help for honors students in the Methods of Analysis Program to prepare them for summer data collection work, especially regarding survey design, text analysis, tips for data entry, and the available quantitative and qualitative software at Stanford. The majority of school affiliations of SSDS participants between 2002 and 2005 were Humanities and Science (77%), Graduate School of Education (16%), and Other (7%), which consisted of researchers from Economics, Political Science, Graduate School of Business, Business, Law, Sociology, Anthropology, Medicine, International Relations, Psychology, Public Policy, and Human Biology.

In 2005, four workshops were standardized and offered every quarter: Introduction to SSDS Data and Software Services and Resources, Choosing Quantitative Software for Research, Choosing Software for Qualitative Research, and Finding and Getting Data for Research and Instruction. Other services offered included software troubleshooting and tips for data entry, research planning and design, and writing. At this time, SSDS’s public teaching and consulting space, the Velma Denning Room (VDR), housed nine computers, 2085 non-circulating volumes (mostly consisting of software codebooks and volumes), 809 dataset codebook titles, 440 software manuals and texts on statistics and advanced methods, 504 CD-ROM, DVD, and diskette titles, and had distributed over 3000 asynchronous learning guides across campus. Researcher data were commonly stored on CD-ROMs, USB drives, diskette, and the AFS server.

Examples of collaborations during this period included those with subject matter specialists (traditionally referred to as librarians, selectors, or bibliographers) who worked with the Honors College, Undergraduate Advising and Program, Disability Resource Center, and other campus departments and various committees. Other successful collaborations were made with the Institute for Research in Social Science (IRiSS) to identify web survey software needs of Stanford researchers through the piloting of the Opinio web survey package to better construct, administer, and analyze web-based surveys. The resulting Survey and Data Analysis project was initiated between SULAIR and the University of California, Berkeley, Survey Research Center (SRC) and was eventually hosted on a Solaris server supported by DLSS. In 2007, SSDS, SULAIR, and IRiSS spearheaded the Social Science Needs Assessment Project to identify the current research, teaching, and learning needs of social science programs at Stanford University and hosted the 34th annual International Association for Social Science Information, Service and Technology conference (IASSIST) the following year (http://web.archive.org/web/20090524005219/http://stanford.edu/group/ADS/cgi-bin/drupal/). IASSIST is a professional membership organization comprised of data librarians, archivists, producers, researchers, and applications developers throughout the world whose members work in a variety of settings and often serve as directors of libraries, national data archives, statistical agencies, research centers, academic departments, government institutions, and non-profit organizations. The local arrangements committee was staffed by Ron Nakao, the UC Berkeley SRC, and UC Berkeley Library.

Between 2006 and 2009, researchers downloaded 3745 datasets and 3686 documentation files from the Roper Center. Between 2007-2009, 3415 projects were undertaken on 17,222 datasets stored in 58,381 individual files from ICPSR. Pedagogical trainings to improve SSDS workshop instruction were first implemented in 2008, while 2009 marked the appearance of R programming language support in the SSDS archive, a trend that would grow sharply into the modern era. Other supported topics included research planning, experimental design, Microsoft Excel, statistical procedures, geographic information software, NVivo, using Stata and R on Stanford remote servers, and asynchronous learning via the printed guides. There was also a noticeable need to improve reporting practices and a Drupal server upgrade glitch caused loss of a significant proportion of consulting log data. From 2010 to 2012, Roper Center reported 11,592 queries for dataset and documentation downloads, RoperExplorer Table Views, and iPOLL + Tab views. Like years prior, SSDS mainly continued to support SSPS, SAS, Stata, NVivo, Atlas.ti, survey research design, survey design, text preprocessing and analysis, geospatial mapping, network analysis, and statistical procedures. Similar trends continued through 2012, but with an increasing number of requests for R support.

The future

The years 2012–2017 marked a shift toward instruction and consultation support for Stata and R (and Python to an increasing extent). SSDS also began tracking consultations via the SSDS website consult log in 2012. As of 2017, the Velma Denning Room was home to 1159 books/volumes and 5083 data disks and CD-ROMS, and since then we mainly focus on R, Stata, and Python support for researchers querying APIs to access “big” data, data wrangling, visualization, and analysis, survey design, text preprocessing and analysis, and various forms of machine and deep learning. The year 2021 then marked another name change as SSDS then became Software and Services for Data Science (luckily again still using the same abbreviation!). Traditionally in-person consultations and workshops moved to Zoom format during 2020–2021 because of the COVID-19 pandemic, with hybrid workshop experimentation currently underway in Spring Quarter 2022. Requests for Python programming language support are rapidly increasing due to its generalized structure and strong data science capabilities. Our current standardized workshop schedule consists of R and Python trainings for: introductory programming, data wrangling, data visualization, introduction to text analysis, and introduction to machine learning.

Part of SSDS’s goals is to not only consult with our patrons on research software and training, but also to highlight the growing data collections of the Stanford Libraries. A natural partnership is one between computational research support units and other internal library staff, such as our bibliographers and subject specialists, who are uniquely positioned to provide the initial consultations with researchers regarding the identification of information resources pertinent to their work.

As patrons move from identification of the data resources, to access and exploration, to understanding how to use the data in their analysis, there is a natural liaising process that ultimately brings many patrons to SSDS for assistance. Through this process, the researchers may be exposed to some of the Libraries’ data storage and exploration tools including our primary catalog SearchWorks, the Stanford Digital Repository and the Stanford Data Farm (https://redivis.com/Stanford). Within these systems, patrons can find collections in machine readable formats such as the Washington Post and New York Times historical text archives, the Gallup World Poll, as well as various CoreLogic and L2 data sets. These collections are growing rapidly and we expect data sets to be a very large part of our collections and curation work as Stanford University continues to support data science and data-intensive research more holistically within the Library.

If co-consulting between librarians and computational research support groups like SSDS can happen in the early stages of a consultee’s research where there is overlap between the data acquisition and computational planning part, there arises an opportunity for all parties involved—the consultee, along with the librarian, and the computational research support consultant—to boost their data literacy. However, the term “data literacy” carries with it an inherent technical component, beyond the data itself, that is yet another necessary hurdle to access and use data. SSDS’s definition of data literacy is flexible and evolving but aligns mostly with the definition provided by Bonikowska et al. (2019, cited in Bauder, 2021: 8) as: “. . .skills necessary to access data, manipulate them, evaluate their quality, conduct analysis, interpret the results, and (in most frameworks) use data ethically.” We would add to this definition the needs for knowing how to explore, visualize, secure, and present data. It can be argued that higher education is failing its students through inadequate training around data, research, and preparation for the modern job market. To enact the necessary changes to make all students, faculty, and staff at all universities data literate however, there needs to be sweeping changes at the administrative, curricular, and faculty training levels to ensure that students are being prepared to find, use, and think critically about data for adequate career preparation, where an increasing amount of jobs have—or will soon have— data and technical components. Impacting students at the undergraduate level is one of the ways to make the most immediate change (Burress et al., 2021; Douglas et al., 2021).

Quantitative Results

Quantitative results show that SSDS has impacted at least 27,031 researchers between 1999 and 2021 through 22,366 consultations (average of more than 1000 consultations per year) and 4665 workshop participants (average of 222 participants per year). Graduate students make up roughly half of these consultations, thus showing how critical computational research support is to the success of their presentations, posters, theses, dissertations, and publications. Undergraduate students historically have followed in a close second (and even surpassed graduate students in some years) for help with capstone and other research projects, but their engagement has tapered off a little in recent years. Staff are the next most affiliated group, and generally seek help with professional academic projects related directly to the University and/or their own professional careers. Faculty and postdocs comprise a smaller percentage of services, and generally contact us for help with specific aspects of nuanced, highly engaged grant and contract funded projects (and sometimes do so through their graduate students and postdocs). Consulting metrics are presented in Table 1 and workshop participation metrics are shown in Table 2. Consulting,workshop, and total participation numbers are illustrated as a dot plot in Figure 1.

Table 1.

Consulting metrics by academic year 1999–2021. Data are organized by number of in person and email consultations, total number of consultations, and percent affiliation by graduate student, undergraduate student, faculty, staff, postdocs, and other.

Year	In person consultations	Email consultations	Total	Graduate %	Undergraduate %	Faculty %	Staff %	Postdoc %	Other %
1999–2000	603	595	1198	49	51
2000–2001	582	715	1297	49	51
2001–2002	772	598	1370	55	37	5	3
2002–2003	790	659	1449	41	52	4	3
2003–2004	740	650	1390	42	53	2	3
2004–2005	526	809	1335	55	43	1	1
2005–2006	481	682	1163	56	37	4	3
2006–2007	520	574	1094	45	36	6	9	2	1
2007–2008	878	1097	1975	42	38	2	10	3	2
2008–2009	589	489	1078	51	37	5	4	2	1
2009–2010	591	594	1185	44	41	2	9	1	1
2010–2011	531	421	952	50	39	4	7
2011–2012	176	572	748	48	31	12	8
2012–2013	412	86	498	44	30	3	8	1	9
2013–2014	617	187	804	50	30	1	7	1	2
2014–2015	340	165	505	43	30	2	9		4
2015–2016	302	150	452	56	20		8		9
2016–2017	433	24	457	59	27		5		3
2017–2018	776	80	856	50	36	1	5	2	5
2018–2019	794	69	863	49	38	1	4	3	5
2019–2020	843	255	1098	45	22	1	2	2	3
2020–2021	419	180	599	48	14	5	13	8	11
Total	12,715	9651	22,366

Table 2.

Workshop metrics by academic year 2000–2021. Data are organized by number of workshops, total number of participants, average attendance per workshop, and percent affiliation by graduate student, undergraduate student, faculty, staff, postdocs, and other.

Year	Number workshops	Participants	Average per workshop	Graduate %	Undergraduate %	Faculty %	Staff %	Postdocs %	Other %
2000–2001	9	90	10.00
2001–2002	9	96	10.67
2002–2003	30	259	8.63
2003–2004	24	244	10.17	38	53	4	3	1
2004–2005	28	245	8.75	33	61	3	2
2005–2006	21	250	11.90	50	38	3	5	2	2
2006–2007	26	239	9.19	48	41	4	2	4
2007–2008	26	279	10.73	55	38	2	5
2008–2009	25	291	11.64	38	55	7
2009–2010	23	189	8.22
2010–2011	21	264	12.57	40	52	4	4
2011–2012	20	214	10.70	42	50
2012–2013	19	223	11.74	39	48		2	2	2
2013–2014	11	127	11.55	55	42	3
2014–2015	17	210	12.35	61	19		14
2015–2016	15	105	7.00	57	25	2	12
2016–2017	16	202	12.63	44	22	2	23	6	2
2017–2018	7	122	17.43	14	61		20	2	2
2018–2019	24	561	23.38	41	6	5	28	16	3
2019–2020	16	265	16.56	40	6	7	24	15	5
2020–2021	26	190	7.31
Total	413	4665	11.30

Figure 1.

Dot plot showing number of participants (y-axis) by academic year (x-axis). For each year, the × symbol represents consulting participation, the ○ symbol represents workshop participation, and the • symbol represents the total sum of consulting plus workshop participation. No workshop data were present for academic year 1999–2000, thus the symbol is not marked on the figure.

Workshop feedback form results are presented in Table 3, and indicate that participants find the pace of workshops to be approximately just right (x̄ = 3.26, σ = 0.66). Participants report an increase of over one scale point for their familiarity with a topic before taking a workshop (x̄ = 1.78, σ = 0.70) compared to after (x̄ = 2.81, σ = 0.73), a statistically significant difference as identified by the t-test (p < 0.000, t = 9.92, df = 176.8). Responses indicate high marks for questions related to content: I learned a new skill or skills (x̄ = 4.21, σ = 0.93), I learned something I can apply directly to my research (x̄ = 4.22, σ = 0.91), and I learned something to help me in a class (x̄ = 3.86, σ = 0.97). Responses about the instructor were very high: the instructor clearly explained the workshop goals (x̄ = 4.67, σ = 0.79), explained the topic in a way I could understand (x̄ = 4.72, σ = 0.61), and took time to answer questions (x̄ = 4.70, σ = 0.76). Participants gave additional high marks for the workshop environment: that it considered my needs (x̄ = 4.52, σ = 0.83), made me feel welcome (x̄ = 4.76, σ = 0.63), and was appropriate for asking questions (x̄ = 4.76, σ = 0.68). Although the question was added late in the academic year, 100% of queried participants (n = 43) indicated that the hybrid in-person/Zoom learning environment suits their needs.

Table 3.

Workshop feedback form metrics since implementation of the updated form on December 1, 2021.

Block	Question	x̄	σ	n	Alpha	Lower	Upper
Pace	The pace of this workshop was:	3.26	0.66	99
Familiarity	How familiar were you with this topic before this workshop?	1.78	0.7	91	0.89	0.85	0.93
	How familiar were you with this topic after this workshop?	2.81	0.73	91
Content	I learned a new skill or skills	4.21	0.93	94	0.96	0.94	0.97
	I learned something I can apply directly to my own research	4.22	0.91	94
	I learned something to help me in a class	3.86	0.97	94
Instructor	The instructor clearly explained the workshop goals	4.67	0.79	94	0.95	0.93	0.97
	The instructor explained the topic in a way that I could understand	4.72	0.61	94
	The instructor took time to answer my questions	4.7	0.76	94
Environment	The workshop environment considered my needs	4.52	0.83	94	0.95	0.94	0.97
	The workshop environment made me feel welcome	4.76	0.63	94
	The workshop environment was appropriate for asking questions	4.76	0.68	93
Hybrid	Did the hybrid learning environment meet your needs?	1	NA	43

x̄ = mean response, σ = mean standard deviation, n = number of responses, Alpha = Cronbach’s alpha, Lower = lower alpha confidence bound (95%), Upper = upper confidence interval bound (95%). Refer to the Data and Methods section to review the items and response scales.

As of May 9, 2022, we report an additional 222 consultations and 223 participants (from 24 workshops; average of 9.33 participants per workshop), with 139 days remaining in the 2021–2022 academic year.

Discussion

The archival and data histories of SSDS presented here have demonstrated the far-reaching impacts that a computational research support unit housed in a university library can have on its researchers. Historically, fluctuations in participation are related to staffing availability, so that years with more staff can serve more researchers because they can potentially schedule more consultation hours and workshops, and bring with them more expertise. Graduate students are our main constituents, who undoubtedly are trying to gain research experience through completion of their theses, dissertations, and other projects in preparation for the job market. Despite rapid advances in computing power, data storage, and online learning, the challenges identified nearly 40 years ago still resonate with many researchers today. They still need to learn computers, software, programming languages, tools, and methods to acquire, preprocess, visualize, analyze, secure, and present data. However, as dataset sizes continue to grow at exponential rates, like the researchers in the early 1980s, today theywe still must navigate issues of inadequate memory, data transfer across systems, data storage, and utilization of remote computing solutions and their cost structures. Importantly, onboarding beginners to a programming language and teaching themlearning methods for visualization and analysis to accomplish research related tasks remainare in high demand, and training others through instruction and consultation remains and are equally as challenging.

SSDS’s service model uses workshops, consultations, and professional development opportunities to upskill and reskill our community by meeting researchers where they are at, removing barriers to entry, and democratizing the computational learning process in a way that helps individuals become self-learners on their own data literacy journeys.One of the most engaging aspects about using data literacy as a foundation for all of our interactions is how, in the hands of a competent instructor, consultant, or mentor, data skills can be shown to betransferrable to other topics, issues, and problems due to the contextual emphasis of accomplishing a specific task. Although interpretation of results depends on domain-specific knowledge (which takes time and a lot of close reading to acquire), learning the basic programmatic and theoretical tenets of how data are stored, accessed, wrangled, explored, visualized, and analyzed is arguably the most valuable and transferrable skills needed for the modern job market. This is absolutely necessary in the 21st Century, as sustainability science (Kates et al., 2001) is becoming a “new” major scientific focus to protect planet Earth and its inhabitants as neoliberal policies continue to decimate humanity and natural environments, particularly through devaluation of the lives of the most vulnerable populations. While it is challenging to interpret departmental breakdown because microlevel data are not available for all years and services, it is clear that SSDS has provided—and continues to provide—invaluable services for patrons from the Humanities and Sciences, the foundation of what is often referred to disparagingly as a “liberal arts education,” and ranges from the technification of disciplines ranging from Urban Studies to International Relations, to Languages and Theater and Performance Studies, and Chemistry, Biology and Statistics. This could also be interpreted to suggest that the Library does not inherently serve more or less “scientific” disciplines, but is a home to all since there are arguably technical barriers that must be overcome in data and the technological requirements to utilize data have permeated all disciplines and fields of study necessary for solving today’s most challenging problems.

Figure 1 shows that consultation and workshop numbers were generally stable from 1999 until 2011, when a variety of online learning platforms first launched: Khan Academy in 2008, Udemy in 2010, Udacity in 2011, Coursera and edX in 2012, and DataCamp in 2013, among others, that frequently offer basic coding skills training for free with more features and certifications available at cost. Importantly, Stack Overflow was also founded in 2008 and remains the world’s premier question and answer website for programming related questions, with the idea being that someone asks a question in the form of a reproducible example, the community can provide answers and comments, and the best answer gets upvoted as the accepted solution. Stack Overflow has become a ubiquitous part of the self-directed learning process, where it is generally acceptable protocol to search how to perform a data-related task to cobble together a solution to current task. This has interestingly caused Joni and Soloway’s (1986) assertion to come full circle again—that just because code works does not mean it is readable, interpretable, or that it provides the correct output. Thus, we then see a rebound in consulting and workshop numbers around 2017, perhaps as researchers began to realize difficulties of copy/paste learning and that they should still rely on communication with experts to get their questions fully answered. However, this rebound back to early-2000s numbers was upended by the COVID-19 global pandemic (The World University Rankings, n.d; DOE OCR, 2021; Godber and Atkins, 2021). Results for 2021–2022 might show signs of a return to pre-pandemic number but are incomplete at the time of this writing.

These trends speak to a few important points about community. First, we have understood that many people do not care for online learning when they know that they can experience the social, in-person aspects of learning as part of their tuition costs. Feedback indicates that these in-person, human interactions are invaluable for supporting researchers not just for their current problem or issue, but also for the many spin-off conversations that can be had. This often leads to professional development opportunities that might not otherwise be possible in terms of graduate school and job application coaching, networking, research paper acknowledgments and authorships, and interpersonal relationships. Second, remote learning is clearly here to stay and universities would be wise to invest sufficient funding in developing online learning programs, especially for computational support units within libraries. Many students are hesitant to rejoin in-person activities due to potential health concerns and unaffordable costs of living, transportation, and child care that can complicate attending events in-person. . One big challenge that remains is how to create new professional development experiences that tend to be more organic in in-person settings in online settings where engagement must still be built. Thus, these challenges present new opportunities for solving issues of learning in the modern world in ways that might exist outside of the traditional tuition structure, such as microcredential certifications, help with school and job applications, interview practice, and a host of other professional development opportunities that can augment regular, research-focused workshop and consultation sessions by placing research in its broader professional contexts.

Challenges and solutions for organizational data collection and use

Beyond challenges posed by funding, organizing, planning, coordinating and implementing services, and balance between in-person, online, and hybrid learning, computational research support units also face challenges for collecting basic data about those who use their services. This can help clarify who is being served and for what purposes and to also provide a data-driven perspective to identify old and new needs across campus and ensure that services remain cutting-edge. First, determining how data is to be collected can be a major point of confusion. It is therefore essential to streamline data ingestion points across services such as the organizational website, email addresses, Google and Qualtrics Forms, events webpages, paper forms, and other registration systems. These data run the risk of growing stale if they live in disparate locations that are not aggregated into a central location. A dedicated staff member must be responsible for the aggregation, analysis, and dissemination of results through reports and presentations to communicate the value of research support services. This person must also wrangle the challenges of potentially deduplicating and coalescing multiple ingestion points and create and follow protocols to prevent the loss of additional service metadata/contextual information. Like our own data, this can lead to issues of underrepresentation in the datasets and interpretation of the metrics that are reported out and acted upon. This missingness caused by failure to capture data can stem from many sources, including staff forgetting to distribute paper or electronic sign-in and feedback forms, forgetting to enter paper forms electronically, consultations that are scheduled through staff email addresses, etc. Additionally, the simple averaging method of workshop feedback scores might be less appropriate than something like a Rasch model for analyzing feedback (Wilson, 2004).

Second, there is also the issue of determining what counts as a consultation versus a contact. While a contact might be someone asking when a workshop will occur or how to sign up for a consultation, we determine a consultation to be anything that directly helps move a researcher’s project forward, even if done so through a brief informal conversation or email. It does not have to be a formal one-to-one consultation meeting nor is there a minimum time limit for it to count, as long as it legitimately solves a problem directly related to the research problem at hand. Third, advertising is a big challenge. People have said in the past things like “I just found out about SSDS! If I had known about your services, I could have finished my degree sooner and saved a year of student loans.” This is heartbreaking to hear and motivates us to keep finding creative ways to advertise our services and convey that our support services work and exist in non-judgmental and anti-bro-culture atmospheres. Finally, it is challenging to be consistent in the above points to produce an evolving community of practice. This can be accomplished through staff communications, trainings, and events. Library-adopted tools such as Slack and Springshare-powered LibCal have proved helpful for coordinating help during the COVID-19 pandemic, as coworking in the same space became impossible and calendars needed to be synced.

Solutions appear easy but are difficult to implement in practice. First, planning and organization of services should be kept as simple as possible. Service capacity should depend on the skills and availability of available staff and students to teach and consult. The further in advance that schedules can be set, the more regularly and widely they can be advertised and hopefully attended. Broad computational and data needs are difficult to understand but sending out a wide-ranging survey or talking to a few key faculty members could serve as well-informed start points; services can be matched if campus needs are otherwise known. Ingestion points for contacts, registration, and feedback should be minimal, ideally beginning through a single point such as a simple static website with one or minimal forms to track and store information in shared spreadsheets. Software such as Drupal, Qualtrics, Microsoft Outlook Mail, and Google Mail make it easy to track this information and send out automated reminders prior to workshops and consultations, and also afterward for feedback surveys. The administrator of these materials can track registration for workshops and assign consultations if they are not previously claimed by staff through a predetermined system. Staff can even protect their personal email addresses through use of a shared inbox for email replies if desired. Although paper signs posted on doors and in hallways might seem useful, they require manual updating and can easily cause confusion when they are forgotten about and information on websites and digital forms change.

Staff should be trained in minimal pedagogy for instruction and consulting. When possible, these pedagogies can be fundamentally shaped by university centers for teaching and learning, but the Carpentries offers essential foundational materials (https://carpentries.github.io/instructor-training/01-welcome/index.html). Exercises can include reading key pedagogical articles and book chapters and applying them to mock workshop instruction and consultation scenarios so that staff learn how to start a consultation or workshop, are able to provide an explanation of what services are available, clarify roles, expectations, and limits of services, spot the “gotcha!” moments, and generally improve social skills and how to think on one’s feet—all key aspects of success in academic settings. This professional development tends to develop naturally out of such training interactions and can help prepare students for co-consulting and team building, graduate school and job applications and interviews, and future success based on the wide-ranging applicability of these transferable and marketable skills.

Improving research starts with improving researchers

One approach to ensuring that patrons are learning when they visit SSDS is to shift the focus to an empathetic view of the researcher as a person. This will help you better understand the research question being asked. By getting to know the researcher a little bit, both in terms of their specific research interests and short- and long-term goals, we can contextualize the research problem in a more holistic way because we can identify a more productive start point. Oftentimes, a researcher thinks they need help with “Method X” because they have been told they need to learn about it without understanding why they need to learn it and without guidance for its application. Usually, basic questions to genuinely understand the research question and its context, the dataset, and the purpose of the software and methods can clarify the initial opaqueness. The consultee can then report back to their faculty advisor, supervisor, or lab group with feedback to catalyze more meaningful discussion and responsible research. This can also be used as a segue with faculty members and lab groups across campus to open higher-level dialog about specific research needs and custom trainings and consultations in newly forged third spaces where otherwise hesitant researchers can flourish by transcending traditional ideas of work and home in plain language and in terms of their own strengths and weaknesses (Gutiérrez, 2008). This can in turn lead to clearer ideas and greater confidence to plan for job searches and other professional development.

Many researchers contact us with little to no research experience but who want to execute immensely complex research objectives on large and complicated datasets using software and methods they are unfamiliar with, crammed within the confines of short time frames, and with the expectation that our organization will complete the majority of actual work for them. The fetishization of “artificial intelligence” is the most superfluous example of this. Most often than not, these researchers have no experience programming in languages such as Python or R and sometimes are not even familiar with essential spreadsheet programs such as Microsoft Excel and lack foundational knowledge about how to ask a research question, explore data, probability and statistics, data types and structures, distributions, sampling, estimation, or hypothesis testing. This speaks broadly to the mentorship void and lack of preparation students and young researchers receive combined with feeling the need to jump into buzzword projects without first understanding basic research principles and learning foundational research and statistical methods. If the consultee is open to the suggestion, we can then set them up on a path to basic literacy and make it clear we are happy to explore these topics with them at a future time and across multiple consultation sessions and workshop trainings. Unfortunately, the influx of third party vendors who seek to capitalize financially further contribute to misunderstandings of basic scientific principles by selling access to data and canned analyses for text analysis and machine learning that produce shortcut outcomes at the expense of these foundational research principles.

We recommend that computational research support organizations should start by focusing on the basics of software, tools, and methods to better address researchers as inherently creative individuals (Kelley and Kelley, 2013). There is often greater value for novice researchers if we can help identify and explain the different components of the general research process in terms that make sense to them, rather than just rushing to the fastest buzzwordy solution that is devoid of context. While there are scenarios where the fast solution is desired, like when helping more intermediate and advanced researchers, helping the novice situate their question in the broader context can provide information about how to: read and combine ideas to ask new and relevant research questions, formulate hypotheses, outline protocols for acquiring and preparing data, understand and develop statistical frameworks, visualize and test data, interpret results, and practice writing and presenting data and research. Different researchers come from different backgrounds, have different capacities for coding and levels of creativity, seek different goals, and learn in their own ways. This makes our ability to speak at their level perhaps the most important aspect of getting through to researchers who seek our help.

Researchers who are mentored in this manner often grasp these principles quickly, simply because nobody else has ever taken the time—or cared to—divide the research process into digestible chunks in postcolonial, individualized language. This usually means helping the researcher determine if their research design is justifiable and can stand up to questioning by a committee member or peer-reviewer, and then helping them understand their data using the most basic vocabulary terms and jargon in scaffolded ways. From there, data exploration often provides the researcher with “Aha!” moments that might be revealed through descriptive statistics, visualizations, and basic tests. Through this process, it becomes simple (but not necessarily easy) to demonstrate how this rationale could then be applied to the next phase of the project, which could be slightly more complex and utilize intermediate/advanced methods of computational text analysis and machine learning, for example. This approach proves that a research outcome can be greater than the sum of its parts because it becomes about helping the researcher better understand the generated results, within the scope of their own knowledge and skill set, and to set them on the path to autodidactic and transferrable learning. All of this is time-consuming and requires funding that universities would be wise to invest in sooner than later as tuition rises exponentially, and student learning experiences are arguably diluted. The 4-year cost of college is expected to cost more than $200,000 for in-state public college in the year 2039, or around a tuition increase of about 5% per year (College Cost Calculator, www.collegeboard.com). Not learning basic research skills will continue to dissuade enrollment when so many online courses, bootcamps, and data science influencers offer cheaper (or free) alternatives in part to spite higher education’s slow pace of change.

Practicing inclusion

One thing that many academics fail to realize is that the key to helping researchers succeed is to understand that it is not about gatekeeping or increasing barriers to research or showing off one’s own knowledge, but rather using every opportunity to use practical language and examples to help build mental models, discuss cognitive load management and visual learning strategies to internalize knowledge, help the novice understand the outcomes of their research question, how to find help, and learning works in general. Universities exhibit blissful/willful ignorance around issues that plague many of their programs and should be required to teach diversity training (Caldwell, 1996; Devine and Ash, 2022; Gay, 2000; Ladson-Billings, 1994) with a focus on shifting cultures toward anti-racist (Ash et al., 2020; Fritzgerald and Rice, 2020) and anti-gender bias (Carnes et al., 2015) attitudes and language.

Success in research depends on many things, but if we are to truly use our own success to open doors to novice researchers—instead of close them—we must understand how sense of belonging (Chen et al., 2021; Murphy et al., 2020; Walton and Cohen, 2007), stereotype threat (Aronson et al., 2002; Spencer et al., 1999), and social identity threat (Hernandez et al., 2021; Martiny and Nikitin, 2019; Stephens et al., 2015) undeniably impact engagement and manifestation of impostor syndrome (Canning et al., 2020) and thus impact motivation for science inquiry that can shape science identity (Hazari et al., 2013) and ultimately, positive outcomes that can be used to catalyze the building of science capital (DeWitt et al., 2016) and help close the achievement gap (Harackiewicz et al., 2016). This can be accomplished through (near) peer instruction and mentorship (Aikens et al., 2017; Balta et al., 2017; Destin et al., 2018; Evans and Cuffe, 2009; Schell and Butler, 2018; von Vacano et al., 2022) in spaces where negative personal judgment and ad hominems are unacceptable. We also must work to eliminate cultures of bullying, especially those perpetrated by staff supervisors and peers who promote the idea of safe spaces while simultaneously making them unsafe for many due to the self-aggrandizing attitudes they promote.

Finally worth reemphasizing to ourselves and the researchers we train is that we do not live in a post-fact world. But, facts are different from evidence of an assertion, a claim well-summarized in the response by J. Dirk Nies, Ph.D (Executive Director, Floriescence Institute in Crozet, Virginia) to Jureidini and McHenry’s (2022) exposé “The illusion of evidence based medicine” (https://www.bmj.com/content/376/bmj.o702/rapid-responses). Dr. Nies succinctly explained several crises in medical research pertaining to the erosion of scientific rigor, but that could also be extended to understanding the perspectives that students and novice researchers bring with them. There are many parallels with rushed academic research that is not properly designed, theoretically framed, and that utilizes poorly understood methods:

A fact is an occurrence in the real world. A thing that is indisputably the case. Something that has actual existence. A truth verifiable from experience or observation. Evidence is an assembly of facts indicating whether a belief or proposition is true or false. Evidence is always gathered and presented either in support of or in opposition to an assertion. Notice the distinction. Facts have no purpose or agenda associated with them. Evidence always does. Furthermore, evidence always considers relevance. Evidence is an intentionally selected subset of all available facts chosen because they are deemed relevant to determining the validity of an assertion. And therein lies the rub. Who determines what the assertion is? And who determines which facts are considered relevant? As a scientist, I seek solid data to guide my personal choices. . .

As we help the researchers we serve solve today’s most challenging problems, we must teach these researchers how to know about data, how it is presented, how to critique it, and how to interpret it in the context of its contextual assertion. That is, in many instances we must also help researchers with how to formulate an assertion and teach them to understand evidence both for and against that assertion; not simply the evidence that supports the argument being made. Every researcher is unique, and we must realize that one-size-fits-all approaches to teaching, consulting, and mentoring are no longer valid approaches because they often fail to teach students how to match an assertion to the evidence presented within a critical framework.

Evidence must still be presented in a manner where researchers understand the process and apply methods responsibly based on a basic understanding of their principles. While the shortcuts are there—and can often be helpful—computational research support units should help researchers understand the processes that are used to find an answer based on the motivations of the assertion and evidence in ways that they can understand. Although the goal is to make difficult problems easier to comprehend, applying such techniques should never be presented as “magical” and the researcher needs to understand the moving parts well enough to explain them to different audiences; to use plain language with public audiences, but to use the jargon with other professionals when necessary. Additionally, we must warn young researchers about applying techniques that they do not understand, which can help them be more self-reflective of their own work (and critical of the literature they read) and identify steps for proceeding in their own technical development. Overly complicated and poorly understood and explained statistical frameworks and representations are not always best and should never be substitutes for basic statistical reasoning paradigms (Box et al., 1994; Donoho, 2017; Kass, 2021; Kass et al., 2005; NASEM, 2017; Thompson, 2001; Tukey, 1977; Velleman and Hoaglin, 2012). This is particularly challenging in the era of data monetization and as higher education continues down its path of extreme neoliberal commercialization. It is up to us to continue to democratize data literacy and technological preparation for every single patron that we serve as we train the next generation of data visualizers, text analyzers, and machine learners in the social sciences and humanities – and beyond.

Footnotes

Acknowledgements

The authors thank Mike Keller and Dr. Matt Marostica for helping steer the direction of SSDS over the past decade of its existence, along with the countless community members dating back to the 1970s who laid the groundwork for us all. We are especially grateful to the many graduate student consultants and instructors who have provided the time, effort, and spirit for supporting computational research efforts at Stanford. We also thank Dr. Diana Pacheco for consultation on the qualitative analysis, and two anonymous reviewers that helped significantly improve the quality of the manuscript.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Evan Muzzall

Author Biographies

Evan Muzzall is Head of Software and Services for Data Science at Stanford University’s Cecil H. Green Library. Here, he upskills and reskills students, faculty, and staff through workshop trainings and one-to-one consultations to help them meet the ever-changing data and technical demands of modern academic research. He also mentors graduate students in preparation for the increasingly data-centric job market. A formally trained forensic anthropologist and bioarchaeologist, he has over 15 years’ teaching experience at the graduate, undergraduate, and high school levels and has designed numerous University-level courses and data science workshops and bootcamps.

Vijoy Abraham is Assistant Director and Head of the Center for Interdisciplinary Digital Research where he rallies together a team of social science and humanities scholars in support of digital research projects. He also works to analyze and implement new services for provisioning Stanford Libraries’ data collections to its research community. His background includes research work in cognitive psychology, cognitive neuroscience, medical informatics, and computational social science.

Ron Nakao serves as the social science data, economics and political science librarian at Stanford University. He has been an ICPSR representative for nearly 20 years and was the recipient of the 2019 William H. Flanigan Award for Distinguished Service as an ICPSR Official Representative.

References

Aikens

Robertson

Sadselia

, et al (2017) Race and gender differences in undergraduate research mentoring structures and research outcomes. CBE—Life Sciences Education 16(2): ar34.

Aronson

Fried

Good

(2002) Reducing the effects of stereotype threat on African American college students by shaping theories of intelligence. Journal of Experimental Social Psychology 38(2): 113–125.

Ash

Hill

Risdon

, et al (2020) Anti-racism in higher education: A model for change. Race and Pedagogy Journal: Teaching and Learning for Justice 4(3): 35.

Balta

Michinov

Balyimez

, et al (2017) A meta-analysis of the effect of Peer Instruction on learning gain: Identification of informational and cultural moderators. International Journal of Educational Research 86: 66–77.

Bauder

(ed.) (2021) Data Literacy in Academic Libraries: Teaching Critical Thinking With Numbers. American Library Association.

Black

(2006) Helping novice programming students succeed. Journal of Computing Sciences in Colleges 22(2): 109–114.

Bonikowska

Sanmartin

Frenette

(2019) Data literacy: What it is and how to measure it in the public service. Statistics Canada, Analytical Studies Branch.

Box

GEP

Jenkins

Reinsel

(1994) Time Series Analysis: Forecasting and Control, 3rd edn. Englewood Cliffs, NJ: Prentice Hall.

Boyd

Crawford

(2012) Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon. Information, Communication & Society 15(5): 662–679.

10.

Bubica

Boljat

(2014) Teaching of novice programmers: Strategies, programming languages and predictors. In: International conference on information technology and development of education, Pavlović, Milan, pp.180–185.

11.

Burress

Mann

Montgomery

, et al (2021) Data Literacy in Undergraduate Education: Faculty Perspectives and Pedagogical Approaches. In: Bauder J (ed.) Teaching Critical Thinking With Numbers: Data Literacy and the Framework for Information Literacy for Higher Education. Chicago: Association of College & Research Libraries, pp.1–22.

12.

Caldwell

(1996) Reviewed Work(s): Critical Race Theory: The Key Writings That Formed the Movement. edited by Crenshaw

Kimberlé Williams

et al. Columbia Law Review 96(5): 1363–1374.

13.

Canning

LaCosse

Kroeper

, et al (2020) Feeling like an imposter: The effect of perceived classroom competition on the daily psychological experiences of first-generation college students. Social Psychological and Personality Science 11: 647–657.

14.

Carnes

Devine

Baier Manwell

, et al (2015) Effect of an intervention to break the gender bias habit for faculty at one institution: A cluster randomized, controlled trial. Academic Medicine: Journal of the Association of American Medical Colleges 90(2): 221–230.

15.

Castells

(1996) The Information Age: Economy, Society and Culture. Volume 1. the Rise of the Network Society. Chichester, West Sussex: Wiley-Blackwell.

16.

Chen

Binning

Manke

, et al (2021) Am I a science person? A strong science identity bolsters minority students’ sense of belonging and performance in college. Personality and Social Psychology Bulletin 47(4): 593–606.

17.

Creswell

(2013) Qualitative inquiry and research design: Choosing among five approaches. London: SAGE publications.

18.

Dale

(2002) Increasing interest in CS ED research. ACM SIGCSE Bulletin 34(4): 16–17.

19.

Davidson

(2010) Humanities and technology in the information age. In: Thompson Klein J and Mitcham C (eds) The Oxford Handbook of Interdisciplinarity. Oxford: Oxford University Press, pp.206–219.

20.

de Raadt

(2008) Teaching programming strategies explicitly to novice programmers. Doctoral dissertation, University of Southern Queensland. Available at: https://core.ac.uk/download/pdf/11038092.pdf

21.

Destin

Castillo

Meissner

(2018) A field experiment demonstrates near peer mentorship as an effective support for student persistence. Basic and Applied Social Psychology 40(5): 269–278.

22.

Devine

Ash

(2022) Diversity training goals, limitations, and promise: A review of the multidisciplinary literature. Annual Review of Psychology 73(1): 403–429.

23.

DeWitt

Archer

Mau

(2016) Dimensions of science capital: Exploring its potential for understanding students’ science participation. International Journal of Science Education 38: 2431–2449.

24.

DOE OCR (2021) United States Department of Education, Office of Civil Rights. Education in a pandemic: The disparate impacts of COVID-19 on America’s students. Available at: https://www2.ed.gov/about/offices/list/ocr/docs/20210608-impacts-of-covid19.pdf (accessed 1 May 2022).

25.

Donoho

(2017) 50 years of data science. Journal of Computational and Graphical Statistics 26(4): 745–766.

26.

Douglas

Gao

Koli

, et al (2021) Beyond the numbers: Building a data information literacy program for undergraduate instruction.

27.

Eckerdal

(2009) Novice programming students’ learning of concepts and practise. PhD Dissertation. Doctoral dissertation, Acta Universitatis Upsaliensis. https://uu.diva-portal.org/smash/get/diva2:173221/FULLTEXT01.pdf (accessed 10 March 2022).

28.

Eisner

(2003) The arts and the creation of mind. Language Arts 80(5): 340–344.

29.

Evans

Cuffe

(2009) Near-peer teaching in anatomy: An approach for deeper learning. Anatomical Sciences Education 2(5): 227–233.

30.

Fritzgerald

Rice

(2020) Antiracism and Universal Design for Learning: Building Expressways to Success|Paperback. Wakefield: Cast Inc.

31.

Gay

(2000) Culturally Responsive Teaching: Theory, Research, and Practice. New York: Teachers College Press.

32.

Godber

Atkins

(2021) COVID-19 impacts on teaching and learning: A collaborative autoethnography by two higher education lecturers. In: Pagé G (ed.) Frontiers in Education, 6, Research Topic: Covid-19 and Beyond: From (Forced) Remote Teaching and Learning to ‘The New Normal’ in Higher Education, 14. Lausanne, Switzerland: Frontiers Research Foundation, DOI: 10.3389/feduc.2021.647524

33.

Gold

(2010) Data curation and libraries: Short-term developments, long-term prospects.

34.

Grimmer

(2015) We are all social scientists now: How big data, machine learning, and causal inference work together. PS: Political Science & Politics 48(01): 80–83.

35.

Gutiérrez

(2008) Developing a sociocritical literacy in the third space. Reading Research Quarterly 43(2): 148–164.

36.

Hanania

(2018) Architecture of a technodemocracy: How technology and democracy can revolutionize governments, empower the 100%, and end the 1% system. Available at: https://www.technodemocracy.us/ (accessed 26 September 2021).

37.

Harackiewicz

Canning

Tibbetts

, et al (2016) Closing achievement gaps with a utility-value intervention: Disentangling race and social Class. Journal of Personality and Social Psychology 111(5): 745–765.

38.

Hazari

Sadler

Sonnert

(2013) The science identity of college students: Exploring the intersection of gender, race, and ethnicity. Journal of College Science Teaching 42(5): 82–91.

39.

Heintz

Mannila

Nygårds

, et al (2015) Computing at school in Sweden - Experiences from introducing computer science within existing subjects. In: Informatics in Schools. Curricula, Competences, and Competitions: 8th International Conference on Informatics in Schools: Situation, Evolution, and Perspectives, ISSEP 2015 (eds A Brodnik and J Vahrenhold), Ljubljana, Slovenia, 28 September–1 October 2015, Vol. 9378, pp.118–130. Cham: Springer.

40.

Hernandez

Silverman

Destin

(2021) From deficit to benefit: Highlighting lower-SES students’ background-specific strengths reinforces their academic persistence. Journal of Experimental Social Psychology 92: 104080.

41.

Horton

(2015) Challenges and opportunities for statistics and statistical education: Looking back, looking forward. The American Statistician 69(2): 138–145.

42.

Iqbal Malik

Mathew

Tawafak

, et al (2021) A web-based model to enhance algorithmic thinking for novice programmers. E-Learning and Digital Media 18(6): 616–633.

43.

Joni

SNA

Soloway

(1986) But my program runs! Discourse rules for novice programmers. Journal of Educational Computing Research 2(1): 95–125.

44.

Jureidini

McHenry

(2022) The illusion of evidence based medicine. BMJ 376: 1–2.

45.

Karvelas

(2019) Investigating novice programmers’ interaction with programming environments. In: Scharlau BA and McDermott R (eds) Proceedings of the 2019 ACM Conference on Innovation and Technology in Computer Science Education. ITiCSE 2019 Innovation and Technology in Computer Science Education. New York: Association for Computing Machinery, pp.336–337.

46.

Kass

(2021) The two cultures: Statistics and machine learning in science. Observational Studies 7(1): 135–144.

47.

Kass

Ventura

Brown

(2005) Statistical issues in the analysis of neuronal data. Journal of Neurophysiology 94(1): 8–25.

48.

Kates

Clark

Corell

, et al (2001) Sustainability science. Science 292(5517): 641–642.

49.

Kelley

(2013) Creative Confidence: Unleashing the Creative Potential Within Us All. New York: Crown Business.

50.

Kenny

(1982) The computation of style: An introduction to statistics for students of literature and humanities. New York: Pergamon Press.

51.

Kross

Guo

(2019) Practitioners teaching data science in industry and academia: Expectations, workflows, and challenges. In: Brewster

Fitzpatrick

Cox

, et al. (eds) Proceedings of the 2019 CHI conference on human factors in computing systems. New York: Association for Computing Machinery, pp.1–14.

52.

Ladson-Billings

(1994) Who will teach our children: Preparing teachers to successfully teach African American students. In: Hollins

King

Hayman

(eds) Teaching Diverse Learners: Formulating a Knowledge Base for Teaching Diverse Populations. Albany: State University of New York Press, pp.129–158.

53.

Loksa

Xie

Kwik

, et al (2020) Investigating novices’ in situ reflections on their programming process. In: Zhang

Sherriff

Heckman

, et al. (eds) Proceedings of the 51st ACM technical symposium on computer science education. New York: Association for Computing Machinery, pp.149–155.

54.

Martiny

Nikitin

(2019) Social identity threat in interpersonal relationships: Activating negative stereotypes decreases social approach motivation. Journal of Experimental Psychology-Applied 25(1): 117–128.

55.

Miles

Huberman

Saldaña

(2013) Qualitative Data Analysis – A Methods Sourcebook. London: Sage.

56.

Murphy

Gopalan

Carter

, et al (2020) A customized belonging intervention improves retention of socially disadvantaged students at a broad-access university. Science Advances 6(29): eaba4677.

57.

NASEM (2017) National Academies of Sciences, Engineering, and Medicine. Refining the concept of scientific inference when working with big data: proceedings of a workshop.

58.

Nederbragt

Harris

Hill

, et al (2020) Ten quick tips for teaching with participatory live coding. PLoS Computational Biology 16(9): e1008090.

59.

Prather

Pettit

McMurry

, et al (2018) Metacognitive difficulties faced by novice programmers in automated assessment tools. In: Proceedings of the 2018 ACM conference on international computing education research.New York: Association for Computing Machinery, pp.41–50.

60.

Rahm

(2000) Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin 23(4): 3–13.

61.

Robins

(2019) Novice programmers and introductory programming. In: Fincher SA and Robins AV (eds) The Cambridge Handbook of Computing Education Research. Cambridge: Cambridge University Press, pp.327–376.

62.

Schell

Butler

(2018) Insights from the science of learning can inform evidence-based implementation of peer instruction. Frontiers in Education 3: 1–13.

63.

Shi

Cui

Zhang

, et al (2018) Evaluating the effectiveness roles of variables in the novice programmers learning. Journal of Educational Computing Research 56(2): 181–201.

64.

Soloway

(1986) Learning to program = learning to construct mechanisms and explanations. Communications of the ACM 29(9): 850–858.

65.

Spencer

Steele

Quinn

(1999) Stereotype threat and women’s math performance. Journal of Experimental Social Psychology 35(1): 4–28.

66.

Stephens

Townsend

Hamedani

, et al (2015) A difference-education intervention equips first-generation college students to thrive in the face of stressful college situations. Psychological Science 26(10): 1556–1566.

67.

The Carpentries (2022) Website. Available at: https://carpentries.org (accessed 1 May 2022).

68.

The World University Rankings (n.d) The impact of coronavirus on higher education. Available at: https://www.timeshighereducation.com/hub/keystone-academic-solutions/p/impact-coronavirus-higher-education (accessed 2 April 2022).

69.

Thompson

(2001) The age of Tukey. Technometrics 43(3): 256–265.

70.

Tukey

(1977) Exploratory Data Analysis. Boston, Mass: Pearson.

71.

Varghese

Buyya

(2018) Next generation cloud computing: New trends and research directions. Future Generation Computer Systems 79: 849–861.

72.

Velleman

Hoaglin

(2012) Exploratory data analysis. In: Cooper

Camic

Long

, et al (eds) APA Handbook of Research Methods in Psychology, Vol. 3. Data Analysis and Research Publication. Washington, DC: American Psychological Association, pp.51–70.

73.

Vihavainen

Paksula

Luukkainen

(2011) Extreme apprenticeship method in teaching programming for beginners. In: Proceedings of the 42nd ACM technical symposium on Computer science education, pp.93–98. (accessed 2 April 2022).

74.

von Vacano

Muzzall

Anderson

, et al (2020) Building STEAM for DH and Electronic Literature: An Educational Approach to Nurturing the STEAM Mindset in Higher Education. London: Electronic Book Review [Frame]works for the Creative Digital Humanities.

75.

von Vacano

Ruiz

Starowicz

, et al (2022) Critical faculty and peer instructor development: Core components for building inclusive STEM programs in higher education. Frontiers in Psychology 13: 754233.

76.

Waite

Sentance

(2021) Teaching programming in schools: A review of approaches and strategies. Raspberry Pi Foundation Research. Available at: https://www.raspberrypi.org/app/uploads/2021/11/Teaching-programming-in-schools-pedagogy-review-Raspberry-Pi-Foundation.pdf (accessed 3 April 2022).

77.

Walton

Cohen

(2007) A question of belonging: Race, social fit, and achievement. Journal of Personality and Social Psychology 92(1): 82–96.

78.

Wilson

(2004) Constructing Measures: An Item Response Modeling Approach. Mahwah, NJ: Lawrence Erlbaum Associates.

79.

Yoon

Schultz

(2017) Research data management services in academic libraries in the US: A content analysis of libraries’ websites. College & Research Libraries 78(7): 920–933.

A perspective on computational research support programs in the library: More than 20 years of data from Stanford University Libraries

Abstract

Keywords

Introduction

Data and methods

The SSDS archive

Consulting and workshop metrics

Workshop feedback form responses

Historical context of Software and Services for Data Science (SSDS) at Stanford University

Early years

The 2000s

The future

Quantitative Results

Discussion

Challenges and solutions for organizational data collection and use

Improving research starts with improving researchers

Practicing inclusion

Footnotes

Acknowledgements

Funding

ORCID iD

Author Biographies

References