Lodestar: Supporting rapid prototyping of data science workflows through data-driven analysis recommendations

Abstract

Keeping abreast of current trends, technologies, and best practices in visualization and data analysis is becoming increasingly difficult, especially for fledgling data scientists. In this paper, we propose lodestar, an interactive computational notebook that allows users to quickly explore and construct new data science workflows by selecting from a list of automated analysis recommendations. We derive our recommendations from directed graphs of known analysis states, with two input sources: one manually curated from online data science tutorials, and another extracted through semi-automatic analysis of a corpus of over 6000 Jupyter notebooks. We validated Lodestar through three separate user studies: first a formative evaluation involving novices learning data science using the tool. We used the feedback from this study to improve the tool. This was followed by a summative study involving both new and returning participants from the formative evaluation to test the efficacy of our improvements. We also engaged professional data scientists in an expert review assessing the utility of the different recommendations. Overall, our results suggest that both novice and professional users find Lodestar useful for rapidly creating data science workflows.

Keywords

Computational notebook visualization recommendation Markov chain data science Python

Introduction

Data science is still a nascent and emerging discipline, which makes it challenging for analysts to learn and keep up with new tools and techniques. There is already a dizzying array of libraries, such as Scikit-Learn, Pandas, and TensorFlow, and best practices and workflows change often. Furthermore, few standardized methods exist for data analysis: many times, the exact data transformations, computations, and analyses needed depends on the data, task, and user. This means that cookbook methods or simple templates are insufficient to teach fledgling analysts how to tackle realistic and ever-changing data science problems.¹

We present lodestar, an interactive and visual sandbox for independent learning of analysis methods and best practices in data science. Our aim in developing Lodestar is to simplify the process of finding and experimenting with new methods by providing automated, data-driven recommendations. The vision is for Lodestar to be a self-contained environment for rapid learning and prototyping that combines everything the user needs to infer the function and purpose of an analysis step in one place.

The Lodestar system shows a sequence of analysis steps in the form of Python code cells (see Figure 1), like in a computational notebook interface (such as Jupyter notebook²), but enables the user to initially select from and interact with self-contained code cells without having to write any code. The user merely selects which data-frame to analyze, and the system displays a ranked list of recommendations of analysis steps to be executed on that data. Each analysis step is represented by an interactive visualization in the notebook interface, giving the user insights into its output and behavior. Furthermore, users can view the corresponding code for any analysis step, and even export the resulting notebook from Lodestar, providing flexibility in how users learn from and interact with Lodestar’s analysis recommendations.

Figure 1.

Lodestar web interface. The top panel, above the selected recommendations, provides a data selection menu. The black dividers between sections are recommendation panels combining suggested analysis steps from various sources (called advisors). Areas that have graphs and analysis outputted are analysis cells, each with multiple tabs: “Analysis Results” gives charts or tables, “Output Dataframe” and “Code Script” shows the outputs and current code block, and “What’s this analysis?” gives a brief description of the analyses.

Lodestar provides recommendations for the user’s next analysis step based on the current state of the analysis and the dataset being analyzed. Recommended analysis steps and workflows are derived from two sources representing current best practices in data science: (1) existing data science tutorials from online academies and training materials (i.e. expert recommendations), and (2) common analysis patterns mined from a large corpus of publicly available Jupyter notebooks³ (i.e. crowd recommendations). The code cells extracted from each source are manually curated, then programmatically clustered into synonymous analysis steps, and inserted into a large directed graph of connected cells representing common analysis workflows. The Lodestar recommendation engine can then identify and rank the most relevant analysis steps given the user’s current position in the graph.

We developed Lodestar using an iterative design process through three separate user studies. In the first study, we used early feedback from six novice data scientists to improve the interface design in a formative user study. Once we made improvements to the interface design, we asked three returning participants and seven new participants to evaluate the improved version of the interface in a summative user study. Our findings show that key Lodestar interactive features, such as automated recommendations, a visualization of the full analysis workflow, a code review pane for suggested analysis steps, and support for Jupyter Notebooks, provide significant value to those who are learning data science. Finally, in a third study, we invited three professional data scientists to evaluate Lodestar recommendations and its recommendation engine. This evaluation showed that Lodestar recommendations provide an easy way to explore data and analysis techniques.

In this work, we make the following contributions: (1) a recommendation system involving two sources of data analysis practice: crowd-based and expert-based; (2) a sandbox interface design integrating visualizations, interactions, and code to facilitate learning about new data analysis techniques; (3) results from formative study and a summative study evaluating the Lodestar interface; (4) results from a study evaluating Lodestar recommendations and recommender engine; and (5) a novel data analysis architecture that integrates a recommender system with an interactive interface. All our materials have been made available on OSF: https://osf.io/pztva/

Motivating scenario

Chewie is a recently retired journalist, and is hoping to devote his retirement to his true passion: traveling the world. He owns a condo in a nice downtown area, and is now trying to decide whether selling or subletting his house is the better choice given the changing market. What will yield him the most funds to travel in the long run? Unfortunately, while Chewie has long experience finding the facts, he has no training in data science or temporal forecasting. He turns to Lodestar for help.

Chewie talks to his realtor, who is able to quickly get him a dataset of recently sold homes in the neighborhood. The resulting CSV file contains approximately 500 homes sold in the last 5 years, including the price, address, square footage, date sold, the income of the house owners, and school statistics. The realtor also provides him with another dataset of 1000 rentals in the city for the last 10 years, including monthly rent, address, size, and year.

Chewie first creates a Lodestar notebook and loads the 500 sold houses. The first few recommendations include several descriptive statistics that give an overview of the dataset. He chooses scatterplot-regression, which renders scatterplots between various attributes with a estimated regression line. He then selects the shuffle-split analysis from the expert advisor, which is described as the first step of training a model. The human-readable description informs him that this is a standard method to randomly select a holdout dataset for training, and one for testing. Then he goes along with the next top recommendation fit-decision-tree, to actually train a decision tree model; this shows up from both the expert and crowd advisors, as it is a common step for forecasting. He is now ready to make predictions and uses the Lodestar default parameters to input the data for his house: selling it now, versus in 5 years. The model suggests that selling his house in 5 years instead of now will net him an extra $25,000.

Satisfied, he creates a new Lodestar notebook and repeats the process for the rental dataset. Again, the parameters of his house will yield him a suggested rent—$1200 per month – as well as its market increase over the 5 years. He takes his figure and manually deducts his monthly mortgage payment, and realizes that his rental income over those 5 years will be close to $5000 more than selling the condo now. His decision made, Chewie contacts his realtor and tells him to list the condo for rent, and then turns to planning his escape to warmer clime.

Background

Lodestar was architected to encourage best practices in sensemaking. Richard Hamming described sensemaking as “the process of searching for a representation and encoding data in that representation to answer task-specific questions.”⁴ Dubbed the sensemaking loop,⁵ each sensemaking iteration works to refine and build on the previous insights – ultimately enabling the analyst to address less specialized audiences.

In combination, these iterations make up the data science workflow. Analysts often use visualizations or other types of intermediate results to guide further analysis. However, these results can sometimes be dead ends. Kandel et al.⁶ found that analysts will commonly overcome dead ends by backtracking and exploring new branches.

Interactive visualization design environments

Many visualization systems and toolkits are designed around specific data analysis tasks, making the analysis process easier to perform. Excel supports basic visualization and data transformations. Shelf-based visualization environments such as Tableau (née Polaris⁷) allow easy configuration of visualizations through drag-and-drop of data attributes and metadata onto “shelves” representing visual channels. This approach is flexible enough for even novice users to construct a wide range of visualizations. Interactive visual design environments such as Lyra,⁸ iVoLVER,⁹ and iVisDesigner¹⁰ utilize direct manipulation to allow users to bind data to visual representations. More recently, Data-Driven Guides,¹¹ Data Illustrator,¹² DataInk,¹³ and Charticulator¹⁴ provide advanced tools for representing data items as visual elements and mapping their attributes to data dimensions. Keshif,¹⁵ a faceted visualization tool, generates grids of predefined charts to support visual exploration by novices. ExPlates¹⁶ uses fluid drag-and-drop interaction to support spatialized data analysis.

Visualization development toolkits such as D3¹⁷ and Protovis¹⁸ provide fine-grained control over designing interactive visualizations, but require significant programing expertise to use. Visualization grammars, such as ggplot2,¹⁹ Vega,²⁰ and Vega-Lite,²¹ abstract away implementation details, but still require programing knowledge to use. Furthermore, even advanced visualization tools, toolkits, and grammars offer only limited functionality for manipulating the data, and only support a small number of statistical functions.

Visualization recommendation

The purpose of visualization recommendation is to suggest relevant visualizations to the user to facilitate data analysis²² where the visualizations are fully designed in advance and therefore directly accessible to the user. It was first proposed by Mackinlay²³ in 1986 with automatic design of effective presentations based on input data. The work combines expressiveness and effectiveness criteria from studies such as those by Bertin²⁴ and Cleveland and McGill.²⁵ to recommend appropriate visualizations. In 2007, Tableau’s Show Me feature²⁶ revealed a commercial product with the implementation of these ideas. Following the idea of Mackinlay’s automatic visualization, Roth et al.²⁷ enhances user-oriented design by completing and retrieving partial design graphics based on their appearance and data contents. The rank-by-feature framework²⁸ ranks histograms, scatterplots, and boxplots over 1D or 2D projections to find important features in multidimensional data. SeeDB²⁹ generates all possible visualizations given a query and identifies the interesting ones. Perry et al.³⁰ as well as van den Elzen and van Wijk³¹ propose generating small multiple visualizations shown as thumbnails using summary statistics.

In the last few years, recommender systems have become widely used for visualization. Voyager³² generates a large number of visualizations given a user-specified partial specification, and organizes them by data attributes. The generated visualizations are rendered as cards on a scrolling view. Saket et al.³³ propose the Visualization-by-Demonstration framework, which allows users to provide incremental changes to the visual representation. The system recommends potential transformations such as data mapping, axes, and view specification transformations. Zenvisage³⁴ automatically identifies and recommends desired visualizations from a large dataset. Voyager 2³⁵ extended the original Voyager through wildcard functionality that explores all possible combinations of attributes. Draco³⁶ automates visualization design itself using partial specifications and a database of design knowledge expressed as constraints. VizML learns what visualizations to recommend by training neural network models on millions of visualization designs made using Plotly.³⁷ Similarly, Qian et al.³⁸ uses learning-based approaches to generate relevant visualizations based on data. Most recently, and uniquely, Solas³⁹ learns to provide visualization recommendations using a user’s analysis history.

Several tools extend these ideas to recommending analytical insights and data processing steps. “Top-K insights”⁴⁰ provides a theory for generating top $K$ insights from multidimensional data. Similarly, Foresight⁴¹ presents the top $K$ insights in a dataset from 12 insight classes using a corresponding visualization. DataSite⁴² organizes significant automatic findings in a specific feed of notifications. Finally, Voder⁴³ builds on a similar feed as DataSite to provide “interactive data facts” using visualizations.

Our proposed Lodestar system combines these ideas from visualization recommendation with an analytical perspective, and allows stringing together such analytical steps into a sequence. There are some existing efforts on recommending data analysis techniques and workflows. Yan and He.⁴⁴ demonstrate that online repositories of computational notebooks can be a valuable resource for modeling and testing a recommendation system for data cleaning techniques. Bar El et al.⁴⁵ take this a step further by automatically generating entire data exploratory workflows using deep reinforcement learning techniques. Our system builds on these works by presenting a holistic model and code mining pipeline for deriving new recommendation features in a data-driven way, whether for data visualization, data preparation, or data analysis workflows. Essentially, Lodestar extends the idea of automated recommendations to the entire data science pipeline, rather than visualizations only.

Interactive notebooks

Donald Knuth’s notion of a “literate” form of programing,⁴⁶ which merges source code with natural language and multimedia, has extended to the concept of literate computing in the form of computational notebooks,² that combine executable code, its output, and media objects in a single document. This has proven to be very useful for rapid prototyping and exploration as well as for replicability and communication in data science.³

Because of their success, with adoption even at the level of entire organizations,⁴⁷ notebooks have enjoyed significant progress in recent years. The new generation of computational notebooks, such as Google Colaboratory and Codestrates,⁴⁸ enable synchronous collaboration. The JavaScript-based Observable notebook also supports reactive execution flows.

Visualization in particular has recently begun to adopt computational notebooks. Altair⁴⁹ builds on Vega²⁰ and Vega-Lite²¹ to provide statistical visualizations in Python, and thus in Jupyter Notebooks as well. Idyll⁵⁰ supports a notebook-like markup language to create interactive data-driven document for communication. Vistrates⁵¹ provides a collaborative visualization workflow in a notebook. Observable leverages computational notebooks to also provide a collaborative visualization platform. Literate visualization⁵² integrates the visualization design process with the choices that led to the implementation.

End-user and live programing paradigms have proven useful in creating intuitive interactions with visualizations found in computational notebooks. For example, Wrex⁵³ and Mage⁵⁴ leverage user interactions on data visualizations to automatically generate exemplar code. Both tools demonstrate the link between code and visual interactions. Torii⁵⁵ uses a live programing model to enable easy maintenance and reuse of source code for building tutorials. These systems not only add to the number of ways users can interact with their literate document, but also connect code and visualization so as to facilitate iterative analysis.

Design Requirements: Formative Study

Our goal is to make Lodestar an interactive and visual sandbox environment for learning and experimenting with new data science methods in a data-driven way. We also wanted to make data science universally accessible to fledgling data analysts and enthusiasts alike. These core ideas helped us compile a set of design requirements and some preliminary prototypes. In this section, we outline our major design requirements, and report on a formative study conducted to validate and refine our approach to the Lodestar interface design and system development processes.

D1: Informed by best practices. Recommendations should be drawn from current practice, empowering those new to data science to learn how to effectively analyze data.^1,26

D2: Prioritize analysis steps over code. Our intended users are trying to analyze data in a fast and fluid fashion, but may not yet be familiar with specific libraries or modules needed to complete different analysis steps. Lodestar needs to build a bridge between the high level analysis steps common in data science, and the low-level code needed to accomplish these steps.⁵⁶ For example, recommendations should be immediately relevant and situated within the overall data science pipeline to enable users to progress in their analysis.

D3: Enable independent exploration. To ensure that users can explore their data independently, educational interface elements must also be incorporated to automatically provide documentation and clarification of system behavior.⁵⁷ Furthermore, intermediate and final results should be presented using visual representations that can be easily interpreted regardless of user expertise.⁵⁸

We conducted a formative user study to evaluate the usability of an early prototype of the Lodestar system, which we used to validate our initial design requirements and refine the design. Questions posed to the participants in the formative study protocol can be found in the supplemental material.

While we refer to Lodestar as a computational notebook, we note that building a fully-fledged notebook system from scratch is a vast engineering effort. Our goal in this research project was to focus on novel aspects of recommendation for data science and visualization, so our resulting notebook lacks significant features normally associated with such systems.

Study Design

The study was conducted over a period of 1 month in which we interviewed six fledgling analysts and data scientists; all undergraduate university students. We focused on recruiting university students, since they are generally learning data science methods for the first time and thus could provide helpful insights in our design process. Each student had demonstrated knowledge of data science fundamentals through attending a university-level introductory data science course and/or other relevant machine learning/data science experience. Although not a prerequisite of the recruitment process, some students also had experience performing analysis on platforms such as Excel and Tableau.

Method

Each interview lasted for 60 min and was divided into three phases. Prior to the interview, each participant signed a consent form, allowing us to record audio and screen capture throughout the duration of the interview. The first phase consisted of questions, delivered verbally, that assessed the participant’s recent experience in learning data science techniques and tools through classes, side projects, research, and other such activities.

The second phase of the interview was dedicated to introducing an early prototype of the Lodestar system in which participants were given a brief 2-min description of Lodestar and associated goals. The next 5 min were spent giving the participant a cursory tutorial of the system. For each participant, the tutorial was given using a pre-written script and with the same sample dataset to give each of them equal knowledge of the system prior to their exploration. The participants then spent the next 15–20 min using the Lodestar system to conduct exploratory data analysis on a dataset of their choosing. We restricted their choices to two datasets; the Boston House dataset from a Udacity tutorial and the Cars dataset (The Udacity tutorial is available here: https://github.com/sajal2692/data-science-portfolio/blob/master/boston_housing/boston_housing.ipynb). During this exploratory session, participants verbalized their thought process, questions, and comments with a think-aloud protocol. We encouraged participants to “to use any and all features of the Lodestar system” and to “explore whatever aspects of the data [they found] interesting.” Participants were allowed to end the session before the allotted time expired if they were satisfied with their results.

The third and final phase of the interview consisted of a post-exploration questionnaire that asked participants to describe the utility of Lodestar for their common data analysis tasks. Participants were specifically asked if they would adopt Lodestar to learn new data science techniques.

All sessions were held in a lab environment using Google Chrome on a Macbook Pro with a 15-inch Retina display. Audio was recorded using the built-in voice recording application on a mobile device. Screen capture was done using Apple’s QuickTime Player. Observational notes from the study coordinators, text responses from our questionnaires, and audio and video recordings were collected for further analysis and prioritization of design requirements and functional features of the existing prototype. We used the the participant responses from the think-aloud and questionnaire portion of the study to illustrate the themes with respect to the Lodestar enhancements.

Results

Our formative study found that a majority of participants were in favor of using Lodestar in their daily work, but suggested several modifications to make the system more useful. For the sake of brevity, we focus primarily on summarizing their constructive feedback below (participant IDs start with “FP”):

Provide Clear Documentation & Context: Our early prototypes did not include tooltips or descriptions of analysis steps. Several participants highlighted the need for increased transparency in the interface. Specifically, they wanted clearer naming conventions, documentation of features and methodologies (e.g. the difference between expert and crowd recommendations), and explanation of expected system behavior. For example, some participants had difficulties understanding the meaning of certain user interface elements. Participants asked questions such as “what are these percentages?” (FP6), or “[what do] the columns on the left side represent?” (FP5). Participants FP1, FP2, FP3, and FP5 also asked if there “is actually a way to view the entire dataset?” (FP2).

There were many questions specific to the meaning of recommendations. For example, FP3 said “I think the names [are] misleading... there were some really complicated names for just a simple linear regression. [It] should just be changed [to more] obvious names.” Similarly, FP2 suggested that there should be “a longer description [...] [or] some way to show their effectiveness without the user having to Google search them.” These misconceptions indicate that better documentation is needed to help new users understand the interface.

Improve Tracking of Analysis Progress: Several participants wanted to be able to see what phase of the data science process they were in based on the current state of their analysis workflow. Our early prototypes did not include the feature to track previously selected analysis. FP4 drew parallels with a restaurant order tracker, where Lodestar should partition each part of the data science process into separate steps, and group analysis recommendations into these steps. Users would then be able to better understand their progress within the data science process.

Enable More Granular Control: The early Lodestar prototype only allowed users to choose from pre-loaded datasets, and did not provide any export or customization functionality for analysis steps. However, multiple participants expressed the desire to import their own dataset and export their own code for later sharing and reuse. Participant FP4 said that they would be frustrated if they wanted to “export it or make some changes in the data or [try] to do something that is not supported by Lodestar [while] not having any way of doing so.” Participants also highlighted the need for more control over what parameters or attributes were being passed into different analysis steps, such as selecting specific attributes when generating visualizations or executing regression analyses. These observations suggest that users should be able to customize analysis steps and export their current analysis.

Further refinement of Lodestar

Though participants could see promise in providing automated recommendations (design requirement D1), the expressed need for more tracking of workflow structure and progress also reinforces design requirement D2. Without additional context to help users situate themselves within the broader data science process, users can easily lose their train of thought, hindering their analytic flow. The need for more documentation and control observed in our formative study supports design requirement D3. Without adequate information, users are unable to explore new data analysis techniques and interpret the results in Lodestar on their own. Users also find it difficult to tailor their explorations to their specific needs without access to the code.

These points of feedback served as motivation for additional iteration on the Lodestar feature design. Specific features that were added as a result of this study included the ability to export the user’s notebook to an .ipynb file for use outside of the system, a visual tracker that displays the progress of the user’s analysis in each output cell, showing which recommendations have been chosen so far, and descriptive tool-tips of the different analysis techniques in each output cell.

The Lodestar system

lodestar is a data analysis recommender, that is, a system that interactively suggests the next step to take in an analysis workflow (D1). Lodestar is designed in the style of an interactive computational notebook, and generally inspired by the designs of existing notebooks such as Jupyter, Observable, and Google Colaboratory. Given Python’s broad popularity in data science contexts,³ we chose to focus on Python as our target environment.

System overview

Lodestar consists of four main components, shown in Figure 2: a browser-based notebook interface, an interactive computing protocol, a recommendation engine to suggest analysis steps, and a server-side kernel to execute analysis steps. The protocol manages communication between the client and server (commands as well as computational results), and the kernel on the server side runs each analysis step that the user selects using an interpreter. The Flask server handles all of the client requests for data processing, analysis, and recommendations, with different endpoints.

Figure 2.

Overview of the Lodestar architecture. The user interacts with the notebook interface and selects either a data set to bootstrap the notebook or an analysis step within a guided workflow. The notebook interface sends the selection as a request to the Lodestar server. The Lodestar server sends requests to the recommendation engine for subsequent recommendations based on current selections (data or analysis). Lodestar server also sends a request to the Python interpreter to execute any selected analysis. Results from both these requests are sent back from the Lodestar server to the notebook interface for the user to view and interact.

Lodestar emphasizes an iterative workflow design where analysis steps are added progressively, one at a time, providing fine-grained control to the user (D3). To help users focus more on analysis steps and best practices (D1, D2) rather than low-level code, Lodestar allows the user to rapidly choose from a list of recommended analysis steps. These recommendations are displayed in the form of buttons, so a user can easily select and execute an analysis step of interest with a single click. Furthermore, these recommendations are mined from recent Python tutorials and active GitHub repositories of Jupyter Notebooks, enabling the user to construct new analysis workflows based on best practices in a data-driven way.

Notebook Interface

The Lodestar interface (shown in Figure 1) is an interactive notebook providing a literate computing environment⁴⁶ that runs in a web browser on the client. Similar to existing computational notebooks, the Lodestar notebook is a linear document that the user can selectively edit and execute. The interface contains three major components: a menu panel at the top, one or more notebook cells, and recommendation panels for each cell. The notebook cells and recommendation panels dynamically appear and update within the notebook interface in response to user interactions.

The user begins their analysis using the menu panel to load an existing dataset or a new dataset (in CSV format) into the system. Once a dataset has been loaded, Lodestar generates a recommendation panel within the notebook interface, providing the user with an initial set of recommended analysis steps. We refer to the actual code behind each analysis step as an analysis block, and the displayed result of executing the analysis step as a notebook cell. From this point onward, the analysis process forms a cycle that repeats until the user is satisfied with their new workflow:

The user selects an analysis step from a recommendation;

The kernel executes the analysis block on the server;

The notebook displays output by appending a new cell; and

The notebook generates a new panel of recommendations, based on the user’s previous selection.

When the user is ready to migrate their workflow to a complementary tool, for example to iterate on the code directly within a code editor, they can export the Lodestar workflow as a Jupyter notebook file.

Recommendation panel

Every notebook cell in the Lodestar interface has an accompanying recommendation panel, allowing the user to extend their latest analysis step by one cell. When the user selects an analysis step from a recommendation panel, a new notebook cell is generated for the selected recommendation, along with a new recommendation panel underneath. Lodestar uses the output of the preceding notebook cell as the input for executing any analysis step selected in this recommendation panel. Each panel provides two sets of recommendations, one from a crowd advisor and one from an expert advisor. The crowd advisor sources recommendations from online data analysis repositories such as GitHub. The expert advisor sources recommendations from educational resources such as textbooks, online classes or online tutorials. Crowd and expert advice are presented separately as a way to highlight a data point the user can take into consideration while choosing a recommendation. We believe that this type of transparency betters independent exploration (design requirement D3).

If a user is unsatisfied with a given set of recommendations, they can choose from Lodestar’s full catalog of analysis steps in a drop-down menu at the bottom of each recommendation panel. This list is available in the supplementary materials.

Notebook cell

Once a selection is made in a recommendation panel, the selected analysis step is highlighted and the results are displayed in a new notebook cell, allowing the user to review their past selections and the corresponding results with each subsequent step. Furthermore, the user is able to go back and update the results at any time by selecting a different analysis step in any of the previous recommendation panels. Any cell can also be deleted, which triggers the removal of all downstream cells that depend on the deleted cell. In this way, Lodestar maintains a linear structure in the notebook, making it easier for users to navigate within the analysis.

To help users understand the functionality of each recommended analysis step and its purpose within the context of the larger data science process, notebook cells consist of five tabs (Figure 3). Each tab describes the behavior of the analysis block represented by this notebook cell. We refined the design of each tab based on the feedback we received from the formative study:

Output Data Frame: Default view that renders the output data frame produced by executing the analysis step as a table.

Analysis Results: Displays the raw results produced by the analysis step (e.g. print output or Seaborn visualization).

Script: Displays the Python code within the analysis block.

“What’s this Analysis?”: High-level description of the step.

Analysis Progress: Displays the chain of analyses leading to the current analysis step, where each step has a name.

Figure 3.

Figure grid. Visualizations generated by an analysis block using the Seaborn statistical data visualization package for Python.

Exporting code and results

When the user is ready to migrate their analysis workflow to a related tool, they can export content directly from Lodestar. To export the code for a specific analysis step into an independent Jupyter notebook file, the user can click on the export button next to the Code Script tab of the corresponding cell. To export the entire analysis workflow, the user can click on the export button on the menu panel at the top of the interface. Similarly, Lodestar enables users to export the output data of any displayed notebook cell as a CSV file. To do this, the user clicks on the export button next to the Output Data Frame tab. The user can also download the visualizations displayed in any notebook cell as PNG files.

Advisors and recommendations

The Lodestar recommendation engine is based on the notion of an advisor: a source of analysis recommendations. Lodestar supports multiple advisors, each consisting of a library of analysis steps and a set of advisor-recommended transitions between analysis steps (i.e. a recommendation graph). In our current implementation, we use two advisors: a “crowd” advisor drawn from our semi-automatic code analysis, and an “expert” advisor drawn from the manual code curation. For each advisor, the recommendation panel will show a list of up to five recommendations, ordered by probability, or how frequently this analysis came next in the recommendation graph.

In this section, we describe how we build our recommendation graphs for the expert and crowd advisors, and how we enable Lodestar to identify equivalent or related states across both graphs.

Recommendation graph

Lodestar models transitions between analysis steps by treating analysis workflows (e.g. existing tutorials or computational notebooks) as paths taken through a network graph. Each node in the graph is an analysis step, and a directed edge appears in the graph for each pair of consecutive analysis steps observed in a workflow. Lodestar leverages the relative frequency of these transitions to predict which analysis steps are likely to occur next. The particular graph structure used in Lodestar is a Markov chain, and the final computed graph we refer to as a recommendation graph. We believe this to be a good initial step toward modeling analytical decisions but, acknowledge that more complex factors influence their construction. We hope to explore this further in future work.

Lodestar traverses the recommendation graph one state at a time for each user input (i.e. choice of analysis step). As a result, our recommendation approach does not require maintaining specific state about the analysis itself. Instead, the location in the Markov chain serves as state, and transitions (e.g. recommendations) thus depend only on the current state. In this way, recommendations are agnostic of the data being analyzed, thus allowing users to draw from a wider range of tools and techniques (D3).

We can infer these recommendation graphs programmatically by mining analysis blocks (i.e. code snippets) from existing computational notebooks. In this case, the analysis blocks are used as the graph states, in place of their corresponding analysis steps. Figure 4 shows the general approach for mining analysis blocks into this recommendation graph. We extract the analysis blocks from existing computational notebooks and recover the transitions between states from the sequences observed in each notebook, with the weights signifying the frequency of observed transitions. Analysis blocks become nodes $B_{i}$ in this graph, and edges represent probabilistic transitions $\Pr (j | i) = P_{i, j}$ , where the probabilities $P_{i, j}$ are taken from a stochastic matrix $P$ that simply represents the frequency of transitions between blocks in the individual sequences.

Figure 4.

Mining blocks into a recommendation graph representing a Markov chain. In Step 1 (left), sources $S_{1}, \dots, S_{n}$ (manually curated or automatically extracted) yield (ordered) sequences of blocks $S_{i} = (B_{1}, \dots, B_{m})$ . In Step 2, a recommendation graph can be derived by matching blocks that appear in multiple sequences and joining the sequences at those nodes. Edges between blocks in the graph are the frequency-weighted state transitions in the chain.

To infer the full recommendation graph, we first construct a separate Markov chain for each notebook (or tutorial) identified as a source for our advisors. Specifically, we model each notebook as a Markov chain with one state per block and the transition probability to move from block $B_{i}$ to the next block $B_{i + 1}$ for each time step (e.g. user input) expressed as $\Pr (i + 1 | i) = 1$ . Similar analysis steps are labeled with the same high-level identifier, representing a broader category of computation that transcends individual notebooks (e.g. $B 1$ , $B 2$ , etc. in Figure 4). The result is a larger two-dimensional nested list, where each notebook is one row within the list (i.e. the left side of Figure 4), and each column a sequence of analysis step categories.

We can then merge the resulting sequences into a single graph (e.g. merging $S_{1}$ , $S_{2}$ , etc. in Figure 4), and aggregate the relative frequencies associated with the different categories to determine transition weights (i.e. how often do we see blocks from category B1 executed before blocks from category B2?).

Specifically, the transition probability $P_{i, j}$ (and thus edge weight in the recommendation graph) for the $i^{th}$ row and the $j^{th}$ column is the number of edges from $B_{i}$ to $B_{j}$ across all the sequences, divided by the out-degree of $B_{i}$ . In other words, the graph will have no edges (weight 0) between blocks that never appeared in sequence, and will have normalized weights for blocks that fan out to multiple different destinations (because they are used by many notebooks). To bootstrap the recommendation, we recommend the first analysis in all the sequences (the root nodes in the graph). To validate this method, we manually calculated the probabilities of two edges, each corresponding to the crowd and expert advisor. For example, we found that the recommendation of the expert advisors specified a probability of 33% for group statistic calculation transitioning to an ANOVA-test. Indeed, we found that this transition occurred 33% of the time in our sample of tutorials. The recommendation graph corresponding to our crowd advisors specified a probability of 7% for matrix-normalization calculations transitioning to the use of the numpy-hstack operation. We found this was accurate with a manual calculation of the probability of their co-occurrence.

Extracting analysis blocks for the expert advisor

We extracted analysis blocks for our expert advisor from online tutorials (Our supplemental materials includes a detailed report of the full process for extracting and curating analysis block s for the expert advisor : https://osf.io/3gpsy/). These tutorials were either Jupyter Notebooks or blogs which clearly delineated code from text. Analysis blocks correspond directly to code cells found in tutorial notebooks, or self-contained code snippets found in blog posts. We derived analysis blocks from these sources because we believed that it could sufficiently encapsulate current best practices in Python data science programing (D1).

While there exist many data science resources online, their focus and depth vary widely, from simple hands-on learning for beginners to expert-level guides on deep learning, sensitivity analysis, and model building and tuning. As a rule, we picked guides focused on teaching a specific analysis task (D2). We narrowed our search to end-to-end data science examples, which provide concrete sequences of analysis steps along the data science pipeline. Specifically, we selected examples that have an explicit purpose for the data analysis, step-by-step explanations and results, and runnable code. These requirements helped to ensure that the extracted analysis blocks will have similar types of functionality and were high-quality.

Formatting analysis blocks for the expert advisor

To ensure that the extracted analysis blocks are executable in Lodestar, we also apply a separate code curation process. From our experience, each source has a specific analysis goal, and the blocks across different sources may use different libraries, data attributes, and variables to achieve it. For example, a tutorial using the Boston housing dataset, may generate a scatter to examine a linear relationship between four housing attributes, while in a school test-scores dataset it only makes sense to examine a linear relationship in between two attributes. This is useful nuance for manual analysis, but cannot be directly used in a generic data analysis system such as Lodestar. In other words, the analysis blocks must be curated – typically generalized – to be applicable across multiple applications.

The block curation process is idiosyncratic, but consists of the following steps: (1) adding missing dependencies, (2) replacing data-specific labels and attributes, (3) setting appropriate default parameters, and (4) generalizing code to operate on general data frames and output data frames too. This process is very similar to our curation strategy for recommendations from our “crowd” advisor. We manually compared new blocks to exiting blocks within the library, to ensure there were no duplicates. Upon completion of the curation process, each new analysis block is added to the library for the recommendation graph.

Managing analysis blocks for the crowd advisor

We extracted analysis blocks for our crowd advisor from a corpus of approximately 6000 Jupyter notebooks, originally collected by Rule et al.³ We filtered out notebooks that did not contain import statements and API calls using common data science libraries, such as Numpy, Scikit-Learn, or Pandas. We first partition each notebook into discrete analysis blocks. For Jupyter notebooks, the code is often already partitioned by the notebook authors through the use of Jupyter notebook code cells. Our straightforward approach is to identify existing cells in the Jupyter notebook corpus as separate analysis blocks for Lodestar. Please see our supplemental materials for a detailed report on our full process for extracting and curating analysis blocks for the crowd advisor.

Our key insight for this process is that similar data analysis steps often use similar API calls in the code. For example, notebooks that leverage sklearn.linear_ model to build a linear regression model using the LinearRegression() module could be characterized as performing the same analytical step. Using this idea, we construct a term vector to represent each analysis block, where the vector represents the normalized frequency of each API call that appears within the block. Each cell in the vector represents a unique API call observed in any notebook in the dataset, allowing the vectors for the block to be compared with any other block.

We use these term vectors to cluster the analysis blocks. Specifically, the normalized vectors are passed to a $k$ -means clustering algorithm to be clustered for similarity. After some iteration, we identified 200 clusters as an ideal number for grouping the analysis blocks extracted from our corpus (please see our supplemental materials for more details). We observed cluster strength through the distribution of cells across clusters and through a silhouette score (score = 0.287). Given the underlying quality of the dataset, we found that 200 clusters were acceptable for manual processing. Each resulting cluster represents a set of analysis blocks that share similarities in functionality, and thus could also represent a shared or synonymous analysis step across the corresponding Jupyter notebooks.

Of the 200 representatives (one for each cluster), we ultimately selected 22 blocks as a starting set for the Lodestar library. For any given cluster, Lodestar needs a way of recommending a single analysis block to users. We use code-line count as a heuristic to pick a representative analysis blockfrom each cluster. Specifically, we pick the blocks which have a median number of lines relative to all other blocks within a cluster. We posit that this will yield the “average” code unit.

Blocks for both the crowd and expert advisors are formatted to follow the same consistent structure assumed by the Lodestar system. We format each analysis block to be a Python function, include necessary imports, convert the function’s input and output to a data frame, and remove print statements and irrelevant comments.

Identifying synonymous states across advisors

Of course, managing multiple advisors means that the system must track the state of the analysis in the recommendation graph for all advisors when the user selects a recommendation from a specific advisor. Our current solution uses a multi-level tagging mechanism where each block is manually tagged given its functionality; for example, a decision tree block could be tagged with train-model and test-model. Tags correspond to steps in the data analysis workflow. We developed an understanding of these steps using previous studies.^44,59,60 Much like Yan and He,⁴⁴ we cast particular Python APIs to specific analysis steps. For example, Pandas dropna function was cast as a data-cleaning operation since dropping empty elements is a common way to clean data. Our tags include: statistical-sampling, visualization, data-organization, data-cleaning, data-formatting, and statistical-summary.

In tagging analysis in this way, we allow for matching the new state of the specific advisor, chosen by the user, to relevant states in the other advisors. More specifically, if the user chooses a recommendation from the expert advisor that suggests running a specific decision tree block, the Lodestar engine will advance the crowd advisor to a state in its recommendation graph that corresponds to the train-model and test-model tags. This design, as well as ordering recommendations by probability ordering, allows Lodestar to guide best practices.

The same functionality is used when the user eschews all of the recommendations and instead selects directly from the library through the drop-down box in the recommendation panel. In this case, all of the advisor models will be advanced to the appropriate state matching the block that the user executed. This allows the user to iterate on techniques unhindered by a guided system.

Summative evaluation

We conducted a second user study to evaluate the improvements we made to Lodestar after receiving formative feedback from the first user study. While the formative study evaluated the usability of the early prototype, this summative study evaluated the viability of Lodestar for learning data science practices.

The summative study was conducted over a period of 1 month. We interviewed 10 fledgling data scientists. Seven of these participants were new to Lodestar and three were part of our formative study. All participants were undergraduate university students who demonstrated knowledge of data science fundamentals through a university-level introductory course or other relevant experiences. Again, some students had experience with performing analysis on platforms like Excel and Tableau, but this was not a prerequisite for the recruitment process. Similar to the formative study, we chose undergraduate students for our user study population because they were learning data science principles for the first time. This study was approved by our home institution’s IRB.

Method

Each interview lasted for 60 min and was divided into four phases. Unlike the formative study, this study was conducted exclusively online with video conferencing software. Prior to the interview, each participant gave us explicit consent to record audio and screen capture throughout the duration of the interview.

In our formative study, participants did not use Lodestar before completing the data exploration task, making it difficult to tease apart design challenges with the Lodestar prototype from a lack of user training. As a result, we included a separate training session where participants were given an overview of system features and then trained to use the system with an initial demo dataset. The training session was then followed by a think-aloud data exploration session that lasted for 15–20 min. Questions posed in the formative study can be found in the supplement. Afterward, participants were asked to verbally respond to a post-exploration questionnaire that assessed their view on the viability of the system. Questions included:

What do you like about Lodestar? What do you dislike? Why?

Would you use Lodestar outside of this study? Why/why not?

If so, in what situations could you see yourself using Lodestar?

We analyzed participant responses from the think-aloud and questionnaire portion for themes regarding the usability of Lodestar. We present the participants’ quotes which represent common themes.

Results

Here, we summarize both the strengths of our design as well as opportunities for future improvements as noted by participants. Identifiers for new participants begin with “N,” whereas returning participants have the same identifiers as before.

Lodestar Strengths

Intuitive and Supportive UI Features. Many participants said that they found the (new) Lodestar interface design to be intuitive. For example, participant NP4 said they “liked [a lot] of the different UI features like the tooltips and collapsing [views].” Participant NP4 also liked that they were able to verify different analyses all on one page. FP2, FP3, and NP6 echoed this sentiment. Thus, the new learning widgets in Lodestar (e.g. tooltips and tabs) seem to help users learn how to use the interface and verify their work.

Integrates with existing tools and workflows

Participants particularly liked the ability to export their workflow as a Jupyter Notebook file for editing outside of Lodestar. For example, NP4 said they “like the integration with Jupyter Notebook and [the] exporting functionality.”

Eases data science tasks

NP3, NP4, FP2, and NP5 appreciated how Lodestar “recommends what to do with the data, and based on that result, [recommends] something else” (NP4). Overall, they found Lodestar helpful for guiding their analysis.

Lodestar also seemed to be helpful for specific data science tasks. For example, participants liked that Lodestar allowed them to quickly familiarize themselves with a particular dataset, which helped them determine what kinds of patterns or trends to analyze later on. This familiarization process is part of data profiling, an important and early task in the data science process.⁶ Participants also valued the features in Lodestar for data exploration, for example participant FP3 said using Lodestar was “Really convenient to do exploratory data analysis.” Participant NP1 also stated that they would use Lodestar for “specific cases where [they] don’t know how to write the code.” Participant NP6 and FP1 shared similar sentiments. These findings suggest that Lodestar helps users more easily complete data science tasks without being hindered by low-level programing issues, and may help users learn how relevant code could be written for future data science projects.

Limitations

Customizing Visualization Outputs. Many participants wanted the ability to customize and choose what attributes of the data were used in generating visualizations. For example, participant FP2 said that “it would be nice to be able to set the parameters...” Thus, even finer-grained control over visualization (and interaction) designs in Lodestar would be a point of improvement.

Additional documentation

Some participants noted that providing more documentation of the features in Lodestar would be helpful when navigating the interface. For example, NP3 stated that it would be helpful if “you could have [some information] about what the data is, where [it] came from, what the columns are.” Thus, the Lodestar interface could be improved further to give users more context for the inputs to each analysis step.

Recommendation evaluation

We conducted a third user study to evaluate the utility of Lodestar recommendations. We recruited three professional data scientists through a convenience sample of people within a professional network connected with the authors. Participant demographics can be found in Table 1. These professionals took part in an hour-long expert review⁶¹ conducted using online videoconferencing. The interviewer initiated a video call with each participant and shared a screen with a running instance of Lodestar.

Table 1.

Participant demographics. These participants were all data science professionals.

Position title	Age	Degree	Exp.
Machine Learning Engineer	30	C.S. B.S	2.6 years
Analytical Engineer	24	D.S. M.S	3.1 years
Data Scientist	34	D.S. M.S	2.4 years

To start, participants were presented with two different analytical scenarios and asked to evaluate the strength of crowd and expert recommendations on behalf of an intern working within two scenarios. The first scenario proposed that an intern would be exploring the cars dataset and using Lodestar to understand the general trends. In the second scenario the intern would be trying to identify the factors which influence housing prices in the boston-housing dataset.

Participants were encouraged to remotely guide interactions with Lodestar recommendations and to build a data science workflow using Lodestar recommendations. At each step, after considering and choosing a recommendation, participants were asked to consider two questions:

Why have they chosen this crowd or expert recommendation?

Are the current crowd or expert recommendations appropriate for an intern performing the current task? Why or why not?

We adopted this protocol to ensure that participants have the flexibility to build an appropriate workflow for the objectives presented and the structure to provide feedback regarding a wide-array of analytical branches represented by the Markov recommender. Questions posed to the experts can be found in the supplement.

Results

We transcribed the verbal responses of each participant and coded their responses using the identifiers “crowd,”“expert,” and “workflow.” The crowd identifier classified reflections regarding crowd recommendations. The expert identified classified comments on expert recommendations. Finally, participants’ reflections on how the data science workflow was constructed and their interactions Lodestar were identified as comments regarding the workflow. We summarize the results of this analysis in this section. Identifiers for each professional participants begin with “P.”

Expert

All participants found that expert recommendations “were quite reasonable” (P1), and helpful for interns performing exploratory analysis as in the first scenario. Expert recommendations seemed to match expectations and act as a helpful guide for further analysis. P1 and P3 found the recommendations which visualized dataset attributes to be particularly helpful during exploration. For example, P1 said they found the “Category Distribution”recommendation “almost too perfect” (Figure 5).

Figure 5.

Visualization recommendations. Examples of visualizations generated by Lodestar recommendations implemented in Python and Seaborn.

While performing more directed analysis as part of the second scenario, participants found that expert recommendations were helpful during the beginning of the analysis. For example, P1 commented that “looking at the data as the first step makes sense.” However, P1 and P3 expressed a desire for more control over how the recommended techniques were being executed in order to dig deeper into data. All participants found the idea of exporting to a traditional notebook environment to be a useful next step in reaction to reaching the end of a Lodestar analytical track.

Crowd

Participants found crowd recommendations overwhelming and unhelpful for exploratory tasks. The fact that the Lodestar displayed low confidence in the recommendations was a particularly strong deterrent for all three participants. For example, P2 said that “there is one [crowd recommendation] I might have wanted to click but... the probability looks really low. So, I’m not really sure how to interpret that” (Figure 6).

Figure 6.

Crowd advisor. The crowd advisor often recommended complex techniques.

P1 and P2 found crowd recommendations equally unhelpful for the directed tasks of the second scenario. However, P3 felt that crowd recommendations could be occasionally helpful when expert techniques seem less varied. Unlike the other participants, P3 performed more than a handful of the advanced techniques suggested by the crowd: k-mean clustering, percentile range, and quantitative bar plots.

Workflow

Two participants found that expert and crowd recommendations were appropriately generated based on previous selections. However, these participants also found the data agnosticism of the recommendations to be confusing.

The system seem to the sufficiently transparent since all participants navigated to the “Code” tab in order to examine the programing details of each analytical step. However, P1 and P2 did not find the categorization of recommendations based on advisor (code source) to be helpful. Both P1 and P3 suggested categorizing recommendations in more ways to allow for better control over analytical goals. For example, P3 suggested separating recommendations geared toward visualizing a data attribute from recommendations which provide analytical support. This would be an interesting direction for our future work.

All three experts agreed that Lodestar recommendations would be helpful for novice users who were learning new analytical techniques and learning to program.

Summary

Participants found expert recommendations more appropriate for data exploration than crowd recommendations. This seems reasonable given that the tutorials we used to train our expert advisor were demonstrating exploratory data analysis. Participants generally found crowd recommendations difficult to trust and understand. However, P1 and P3 suggested that crowd recommendations occasionally supported directed analysis better than expert recommendations. Finally, participants expressed that a combination of expert and crowd recommendations would support interns who wanted to safely sandbox unfamiliar data analysis techniques.

Discussion

We have presented Lodestar, a computational notebook for rapid experimentation and learning of new data science practices. Instead of forcing fledgling analysts to search for and apply relevant data analysis methods by hand, Lodestar recommends suitable next steps for the current workflow using both manually curated as well as automatically crowd-sourced guidance. Our work on Lodestar has uncovered several interesting discussion points: the prospect for data science for novices, the actual “wisdom” of crowd recommendations, and alternate recommendation mechanisms.

Data science for non-experts

The real power of Lodestar lies not in its data sources, which are publicly available to anyone online, but in its ability to synthesize the knowledge from these diverse sources into a single unified model. By sharing this knowledge in the form that data scientists are most familiar – Python (or R) source code– Lodestar provides reusable building blocks that can be transferred across workflows.

However, for the tool to be truly effective for its purpose, the library of analysis blocks must be expanded and drawn from a large set of sources. For example, new data sources could be incorporated to customize Lodestar for specific disciplines such as bio-informatics, computational journalism, and computer vision. Lodestar’s advisor model may be one way to support this; as suggested by our recommendation user study, instead of the “expert” versus “crowd” dichotomy that our current implementation uses, a more robust implementation could support a plethora of pluggable advisors drawn from a central repository. In this way, the advisors, analysis blocks, and library could be community-driven and improved by anyone.

Choosing an analysis step or interpreting results in our current prototype still requires baseline data science knowledge, such as from a university data science course (indeed, all our participants had this). However, the Lodestar approach does alleviate lack of expertise in data science practice, which is often the case for academic learning.

The philosophy of the current Lodestar implementation is to give the user as many options as possible for how to proceed with the analysis. However, choice is sometimes a bad thing: for a novice data scientist, getting multiple – and, worse, conflicting – advice can be bewildering. In future work, it would be interesting to curate and coordinate recommendations from multiple advisors to help the user make better and more informed choices.

On the “wisdom of the crowd” for data analysis

While we are excited about the prospects of the “wisdom of the crowd”⁶² for data science and analysis, it has become clear that this is an area that will require significantly more work. For example, our current approach is not entirely automated; manual curation is still required in choosing a representative block from the clustering analysis and in editing the block into the appropriate form that Lodestar expects, including eliminating side effects, removing output statements, and resolving dependencies. We plan to automate these steps in the future.

The need for manual curation, or at least review, is compounded by the fact that a significant portion of Rule et al.’s Jupyter corpus³ was of low quality: some notebooks had cells with a single line of code, or all of the source code in a single cell. Many had non-functional code, syntax errors, or code that was never used. While we filtered such notebooks from our analysis, the signal-to-noise ratio in crowdsourced code is often low.

The remedy for many of these challenges can often be found in sheer scale. While we studied the “sampler” dataset containing 6530 notebooks in this paper, the full 600 GB dataset contains more than 1.25 million notebooks. With access to this many examples, we could afford to discard more problematic ones. Furthermore, frequency of use would help ensure that best practices are easier to identify. Of course, a dataset of this size brings with it a new set of scalability challenges. Existing data processing⁶³ and code analysis^64,65 techniques could help address this big data challenge in the future.

Different recommendation strategies

The Lodestar recommendation engine is based on Markov chains, which are useful for representing a sequence of chained states or commands, as in a data science script. However Markov chains may oversimplify the relationships between analysis steps and data science users in some ways. It would be interesting to study how to use more sophisticated methods as part of the Lodestar recommendation engine. For example, state-of-the-art recommender systems tend to be organized into collaborative filtering, content-based filtering, and hybrid filtering.⁶⁶ Collaborative filtering is based on a social view of recommendation, where behavior by other users such as navigation, ratings, and their personal traits are used to match content to a specific user. In the case of Lodestar, this would enable the historical preferences of Lodestar users to guide other users. For content-based filtering, recommendations can be derived by comparing items to recommend with user preferences and auxiliary information. This approach could enable Lodestar users to be matched to specific analysis steps based on, for example, workflows they have created in the past, specific data types, and metadata for existing datasets and code. Finally, we could combine methods to develop new hybrid recommendation strategies.

A recent development in artificial intelligence is to build recommender systems using deep learning techniques (or deep recommenders,^45,67 particularly for content-based approaches. Given our large available corpus of potential training data, unsupervised methods such as Recurrent Neural Networks could prove useful, since they are ideal for sequential data. The Lodestar advisor model provides a useful framework from which to incorporate and merge future recommendation strategies for data science. However, these topics are beyond the scope of this paper.

Limitations and future work

Our evaluation of the recommender suggests curated recommendations, geared toward different types of analytical goal (e.g. data cleaning, data exploration, visualization..etc.) can enable rapid experimentation with different programing techniques for the real-world. We limited our system to tutorials which presented high-level data explorations techniques. Advisors specifically geared to provide recommendations on data wrangling techniques could be the means to handle messy real-world datasets for Lodestar users. Advisors can even be designed to provide “unique” or less popular recommendations to ensure Lodestar users consider many options. Diversifying the recommendation techniques to target multiple goals will be an important part of future work.

When we first curated crowd techniques, our aim was to introduce users to a variety of more modern libraries and conventions –particularly since expert tutorials become outdated. The professional data science study participants, rightfully, were sometimes confused by the complexity of the techniques presented by the crowd advisor. We believe that more detailed documentation⁶⁸ would be helpful reducing this type of confusing.

Due to the many challenges of automatic code analysis, we currently do not allow users to write their own code directly in Lodestar, or even to modify existing code. To make online code editing possible, we would need an automatic classification process that could determine how new code fits into the recommendation graph so that the system could resume the analysis with new recommendations after manual code block. Such live updates to the recommender are not currently part of Lodestar, but are an interesting direction for future work. These live updates would provide a view into the “latest” analysis trends and enable a means for the Lodestar analysis library to grow in time. We anticipate that such a live update mechanism would feed into a dashboard by which Lodestar users could further view, label, and filter for valid analysis techniques.

We made several design decisions to the Lodestar notebook that will need to be revisited for a general implementation. Lodestar currently does not consider specifics about each input dataset while making recommendations – only display recommendations which do not programmatically fail to execute on the selected dataset. This should be studied in future work. Furthermore, all of our analysis blocks take a Pandas data frame as input, and generate a new data frame as output. Also, other disciplines use other data representations, and some computations may require passing multiple data objects as arguments. To address these limitations, we look to improving our existing design and thoroughly evaluating these improvements in our future work.

Finally, with the release of state-of-the-art Large Language Models (LLMs) from OpenAI, Microsoft, and Google, it is safe to say that the future of data science recommendation is changing rapidly. Plugins for these LLMs already exist that allow users to upload datasets and then ask for customized data science analysis using natural language queries. We think that the findings in this paper can help guide and influence these future directions.

Footnotes

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by the Foundation for the National Institutes of Health grant no. R01GM114267.

ORCID iDs

Deepthi Raghunandan

Niklas Elmqvist

References

Kross

Guo

. Practitioners teaching data science in industry and academia: Expectations, workflows, and challenges. In: Proceedings of the ACM conference on human factors in computing systems, 2019, pp.1–14. New York, NY: ACM.

Kluyver

Ragan-Kelley

Pèrez

, et al. Jupyter notebooks – a publishing format for reproducible computational workflows. In: Fernando L and Birgit S (eds) Positioning and power in academic publishing: players, agents and Agendas. Amsterdam: IOS Press, 2016, pp.87–90.

Rule

Tabard

Hollan

. Exploration and explanation in computational notebooks. In: Proceedings of the ACM conference on human factors in computing systems, 2018, pp.32:1–32:12. New York, NY: ACM.

Russell

Stefik

Pirolli

, et al. The cost structure of sensemaking. In: Proceedings of the ACM conference on human factors in computing systems, 1993, pp.269–276. New York, NY: ACM.

Pirolli

Card

. The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis. In: Proceedings of the International Conference on Intelligence Analysis, 2005, pp.2–4, vol. 5. McLean, VA: The MITRE Corporation.

Kandel

Paepcke

Hellerstein

, et al. Enterprise data analysis and visualization: an interview study. IEEE Trans Vis Comput Graph 2012; 18(12): 2917–2926.

Stolte

Tang

Hanrahan

. Polaris: A system for query, analysis, and visualization of multidimensional relational databases. IEEE Trans Vis Comput Graph 2002; 8(1): 52–65.

Satyanarayan

Heer

. Lyra: an interactive visualization design environment. Comput Graph Forum 2014; 33(3): 351–360.

Mèndez

Nacenta

Vandenheste

. iVoLVER: Interactive visual language for visualization extraction and reconstruction. In: Proceedings of the ACM conference on human factors in computing systems, 2016, pp.4073–4085. New York, NY: ACM.

10.

Ren

Höllerer

Yuan

. IVisDesigner: Expressive interactive design of information visualizations. IEEE Trans Vis Comput Graph 2014; 20(12): 2092–2101.

11.

Kim

Schweickart

Liu

, et al. Data-driven guides: supporting expressive design for information graphics. IEEE Trans Vis Comput Graph 2017; 23(1): 491–500.

12.

Liu

Thompson

Wilson

, et al. Data illustrator: Augmenting vector design tools with lazy data binding for expressive visualization authoring. In: Proceedings of the ACM conference on human factors in computing systems, 2018, pp.1–13. New York, NY: ACM.

13.

Xia

Riche

Chevalier

, et al. DataInk: Direct and creative data-oriented drawing. In: Proceedings of the ACM Conference on human factors in computing systems, 2018, pp.223:1–223:13. New York, NY: ACM.

14.

Ren

Lee

Brehmer

. Charticulator: interactive construction of bespoke chart layouts. IEEE Trans Vis Comput Graph 2019; 25(1): 789–799.

15.

Yalcin

Elmqvist

Bederson

. Keshif: rapid and expressive tabular data exploration for novices. IEEE Trans Vis Comput Graph 2018; 24(8): 2339–2352.

16.

Javed

Elmqvist

. ExPlates: Spatializing interactive analysis to scaffold visual exploration. Comput Graph Forum 2013; 32(3pt4): 441–450.

17.

Bostock

Ogievetsky

Heer

. D³: data-driven documents. IEEE Trans Vis Comput Graph 2011; 17(12): 2301–2309.

18.

Bostock

Heer

. Protovis: A graphical toolkit for visualization. IEEE Trans Vis Comput Graph 2009; 15(6): 1121–1128.

19.

Wickham

. ggplot2: Elegant Graphics for Data Analysis. New York, NY: Springer, 2016.

20.

Satyanarayan

Russell

Hoffswell

, et al. Reactive Vega: A streaming dataflow architecture for declarative interactive visualization. IEEE Trans Vis Comput Graph 2016; 22(1): 659–668.

21.

Satyanarayan

Moritz

Wongsuphasawat

, et al. Vega-Lite: A grammar of interactive graphics. IEEE Trans Vis Comput Graph 2017; 23(1): 341–350.

22.

Herlocker

Konstan

Terveen

, et al. Evaluating collaborative filtering recommender systems. ACM Trans Inf Syst 2004; 22(1): 5–53.

23.

Mackinlay

. Automating the design of graphical presentations of relational information. ACM Trans Graph 1986; 5(2): 110–141.

24.

Bertin

. Semiology of graphics: diagrams, networks, maps. Madison, WI: University of Wisconsin Press, 1983.

25.

Cleveland

McGill

. Graphical perception: theory, experimentation, and application to the development of graphical methods. J Am Stat Assoc 1984; 79(387): 531–554.

26.

Mackinlay

Hanrahan

Stolte

. Show me: Automatic presentation for visual analysis. IEEE Trans Vis Comput Graph 2007; 13(6): 1137–1144.

27.

Roth

Kolojejchick

Mattis

, et al. Interactive graphic design using automatic presentation knowledge. In: Proceedings of the ACM conference on human factors in computing systems, 1994, pp.112–117. New York, NY: ACM.

28.

Seo

Shneiderman

. A rank-by-feature framework for interactive exploration of multidimensional data. Inf Vis 2005; 4(2): 96–113.

29.

Vartak

Madden

Parameswaran

, et al. SeeDB: automatically generating query visualizations. Proc Large Database Endow 2014; 7(13): 1581–1584.

30.

Perry

Howe

Key

AMF

, et al. VizDeck: Streamlining exploratory visual analytics of scientific data. In: Proceedings of the iConference, 2013, pp.338–350. Fort Worth, TX: iSchools.

31.

van den Elzen

van Wijk

. Small multiples, large singles: A new approach for visual data exploration. Comput Graph Forum 2013; 32(3pt2): 191–200.

32.

Wongsuphasawat

Moritz

Anand

, et al. Voyager: exploratory analysis via faceted browsing of visualization recommendations. IEEE Trans Vis Comput Graph 2016; 22(1): 649–658.

33.

Saket

Kim

Brown

, et al. Visualization by demonstration: an interaction paradigm for visual data exploration. IEEE Trans Vis Comput Graph 2017; 23(1): 331–340.

34.

Siddiqui

Kim

Lee

, et al. Effortless data exploration with zenvisage: an expressive and interactive visual analytics system. Proc Large Database Endow 2016; 10(4): 457–468.

35.

Wongsuphasawat

Moritz

, et al. Voyager 2: Augmenting visual analysis with partial view specifications. In: Proceedings of the ACM conference on human factors in computing systems, 2017, pp.2648–2659. New York, NY: ACM.

36.

Moritz

Wang

Nelson

, et al. Formalizing visualization design knowledge as constraints: Actionable and extensible models in Draco. IEEE Trans Vis Comput Graph 2019; 25(1): 438–448.

37.

Bakker

, et al. VizML: A machine learning approach to visualization recommendation. In: Proceedings of the ACM conference on human factors in computing systems, 2019, pp.1–12. New York, NY: ACM.

38.

Qian

Rossi

, et al. Learning to recommend visualizations from data. In: Proceedings of the ACM conference on knowledge discovery and data mining, 2021, pp.1359–1369. New York, NY: ACM.

39.

EPPerson

Jung-Lin Lee

Wang

, et al. Leveraging analysis history for improved in situ visualization recommendation. Comput Graph Forum 2022; 41(3): 145–155.

40.

Tang

Han

Yiu

, et al. Extracting top-k insights from multi-dimensional data. In: Proceedings of the ACM Conference on Management of Data, pp.1509–1524, 2017. New York, NY: ACM.

41.

Demiralp

Haas

Parthasarathy

, et al. Foresight: Recommending visual insights. Proc Large Database Endow 2017; 10(12): 1937–1940.

42.

Cui

Badam

Yalçin

, et al. DataSite: Proactive visual data exploration with computation of insight-based recommendations. Inf Vis 2019; 18(2): 251–267.

43.

Srinivasan

Drucker

Endert

, et al. Augmenting visualizations with interactive data facts to facilitate interpretation and communication. IEEE Trans Vis Comput Graph 2019; 25(1): 672–681.

44.

Yan

. Auto-Suggest: Learning-to-recommend data preparation steps using data science notebooks. In: Proceedings of the ACM Conference on Management of Data, 2020, pp.1539–1554. New York, NY: ACM.

45.

Bar El

Milo

Somech

. Automatically generating data exploration sessions using deep reinforcement learning. In: Proceedings of the ACM conference on management of data, 2020, pp.1527–1537. New York, NY: ACM.

46.

Knuth

. Literate programming. Comput J 1984; 27(2): 97–111.

47.

Ufford

Pacer

Seal

, et al. Beyond interactive: Notebook innovation at Netflix, https://medium.com/netflix-techblog/notebook-innovation-591ee3221233 (2018, accessed 15 June 2019).

48.

Rädle

Nouwens

Antonsen

, et al. Codestrates: Literate computing with webstrates. In: Proceedings of the ACM symposium on user interface software and technology, 2017, pp.715–725. New York, NY: ACM.

49.

VanderPlas

Granger

Heer

, et al. Altair: Interactive statistical visualizations for Python. J Open Source Softw 2018; 3(32): 1057.

50.

Conlen

Heer

. Idyll: A markup language for authoring and publishing interactive articles on the web. In: Proceedings of the ACM symposium on user interface software and technology, 2018, pp.977–989. New York, NY: ACM.

51.

Badam

Mathisen

Radle

, et al. Vistrates: A component model for ubiquitous analytics. IEEE Trans Vis Comput Graph 2019; 25(1): 586–596.

52.

Wood

Kachkaev

Dykes

. Design exposition with literate visualization. IEEE Trans Vis Comput Graph 2019; 25(1): 759–768.

53.

Drosos

Barik

Guo

, et al. Wrex: A unified programming-by-example interaction for synthesizing readable code for data scientists. In: Proceedings of the ACM conference on human factors in computing systems, 2020, pp.1–12. New York, NY: ACM.

54.

Kery

Ren

Hohman

, et al. Mage: Fluid moves between code and graphical work in computational notebooks. In: Proceedings of the ACM symposium on user interface software and technology, 2020, pp.140–151. New York, NY: ACM.

55.

Head

Jiang

Smith

, et al. Composing flexibly-organized step-by-step tutorials from linked source code, snippets, and outputs. In: Proceedings of the ACM conference on human factors in computing systems, 2020, pp.1–12. New York, NY: ACM.

56.

Mathisen

Horak

Klokmose

, et al. InsideInsights: Integrating Data-Driven Reporting in collaborative visual analytics. Comput Graph Forum 2019; 38(3): 649–661.

57.

Drozdal

Weisz

Wang

, et al. Trust in AutoML: Exploring information needs for establishing trust in automated machine learning systems. In: Proceedings of the ACM conference on intelligent user interfaces, 2020, pp.297–307. New York, NY: ACM.

58.

Tufte

. The Visual Display of Quantitative Information. Cheshire, CT: Graphic Press, 2001.

59.

Battle

Heer

. Characterizing exploratory visual analysis: A literature review and evaluation of analytic provenance in Tableau. Comput Graph Forum 2019; 38(3): 145–159.

60.

Kandel

Paepcke

Hellerstein

, et al. Wrangler: Interactive visual specification of data transformation scripts. In: Proceedings of the ACM conference on human factors in computing systems, 2011, pp.3363–3372. New York, NY: ACM.

61.

Tory

Möller

. Evaluating visualizations: Do expert reviews work? IEEE Comput Graph Appl 2005; 25(5): 8–11.

62.

Surowiecki

. The Wisdom of Crowds: Why the many are smarter than the few and how collective wisdom shapes business, economies, societies, and Nations. New York, NY: Anchor Books, 2004.

63.

Mudgal

Rekatsinas

, et al. Deep learning for entity matching: A design space exploration. In: Proceedings of the ACM conference on management of data, 2018, pp.19–34. New York, NY: ACM.

64.

Glassman

Scott

Singh

, et al. OverCode: Visualizing variation in student solutions to programming problems at scale. ACM Trans Comput Hum Interact 2015; 22: 1–735.

65.

Glassman

Zhang

Hartmann

, et al. Visualizing API usage examples at scale. In: Proceedings of the ACM conference on human factors in computing systems, 2018. New York, NY: ACM, pp.1–12.

66.

Adomavicius

Tuzhilin

. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Trans Knowl Data Eng 2005; 17(6): 734–749.

67.

Zhang

Yao

Sun

, et al. Deep learning based recommender system: A survey and new perspectives. ACM Comput Surv 2020; 52(1): 1–538.

68.

Wang

Drozdal

, et al. Documentation matters: human-centered AI system to assist data science code documentation in computational notebooks. ACM Trans Comput Hum Interact 2022; 29(2): 1–33.