Abstract
Artificial intelligence will play an increasingly more prominent role in scientific research ecosystems, and will become indispensable as more interdisciplinary science questions are tackled. While in recent years computers have propelled science by crunching through data and leading to a data science revolution, qualitatively different scientific advances will result from advanced intelligent technologies for crunching through knowledge and ideas. We propose seven principles for developing
Introduction
Scientific research is accomplished by an ecosystem of contributors. From principal investigators that propose insightful problems, to graduate students that go deep into a specific question, to lab assistants that patiently sit through experiments, to undergraduates that contribute to simpler mundane tasks, there are a range of contributions made by people with different abilities and levels of experience. What kind of role could intelligent machines have in this ecosystem?
Though I pose this question from the point of view of science, I believe it applies to data science just the same. Scientific discovery is arguably the hardest form of data science, where the goal is to understand rather than correlate, where the phenomena are deeply interacting rather than independent, and where new findings must be integrated with existing theories.
In the last few decades, advances in data-intensive computing have pushed the envelope in the scale of the phenomena that can be studied. Well-designed data structures, efficient algorithms, and distributed computation work at unison to process large-scale data, leading to spectacular discoveries in diverse areas such as high-energy physics, biomedicine, and geosciences. In recent years, the incorporation of intelligent techniques for data mining and machine learning has given rise to data science, bringing powerful data-driven discovery capabilities to scientists [33]. Indeed, a recent cover of Science states “artificial intelligence transforms science” [35].
However, the role of artificial intelligence systems, particularly in machine learning, has been reduced to solving a well-defined task where the data and techniques are given to them by the scientist. This limits our ability to tackle problems where the complexity of the data, the questions, and the tasks challenge our human capabilities to make discoveries. Confining intelligent machines to this data-intensive computing realm is severely limiting our ability to truly harness the potential of machines to enable us to take on larger problems.
I also posit that future scientific endeavors will require partnerships of scientists and thoughtful artificial intelligence systems, where machines will pursue independently substantial aspects of the research and contribute their own discoveries. These intelligent systems should be capable of taking on significant problems by formulating their own research goals, proposing and testing hypotheses, designing theories, debating alternative options, and synthesizing new knowledge. They should be able to explain their reasoning, compare their lines of inference to other possible ones, and situate their findings. Intelligent systems should be able to communicate with scientists with different levels of expertise and understanding in a topic. To form a true partnership, they should be able to take guidance from scientists as well as provide guidance to them in turn.
The paper begins with a discussion on the nature of the complexity of major unresolved questions in science, and why machine intelligence will be necessary to make headways. After a short overview of artificial intelligence research for scientific discovery, I propose seven principles for the development of thoughtful artificial intelligence. I then discuss a personal perspective on a research agenda to achieve this vision.
Science challenges in the new millennium
Our capabilities to do research need to be augmented as scientific questions become significantly more complex. Compare the challenges of finding a cure for polio with finding a cure for cancer. Polio, a scourge that has affected humanity for millennia, was cured through a vaccine that was discovered by one scientist. We are now faced with understanding diseases such as glioblastoma, a brain cancer that takes very few months to go to advanced stages and is very hard to detect and to treat. Scientists with different specialties have different information about the problem, with disparate data on genomics, proteomics, transcriptomics, MRIs, clinical, therapies and drugs, etc.
Important research questions are increasingly collaborative, requiring dozens, hundreds, and in some cases thousands of people in different disciplines to work together for months or years to produce results. This progression is described very well in [2], where the individual researchers were the norm in the past (Galileo, Newton, Darwin) giving way to dual collaborations early in the 20th century (Watson and Crick) and transitioning to larger projects such as the human genome project. These collaborations have now become even more sophisticated and complex. For example, the discovery of the Higgs boson demonstrated a major mobilization of the particle physics community to build the Large Hadron Collider (LHC), carry out the Atlas and Compact Muon Solenoid (CMS) experiments, and analyze the resulting data, leading to articles with thousands of authors [37]. Massive amounts of information were created and consumed by different subgroups of researchers in a painstaking and time-consuming manner. As a result, this kind of significant discovery only occurs occasionally and with significant coordination effort.
Computers already play a significant role in this collaborative scientific research ecosystem, albeit mostly by crunching through the large amounts of data. Through high-end computing capabilities, we now see machines do petascale computations routinely. Through scalable databases and information retrieval capabilities, we know machines can store large amounts of information and help scientists sift through it. Through advances in machine learning, new discoveries have been made in climate research, ecosystems, materials science, and social sciences. The scientific advances due to these technologies would be inconceivable or impossible without them.
As we look to the future, science challenges will be ever more complex, posing unprecedented challenges for scientists. In particle physics, the LHC will increase its collision rate by a factor of 10, and more sophisticated instruments such as the Long-Baseline Neutrino Facility (LBNF) will soon be a reality [7,32]. Other examples of these significant future science challenges include understanding the Earth as a system of systems, studying the brain from the molecular to the cellular to the organ level, detecting and managing natural hazards, protecting our environment with sustainable policies, curing cancer and other degenerative diseases, personalizing learning and on-demand training, and designing materials with desired properties. Other important scientific questions are not even being posed, since they are far out of reach given our current capabilities. The diversity and complexity of the data available for analysis, the amount of information to track, the amount of possible hypotheses and models to explore, and the amount of coordination across expertise areas all add up to closely related but fragmented information space that is extremely difficult to explore and manage. Even if important aspects of these problems can be solved, we want faster turnaround in the work and see new results in days or weeks rather than years of research.
Artificial intelligence in science
Artificial intelligence has a long tradition in tackling scientific research as a problem-solving activity [24,26,34]. Table 1 highlights examples of the aspects of scientific discovery that have been addressed by artificial intelligence research, mentioning a particular science targeted where the impact was most significant. This is not meant to be a comprehensive survey, but rather a sample of pioneering work on artificial intelligence for scientific discovery for readers unfamiliar with this literature. For more detailed overviews see [16,18,21,23,36]. The goal here is to highlight two important things. First, the scientific challenges uncovered research challenges for artificial intelligence, and as a result new advances in artificial intelligence enabled significant new advances in science. Second, there is a wide diversity both in artificial intelligence research areas as well as in scientific domains.
Today, many artificial intelligence technologies are familiar to scientists. They include machine learning, natural language processing, semantic networks and ontologies, causal reasoning, robotics, and image processing among others. There are numerous uses of these technologies across diverse areas of science. The impact of artificial intelligence in science is palpable and continues to expand. Many scientific advances have only been possible through the pursuit of new lines of research in artificial intelligence.
Major activities in scientific discovery that have been a focus of artificial intelligence research
Major activities in scientific discovery that have been a focus of artificial intelligence research
In the coming years, I expect that we will see broader dissemination of these artificial intelligence techniques as well as further automation of scientific tasks. These intelligent systems will be embedded in all aspects of the scientific research ecosystem.
I envision a much more expanded role for artificial intelligence in scientific discovery. Scientists will need intelligent systems with the ability to do independent inquiry, proactive learning, and deliberative reasoning. I envision a new generation of artificial intelligence systems that will enable a true partnership between scientists and machines. This partnership will be essential to tackle a new generation of science quests.
AI systems will become effective partners that will significantly enhance human abilities. Humans are clearly capable of being magnificent researchers, but even scientists with the best reputation make mistakes due to a number of factors:
As our science endeavors grow in ambition, their complexity will exacerbate human shortcomings and limitations. Intelligent systems can help counter these human shortcomings, and we have already referred to some examples that were used to address them (e.g., [29] and [1]). Intelligent systems can be systematic, covering all the space of choices without ignoring any details. They are correct, in that they follow instructions to the letter. They are unbiased, considering all the plausible interpretations however unlikely. They can also do rigorous reporting, recording and presenting every aspect of their work with justifications and verifications if needed. These are all ideal qualities for capable research collaborators.
At the same time, today’s intelligent systems have important limitations in areas where humans excel. Because intelligent systems are always built to perform specific tasks, they have very narrow knowledge. Therefore, they cannot put ideas in a broader context and understand the importance of a new result. They cannot think out of the box and change perspectives or reframe problems. They also cannot envision novel forms of thinking about a problem. Interestingly, these are areas where humans shine, through ingenious and creative perspectives, unconventional insights, priorities of science questions and goals, and awareness of the importance of results and ideas.
These limitations of intelligent systems will be lifted as intelligent systems research progresses. Today’s intelligent systems provide very helpful but very specific and narrow capabilities. Tomorrow, they will become research assistants to scientists. But very soon they will develop into colleagues that can do valuable research contributions independently. Intelligent systems indeed have the potential to play an important role in the research ecosystem and become effective research partners for scientists. What would it take to build such intelligent machines?
Thoughtful artificial intelligence
I propose a research agenda on a new generation of approaches that I will refer to as
These principles are summarized in Table 2. Current artificial intelligence systems aim to satisfy only one or a few of these principles, but not all. For example, game playing systems care about the rationality principle, conversational interfaces are designed with the articulation principle in mind, and semantic web systems have the networking principle at their core. Significant new capabilities will come from the combination of all these principles.
ThAIs will represent a new generation of artificial intelligence systems that will be more resilient and resourceful, behave more responsibly, and have a sense of their role within a larger science ecosystem. ThAIs will be key to bringing data science to a new level. Scientists expect these principles from research partners, and intelligent systems will have to grow in all these directions in order to take a much-needed role as partners in data science and scientific discovery.
Defining Principles for Thoughtful Artificial Intelligence Systems
Defining Principles for Thoughtful Artificial Intelligence Systems
This section discusses major research challenges in developing ThAIs, and highlights some of our ongoing work towards this vision. I leave out the rationality principle because it has been discussed in depth elsewhere.
Networked ThAIs: Towards a scientific knowledge web
Increasingly, knowledge about scientific entities of interest is captured in shared catalogs with metadata descriptions that enable scientists to find, compare, and relate those entities. Many ontologies reflect community agreements on standard ways to represent them. There is significant adoption of semantic web technologies in science. However, the establishment of links among scientific entities and the meaning of those links is largely done manually, as is the processing and understanding of the linked items. Networked ThAIs would need semantic links that they could find, interpret, exploit, and enrich.
Current structured representations for shared scientific knowledge focus on ontological models that represent objects and their properties. Ontologies are widely used in data repositories to specify metadata in order to find and integrate datasets by reasoning about their descriptions [3]. But scientific knowledge is much more than objects and properties, and scientists need support beyond querying and aggregating data based on basic properties. In order for ThAIs to access comprehensive scientific knowledge, we need to augment our current representations and capture more complex aspects of scientific research such as data analysis processes, hypotheses and claims, and synthesis of evidence from data [8]. I am personally interested in computational processes for data analysis, and have developed semantic workflow representations to capture knowledge about how the properties of datasets affect how they should be analyzed, and how the assumptions of data analysis algorithms constrain their use in a data analysis method [15]. The WINGS intelligent workflow system incorporates a variety of computational techniques that use those representations to assist scientists in: (1) finding appropriate workflows, (2) customizing existing workflows, (3) exploring the space of alternative workflow configurations, (4) correlating the results of exploratory executions, and (5) detecting commonly used workflow fragments. We publish workflows as web objects following linked open data principles [9]. This enables other systems to examine, compare, and reuse these workflows, effectively building on another system’s scientific processes. This is a key building block towards allowing ThAIs to acquire knowledge from other networked sources about how to do data analysis.
Further research is needed on capturing more explicitly additional forms of knowledge that are currently scattered in publications, lab notebooks, emails, presentations, and other documentation. As more structured representations of additional kinds of scientific knowledge become available, ThAIs can access increasingly more information and resources about science.
I see not only limitations of current representations but also the need for research on knowledge capture systems that help scientists to create them. The research involves understanding how to embed new computational approaches for method representations and abstractions into routine practice while minimizing the scientist’s effort in specifying them. This will require novel intelligent user interfaces that interconnect data, software, people, instruments, and other scientific resources to effectively create meaningful chunks of a scientific knowledge web.
Given access to the data, models, and methods in this semantically enriched scientific knowledge web, ThAIs could find relevant information for their tasks, reason about new interconnections, discover new relevant data or results across different disciplines, establish new connections and generalizations, suggest and potentially resolve inconsistencies, and generally manage the intricacies of this immensely rich and incredibly powerful scientific knowledge web. Particularly effective will be intelligent systems that could help scientists connect and collaborate across disciplines, something that currently takes significant effort and often times serendipitous connections and yet leads to transformative approaches to tackle old problems.
Context-driven ThAIs: Task scoping
Data analysis occurs in a much larger context of discovery: scientists are driven by models and hypotheses, and the analysis results must be converted to evidence and related back to those hypotheses. The meta-reasoning processes involved in deciding how to explore and revise such hypotheses computationally are largely unexplored. Data analytics and machine learning methods focus on finding patterns in the data, but today the overarching processes of hypothesis formulation and revision are fully carried out manually by scientists. ThAIs could largely automate this process, starting with initial hypotheses and automatically finding relevant data, analyzing it, and assessing the results.
I am very interested in enabling intelligent systems to reason about hypotheses to create goals and define their own problems to pursue. We are developing DISK, a system that uses a meta-reasoning approach that starts from research hypotheses, triggers lines of inquiry to map hypotheses to data queries and computational workflows, and uses meta-workflows to combine workflow results and discern what is interesting to report back to the researcher [13,14].
Much work remains to be done to represent scientific hypotheses, to capture meta-reasoning strategies that scientists pursue, and to analyze results. Ongoing work in cognitive science and philosophy of science is articulating such processes [5]. We need to incorporate them in artificial intelligence systems.
Initiative-driven ThAIs: Access to the scientific record
In order for ThAIs to acquire new knowledge about science, the scientific record should be made more accessible. Current published articles do not contain the information necessary to understand what was done in enough detail that they can be reproduced [10].
Many scientists would like to improve the transparency and reproducibility of their papers, but the best practices remain difficult to understand and follow in practice. Inspired by and partnering with early career researchers, I developed the Scientific Paper of the Future (SPF) Initiative to teach scientists how to write papers that describe and cite explicitly not just data but also software and methods (workflows and provenance) [11]. The Initiative includes a special issue of a journal where submissions discuss the scientist’s motivation for structuring and reporting their research products more thoroughly. We have now trained hundreds of scientists, including many center directors and principal investigators, who have changed their practices as a result. Improved publication of computational experiments will provide ThAIs with a more accessible scientific record.
This line of work complements ongoing efforts to extract information from the scientific literature through text mining [29]. Most of the work focuses on extracting facts and claims. Much remains to be done to address the challenges of extracting information about processes and methods.
Articulate ThAIs: Describing scientific results to different audiences
Communicating scientific findings will be a strong requirement for ThAIs. New findings have to be placed in perspective of what is already known in the literature. In addition, appropriate explanations and compelling arguments need to be generated.
I am very interested in using computational workflows to generate automatically explanatory text that could be included in scientific papers We are investigating the use of semantic representations of data analysis workflows to generate alternative narrative accounts in DANA, a system to customize the methods section of an article to different readers depending on their interest and expertise levels [12]. In order to be true partners in the scientific research enterprise, ThAIs will have to communicate not just how they analyzed data, but the reasons and context for the analysis as well as the significance of the results. Perhaps by reading ThAIs-generated descriptions, human scientists will adopt more precise language to describe scientific results and improve the reproducibility of the work.
System design for ThAIs: Compositionality, abstraction, and connectivity
We need to develop architectures for intelligence systems that can be easily extended with new capabilities and integrated with others. Unless ThAIS have compositionality, abstraction, and connectivity, it will be hard to combine the capabilities above as they grow in breadth and depth.
Ethical ThAIs: Awareness of limitations
ThAIs should understand their limitations, and behave accordingly. This involves reasoning about their confidence on the completeness and quality of their knowledge, their ability to accomplish any task requested, and the responsibilities of being an authority in a particular subject domain. Their ethical behavior will hinge on whether they can stop themselves from taking action when they are not qualified or able to do so.
Research is needed in the area of representations of hypotheses, claims, and evidence. A biomedical repository with an entry about the interaction between a protein and a gene can cite several papers as evidence, but more nuanced representations of that evidence are needed in order to make informed decisions about how to use it: were the experiments done for humans or another organism, were they reliable mass spectrometry experiments or simple fluorescence spectroscopy, were any of the results replicated, and what were the p-value ranges obtained. This kind of meta-knowledge and provenance is crucial to determine the confidence on the cumulated scientific record [4]. I have contributed to the development of provenance standards as an important enabler for this line of research [17,28].
Conclusions
Thoughtful artificial intelligence goes beyond exhibiting rational behavior, and incorporates expanded context, initiative, networked connection, articulate communication, compositional system design, and ethical considerations. Thoughtful artificial intelligence systems will result in fundamentally new capabilities that will be a game changer for data science and scientific discovery.
There are significant research challenges ahead, and we must focus not only on developing thoughtful artificial intelligence but also on developing infrastructure and mechanisms to enable them. With thoughtful artificial intelligence systems as partners for data science and scientific discovery, we will be better equipped to tackle discoveries that are now unreachable.
Footnotes
Acknowledgements
We gratefully acknowledge support from the Defense Advanced Research Projects Agency through the SIMPLEX program with award W911NF-15-1-0555, and from the National Institutes of Health under award 1R01GM117097. We would like to thank Daniel Garijo, Hiroaki Kitano, and Parag Mallick for many thoughtful discussions.
