A review of methodologies for natural-language-facilitated human

Abstract

Natural-language-facilitated human–robot cooperation refers to using natural language to facilitate interactive information sharing and task executions with a common goal constraint between robots and humans. Recently, natural-language-facilitated human–robot cooperation research has received increasing attention. Typical natural-language-facilitated human–robot cooperation scenarios include robotic daily assistance, robotic health caregiving, intelligent manufacturing, autonomous navigation, and robot social accompany. However, a thorough review, which can reveal latest methodologies of using natural language to facilitate human–robot cooperation, is missing. In this review, we comprehensively investigated natural-language-facilitated human–robot cooperation methodologies, by summarizing natural-language-facilitated human–robot cooperation research as three aspects (natural language instruction understanding, natural language-based execution plan generation, knowledge-world mapping). We also made in-depth analysis on theoretical methods, applications, and model advantages and disadvantages. Based on our paper review and perspective, future directions of natural-language-facilitated human–robot cooperation research were discussed.

Keywords

Natural language human–robot cooperation NL instruction understanding NL-based execution plan generation knowledge-world mapping

Introduction

Attracted by the naturalness of natural language (NL) communications among humans, intelligent robots start to understand NL to develop intuitive human–robot cooperation in various tasks.^1,2 Natural-language-facilitated human–robot cooperation (NLC) has received increasing attention in human-involved robotics research over the recent decade. By using NL, human intelligence at high-level task planning and robot physical capability—such as force,³ precision,⁴ and speed²—at low-level task executions are combined to perform intuitive cooperation.^5,6 For example, in furniture assembly, it is challenging to perform natural cooperation, for that a human has limited precision and speed in hold driller, while a robot lacks understanding of assembly sequence. By giving robots NL instructions, such as “drill holes, then clean surface, last install screws,” a human’s high-level plan was combined with robots’ low-level executions, such as “grasping drillers and brush,” and “motion planning in drilling, cleaning, and installing,” and finally conducting natural cooperation.⁴

Currently, typical manners in human–robot cooperation include tactile indications, such as contact location,⁷ force strength,⁸ and visual indications, such as body pose,⁹ and motion.¹⁰ Compared with these methods, using NL to conduct an intuitive NLC has several advantages. First, NL makes human–robot cooperation natural. For traditional methods mentioned above, humans need to be trained to use certain actions/poses for making themselves understandable by a robot.^11,12 While in NLC, even nonexpert users without prior training can use verbal conversations to instruct robots.¹³ Second, NL transfers human commands efficiently. The traditional communication methods using visual/motion indications require the design of informative patterns, “‘lift hand’ means ‘stop’, ‘horizontal hand movements’ means ‘follow,’” for delivering human commands.¹⁴ Existing languages, such as “English, Chinese, and German,” already have standard linguistic structures, which contain abundant informative expressions to serve as patterns.¹⁵ NL-based methods do not need to design specific informative patterns for various NL commands, making human–robot cooperation efficient. Lastly, since NL instructions are delivered orally instead of being physically involved, human hands are set free to perform more important executions. Typical areas using NLC are shown in Figure 1.

Figure 1.

Promising areas using NLC. (a) daily robotic assistance using NL.¹⁶ A robot categorized daily objects with human NL instructions. (b) Autonomous manufacturing using NL.¹⁷ An industrial robot welded parts under human’s oral instructions. (c) Robotic navigation using NL.¹⁸ A quadcopter navigated in indoor environments with human’s oral guidance. (d) Social accompany.¹⁹ A pet dog is playing balls with a human with socialized verbal communications. NLC: natural-language-facilitated human–robot cooperation; NL: natural language.

Advancements of NLP support an accurate understanding of the task in NLC. Advancement of a robot’s physical capability support increasingly improved task execution in NLC. With supporting technique from both natural language processing (NLP) and robot execution, NLC has been developed from low-cognition-level symbol matching control, such as using “yes/no” to control robotic arms, to high-cognition-level task understanding, such as identifying a plan from the description “go straight and turn left at the second cross.”

NLC research is regularly published in international journals, such as IJRR,²⁰ TRO,²¹ AI,²²] and KBS,²³ and international conferences such as ICRA,²⁴ IROS,²⁵ and AAAI.²⁶ By using keywords “‘NLP, human, robot, cooperation, speech, dialog, natural language,” about 1400 papers were retrieved from Google Scholar,²⁷ then with a focus of NL-facilitated human–robot cooperation, about 570 papers were related. The publication trend is shown in Figure 2, where the increasing significance of NLC is reflected by steadily increasing publication numbers.

Figure 2.

The annual amount of NLC-related publications since the year 2000 according to our paper review. In the past 18 years, the number of NLC publications are steadily increasing and reaching a history-high level in current time, revealing that NLC research is encouraged by other researches such as robotics and NLP. NLC: natural-language-facilitated human–robot cooperation; NL: natural language.

Compared with existing review papers about human–robot cooperation using communication manners such as gesture and pose,^28,29 action and motion,³⁰ and tactile,³¹ a review paper about human–robot cooperation using NL communication is lacking. Therefore, given the huge potentials of facilitating human–robot cooperation and increasingly received attention in NLC, in this review paper, we aim to summarize the state-of-the-art NLC methodologies in wide-range domains, revealing current research progress and signposting future NLC research. Our novelty is that we summarized the NLC research as three aspects: NL instruction understanding, NL-based execution plan generation, and knowledge-world mapping. Each aspect was comprehensively analyzed with research progress, method advantages, and limitations. The organization of this article is shown in Figure 3.

Figure 3.

Organization of this review paper. This review systematically summarized methodologies for using NL to facilitate human–robot cooperation. Three main researches are introduced as NL instruction understanding, NL-based execution plan generation, and knowledge-world mapping. In each research, typical models, application scenarios, model comparison, and open problems are summarized.

Framework of NLC realization

Realization of NLC is challenging due to the following aspects. First, human NL is abstract and ambiguous. It is hard to understand humans accurately during task assignments, impeding natural communications between a robot and a human. Second, NL-instructed plans are implicit. It is difficult to reason appropriate execution plans from human NL instructions for effective human–robot cooperation. Third, NL-instructed knowledge is information-incomplete and real-world inconsistent. It is difficult to map enough theoretical knowledge into the real world for supporting successful NLC. To solve these problems for effective and natural NLC, mainly three types of research have been done.

NL instruction understanding: To accurately understand assignments during NLC, the research of NL instruction understanding has been done to build semantic models for extracting cooperation-related knowledge from human NL instructions.

NL-based execution plan generation: To reason a robot’s execution plans from human NL instructions, the research of NL-based execution plan generation has been done to create various reasoning mechanisms for identifying human requests and formulize robot execution strategies.

Knowledge-world mapping: To map NL-instructed theoretical knowledge to real-world situations for practical cooperation, the research of knowledge-world mapping research has been done to recommend the missing knowledge and correct the real-world inconsistent knowledge for realizing NLC in various real-world environment.

NL instruction understanding

NL instruction understanding enables a robot to receive human-assigned tasks, identify human-preferred execution procedures, and understand the surrounding environment from abstract and ambiguous human NL instructions during NLC. By improving the robot’s understanding toward the human, the accuracy and naturalness during NLC are improved. To intuitively understand human NL expressions with an environment awareness, two types of semantic analysis models were developed: literal models and interpreted models. For both literal models and interpreted models, cooperation-related information is explicitly or implicitly extracted indicated by humans. The difference between them, however, is the information source. The literal models only extract information from human NL instructions, while the interpreted models will also extract information from human’s surrounding environment. With literal models, the robot understands tasks merely by following human NL instructions, while with interpreted models, robots understand tasks by critically thinking about cooperation-related practical environment conditions, becoming situation aware.

From the model construction perspective, to analyze meanings of human NL instructions in NLC, literal models mainly use literal linguistic features, such as words, Part-of-Speech (PoS) tags, word dependencies, word references, and sentence syntax structures, shown in Figure 4; interpreted models mainly use interpreted linguistic features, such as temporal and spatial relations, object categories, object physical properties, object functional roles, action usages, and task execution methods, as shown in Figure 5. Literal linguistic features were directly extracted from human NL instructions, while interpreted linguistic features were indirectly inferred from common sense based on NL expressions.

Figure 4.

Typical literal models for NL instruction understanding. (a) House et al.³² is a grammar model. The robotic arm’s motion was controlled by predefined vowels, such as “aw, ee, ch,” in human speech. (b) Dominey et al.³³ is an association model. NL expressions, such as “OpenLeft,” was interpreted as specific parameter “open left hand for 1 DOF” for robotic arms. NL: natural language. DOF: degrees of freedom.

Figure 5.

A typical interpreted model for NL instruction understanding.³⁴ Robot memory, real-world states and human NL instructions were integrated to instruct robot executions. NL: natural language.

Literal models

With regard to involvement manners of literal linguistic features, literal models are categorized into the following types. (1) Grammar model: Literal linguistic feature patterns such as “action + destination” are manually defined. (2) Association model: Literal linguistic features are mutually associated with commonsense knowledge.

Grammar models

To initially identify key cooperation-related information, such as goal, tool usage, and action sequences, from human NL instructions, grammar patterns are defined to build grammar models. Grammar patterns refer to keyword combinations, PoS tag combinations, and keyword-PoS tag combinations.^35,36 By using these grammar models, robot behaviors will be triggered by the grammars mentioned in human NL instructions. Some grammar patterns explored execution logics. For example, verbs and nouns were combined to describe a type of actions such as V(go) + NN(Hallway) and V(grasp) + NN(cup).^37
–39 Some grammar patterns explored temporal relations, such as the if–then relation “if door open, then turn right” and the step 1 to step 2 relation “go—grasp.”^40,41 Some grammar patterns explored spatial relations, such as the IN relation “cup IN room” and the CloseTo relation “cup CloseTo plate.”^42,43 The rationale of the grammar model is that sentences with a similar meaning have similar syntax structures. Similarity of NL meanings was calculated by evaluating the syntax structure similarity.

Association models

To understand abstract and implicit NL execution commands during cooperation, association models were developed by associating different literal linguistic features together to extract new semantic meanings. Essentially, the association model exploited existed knowledge by creating high-level abstract knowledge from low-level detailed knowledge. One typical association model is a probabilistic association model. Informative literal linguistic features in NL instructions were correlated with other informative keywords by using probability likelihoods computed from human communications. Typical works are as follows.

Learning from previous human execution experiences: Cooperation-needed actions are inferred based on mentioned tasks, locations, and their probabilistic associations.⁴⁴

Learning from daily common sense: Quantitative dynamic spatial relations such as “away from, between,…” have been associated with its corresponding NL expressions based on their probabilistic relations⁴⁵; general terms such as “beverage” are specified to “juice” according to cooperation types and task–object probabilistic relations.⁴⁶

With this probabilistic association model, the uncertainty in NL expressions was modeled, disambiguating NL instructions and improving a robot’s adaptation toward different human users with various NL expressions. Another typical association model is an empirical association model. High-level abstract literal linguistic features, such as ambiguous words and uncertain NL phrases, are empirically specified by low-level detailed literal features such as action usage, sensor values, and tool usages. The rationale is that general knowledge could be recommended for disambiguating ambiguous NL instructions in specific situations. Compared with probabilistic association models, which use objective probabilistic calculation, empirical association models use subjective empirical association.

Typical usages include the following types.

By defining sensor value ranges as ambiguous NL descriptions, such as “slowly, frequently, heavy,” ambiguous execution-related NL expressions were quantitatively interpreted, making ambiguous NL expressions sensor-perceivable.^41,47

By integrating key aspects, such as execution preconditions, action sequences, human preferences, tool usages, and location information, into abstract NL expressions—such as “drill a hole”—human instructed high-level plans were specified into detailed robot-executable plans—such as “clean the surface,” or “install a screw”.^{4,36,39,42,48}

By using discrete fuzzy statuses—such as “close, far, cold, warm”—to divide continuous sensor data ranges, unlimited objective sensor values were “translated” into limited subjective human feelings, such as “close to the robot, day is hot,” supporting a human-centered task understanding.^49,50

By combining human factors, such as “human’s visual scope,” with linguistic features, such as a keyword “wrench” in human NL instructions, empirical association model became environmental-context-sensitive, making a robot to understand a human NL instructions such as “deliver him a wrench” from the human perspective “human desired wrench is actually the human-visible wrench.”^51
–53 The advantage of using association models in NLC is that the robot cognition level is improved by means of mutual knowledge compensation. With this association model, a robot can explore unfamiliar environments by exploiting its existing knowledge.

Interpreted models

Human requests are usually situated, which means human NL expressions are with default environmental preconditions, such as “cup is dirty, a driller is missing, robot is far from a human.” Human NL instructions are closely correlated with situation-related information, such as human tactile indication (tactile modality), human hand/body pose (vision modality and motion dynamics modality), and environmental conditions (environment sensor modality).

To accurately understand human NL instructions, interpreted models are developed to integrate information from multimodalities, instead of merely from NL modality. The rationale behind interpreted models is that a human is dependent with their surrounding environment and better understanding of human needs to be environmentally context aware. With multimodality models, information from different modalities related to human, robot, and their surrounding environment was aligned to establish semantic corrections.^54
–56 Using NL instructions and human-related features to understand human NL instructions, typical features beyond linguistic features considered in single-modality models also include the following: individual identity detected by radio-frequency identification (RFID) sensor,⁵⁷ touch events detected by tactile sensors,⁵⁸ facial expressions (joy, sad),⁵⁹ hand poses detected by computer vision systems,⁶⁰ and human head orientations detected by motion tracking systems.⁶¹ Supported by rich information from multimodality information, typical problems tackled for NLC include complex-instruction understanding,⁶² human-like cooperation,⁶¹ human social behavior understanding, and mimicking.⁵⁰ For multimodality models using environment and robot-related features to understand human NL instructions in NLC, typical features also include the following: spatial object-robot relations indicated by human hand directions,⁶² temporal robot-speech-and-head-orientation dependencies measured by computer vision systems,⁶¹ object visual cues detected by cameras,^63,64 robot sensorimotor behaviors monitored by both motion systems and computer vision systems.³⁴ Supported by rich information from these features, typical problems tackled in NLC include real-time communication, context-sensitive cooperation (sensor-speech alignment), machine-executable task plan generation, and implicit human request interpretation. Typical algorithms used for constructing multimodality models include hidden Markov model (HMM) for modeling hidden probabilistic relations among interpreted linguistic features,^63,65 Bayesian network (BN) for modeling probabilistic transitions among task-execution steps,^66
–68 and first-order logic for modeling semantic constraints among interpreted linguistic features.^69,70 These algorithms integrate different modalities with appropriate contribution distributions and extract contributive feature patterns among modalities. Multimodality models have three potential advantages in understanding human NL instructions.

By exploring multimodality-information sources, rich information can be extracted for an accurate NL instruction understanding.

Information in one modality can be compensated by information learned from other modalities for better NL disambiguation.

Consistency of multimodality information enables mutual confirmations among knowledge from multiple modalities. A reliable NL command understanding could be conducted.

Supported by these advantages, multimodality models have the potential to understand complex plans and various users and to perform practical NL instruction understanding in real-world NLC situations.

Model comparison

Literal models, which use basic linguistic features directly from human NL instructions, are shallow literature-level understanding. Interpreted models, which use multimodality features interpreted from human NL instructions, are comprehensive connotation-level understanding. Each of them has unique advantages, therefore suitable for different application scenarios. For literal models, they are good at scenarios with simple procedures and clear work assignments, such as robot arm control and robot pose control. For interpreted models, they are good at scenarios with involvements of daily common sense, human cognitive logics, rich domain information, such as object physical property-assisted object searching, intuitive machine-executable plan generation, as well as vision–verbal–motion-supported object delivery. From literal models to interpreted models, robots have been more closely integrated with humans both physically and mentally. This integration enables a robot to accurately understand both human requests and practical environments, improving the effectiveness and naturalness of NLC.

Open problems

Although robots using grammar models have an initial capability of understanding human NL instructions during cooperation, the drawback is that feature correlations needed for understanding have been exhaustively listed. It is difficult to summarize all the likely encountered grammar rules. Compared with grammar models, association models give more cooperation-related knowledge to a robot by exploiting associations among literal features. Even though the association model could interpret abstract linguistic features into detailed execution plan, it still suffers from incorrect association problems. These open problems are decreasing NL instruction understanding accuracy and further decreasing robot adaptability.

Although interpreted models are capable of comprehensively understanding human NL instructions by considering practical environment conditions, it is difficult to combine different types of modalities such as motion, speech, and visual cues with an appropriate manner to reveal practical contribution distributions for different modalities. Second, it is difficult to extract contributive features for describing both distinctive and common aspects of one modality in understanding NL instructions. Third, the overfitting problem still exists when using multimodality information to understand NL instructions. NL instruction understanding based on different modalities could be mutually conflicting, thereby preventing the practical implementation of multimodality models. Model details are presented in Table 1.

Table 1.

Summary of NL instruction understanding methods.

	Literal models		Interpreted models
	Grammar	Association	Interpreted models
Knowledge format	Linguistic structures	Meaningful concepts	Semantic correlations
Algorithms	First-order logic	Ontology tree	Typical classification algorithms (NB, SVM), first-order logic
User adaptability	Low	Low	High
Tackled problems	Initially understand logic relations, temporal and spatial relations in execution processes	Specify abstract executions into machine-executable executions	Complex task instruction understanding, human-like human–robot cooperation, context-sensitive cooperation
Advantages	Performance is good and steady in trained situations	Model human cognitive process, scaling up robot knowledge	Rich cooperation-related information is involved. information is more reliable.
Disadvantages	Exhaustive listing of NL instructions, time-consuming and labor-intensive	Lacking standards for concept interpretation and interpretation evaluation	Difficult to combine different-modality features, difficult to extract important NL features
Typical references	^{35,36,40,41,43}	^{44,45,47,48,50}	^{63,65,66,67,70}

NL: natural language.

NL-based execution plan generation

With task knowledge extracted in NL instruction understanding, it is critical to use the task knowledge to plan robot executions in NLC. Models for NL-based execution plan generation (“generation model” for short) are developed for formulizing robot execution plans, theoretically supporting robots to cooperate with humans in appropriate manners. In these models, previously learned piecemeal knowledge is organized with different algorithm structures. Different algorithms enable the models with different cooperation manners under various human–robot cooperation scenarios. For example, dynamic models supported by HMM enable real-time NL understanding and execution, while static models supported by naive Bayesian (NB) enable spatial human–robot relation exploration. During a plan generation, correlations among NLC-related knowledge, such as execution steps, step transitions, and actions, tools, or locations—as well as their temporal, spatial, and logic relations are defined. Regarding reasoning mechanisms, generation models have three main types.

Probabilistic models: To enable robots with cooperation associative capability, in which a likely plan is inferred, and appropriate tools and actions are recommended, probabilistic models were developed based on probabilistic dependencies, shown in Figure 6.

Logic models: To enable robots with logical reasoning capability, in which internal logics among execution procedures are followed, logic models were developed based on ontology and first-order logics, shown in Figure 7.

Cognitive models: To enable robots with cognitive thinking capability, in which plans are intuitively made and adjusted, cognitive models were developed based on weighted logics, shown in Figure 8.

Figure 6.

Typical probabilistic models. (a) Takano’s⁷¹ is an HMM model, in which NLC task’s potential execution sequences are modeled by hidden Markov statuses. (b) Salvi et al.’s⁷² is a naive Bayesian model, in which observations “object size, object shape” and their conditional correlations such as “size-big, shape-ball,…” are combined to form joint-probability correlations such as “object-size-shape,….” NLC: natural-language-facilitated human–robot cooperation; NL: natural language.

Figure 7.

In the study by Dantam and Stilman,³⁸ hard logic relations, such as “move = (grasp, place),…,” were defined to control robot motion in playing chess with a human.

Figure 8.

In the cognitive model,⁷³ human’s cognitive process in decision-making was simulated by execution logics with different influence weights, based on which important logics with larger weights could be emphasized and trivial logics with smaller weights could be ignored. With this soft logic manner, the flexible cooperation between a human and a robot could be conducted.

Probabilistic model

Joint-probabilistic BN methods

To enable a robot with cooperation planning based on various observations, joint-probabilistic BN methods are developed. By using single-joint probability p(x, y), a robot could use the probabilistic association p between a human NL instruction y, such as “move,” and one execution parameter x, such as object “ball,” to plan simple cooperation such as object placement “move ball.”⁷² Typical joint-probability associations in NLC include activity-object associations, such as “drink-cup,”⁵¹ activity-environment associations such as “drink—hot day,”⁷⁴ and action-sensor associations.⁷⁵ During the generation of cooperation strategies, a single joint-probabilistic BN association is used as independent evidence to describe one semantic aspect of a task. For using multiple joint-probabilistic associations ∏(·,·), interpreted linguistic features of NLC task are collected from various NL descriptions and sensor data, describing relatively complex plans. Typical methods using multiple joint-probability associations include Viterbi algorithm,⁷⁵ NB algorithm,⁷⁴ and Markov random field.⁷⁶ With these algorithms, the most complete plan described in human NL instructions are selected as a human-desired plan. With multi-joint-probabilistic BN models, tackled problems are as follows.

Modeling plans by extracting linguistic features, such as NL instruction patterns^76,77;

Enriching cooperation details by aligning multiple types of sensor data, such as speech meaning, task execution statuses, and robot or human motion status³⁹;

Making flexible plans by specifying verbally-described tasks with appropriate execution details, such as execution actions and effects^78,79;

Intuitively cooperating with a human by integrating current NL descriptions with previous execution experiences⁸⁰;

Accurate tool searching by associating theoretical knowledge, such as tool identities with practical real-world evidences, such as tools’ colors and placement locations.⁸¹

One common characteristic of probabilistic models, such as NB, is that dependencies among task features are simplified to be fully or partially independent.⁷⁴ In practical situations, when a set of observations are made, evidence, such as speech, object, context, and action involved in cooperation, is usually not mutually independent.⁸² As for task plan representation, this simplification brings both negative effects, such as undermining the plan representation accuracy, and positive effects, such as preventing overfitting problems in plan-representation process. The common problem of multi-joint-probabilistic BN models is that temporal associations are ignored, limiting the implementations of real-time NLC.

DBN methods

To enable a temporal knowledge association for real-time cooperation planning, dynamic Bayesian network (DBN) was developed. With DBN, temporal dependencies p( | ₋₁) were propagated among NLC-related requests and object usages ₋₁.⁸³ Given that the final format of DBN is the joint probabilistic form p(y, , ₋₁, ), DBN is still a joint-probabilistic model. A widely used DBN algorithm in NLC is HMM algorithm,⁸⁴ which uses a Markov chain assumption to explore the hidden influence of previous task-related features on the current NLC status. The rationale of HMM in NLC is that human-desired executions, such as going to a position, grasping a tool, and lifting a robot hand, are decided by the previous cooperation ( ₋₁), such as action sequence, and current cooperation. These statuses include environmental conditions, task execution progress, and human NL instructions, as well as working statuses for the human and robot. HMM uses both observation probabilities (absolute probability p(x)) and transitions abilities (conditional probability p(Y/X)) for modeling associations P(x, y) among NLC-related knowledge.^71,84 With HMMs, tackled problems mainly include real-time task assignments,⁸⁵ dynamic human-centered cooperation adjustment,^71,86 accurate tool delivering by simultaneously fusing multi-view data such as NL instruction, shoulder coordinates, shoulders-elbows’ 3-D angle data, and hand poses.^84,87 Limited by Markov assumptions, HMM is only capable of modeling shallow-level hidden correlations among NLC-related knowledge. Moreover, given that hidden statuses need to be explored for HMM modeling, a large amount of training data is needed, limiting HMM implementations in unstructured scenarios with limited training data availability.

Logic model

To support a robot with rational logical reasoning of cooperation strategies, rather than merely conducting exhausting probabilistic inferences from various NL-indicated evidences, logic models were developed. Logic models teach robots using unviolated logic formulas to describe complex execution procedures which include multiple actions and statuses. Unviolated logics usually are first-order logic formulas, such as “in possible worlds a kitchen is a region (∀w∀x(kitchen(w, x) → region(w, x))).”⁸⁸ The rationale behind logic models in NLC is that an NLC task is decomposed into sequential logic formulas by satisfying which specific NLC task could be accomplished. In a logic model, logics are equally important without contribution differences toward execution success. Logic relations, including tool usages, action sequences, and locations, are defined in the structure. Typical tackled problems include the following:

Autonomous robot navigation by using logic navigation sequences, such as going to a location “hallway” then going to a new location “rest room.”^66,70

Environment uncertainty modeling by summarizing potential executions, such as “ground atoms (Boolean random variables) eats (Dominik, Cereals), uses (Dominik, Bowl), eats (Michael, Cereals) and uses (Michael, Bowl).”⁸⁹

Robot action control by defining action-usage logics such as “move (grasp piece(location, grip), place piece(location, ungrip)).”^66,90
–92

Autonomous failure analysis by looking up first-order logic representations to detect the missing knowledge, such as “tool brush, action: sweep.”^4,92

NL-based robot programming by using the grammar language, such as point(object, arm-side), lookAt(object), and rotate(rot-dir, arm-side).⁶⁹

The drawback of logic models in modeling NLC tasks is that logic relations defined in the model are hard constraints. If one logic formula was violated in practical execution processes, the whole logic structure would be inapplicable, and the task execution would fail. This drawback limits models’ implementation scopes and reduces a robot’s environment adaptability. Moreover, hard constraints were defined indifferently, ignoring the relative importance of executions. The execution flexibility is undermined due to critical executions not being focused and trivial executions not being ignored when the NLC plan modifications are necessary.

Cognitive model

Neural science research⁹³ and psychology research⁹⁴ proved that a cognitive human planning is not a sensorimotor transforming, instead a goal-based cognitive thinking. This reasoning is reflected on that cognitive thinking of cooperation is not relying on specific objects and specific executions, instead it is merely relying on goal realization. Based on this theory, another generation model category is summarized as a cognitive model. Human-like robot cognitive planning in NLC is reflected in flexibly changing execution plans (different procedures), adjusting execution orders (same procedures, different orders), removing some less-important execution steps (same procedures, less steps), and adding more critical executions procedures (similar procedures, similar orders).

Cognitive models

To develop human-like robot cognitive planning for robust NLC, cognitive models are developed by using soft logic, which is defined by both logic formulas and their weighted importance. A typical cognitive model is Markov logic network (MLN) model. MLN represents NLC task in a way such as “0.3Drill(1) ^ 0.3TransitionFeasible(1, 2) ^ 0.3Clean(2) ⇒ 0.9Task(1, 2),”⁴ imitating the human cognition process in task planning.

In this model, single execution steps and step transitions were defined by logic formulas, which could be grounded into different logic formulas by substituting real-world conditions. With this cognitive model, a flexible execution plan can be generated by omitting non-contributing and weak-contributing logic formulas and involving strong-contributing logic formulas. Different from hard constraints in logic models, constraints (logic formula) in MLN are soft. These soft constraints mean when human NL instructions are partially obeyed by a robot, the task could still be successfully executed. Typical tackled problems include using MLN to generate a flexible machine-executable plan from human NL instructions for autonomous industrial task execution,⁷³ NL-based cooperation in uncertain environments by using MLN to meet constraints from both robots’ knowledge availability (human-NL-instructed knowledge) and real-world’s knowledge requirements (practical situation conditions).^89,95 The advantage of using cognitive models in NLC is that soft logic is relatively like a human’s cognitive process reflected in human NL instructions during cooperation. It helps a robot with intuitive cooperation in unfamiliar situations by modifying, or replacing, and executing plan details, such as tool or action usages, improving robots’ cognition levels and enhancing its environment adaptability. The major drawback is that MLN is still different from human cognitive processes to consider logic conditions at a deep level to enable plan modification, new plan making, and failure analysis. Logic parameters for analyzing real-world conditions are still insufficient to imitate logic relations in the human mind, thereby limiting robots’ performances in adapting to users and environments.

Model comparison

Usually the probabilistic model is conducted in an end-to-end manner, which directly reasons cooperation strategies from observations, ignoring internal correlations among execution procedures. A logic model uses a step-by-step manner, with which ontology correlations and temporal or spatial correlations among execution procedures are explored, enabling process reasoning for intuitive planning. The cognitive model also uses a step-by-step model. Including logic correlations, the cognitive model also explores relative influences of execution procedures, enabling a flexibly plan adjustment. For the probabilistic model, it is good for scenarios with rich evidence and single objective goal, such as tool delivery and navigation path selection. For the logic model, it is good for scenarios with either poor evidence or multiple objective goals, such as assembly planning and cup grasping planning. For the cognitive model, it is good for rich or poor evidence and multiple subjective goals, such as human emotion-guided social interaction, and human preference-based object assembly.

Open problems

Probabilistic models lack explorations of indirect human cognitive processes in NLC, limiting naturalness of robotic executions. Logic models are inflexible and incapable of simulating a human’s intuitive planning in real-world environments. The cognitive model is close to a human’s cognitive process in simulating flexible decision-making processes. However, cognitive models are still suffering from two types of shortcomings. One shortcoming is that cognitive process simulation is still not a cognitive process because the fundamental theory of cognitive process modeling is lacking insufficient support for a human-like task execution.⁸⁹ The second problem is the difficulties of cognitive model learning. Different individuals have different cognitive processes, thus making it difficult to learn a general reasoning model. Model details are presented in Table 2.

Table 2.

Method summary of NL-based execution plan generation.

	Probabilistic model		Logic model	Cognitive model
	Joint-probabilistic BN	Dynamic Bayesian network	Logic model	Cognitive model
Knowledge format	Joint probabilistic correlations	Conditional probabilistic correlations	Logic formulas	Logic formulas, their weighted influences
Algorithms	Joint BN, NB, MRF	Conditional BN, Viterbi Algorithm, HMM	First-order Logic, Ontology Tree	MLN, fuzzy logic
User adaptability	Moderate	Moderate	Low	High
Tackled problems	Modeling meaning distributions on NL instructions, aligning multi-view sensor data, action and tool recommendation	Meaning disambiguation, entity-sensor data mapping, human-attended object identification, real-time uncertainty assessment	Autonomous robot navigation, environment uncertainty modeling, autonomous execution failure diagnosis, NL-based robot programming	Support a flexible machine-executable plan implementation, task execution in unknown environments
Advantages	Good at representing a complete plan	Good at distinguishing plans	Strong logic correlations among execution steps	Flexible task plan, human cognitive process imitating, strong environment/user adaptability
Disadvantages	Weak capability in modeling the mutual distinctiveness among tasks.	Weak capability in representing a complete task; rely on large amount of training data	Inflexible task execution, weak environment adaptation	Parameters are difficult to learn, the current soft logic is still far from a human cognitive process
Typical references	^{72,75,77,78,79}	^83 –87	^{66,70,88,90,91}	^73,89,95

NL: natural language; NB: naïve Bayesian; BN: Bayesian network; HMM: hidden Markov model; MRF: Markov random field; MLN: Markov logic network.

Knowledge-world mapping

With understanding of NL language and execution plans, it is critical for a robot to use this knowledge in practical cooperation scenarios. Knowledge-world mapping methods are developed to enable intuitive human–robot cooperation in real-world situations. The general process of knowledge-world mapping is shown in Figure 9. Considering the different implementation problems, knowledge-world mapping methods include two main types: theoretical knowledge grounding and knowledge gap filling. Theoretical knowledge grounding methods accurately mapped learned knowledge items, such as objects, spatial/temporal logic relations, into corresponding objects and relations in real-world scenarios. Gap filling methods detect and recommend both the missing knowledge, which is needed in real-world situations, but has not been covered by theoretical execution plans, as well as real-world inconsistent knowledge, which is provided by a human, but could not find corresponding things in practical real-world scenarios.

Figure 9.

Typical methods for theoretical knowledge grounding. In (a), Takano and Nakamura¹¹ predefined motions, such as “avoiding, bowing, carring,…,” were directly associated with their corresponding symbolic words. In (b), Hemachandra et al.’s⁹⁶ special features, such as “kitchen location, lab locations,…,” were considered to identify human-desired paths.

Theoretical knowledge grounding

To accurately map theoretical knowledge to practical things, knowledge grounding methods are developed. In these methods, a knowledge item is defined by properties, such as visual properties “object color and shape” captured by RGB cameras, motion properties “action speed” captured by motion tracking systems and execution properties “tool usage and location” captured by RFID. Different from the direct symbol mapping method, which has an element-mapping manner, the general property mapping method has a structural mapping manner. The rationale behind these methods is that a knowledge item can be successfully grounded into the real world by mapping its properties. The properties were collected by using methods such as “semantic similarity measurement,”⁹⁷ which can establish correlations between an object and their corresponding properties. One typical mapping method by using general property mapping is semantic map.^96,98 Theoretical indoor entities such as rooms and objects are identified by meaningful real-world properties, such as location, color, point cloud, spatial relation “parallel,” neighbor entities, constructing a semantic map with both objective locations and semantic interpretations “wall, ceiling, wall, floor.” For detecting visual properties in the real world, RGBD cameras are usually used. For spatial relations, it is detected by laser sensors and motion tracking systems. By identifying these properties in the real world, an indoor entity is identified, enabling an accurate robotic navigation in real-world NLC. Other typical mapping method also include the following.

Object searching by using NL instructions (detected by microphones) as well as visual properties such as object color, size, and shape (detected by motion tracking systems and cameras).⁹⁹

Executing NL-instructed motion plans, such as “pick up the tire pallet” by focusing on realizing actions “drive, insert, raise, drive, set.”¹⁰⁰

Identifying human-desired cooperation places, such as “lounge, lab, conference room,” by checking spatial-semantic distributions of landmarks, such as “hallway, gym,…”¹⁰¹

With mapping methods, knowledge could be mapped into real world in a flexible manner, in which only parts of properties need to be mapped for grounding a theoretical item into a real-world thing. This manner could improve a robot’s adaptability toward users and environments. The limitation is that these mapping methods still use predefinitions to give a robot knowledge, reducing the intuitiveness of human–robot cooperation.

Knowledge gap filling

A theoretical execution plan defines an ideal real-world situation. Given unpredicted aspects in a practical situation, even if all defined knowledge has been accurately mapped into the real world, it is still challenging to ensure the success of NLC by providing all knowledge needed in a practical situation. Especially in real-world situations, human users and environment conditions vary, causing the occurrences of knowledge gaps, which are knowledge required by real-world situations but are missing from a robot’s knowledge database.

To ensure the success of a robot’s execution, knowledge gap filling methods are developed to fill in these knowledge gaps. There are three main types of knowledge gaps: (1) environment gaps, which are constraints such as tool availability and space or location limitations imposed by unfamiliar environments¹⁰²;

(2) robot gaps, which are constraints such as a robot’s physical structure strength, capable actions, and operation precision¹⁰³; (3) user gaps, which are missing information caused by abstract, ambiguous, or incomplete human NL instructions.^104,105 Filling up these knowledge gaps enhances robot capability in adapting dynamic environments and various tasks or users. Knowledge gap filling is challenging in that it is difficult to make a robot aware of its knowledge shortage in specific situations, and it is difficult to make a robot understand how missing knowledge should be compensated for successful task executions.

The first step of gap filling is gap detection. Gap detection methods mainly include the following.

Hierarchical knowledge structure checking, which detects knowledge gaps by checking real-world-available knowledge from top-level goals to low-level NLC execution parameters defined in a hierarchical knowledge structure.^41,103

Knowledge-applicability assessment, which detects knowledge gaps by checking the similarities between theoretical scenarios and real-world scenarios.^34,103

Performance-triggered knowledge gap estimation, which detects knowledge gaps by considering the final execution performances.^105,106

Hierarchical knowledge structure checking has the rationale that if desired knowledge defined in a knowledge structure is missing in real-world situations, then knowledge gaps exist. Knowledge applicability assessment has a rationale that if the NLC situation is not similar with the previously trained situations, then knowledge gaps exist. Performance-triggered knowledge gap estimation has a rationale that if the final NLC performances of a robot is not acceptable, then knowledge gaps exist. In this detection stage, execution plan provides reasoning mechanisms. While real world provides practical things such as objects, locations, human identities, and relations such as spatial relations and temporal relations, which are detected by perceiving systems.

The second step of gap filling is gap filling. Gap filling methods mainly include the following.

Using existing alternative knowledge such as “brush” in the robot knowledge base to replace inappropriate knowledge such as “vacuum cleaner” in NLC tasks such as “clean a surface”^4,106;

Using general commonsense knowledge “drilling action needs driller” in a robot database to satisfy the need for a specific type of knowledge such as “tool for drilling a hole in the install a screw task”^106,103;

Asking knowledge input from human users by proactively asking questions such as “where is the table leg”^105,107,108;

Autonomously learning from the Internet for recognizing human daily intentions, such as “drink water, wash dishware.”^107,108

In gap filling stage, execution plan describes the needed knowledge items. Real world provides practical objects as well as robot performance monitoring.

Model comparison

Knowledge grounding model and knowledge gap filling model are two critical steps for a successful mapping between NL-instructed theoretical knowledge and real-world cooperation situations. For knowledge grounding models, the objective is strictly mapping NL-instructed objects and logic relations into real-world conditions. It is a necessary step for all the NLC application scenarios, such as human-like action learning, indoor, and outdoor cooperative navigation. For knowledge gap filling models, the objective is to detect and repair missing or incorrect knowledge in human NL instructions. It is only necessary when human NL instruction cannot ensure successful NLC under given real-world conditions. Typical scenarios include daily assistance such as serving drink, where information such as correct types of “drink,” “vessel,” and default places for drink delivery is missing; cooperative surface processing where execution procedures are incorrect and tools are missing.

Open problems

A typical problem of theoretical knowledge grounding is the non-executable-instruction problem. Human NL-instructed knowledge is usually ambiguous that NL-mentioned objects are too ambiguous to be identified in real world; abstract that high-level cooperation strategies are difficult to be interpreted into low-level execution details; information-incomplete that important cooperation information such as tool usages, action selections, and working locations are partially ignored; real-world inconsistent that human NL-instructed knowledge is not available in real world. These non-executable problems limit practical executions of human NL-instructed plans. One type of cause of non-executable-instruction problems include intrinsic NL characteristics, such as omitting, referring, and simplifying, as well as human speaking habits, such as different sentence organizations and phrase usages. Another type of cause is the lack of environment understanding. For example, if object-related information such as availability, location, and distances to a robot or human was ignored, it is difficult for a robot to infer which object a human user needs.⁵¹

For knowledge gap filling, when a robot queries knowledge from either a human or open knowledge databases such as openCYC,¹⁰⁹ the scalability is limited. For a specific user or a specific open knowledge database, available contents are insufficient to satisfy general knowledge needs in various NLC executions. The time and labor cost are high, further limiting knowledge supports for NLC. Model details are presented in Table 3.

Table 3.

Summary of knowledge-world mapping methods.

	Theoretical knowledge grounding	Knowledge gap filling
	Theoretical knowledge grounding	Hierarchical knowledge structure checking	Performance-triggered knowledge gap estimation
Knowledge format	Real-world objects, spatial/temporal/logic correlations	Real-world objects, spatial/temporal/logic correlations	Real-world objects, spatial/temporal/logic correlations
Algorithms	Typical classification algorithms	First-order logic, ontology tree	Typical classification algorithms
User adaptability	Low	Middle	Middle
Tackled problems	Indoor routine identifying, accurate object searching, scene understanding	Detecting gaps among robots, users, and robots	Detecting gaps among robots, users, and robots
Advantages	Flexible knowledge usage, improving robots’ environment adaptability.	Improving smoothness of task executions	Improving the success rate of task executions
Disadvantages	Predefined manner limits robots’ environment adaptability, execution naturalness and intuitiveness	Difficult to decide which knowledge can miss	Difficult to decide which gaps lead to the failure of task executions
Typical references	^96,99 –101	^80,103,106	^103,107,108

Discussion

DL for better command understanding

Nowadays, NLP is undergoing a deep learning (DL) revolution to create sophisticated models for semantic analysis. Potential benefits of using advanced DL models for NLC include the following.

The word embedding methods, add semantic correlations, such as “cats and dogs are animals,” into irrelevant words, such as “cat, dog.”¹¹⁰ In future NLC research, embedding methods could be used to introduce extra task-specific meanings, such as general common sense “drilling needs the tool driller” and safety rules, “stay away from hot surface and sharp tools,” endowing robots with better command understanding with awareness of environment limitations and human requirements.

Sequence-to-sequence language model, such as long- and short-term memory, sequentially outputted meaning based on the continuously inputted text.^109,111 In future NLC research, sequence-to-sequence models will enable robots to follow real-time instructions for timely executions and modifications, by aligning temporal verbal instructions with task-related knowledge, such as “action sequence and location assignments.”

Attention-based NL understanding models, such as “Recurrent neural networks (RNN) encoder–decoder,” emphasize relatively important words by increasing the weights of the important expressions.¹¹² For example, in translating sentence “that does not mean that we want to bring an end to subsidization,” keyword “subsidization” which with key information is emphasized.¹¹³ In future NLC research, attention models could help a robot to focus on human-desired executions by analyzing verbal attention.

Cost reduction for knowledge learning

Cost reduction in knowledge collection is critical for intuitive NLC. On the one hand, to understand human NL instructions, represent tasks, or fill in knowledge gaps, a large scale of reliable knowledge is needed. On the other hand, time, economic cost, and labor investments need to be reduced. To solve this problem, two trends in developing knowledge-scaling-up methods appear recently: existing-knowledge exploitation and new-knowledge exploration. In existing-knowledge exploitation, existing knowledge is interpreted and extracted into general knowledge, thereby increasing knowledge interchangeability. Knowledge for specific situations, such as “use cup and spoon for preparing coffee,” could be used for general situations, such as “preparing drink.”⁷⁴ In new-knowledge exploration, new knowledge is collected by proactively asking human and autonomously retrieving from the Word Wide Web,¹¹⁴ books,¹¹⁵ operation logs,¹¹⁶ and videos.¹¹⁷

NLC system personalization

When a robot cooperates with a specific human for a long time, the personalization of a robot becomes critical. For personalization, it does not only mean defining individualized knowledge for a robot to adapt to a specific user, but it also means designing a knowledge-individualization method, for a robot to autonomously adapt to variable users.¹⁰² Therefore, a future research in NLC would be developing knowledge-personalization methods to consider both execution preferences and social norms, supporting a long-term NLC personalization.

Safety consideration in NLC system design

When robots are deployed in human-involved environments, it is necessary to follow the safety standards^118,119 to minimize and detail safety hazards before actual implementations. The objective of enforcing the safety standards in NLC system design is for protecting humans’ individual safety and psychological comfortableness. Given the unique aspects of NLC systems in communication and interaction, the safety considerations are typically three types, summarized as follows. First, verbally instructed actions that endanger human being’s safety should be rejected or be carefully assessed by the robotic systems.¹²⁰ This requires NLC systems to have safety-related common sense for safety issue understanding and prediction. Second, the verbal abuse from either a human to a robot or from a robot to a human should be avoided, because verbal abuse will make a human feel uncomfortable psychologically and finally influence the performance of NLC systems.^121,122 The psychology comfortableness of NLC systems requires the research of human behaviors and human–machine trust. Third, it is necessary to enforce rigid risk assessment and controlled safety verification before releasing the NLC systems into the market, to make sure NLC systems are safe and helpful without causing safety issues for humans.¹²³

Conclusion

This article reviewed state-of-the-art methodologies for realizing NLC. With in-depth analysis of application scenarios, method rationales, method formulizations, and current challenges, and research of using NL to push forward the limits of human–robot cooperation were summarized from a high-level perspective. This review article mainly categorized a typical NLC process into three steps: NL instruction understanding, NL-based execution plan generation, and knowledge-world mapping. With these three steps, a robot can communicate with a human, reason about human NL instruction, and practically provide human-desired cooperation according to human NL instructions.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Rui Liu

Xiaoli Zhang

References

Winograd

. Understanding natural language. Cogn Psychol 1972; 3: 1–91.

Baraglia

Cakmak

Nagai

. Initiative in robot assistance during collaborative task execution. In: ACM/IEEE HRI, New Zealand, March 2016, pp. 67–74.

Tellex

S A

Kollar

Dickerson

. Understanding natural language commands for robotic navigation and mobile manipulation. In: AAAI, San Francisco, California, 7–11 August 2011, pp. 1507–1514.

Liu

Webb

Zhang

. Natural-language-instructed industrial task execution. In: ASME IDETC/CIE, North Carolina, USA, 21–24 August 2016, pp. v01bt02a043–v01bt02a04.

Tenorth

Nyga

Beetz

. Understanding and executing instructions for everyday manipulation tasks from the world wide web. In: IEEE ICRA, 2010, pp. 1486–1491.

Hemachandra

Duvallet

Howard

. Learning models for following natural language directions in unknown environments. In: IEEE ICRA, 2015, pp. 5608–5615.

Iwata

Sugano

. Human-robot-contact-state identification based on tactile recognition. IEEE Trans Ind Electron 2005; 52(6): 1468–1477.

Kruger

Surdilovic

. Hand force adjustment: robust control of force-coupled human–robot-interaction in assembly processes. CIRP Ann Manuf Technol 2008; 57(1): 41–44.

Kim

Jung

Kavuri

. Intention estimation and recommendation system based on attention sharing. In: International conference on neural information processing, 2013, pp. 395–402.

10.

Liu

Zhang

. Understanding human behaviors with an object functional role perspective for robotics. IEEE Trans Cogn Develop Syst 2016; 8(2): 115–127.

11.

Takano

Nakamura

. Action database for categorizing and inferring human poses from video sequences. Robot Auton Syst 2015; 70: 116–125.

12.

Raman

Lignos

Finucane

. Sorry Dave, I’m afraid I can’t do that: explaining unachievable robot tasks using natural language. In: Robotics: science and systems, Berlin, Germany, 24–28 June 2013.

13.

Matuszek

Herbst

Zettlemoyer

. Learning to parse natural language commands to a robot control system. In: Desai

Dudek

Khatib

Kumar

(eds) Experimental robotics. Springer tracts in advanced robotics, Vol. 88. Heidelberg: Springer, 2016, pp. 403–415.

14.

Waldherr

Romero

Thrun

. A gesture based interface for human-robot interaction. Auton Robot 2000; 9(2): 151–173.

15.

Hunston

Francis

. Pattern grammar: a corpus-driven approach to the lexical grammar of English. Comput Linguist 2000; 27(2): 318–320.

16.

Hsiao

Vosoughi

Tellex

. Object schemas for responsive robotic language use. In: Proceedings of the third ACM/IEEE international conference on Human robot interaction, Amsterdam, The Netherlands, 12–15 March 2008, pp. 233–240. ACM: New York, USA.

17.

Guerin

Lea

Paxton

. A framework for end-user instruction of a robot assistant for manufacturing. In: IEEE ICRA, Seattle, May 2015, pp. 6167–6174.

18.

Steels

Kaplan

. AIBO’s first words: the social learning of language and meaning. Evol Commun 2000; 4: 3–32.

19.

Huang

Tellex

Bachrach

. Natural language command of an autonomous micro-air vehicle. In: IEEE/RSJ IROS, 2010, pp. 2663–2669.

20.

Hollerbach

. The international journal of robotics research. SAGE. [Online]. http://journals.sagepub.com/home/ijr. (accessed 27 January 2017).

21.

Park

. IEEE Xplore: IEEE transactions on robotics [Online]. http://ieeexplore.ieee.org/xpl/RecentIssue.jsp? punumber=8860. (accessed 27 January 2017).

22.

Dechter

. Artificial intelligence. [Online]. https://www.journals.elsevier.com/artificial-intelligence/. (accessed 27 January 2017).

23.

Fujita

. Knowledge-based systems. [Online]. https://www.journals.elsevier.com/knowledge-based-systems. (accessed 27 January 2017).

24.

ICRA2015. in ICRA2015. [Online]. http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp? punumber=71 28761. (accessed 27 January 2017).

25.

IROS2015.in IROS2015. [Online]. https://www.ieee.org/conferences_events/conferences/conferenc edetails/index.html? Conf_ID=33365. (accessed 27 January 2017).

26.

AAAI15: Twenty-Ninth conference on artificial intelligence (AAAI15), 1995. [Online]. http://www.aaai.org/Conferences/AAAI/aaai15.php. (accessed 27 January 2017).

27.

GoogleScholar. [Online]. https://scholar.google.com.

28.

Argall

Chernova

Veloso

. A survey of robot learning from demonstration. Robot Auton Syst 2009; 57: 469–483.

29.

Rautaray

Agrawal

. Vision based hand gesture recognition for human computer interaction: a survey. Artif Intell Rev 2015; 43: 1–54.

30.

Bethel

Salomon

Murphy

. Survey of psychophysiology measurements applied to human-robot interaction. In: IEEE RO-MAN, 2007, pp. 732–737.

31.

Argall

Billard

. A survey of tactile human–robot interactions. Robot Auton Syst 2010; 58: 1159–1176.

32.

House

B, Malkin

Bilmes

. The VoiceBot: a voice controlled robot arm. In: ACM SIGCHI, 2009, pp. 183–192.

33.

Dominey

Mallet

Yoshida

. Progress in programming the hrp-2 humanoid using spoken language. In: IEEE ICRA, 2007, pp. 2169–2174.

34.

Ovchinnikova

Wachter

. Multi-purpose natural language understanding linked to sensorimotor experience in humanoid robots. In: IEEE-RAS Humanoids, 2015, pp. 365–372.

35.

Lee

Kim

Yoon

. Designing a human-robot interaction framework for home service robot. In: IEEE RO-MAN, 2005, pp. 286–293.

36.

Lee

Kim

Lee

. Affective effects of speech-enabled robots for language learning. In: IEEE SLT, 2010, pp. 145–150.

37.

Motallebipour

Bering

. A spoken dialogue system to control robots. 2002.

38.

Dantam

Stilman

. The motion grammar: analysis of a linguistic method for robot control. IEEE Trans Robot 2013; 29: 704–718.

39.

Bicho

Louro

Erlhagen

. Integrating verbal and nonverbal communication in a dynamic neural field architecture for human–robot interaction. Front Neurorobot 2010; 4: 1–13.

40.

Mcguire

Fritsch

Steil

. Multi-modal human-machine communication for instructing robot grasping tasks. In: IEEE/RSJ IROS, Vol. 2, 2005, pp. 1082–1088.

41.

Zender

Jensfelt

Mozos

. An Integrated robotic system for spatial understanding and situated interaction in indoor environments. In: AAAI, 2007, pp. 1584–1589.

42.

Guadarrama

Riano

Golland

. Grounding spatial relations for human-robot interaction. In: IEEE/RSJ IROS, 2013, pp. 1640–1647.

43.

Scheutz

Cantrell

Schermerhorn

. Toward humanlike task-based dialogue processing for human robot interaction. AI Mag 2011; 32: 77–84.

44.

Kollar

Perera

Nardi

. Learning environmental knowledge from task-based human-robot dialog. In: IEEE ICRA, 2013, pp. 4304–4309.

45.

Fasola

Mataric

. Using semantic fields to model dynamic spatial relations in a robot architecture for natural language instruction of service robots. In: IEEE/RSJ IROS, 2013, pp. 143–150.

46.

Chen

. Understanding user instructions by utilizing open knowledge for service robots. arXiv: 1606.02877v1, 2016.

47.

Brenner

Hawes

Kelleher

. Mediating between qualitative and quantitative representations for task-orientated human-robot interaction. In: Proceedings of the 20th international joint conference on Artificial intelligence, IJCAI 2007, Hyderabad, India, 6–12 January 2007 pp. 2072–2077. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

48.

Cantrell

Schermerhorn

Scheutz

. Learning actions from human-robot dialogues. In: IEEE RO-MAN, 2011, pp. 125–130.

49.

Jayawardena

Watanabe

Lzumi

. Posture control of robot manipulators with fuzzy voice commands using a fuzzy coach–player system. Adv Robot 2007; 21: 293–328.

50.

Zhang

Knoll

. A two-arm situated artificial communicator for human–robot cooperative assembly. IEEE Trans Ind Electron 2003; 50: 651–658.

51.

Misra

Sung

Lee

. Tell me dave: context-sensitive grounding of natural language to manipulation instructions. Int J Robot Res 2016; 35: 281–300.

52.

Fong

Nourbakhsh

Kunz

. The peer-to-peer human-robot interaction project. In: AAAI Space Forum, Long Beach, California, August 2005, pp. 6750.

53.

Rybski

Stolarz

Yoon

. Using dialog and human observations to dictate tasks to a learning robot assistant. Intel Serv Robot 2008; 1: 159–167.

54.

Walter

Hemachandra

Homberg

. Learning semantic maps from natural language descriptions. In: RSS, 2013.

55.

Bastianelli

Croce

Vanzo

. A discriminative approach to grounded spoken language understanding in interactive robotics. In: IJCAI, 2016, pp. 9–15.

56.

Jesse

Shiqi

Raymond

. Learning to interpret natural language commands through human-robot dialog. In: IJCAI, 2015.

57.

Foster

Richert

. Human-robot dialogue for joint construction tasks. In: ICMI Proceedings of the eighth international conference on multimodal interface, Banff, Alberta, Canada, 2–04 November 2006, pp. 68–71. ACM: USA, New York.

58.

Bischoff

Graefe

. Dependable multimodal communication and interaction with robotic assistants. In: IEEE RO-MAN, 2002, pp. 300–305.

59.

Breazeal

Aryananda

. Recognition of affective communicative intent in robot-directed speech. Auton Robot 2002; 12: 83–104.

60.

Connell

J, Marcheret

Pankanti

. An extensible language interface for robot manipulation. In: ICAGI, 2012, pp. 21–30.

61.

Stiefelhagen

Fugen

Gieselmann

. Natural human-robot interaction using speech. head pose and gestures. In: IEEE/RSJ IROS, 2004, pp. 2422–2427.

62.

Ghidary

Nakata

Saito

. Multi-modal human robot interaction for map generation. In: IEEE/RSJ IROS, 2001, pp. 2246–2251.

63.

Levinson

Zhu

. Automatic language acquisition by an autonomous robot. In: IJCNN, Vol. 4, 2003, pp. 2716–2721.

64.

Kruijff

Zender

Jensfelt

. Situated dialogue and spatial organization: what, where and why. IJARS 2007; 4: 16.

65.

Oliveira

Ince

Nakamura

. An active audition framework for auditory-driven HRI: application to interactive robot dancing. In: IEEE RO-MAN, 2012, pp. 1078–1085.

66.

Shimizu

Haas

. Learning to follow navigational route instructions. In: IJCAI 2009, pp. 1488–1493.

67.

Gemignani

Veloso

Nardi

. Language-based sensing descriptors for robot object grounding. In: Robot Soccer World Cup, 2015, pp. 3–15.

68.

Dongcai

Shiqi

Peter

. Leveraging commonsense reasoning and multimodal perception for robot spoken dialog systems. In: IEEE IROS, 2017, pp. 3855–3863.

69.

Allen

Duong

Thompson

. Natural language service for controlling robots and other agents. In: IEEE KIMAS 2005, pp. 592–595.

70.

Bos

. Applying automated deduction to natural language understanding. J Appl Logic 2009; 7: 100–112.

71.

Takano

. Learning motion primitives and annotative texts from crowd-sourcing. Robomech J 2015; 2: 1–9.

72.

Salvi

Montesano

Bernardino

. Language bootstrapping: learning word meanings from perception–action association. IEEE Trans Syst Man Cybern B 2011; 42: 660–671.

73.

Liu

Zhang

. Generating machine-executable plans from end-user’s natural-language instructions. Knowl Based Syst 2017; 140: 15–26.

74.

Liu

Zhang

Webb

. Context-specific intention awareness through web query in robotic caregiving. In: IEEE ICRA, 2015, pp. 1962–1967.

75.

Sattar

Dudek

. Towards quantitative modeling of task confirmations in human-robot dialog. In: IEEE ICRA, 2011, pp. 1957–1963.

76.

Liu

Zhang

. Context-specific grounding of web natural descriptions to human-centered situations. Knowl Based Syst 2016; 111: 1–16.

77.

Dindo

Zambuto

. A probabilistic approach to learning a visually grounded language model through human-robot interaction. In: IEEE/RSJ IROS, 2010, pp. 790–796.

78.

Oates

. Grounding knowledge in sensors: unsupervised learning for language and planning. 2001.

79.

Krunic

Salvi

Bernardino

. Affordance based word-to-meaning association. In: IEEE ICRA, 2009, pp. 4138–4143.

80.

Deits

Tellex

Thaker

. Clarifying commands with information-theoretic human-robot dialog. JHRI 2012; 1: 78–95.

81.

Matuszek

Fitzgerald

Zettlemoyer

. A joint model of language and perception for grounded attribute learning. arXiv preprint arXiv:1206.6423. 2012.

82.

Bustamante

Garrido

Soto

. Fuzzy naive Bayesian classification in RoboSoccer 3D: a hybrid approach to decision making. In: Robot Soccer World Cup, 2006; pp. 507–515.

83.

Tahboub

. Intelligent human-machine interaction based on dynamic Bayesian networks probabilistic intention recognition. J Intell Robot Syst 2006; 45: 31–52.

84.

Burger

Ferrane

Lerasle

. Two-handed gesture recognition and fusion with speech to command a robot. Auton Robot 2012; 32: 129–147.

85.

Doshi

Roy

. Spoken language interaction with model uncertainty: an adaptive human–robot interaction system. Connect Sci 2008; 20: 299–318.

86.

Takano

Nakamura

. Bigram-based natural language model and statistical motion symbol model for scalable language of humanoid robots. In: IEEE ICRA, 2012, pp. 1232–1237.

87.

Rossi

Leone

Fiore

. An extensible architecture for robust multimodal human-robot communication. In: IEEE/RSJ IROS, 2013, pp. 2208–2213.

88.

Bos

Oka

. A spoken language interface with a mobile robot. Artif Life Robot 2017; 11: 42–47.

89.

Jain

Mosenlechner

Beetz

. Equipping robot control programs with first-order probabilistic reasoning capabilities. In: IEEE ICRA, 2009, pp. 3626–3631.

90.

Dzifcak

Scheutz

Baral

. What to do and how to do it: translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: IEEE ICRA, 2009, pp. 4163–4168.

91.

Finucane

Jing

. LTLMoP: experimenting with language. Temporal logic and robot control. In: IEEE/RSJ IROS, 2010, pp. 1988–1993.

92.

Cheng

Jia

Fang

. Modelling and analysis of natural language controlled robotic systems. IFAC Proc Vol 2014; 47: 11767–11772.

93.

Johnson

SHF

. What’s so special about human tool use? Neuron 2003; 39: 201–204.

94.

Gray

Breazeal

. Manipulating mental states through physical action. Int J Soc Robot 2014; 6: 315–327.

95.

Tenorth

. Knowledge processing for autonomous robots. Dissertation. Universität München, 2011.

96.

Hemachandra

Walter

Tellex

. Learning spatial-semantic representations from natural language descriptions and scene classifications. In: IEEE ICRA, 2015, pp. 2623–2630.

97.

Bastianelli

Croce

Basili

. Using semantic models for robust natural language human robot interaction. In: AIIA, 2015 pp. 343–356.

98.

Nuchter

Hertzberg

. Towards semantic maps for mobile robots. Robot Auton Syst 2008; 56: 915–926.

99.

Steels

Baillie

. Shared grounding of event descriptions by autonomous robots. Robot Auton Syst 2003; 43: 163–173.

100.

Tellex

Thaker

Joseph

. Learning perceptually grounded word meanings from unaligned parallel data. Mach Learn 2014; 94: 151–167.

101.

Spexard

Wrede

. BIRON. Where are you? Enabling a robot to learn new places in a real home environment by integrating spoken dialog and visual localization. In: IEEE/RSJ IROS, 2006, pp. 934–940.

102.

Mason

Lopes

. Robot self-initiative and personalization by learning through repeated interactions. In: ACM/IEEE HRI, 2011, pp. 433–440.

103.

Chen

Xie

. Toward open knowledge enabling for human-robot interaction. JHRI 2012; 1: 100–117.

104.

Thomas

Jenkins

. RoboFrameNet: verb-centric semantics for actions in robot middleware. In: IEEE ICRA, 2012, pp. 4750–4755.

105.

Knepper

Tellex

. Recovering from failure by asking for help. Auton Robot 2012; 39: 347–362.

106.

Scioni

Borghesan

Bruyninckx

. Bridging the gap between discrete symbolic planning and optimization-based robot control. In: 2015 IEEE international conference on robotics and automation (ICRA), 26–30 May 2015, pp. 5075–5081. USA: IEEE.

107.

Toris

Kent

Chernova

. Unsupervised learning of multi-hypothesized pick-and-place task templates via crowdsourcing. In: IEEE ICRA, 2015, pp. 4504–4510.

108.

Tenorth

Klank

Pangercic

. Web-enabled robots. IEEE Robot Autom Mag 2011; 18: 58–68.

109.

Akshat

Swaminathan

Vasu

. Community regularization of visually-grounded dialog. 2018. arXiv:1808.04359.

110.

Weston

Ratle

Collobert

. Deep learning via semi-supervised embedding. In: Proceedings of the 25th international conference on Machine learning (ICML), pp. 639–655, 2008.

111.

Sutskever

Vinyals

. Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst 2014; 4: 3104–3112.

112.

Bahdanau

Cho

Bengio

. Neural machine translation by jointly learning to align and translate. In: International conference on learning representation (ICLR), 2015.

113.

Ling

Trancoso

Dyer

. Character-based neural machine translation. In: ICLR, 2016.

114.

Samadi

Kollar

Veloso

. Using the web to interactively learn to find objects. In: Proceedings of the 26th AAAI conference on artificial intelligence, 2012.

115.

Gordon

Breazeal

. Bayesian active learning-based robot tutor for children’s word-reading skills. In: Proceedings of the 29th AAAI conference on artificial intelligence, Austin, Texas, 25–30 January 2015. pp. 1343–1349.

116.

Zeng

Sun

Duan

. Cross-organizational collaborative workflow mining from a multi-source log. Decis Support Syst 2013; 54: 1280–1301.

117.

Liu

Zhang

. Web-video-mining-supported workflow modeling for laparoscopic surgeries. Artif Intell Med 2016; 74: 9–20.

118.

https://www.iso.org/standard/53820.html.

119.

Jacobs

Virk

. ISO 13482—The new safety standard for personal care robots, ISR/Robotik 2014. In: 41st international symposium on robotics, Munich, Germany, 2014, pp. 1–6.

120.

Singh

Payne

Jennings

. Toward a methodology for assessing electric vehicle exterior sounds. IEEE Trans Intell Transp Syst 2014; 15(4): 1790–1800.

121.

Han

Lin

Song

. Robotic emotional expression generation based on mood transition and personality model. IEEE Trans Cybern 2013; 43(4) 1290–1303.

122.

Wrede

Sagerer

. A dialog system for comparative user studies on robot verbal behavior. In: IEEE international symposium on robot and human interactive communication, 6–8 September 2006, pp. 129–134. Hatfield, UK: IEEE.

123.

Guiochet

Martin-Guillerez

Powell

. Experience with model-based user-centered risk assessment for service robots. In: IEEE international symposium on high-assurance systems engineering (HASE), 3–4 November 2010, pp. 104–113, CA, USA: IEEE.

A review of methodologies for natural-language-facilitated human–robot cooperation

Abstract

Keywords

Introduction

Framework of NLC realization

NL instruction understanding

Literal models

Grammar models

Association models

Interpreted models

Model comparison

Open problems

NL-based execution plan generation

Probabilistic model

Joint-probabilistic BN methods

DBN methods

Logic model

Cognitive model

Cognitive models

Model comparison

Open problems

Knowledge-world mapping

Theoretical knowledge grounding

Knowledge gap filling

Model comparison

Open problems

Discussion

DL for better command understanding

Cost reduction for knowledge learning

NLC system personalization

Safety consideration in NLC system design

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References