Blog post illustration

Enhancing Embodied AI Teaching Agents to Seek Clarification Using Multimodal Large Language Models

7 min read
ai robotics embodied-ai multimodal-llm reinforcement-learning
Embodied AI robot instructor asking for clarification

The intersection of robotics and large language models has ushered in a new era of embodied AI systems capable of navigating real-world environments and interacting with humans. However, these systems often struggle with a fundamental aspect of human communication: recognizing when instructions are ambiguous and seeking appropriate clarification. A recent breakthrough paper published on arXiv, "Grounding Multimodal LLMs to Embodied Agents that Ask for Help with Reinforcement Learning," introduces a novel approach that enables robots to identify ambiguity in human instructions and ask relevant clarifying questions before acting.

The Challenge of Ambiguity in Human-Robot Interaction

Imagine instructing your home robot, "Please get me the book from the table," only to realize there are three different books present. This common scenario highlights a critical challenge in human-robot interaction: ambiguity in natural language instructions. Traditionally, embodied AI agents have faced three problematic options when encountering such situations:

  1. Making assumptions: The agent selects one book based on internal heuristics, potentially bringing the wrong item
  2. Requesting repetition: The agent asks for the entire instruction again, frustrating the human
  3. Refusing the task: The agent simply fails to complete the instruction, requiring the human to provide more specificity

None of these approaches effectively mirrors how humans navigate ambiguity—by asking targeted questions to clarify the specific ambiguous elements ("Which book would you prefer: the red hardcover, the paperback novel, or the textbook?").

The Ask-to-Act Framework: Teaching Robots When and How to Seek Clarification

The researchers address this challenge by introducing the "Ask-to-Act" task, which extends the traditional Vision-Language-Action (VLA) framework by incorporating clarification-seeking behavior. This framework trains agents to:

  1. Detect ambiguity: Determine when an instruction contains insufficient information given the current visual scene
  2. Generate relevant questions: Formulate targeted questions addressing the specific ambiguity
  3. Incorporate clarifications: Process the human's answer to resolve the ambiguity
  4. Execute the appropriate action: Perform the task correctly based on the complete information

This approach transforms robot behavior from passive instruction-following to active participation in cooperative dialogue—a significant step toward natural human-robot interaction.

Technical Innovation: Reinforcement Learning with LLM-Generated Rewards

The study's key innovation lies in its training methodology. Rather than relying on extensive human-annotated datasets of ambiguous scenarios (which would be prohibitively expensive and time-consuming to create), the researchers developed a reinforcement learning approach using LLM-generated rewards.

Their method involves:

  1. Initial fine-tuning: Starting with a multimodal LLM (capable of processing both visual and text inputs), the researchers fine-tune the model on a small set of human-annotated examples demonstrating appropriate question-asking behavior.

  2. Reward model development: The team creates an automatic evaluation framework that uses LLMs to assess:

    • Question relevance: Does the question address the actual ambiguity?
    • Question conciseness: Is the question direct and to the point?
    • Action correctness: After receiving clarification, does the agent perform the correct action?
  3. Policy optimization: The model undergoes RLHF (Reinforcement Learning from Human Feedback) using the LLM-generated rewards as signals, maximizing performance on the above metrics without requiring human evaluators for each training instance.

This approach offers substantial advantages over traditional methods:

// Pseudocode for LLM-Reward Based Training
function train_clarification_agent(initial_model, training_environments):
    reward_model = initialize_reward_llm()
    policy = initialize_policy(initial_model)

    for episode in training_episodes:
        env = sample_environment(training_environments)
        instruction = generate_ambiguous_instruction(env)

        # Agent generates question or decides to act
        question = policy.generate_question(instruction, env.observation)

        if question != "NO_QUESTION":
            clarification = generate_answer_to_question(question, env)
            action = policy.select_action(instruction, clarification, env.observation)
        else:
            action = policy.select_action(instruction, env.observation)

        # LLM evaluates the agent's performance
        question_relevance = reward_model.evaluate_question_relevance(
            instruction, env.observation, question)
        question_conciseness = reward_model.evaluate_question_conciseness(question)
        action_correctness = reward_model.evaluate_action(
            instruction, env.observation, action)

        # Combined reward signal
        reward = compute_weighted_reward(
            question_relevance, question_conciseness, action_correctness)

        # Update policy using reinforcement learning
        policy.update(reward)

    return policy

Experimental Results: Performance Gains and Generalization

The study's experiments demonstrate compelling improvements over current approaches. The RL-finetuned MLLM outperformed strong zero-shot baselines (including GPT-4o) by margins of 19.1%–40.3% across various test scenarios. The performance gains were particularly notable in:

  • Novel object configurations: The agent successfully handled new arrangements of objects not seen during training
  • New object categories: The agent could ask appropriate questions even about object types not encountered previously
  • New instruction types: The agent generalized to instruction patterns beyond its training distribution

These results highlight the robustness of the approach and its potential for real-world deployment, where environments and instructions are inherently diverse and unpredictable.

Applications and Implications

The ability for embodied agents to ask clarifying questions extends far beyond household robots and has significant implications across numerous domains:

Healthcare Assistance

Robots providing patient care can seek clarification about symptoms or medication needs instead of making potentially dangerous assumptions. For example, when instructed to "bring the medication," a robot could ask "Which medication would you like: the pain reliever, the antibiotic, or the blood pressure medication?"

Educational Technology

AI teaching assistants can identify when a student's question contains ambiguities and seek clarification before providing potentially confusing or incorrect information. This mirrors effective teaching practices where educators check their understanding of student queries.

Industrial Robotics

Manufacturing robots can request specification clarification when task instructions are ambiguous, reducing errors and improving safety. When told to "tighten the bolt," a robot could ask "To what torque specification should I tighten it?" rather than applying an arbitrary force.

Accessibility Technology

Assistive robots for individuals with disabilities can ensure they correctly understand user needs through clarification, particularly important when supporting people with limited mobility or communication abilities.

Ethical Considerations and Future Directions

While this research represents significant progress, several important considerations remain for future development:

Privacy and Data Security

As clarification-seeking robots collect more specific information from users, privacy protections become increasingly important. Future systems will need robust mechanisms to ensure user data is handled securely and ethically.

Cultural and Linguistic Sensitivity

Different cultures and linguistic groups have varying norms around question-asking behavior. Future research should explore how to make clarification-seeking behavior appropriate across diverse cultural contexts.

Avoiding Excessive Questioning

A robot that asks too many questions could become annoying rather than helpful. Finding the right balance between seeking necessary clarification and avoiding excessive interruption remains an open challenge.

Multimodal Clarification Methods

Beyond verbal questions, robots might use gestures (pointing between multiple options) or visual interfaces (highlighting ambiguous objects on a screen) to seek clarification more efficiently in certain contexts.

Conclusion

The development of embodied AI agents capable of recognizing ambiguity and seeking targeted clarification represents a significant advancement toward more natural and effective human-robot collaboration. By teaching robots to ask relevant questions—much as humans do when faced with unclear instructions—we can create systems that are not only more capable but also more intuitive to interact with.

This approach addresses a fundamental limitation in current AI systems: their struggle to acknowledge and resolve uncertainty through dialogue. As embodied AI continues to integrate into homes, workplaces, and public spaces, the ability to engage in clarification-seeking behavior will be essential for creating robots that can function as collaborative partners rather than rigid instruction-followers.

The research demonstrates that by combining multimodal large language models with reinforcement learning techniques, we can create systems that learn this crucial human-like communication skill without requiring prohibitively expensive human annotation. This scalable approach suggests a promising path forward for developing embodied AI that can navigate the inherent ambiguities of human language and real-world environments.


Further Resources

This post explores findings from recent research in embodied AI and multimodal large language models, highlighting how reinforcement learning can enable more natural human-robot interactions through appropriate clarification-seeking behavior.

Author

William Zujkowski

Personal website and technology blog