Introducing WebAgent: An
LLM-Driven Agent for Real-World Web Navigation
Introduction:
Google DeepMind and researchers from the University of Tokyo have developed WebAgent, an innovative LLM-driven agent capable of completing tasks on real websites based on natural language instructions. Large language models (LLMs) have shown great success in various natural language tasks, including computation, common sense, logical reasoning,
question-answering, and interactive decision-making. They have recently demonstrated remarkable independent web navigation capabilities, controlling computers or browsing the internet to fulfill natural language instructions.
Challenges in Real-World Web Navigation:
Despite their success, real-world web navigation poses unique challenges, such as the absence of a preset action space and longer HTML complexities. The lack of HTML domain knowledge in LLMs negatively affects web navigation on real websites, making it difficult to choose the right action space in advance.
The WebAgent Solution:
WebAgent addresses these challenges by breaking down natural language instructions into lower-level sub-instructions, planning sub-instructions for each step, and condensing lengthy HTML elements into task-applicable particles based on sub-instructions. It executes these sub-instructions and HTML particles on factual websites.
Enhancing LLM Capabilities:
The researchers combine two LLMs, namely HTML-T5, a sphere-expert pre-trained language model used for work planning and tentative HTML summarization, and Flan-U-PaLM for predicted law generation. By including original and global attention styles in the encoder, HTML-T5 better captures the structure, syntax, and semantics of lengthy HTML elements. The model is tone-supervised and pre-trained on a substantial HTML corpus created by CommonCrawl, using long-span denoising techniques.
WebAgent's Superior Performance:
WebAgent outperforms single LLMs on static website appreciation tasks, showing higher QA delicacy and comparable performance against sound baselines. HTML-T5 plays a crucial role as a plugin for WebAgent, producing cutting-edge results on web-based tasks. On the MiniWoB test, HTML-T5 achieves a significant 14.9% improvement over former LLM agents.
Key Contributions:
The researchers present WebAgent, a combination of two LLMs for practical web navigation. The generalist language model generates executable programs, while the sphere-expert language model handles planning and HTML summaries. HTML-T5, a new HTML-specific language model, significantly increases success rates on real websites by over 50%, and in the MiniWoB test, it outperforms former LLM agents by 14.9%.
Conclusion:
WebAgent represents a major advancement in LLM-driven web navigation, showcasing the potential of combining language models for real-world tasks. With its enhanced capabilities and improved success rates, WebAgent opens new avenues for natural language-based interactions with real websites, bringing us closer to more sophisticated and practical AI applications in web navigation.