World-renowned AI researcher reveals breakthroughs accelerating AI agent practical application Fujitsu's Challenge to Revolutionize AI Agent "Collaboration, Memory, and Quality"

Article | 2025-12-15

While 2025 is hailed as the "Year One of AI Agents" with the technology itself drawing significant attention, numerous barriers remain for actual implementation. An MIT report[1] points out that a major failure factor is AI's inability to correctly understand situations, leading to unstable workflows. Gartner[2] similarly warns that complex system design, data management, and security issues complicate adoption. Worse still, research from MongoDB [3] shows that when multiple agents don’t coordinate well, or when memory management and system structure aren’t robust, problems cascade quickly. In fact, their study found these issues derail anything from 40% to 80% of implementations.

Fujitsu has begun pioneering research and development [4] into AI agents that autonomously advance sophisticated tasks while collaborating with humans. Through this R&D, Fujitsu identifies three fundamental technological gaps that must be bridged to address these challenges: "Collaboration" to enable smooth multi-agent coordination, "Memory" to retain information without context loss, and "Quality" to optimize routing and ensure reliable output.
In this article, Dr. Kobashi, Fujitsu Research, introduces cutting-edge research tackling these gaps, featuring interviews with the authors of research papers.

AI

Agent Data Protocol : Enabling “Collaboration” Among AI Agents

For multiple AI agents to work together effectively across diverse tasks, each agent needs strong, well-rounded training. Today, however, the shortage of high-quality supervised fine-tuning data - the kind of data required to teach agents how to collaborate smoothly - remains a major barrier. Without it, improving overall agent performance becomes harder.

The breakthrough addressing this challenge is the Agent Data Protocol[5]. The conceptual diagram below illustrates how it works. First, raw data is collected from a variety of agent datasets. It is then processed using a unified set of Actions and Observations defined by the Agent Data Protocol, and the resulting Trajectory is stored.
By standardizing these diverse datasets and converting them into a form that’s immediately ready for learning, the protocol dramatically reduces the preparation time required before reinforcement learning can begin. A dataset with more than 1.6 million training instances has already been released publicly, allowing anyone to start reinforcement learning with well-trained agents right away.

Conceptual diagram of Agent Data Protocol

In this segment, Dr. Kobashi engages in a deep dive conversation with Professor Graham Neubig (Carnegie Mellon University) about the "Agent Data Protocol”.

――We'd like to ask about the challenges and solutions for deploying AI agents in enterprises. Integration with existing systems and overall complexity are major concerns. What technical and organizational barriers do you anticipate?

Professor Neubig: Deploying AI agents in enterprise environments is highly challenging for several reasons.. The primary reason is that the data companies possess is generally not publicly available and is not included in the training data of existing large language models. However, this is where the Agent Data Protocol proves highly valuable. By creating company-specific datasets following this protocol and combining them with existing general-purpose datasets, we can improve agent accuracy without over-biasing toward specific corporate data. This approach allows AI agents to utilize valuable corporate data while preserving broad applicability. Security and other challenges remain, but this marks a significant step forward.

―― It is often noted that multi-agent systems consume up to 15 times more tokens than traditional chat systems. (Tokens refer to units of text processed by AI models.)
In training AI agents using the Agent Data Protocol, are there technical challenges to address from the perspectives of cost efficiency and computational resources?

Professor Neubig: There are two costs: training cost and inference cost. Training costs are not prohibitively high; for instance, a model with 32 billion parameters can be trained on two nodes with NVIDIA H100 GPUs in approximately four days. (H100 refers to NVIDIA’s high-performance AI accelerator).
This isn't particularly expensive compared to many other tasks. More importantly, once an agent is sufficiently trained, its knowledge can be transferred and utilized in smaller models [6]. This has the potential to improve cost efficiency by up to 10 times compared to using API-based models. The Agent Data Protocol provides a crucial foundation for achieving this significant improvement in cost efficiency.

――Please tell us about the necessary conditions and requirements for the Agent Data Protocol to become widely adopted as an industry standard, as well as the challenges the research community and industry should address.

Professor Neubig: First, we believe we must continue to actively promote this protocol. Research projects often end with the publication of a paper, but it's crucial to convert new agent training datasets into the Agent Data Protocol format whenever they emerge and to strengthen collaboration with other institutions. Furthermore, a key challenge we haven't yet addressed is handling multimodal data. We are currently working on this, and if achieved, it will make a significant difference in many use cases. We believe these are the keys to widespread adoption of the protocol.

Deepening AI Agent "Memory" with "Embodied RAG"

For AI agents, a memory mechanism is essential for maintaining contextual consistency and preventing information loss. In this chapter, we interviewed Professor Bisk, the paper's author, about the technical innovation behind "Embodied RAG (Retrieval-Augmented Generation)" [7], a groundbreaking approach for efficiently utilizing memory within physical environments.

In this segment, Dr. Kobashi explores the innovations behind "Embodied RAG" with Professor Yonatan Bisk (Carnegie Mellon University).

――First, could you explain how Embodied RAG differs from conventional RAG, particularly regarding the innovation in memory management within physical environments?

Professor Bisk: The most fundamental difference between Embodied RAG and conventional RAG lies in the "definition of semantic units" and the "method for determining relevance." Conventional RAG retrieves information based on word similarity within documents. However, in the physical environment Embodied RAG addresses—such as an office with rugs and plants—it is unclear how to define similarity for physical objects. For example, the experience of "the route taken to work today" is a composite accumulation of multiple elements and does not directly correspond to a specific document like language does. Therefore, the greatest innovation in Embodied RAG lies in how it understands the spatial environment, determines which information is worth retrieving, and judges which pieces of information are relevant to each other.

Conceptual diagram of Embodied RAG (Retrieval-Augmented Generation)

――How does Embodied RAG select appropriate information and retain only highly relevant context, despite the impact of information overload and irrelevant data that can hinder an agent's decision-making?

Professor Bisk: This challenge boils down to understanding dynamically changing "contextual relevance." For instance, when asked "I want to go to lunch," traditional RAG would only return restaurant information. Embodied RAG, however, must account for diverse environmental factors and personal circumstances—such as time until the next meeting, travel distance, current weather conditions, and even accessibility needs like wheelchair use. These factors constantly alter the relevance of information, influencing the agent's planning and actions. There is strong evidence that large language models often struggle to interpret spatial relationships. Therefore, it is crucial to flexibly adjust weighting in real-time based on "common-sense reasoning" akin to human cognition. Furthermore, not only visual information but also supplementary textual information, such as crowd levels, plays a role in enhancing the accuracy of reasoning.

――Regarding shared memory in multi-agent environments, how does Embodied RAG share and manage memory while maintaining consistency when multiple agents operate in the same environment?

Professor Bisk: This is an area we're currently exploring, but ideally, each agent would have its own RAG database and share information "selectively." Centralized memory makes consistency difficult if agents go offline or services become unstable. Consider two people shopping in a mall: instead of sharing all their experiences, they exchange only task-relevant details such as “I found the food court” or “I located the shoe store.” \" This allows the other person to update their model efficiently based on necessary information—a "sparse graph." Key to this mechanism will be entity matching and the agents' ability to autonomously decide when to share information.

――How does Embodied RAG update and adapt its memory to environmental changes and unexpected situations? Vulnerable workflows are cited as a cause of failure. Could you elaborate on adaptation to dynamic environments?

Professor Bisk: To handle dynamic environments, "reliability estimation for memory" and "quantifying uncertainty" are essential. For example, the location of buildings or walls is a highly reliable memory, while the position of a coffee cup is prone to change and less reliable. Humans judge this intuitively, but agents similarly need the ability to determine what is still true based on memory decay and uncertainty.This enables agents to act more cautiously—seeking help or shortening plans. Furthermore, since physical mistakes can be irreversible, a "human-in-the-loop" mechanism becomes crucial when risk levels exceed a certain threshold, prompting requests for human intervention. Agents, as collaborators, must also adopt a stance of not acting on ambiguous instructions.

――What do you see as the technical complexities of implementing Embodied RAG in enterprise environments and the key challenges to overcome for practical deployment?

Professor Bisk: Implementation in enterprise environments is extremely complex, as the saying goes, "The devil is in the details." The biggest challenge in physical environments is that we cannot ignore the details of the real world, such as ambiguity, dynamic nature, and multi-agent systems. Practical challenges also include ensuring reliable power supply, overcoming local computing limitations, and mitigating communication failures. . For example, in a factory, there are dozens of cameras and diverse robots. Managing all these components through a single centralized database is impractical. I believe the direction of distributed cooperative memory, where each agent has its own memory and shares information "selectively" as needed, is preferable. This alleviates the requirements for full communication and continuous power supply,and also contributes to privacy protection. We should aim for systems that support human work through a human-centered approach.

Maximizing AI Agent "Quality" with "Adaptive LLM Routing"

With numerous LLM models now proposed, the technology enabling AI agents to select and route to the optimal LLM based on user intent is crucial for ensuring quality. Particularly when performance metrics are contractually defined, such as in Service Level Agreements (SLAs), ensuring the system operates without violating these agreements presents a significant challenge. In this chapter, we interviewed Researcher Chaitanya, author of the paper on the groundbreaking approach "Adaptive LLM Routing" [8], about its technical innovations.

In this segment, Dr. Kobashi discusses the technical innovations of "Adaptive LLM Routing" with Mr. Chaitanya Devaguptapu (Fujitsu Research of India Private Limited).

――Could you explain the key points of this technology and the challenges it addresses?

Chaitanya: Adaptive LLM Routing is a system that automatically selects the most suitable large language model (LLM) for a given user query and continuously improves its decision-making based on user feedback. For example, a lightweight model is sufficient for simple questions like “What are your business hours?” in a customer service chatbot, whereas a model with advanced reasoning capabilities is essential for complex tasks such as comparing multiple smartphone models.
The main challenge was how to make this optimal selection both automatic and efficient. To solve this, we reframed the problem as a “bandit learning” task, enabling the system to learn and improve without having to exhaustively evaluate every model’s performance.

Conceptual diagram of Adaptive LLM Routing

―― In real-world LLM deployments, constraints like budget, response speed, and quality often overlap. How do you adaptively manage these challenges, and what technical approach do you use?

Chaitanya: This is one of the biggest hurdles in practical system operations. It requires a balanced strategy that optimizes multiple objectives simultaneously, rather than focusing solely on cost or quality.
Our solution is an approach called “online cost policy.” For example, we divide 10,000 queries into 100 groups and assign a budget to each group. Any unused budget from one group rolls over to the next. This flexibility allows the system to keep costs low for simple queries while allocating more resources to complex queries where quality is critical.

―― In real-world environments, we rarely have complete information about model performance or perfect evaluations. How does the system adapt to changing user preferences and needs? What role does bandit feedback play?

Chaitanya: That challenge is exactly why we reframed routing as a contextual bandit problem. Traditional supervised learning requires a “correct answer” for every instance, which is costly and inflexible when user needs change. With bandit feedback, the system learns only from its own choices. For example, if the router selects a high-performance model and the user rates the response as “good,” the system learns that decision was correct—without needing to compare other models. This approach enables practical, adaptive learning in dynamic real-world environments where user preferences constantly shift.

―― When routing, deciding whether to use an expensive, high-performance model or a more affordable one is critical. How does the system strike the right balance between quality and cost?

Chaitanya: We achieve this using a technique called shared embedding space, which maps both the query and the LLM into the same representation. The system learns that the distance between a query and a model reflects how well they match. For instance, if users consistently prefer high-performance models for complex reasoning tasks, the system positions those models closer to such queries. This allows the system to accurately gauge query complexity and select the most suitable model—balancing quality and cost intelligently.

―― Finally, what are the main challenges for widespread adoption of adaptive routing in enterprise environments? Are there technical hurdles or considerations companies should keep in mind when moving from static model selection to dynamic routing?

Chaitanya: Several challenges stand out. Current research focuses mainly on single-turn conversations, but real-world enterprise use cases often involve multi-turn dialogues, making context retention a future priority. On the positive side, routing decisions are fast—taking only 0.065 to 0.239 seconds—so computational overhead is minimal. Another strength is adaptability: the system can automatically adjust to changes in query patterns caused by seasonal trends or new product launches. For enterprises, the biggest shift is mindset—moving away from fixed costs and embracing intelligent, dynamic budget allocation.

FieldWorkArena Benchmark: Guiding Practical Evaluation of AI Agents

The technologies introduced above offer promising solutions for three critical gaps: collaboration, memory, and quality. However, their effectiveness must be demonstrated through rigorous benchmark verification aligned with real-world use cases. Existing benchmarks typically focus on narrow tasks and fail to adequately address the complexity, integration needs, and security requirements of enterprise AI agent deployment.

Therefore, collecting enterprise benchmarks through cross-company collaboration is essential. By pooling the knowledge of researchers, business practitioners, and technology developers, we can build an evaluation framework reflecting real-world challenges and enable effective comparisons across different technical approaches.

Fujitsu has developed "FieldWorkArena" [9], a comprehensive benchmark suite for evaluating AI agents in real-world enterprise tasks. This benchmark provides standardized evaluation metrics for collaboration, memory, and quality.

Workshop Announcement: Co-Creating the Future of AI Agents

In response to the growing demand for enterprise benchmarks, Fujitsu, Carnegie Mellon University (CMU), and Keio University will host the workshop "Agentic AI Benchmarks and Applications for Enterprise Tasks" [10] at the AAAI (Annual AAAI Conference on Artificial Intelligence), a renowned international conference celebrating its 40th anniversary. The workshop aims to foster the discussions and collaboration necessary to build robust, efficient, and reliable agent-based AI technologies for complex, dynamic enterprise operations, bridging the gap between cutting-edge agent-based AI research and practical enterprise needs.

At this workshop, researchers, industry practitioners, and technology developers will gather to discuss various benchmarks covering diverse business processes in enterprise environments. Case studies applying agent-based AI research to real-world operations will also be shared. This collaboration aims to create a community dedicated to developing, standardizing, and evolving enterprise benchmarks—a shared resource that grows with the field’s progress.
You can register via the official website [11] to participate either in-person or online (audience only). We sincerely look forward to your participation.

Related Links