A Hierarchical Architecture Achieves Up to 475× Higher Output Tokens per GPU than Transformer

June 24, 2026

Advancing Large Language Model Efficiency: An Architecture for Reduced Operational Costs

Numerous digital information and lights are displayed in a virtual pathway.

Fujitsu Limited has developed Parallel Hierarchical Operation for TOp-down Networks (hereinafter PHOTON), an architecture designed to substantially reduce the operational cost of large language models (LLMs). PHOTON delivers up to 475 times the multi-query throughput per GPU (Note 1) compared to Transformer — the dominant backbone architecture in contemporary LLMs. By combining this throughput advantage with multi-query integration, PHOTON enables higher output quality than conventional Transformer-based systems while requiring fewer GPU resources.

Background

Recent advances have demonstrated that allowing generative AI models to reason more deeply at inference time leads to improvements in output quality, and deployment of such approaches is accelerating. However, a fundamental limitation of the prevailing Transformer architecture is that, as input sequences grow longer or the number of concurrent queries increases, the memory accesses required to retain past context increase, causing processing speed to degrade. This bottleneck becomes especially pronounced in long-document processing and high-concurrency serving scenarios with many simultaneous users.
PHOTON addresses this challenge by enabling efficient, low-cost handling of workloads that require multiple simultaneous input-output streams — such as multi-agent pipelines — thereby contributing to reductions in GPU resource consumption.

Key Features of the Developed Technology

Figure 1. (1) Reduces computational resource requirements by aggregating input tokens and processing them as semantic units. (2) Improves the accuracy of final responses by decomposing an input question into multiple distinct query variants and integrating their outputs.

(1) PHOTON Architecture: Hierarchical Processing at the Semantic-Unit Level, Not the Token Level

Standard Transformer architectures decompose text into short fragments of a few characters each — known as tokens — and compute attention across every pair of tokens in the sequence. PHOTON, by contrast, treats text as meaningful semantic groups and processes them hierarchically, substantially reducing overall computational load. Processing multiple sequences in parallel further allows PHOTON to achieve up to 475× higher computational efficiency per GPU.

(2) Multi-Query Integration Technology: Enhancing Performance through Output Aggregation

Multi-query integration is a technique that generates multiple slightly varied queries from the same input problem, obtains a response for each, and aggregates these responses to determine the final answer. With PHOTON, results are aggregated via majority voting or best-of-N selection, achieving more stable and higher-quality performance from a single inference run.
Numerical experiments confirm that, across model scales of 600M, 900M, and 1.2B parameters, PHOTON achieves higher generation throughput than conventional Transformer while maintaining a lower memory footprint. In particular, the 1.2B-parameter model delivers approximately 475× the multi-query computational capacity of a conventional Transformer baseline, at the cost of only a modest reduction in generation quality. Furthermore, because PHOTON's per-generation KV-cache usage is substantially smaller, multiple generation results can be obtained in parallel within the same GPU memory budget. In validation experiments, aggregating as few as 9 queries was sufficient to match the output quality of a conventional Transformer.

Looking Ahead

These research findings are scheduled to be presented at the oral session of The 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), a premier venue for natural language processing research, to be held in San Diego, USA, from 2 July. Going forward, Fujitsu will work to improve the efficiency of generative AI systems that currently demand substantial GPU resources, achieving reductions in energy consumption and cost — thereby advancing both the environmental and commercial sustainability required to meet the world's surging demand for AI.