A Hierarchical Architecture Achieves Up to 475× Higher Output Tokens per GPU than Transformer
June 24, 2026
Advancing Large Language Model Efficiency: An Architecture for Reduced Operational Costs
Fujitsu Limited has developed Parallel Hierarchical Operation for TOp-down Networks (hereinafter PHOTON), an architecture designed to substantially reduce the operational cost of large language models (LLMs). PHOTON delivers up to 475 times the multi-query throughput per GPU (Note 1) compared to Transformer — the dominant backbone architecture in contemporary LLMs. By combining this throughput advantage with multi-query integration, PHOTON enables higher output quality than conventional Transformer-based systems while requiring fewer GPU resources.
Background
Recent advances have demonstrated that allowing generative AI models to reason more deeply at inference time leads to improvements in output quality, and deployment of such approaches is accelerating. However, a fundamental limitation of the prevailing Transformer architecture is that, as input sequences grow longer or the number of concurrent queries increases, the memory accesses required to retain past context increase, causing processing speed to degrade. This bottleneck becomes especially pronounced in long-document processing and high-concurrency serving scenarios with many simultaneous users.
PHOTON addresses this challenge by enabling efficient, low-cost handling of workloads that require multiple simultaneous input-output streams — such as multi-agent pipelines — thereby contributing to reductions in GPU resource consumption.
Key Features of the Developed Technology
(1) PHOTON Architecture: Hierarchical Processing at the Semantic-Unit Level, Not the Token Level
Standard Transformer architectures decompose text into short fragments of a few characters each — known as tokens — and compute attention across every pair of tokens in the sequence. PHOTON, by contrast, treats text as meaningful semantic groups and processes them hierarchically, substantially reducing overall computational load. Processing multiple sequences in parallel further allows PHOTON to achieve up to 475× higher computational efficiency per GPU.
(2) Multi-Query Integration Technology: Enhancing Performance through Output Aggregation
Multi-query integration is a technique that generates multiple slightly varied queries from the same input problem, obtains a response for each, and aggregates these responses to determine the final answer. With PHOTON, results are aggregated via majority voting or best-of-N selection, achieving more stable and higher-quality performance from a single inference run.
Numerical experiments confirm that, across model scales of 600M, 900M, and 1.2B parameters, PHOTON achieves higher generation throughput than conventional Transformer while maintaining a lower memory footprint. In particular, the 1.2B-parameter model delivers approximately 475× the multi-query computational capacity of a conventional Transformer baseline, at the cost of only a modest reduction in generation quality. Furthermore, because PHOTON's per-generation KV-cache usage is substantially smaller, multiple generation results can be obtained in parallel within the same GPU memory budget. In validation experiments, aggregating as few as 9 queries was sufficient to match the output quality of a conventional Transformer.
Looking Ahead
These research findings are scheduled to be presented at the oral session of The 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), a premier venue for natural language processing research, to be held in San Diego, USA, from 2 July. Going forward, Fujitsu will work to improve the efficiency of generative AI systems that currently demand substantial GPU resources, achieving reductions in energy consumption and cost — thereby advancing both the environmental and commercial sustainability required to meet the world's surging demand for AI.
Trademarks
All product names and other proper nouns mentioned herein are trademarks or registered trademarks of their respective owners.
Note
- (Note 1) Multi-query throughput: throughput (output tokens per second) per unit of GPU resource.