Edinburgh Uni AI Breakthrough Makes Models 10x Faster

Researchers at the University of Edinburgh claim to have made a software breakthrough that could allow future AI models to process information ten times faster than today’s systems.

Working with the world’s largest computer chip, called a wafer-scale chip (roughly the same size as a dinner plate), the team developed an approach for trained LLMs to draw conclusions from new data in a much more efficient way.

This process, known as inference, currently runs on GPU chips, the backbone of today’s AI economy and the hardware that handles most of AI’s everyday applications once they have been trained on vast amounts of data.

However, wafer-scale chips differ from these typical AI chips not only in terms of size but also in how they operate. The larger chips are designed to carry out many computational tasks at once, within a single chip, aided by massive on-chip memory.

With all the computation taking place on the same piece of silicon, data moves between different parts of the chip much faster than if it had to travel between separate groups of chips and memory via a network.

By integrating hundreds of thousands of computation cores working in parallel, these next-gen chips excel at the kind of mathematical operations that underpin neural networks, the foundation of LLM technology.

These advantages have sparked growing interest in how wafer‑scale chips could enhance AI inference, but the hardware promise comes with software challenges.

To function properly, wafer-scale chips require different software from today’s AI systems, one that can intelligently run an AI model and coordinate enormous parallel computations and data movement across a huge number of processing cores.

Hoping to solve the problem, the Edinburgh team developed a software called WaferLLM, designed exclusively to help wafer-scale chips improve their performance and get the most out of the chips’ parallel processing, memory, and latency.

To evaluating the software at EPCC, the UK’s National Supercomputing Centre based at the University, a series of tests were carried out to measure how the wafer-scale chips performed when running several LLMs, such as LLaMA and Qwen.

The team claims to have observed a dramatic acceleration in inference, resulting in a tenfold increase in how quickly the LLM responded to queries, known as latency performance, compared with a cluster of sixteen GPUs.

The wafer-scale chips were also found to have considerable energy efficiency benefits. When operating at scale, the chips could be up to two times more energy-efficient in running LLMs compared with GPUs.

“Wafer-scale computing has shown remarkable potential, but software has been the key barrier to putting it to work,” said Dr Luo Mai, lead researcher on the project from the University of Edinburgh’s School of Informatics.

“With WaferLLM, we show that the right software design can unlock that potential, delivering real gains in speed and energy efficiency for large language models. This is a step toward a new generation of AI infrastructure – one that can support real-time intelligence in science, healthcare, education, and everyday life.”

Source: DIGIT