Groq Positions Inference as AI’s Core Engine in Post-Training Era
Moazami argues that inference—not training—will define the AI industry’s infrastructure, economics, and global competition
AI is entering a new phase—one where real-time inference, not massive model training, becomes the dominant force driving growth, infrastructure, and returns. Mohsen Moazami, President (International) at Groq, argues that the industry’s next chapter will belong to those who can deliver low-latency AI at the lowest cost.
“The emergence of DeepSeek highlighted for all of us was the fact that inference is going to be a massive market, a large and growing market, and far larger than training,” Moazami said.
He noted that training large language models (LLMs) is when “a lot of capex, where a lot of money is spent,” goes into building the model. Inference, by contrast, is the execution phase—when models are used in production. “If you spend money on training, you make money on inference.”
The shift is about more than just workloads. It’s also about speed.
“In training versus inference, run times are weeks. The time frames are much longer. In inference, it's a milliseconds matter,” Moazami said. “Ultimately, all the agentic things that you've heard here or you read about or practice every day—if there is no speed or broadband, could you and I use Uber? Uber would be unusable. The advent of broadband enabled us to use applications like Deliveroo, DoorDash, and Uber.”
GPUs vs LPUs
Moazami, a Cisco veteran and founder of multiple companies, represents Groq, a U.S.-based AI infrastructure firm built specifically for inference. He argues that the company’s technology addresses a fundamental mismatch in today’s AI hardware landscape.
“GPUs are the best for training, are perfect,” he said. “We tell every customer to go buy Nvidia for training.”
But when it comes to inference, GPUs show their limits.
“Low-latency inference is sequential. You cannot parallelize in inference,” he explained. “Ours is called LPU, language processing unit, and we think—and this is what we're arguing, and luckily, a portion of the market believes in what we're doing—LPUs are a better architecture for inference.”
A significant technical difference lies in how data is handled. Traditional GPU systems rely on HBM (high bandwidth memory), which creates multiple bottlenecks. “All these tokens have to leave the chip, go into HBM, get processed, get back in, and then transfer. All of this causes latency. What causes it? Power consumption.”
He continued, “Look at some of Nvidia's announcements at the last Nvidia GTC last month in San Jose, California. The power consumption of these units is becoming unbearably high and large.”
Groq’s LPU architecture avoids those pitfalls by eliminating reliance on HBM. “We don't use any HBM. High bandwidth memory has two issues. One, I told you, causes some latency, but most importantly, the entire world's supply is made by two companies - Samsung and SK Hynix in South Korea. And the entire allocation of high bandwidth memory is sold out for the next two years to whom? Nvidia.”
Groq instead uses SRAM (static random-access memory) directly on its chip.
“The data never leaves the chip. The more memory you need, the more chips you deploy, and all of this is done again synchronously and deterministically, which causes high throughput and low cost. Again, power consumption goes down because of not having to deal with high bandwidth memory.”
The Founder
Moazami’s remarks came during a keynote at AI Rush, a high-profile industry gathering held in London on May 16. His speech followed a day of discussions around generative AI, compute infrastructure, and the economics of scaling large models.
Groq was founded in 2016 by Jonathan Ross, the engineer who created Google’s original TPU (Tensor Processing Unit). Since its founding, Groq has focused entirely on inference, building a platform capable of running large open-source models at unprecedented speed and efficiency.
“Our language processing unit executes things synchronously and deterministically, and integrates,” Moazami told the audience. “We're dealing with the milliseconds.”
That performance is translating into adoption. “This speed with zero marketing budget has led from zero developers on our platform, GroqCloud, to more than 1.5 million registered developers working and doing their great applications on top of Groq Cloud today.”
Groq recently announced it had been selected as the official inference provider for HUMAIN, a Saudi-backed AI company operating across the entire value chain. As part of the initiative, Groq opened a data center in Dammam earlier this year.
“Over there, we're dealing with three cents a kilowatt hour power cost, and I hate to say when you come to the UK, you're dealing with 30 cents per kilowatt hour. You cannot be competitive with that level of difference.”
Energy Constraints
Moazami argued that the availability and cost of power are now central factors in AI infrastructure planning.
“Jensen Huang, the CEO and founder of Nvidia, is heard over and over again in conferences and interviews that the industry is power-limited. The growth of this sector is now limited by power and the optionality of power.”
In his view, the U.S. and UK are already “hitting a concrete wall in terms of access and availability of cheap power.”
Governments across Europe are exploring solutions. “Now you see a lot of investments here and there in France on nuclear,” he said. “This again talks about the fact that things are large.” Norway, too, was cited as an emerging player with potential due to its energy strategy.
Moazami emphasized that the growth of AI is outpacing infrastructure in many countries. “Look how much is spent on data centers and power,” he said. “This is by far the largest thing that I have lived through and experienced as an individual.”
A New Compute Economy
Beyond power and architecture, Groq’s broader thesis is that a new economic model is emerging—one defined not by software licensing but by token generation.
“All apps will be AI,” Moazami said. “Even Satya Nadella, CEO of Microsoft, is on record saying the end of SaaS (software as a service) as we know it. It's now agents as a service (AaaS).”
That change will shift the market toward ultra-efficient compute delivery. “You have to optimize on two dimensions. You have to be fast and you have to be cheap. These token factories are all about token production at the most economical unit and as fast as possible.”
Moazami, who has been involved in global discussions from Japan to Brazil, said the message is everywhere: speed and cost determine value.
“You've got to be fast to deliver the user experience and get answers to the questions people are asking. But you have to do it cheaply because every day billions and billions and billions of tokens are generated and consumed.”
As Groq scales further—with new data centers launching this month across North America—it positions itself not just as a chip company, but as a next-generation infrastructure layer.
“We have hit as a company—we have hit an inflection point,” Moazami said.
And in a world where milliseconds matter and power is scarce, that inflection point might mark the start of a broader industry reckoning.