Cloudflare Infire LLM Inference Engine — Running AI at Global Edge | RECATOOLS

Cloudflare's New "Infire" Engine Runs Trillion-Parameter AI Models on Eight GPUs

DEVELOPER TOOLS · 3 May 2026 —

Key Takeaways

Cloudflare built a custom AI inference engine called Infire for running large language models across its global network
The engine uses "disaggregated prefill" — separating input processing and output generation onto different machines
New "Unweight" compression reduces model weight by 15–22% without accuracy loss
Kimi K2.5, with over 1 trillion parameters, now runs on just eight H100 GPUs via Infire
This infrastructure is directly relevant to developers building AI applications in ASEAN

The Facts

Cloudflare has announced a significant expansion of its AI infrastructure with the release of technical details behind Infire, a custom inference engine designed to run large language models efficiently across its global edge network.

The engineering challenge Cloudflare is solving is substantial. Modern frontier models such as Moonshot AI's Kimi K2.5 contain over one trillion parameters and occupy approximately 560GB of storage. Loading the model into GPU memory alone requires at least eight NVIDIA H100 GPUs, before any actual inference computation begins. At that scale, efficiency is not optional — it is the difference between a commercially viable service and one that loses money on every request.

Alongside Infire, Cloudflare introduced Unweight — a compression system that reduces LLM weight file sizes by 15 to 22 percent without measurable accuracy degradation. Unweight reduces the amount of data that GPUs need to load and transfer during inference, directly translating to faster response times and lower per-request costs. Cloudflare's engineering blog documents the full technical implementation for developers integrating these capabilities.

Cloudflare also ran Llama 4 Scout — Meta's latest open-weight multimodal model — on just two H200 GPUs with capacity remaining for a large context window, demonstrating that smaller open-source models can be served extremely efficiently under this architecture.

Technical Deep-Dive

The most technically significant innovation in Cloudflare's infrastructure is disaggregated prefill — a technique that splits LLM inference into two separate stages handled by physically different machines.

Stage one is the prefill phase: the model reads and processes all the input tokens (your question or prompt) and builds the KV cache — a memory structure that stores key-value pairs representing the model's understanding of the context. Prefill is computationally intensive: it requires heavy parallel computation across GPU cores.

Stage two is the decode phase: the model generates output tokens one at a time, reading from the KV cache populated in stage one. Decode is memory-bandwidth intensive rather than compute intensive.

By routing these two stages to hardware optimised for each workload — compute-dense machines for prefill, memory-bandwidth machines for decode — Cloudflare can serve more requests per GPU-hour than a unified architecture allows.

Combined with Unweight compression reducing the model's memory footprint by up to 22%, Infire enables Cloudflare to run models that would previously have required dedicated high-cost hardware on its globally distributed edge fleet.

The ASEAN Perspective

For developers in Singapore, Indonesia, and Malaysia building applications on top of AI APIs, Cloudflare's infrastructure work has direct practical implications.

Cloudflare operates data centres in Singapore, Kuala Lumpur, Jakarta, Bangkok, and Manila. As Infire rolls out to the global Workers AI platform, developers in ASEAN will be able to run inference on large models — including trillion-parameter systems — with lower latency than routing requests to US or European data centres.

For startups building AI-powered products on a budget, the combination of open-weight models like Llama 4 running efficiently on Cloudflare's edge infrastructure represents a genuinely cost-effective alternative to commercial API providers. A model that runs on two H200 GPUs rather than ten significantly changes the economics of serving regional ASEAN users.

Try our QR Code Generator if you need to create scannable codes for your AI-powered application — no signup needed.

RECATOOLS Verdict

Cloudflare's Infire work matters beyond its own business. By making trillion-parameter model inference economically viable at the edge, Cloudflare is moving AI compute closer to where users actually are — not just in data centre hubs.

The Unweight compression technology is particularly significant. A 15–22% reduction in model weight may sound incremental, but at the scale of deploying hundreds of replicas across a global network, it compounds into substantial cost savings and latency improvements.

For the developer ecosystem, this means more powerful models at lower cost and closer geographic proximity — a combination that will accelerate AI application development across every market, including ASEAN.

Frequently Asked Questions

What is Cloudflare Infire?+

What is disaggregated prefill?+

What is Cloudflare Unweight?+

Can ASEAN developers use this?+

What models run on Infire?+

Tags: #ASEAN #Cloudflare #AI-Inference #Edge-Computing #Developer-Tools

RECATOOLS Editorial

General Editorial Desk

The RECATOOLS Editorial desk covers platform updates, tool explainers, digital trends, and practical guides for everyday users and professionals.

View author profile → · Editorial policy

About this byline RECATOOLS Editorial is a general editorial desk byline. Articles are produced and reviewed under RECATOOLS editorial supervision.

Key Takeaways

The Facts

Technical Deep-Dive

The ASEAN Perspective

RECATOOLS Verdict

Frequently Asked Questions

Related articles

npm v12 Is Here, and It Turns Off a Default That Has Run Arbitrary Code for a Decade

Apple Loses Its EU Gatekeeper Fight — What the DMA Ruling Actually Changes for Developers

DuneSlide: Two Cursor Flaws Turn a Zero-Click Prompt Injection Into Code Execution