Key Takeaways
- Cloudflare built a custom AI inference engine called Infire for running large language models across its global network
- The engine uses "disaggregated prefill" — separating input processing and output generation onto different machines
- New "Unweight" compression reduces model weight by 15–22% without accuracy loss
- Kimi K2.5, with over 1 trillion parameters, now runs on just eight H100 GPUs via Infire
- This infrastructure is directly relevant to developers building AI applications in ASEAN
The Facts
Cloudflare has announced a significant expansion of its AI infrastructure with the release of technical details behind Infire, a custom inference engine designed to run large language models efficiently across its global edge network.
The engineering challenge Cloudflare is solving is substantial. Modern frontier models such as Moonshot AI's Kimi K2.5 contain over one trillion parameters and occupy approximately 560GB of storage. Loading the model into GPU memory alone requires at least eight NVIDIA H100 GPUs, before any actual inference computation begins. At that scale, efficiency is not optional — it is the difference between a commercially viable service and one that loses money on every request.
Alongside Infire, Cloudflare introduced Unweight — a compression system that reduces LLM weight file sizes by 15 to 22 percent without measurable accuracy degradation. Unweight reduces the amount of data that GPUs need to load and transfer during inference, directly translating to faster response times and lower per-request costs. Cloudflare's engineering blog documents the full technical implementation for developers integrating these capabilities.
Cloudflare also ran Llama 4 Scout — Meta's latest open-weight multimodal model — on just two H200 GPUs with capacity remaining for a large context window, demonstrating that smaller open-source models can be served extremely efficiently under this architecture.
Technical Deep-Dive
The most technically significant innovation in Cloudflare's infrastructure is disaggregated prefill — a technique that splits LLM inference into two separate stages handled by physically different machines.
Stage one is the prefill phase: the model reads and processes all the input tokens (your question or prompt) and builds the KV cache — a memory structure that stores key-value pairs representing the model's understanding of the context. Prefill is computationally intensive: it requires heavy parallel computation across GPU cores.
Stage two is the decode phase: the model generates output tokens one at a time, reading from the KV cache populated in stage one. Decode is memory-bandwidth intensive rather than compute intensive.
By routing these two stages to hardware optimised for each workload — compute-dense machines for prefill, memory-bandwidth machines for decode — Cloudflare can serve more requests per GPU-hour than a unified architecture allows.
Combined with Unweight compression reducing the model's memory footprint by up to 22%, Infire enables Cloudflare to run models that would previously have required dedicated high-cost hardware on its globally distributed edge fleet.
The ASEAN Perspective
For developers in Singapore, Indonesia, and Malaysia building applications on top of AI APIs, Cloudflare's infrastructure work has direct practical implications.
Cloudflare operates data centres in Singapore, Kuala Lumpur, Jakarta, Bangkok, and Manila. As Infire rolls out to the global Workers AI platform, developers in ASEAN will be able to run inference on large models — including trillion-parameter systems — with lower latency than routing requests to US or European data centres.
For startups building AI-powered products on a budget, the combination of open-weight models like Llama 4 running efficiently on Cloudflare's edge infrastructure represents a genuinely cost-effective alternative to commercial API providers. A model that runs on two H200 GPUs rather than ten significantly changes the economics of serving regional ASEAN users.
Try our QR Code Generator if you need to create scannable codes for your AI-powered application — no signup needed.
RECATOOLS Verdict
Cloudflare's Infire work matters beyond its own business. By making trillion-parameter model inference economically viable at the edge, Cloudflare is moving AI compute closer to where users actually are — not just in data centre hubs.
The Unweight compression technology is particularly significant. A 15–22% reduction in model weight may sound incremental, but at the scale of deploying hundreds of replicas across a global network, it compounds into substantial cost savings and latency improvements.
For the developer ecosystem, this means more powerful models at lower cost and closer geographic proximity — a combination that will accelerate AI application development across every market, including ASEAN.
Frequently Asked Questions
Infire is Cloudflare's custom AI inference engine for running large language models across its global edge network more efficiently.
A technique that separates LLM inference into prefill (input processing) and decode (output generation) stages, routing each to hardware optimised for that workload.
Unweight is Cloudflare's model weight compression system that reduces LLM file sizes by 15–22% without accuracy loss.
Yes — Cloudflare's Workers AI platform is available globally including Singapore, Malaysia, and Indonesia edge locations.
Currently including Kimi K2.5 (1T+ params, 8x H100) and Llama 4 Scout (2x H200).