Groq

Ultra-fast open-weight model inference on custom LPU silicon

AI Inference Developer Tools LLM API LPU Hardware Open-Weight Models Speed-Optimised Voice AI

Code & Dev Tools Freemium Has API

Researched 20 May 2026, 01:23 SGT · Published 19 May 2026, 01:23 SGT · Reviewed 11 Jul 2026

Visit Groq Compare alternatives

RECATOOLS Score

7.6 / 10

Capability

Value for money

Ease of use

ASEAN readiness

API quality

Founded

2016

Mountain View, California, USA

Users

~3M developers/teams on GroqCloud (mid-2026); ~75% of Fortune 100 hold accounts

Launched

GroqCloud public API: February 2024

Developer

Groq, Inc. (independent; Nvidia acquired most chip assets/IP/leadership in a ~$20B Dec 2025 deal, but GroqCloud was excluded and stays with independent Groq)

Overview

Groq is an AI inference company that built the Language Processing Unit (LPU) — a purpose-designed chip architecture radically different from GPUs — to deliver some of the fastest large language model inference available. Founded in 2016 in Mountain View, California, by Jonathan Ross (who created the original Google TPU, famously launched as a Google '20% side project') and co-founder Douglas Wightman (a former Google X engineer who served as Groq's first CEO), Groq's core insight is that inference workloads have fundamentally different computational characteristics than training, and that deterministic, synchronous dataflow through SRAM-heavy chips eliminates the memory-bandwidth bottlenecks that throttle GPU inference. In independent Artificial Analysis benchmarks, Groq runs Llama 3.3 70B at roughly 276 tokens per second — the fastest of all benchmarked providers, typically several times quicker than GPU-backed rivals.

GroqCloud is the developer-facing cloud API platform exposing this speed advantage via an OpenAI-compatible REST API; existing apps migrate with as little as a base-URL and model-name change. The model catalogue is exclusively open-weight: Llama 3.1 8B and 3.3 70B, Mixtral, Gemma 2, DeepSeek R1 Distill, Qwen 2.5, Whisper (Large v3 Turbo benchmarked at ~216x real-time), and Compound Beta (agentic mode with built-in web search and code execution). Pricing starts at $0.05 per million input tokens, with a generous free tier requiring no credit card. Crucially, in December 2025 Nvidia agreed to a ~$20 billion (about $17 billion cash) deal for Groq's chip assets, patents and leadership — Jonathan Ross, president Sunny Madra and most senior staff moved to Nvidia. GroqCloud's cloud business was explicitly carved OUT of that transaction and remains with the independent Groq, now led by former CFO Simon Edwards as CEO, which is pivoting to an inference 'neocloud' (dubbed 'Groq 2.0') and was reported in late May 2026 to be raising roughly $650 million to fund it.

Groq's key limitation is by design: the LPU runs only open-weight models. Proprietary frontier models (GPT-4o, Claude, Gemini) are unavailable, and adding new model architectures requires hardware-level optimisation. Very large models require many LPU chips and become expensive at scale. The free tier's 6,000 tokens-per-minute cap is a real bottleneck for production apps with concurrent users. Despite these constraints, Groq remains a go-to choice for latency-critical applications — voice AI, real-time agents, and streaming interfaces — where raw tokens-per-second speed is the primary decision criterion. (Note: Groq the inference firm is unrelated to Grok, xAI's chatbot.)

Pricing

Pricing shown for reference only. These figures reflect RECATOOLS research as of 11 Jul 2026 and may be out of date or incomplete. This is not financial or purchasing advice — always confirm the current price on the provider’s official website before making any decision.

Free

Forever-free developer tier — rate-limited, no credit card.

14,400 requests/day, 30 RPM, 6,000 TPM
All open-weight models included
Limits apply at the org level

Developer (pay-as-you-go)

From $0.05/M tokens

On-demand per-token pricing on Groq's LPU hardware.

Llama 3.1 8B: $0.05 in / $0.08 out per 1M
Llama 3.3 70B: $0.59 in / $0.79 out per 1M
Prompt caching: 50% off cached input
Batch API: 50% lower cost

Enterprise

Custom

Custom capacity and support through Groq's Enterprise Access program.

Reserved capacity and higher limits
Custom terms via sales

Use cases

Real-time conversational AI and voice assistants requiring low first-token latency High-throughput document processing, summarisation, and classification pipelines using the Batch API Rapid LLM prototyping on the free tier with OpenAI-compatible SDK drop-in replacement Autonomous AI agents with built-in web search and code execution via Compound Beta

What you can produce with Groq

Build real-time voice AI pipelines combining Whisper Large v3 Turbo transcription (~216x real-time) with Llama generation for low end-to-end latency
Prototype LLM features instantly on the free tier using a simple base-URL swap in any existing OpenAI SDK integration
Deploy streaming chat interfaces where Groq's market-leading throughput (~276 t/s on Llama 3.3 70B, per Artificial Analysis) minimises perceptible generation lag
Run autonomous AI agents via Compound Beta, which bundles built-in web search and code execution with no external tool wiring
Cut inference costs on large-scale summarisation, translation, or classification jobs using the Batch API for async processing
Build multilingual ASEAN content tools on Llama 3.3 70B or Qwen 2.5 with lower latency via the Sydney APAC inference node
Implement content moderation pipelines using Llama Guard models served at Groq throughput rates for near-real-time safety checks

ASEAN Perspective

Groq in Southeast Asia

Groq is gaining meaningful traction in Southeast Asia, driven by a price-to-speed ratio that appeals to cost-conscious startups in Singapore, Indonesia, Malaysia, and the Philippines. Groq says over half of its developers are Asia-Pacific based, and the Sydney APAC inference node (opened November 2025 with Equinix) reduces latency for the region versus US-only endpoints. For ASEAN developers building voice AI, multilingual chatbots, or real-time agent workflows on open-weight models like Llama or Qwen, Groq's free tier and low cost per million tokens make it one of the most accessible inference options available. The absence of Singapore or Southeast Asian data centres means compliance-sensitive workloads (finance, healthcare) still carry data-residency concerns, though Groq has signalled Asia as an active expansion target. Groq does not serve proprietary models, limiting its fit for ASEAN enterprises standardised on OpenAI or Anthropic. A further caveat: after Nvidia's December 2025 deal absorbed Groq's hardware assets and founding team, the independent GroqCloud is rebuilding as an inference neocloud — its API is operating normally today, but its long-term roadmap is worth monitoring. For pure-speed inference on open-weight models, it remains hard to beat at this price point.

RECATOOLS Verdict

Groq is among the fastest options for open-weight model inference — if raw tokens-per-second throughput is your primary requirement, few providers beat it at this price point (independent benchmarks put Llama 3.3 70B at ~276 t/s, the fastest measured). The OpenAI-compatible API, generous free tier, and predictable per-token pricing make it genuinely easy to adopt. However, the hard ceiling of open-weight-only models means teams tied to GPT-4o, Claude, or Gemini cannot use Groq at all, and the free tier's 6,000 TPM cap forces paid upgrades sooner than it appears. The bigger caveat is strategic: Nvidia's December 2025 deal absorbed Groq's chip assets and most of its senior team (including founder Jonathan Ross), leaving the independent GroqCloud to rebuild as an inference 'neocloud' under new leadership — its long-term hardware roadmap deserves scrutiny even as the API operates normally today.

Independent AI-assisted assessment by RECATOOLS.

What people say

Ask developers what Groq is for and you get one answer: speed. Llama 3.3 70B at roughly 276 tokens per second — the fastest Artificial Analysis has measured for that model — feels instantaneous next to GPU-backed rivals, and the OpenAI-compatible API means migrating is usually a base-URL change. The free tier needs no credit card, and GroqCloud claimed more than 3.5 million developers by February 2026.

The complaints are just as consistent. The catalogue is open-weight only — no GPT, Claude or Gemini — a hard blocker for teams tied to proprietary models. Free-tier ceilings (30 requests and 6,000 tokens per minute on most models, plus daily caps) bite fast in production; one long prompt can eat a minute's budget, and limits apply org-wide. The paid developer tier lifts limits roughly 10x for the price of adding a card. And Groq trains nothing itself, so quality tracks whatever Meta, Google, Mistral and Alibaba release as open weights.

The strategic question is newer. In December 2025 Nvidia paid about $20 billion to license Groq's LPU technology and hire most of its senior team, founder Jonathan Ross included. GroqCloud was carved out and stays independent under former CFO Simon Edwards, now rebuilding as an inference 'neocloud' on remaining LPU hardware plus Nvidia GPUs, funded by a roughly $650 million raise reported in late May 2026. The API runs uninterrupted today, but the long-term hardware roadmap deserves scrutiny.

User verdict: a superb specialist for voice AI, streaming agents and rapid prototyping — not a one-stop inference platform.

Summary of public user & expert reviews, compiled by RECATOOLS.

Notable facts

Groq founder Jonathan Ross originally created the Google TPU as a '20% side project' (one day a week) before leaving Google in 2016 to build LPU chips — which Nvidia eventually acquired/licensed in a ~$20 billion deal in December 2025, taking Ross and most of Groq's senior team with it
Groq's LPU uses on-chip SRAM instead of off-chip DRAM for model weights, enabling deterministic synchronous dataflow that eliminates memory-bandwidth stalls — the architectural reason it tops independent Artificial Analysis throughput benchmarks (~276 tokens/second on Llama 3.3 70B, the fastest of all measured providers)

Frequently asked questions

Can I use Groq with Claude, GPT-4o, or Gemini?

No. Groq's LPU hardware only runs open-weight models (Llama, Mixtral, Gemma, Qwen, DeepSeek Distill, Whisper). Proprietary frontier models from Anthropic, OpenAI, and Google are not available and cannot be added without hardware-level re-engineering for each new architecture. (Also note: Groq is unrelated to xAI's 'Grok' chatbot.)

How does the free tier work and when will I hit limits?

The free tier gives 14,400 requests/day and 30 RPM with no credit card required. The real bottleneck is 6,000 tokens per minute — a single long prompt with a lengthy response can exhaust most of your per-minute budget, and limits apply at the organisation level. It is excellent for prototyping but production apps with concurrent users will need a paid plan.

What happened to Groq after the Nvidia deal?

In December 2025, Nvidia agreed to a ~$20 billion deal (about $17 billion cash across installments through end-2026) for Groq's chip assets, patents and leadership — founder Jonathan Ross, president Sunny Madra and most senior staff moved to Nvidia. Critically, GroqCloud's cloud business was excluded from that deal and remains with the independent Groq, now led by former CFO Simon Edwards as CEO. GroqCloud was not disrupted, and in late May 2026 Groq was reported to be raising roughly $650 million to fund its inference-'neocloud' second act ('Groq 2.0').

Does Groq have data centres in Asia?

Yes. Groq opened its first Asia-Pacific inference site in an Equinix data centre in Sydney, Australia in November 2025 (a 4.5MW facility, part of a planned ~US$300M expansion). Asia-Pacific is a key growth market — Groq says over half its developers are APAC-based — but no Southeast Asian data centres exist yet, so some latency and data-residency concerns remain for ASEAN users.

Is Groq's API compatible with OpenAI SDKs?

Yes, mostly. Point the base URL to https://api.groq.com/openai/v1, supply your Groq API key, and select a Groq model name. A small set of OpenAI parameters (e.g. logprobs, logit_bias, messages[].name, n greater than 1) are not yet supported, but standard chat-completion workflows migrate with minimal changes.

Was this listing helpful?

Visit Groq

Quick facts

DeveloperGroq, Inc. (independent; Nvidia acquired most chip assets/IP/leadership in a ~$20B Dec 2025 deal, but GroqCloud was excluded and stays with independent Groq)

Founded2016

HQMountain View, California, USA

Users~3M developers/teams on GroqCloud (mid-2026); ~75% of Fortune 100 hold accounts

PricingFreemium

APIYes

GitHub Source

Top alternatives

Fireworks AI

Production inference platform for op...

OpenRouter

One API key, one bill, 300-plus LLMs

Together AI

High-performance inference for open-...

In-house AI Tools

Prompt Framework Builder

Build a structured AI prompt from a...

System Prompt Builder

Build a system prompt for a custom G...

llms.txt Generator

Build a spec-compliant /llms.txt to...

AI-Crawler robots.txt Builder

Allow or block AI crawlers — GPTBot,...

Token Counter

Count exact GPT tokens (tiktoken) pl...

About this listing

Researched on Wednesday, 20 May 2026 at 01:23 SGT (UTC+8)

Published on Tuesday, 19 May 2026 at 01:23 SGT (UTC+8)

Last reviewed Saturday, 11 July 2026 (1 week ago)

This entry was compiled from publicly available data including Groq's official website, press releases, documentation, and reputable third-party publications. RECATOOLS is not affiliated with Groq unless explicitly stated.

Data accuracy

Third-party AI tools update their pricing, features, availability, and policies frequently. Information here may be outdated by the time you read this — we make reasonable efforts to keep listings current, but cannot guarantee absolute accuracy.

For the latest details, please refer to Groq directly →

Spotted something out of date? Suggest an update →