RedPajama

Open LLaMA-recipe training data — 30T tokens, fully transparent

Fully Open Open Source Research Together AI Training Data Transparent

LLMs & Chat Open Source Has API Open Source

Researched 8 May 2026, 20:44 SGT · Published 8 May 2026, 08:00 SGT · Reviewed 11 Jul 2026

Visit RedPajama Compare alternatives

RECATOOLS Score

5 / 10

Capability

Value for money

Ease of use

ASEAN readiness

API quality

Founded

2023

San Francisco, California

Users

100k+ downloads

Launched

Apr 2023

Developer

Together AI

Overview

Together AI-led open project that reproduced the LLaMA training recipe in the open: the 1.2T-token v1 corpus, the 30-trillion-token RedPajama-Data-v2 with 40+ quality annotations, and the fully open INCITE base models. A research resource, not an end-user product.

Pricing

Pricing shown for reference only. These figures reflect RECATOOLS research as of 11 Jul 2026 and may be out of date or incomplete. This is not financial or purchasing advice — always confirm the current price on the provider’s official website before making any decision.

Free

Fully free

Use cases

Training a new language model from scratch with full data provenance for research Auditing whether a model's knowledge comes from specific contaminated sources Academic research requiring full reproducibility of the training pipeline

ASEAN Perspective

RedPajama in Southeast Asia

ASEAN-region availability and pricing notes coming soon. Drop the editorial team a note via /contact/ if you can supply local context (Singapore/Malaysia/Indonesia/Thailand/Vietnam).

RECATOOLS Verdict

RedPajama is an open-source initiative led by Together AI and collaborators to reproduce LLaMA-scale training datasets and release open base models, contributing one of the most influential open pre-training corpora in the field. Its real value is to ML researchers, dataset builders and teams pre-training or studying open models, where the transparent, permissively available data is a major asset.

It is not an end-user product: there is no app, chat UI, or polished API, and the released base models have been surpassed by newer open families (Llama, Qwen, Mistral, etc.) for practical use. As a free, foundational research contribution it is significant; as a tool to deploy today it is mostly of historical and dataset interest. Scored as a research artifact, not a product.

Independent AI-assisted assessment by RECATOOLS.

What people say

There's no G2 page for RedPajama because it isn't a product — it's a dataset project, and its reputation lives in downloads and citations. On those terms it did its job. The original 1.2-trillion-token v1 corpus (2023) proved Meta's LLaMA recipe could be reproduced in the open, and the follow-up RedPajama-Data-v2 — 30 trillion tokens across 84 CommonCrawl snapshots, shipped with 40+ precomputed quality signals — became genuine infrastructure: Together reported over 20,000 downloads a month, and it fed training runs for models like Snowflake's Arctic. The accompanying paper landed at NeurIPS 2024.

The models are a different story. The RedPajama-INCITE 3B and 7B checkpoints, trained on Oak Ridge INCITE compute in 2023, beat contemporaries like Pythia and GPT-Neo at the 3B scale but never closed the gap to LLaMA-7B, and nobody would deploy them in 2026 — Llama, Qwen and Mistral families outclass them at every size. They survive as reference points and ablation baselines, not deployment candidates.

Even on the data side the field has moved. Newer curated corpora such as Hugging Face's FineWeb have become the default for English web pretraining, and v2's ship-everything-with-quality-signals approach gets praised for flexibility and knocked for pushing heavy filtering work onto the user. The honest 2026 read: RedPajama mattered enormously, the v2 corpus still gets pulled by dataset builders, and its audience is people training or auditing models — anyone shopping for a tool should keep walking.

Summary of public user & expert reviews, compiled by RECATOOLS.

Notable facts

RedPajama was the first project to fully reproduce the LLaMA training dataset — 1.2 trillion tokens from Common Crawl, Wikipedia, GitHub, ArXiv, Books, and other sources.
The data release enabled researchers to audit exactly what information models learned from, addressing a key criticism of opaque training data in commercial models.
Together AI released RedPajama specifically to counter the 'open source in name only' criticism of models that release weights but not training data.

Frequently asked questions

Is RedPajama free?

Yes. Both the dataset and models are Apache 2.0 licensed.

What makes RedPajama different from Llama?

RedPajama releases the training data alongside model weights, providing full transparency. LLaMA releases only weights.

How large is the RedPajama dataset?

1.2 trillion tokens across web text, books, code, scientific papers, and other sources.

Can I train my own model from RedPajama data?

Yes. The dataset and training code are both open source.

Why is training data transparency important?

It allows researchers to audit what information models learned, test for data contamination, and understand model behaviour origins.

Was this listing helpful?

Visit RedPajama

Quick facts

DeveloperTogether AI

Founded2023

HQSan Francisco, California

Users100k+ downloads

PricingOpen Source

APIYes

GitHub Source

GitHub ★ 5k ⑂ 377 Apache-2.0 updated 1 month ago · synced 16 Jul 2026

Hugging Face ⬇ 333 ♥ 92 · synced 14 Jul 2026

Top alternatives

BLOOM

The first multilingual open-access l...

Llama

Meta's open-weight LLM family — Llam...

Mistral AI

European frontier-model lab

In-house AI Tools

Prompt Framework Builder

Build a structured AI prompt from a...

System Prompt Builder

Build a system prompt for a custom G...

llms.txt Generator

Build a spec-compliant /llms.txt to...

AI-Crawler robots.txt Builder

Allow or block AI crawlers — GPTBot,...

Token Counter

Count exact GPT tokens (tiktoken) pl...

About this listing

Researched on Friday, 8 May 2026 at 20:44 SGT (UTC+8)

Published on Friday, 8 May 2026 at 08:00 SGT (UTC+8)

Last reviewed Saturday, 11 July 2026 (1 week ago)

This entry was compiled from publicly available data including RedPajama's official website, press releases, documentation, and reputable third-party publications. RECATOOLS is not affiliated with RedPajama unless explicitly stated.

Data accuracy

Third-party AI tools update their pricing, features, availability, and policies frequently. Information here may be outdated by the time you read this — we make reasonable efforts to keep listings current, but cannot guarantee absolute accuracy.

For the latest details, please refer to RedPajama directly →

Spotted something out of date? Suggest an update →