RedPajama

Open reproduction of LLaMA training data and models — fully transparent, permissive licence, no restrictions.

LLMs & Chat Open Source Has API Open Source
Researched · Published · Reviewed
RECATOOLS Score
5 / 10
Capability
5
Value for money
7
Ease of use
3
ASEAN readiness
5
API quality
3
Founded
2023
HQ
San Francisco, California
Users
100k+ downloads
Launched
Apr 2023
Developer
Together AI

Overview

RedPajama is an open-source project by Together AI that created a fully open reproduction of the LLaMA training dataset and model series. While Meta's LLaMA models are open-weight, the original training data was not released. RedPajama addressed this by recreating the training dataset (1.2 trillion tokens) and training new models from scratch with full transparency about data sources and processing.

The RedPajama-Data-v1 dataset is one of the largest openly available training corpora, used by many researchers to train new models from scratch. This full data transparency enables research into training data influence, contamination testing, and auditing model capabilities back to their source data.

RedPajama-INCITE models are the language models trained on this data, offering fully open models where the entire stack — data, preprocessing, training code, and weights — is publicly available. This level of transparency is unavailable from Meta's LLaMA or most commercial models, making RedPajama valuable for research requiring full provenance tracking.

Advertisement

Pricing

Pricing shown for reference only. These figures reflect RECATOOLS research as of 8 May 2026 and may be out of date or incomplete. This is not financial or purchasing advice — always confirm the current price on the provider’s official website before making any decision.

Free
Free
Fully free

Use cases

Training a new language model from scratch with full data provenance for research Auditing whether a model's knowledge comes from specific contaminated sources Academic research requiring full reproducibility of the training pipeline
Advertisement

ASEAN Perspective

RedPajama in Southeast Asia

ASEAN-region availability and pricing notes coming soon. Drop the editorial team a note via /contact/ if you can supply local context (Singapore/Malaysia/Indonesia/Thailand/Vietnam).

RECATOOLS Verdict

RedPajama is an open-source initiative led by Together AI and collaborators to reproduce LLaMA-scale training datasets and release open base models, contributing one of the most influential open pre-training corpora in the field. Its real value is to ML researchers, dataset builders and teams pre-training or studying open models, where the transparent, permissively available data is a major asset.

It is not an end-user product: there is no app, chat UI, or polished API, and the released base models have been surpassed by newer open families (Llama, Qwen, Mistral, etc.) for practical use. As a free, foundational research contribution it is significant; as a tool to deploy today it is mostly of historical and dataset interest. Scored as a research artifact, not a product.

Independent AI-assisted assessment by RECATOOLS.

Notable facts

  • RedPajama was the first project to fully reproduce the LLaMA training dataset — 1.2 trillion tokens from Common Crawl, Wikipedia, GitHub, ArXiv, Books, and other sources.
  • The data release enabled researchers to audit exactly what information models learned from, addressing a key criticism of opaque training data in commercial models.
  • Together AI released RedPajama specifically to counter the 'open source in name only' criticism of models that release weights but not training data.

Frequently asked questions

Is RedPajama free?
Yes. Both the dataset and models are Apache 2.0 licensed.
What makes RedPajama different from Llama?
RedPajama releases the training data alongside model weights, providing full transparency. LLaMA releases only weights.
How large is the RedPajama dataset?
1.2 trillion tokens across web text, books, code, scientific papers, and other sources.
Can I train my own model from RedPajama data?
Yes. The dataset and training code are both open source.
Why is training data transparency important?
It allows researchers to audit what information models learned, test for data contamination, and understand model behaviour origins.

About this listing

Researched on
Published on
Last reviewed

This entry was compiled from publicly available data including RedPajama's official website, press releases, documentation, and reputable third-party publications. RECATOOLS is not affiliated with RedPajama unless explicitly stated.

Data accuracy

Third-party AI tools update their pricing, features, availability, and policies frequently. Information here may be outdated by the time you read this — we make reasonable efforts to keep listings current, but cannot guarantee absolute accuracy.

For the latest details, please refer to RedPajama directly →

Spotted something out of date? Suggest an update →

Advertisement