RedPajama
Open reproduction of LLaMA training data and models — fully transparent, permissive licence, no restrictions.
Overview
RedPajama is an open-source project by Together AI that created a fully open reproduction of the LLaMA training dataset and model series. While Meta's LLaMA models are open-weight, the original training data was not released. RedPajama addressed this by recreating the training dataset (1.2 trillion tokens) and training new models from scratch with full transparency about data sources and processing.
The RedPajama-Data-v1 dataset is one of the largest openly available training corpora, used by many researchers to train new models from scratch. This full data transparency enables research into training data influence, contamination testing, and auditing model capabilities back to their source data.
RedPajama-INCITE models are the language models trained on this data, offering fully open models where the entire stack — data, preprocessing, training code, and weights — is publicly available. This level of transparency is unavailable from Meta's LLaMA or most commercial models, making RedPajama valuable for research requiring full provenance tracking.
Pricing
Pricing shown for reference only. These figures reflect RECATOOLS research as of 8 May 2026 and may be out of date or incomplete. This is not financial or purchasing advice — always confirm the current price on the provider’s official website before making any decision.
Use cases
ASEAN Perspective
RedPajama in Southeast Asia
ASEAN-region availability and pricing notes coming soon. Drop the editorial team a note via /contact/ if you can supply local context (Singapore/Malaysia/Indonesia/Thailand/Vietnam).
RedPajama is an open-source initiative led by Together AI and collaborators to reproduce LLaMA-scale training datasets and release open base models, contributing one of the most influential open pre-training corpora in the field. Its real value is to ML researchers, dataset builders and teams pre-training or studying open models, where the transparent, permissively available data is a major asset.
It is not an end-user product: there is no app, chat UI, or polished API, and the released base models have been surpassed by newer open families (Llama, Qwen, Mistral, etc.) for practical use. As a free, foundational research contribution it is significant; as a tool to deploy today it is mostly of historical and dataset interest. Scored as a research artifact, not a product.
Notable facts
- RedPajama was the first project to fully reproduce the LLaMA training dataset — 1.2 trillion tokens from Common Crawl, Wikipedia, GitHub, ArXiv, Books, and other sources.
- The data release enabled researchers to audit exactly what information models learned from, addressing a key criticism of opaque training data in commercial models.
- Together AI released RedPajama specifically to counter the 'open source in name only' criticism of models that release weights but not training data.
Frequently asked questions
About this listing
This entry was compiled from publicly available data including RedPajama's official website, press releases, documentation, and reputable third-party publications. RECATOOLS is not affiliated with RedPajama unless explicitly stated.
Third-party AI tools update their pricing, features, availability, and policies frequently. Information here may be outdated by the time you read this — we make reasonable efforts to keep listings current, but cannot guarantee absolute accuracy.
For the latest details, please refer to RedPajama directly →
Spotted something out of date? Suggest an update →
Alternatives to RedPajama
More in LLMs & Chat