List of large language models
A large language model is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.
List
For the training cost column, 1 petaFLOP-day = 1 petaFLOP/sec × 1 day = 8.64E19 FLOP. Also, only the largest model's cost is written.| Name | Release date | Developer | Number of parameters | Corpus size | Training cost | License | Notes |
| GPT-1 | OpenAI | 1 | First GPT model, decoder-only transformer. Trained for 30 days on 8 P600 GPUs. | ||||
| BERT | words | An early and influential language model.Encoder-only and thus not built to be prompted or generative. Training took 4 days on 64 TPUv2 chips. | |||||
| T5 | 34 billion tokens | Base model for many Google projects, such as Imagen. | |||||
| XLNet | billion words | 330 | An alternative to BERT; designed as encoder-only. Trained on 512 TPU v3 chips for 5.5 days. | ||||
| GPT-2 | OpenAI | 40GB | 28 | Trained on 32 TPUv3 chips for 1 week. | |||
| GPT-3 | OpenAI | tokens | 3640 | A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022. | |||
| GPT-Neo | EleutherAI | 825 GiB | The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3. | ||||
| GPT-J | EleutherAI | 825 GiB | 200 | GPT-3-style language model | |||
| Megatron-Turing NLG | Microsoft and Nvidia | tokens | 38000 | Trained for 3 months on over 2000 A100 GPUs on the NVIDIA Selene Supercomputer, for over 3 million GPU-hours | |||
| Ernie 3.0 Titan | Baidu | 4TB | Chinese-language LLM. Ernie Bot is based on this model. | ||||
| Claude | Anthropic | tokens | Fine-tuned for desirable behavior in conversations. | ||||
| GLaM | tokens | 5600 | Sparse mixture of experts model, making it more expensive to train but cheaper to run inference compared to GPT-3. | ||||
| Gopher | DeepMind | tokens | 5833 | Later developed into the Chinchilla model. | |||
| LaMDA | 1.56T words, tokens | 4110 | Specialized for response generation in conversations. | ||||
| GPT-NeoX | EleutherAI | 825 GiB | 740 | based on the Megatron architecture | |||
| Chinchilla | DeepMind | tokens | 6805 | Reduced-parameter model trained on more data. Used in the Sparrow bot. Often cited for its neural scaling law. | |||
| PaLM | tokens | Trained for ~60 days on ~6000 TPU v4 chips. | |||||
| OPT | Meta | tokens | 310 | GPT-3 architecture with some adaptations from Megatron. Uniquely, the training logbook written by the team was published. | |||
| YaLM 100B | Yandex | 1.7TB | English-Russian model based on Microsoft's Megatron-LM | ||||
| Minerva | 38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server | For solving "mathematical and scientific questions using step-by-step reasoning". Initialized from PaLM models, then finetuned on mathematical and scientific data. | |||||
| BLOOM | Large collaboration led by Hugging Face | tokens | Essentially GPT-3 but trained on a multi-lingual corpus | ||||
| Galactica | Meta | tokens | Trained on scientific text and modalities. | ||||
| AlexaTM | Amazon | Bidirectional sequence-to-sequence architecture | |||||
| Llama | Meta AI | 6300 | Corpus has 20 languages. "Overtrained" for better performance with fewer parameters. | ||||
| GPT-4 | OpenAI | , estimated 230,000 | Available for all ChatGPT users now and used in several products. | ||||
| Cerebras-GPT | Cerebras | 270 | Trained with Chinchilla formula. | ||||
| Falcon | Technology Innovation Institute | 1 trillion tokens, from RefinedWeb plus some "curated corpora". | 2800 | ||||
| BloombergGPT | Bloomberg L.P. | 363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasets | Trained on financial data from proprietary sources, for financial tasks | ||||
| PanGu-Σ | Huawei | 329 billion tokens | |||||
| OpenAssistant | LAION | 1.5 trillion tokens | Trained on crowdsourced open data | ||||
| Jurassic-2 | AI21 Labs | Multilingual | |||||
| PaLM 2 | tokens | Was used in Bard chatbot. | |||||
| YandexGPT | Yandex | Used in Alice chatbot. | |||||
| Llama 2 | Meta AI | tokens | 1.7 million A100-hours. | ||||
| Claude 2 | Anthropic | Used in Claude chatbot. | |||||
| Granite 13b | IBM | Used in IBM Watsonx. | |||||
| Mistral 7B | Mistral AI | ||||||
| YandexGPT 2 | Yandex | Used in Alice chatbot. | |||||
| Claude 2.1 | Anthropic | Used in Claude chatbot. Has a context window of 200,000 tokens, or ~500 pages. | |||||
| Grok 1 | xAI | 314 | Used in Grok chatbot. Grok 1 has a context length of 8,192 tokens and has access to X. | ||||
| Gemini 1.0 | Google DeepMind | Multimodal model, comes in three sizes. Used in the chatbot of the same name. | |||||
| Mixtral 8x7B | Mistral AI | 46.7 | Outperforms GPT-3.5 and Llama 2 70B on many benchmarks. Mixture of experts model, with 12.9 billion parameters activated per token. | ||||
| DeepSeek-LLM | DeepSeek | 67 | 2T tokens | Trained on English and Chinese text. 1e24 FLOPs for 67B. 1e23 FLOPs for 7B | |||
| Phi-2 | Microsoft | 2.7 | 1.4T tokens | 419 | Trained on real and synthetic "textbook-quality" data, for 14 days on 96 A100 GPUs. | ||
| Gemini 1.5 | Google DeepMind | Multimodal model, based on a Mixture-of-Experts architecture. Context window above 1 million tokens. | |||||
| Gemini Ultra | Google DeepMind | ||||||
| Gemma | Google DeepMind | 7 | 6T tokens | ||||
| Claude 3 | Anthropic | Includes three models, Haiku, Sonnet, and Opus. | |||||
| DBRX | Databricks and Mosaic ML | 12T tokens | Training cost 10 million USD | ||||
| YandexGPT 3 Pro | Yandex | Used in Alice chatbot. | |||||
| Fugaku-LLM | Fujitsu, Tokyo Institute of Technology, etc. | 380B tokens | The largest model ever trained on CPU-only, on the Fugaku | ||||
| Chameleon | Meta AI | - | |||||
| Mixtral 8x22B | Mistral AI | 141 | |||||
| Phi-3 | Microsoft | 14 | 4.8T tokens | Microsoft markets them as "small language model". | |||
| Granite Code Models | IBM | ||||||
| YandexGPT 3 Lite | Yandex | Used in Alice chatbot. | |||||
| Qwen2 | Alibaba Cloud | 72 | 3T tokens | Multiple sizes, the smallest being 0.5B. | |||
| DeepSeek-V2 | DeepSeek | 236 | 8.1T tokens | 1.4M hours on H800. | |||
| Nemotron-4 | Nvidia | 9T tokens | Trained for 1 epoch. Trained on 6144 H100 GPUs between December 2023 and May 2024. | ||||
| Claude 3.5 | Anthropic | Initially, only one model, Sonnet, was released. In October 2024, Sonnet 3.5 was upgraded, and Haiku 3.5 became available. | |||||
| Llama 3.1 | Meta AI | 405 | 15.6T tokens | 405B version took 31 million hours on H100-80GB, at 3.8E25 FLOPs. | |||
| Grok-2 | xAI | Originally closed-source, then re-released as "Grok 2.5" under a source-available license in August 2025. | |||||
| OpenAI o1 | OpenAI | Reasoning model. | |||||
| YandexGPT 4 Lite and Pro | Yandex | Used in Alice chatbot. | |||||
| Mistral Large | Mistral AI | 123 | Upgraded over time. The latest version is 24.11. | ||||
| Pixtral | Mistral AI | 123 | Multimodal. There is also a 12B version which is under Apache 2 license. | ||||
| Phi-4 | Microsoft | 14 | tokens | Microsoft markets them as "small language model". | |||
| DeepSeek-V3 | DeepSeek | 671 | 14.8T tokens | 2.788M hours on H800 GPUs. Originally released under the DeepSeek License, then re-released under the MIT License as "DeepSeek-V3-0324" in March 2025. | |||
| Amazon Nova | Amazon | Includes three models, Nova Micro, Nova Lite, and Nova Pro | |||||
| DeepSeek-R1 | DeepSeek | 671 | No pretraining. Reinforcement-learned upon V3-Base. | ||||
| Qwen2.5 | Alibaba | 72 | 18T tokens | 7 dense models, with parameter count from 0.5B to 72B. They also released 2 MoE variants. | |||
| MiniMax-Text-01 | Minimax | 456 | 4.7T tokens | ||||
| Gemini 2.0 | Google DeepMind | Three models released: Flash, Flash-Lite and Pro | |||||
| Claude 3.7 | Anthropic | One model, Sonnet 3.7. | |||||
| YandexGPT 5 Lite Pretrain and Pro | Yandex | Used in Alice Neural Network chatbot. | |||||
| GPT-4.5 | OpenAI | Largest non-reasoning model. | |||||
| Grok 3 | xAI | Training cost claimed "10x the compute of previous state-of-the-art models". | |||||
| Gemini 2.5 | Google DeepMind | Three models released: Flash, Flash-Lite and Pro | |||||
| YandexGPT 5 Lite Instruct | Yandex | Used in Alice Neural Network chatbot. | |||||
| Llama 4 | Meta AI | ||||||
| OpenAI o3 and o4-mini | OpenAI | Reasoning models. | |||||
| Qwen3 | Alibaba Cloud | 235 | Multiple sizes, the smallest being 0.6B. | ||||
| Claude 4 | Anthropic | Includes two models, Sonnet and Opus. | |||||
| Grok 4 | xAI | ||||||
| GLM-4.5 | Zhipu AI | 355 | 22T tokens | Released in 335B and 106B sizes. Corpus size was calculated by combining the 15 trillion tokens and the 7 trillion tokens pre-training mix. | |||
| GPT-OSS | OpenAI | 117 | Released in 20B and 120B sizes. | ||||
| Claude 4.1 | Anthropic | Includes one model, Opus. | |||||
| GPT-5 | OpenAI | Includes three models, GPT-5, GPT-5 mini, and GPT-5 nano. GPT-5 is available in ChatGPT and API. It includes thinking abilities. | |||||
| DeepSeek-V3.1 | DeepSeek | 671 | 15.639T | Training size: 14.8T tokens, of DeepSeek V3 plus 839B tokens from the extension phases It is a hybrid model that can switch between thinking and non-thinking modes. | |||
| YandexGPT 5.1 Pro | Yandex | Used in Alice Neural Network chatbot. | |||||
| Apertus | ETH Zurich and EPF Lausanne | 70 | It's said to be the first LLM to be compliant with EU's Artificial Intelligence Act. | ||||
| Claude Sonnet 4.5 | Anthropic | ||||||
| DeepSeek-V3.2-Exp | DeepSeek | 685 | This experimental model built upon v3.1-Terminus uses a custom efficient mechanism tagged DeepSeek Sparse Attention. | ||||
| GLM-4.6 | Zhipu AI | 357 | |||||
| Alice AI LLM 1.0 | Yandex | Available in Alice AI chatbot. | |||||
| Gemini 3 | Google DeepMind | Two models released: Deep Think and Pro | |||||
| Claude Opus 4.5 | Anthropic | The largest model in the Claude family. | |||||
| GPT 5.2 | December 11, 2025 | OpenAI | It was able to solve an open problem in statistical learning theory that had previously remained unresolved by human researchers. |