List of large language models


A large language model is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.

List

For the training cost column, 1 petaFLOP-day = 1 petaFLOP/sec × 1 day = 8.64E19 FLOP. Also, only the largest model's cost is written.

NameRelease dateDeveloperNumber of parameters Corpus sizeTraining cost LicenseNotes
GPT-1OpenAI1First GPT model, decoder-only transformer. Trained for 30 days on 8 P600 GPUs.
BERTGoogle wordsAn early and influential language model.Encoder-only and thus not built to be prompted or generative. Training took 4 days on 64 TPUv2 chips.
T5Google34 billion tokensBase model for many Google projects, such as Imagen.
XLNetGoogle billion words330An alternative to BERT; designed as encoder-only. Trained on 512 TPU v3 chips for 5.5 days.
GPT-2OpenAI40GB 28Trained on 32 TPUv3 chips for 1 week.
GPT-3OpenAI tokens3640A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022.
GPT-NeoEleutherAI825 GiBThe first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3.
GPT-JEleutherAI825 GiB200GPT-3-style language model
Megatron-Turing NLGMicrosoft and Nvidia tokens38000Trained for 3 months on over 2000 A100 GPUs on the NVIDIA Selene Supercomputer, for over 3 million GPU-hours
Ernie 3.0 TitanBaidu4TBChinese-language LLM. Ernie Bot is based on this model.
ClaudeAnthropic tokensFine-tuned for desirable behavior in conversations.
GLaM Google tokens5600Sparse mixture of experts model, making it more expensive to train but cheaper to run inference compared to GPT-3.
GopherDeepMind tokens5833Later developed into the Chinchilla model.
LaMDA Google1.56T words, tokens4110Specialized for response generation in conversations.
GPT-NeoXEleutherAI825 GiB740based on the Megatron architecture
ChinchillaDeepMind tokens6805Reduced-parameter model trained on more data. Used in the Sparrow bot. Often cited for its neural scaling law.
PaLM Google tokensTrained for ~60 days on ~6000 TPU v4 chips.
OPT Meta tokens310GPT-3 architecture with some adaptations from Megatron. Uniquely, the training logbook written by the team was published.
YaLM 100BYandex1.7TBEnglish-Russian model based on Microsoft's Megatron-LM
MinervaGoogle38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint serverFor solving "mathematical and scientific questions using step-by-step reasoning". Initialized from PaLM models, then finetuned on mathematical and scientific data.
BLOOMLarge collaboration led by Hugging Face tokens Essentially GPT-3 but trained on a multi-lingual corpus
GalacticaMeta tokensTrained on scientific text and modalities.
AlexaTM AmazonBidirectional sequence-to-sequence architecture
LlamaMeta AI6300Corpus has 20 languages. "Overtrained" for better performance with fewer parameters.
GPT-4OpenAI
,
estimated 230,000
Available for all ChatGPT users now and used in several products.
Cerebras-GPTCerebras270Trained with Chinchilla formula.
FalconTechnology Innovation Institute1 trillion tokens, from RefinedWeb plus some "curated corpora".2800
BloombergGPTBloomberg L.P.363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasetsTrained on financial data from proprietary sources, for financial tasks
PanGu-ΣHuawei329 billion tokens
OpenAssistantLAION1.5 trillion tokensTrained on crowdsourced open data
Jurassic-2AI21 LabsMultilingual
PaLM 2 Google tokensWas used in Bard chatbot.
YandexGPTYandexUsed in Alice chatbot.
Llama 2Meta AI tokens1.7 million A100-hours.
Claude 2AnthropicUsed in Claude chatbot.
Granite 13bIBMUsed in IBM Watsonx.
Mistral 7BMistral AI
YandexGPT 2YandexUsed in Alice chatbot.
Claude 2.1AnthropicUsed in Claude chatbot. Has a context window of 200,000 tokens, or ~500 pages.
Grok 1xAI314Used in Grok chatbot. Grok 1 has a context length of 8,192 tokens and has access to X.
Gemini 1.0Google DeepMindMultimodal model, comes in three sizes. Used in the chatbot of the same name.
Mixtral 8x7BMistral AI46.7Outperforms GPT-3.5 and Llama 2 70B on many benchmarks. Mixture of experts model, with 12.9 billion parameters activated per token.
DeepSeek-LLMDeepSeek672T tokensTrained on English and Chinese text. 1e24 FLOPs for 67B. 1e23 FLOPs for 7B
Phi-2Microsoft2.71.4T tokens419Trained on real and synthetic "textbook-quality" data, for 14 days on 96 A100 GPUs.
Gemini 1.5Google DeepMindMultimodal model, based on a Mixture-of-Experts architecture. Context window above 1 million tokens.
Gemini UltraGoogle DeepMind
GemmaGoogle DeepMind76T tokens
Claude 3AnthropicIncludes three models, Haiku, Sonnet, and Opus.
DBRXDatabricks and Mosaic ML12T tokensTraining cost 10 million USD
YandexGPT 3 ProYandexUsed in Alice chatbot.
Fugaku-LLMFujitsu, Tokyo Institute of Technology, etc.380B tokensThe largest model ever trained on CPU-only, on the Fugaku
ChameleonMeta AI-
Mixtral 8x22BMistral AI141
Phi-3Microsoft144.8T tokensMicrosoft markets them as "small language model".
Granite Code ModelsIBM
YandexGPT 3 LiteYandexUsed in Alice chatbot.
Qwen2Alibaba Cloud723T tokensMultiple sizes, the smallest being 0.5B.
DeepSeek-V2DeepSeek2368.1T tokens1.4M hours on H800.
Nemotron-4Nvidia9T tokensTrained for 1 epoch. Trained on 6144 H100 GPUs between December 2023 and May 2024.
Claude 3.5AnthropicInitially, only one model, Sonnet, was released. In October 2024, Sonnet 3.5 was upgraded, and Haiku 3.5 became available.
Llama 3.1Meta AI40515.6T tokens405B version took 31 million hours on H100-80GB, at 3.8E25 FLOPs.
Grok-2xAIOriginally closed-source, then re-released as "Grok 2.5" under a source-available license in August 2025.
OpenAI o1OpenAIReasoning model.
YandexGPT 4 Lite and ProYandexUsed in Alice chatbot.
Mistral LargeMistral AI123Upgraded over time. The latest version is 24.11.
PixtralMistral AI123Multimodal. There is also a 12B version which is under Apache 2 license.
Phi-4Microsoft14 tokensMicrosoft markets them as "small language model".
DeepSeek-V3DeepSeek67114.8T tokens2.788M hours on H800 GPUs. Originally released under the DeepSeek License, then re-released under the MIT License as "DeepSeek-V3-0324" in March 2025.
Amazon NovaAmazonIncludes three models, Nova Micro, Nova Lite, and Nova Pro
DeepSeek-R1DeepSeek671No pretraining. Reinforcement-learned upon V3-Base.
Qwen2.5Alibaba7218T tokens7 dense models, with parameter count from 0.5B to 72B. They also released 2 MoE variants.
MiniMax-Text-01Minimax4564.7T tokens
Gemini 2.0Google DeepMindThree models released: Flash, Flash-Lite and Pro
Claude 3.7AnthropicOne model, Sonnet 3.7.
YandexGPT 5 Lite Pretrain and ProYandexUsed in Alice Neural Network chatbot.
GPT-4.5OpenAILargest non-reasoning model.
Grok 3xAITraining cost claimed "10x the compute of previous state-of-the-art models".
Gemini 2.5Google DeepMindThree models released: Flash, Flash-Lite and Pro
YandexGPT 5 Lite InstructYandexUsed in Alice Neural Network chatbot.
Llama 4Meta AI
OpenAI o3 and o4-miniOpenAIReasoning models.
Qwen3Alibaba Cloud235Multiple sizes, the smallest being 0.6B.
Claude 4AnthropicIncludes two models, Sonnet and Opus.
Grok 4xAI
GLM-4.5Zhipu AI35522T tokensReleased in 335B and 106B sizes. Corpus size was calculated by combining the 15 trillion tokens and the 7 trillion tokens pre-training mix.
GPT-OSSOpenAI117Released in 20B and 120B sizes.
Claude 4.1AnthropicIncludes one model, Opus.
GPT-5OpenAIIncludes three models, GPT-5, GPT-5 mini, and GPT-5 nano. GPT-5 is available in ChatGPT and API. It includes thinking abilities.
DeepSeek-V3.1DeepSeek67115.639TTraining size: 14.8T tokens, of DeepSeek V3 plus 839B tokens from the extension phases It is a hybrid model that can switch between thinking and non-thinking modes.
YandexGPT 5.1 ProYandexUsed in Alice Neural Network chatbot.
ApertusETH Zurich and EPF Lausanne70It's said to be the first LLM to be compliant with EU's Artificial Intelligence Act.
Claude Sonnet 4.5Anthropic
DeepSeek-V3.2-ExpDeepSeek685This experimental model built upon v3.1-Terminus uses a custom efficient mechanism tagged DeepSeek Sparse Attention.
GLM-4.6Zhipu AI357
Alice AI LLM 1.0YandexAvailable in Alice AI chatbot.
Gemini 3Google DeepMindTwo models released: Deep Think and Pro
Claude Opus 4.5AnthropicThe largest model in the Claude family.
GPT 5.2December 11, 2025OpenAIIt was able to solve an open problem in statistical learning theory that had previously remained unresolved by human researchers.