List of large language models

A large language model is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.

List

For the training cost column, 1 petaFLOP-day = 1 petaFLOP/sec × 1 day = 8.64E19 FLOP. Also, only the largest model's cost is written.

Name	Release date	Developer	Number of parameters	Corpus size	Training cost	License	Notes
GPT-1		OpenAI			1		First GPT model, decoder-only transformer. Trained for 30 days on 8 P600 GPUs.
BERT		Google		words			An early and influential language model.Encoder-only and thus not built to be prompted or generative. Training took 4 days on 64 TPUv2 chips.
T5		Google		34 billion tokens			Base model for many Google projects, such as Imagen.
XLNet		Google		billion words	330		An alternative to BERT; designed as encoder-only. Trained on 512 TPU v3 chips for 5.5 days.
GPT-2		OpenAI		40GB	28		Trained on 32 TPUv3 chips for 1 week.
GPT-3		OpenAI		tokens	3640		A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022.
GPT-Neo		EleutherAI		825 GiB			The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3.
GPT-J		EleutherAI		825 GiB	200		GPT-3-style language model
Megatron-Turing NLG		Microsoft and Nvidia		tokens	38000		Trained for 3 months on over 2000 A100 GPUs on the NVIDIA Selene Supercomputer, for over 3 million GPU-hours
Ernie 3.0 Titan		Baidu		4TB			Chinese-language LLM. Ernie Bot is based on this model.
Claude		Anthropic		tokens			Fine-tuned for desirable behavior in conversations.
GLaM		Google		tokens	5600		Sparse mixture of experts model, making it more expensive to train but cheaper to run inference compared to GPT-3.
Gopher		DeepMind		tokens	5833		Later developed into the Chinchilla model.
LaMDA		Google		1.56T words, tokens	4110		Specialized for response generation in conversations.
GPT-NeoX		EleutherAI		825 GiB	740		based on the Megatron architecture
Chinchilla		DeepMind		tokens	6805		Reduced-parameter model trained on more data. Used in the Sparrow bot. Often cited for its neural scaling law.
PaLM		Google		tokens			Trained for ~60 days on ~6000 TPU v4 chips.
OPT		Meta		tokens	310		GPT-3 architecture with some adaptations from Megatron. Uniquely, the training logbook written by the team was published.
YaLM 100B		Yandex		1.7TB			English-Russian model based on Microsoft's Megatron-LM
Minerva		Google		38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server			For solving "mathematical and scientific questions using step-by-step reasoning". Initialized from PaLM models, then finetuned on mathematical and scientific data.
BLOOM		Large collaboration led by Hugging Face		tokens			Essentially GPT-3 but trained on a multi-lingual corpus
Galactica		Meta		tokens			Trained on scientific text and modalities.
AlexaTM		Amazon					Bidirectional sequence-to-sequence architecture
Llama		Meta AI			6300		Corpus has 20 languages. "Overtrained" for better performance with fewer parameters.
GPT-4		OpenAI			, estimated 230,000		Available for all ChatGPT users now and used in several products.
Cerebras-GPT		Cerebras			270		Trained with Chinchilla formula.
Falcon		Technology Innovation Institute		1 trillion tokens, from RefinedWeb plus some "curated corpora".	2800
BloombergGPT		Bloomberg L.P.		363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasets			Trained on financial data from proprietary sources, for financial tasks
PanGu-Σ		Huawei		329 billion tokens
OpenAssistant		LAION		1.5 trillion tokens			Trained on crowdsourced open data
Jurassic-2		AI21 Labs					Multilingual
PaLM 2		Google		tokens			Was used in Bard chatbot.
YandexGPT		Yandex					Used in Alice chatbot.
Llama 2		Meta AI		tokens			1.7 million A100-hours.
Claude 2		Anthropic					Used in Claude chatbot.
Granite 13b		IBM					Used in IBM Watsonx.
Mistral 7B		Mistral AI
YandexGPT 2		Yandex					Used in Alice chatbot.
Claude 2.1		Anthropic					Used in Claude chatbot. Has a context window of 200,000 tokens, or ~500 pages.
Grok 1		xAI	314				Used in Grok chatbot. Grok 1 has a context length of 8,192 tokens and has access to X.
Gemini 1.0		Google DeepMind					Multimodal model, comes in three sizes. Used in the chatbot of the same name.
Mixtral 8x7B		Mistral AI	46.7				Outperforms GPT-3.5 and Llama 2 70B on many benchmarks. Mixture of experts model, with 12.9 billion parameters activated per token.
DeepSeek-LLM		DeepSeek	67	2T tokens			Trained on English and Chinese text. 1e24 FLOPs for 67B. 1e23 FLOPs for 7B
Phi-2		Microsoft	2.7	1.4T tokens	419		Trained on real and synthetic "textbook-quality" data, for 14 days on 96 A100 GPUs.
Gemini 1.5		Google DeepMind					Multimodal model, based on a Mixture-of-Experts architecture. Context window above 1 million tokens.
Gemini Ultra		Google DeepMind
Gemma		Google DeepMind	7	6T tokens
Claude 3		Anthropic					Includes three models, Haiku, Sonnet, and Opus.
DBRX		Databricks and Mosaic ML		12T tokens			Training cost 10 million USD
YandexGPT 3 Pro		Yandex					Used in Alice chatbot.
Fugaku-LLM		Fujitsu, Tokyo Institute of Technology, etc.		380B tokens			The largest model ever trained on CPU-only, on the Fugaku
Chameleon		Meta AI					-
Mixtral 8x22B		Mistral AI	141
Phi-3		Microsoft	14	4.8T tokens			Microsoft markets them as "small language model".
Granite Code Models		IBM
YandexGPT 3 Lite		Yandex					Used in Alice chatbot.
Qwen2		Alibaba Cloud	72	3T tokens			Multiple sizes, the smallest being 0.5B.
DeepSeek-V2		DeepSeek	236	8.1T tokens			1.4M hours on H800.
Nemotron-4		Nvidia		9T tokens			Trained for 1 epoch. Trained on 6144 H100 GPUs between December 2023 and May 2024.
Claude 3.5		Anthropic					Initially, only one model, Sonnet, was released. In October 2024, Sonnet 3.5 was upgraded, and Haiku 3.5 became available.
Llama 3.1		Meta AI	405	15.6T tokens			405B version took 31 million hours on H100-80GB, at 3.8E25 FLOPs.
Grok-2		xAI					Originally closed-source, then re-released as "Grok 2.5" under a source-available license in August 2025.
OpenAI o1		OpenAI					Reasoning model.
YandexGPT 4 Lite and Pro		Yandex					Used in Alice chatbot.
Mistral Large		Mistral AI	123				Upgraded over time. The latest version is 24.11.
Pixtral		Mistral AI	123				Multimodal. There is also a 12B version which is under Apache 2 license.
Phi-4		Microsoft	14	tokens			Microsoft markets them as "small language model".
DeepSeek-V3		DeepSeek	671	14.8T tokens			2.788M hours on H800 GPUs. Originally released under the DeepSeek License, then re-released under the MIT License as "DeepSeek-V3-0324" in March 2025.
Amazon Nova		Amazon					Includes three models, Nova Micro, Nova Lite, and Nova Pro
DeepSeek-R1		DeepSeek	671				No pretraining. Reinforcement-learned upon V3-Base.
Qwen2.5		Alibaba	72	18T tokens			7 dense models, with parameter count from 0.5B to 72B. They also released 2 MoE variants.
MiniMax-Text-01		Minimax	456	4.7T tokens
Gemini 2.0		Google DeepMind					Three models released: Flash, Flash-Lite and Pro
Claude 3.7		Anthropic					One model, Sonnet 3.7.
YandexGPT 5 Lite Pretrain and Pro		Yandex					Used in Alice Neural Network chatbot.
GPT-4.5		OpenAI					Largest non-reasoning model.
Grok 3		xAI					Training cost claimed "10x the compute of previous state-of-the-art models".
Gemini 2.5		Google DeepMind					Three models released: Flash, Flash-Lite and Pro
YandexGPT 5 Lite Instruct		Yandex					Used in Alice Neural Network chatbot.
Llama 4		Meta AI
OpenAI o3 and o4-mini		OpenAI					Reasoning models.
Qwen3		Alibaba Cloud	235				Multiple sizes, the smallest being 0.6B.
Claude 4		Anthropic					Includes two models, Sonnet and Opus.
Grok 4		xAI
GLM-4.5		Zhipu AI	355	22T tokens			Released in 335B and 106B sizes. Corpus size was calculated by combining the 15 trillion tokens and the 7 trillion tokens pre-training mix.
GPT-OSS		OpenAI	117				Released in 20B and 120B sizes.
Claude 4.1		Anthropic					Includes one model, Opus.
GPT-5		OpenAI					Includes three models, GPT-5, GPT-5 mini, and GPT-5 nano. GPT-5 is available in ChatGPT and API. It includes thinking abilities.
DeepSeek-V3.1		DeepSeek	671	15.639T			Training size: 14.8T tokens, of DeepSeek V3 plus 839B tokens from the extension phases It is a hybrid model that can switch between thinking and non-thinking modes.
YandexGPT 5.1 Pro		Yandex					Used in Alice Neural Network chatbot.
Apertus		ETH Zurich and EPF Lausanne	70				It's said to be the first LLM to be compliant with EU's Artificial Intelligence Act.
Claude Sonnet 4.5		Anthropic
DeepSeek-V3.2-Exp		DeepSeek	685				This experimental model built upon v3.1-Terminus uses a custom efficient mechanism tagged DeepSeek Sparse Attention.
GLM-4.6		Zhipu AI	357
Alice AI LLM 1.0		Yandex					Available in Alice AI chatbot.
Gemini 3		Google DeepMind					Two models released: Deep Think and Pro
Claude Opus 4.5		Anthropic					The largest model in the Claude family.
GPT 5.2	December 11, 2025	OpenAI					It was able to solve an open problem in statistical learning theory that had previously remained unresolved by human researchers.