GPT-3


Generative Pre-trained Transformer 3 is a large language model released by OpenAI in 2020.
Like its predecessor, GPT-2, it is a decoder-only transformer model of deep neural network, which supersedes recurrence and convolution-based architectures with a technique known as "attention". This attention mechanism allows the model to focus selectively on segments of input text it predicts to be most relevant. GPT-3 has 175 billion parameters, each with 16-bit precision, requiring 350GB of storage since each parameter occupies 2 bytes. It has a context window size of 2048 tokens, and has demonstrated strong "zero-shot" and "few-shot" learning abilities on many tasks.
On September 22, 2020, Microsoft announced that it had licensed GPT-3 exclusively. Others can still receive output from its public API, but only Microsoft has access to the underlying model.

Background

According to The Economist, improved algorithms, more powerful computers, and a recent increase in the amount of digitized material have fueled a revolution in machine learning. New techniques in the 2010s resulted in "rapid improvements in tasks", including manipulating language.
Software models are trained to learn by using thousands or millions of examples in a "structure... loosely based on the neural architecture of the brain". One architecture used in natural language processing is a neural network based on a deep learning model that was introduced in 2017—the transformer architecture. There are a number of NLP systems capable of processing, mining, organizing, connecting and contrasting textual input, as well as correctly answering questions.
On June 11, 2018, OpenAI researchers and engineers published a paper introducing the first generative pre-trained transformer a type of generative large language model that is pre-trained with an enormous and diverse text corpus in datasets, followed by discriminative fine-tuning to focus on a specific task. GPT models are transformer-based deep-learning neural network architectures. Previously, the best-performing neural NLP models commonly employed supervised learning from large amounts of manually-labeled data, which made it prohibitively expensive and time-consuming to train extremely large language models. The first GPT model was known as "GPT-1," and it was followed by "GPT-2" in February 2019. Created as a direct scale-up of its predecessor, GPT-2 had both its parameter count and dataset size increased by a factor of 10. It had 1.5 billion parameters, and was trained on a dataset of 8 million web pages.
In February 2020, Microsoft introduced its Turing Natural Language Generation, which they claimed was "largest language model ever published at 17 billion parameters." It performed better than any other language model at a variety of tasks, including summarizing texts and answering questions.

Training and capabilities

On May 28, 2020, an arXiv preprint by a group of 31 engineers and researchers at OpenAI described the achievement and development of GPT-3, a third-generation "state-of-the-art language model". The team increased the capacity of GPT-3 by over two orders of magnitude from that of its predecessor, GPT-2, making GPT-3 the largest non-sparse language model at that time. Because GPT-3 is structurally similar to its predecessors, its greater accuracy is attributed to its increased capacity and greater number of parameters. GPT-3's capacity is ten times larger than that of Microsoft's Turing NLG, the next largest NLP model known at the time.
Lambdalabs estimated a hypothetical cost of around $4.6 million US dollars and 355 years to train GPT-3 on a single GPU in 2020, with lower actual training time by using more GPUs in parallel.
Sixty percent of the weighted pre-training dataset for GPT-3 comes from a filtered version of Common Crawl consisting of 410 billion byte-pair-encoded tokens. Fuzzy deduplication used Apache Spark's MinHashLSH. Other sources are 19 billion tokens from WebText2 representing 22% of the weighted total, 12 billion tokens from Books1 representing 8%, 55 billion tokens from Books2 representing 8%, and 3 billion tokens from Wikipedia representing 3%. GPT-3 was trained on hundreds of billions of words and is also capable of coding in CSS, JSX, and Python, among others.
Dataset# tokensProportion
within training
Common Crawl410 billion60%
WebText219 billion22%
Books112 billion8%
Books255 billion8%
Wikipedia3 billion3%

Since GPT-3's training data was all-encompassing, it does not require further training for distinct language tasks. The training data contains occasional toxic language and GPT-3 occasionally generates toxic language as a result of mimicking its training data. A study from the University of Washington found that GPT-3 produced toxic language at a toxicity level comparable to the similar natural language processing models of GPT-2 and CTRL. OpenAI has implemented several strategies to limit the amount of toxic language generated by GPT-3. As a result, GPT-3 produced less toxic language compared to its predecessor model, GPT-1, although it produced both more generations and a higher toxicity of toxic language compared to CTRL Wiki, a language model trained entirely on Wikipedia data.
On June 11, 2020, OpenAI announced that users could request access to its user-friendly GPT-3 API—a "machine learning toolset"—to help OpenAI "explore the strengths and limits" of this new technology. The invitation described how this API had a general-purpose "text in, text out" interface that can complete almost "any English language task", instead of the usual single use-case. According to one user, who had access to a private early release of the OpenAI GPT-3 API, GPT-3 was "eerily good" at writing "amazingly coherent text" with only a few simple prompts. In an initial experiment 80 US subjects were asked to judge if short ~200 word articles were written by humans or GPT-3. The participants judged correctly 52% of the time, doing only slightly better than random guessing.
On November 18, 2021, OpenAI announced that enough safeguards had been implemented that access to its API would be unrestricted. OpenAI provided developers with a content moderation tool that helps them abide by OpenAI's content policy. On January 27, 2022, OpenAI announced that its newest GPT-3 language models were now the default language model used on their API. According to OpenAI, InstructGPT produced content that was better aligned to user intentions by following instructions better, generating fewer made-up facts, and producing somewhat less toxic content.
Because GPT-3 can "generate news articles which human evaluators have difficulty distinguishing from articles written by humans," GPT-3 has the "potential to advance both the beneficial and harmful applications of language models." In their May 28, 2020 paper, the researchers described in detail the potential "harmful effects of GPT-3" which include "misinformation, spam, phishing, abuse of legal and governmental processes, fraudulent academic essay writing and social engineering pretexting". The authors draw attention to these dangers to call for research on risk mitigation.
GPT-3 is capable of performing zero-shot and few-shot learning.
In June 2022, Almira Osmanovic Thunström wrote that GPT-3 was the primary author on an article on itself, that they had submitted it for publication, and that it had been pre-published while waiting for completion of its review.

GPT-3 models

There are many models in the GPT-3 family, some serving different purposes than others. In the initial research paper published by OpenAI, they mentioned 8 different sizes of the main GPT-3 model :
Model NameBatch SizeLearning RateAPI name
GPT-3 Small125M1276812640.5M
GPT-3 Medium350M24102416640.5Mada
GPT-3 Large760M24153616960.5M
GPT-3 XL1.3B242048241281Mbabbage
GPT-3 2.7B2.7B32256032801M
GPT-3 6.7B6.7B324096321282Mcurie
GPT-3 13B13.0B405140401282M
GPT-3 175B175.0B9612288961283.2Mdavinci

Half of the models are accessible through the API, namely GPT-3-medium, GPT-3-xl, GPT-3-6.7B and GPT-3-175b, which are referred to as ada, babbage, curie and davinci respectively. While the size of the API models was not originally disclosed by OpenAI, EleutherAI announced the mapping between model sizes and API names in May 2021. These model sizes were later confirmed by OpenAI, but the sizes of subsequent models have not been disclosed.
ModelParametersDescriptionSeries
ada350 MCapable of very simple tasks, usually the fastest model in the GPT-3 series, and lowest cost.Base GPT-3
babbage
babbage-002
1.3 BCapable of straightforward tasks, very fast, and lower cost.Base GPT-3
curie6.7BVery capable, but faster and lower cost than Davinci.Base GPT-3
davinci
davinci-002
175 BMost capable GPT-3 model. Can do any task the other models can do, often with higher quality.Base GPT-3
text-ada-001350 MCapable of very simple tasks, usually the fastest model in the GPT-3 series, and lowest cost.InstructGPT
text-babbage-0011.3BCapable of straightforward tasks, very fast, and lower cost.InstructGPT
text-curie-0016.7BVery capable, faster and lower cost than Davinci.InstructGPT
text-davinci-001175BOlder version of the most capable model in the GPT-3 series. Can perform any task the other GPT-3 models can, often with less context.InstructGPT
text-davinci-002
code-davinci-002
UndisclosedSimilar capabilities to text-davinci-003 but trained with supervised fine-tuning instead of reinforcement learningGPT-3.5
text-davinci-003UndisclosedCan do any language task with better quality, longer output, and consistent instruction-following than the curie, babbage, or ada models. Also supports inserting completions within text.GPT-3.5
gpt-3.5-turbo
gpt-3.5-turbo-instruct
gpt-3.5-turbo-16k
UndisclosedMost capable and cost effective GPT-3.5 model and optimized for chat at 1/10th the cost of text-davinci-003.GPT-3.5