- John "NextWord" Hwang
How ChatGPT and GPT3.5 Work in Plain English (for Non-Developers)
As an ex-AI product manager who’s currently helping non-techies with AI, I have found that most GPT explainers are riddled with technical jargons. But it doesn’t have to be.
In this post, I'll cover the foundational ideas behind ChatGPT, GPT3, and large language models (LLMs) using plain English, targeting a non-techie audience interested in AI. These ideas will underlie pretty much every application you build with large language models or OpenAI's APIs. So it's important to have a working mental model around these terms.
Let’s start with talking about how ChatGPT works.
Essentially, ChatGPT is just an user interface that sits in front of an AI model called InstructGPT, which is the core component that’s responsible for generating text. Put another way, InstructGPT is the AI model doing (almost) all the work. So how does InstructGPT work?
Turns out, InstructGPT itself is an adapted (aka finetuned) version of yet another AI model called GPT3.5 (”text-davinci-003”), which encapsulates most of the intelligence around generating text. Here’s a visual diagram of how everything fits together.
While GPT3.5 is available to the public as an API, InstructGPT is not. So let’s dive deeper into GPT3.5 and set InstructGPT aside for now (I'll touch again on InstructGPT at the end).
So what is GPT3.5 (”text-davinci-003”)? It and all its older versions (Curie, Babbage, Ada, GPT2, GPT, etc) belong to a category of AI models called Large Language Models (LLMs). Let’s forget about the “Large” part for now, and focus on “Language Model”.
Think of language models as a program that works similar to the auto-complete feature on your phone keyboard. Auto-complete helps you to “generate text” via iteratively completing a series of suggestions, which are themselves sorted based on some probability.
Whether the suggested words or phrases are “probable” or “appropriate” is determined based on what you have typed so far, i.e. context. A word that doesn’t fit well within the words that came before will be assigned a low probability, and vice versa. And in the case of auto-complete keyboard, the human is actively choosing (i.e. sampling) from a set of most probable suggestions.
Language models (like GPT3.5, GPT3, GPT2, etc) essentially work the same, except it is algorithmically repeating 1/ calculate most probable next token suggestions, and 2/ sampling from most probable tokens in a loop. (To be precise, GPT models are sub-categorized as auto-regressive models in that they generate one word at a time, from left to right).
Consider the following word completion challenge: I think, therefore . The most appropriate word would be “I am”, given that completes the quote from the French philosopher, Rene Decartes.
When G P T sees this sentence completion task, it will first generate probabilities for every single word / token in its vocabulary, and these word probabilities were solely determined by the text that came before (context). The process of reflecting the context into the next word probabilities is called attention mechanism. (note: I have been using word and token interchangeably, but they are two different but related concepts. See the appendix A for the fine print difference).
From here, GPT will sample (i.e. randomly pick) from this table of probabilities according to the probability associated with each word, meaning higher probability words will get picked more often, but less likely words such as “**he”** can also get picked. Thus, GPT is a non-deterministic model - if you ran the algorithm twice on the same input / prompt, you may not get the same output.
In this above case, we pick “I” and put it back into the input, thus changing the context. And when context changes, next word probabilities also change, which assigns the highest probability to the word “am”. But again, it anything could happen, something very unlikely like “yawn” can get picked as a freak accident. In this example, we will just assume “am” gets picked (Note, you can go to OpenAI playground to further visualize this example).
At this point, you may be wondering “this all looks so simple, that’s it? Where’s the secret sauce??”. Well, the secret sauce comes mainly from the “Large” part in “Large Language Model”. So far, we only learned how Language Models work at a high level, so what does “Large” mean?
“Large” refers to the overall number of parameters that the GPT series of models have. Parameters are kind of like neurons in a brain - they determine the capacity or ceiling of how much knowledge a model can internalize. While not linear, there’s generally a correlation between number of parameters and how sophisticated a model can be, just like there’s a weak but positive correlation between brain size and intelligence in mammals.
And OpenAI’s GPT3 models (2020) have up to 175 billion parameters, which eclipsed the “size” of competitors’ models or previous generations. Below is a chart from Nvidia plotting number of parameters by model year vintage. Considering GPT2 models (from 2019) had just 1.5 billion, this was a 100x increase - in just 1 year. Really, the only thing that changed from GPT2 to GPT3 was the number of parameters (and a larger training dataset, but not as important a factor as model parameters) - everything else about the model’s mechanisms remained the same - so all of the performance gain & magic could be attributed to beefing up parameters by 100x.
Note, GPT wasn't the first language model or the transformer based language model (one that uses attention mechanism) to come around. Transformer mechanism itself came out in 2017. So kudos to OpenAI for being bold enough to try a 100x scaled up version of the status quo then. And boldness is required since larger models consume more computing power (read: money $$$) to train.
It’s also worth mentioning GPT3 trained on the largest digitized body of human text, at least in English. Below is a chart from the GPT3 paper, and you can see that almost half a trillion or so tokens were used to train the model. After scanning so much language data, GPT seems to have picked up some deep insight into human language and therefore cognition, because language and intelligence are very interrelated. This explains why GPT seems to have picked up “emergent” behavior like language translation, classification, etc - which the model was never specifically trained for.
InstructGPT vs GPT3.5
And lastly, I wanted to quickly mention InstructGPT, which is a finetuned form of GPT3.5. Why are two separate models needed? Why isn’t GPT3.5 used in its raw form inside ChatGPT?
Prior to launching chat G P T to the public, OpenAI had two concerns.
- Performance (does it work?)
- Ensure that the model is human friendly and avoid saying something abrasive, racist, or sexual that can raise red flags to the public
So OpenAI had to create a model that can run with some guardrails and constraints. And in addition, they want a model that can understand human instructions very well. So OpenAI hired human annotators to manually give feedback to GPT3.5’s answers, which allowed InstructGPT to learn and adapt its raw GPT3.5 answers to what it predicts humans will like.
- ChatGPT leverages InstructGPT, which in turn leverages GPT3.5
- GPT3.5 is belongs to a class of models called language models.
- GPT3.5 is what’s available as an API, while InstructGPT isn’t.
- Language Models are basically automated auto-completers, but it’s the “Largeness” of Language Models that make them so powerful.
- Sheer largeness of Language Models gave birth to emergent behaviors.
Appendix A: Word versus Token
When a language model is trained (created), it doesn’t look process words (e.g. “tree”) but also special characters, spaces, etc. Thus, language models have to work not just for “words” but a more general set of character sequences, which are called tokens in Natural Language Processing parlance.