GPT-4

GPT-4 Architecture: Unravelling the Deep Technical Weave

Explore GPT-4's evolution, architecture, and potential in this comprehensive guide. Delve into transformer details and the Mixture of Experts framework. A must-read for AI aficionados and beginners, this piece demystifies the brilliance of GPT-4. Embark on this enlightening AI journey.

Tipz AI

Oct 10, 2023 • 10 min read

GPT-4 Architecture

In the realm of artificial intelligence, there are giants, and then there's GPT-4 — a behemoth that stands head and shoulders above the rest. Hailing from OpenAI's innovative lab, GPT-4 is the latest prodigy in the illustrious line of Generative Pre-trained Transformer (GPT) language models. Boasting a staggering 100 trillion parameters — a number so vast it's almost unfathomable — GPT-4 doesn't just play in the major leagues; it's rewriting the rulebook. Early benchmarks hint at its prowess, revealing performance metrics that eclipse its predecessors. While the mystique around its architecture remains, given OpenAI's discreteness on some specifics, we know its transformer roots and the intriguing utilisation of a Mixture of Experts (MoE) framework. Dive with us into this article, as we embark on a journey to unravel the intricacies of GPT-4's architecture, juxtapose it against other titanic language models, and weigh its monumental promise against potential perils.

The Saga of GPT: From Pioneer to Prodigy

Once upon a techno-time, in 2018, OpenAI unveiled the first in its line of marquee offerings: GPT, or the Generative Pre-trained Transformer. More than just an eloquent name, it was a quantum leap in the domain of natural language processing. Fueled by an extensive corpus of text and code, GPT wasn't merely a model; it was a maestro, orchestrating symphonies of coherent text, dexterously translating languages, penning an array of creative compositions, and proffering answers with sagacious charm.

2019 heralded the arrival of GPT-2, a behemoth that made its predecessor seem but a fledgling. With its formidable capabilities, it stirred both wonder and wariness. It was so astoundingly adept that OpenAI, in an unprecedented move, deemed it "too dangerous" for an immediate public debut. While the initial reticence sparked intrigue, GPT-2 eventually graced the hands of a select cohort of researchers, forging a path for a slew of remarkable applications.

💡

How Tipz AI LLP Amplifies Your AI Experience
Ready to lead the AI revolution? Dive into transformative AI solutions with Tipz AI LLP. From cutting-edge AI model development to ethical AI advisory, we pave the way for next-gen AGI technologies. Partner with us to harness the true power of AI for your business. Let's shape the future, together. Explore Our Services Now!

Fast-forward to 2020, and the stage was set for GPT-3, a model that was not just bigger—it was monumental. Touting 175 billion parameters, a staggering 10x leap from GPT-2, it redefined versatility. From conjuring lines of code to orchestrating melodic tunes and fluently flipping between languages, GPT-3 wasn't just on the horizon of AI—it was the horizon.

The Architectural Essence of GPT: The Transformer's Decoder

At the foundational core of GPT lies the transformer architecture. Originally conceived for tasks like machine translation, transformers have proven versatile and powerful, dominating a myriad of NLP challenges. Unlike traditional RNNs or LSTMs that process sequences incrementally, transformers process all tokens in the sequence in parallel, harnessing self-attention mechanisms to draw global dependencies.

While the pioneering Transformer model is constructed on a dual pillar of Encoder-Decoder architecture, the GPT series by OpenAI takes a specialised approach. Based on the work of Radford et al. (2018), GPT models singularly harness the power of the transformer's decoder, both in design and during the training phase. This focused architecture enables the GPT series to excel in generative tasks, providing rich context without the need for a separate encoding phase.

The figure below showcases the intricacies of the decoder as employed in GPT-1:

As we can see the decoder consists of three types of layers and two types of connections, let us see what each of them are, and how they contribute to decoder’s unique strength:

Masked Multi Self-Attention Mechanism:

Cracking open this seemingly complex layer, let's dissect it into its three main components: the self-attention mechanism, the masking mechanism, and the illustrious multiple heads.

Self-Attention Mechanism

The self-attention mechanism isn't just the core, but the pulsating heart of the Transformer architecture. Picture it as the model's magic glasses, allowing it to read sequences in any order and spot patterns, no matter where they hide in the sequence. It's a bit like having x-ray vision!

For our model to cast its 'attention spell' on every word, it conjures up three magical entities for a given input sequence X: the enigmatic queries (Q), the cryptic keys (K), and the valuable treasures known as values (V). And how, you ask? By weaving the input with their respective alchemical formulas (or weight matrices) WQ, WK, and WV. Just so you know, these matrices aren't just handed down; they're honed and perfected during training.

$$Q = X \times W_{Q}\\K = X \times W_{K}\\V = X \times W_{V}$$

Following its 'attention spell', the model then works its arithmetic magic to compute attention scores. Think of these scores as an intricate dance card, deciding which word in the sequence should twirl with which. These scores are concocted by taking the dot product of the query and key matrices for each word token. It's a bit like the model's internal matchmaking service, determining which word gets to waltz with whom and for how long.

$$Attention\space score = QK^T$$

Before our words waltz through the sequence, the scores get a tweak. They're divided by the square root of the depth of the key vectors (dk), ensuring balance. Picture the softmax function as our dance instructor, fine-tuning each word's step and poise to produce balanced attention weights. With these in hand, our model choreographs a final dance, merging the value vectors using the weights. And voilà, the self-attention mechanism's elegant output takes the stage.

$$Attention(Q,K,V) = softmax \left( \begin{array}{cc} QK^T \\ \overline{ \sqrt{d_{k}} } \end{array} \right)V$$

Where Q, K, V are the query, key, and value matrices respectively, and dk is the depth of the key vectors.

Below is a simple Python code for self-attention computation:

#Sample code for self-attention computation
import numpy as np

def attention(Q, K, V):
    d_k = Q.shape[-1]
    scores = np.matmul(Q, K.transpose()) / np.sqrt(d_k)
    scores = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
    return np.matmul(scores, V)

#This is a simplified version and does not account for 
#batching or multi-headedness

This mechanism allows the decoder to focus on different parts of the input sequence. By determining attention scores, the decoder can understand which parts of the input are relevant when producing a particular output word, enabling context-aware generation. The self-attention mechanism can attend to any word in the input sequence, allowing it to capture both short-term and long-term dependencies in the data.

Masking Mechanism

While training a language model like GPT, our goal is straightforward: predict the next word in a sequence by only considering the words that preceded it. Essentially, if the model is generating a sentence word by word, it shouldn't have knowledge of future words because, in a real-world scenario, those words haven't been generated yet. Think of it as building a jigsaw puzzle; you want to place the next piece based on the pieces you've already set, not the ones still in the box.

Consider the sequence, "The cat is on the ___". The aim is to fill in that blank based on prior context without the model being privy to any words that follow.

The masking process is a tad like using a strategic bookmark while reading. This bookmark is placed under your current line, guarding the upcoming text and ensuring you're not tempted to jump ahead. With this technique, the model is confined to the lines — or, in our case, words — that came before, ensuring its predictions are rooted in the previous context.

But here's the technical magic: The model doesn't just ignore future words; it's actively deterred from them. This is achieved by giving the attention scores of future words a whopping negative value, so big that it's nearly infinite in its negativity! Now, when we push these scores through a softmax function, they essentially shrivel up to a value so minuscule, it's almost zero. This ensures the model doesn't give two hoots about those future words when making its predictions. Quite clever, if you ask me!

For example

$$\text{softmax}([- \infty, x, y, z]) \approx \left[ 0, \frac{e^x}{e^x + e^y + e^z}, \frac{e^y}{e^x + e^y + e^z}, \frac{e^z}{e^x + e^y + e^z} \right]$$

These near-zero values mean that, when calculating the final output of the attention layer, the model doesn't consider future words at all. Their influence is effectively nullified.

Multi Head Mechanism

Think of the multi-head mechanism as viewing a scene with multiple cameras, each capturing a unique angle. Instead of a single set of Q, K, and V matrices, our model uses multiple sets, each representing a different "camera" or "head." Every head zeroes in on a different aspect of the input data.

After capturing these diverse perspectives, their outputs are seamlessly merged and go through a linear transformation. The result? A comprehensive, well-rounded output that capitalises on the insights from every angle. In essence, it's like piecing together a multi-camera shot to get the full picture!

$$ \text{MultiHeadOutput} = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W_O $$

Here $W_O$ is the output weight matrix, and each head is

$$\text{head}_i = \text{Attention}(X \cdot W_{Q_i}, X \cdot W_{K_i}, X \cdot W_{V_i}) $$

After gathering diverse insights from each head, we have a vast concatenated output. Now, if left unchecked, this could be overwhelming in size. So, to neatly package these insights without making it a cumbersome bundle, the concatenated output is multiplied by $W_O$ – our trusty output weight matrix. This is similar to condensing a thick novel into a concise yet informative summary.

And what's cool? $W_O$ isn't rigid. It's like a clay model, malleable and trainable. During the training phase, it morphs and tweaks itself, ensuring that the multi-head attention mechanism consistently churns out insightful and meaningful outputs.

In a nutshell, multi-head attention is like having multiple detectives on a case. They each pick up on different clues, ensuring no detail, no matter how minuscule, goes unnoticed. Multi-head attention allows the model to focus on different parts of the input simultaneously, capturing various aspects or features from the input data.

Feed-Forward Neural Network:

Each word in the input sequence is passed through the same feed-forward network, independently. This network's role is to transform the contextual embeddings received from the self-attention layer. For each position in the input x:

$$\text{FFN}(x) = \text{ReLU}(xW_1+b_1)W_2+b_2 $$

Where W1, W2, b1 and b2 are learned parameters of the network. It is a two-layer feed-forward neural network (often called a position-wise feed-forward network in the context of the Transformer architecture).

Why this two-step finesse, you ask? While the self-attention mechanism is the discerning eye, capturing the global essence of the input text, the feed-forward network is the hand that adds depth and nuance. It complements the self-attention mechanism, enabling the model to learn different types of transformations and thereby enriching its representational power.

A fine balance: Residual Connections and Layer Normalization

Residual connections, represented by the "+" operation, allow gradients to flow more easily during training. If a layer needs to make only a small change to the input representation, it can easily do so. Essentially, these connections provide 'shortcuts' for the gradient, combating the vanishing gradient problem in deep networks. After adding the residual, layer normalisation stabilises the activations, ensuring that they remain in a consistent scale and distribution, further aiding in stable and faster training.

In more technical terms, for every layer's output, F(x), and input x, the output is given by:

$$\text{Output} = \text{LayerNorm}(x + F(x)) $$

Unmasking the Mixture of Experts (MoE)

Though OpenAI has not confirmed it, there are rumours that GPT 4 uses MoE framework, you may ask what the hell is it?

Imagine you're assembling a team for a project. Instead of relying on one person who knows a bit about everything, wouldn't it be better to have specialists who are really good at one thing?

That's the idea behind MoE:

Experts Everywhere: Instead of one model trying to know everything, we have many mini-models (called "experts"). Each of these experts is a pro at understanding a particular type of information.
The Decision Maker: We have a helper network (the "gating network"). Its job? To look at incoming data and decide which expert should take a look.
Passing the Baton: For any piece of data, our gating network figures out the best experts to handle it. The data is then passed on to those top experts for processing.
Teamwork: These chosen experts look at the data and give their outputs. The outputs are then mixed together, using weights given by our gating network, to produce the final result.

The magic here? By sharing tasks among experts, MoE can cover more ground without making everything super complex. It's like having a team with diverse skills, leading to a richer and more comprehensive solution without burning extra resources.

In the world of massive models like GPT, MoE is like having specialised linguists for every corner of the language. Some experts might be wordplay wizards, while others could be champions of Chaucer or sages of slang.

So, to wrap it up: MoE gives these models a boost. It's a smart method to pack more punch in our models without overloading them. This means we get powerful, efficient models that can dive into the vast seas of language and come up with treasures every time.

Conclusion

In the sprawling expanse of AI's landscape, GPT-4 stands as a beacon of innovation, a testament to what's possible when ingenuity meets expertise. From its roots in the transformer architecture to the whispers of MoE's inclusion, GPT-4 is not merely a leap but a quantum jump in the evolution of language models. As we stand on the brink of this new frontier, we're not just spectators but active participants in a revolution that promises to redefine the way we perceive, process, and interact with language.

💡

Amita Kapoor: Author, Research & Code Development
Narotam Singh: Conceptualisation, Design & Digital Management