Transformers and Large Language Models

Week 4: Bert, Llama, GPT, and Co

Dr. Philipp K. Masur

A Look Back at the Chronology of NLP

What we focus on today…

The classic machine learning approach

A lot of different steps…
Training takes a lot of data and time… and performance was still somewhat

# Get data
science_data <- read_csv("data/science.csv") |>  filter(label != "biology" & label != "finance" & label != "mathematics")

# Create test and train data sets
split <- initial_split(science_data, prop = .50)

# Feature engineering
rec <- recipe(sentiment ~ lemmata, data = science_data) |> 
  step_tokenize(lemmata) |> 
  step_tf(all_predictors())  |> 
  step_normalize(all_predictors())

# Setup algorithm/model
mlp_spec <- mlp(epochs = 600, hidden_units = c(6),  
                penalty = 0.01, learn_rate = 0.2) |>   
  set_engine("brulee") |>    
  set_mode("classification")

# Create workflow
mlp_workflow <- workflow() |> 
  add_recipe(rec) |> 
  add_model(mlp_spec)

# Fit model
m_mlp <- fit(mlp_workflow, data = training(split))

What if things were easier?

Wouldn’t it be great if we would not have to wrangle with the data, not engage in any text preprocessing, and simply let the “computer” figure this out?
In fact, aren’t we living in times were we can simply ask the computer, similar as Theodore in the movie “Her”?

library(tidyllm)
library(glue)

glue('Classify this scientific abstract: {abstract}

     Pick one of the following fields.
     Provide only the name of the field:
     
     Physics
     Computer Science
     Statistics', 
     abstract = science_data$text[2]) |> 
  llm_message() |> 
  chat(ollama(.model = "llama3", .temperature = 0))

Message History:
system: You are a helpful assistant
--------------------------------------------------------------
user: Classify this scientific abstract: Rotation Invariance Neural Network Rotation invariance and translation invariance have great values in image
recognition tasks. In this paper, we bring a new architecture in convolutional
neural network (CNN) named cyclic convolutional layer to achieve rotation
invariance in 2-D symbol recognition. We can also get the position and
orientation of the 2-D symbol by the network to achieve detection purpose for
multiple non-overlap target. Last but not least, this architecture can achieve
one-shot learning in some cases using those invariance.


Pick one of the following fields.
Provide only the name of the field:

Physics
Computer Science
Statistics
--------------------------------------------------------------
assistant: Computer Science
--------------------------------------------------------------

Content of this Lecture

Machine Learning vs. Deep Learning

1.1. Reminder: Text Classification Pipeline

1.2. How did the Field move on?
The Rise of Transformers and Transfer Learning

3.1. Overview

3.2. Transfer Learning

3.3. Architecture of the Transformer Model

3.4. The Transformer Text Classification Pipeline Using BERT
Large Language Models: BERT, Llama, GPT and Co

3.1. What are Large Language Models and Generative AI?

3.2. General Idea: Next-Token-Prediction

3.3. A Peek into the Architecture of GPT

Using LLMs for Text Classification

4.1. Zero-Shot, One-Short, and Few-Shot Classification

4.2. Text Classification Using Llama and GPT

4.3. Validation, validation, validation!

4.4. Examples in the Literature
Summary and conclusion

5.1. State-of-the-Art in Classification

5.2. Ethical considerations

5.3. Conclusion

Machine Learning vs. Deep Learning

Our Text Classification Pipeline

Classic Machine Learning (1990-2013)

Word-Embeddings (2013-2020)

But Massive advancements in recent years

Massive advancement in how text can be represented at numbers
- From simple word counts to word embeddings
- From Static to contextual word embeddings
- Increasingly better embedding of meaning
Pretraining and transfer learning
- Word embeddings can be trained on large scale corpus
- Pretrained word embeddings can fine-tuned (less training data) and then used for downstream tasks
Transformers and Generative AI
- Larger and larger “language models”
- New mechanisms for better embedding (e.g., Attention)
- Conversational frameworks and Generative AI

The Rise of Transformers and Transfer Learning

Origin of the Transformer

Until 2017, the state-of-the-art for natural language processing was using a deep neural network (e.g., recurrent neural networks, long short-term memory and gated recurrent neural networks)
In a preprint called “Attention is all you need”, published in 2017 and cited more than 95,000 times, the team of Google Brain introduced the so-called Transformer
It represents a neural network-type architecture that learns context and thus meaning by tracking relationships in sequential data like the words in this sentence

What was so special about transformers?

Transformer models apply an evolving set of mathematical techniques, called attention or self-attention, to detect subtle ways even distant data elements in a series influence and depend on each other.
The proposed network structure had notable characteristics:
- No need for recurrent or convolutional network structures
- Based solely on attention mechanism (stacked on top of one another)
- Required less training time (can be parallelized)
It thereby outperformed prior state-of-the-art models in a variety of tasks
The transformer architecture is the back-bone of all current large language models and so far drives the entire “AI revolution”

Overview of the architecture

The figure on the right represent an abstract overview of a transformer’s architecture
It can be used for next-token-predictions
- classic example is translation: e.g., english-to-dutch
- but also: question-to-answer, text-to-summary, sentence-to-next-word…
- an thus also for text classification: text-to-label
Although models can differ, they generally include:
- An encoder-decoder framework
- Word embeddings + positional embedding
- Attention and self-attention modules

Vaswani et al. 2017

Basic Encoder-Decoder Framework (for Translation)

Stacked Encoders and Decoders

More elaborate encoding of words

Source: Alammar, 2018

Inside of an encoder and a decoder

The word, position, and time signal embeddings are passed to the first encoder
Here, they flow through a self-attention layer, which further refines the encoding by “looking at other words” as it encodes a specific word
The outputs of the self-attention layer are fed to a feed-forward neural network.
The decoder likewise has both layers as well, but also an extra attention layer that helps to focus on different parts of the input (e.g., the encoders outputs)

Source: Alammar, 2018

Putting it all together

Source: Alammar, 2018

Different Types of Models for different tasks

BERT vs. GPT (or Llama)

Encoder-Decoder Transformers:
- BART (Lewis et al., 2019): tanslation, but also text generation ,…
Encoder-Only Transformer
- BERT (Devlin et al., 2019): Embedding-only, then down-stream tasks…
Decoder-Only Transformer:
- GPT-series (OpenAI): Text generation, …
- Llama (Meta): Text generation…

Text Classification with Transformers, Encoder-Only

Text Classification with Transformers, Decoder-Only

Pre-training, Fine-tuning, and Transfer Learning

Generally, transformer models are pre-trained using specific natural language processing tasks
- Mask Language Modelling (LM): simply mask some percentage of the input tokens at random, and then predict those masked tokens
- Next sentence prediction (NSP): predict sentence B from sentence A to model relationships between sentences
The general idea was to use a pre-trained model and then “fine-tune” it on the specific tasks it is supposed to perform (e.g., annotating text with topics or sentiment)
Although the transformer’s architecture has made training more efficient (due to the ability to parallelize), it nonetheless requires significant computing power to fine-tune a model
As pre-training often involves tasks that are different than what we want the model to do, this is often denoted as “transfer learning”, thus a type of learning that transfers to other task as well
As transformer-based models become larger and larger, the need for fine-tuning decreases as they already do well on down-stream tasks (e.g., using only prompt-engineering)

Very Large Language Models: Llama and GPT

Source: Christian Behler on Medium

What are large language models

A large language model (LLM) is a type of language model notable for its ability to achieve general-purpose language understanding and generation
LLMs acquire these abilities by using massive amounts of data to learn billions of parameters during training and consuming large computational resources during their training and operation
LLMs are still just a type of artificial neural networks (mainly transformers!) and are pretrained using self-supervised learning and semi-supervised learning (e.g., Mask Language Modelling)
As so-called autoregressive language models, they take an input text and repeatedly predicting the next token or word

Current models

Next token prediction (as in GPT-2)

Next token prediction

GPT-Series by OpenAI

Generative Pre-trained Transformer (GPT), is a set of state-of-the-art large language model developed by OpenAI.
Particularly GPT-3, released publicly in November 2022 together with a chat interface, caused a lot of public attention
Millions of users in a very short amount of time (faster than Facebook, Instagram, TikTok, etc…), now 1.5 Billion users

Overview of the GPT-series by OpenAI

Source: Wikipedia

A Peek into the Architecture of GPT

Next-token-Prediction based on input text

Source: Adapted from Alammar, 2018

Intricate Meaning of Words

Core Idea: Better Encoding

Based on static word-embeddings (that we discussed in the last lecture), the word “Harry” would get the same embeddings vector, even though we clearly see that they refer to different persons
The same is true for words like “mole” (which can refer to an animal or a little skin spot) or “model” (e.g., a fashion model vs. a computer model)
LLMs like GPT work so well, because their architecture allows them to encode additional information into a token’s embedding vector and take surrounding tokens into account
This way, they learn which name (e.g., Harry) refers to which person or which word (e.g., model) refers to which meaning

A large neural network

Source: Alammar, 2018 and 3Blue1Brown, 2024

Many Layers of Attention

Source: Alammar, 2018 and 3Blue1Brown, 2024

Many Layers of Attention

Source: Alammar, 2018 and 3Blue1Brown, 2024

General idea behind Attention

In general terms, self-attention works encodes how similar each word is to all the words in the sentence, including itself
Once the similarities are calculated, they are used to determine how the transformers should update the embeddings of each word

General idea behind Attention

In general terms, self-attention works encodes how similar each word is to all the words in the sentence, including itself.
Once the similarities are calculated, they are used to determine how the transformers should update the embeddings of each word

General idea behind Attention

In general terms, self-attention works encodes how similar each word is to all the words in the sentence, including itself.
Once the similarities are calculated, they are used to determine how the transformers should update the embeddings of each word

Attention in Detail