# A tibble: 3 × 4 id text truth predicted <dbl> <chr> <fct> <chr> 1 10007 Rabobank voorspelt flinke stijging hypotheekrente neutral negative 2 10027 D66 wil reserves provincies aanspreken voor groei neutral positive 3 10037 UWV: dit jaar meer banen positive neutral
Content of this Lecture
Machine Learning vs. Deep Learning
1.1. Reminder: Machine Learning Text classification Pipeline
1.2. How did the field move on?
From Term Frequencies to Word-Embeddings
2.1. Basic principles
2.2. Similarity of words and texts
2.3. Pre-trained Word-Embeddings
2.4. Text Classification Pipeline with Word-Embeddings
2.5. Examples in the literature
The Rise of Transformers and Transfer Learning
3.1. Overview and Principles
3.2. Architecture of the Transformer Model
3.3. The Transformer/LLM Text Classification Pipeline
3.4. The Concept of Fine-Tuning
Large Language Models: BERT, GPT and the “AI Revolution”?
4.1. What are Large Language Models and Generative AI?
4.2. A Peek into the Architecture of Famous LLMs
4.3. Zero-Shot Text Classification Using BERT and GPT
4.4. Validation, validation, validation!
4.5. Examples in the Literature
Summary and conclusion
5.1. A Look Back at the Chronology of NLP
5.2. State-of-the-Art in Classification
5.3. Ethical considerations
5.4. Conclusion
Machine Learning vs. Deep Learning
Text Classification Pipeline
Machine Learning (1990-2010)
And now?
Massive advancements in recent years
Massive advancement in how text can be represented at numbers
From simple word counts to word embeddings
Static vs. contextual word embeddings
Pretraining and transfer learning
Word embeddings can be trained on large scale corpus
Pretrained word embeddings can fine-tuned (less training data) and then used for downstream tasks
Transformers and Generative AI
Larger and larger “language models”
Ever more powerful in solving rather complex tasks
Conversational frameworks (e.g., GPT)
From Term Frequencies to Word-Embeddings
More powerful and informative ways to represent text as numbers.
Rember: The initial problem of text analysis
Computers don’t read text, they only can deal with numbers
For this reason, so far, we tokenized our texts (e.g., in words) and summarized their frequency across texts to create a document-feature matrix within the bag-of-words model
Such a text representation has some issues:
Treats words as equally important (→ requires removal of noise, stopwords…)
Ignores word order and context
Results in a sparse matrix (→ computationally expensive)
Alternative: Map words into a vector space
(Static) Word embeddings
Word embeddings are a “learned” type of word representation that allows words with similar meaning to have a similar representation via a k-dimensional vector space
The first core idea behind word embeddings is that the meaning of a word can be expressed using a relatively small embedding vector, generally consisting of around 300 numbers which can be interpreted as dimensions of meaning.
The second core idea is that these embedding vectors can be derived by scanning the context of each word in millions and millions of documents.
This means that words that are used in similar ways in the training data result in similar representations, thereby capturing their similar meaning.
This can be contrasted with the crisp but rather limited representation of words in a bag of words model where different words have different representations, regardless of how they are used.
How do we get these “values” for each word?
All word embedding methods learn a real-valued vector representation for a predefined fixed sized vocabulary from a corpus of text.
There are many different ways to “learn” or use word embeddings:
Via an embedding layer in a neural network designed for a particular downstream task
Learning word embeddings using a shallow neural network and context windows (e.g., word2vec)
Leatning word embeddings by aggregating global word-word co-occurrence matrix (e.g., GloVe)
Word2Vec: Continuous Bag-of-words (CBOW)
Word2Vec: Continuous Bag-of-words (CBOW)
Word2Vec: Continuous Bag-of-words (CBOW)
Word2Vec: Continuous Bag-of-words (CBOW)
Word2Vec: Continuous Bag-of-words (CBOW)
Word2Vec: Continuous Bag-of-words (CBOW)
Pre-trained Word embeddings: GloVe
glove_fn ="glove.6B.50d.10k.w2v.txt"url = glue::glue("https://cssbook.net/d/{glove_fn}")if (!file.exists(glove_fn)) download.file(url, glove_fn)# Data wranglingwv_tibble <-read_delim(glove_fn, skip=1, delim=" ", quote="", col_names =c("word", paste0("d", 1:50)))# 10 highest scoring words on dimension 1wv_tibble |>arrange(-d1) |>select(1:10)
# A tibble: 5 × 2 word similarity <chr> <dbl>1 basketball 1 2 football 0.8793 hockey 0.8624 baseball 0.8615 nba 0.838
wv_similar(wv, wvector(wv, "netherlands"))
# A tibble: 5 × 2 word similarity <chr> <dbl>1 netherlands 1 2 belgium 0.8933 switzerland 0.8214 denmark 0.8095 france 0.789
Similarity of entire sentences or texts
But we can also generalize word embeddings to entire sentences (or even texts):
library(ccsamsterdamR)# Example sentencesmovies <-tibble(sentences =c("This movie is great, I loved it.", "The film was fantastic, a real treat!","I did not like this movie, it was not great.","Today, I went to the cinema and watched a movie","I had pizza for lunch."))# Get embeddings from a sentence transformermovie_embeddings <-hf_embeddings(txt = movies$sentences)# Each text has now 384 valuesmovie_embeddings
The subtle similarity of some texts in the example
We can see that text 2 is most similar to text 1: Both express a very similar sentiment, just with different words (“great” ≈ “fantastic”; “I loved it” ≈ “A real treat”)
Text 2 is still similar to text 3 (after all it is about movies), but less so compared to text 1 (“fantastic” is the opposite of “not great”)
Text 4 still shares similarities (the context is the cinema/watching movies), but text 5 is very different as it doesn’t contain similar words and is not about similar things (except “I”).
# Similarity between 2nd and the other sentencesmovies |>mutate(similarity =as.matrix(movie_embeddings) %*%t(as.matrix(movie_embeddings)[2,, drop = F]))
# A tibble: 5 × 2 sentences similarity[,1] <chr> <dbl>1 This movie is great, I loved it. 0.6462 The film was fantastic, a real treat! 1.00 3 I did not like this movie, it was not great. 0.5134 Today, I went to the cinema and watched a movie 0.3725 I had pizza for lunch. 0.132
Word Embeddings as Input
Word, sentence, or text embedding vectors can then be used as features in further text analysis tasks
Think about the example we just investigated: The sentence embeddings did capturesome difference between:
“The film was fantastic, a real treat!” (positive)
“I did not like this moive, it was not great.” (negative)
Approaches from the last lecture (classic machine learning) would have a hard time to detect the negation “not great”.
Yet, bear in mind: We do not actually know what the 100+ (often >300) dimensions actually mean (→ we cannot look under the hood later!)
From Sparse to Dense Matrix Representation
Using embedding vectors instead of word frequencies further has the advantages of strongly reducing the dimensionality of the DTM: instead of (tens of) thousands of columns for each unique word we only need hundreds of columns for the embedding vectors (→ dense instead of sparse)
This means that further processing can be more efficient as fewer parameters need to be fit, or conversely that more complicated models can be used without blowing up the parameter space.
Text classification with Word-Embeddings
Doing analysis with word-embedding themselves
Based on vector-based computations, we can also analyse semantic relationships
This is a quite common approach by now in research on gender and other types of stereotypes
Source: https://developers.google.com
Example from the literature: Gender-Stereotypes
Andrich et al. (2023) examine stereotypical traits in portrayals of 1,095 U.S. politicians.
Analyzed 5 million U.S. news stories published from 2010 to 2020 to study gender-linked (feminine, masculine) and political (leadership, competence, integrity, empathy) traits
Methodologically, they estimated word embeddings using the Continuous Bag of Words (CBOW) model, meaning that a target word (e.g., honest) is predicted from its context (e.g., Who thinks President Trump is [target word]?)
Bias can thus be identified if e.g., gender-neutral words (e.g., competent) are closer to words that represent one gender (e.g., donald_trump) than to words that represent the opposite gender (e.g., hillary_clinton).
Results
All three masculine traits were more strongly associated with male politicians.
In contrast, only the feminine physical traits were more strongly associated with female politicians.
Differences remained stable across time.
The Rise of Transformers and Transfer Learning
Origin of Transformer Models
Until 2017, the state-of-the-art for natural language processing was using a deep neural network (e.g., recurrent neural networks, long short-term memory and gated recurrent neural networks)
In a preprint called “Attention is all you need”, published in 2017 and cited more than 95,000 times, the team of Google Brain introduced the so-called Transformer, a a neural network-type architecture that learns context and thus meaning by tracking relationships in sequential data like the words in this sentence.
Transformer models apply an evolving set of mathematical techniques, called attention or self-attention, to detect subtle ways even distant data elements in a series influence and depend on each other.
The proposed network structure had notable characteristics:
No need for recurrent or convolutional network structures
Based solely on attention mechanism (stacked on top of one another)
Requires less training time (can be parallelized)
Outperformed prior state-of-the-art models in a variety of tasks
OVerview of the architecture
The figure on the right represent an abstract overview of a transformer’s architecture
It can be used for sequence-to-sequence predictions
classic example is translation: e.g., english-to-dutch
but also: question-to-answer, text-to-summary, sentence-to-next-sentence…
Although models can differ, they generally include:
An encoder-decoder framework
Word embeddings + positional embedding
Attention and self-attention modules
We won’t cover any of these in much detail and just aim for a high-level understanding
Vaswani et al. 2017
Basic Encoder-Decoder Framework (for Translation)
Stacked Encoders and Decoders
More elaborate encoding of words
Source: Alammar, 2018
Inside of an encoder and a decoder
The word, position, and time signal embeddings are passed to the first encoder
Here, they flow through a self-attention layer, which further refines the encoding by “looking at other words” as it encodes a specific word
The outputs of the self-attention layer are fed to a feed-forward neural network.
The decoder likewise has both layers as well, but also an extra attention layer that helps to focus on different parts of the input (e.g., the encoders outputs)
Source: Alammar, 2018
Encoding pipeline
The embedding only happens in the bottom-most encoder.
In other encoders, it would be the output of the encoder that’s directly below.
Source: Alammar, 2018
Self-Attention
In general terms, self-attention works encodes how similar each word is to all the words in the sentence, including itself.
Once the similarities are calculated, they are used to determine how the transformers encodes each word.
Self-Attention
In general terms, self-attention works encodes how similar each word is to all the words in the sentence, including itself.
Once the similarities are calculated, they are used to determine how the transformers encodes each word.
Self-Attention
In general terms, self-attention works encodes how similar each word is to all the words in the sentence, including itself.
Once the similarities are calculated, they are used to determine how the transformers encodes each word.
Self-Attention
In general terms, self-attention works encodes how similar each word is to all the words in the sentence, including itself.
Once the similarities are calculated, they are used to determine how the transformers encodes each word.
Self-Attention at a High Level
As the encoder processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.
On the decoder side, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation.
The actual architecture of this step is incredibly complex, so we keep it at that for now.
Putting it all together
Source: Alammar, 2018
High-level process
The transformers starts by creating word embeddings (combinations of similarity, position, time signal)
The encoder start by processing the input sequence (embeddings).
The output of the top encoder is then transformed into a set of attention vectors which are used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence
The decoder spits out a first output (e.g., the word “I”), which then becomes the input for the decoder in follow-up steps
The decoder repeats these steps until a special symbol (e.g., = “end of sentence”) is reached.
Text Classification Pipeline Using Transformers
Pre-training and transfer learning
Generally, transformer models are pre-trained using specific natural language processing tasks
It has been shown that these (often self- or unsupervised) trainings are often sufficient to let the model perform well on many down-stream tasks
However, the general idea would be to use a pre-trained model and then “fine-tune” it on the specific tasks it is supposed to perform (e.g., annotating text with topics or sentiment)
Although the transformer’s architecture has made training more efficient (due to the ability to parallelize), it nonetheless requires significant computing power to fine-tune a model
Luckily, transformers are the back-bone of today’s large language models, which - due to their immense training - are able to perform very well on most task without any specific fine-tuning (called: zero-shooting).
As pre-training often involves tasks that are different than what we want the model to do, this is often denoted as “transfer learning”, thus a type of learning that transfers to other task as well
Large Language Models: BERT, GPT and the “AI Revolution”?
A large language model (LLM) is a type of language model notable for its ability to achieve general-purpose language understanding and generation.
LLMs acquire these abilities by using massive amounts of data to learn billions of parameters during training and consuming large computational resources during their training and operation.
LLMs are still just a type of artificial neural networks (mainly transformers!) and are (pre-)trained using self-supervised learning and semi-supervised learning.
As so-called autoregressive language models, they take an input text and repeatedly predicting the next token or word.
Fine-tuning vs. zero-shot learning
Up to 2020, fine tuning was the only way a model could be adapted to be able to accomplish specific tasks.
Larger sized models (since BERT and GPT-2 and 3), however, can be prompt-engineered to achieve similar results.
They are thought to acquire embodied knowledge about syntax, semantics and “ontology” inherent in human language corpora, but also inaccuracies and biases present in the corpora.
Notable examples include:
OpenAI’s GPT models (e.g., GPT-3.5 and GPT-4, used in ChatGPT)
Google’s BERT and PaLM (used in Bard),
Meta’s LLaMa
Next token prediction (as in GPT-2)
Next token prediction (as in GPT-2)
Next token prediction
Different architectures
Encoder-Decoder Transformers:
BART (Lewis et al., 2019): language generation, tanslation, comprehension,…
Encoder-Only Transformer
BERT (Devlin et al., 2019): Question answering, language inference,…
Decoder-Only Transformer:
GPT-series (OpenAI)
In the following, we will focus on BERT and GPT
Note: Whereas many large language models are open source AND free to use (e.g., BERT), others are free to use only in limited capacity and essentially a black box when it comes to details about the architecture and training form and training data (e.g., GPT-3 or GPT-4).
Bert
Bidirectional Encoder Representations from Transformers (BERT) is a family of language models introduced in October 2018 by researchers at Google.
BERT is an “encoder-only” transformer architecture.
Generally speaking, BERT consists of three modules:
Embedding. This module converts an array of one-hot encoded tokens into an array of vectors representing the tokens.
Stack of encoders. These encoders are the Transformer encoders (BERT-base = 12, BERT-large = 24). They perform transformations over the array of representation vectors.
Un-embedding. This module converts the final representation vectors into one-hot encoded tokens again.
Devlin et al. 2018
Why un-embedding?
The un-embedding module is necessary for pretraining, but it is often unnecessary for downstream tasks.
Here, one would take the representation vectors output at the end of the stack of encoders, and use those as a vector representation of the text input, and train a smaller model on top of that (technically anything we covered last lecture!)
Training and Fine-Tuning of BERT
Pretrained on two task:
Mask Language Modelling (LM): simply mask some percentage of the input tokens at random, and then predict those masked tokens
Next sentence prediction (NSP): predict sentence B from sentence A to model relationships between sentences
This pre-training led to great performances in downstream tasks
Source: Devlin et al., 2018
GPT-Series by OpenAI
Generative Pre-trained Transformer (GPT), is a set of state-of-the-art large language model developed by OpenAI.
Particularly GPT-3, released publicly in November 2022 together with a chat interface, caused a lot of public attention.
Millions of users in a very short amount of time (faster than Facebook, Instagram, TikTok, etc…), now 1.5 Billion users
High-level architecture and training of GPT
GPT are decoder-only models
Source: Wikipedia
Zero-shot Learning
Zero-shot learning refers to the ability of a model to perform a task or make predictions on a set of classes or concepts that it has never seen or been explicitly trained on.
In other words, the model can generalize its pre-trained knowledge to new, unseen tasks without specific examples or training data for those tasks.
Classic machine learning models (last lecture) are typically trained on a specific set of classes, and their performance is evaluated on the same set of classes during testing.
Zero-shot learning extends this capability by allowing the model to handle tasks or categories that were not part of its training set be
This works, because of the model’s ability to capture and generalize information from the vast and varied data it has been exposed to during training.
Zero-Shot Text Classification with LLMS
Many LLMS are available at huggingface.co
Workflow
There are generally two ways in which we can work with these models:
Assess the model via the hugging face API (only smaller rates per minute, but still useful)
Download and use the model on our own computer/GPU (requires Python, can be computationally intensive)
In this course, we are going to “play around” with BERT models via the hugging face API
We created a set of simple functions that allow to use different types of models for different purposes
To use them, it makes sense to create an account on hugging face and create an access token (homework!)
For the project-phase, I will provide a more elaborate tutorial on how to use such models on your computer (but you don’t have to, if you don’t want to)
This will require to install python and use it via R
Generally a bit more complicated
Using a BERT model to zero-shot topics
library(ccsamsterdamR)example_corpus <-tibble(id =c(1:6),text =c("To be, or not to be: that is the question.", "An atom is a particle that consists of a nucleus of protons and neutrons.","Senate passes stopgap bill to avert government shutdown.","S&P 500 ends Friday slightly higher, major averages cruise to third week of gains: Live updates","Joe Burrow out for season: League to investigate Bengals over injury, looking whether team violated NFL policy","Joe Biden attended the final NFL game together with French president Emmanuel Macron."))output <-hf_zeroshot(txt = example_corpus$text, labels =c("Poetry", "Politics", "Physics", "Finance", "Sport" ),url ="https://api-inference.huggingface.co/models/MoritzLaurer/deberta-v3-large-zeroshot-v1")results <- output |>unnest(cols =c("labels", "scores")) %>% as_tibble %>%spread(labels, scores)results
# A tibble: 6 × 6 sequence Finance Physics Poetry Politics Sport <chr> <dbl> <dbl> <dbl> <dbl> <dbl>1 An atom is a particle that consists of… 0.0003 0.999 0.0003 0.0003 0.00032 Joe Biden attended the final NFL game … 0.0003 0.0002 0.0002 0.116 0.884 3 Joe Burrow out for season: League to i… 0 0 0 0 1.00 4 S&P 500 ends Friday slightly higher, m… 1.00 0.0001 0.0001 0.0001 0.00015 Senate passes stopgap bill to avert go… 0.0003 0.0001 0.0001 0.999 0.00016 To be, or not to be: that is the quest… 0.0625 0.101 0.687 0.0829 0.0665
Using GPT for text classification
The simplest way to use GPT is to simply paste text into the chat and ask for the relevant topics (in fact, let’s try this now!)
Yet, for larger text corpora this is not feasible
To use GPT for text analysis purposes, we need to access the models via the Open AI API
We created functions that allow to use the GPT models similarly to how we used the hugging face models
Requires to pay per token (luckily not too expensive, we will provide some accounts)
Using GPT-3.5-turbo for a simple topic classification
We simply pass the corpus to the function gpt_zeroshot() included in the package ccsamsterdamR
We further pass a vector of labels that we want GPT to classify the text with.
# A tibble: 6 × 4 id text labels justification <dbl> <chr> <chr> <chr> 1 1 To be, or not to be: that is the question. Poetry Contains famous line from Shakespeare's Hamlet. 2 2 An atom is a particle that consists of a nucleus of protons and neutrons. Physics Describes basic concept of an atom. 3 3 Senate passes stopgap bill to avert government shutdown. Politics Refers to a specific legislative action in the Senate. 4 4 S&P 500 ends Friday slightly higher, major averages cruise to third week of gains: Live updates Finance Covers stock market performance and S&P 500. 5 5 Joe Burrow out for season: League to investigate Bengals over injury, looking whether team violated NFL policy Sport Reports on injury of football player Joe Burrow. 6 6 Joe Biden attended the final NFL game together with French president Emmanuel Macron. Politics Describes diplomatic engagement between Joe Biden and Emmanuel Macron at NFL game.
Comparison with classic machine learning
Let’s compare the zero-shot performance of GPT-3.5 against our best neural network classifier from last week in predicting genre from music lyrics.
To make things easier, we only use 87 songs from the lyrics data set.
library(tidyverse)library(textrecipes)library(tidymodels)# Set seedset.seed(42)# Read datalyrics_data <-read_csv("data/lyrics-data-prep.csv") |>mutate(binary_genre =factor(ifelse(Genre =="Rock","rock", "other"), levels =c("rock", "other") )) |>sample_frac(size = .001) |>select(doc_id = SLink, Artist, Song = SName, Genre, binary_genre, text = Lyric)head(lyrics_data)
# A tibble: 6 × 6 doc_id Artist Song Genre binary_genre text <chr> <chr> <chr> <chr> <fct> <chr>1 /snoop-dogg/213-tha-gangsta-clicc.html Snoop Dogg 213 Tha Gangsta Clicc Hip … other "[Sn…2 /far-east-movement/jello-feat-rye-rye.html Far East Movement Jello (feat. Rye Rye) Pop other "Jel…3 /janet-jackson/got-till-its-gone.html Janet Jackson Got 'til It's Gone (feat. Q-Tip… Pop other "Wha…4 /tori-amos/yo-george.html Tori Amos Yo George Rock rock "I s…5 /van-morrison/a-sense-of-wonder.html Van Morrison A Sense of Wonder Rock rock "I w…6 /heart/together-now.html Heart Together Now Rock rock "Dee…
Predict genre with the neural network classifier
Because I saved the classifier from last week, I can now simply use it to predict the genre in the new subset
Remember, it was multilayer perceptrion with 1 hidden layer (6 nodes) and trained on 10,445 songs.
# Setting relevant metricsclass_metrics =metric_set(accuracy, precision, recall, f_meas)# Loading the model from last weekload("results/m_ann.Rdata")# Predicting the new subset of the sample (the 87 songs!)predict_ann <-predict(m_ann, new_data=lyrics_data) |>bind_cols(select(lyrics_data, binary_genre)) |>rename(predicted=.pred_class, actual=binary_genre)
Predicting genre with GPT-3.5
To predict the data with GPT (I am choosing the GPT-3.5 Turbo version here), we don’t have to engage in elaborate preprocessing. We simply create an id-variable (helps to later combine predictions and actual gold standard).
However, the GPT API allows on a certain amount of tokens per minute (~10,000) AND also only a certain amount of tokens per prompt (for this model at least 32,000)
This requires us to get creative and do a few things:
split the sample in smaller chunks so that we can prompt one after the other
map across those chunks and capture prediction in each step
add a delay (Sys.sleep(10) = 10 seconds break) in between prompts
Predicting genre with GPT-3.5
First, we have to create and ID-variable and split the data in small chunks:
# Create id variablelyrics_data_gpt <- lyrics_data |>mutate(id =1:n()) |>select(id, text)# Split data set in smaller chunkssplits <- lyrics_data_gpt |>gpt_split_data(n_per_group =2)
# Actual prompting via the APImap_results <-map_df(splits, function(x) { output <-gpt_zeroshot(txt = x,expertise ="You are in expert in classifying the genre of a song based on its lyrics.",labels =c("rock", "other"),model ="gpt-3.5-turbo-1106")Sys.sleep(10) # Adding 10 seconds delay between prompts to avoid token-per-minute rate limit output})
Finally, we can join the resulting predictions with the original data:
# Join predictions and original gold standardpredict_gpt <- lyrics_data_gpt |>left_join(map_results) |>left_join(lyrics_data) |>select(id, labels, binary_genre) |>mutate(labels =factor(labels, levels =c("rock", "other")))
Comparison between neural network and GPT-3.5
We can see that the neural network does better overall, but bear in mind that it was trained on a lot of data (10,445 songs!)
GPT-3.5 yields a quite astonishing performance, given that it was not trained to do this task at all!
Baluff et al. (2023) investigated a recent case of media capture, a mutually corrupting relationship between political actors and media organizations.
This case involves former Austrian chancellor who allegedly colluded with a tabloid newspaper to receive better news coverage in exchange for increased ad placements by government institutions.
They implemented automated content analysis (using BERT) of political news articles from six prominent Austrian news outlets spanning 2012 to 2021 (n = 188,203) and adopted a difference-in-differences approach to scrutinize political actors’ visibility and favorability in news coverage for patterns indicative of the alleged serious breach of professional political and journalistic norms.
Methods
Used a German-language GottBERT model (Scheible et al., 2020) that they further fine-tuned for the task using publicly available data from the AUTNES Manual Content Analysis of the Media Coverage 2017 and 2019 (Galyga et al., 2022; Litvyak et al., 2022c)
Comparatively difficult task, but were able to reach a satisfactory F1-Score of 0.77 (precision = 0.77, recall = 0.77).
Findings
Our findings indicate a substantial increase in the news coverage of the former Austrian chancellor within the news outlet that is alleged to have received bribes.
In contrast, several other political actors did not experience similar shifts in visibility nor are similar patterns identified in other media outlets.
Summary and conclusion
A Look Back at the Chronology of NLP
A Look Back at the Chronology of NLP
A Look Back at the Chronology of NLP
Explosion in model size?
Environmental Impact
Ethical considerations
Training large language models requires significant computational resources, contributing to a substantial carbon footprint. Ethical considerations involve assessing the environmental impact of developing and deploying such models.
LLMs can inherit and perpetuate biases present in their training data.
This can result in the generation of biased or unfair content, reflecting and potentially amplifying societal biases and stereotypes.
Developers and users must be aware of the potential for bias and take steps to mitigate it during model training and deployment.
The fact that some LLMs are developed, trained, and employed behind closed doors causes yet another ethical dilemma in using them!
Conclusion
Advancement in NLP and AI are fast-paced; difficult to keep up
LLMs promise immense potential for communication research
Yet, large language models can contain biases or even hallucinate!
Validation, validation, validation!
Thank you for your attention!
Required Reading
Kroon, A., Welbers, K., Trilling, D., & van Atteveldt, W. (2023). Advancing Automated Content Analysis for a New Era of Media Effects Research: The Key Role of Transfer Learning. Communication Methods and Measures, 1-21
(available on Canvas)
Reference
Alammar, J. (2018). The illustrated Transformer. Retrieved from: https://jalammar.github.io/illustrated-transformer/
Andrich, A., Bachl, M., & Domahidi, E. (2023). Goodbye, Gender Stereotypes? Trait Attributions to Politicians in 11 Years of News Coverage. Journalism & Mass Communication Quarterly, 100(3), 473-497. https://doi-org.vu-nl.idm.oclc.org/10.1177/10776990221142248
Balluff, P., Eberl, J., Oberhänsli, S. J., Bernhard, J., Boomgaarden, H. G., Fahr, A., & Huber, M. (2023, September 15). The Austrian Political Advertisement Scandal: Searching for Patterns of “Journalism for Sale”. https://doi.org/10.31235/osf.io/m5qx4
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Kroon, A., Welbers, K., Trilling, D., & van Atteveldt, W. (2023). Advancing Automated Content Analysis for a New Era of Media Effects Research: The Key Role of Transfer Learning. Communication Methods and Measures, 1-21
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Example Exam Question (Multiple Choice)
How are word embeddings learned?
A. By assigning random numerical values to each word
B. By analyzing the pronunciation of words
C. By scanning the context of each word in a large corpus of documents
D. By counting the frequency of words in a given text
Example Exam Question (Multiple Choice)
How are word embeddings learned?
A. By assigning random numerical values to each word
B. By analyzing the pronunciation of words
C. By scanning the context of each word in a large corpus of documents
D. By counting the frequency of words in a given text
Example Exam Question (Open Format)
What does zero-shot learning refer to in the context of large language models?
In the context of large language models, zero-shot learning refers to the ability of a model to perform a task or make predictions on a set of classes or concepts that it has never seen or been explicitly trained on. Essentially, the model can generalize its knowledge to new, unseen tasks without specific examples or training data for those tasks.
In traditional machine learning, models are typically trained on a specific set of classes, and their performance is evaluated on the same set of classes during testing. Zero-shot learning extends this capability by allowing the model to handle tasks or categories that were not part of its training set.
In the case of large language models like GPT-3, which is trained on a diverse range of internet text, zero-shot learning means the model can understand and generate relevant responses for queries or prompts related to concepts it hasn’t been explicitly trained on. This is achieved through the model’s ability to capture and generalize information from the vast and varied data it has been exposed to during training.