Computational Analysis of Digital Communication

Week 3: Supervised Text Classification

Dr. Philipp K. Masur

Machine learning???

If science fiction stories…

  • are to be believed, the invention of artificial intelligence inevitably leads to apocalyptic wars between machines and their makers

  • It begins with today’s reality: computers learning how to play simple games and automate routines

  • They later are given control over traffic lights and communications, followed by military drones and missiles

  • This evolution takes a bad turn once computers become sentient and learn how to teach themselves….

Machines are taking over!

And then?

  • Having no more need for human programmers, humankind is simply deleted

  • Is this what machine learning is about?









(Stills from the movies “Ex Machina” and “Her”)

So what are we actually talking about?

  • Machine learning is the study of computer algorithms that can improve automatically through experience and by the use of data

  • The field originated in an environment where the available data, statistical methods, and computing power rapidly and simultaneously evolved

  • Due to the “black box” nature of the algorithm’s operations, it is often seen as a form of artificial intelligence

  • But in simple term: Machines are not good at asking questions or even knowing what questions are

  • They are much better in answering them, provided the question is stated in a way that a computer can comprehend (remember the main challenge of text analysis?)

Applications of Machine Learning

  • Machine learning is most successful when it augments, rather than replaces, the specialized knowledge of a subject-matter expert.

  • Machine learning is used in a wide variety of applications and contexts, such as in businesses, hospitals, scientific laboratories, or governmental organizations

  • In communication science, we can use these techniques to automate text analysis!


Lantz, 2013

Some success stories

Applying machine learning in practical context:

  • Identification of spam messages in mails
  • Segmentation of customers for targeted advertising
  • Weather forecasts and long-term climate changes
  • Reduction of fraudulent credit card transactions
  • Prediction of election outcomes
  • Auto-piloting and self-driving cars
  • Face recognition
  • Optimization of energy use in homes and buildings
  • Discovery of genetic sequences linked to diseases -….

Content of this lecture

1. What is machine learning?

2. Supervised text classification

  • Overview
  • Principles
  • Validation
  • Example: Predicting genre from song lyrics

3. Examples from the literature

4. Outlook and conclusion

What is machine learning?

Differences between supervised and unsupervised approaches.

Deductive vs. inductive approaches

  • In the previous lecture, we talked about deductive approaches (e.g., dictionary approaches)

  • These are deterministic and are based on text theory (e.g., happy -> positive, hate -> negative)

  • Yet, natural language is often ambiguous and probabilistic coding may be better

  • Dictionary-based or generally rule-based approaches are not very similar to manual coding; a human being assesses much more than just a list of words!

  • Inductive approaches promise to combine the scalability of automatic coding with the validity of manual coding (supervised learning) or can even identify things or relations that we as human beings cannot identify (unsupervised learning)

Supervised learning

  • Algorithms build a model based on sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to do so

  • Combines the scalability of automatic coding with the validity of manual coding (requires pre-labeled data to train algorithm)

  • Examples:

    • Supervised text classification, such as extending manual coding to large text corpora, sentiment analysis…
    • Pattern recognition: e.g., face recognition, spam filter,…

Unsupervised learning (next lecture)

  • Algorithm detects clusters, patterns, or associations in data that has not been labeled previously, but researcher needs to interpret results

  • Very helpful to make sense of new data (similar to cluster analysis or exploratory factor analysis)

  • Examples:

    • Topic modeling: Extracting topics from unlabeled (text) data
    • Customer segmentation: Better understanding different customer groups around which to build marketing or other business strategies

Supervised text classification

Training algorithms to make good predictions!

Supervised text classification

We can now use machine learning models to classify text into specific sets of categories. This is known as supervised learning. The basic process is:


1. Manually code a small set of documents (say N = 1,000) for whatever variable(s) you care about

2. Train a machine learning model on the hand-coded data, using the variable as the outcome of interest and the text features of the documents as the predictors

3. Evaluate the effectiveness of the machine learning model via cross-validation (test it on new data/gold standard)

4. Once you have trained a model with sufficient predictive accuracy, apply the model to the remaining set of documents that have never been hand-coded (e.g., N = 100,000) or use it in the planned application (e.g., a spam filter detection software)

Basic Procedure

Example: Spam Detection

  • Suppose we would want to develop a tool to automatically filter spam messages

  • How would you do this if you could only use a dictionary?

    • Compare spam and not spam emails to see which words and word combinations occur a lot in spam and not in not spam.
    • Downsides: Extremely time consuming and difficult!
  • Machine learning solution

    • Transform the emails into data (e.g., a document-term matrix!)
    • Let the computer figure out how to compute a probability for whether an email is spam
    • Different ML algorithms figure this out in different ways
  • The resulting “classifier” can then be integrated in a software tool that can be used to detect spam mails automatically

General Idea

  • Model relation between…
    • Input features
      • Similar to independent variables in statistics
      • Can be MANY features (e.g., all words in a DTM)
    • Output class
      • Similar to dependent variables
      • Can be categories (e.g., sentiment, topic classification) or continuous (e.g., stock market value)

Statistical Modeling vs. Machine Learning?

  • Machine learning is thus similar to normal statistical modeling

  • Learn \(f\) so you can predict \(y\) from \(x\):

\(y = f(x)\)
  • In a linear regression model, we aim to find the best fitting “line” that best predicts y based on x.

Differences

  • Goal of ‘normal’ modeling: explaining/understanding

    • Serves to make inferences about a population (“Does X relate to Y?”)
    • Doesn’t use too many variables to avoid the difficult of interpreting too many parameters
    • Requires interpretable parameters
  • Goal of machine learning: best possible prediction

    • make generalizable predictions (“How to best predict Y?”)
    • Use as many variables as you need, and don’t worry about interpretability of parameters
    • Always train (fit) and test (validate) on distinct data

Note: Machine learning models often have 1000’s of collinear independent variables and can have many latent variables!

Supervised Approaches

Advantages

  • independent of language and topic; we only need consistently coded training material

  • can be connected to traditional content analysis (same operationalization, similar criteria in terms of validity and reliability)

  • efficient (analysis of very large samples and text corpora possible)

Disadvantages

  • Requires large amounts of (manually) coded training data

  • Requires in-depth validation

Principles of Supervised Text Classification

How do these algorithms work?

Overview of different algorithms

  • There are many different “algorithms” or classifiers that we can use:

    • Naive Bayes
    • Support Vector Machines
    • Logistic regression
    • k-Nearest neighbors
    • … and many more
  • Most of these algorithms have certain hyperparameters that need to be set

    • e.g., learning rate, regularization, structure…
  • Unfortunately, there is no good theoretical basis for selecting an algorithm

    • Solution: choose algorithm that performs best

The Naive Bayes Algorithm

  • Computes the prior probability ( P ) for every category ( c = outcome variable ) based on the training data set

  • Computes the probability of every feature ( x ) to be a characteristic of the class ( c ); i.e., the relative frequency of the feature in category

  • For every probability of a category in light of certain features ( P(c|X) ), all feature probabilities ( x ) are multiplied

  • The algorithm hence chooses the class that has highest weighted sum of inputs

Applying the Naive Bayes classifier: Spam filtering

  • Without any knowledge of an incoming mail, the best estimate of whether or not it is spam would be P(spam), i.e., the probability that any prior message is spam (= class prior probability, e.g., 20%)
  • The algorithm now “learns” (based on the document-feature matrix) that the word “viagra” can often be found in spam mails. This probability is known as the likelihood P(viagra|spam) (e.g., .20%)
  • The algorithm also “learns” the probability of the word “viagra” appearing in any mail P(viagra) (predictor prior probability, e.g. 5%). Applying the Bayes Theorem, we get:
\(P(spam|viagra) = \frac{P(viagra|spam) * P(spam)}{P(viagra)} = \frac{.20 * .20}{.05} = .80\)

Overview: Naive Bayes

Strengths

  • Simple, fast, very effective
  • Does well with noisy and missing data
  • Requires relatively few examples for training, but also works with large numbers of examples
  • Easy to obtain the estimated probability of an estimation

Weaknesses

  • Relies on an often-faulty assumption that all features are equally important and independent
  • Not ideal for data sets with many numeric features
  • Estimated probabilities are less reliable than predicted classes |


Lantz, 2013

Support Vector Machines

  • Very often used machine learning method

  • Can be imagined as a “surface” that creates a boundary between points of data plotted in a multidimensional space representing examples and their feature values

  • Tries to find decision boundary between points that maximizes margin between classes while minimizing errors

  • More formally, a support-vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space

Overview: Support Vector Machines

Strengths

  • Can be use for classification or numeric prediction
  • Not overly influenced by noisy data, not prone to overfitting
  • Easier to use than neural networks
  • Often high accuracy

Weaknesses

  • Finding the best model requires testing of various combinations of model parameters
  • Can be slow to train, particularly with many features
  • Results in a complex ‘black box’ that is difficult, if not impossible to understand


Lantz, 2013

Neural networks

  • Inspired by human brain (but abstracted to mathematical model)

  • Each ‘neuron’ is a linear model with activation function:

    \(y = f(w_1x_1 + … + w_nx_n)\)

  • Normal activation functions: logistic, linear, block, tanh, …

  • Each neuron is practically a generalized linear model

  • Networks differ with regard to three main characteristics:

    • The number of (hidden) layers
    • Whether information is allowed to travel backward
    • The number of nodes within each layer

Neural networks: Hidden layers

  • Simple model cannot combine features
  • But we can add “hidden” layers (latent variables)
  • Estimated iteratively via backpropagation

    1. Start with random assignments
    2. Optimize final layer to predict output
    3. Optimize second layer (etc) given ‘best’ final layer
    4. Repeat from 2 until converged


Universal approximator: Neural Networks with single hidden layer can represent every continuous function

Overview: Neural Networks

Strengths

  • Can be adapted to any prediction problem
  • Capable of modelling more complex patterns than nearly any algorithm
  • Makes few assumptions about the data’s underlying assumptions

Weaknesses

  • Extremely computationally intensive and thus slow
  • Very prone to overfitting
  • Results in a complex ‘black box’ that is difficult, if not impossible to interpret


Lantz, 2013

How to know which performs best?

  • Sufficiently complex algorithms can “predict” all training data perfectly

  • But such an algorithm does not generalize to new data

  • Essentially, we want the model to have a good fit to the data, but we also want it to optimize on things that are specific to the training data set

  • Problem of under- vs- overfit

Statistical modeling

Linear fit (underfit)

Good fit

Overfit

Preventing overfitting in machine learning

  • Regularization

    • When fitting a model, ‘punish’ complexity / flexibility
  • Out-of-sample validation detects overfitting

    • To see whether a model generalizes to new data, simply test it on new data!
  • In sum, we need to validate our new classifier on unseen data

Validation

Best practices and processes.

Typical Machine Learning Process: Training vs. Testing

  • Models (almost) always overfit: performance on training data is not a good indicator of real quality

  • Solution

    • Split data into train and test sets (there are different ways of doing this, like split-half, leave-1-out, or k-fold)
    • Train model on training data
    • Test model on unseen test data
    • We can again estimate accuracy, precision, recall, and F-Score (see last lecture)
  • So… why don’t we do this with statistics?

    • Less complex models, so less risk of overfitting (but it’s still a risk!)
    • Less focus on prediction
    • But we sometimes could, and should use this approach!

Example: Predicting music genre from lyrics

  • This data is scraped from the “Vagalume” website, so it depends on their storing and sharing millions of song lyrics (not really representative or complete)

  • Many different songs, but not all types of music are represented in this data set

s <- read_csv("data/lyrics-data.csv")
a <- read_csv("data/artists-data.csv")

a <- a %>%
  group_by(Artist) %>%
  filter(row_number()==1) %>%
  rename(ALink = Link)

d <- left_join(s, a) %>%
  unique %>% group_by(SLink) %>%
  filter(row_number() == 1) %>%
  filter(!is.na(Genre))
d
# A tibble: 161,289 × 10
# Groups:   SLink [161,289]
   ALink           SName  SLink Lyric Idiom Artist Songs Popularity Genre Genres
   <chr>           <chr>  <chr> <chr> <chr> <chr>  <dbl>      <dbl> <chr> <chr> 
 1 /10000-maniacs/ More … /100… "I c… ENGL… 10000…   110        0.3 Rock  Rock;…
 2 /10000-maniacs/ Becau… /100… "Tak… ENGL… 10000…   110        0.3 Rock  Rock;…
 3 /10000-maniacs/ These… /100… "The… ENGL… 10000…   110        0.3 Rock  Rock;…
 4 /10000-maniacs/ A Cam… /100… "A l… ENGL… 10000…   110        0.3 Rock  Rock;…
 5 /10000-maniacs/ Every… /100… "Tru… ENGL… 10000…   110        0.3 Rock  Rock;…
 6 /10000-maniacs/ Don't… /100… "Don… ENGL… 10000…   110        0.3 Rock  Rock;…
 7 /10000-maniacs/ Acros… /100… "Wel… ENGL… 10000…   110        0.3 Rock  Rock;…
 8 /10000-maniacs/ Plann… /100… "[ m… ENGL… 10000…   110        0.3 Rock  Rock;…
 9 /10000-maniacs/ Rainy… /100… "On … ENGL… 10000…   110        0.3 Rock  Rock;…
10 /10000-maniacs/ Anthe… /100… "For… ENGL… 10000…   110        0.3 Rock  Rock;…
# … with 161,279 more rows

Understanding the data set

  • Contains artist name, song name, lyrics, and genre of the artist (not the song)

  • The following genres are in the data set:

    • Rock
    • Hip Hop
    • Pop Music
    • Sertanejo (Basically the Brazilian version of Country Music)
    • Funk Carioca (Originated in 60s US Funk, a completely different genre in Brazil nowadays)
    • Samba (Typical Brazilian music)
d %>%
  group_by(Genre) %>% 
  tally %>%
  ggplot(aes(x = reorder(Genre, n), 
             y = n, fill = Genre)) +
  geom_col(stat = "identity") +
  scale_fill_brewer(palette = "Pastel2") +
  coord_flip() +
  theme_classic() +
  guides(fill = F) +
  labs(x = "", 
       y = "number of songs", 
       title = "Songs per Genre")

How is the data stored and encoded?

d %>%
  ungroup %>%
  filter(Artist == "Britney Spears" & SName == "...Baby One More Time") %>%
  select(Artist, SName, Lyric, Genre)
# A tibble: 1 × 4
  Artist         SName                 Lyric                                                                                           Genre
  <chr>          <chr>                 <chr>                                                                                           <chr>
1 Britney Spears ...Baby One More Time Oh baby baby. Oh baby baby (wow). Oh baby baby. How was I supposed to know?. That something wa… Pop  


d %>%
  ungroup %>%
  filter(Artist == "Drake" & SName == "God's Plan") %>%
  select(Artist, SName, Lyric, Genre)
# A tibble: 1 × 4
  Artist SName      Lyric                                                                                                              Genre
  <chr>  <chr>      <chr>                                                                                                              <chr>
1 Drake  God's Plan "Yeah they wishin' and wishin' and wishin' and wishin'. They wishin' on me, yuh. I been movin' calm, don't start … Hip …

Note: Dealing with non-determinism

  • Many machine learning algorithms are non-deterministic

  • Random initial state and/or random parameter improvements - Even deterministic algorithm require random data split

  • Problem: research is not replicable, outcome may be affected

  • For replicability: set random seed in R: set.seed(123)

  • For valid outcome: repeat X times and report average performance

Creating a corpus

library(quanteda)
library(quanteda.textmodels)
d <- d %>%
  filter(Genre == "Rock" | Genre == "Pop" | Genre == "Hip Hop")

music <- corpus(d, docid_field = "SLink", text_field = "Lyric")
music
Corpus consisting of 116,190 documents and 8 docvars.
/10000-maniacs/more-than-this.html :
"I could feel at the time. There was no way of knowing. Falle..."

/10000-maniacs/because-the-night.html :
"Take me now, baby, here as I am. Hold me close, and try and ..."

/10000-maniacs/these-are-days.html :
"These are. These are days you'll remember. Never before and ..."

/10000-maniacs/a-campfire-song.html :
"A lie to say, "O my mountain has coal veins and beds to dig...."

/10000-maniacs/everyday-is-like-sunday.html :
"Trudging slowly over wet sand. Back to the bench where your ..."

/10000-maniacs/dont-talk.html :
"Don't talk, I will listen. Don't talk, you keep your distanc..."

[ reached max_ndoc ... 116,184 more documents ]

Splitting the data into a train and a test set

# Set seed to insure replicability
set.seed(42)

# Sample rows for testset and create subsets
testset <- sample(docnames(music), nrow(d)/2)
music_test <-  music %>% 
  corpus_subset(docnames(music) %in% testset)
music_train <- music %>% 
  corpus_subset(!docnames(music) %in% testset)

# Define outcome variable for each set
genre_train <- as.factor(docvars(music_train, "Genre"))
genre_test <- as.factor(docvars(music_test, "Genre"))
  • The procedure is always the same

  • Split data into train and test set

    • split-half (what we do here)
    • leave-1-out
    • k-fold
  • We store the outcome variable (Genre) for each subset

Text preprocessing (remember lecture 3?)

  • Step 1: Tokenization (including removing ‘noise’) and normalization

  • Step 2: Removing stop words

  • Step 3: Stemming

  • Step 4: Create document-feature matrix (DFM)

  • Step 5: Remove too short (< 2 characters) and rare words

  • (Step 6: Transforms the dtm so that words with a high document frequency weight less)

dfm_train <- music_train %>% 
  tokens(remove_punct = T, 
         remove_numbers = T, 
         remove_symbols = T) %>%
  tokens_tolower %>%
  tokens_remove(stopwords('en')) %>%
  tokens_wordstem %>%
  dfm %>%
  dfm_select(min_nchar = 2) %>% 
  dfm_trim(min_docfreq=20) %>%
  dfm_tfidf()  ## weighting process

Choose algorithm and train model

library(quanteda.textmodels)
m_nb <- textmodel_nb(x = dfm_train, y = genre_train) ## Naive Bayes
summary(m_nb)

Call:
textmodel_nb.dfm(x = dfm_train, y = genre_train)

Class Priors:
(showing first 3 elements)
Hip Hop     Pop    Rock 
 0.3333  0.3333  0.3333 

Estimated Feature Scores:
            take      now     babi      hold     close       tri understand     desir    hunger      fire    breath
Hip Hop 0.001212 0.001709 0.001888 0.0006368 0.0003482 0.0008651  0.0003373 4.339e-05 2.898e-05 0.0003382 0.0002871
Pop     0.001912 0.002209 0.003575 0.0012459 0.0007566 0.0013738  0.0005118 1.710e-04 6.059e-05 0.0007605 0.0006797
Rock    0.002145 0.002460 0.002282 0.0013524 0.0007455 0.0015209  0.0006004 2.822e-04 1.040e-04 0.0010468 0.0008306
            love   banquet      feed     come      way     feel   command      hand       sun   descend      hurt
Hip Hop 0.001688 1.010e-05 1.329e-04 0.001310 0.001060 0.001011 3.257e-05 0.0007522 0.0001953 8.027e-06 0.0003476
Pop     0.004675 7.163e-06 9.344e-05 0.002088 0.001887 0.002379 3.420e-05 0.0011299 0.0007235 2.252e-05 0.0007979
Rock    0.003456 1.822e-05 2.855e-04 0.002653 0.002062 0.002207 8.816e-05 0.0012946 0.0011366 5.312e-05 0.0007014
            night    belong     lover        us     caus     doubt      alon      ring
Hip Hop 0.0007309 0.0000717 0.0001143 0.0009473 0.001716 0.0001802 0.0003117 0.0003029
Pop     0.0016766 0.0002989 0.0005302 0.0012290 0.002201 0.0002322 0.0008661 0.0003434
Rock    0.0018035 0.0004535 0.0005678 0.0014138 0.001477 0.0002986 0.0012920 0.0004085

Predict genre in test set using the algorithm

  • To see how well the model does, we test it on the test (held-out) data

  • For this, it is important that the test data uses the same features (vocabulary) as the training data

  • The model contains parameters for these features, not for words that only occur in the test data

  • In other words, we have to “match” or “align” the train and test data

    • Same textprocessing
    • Matching of the features
# Matching
dfm_test <- music_test %>% 
  tokens(remove_punct = T, 
         remove_numbers = T, 
         remove_symbols = T) %>%
  tokens_remove(stopwords('en')) %>%
  tokens_wordstem %>%
  dfm %>% 
  dfm_match(featnames(dfm_train)) %>% 
  dfm_tfidf()

# Actual prediction
nb_pred <- predict(m_nb, newdata = dfm_test)
head(nb_pred, 2)
/10000-maniacs/more-than-this.html /10000-maniacs/these-are-days.html 
                              Rock                               Rock 
Levels: Hip Hop Pop Rock

Evaluating the Prediction

  • As we can see in the confusion matrix, there are a lot of false positives and false negatives!

  • Overall Accuracy: 64.71%

  • Precision, Recall and F1-Score are not too good for each genre

    • Precision is slightly better for Rock and Pop,
    • Recall is better for Hip Hop
library(caret)
cm_nb <- confusionMatrix(nb_pred, genre_test)
cm_nb$table
          Reference
Prediction Hip Hop   Pop  Rock
   Hip Hop    7106  2501  3315
   Pop        1012  7073  2849
   Rock       1166  9659 23414


(cm_nb2 <- cm_nb$byClass %>%
  as.data.frame %>%
  rownames_to_column("Genre") %>%
  select(Genre, Precision, Recall, F1) %>%
  as.data.frame %>%
  mutate_if(is.numeric, round, 2))
           Genre Precision Recall   F1
1 Class: Hip Hop      0.55   0.77 0.64
2     Class: Pop      0.65   0.37 0.47
3    Class: Rock      0.68   0.79 0.73

Different algorithm = better results?

m_svm <- textmodel_svm(x = dfm_train, y = genre_train, type = 2) 
svm_pred <- predict(m_svm, newdata = dfm_test)
cm_svm <- confusionMatrix(svm_pred, genre_test)
  • When we refit the model with support vector machines, there are still a lot of false positives and false negatives

  • Overall Accuracy: 69.66%

  • However, Precision, Recall and F1-Score all have improved!

cm_svm <- confusionMatrix(svm_pred, genre_test)
cm_svm$table
          Reference
Prediction Hip Hop   Pop  Rock
   Hip Hop    6509  1228   796
   Pop        1682 10297  5117
   Rock       1093  7708 23665
(cm_svm2 <- cm_svm$byClass %>%
  as.data.frame %>%
  rownames_to_column("Genre") %>%
  select(Genre, Precision, Recall, F1) %>%
  as.data.frame %>%
  mutate_if(is.numeric, round, 2))
           Genre Precision Recall   F1
1 Class: Hip Hop      0.76   0.70 0.73
2     Class: Pop      0.60   0.54 0.57
3    Class: Rock      0.73   0.80 0.76

Comparison between Algorithms

bind_rows(cm_nb2, cm_svm2) %>%
  bind_cols(Model = c(rep("Naive Bayes", 3), 
                      rep("SVM", 3))) %>%
  pivot_longer(Precision:F1) %>%
  ggplot(aes(x = Genre, 
             y = value, 
             fill = Model)) +
  geom_bar(stat= "identity",
           position = "dodge", 
           color = "white") +
  scale_fill_brewer(palette = "Pastel1") +
  facet_wrap(~name) +
  coord_flip() +
  ylim(0, 1) +
  theme_grey() +
  theme(legend.position = "bottom")

Drivers of model performance

  1. Task difficulty

  2. Amount of training data

  3. Choice of features (n-grams, lemmata, etc)

  4. Text preprocessing (e.g., exclude or include stopwords?)

  5. Tuning of algorithm (if required)

What is the effect of text preprocessing on model performance?

Scharkow, 2013

Examples from the literature

How is this used in research?

Example 1: Validating different approaches

  • Van Atteveldt et al. (2021) re-analysised data reported in Boukes et al. (2020) to understanding the validity of different text classification approaches

  • The data incldued news from a total of ten newspapers and five websites published between February 1 and July 7, 2015:

    • three quality newspapers (NRC Handelsblad, Trouw, de Volkskrant)
    • a financial newspaper (Financieel Dagblad)
    • three popular newspapers (Algemeen Dagblad, Metro, De Telegraaf)
    • three regional outlets (Dagblad van het Noorden, de Gelderlander, Noordhollands Dagblad)

Methods

  • They analyzed the paper using different methods and compared the results

    • Gold standard (manual coding by the three authors)
    • Manual coding (1 or 3 coders)
    • Crowd-Coding (1, 3 or 5 coders on a online platform)
    • Sentiment dictionaries (various versions)
    • Different supervised machine learning algorithms (NB, SVM, convolutional neural networks)
  • Investigated performance results of all models

Main results

So which method is valid?

  • Manual coding still outperforms all other approaches

  • Supervised text classification (particularly deep learning) is better than dictionary approaches (not too surprising)

  • Particularly supervised learning gets better with more training data (more is more!)

  • Nonetheless strongly depends on quality of training data

  • Recommendation for dictionary: Apply any applicable off-the-shelf dictionaries and if any of these is sufficiently valid as determined by comparison with the gold standard, use this for the text analysis

    • Dictionaries give very good transparency
    • Replicability for a low cost

Example 2: Incivility in Facebook comments

  • Study examined the extent and patterns of incivility in the comment sections of 42 US news outlets’ Facebook pages in 2015–2016

  • News source outlets included

    • National-news outlets (e.g., ABC, CBS, CNN…)
    • Local-new outlets (e.g., The Denver Post, San Francisco Chronicle…)
    • Conservative and liberal partisan news outlets (e.g., Breitbart, The Daily Show…)
  • Implemented a combination of manual coding and supervised machine learning to code comments with regard to:

    • Civility
    • Interpersonal rudeness
    • Personal rudeness
    • Impersonal extreme incivility
    • Personal extreme incivility

Results: Incivility over time

  • Despite several discernible spikes, the percentage of extremely uncivil personal comments on national-news outlets’ pages shifted only modestly

  • On conservative outlets’ Facebook pages, the proportions of both extremely uncivil and rude comments fluctuated dramatically across the sampling window

Su et al., 2018

Overall differences

Outlook and Conclusion

Deep learning

  • Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning

  • Learning can again be supervised, semi-supervised or unsupervised

  • Generally refers to large neural networks with many hidden layers

    • Possible because of modern computing power, large training sets
    • Powers e.g. automatic translation, self-driving cars, chess computers, etc.
  • Originally developed to deal with image recognition, now also adapted for text analysis

  • Use the combination of words, bi-grams, word-embeddings, rather than feature frequencies

Deep Learning Neural Networks

Example: TV gender representation

BERT language models

  • Bidirectional Encoder Representations from Transformers (BERT)

  • A machine learning technique for natural language processing, pre-training and developed by Google

  • A deep learning model in which every output element is connected to every input element, and the weightings between them are dynamically calculated based upon their connection

  • BERT is pre-trained on two different tasks: Masked Language Modeling and Next Sentence Prediction.

    • Masked Language Model training is to hide a word in a sentence and then have the program predict what word has been hidden (masked) based on the hidden word’s context
    • Next Sentence Prediction training is to have the program predict whether two given sentences have a logical, sequential connection or whether their relationship is simply random

BERT language models - Schematic Overview

BERT in R?

  • There is a new package (very recently released) that allows to use such pre-trained, large scale language models in R

  • If you are interested check the package “text”: https://r-text.org/

  • But: Be mindful! Running a BERT model can take a long time and might even require a more powerful computer than yours!

Is machine learning really useful?

  • Is it okay to use a model we can’t possibly understand?

  • Machine learning in the social sciences generally used to solve an engineering problem

  • Output of Machine Learning is input for “actual” statistical model (e.g., we classify text, but run an analysis of variance with the output)

Conclusion

  • Machine learning is a useful tool for generalizing from sample

  • It is very useful to reduce the amount of manual coding needed

  • Many different models exist (each with many parameters/options)

  • We always need to validate model on unseen and representative test data!

Thank you for your attention!

Required Reading



van Atteveldt, W., van der Velden, M. A. C. G., & Boukes, M.. (2021). The Validity of Sentiment Analysis: Comparing Manual Annotation, Crowd-Coding, Dictionary Approaches, and Machine Learning Algorithms. Communication Methods and Measures, (15)2, 121-140, https://doi.org/10.1080/19312458.2020.1869198

Su, L. Y.-F., Xenos, M. A., Rose, K. M., Wirz, C., Scheufele, D. A., & Brossard, D. (2018). Uncivil and personal? Comparing patterns of incivility in comments on the Facebook pages of news outlets. New Media & Society, 20(10), 3678–3699. https://doi.org/10.1177/1461444818757205


(available on Canvas)

References

  • Boumans, J. W., & Trilling, D. (2016). Taking stock of the toolkit: An overview of relevant automated content analysis approaches and techniques for digital journalism scholars. Digital journalism, 4(1), 8-23.

  • Günther, E. , & Domahidi, E. (2017). What Communication Scholars Write About: An Analysis of 80 Years of Research in High-Impact Journals. International Journal of Communication 11(2017), 3051–3071

  • Hvitfeld, E. & Silge, J. (2021). Supervised Machine Learning for Text Analysis in R. CRC Press. https://smltar.com/

  • Jürgens, P., Meltzer, C., & Scharkow, M. (2021, in press). Age and Gender Representation on German TV: A Longitudinal Computational Analysis. Computational Communication Research.

  • Lantz, B. (2013). Machine learning in R. Packt Publishing Ltd.

  • Scharkow, M. (2013). Thematic content analysis using supervised machine learning: An empirical evaluation using german online news. Quality & Quantity, 47(2), 761–773. https://doi.org/10.1007/s11135-011-9545-7

  • Su, L. Y.-F., Xenos, M. A., Rose, K. M., Wirz, C., Scheufele, D. A., & Brossard, D. (2018). Uncivil and personal? Comparing patterns of incivility in comments on the Facebook pages of news outlets. New Media & Society, 20(10), 3678–3699. https://doi.org/10.1177/1461444818757205

  • van Atteveldt, W., van der Velden, M. A. C. G., & Boukes, M.. (2021). The Validity of Sentiment Analysis: Comparing Manual Annotation, Crowd-Coding, Dictionary Approaches, and Machine Learning Algorithms. Communication Methods and Measures, (15)2, 121-140, https://doi.org/10.1080/19312458.2020.1869198

Example Exam Question (Multiple Choice)

Van Atteveldt and colleagues (2020) tested the validity of various automated text analysis approaches. What was their main result?


A. English dictionaries performed better than Dutch dictionaries in classifying the sentiment of Dutch news paper headlines.

B. Dictionary approaches were as good as machine learning approaches in classifying the sentiment of Dutch news paper headlines.

C. Of all automated approaches, supervised machine learning approaches performed the best in classifying the sentiment of Dutch news paper headlines.

D. Manual coding and supervised machine learning approaches performed similarly well in classifying the sentiment of Dutch news paper headlines.

Example Exam Question (Multiple Choice)

Van Atteveldt and colleagues (2020) tested the validity of various automated text analysis approaches. What was their main result?


A. English dictionaries performed better than Dutch dictionaries in classifying the sentiment of Dutch news paper headlines.

B. Dictionary approaches were as good as machine learning approaches in classifying the sentiment of Dutch news paper headlines.

C. Of all automated approaches, supervised machine learning approaches performed the best in classifying the sentiment of Dutch news paper headlines.

D. Manual coding and supervised machine learning approaches performed similarly well in classifying the sentiment of Dutch news paper headlines.

Example Exam Question (Open Format)

Describe the typical process used in supervised text classification.

Any supervised machine learning procedure to analyze text usually contains at least 4 steps:

  1. One has to manually code a small set of documents for whatever variable(s) you care about (e.g., topics, sentiment, source,…).

  2. One has to train a machine learning model on the hand-coded /gold-standard data, using the variable as the outcome of interest and the text features of the documents as the predictors.

  3. One has to evaluate the effectiveness of the machine learning model via cross-validation. This means one has to test the model test on new (held-out) data.

  4. Once one has trained a model with sufficient predictive accuracy, precision and recall, one can apply the model to more documents that have never been hand-coded or use it for the purpose it was designed for (e.g., a spam filter detection software)