Text Classification Using Classic Machine Learning

Week 3: From Naive Bayes to Neural Networks

Dr. Philipp K. Masur

So what are we actually talking about?

  • Machine learning is the study of computer algorithms that can improve automatically through experience and by the use of data

  • The field originated in an environment where the available data, statistical methods, and computing power rapidly and simultaneously evolved

  • Due to the “black box” nature of the algorithm’s operations, it is often seen as a form of artificial intelligence

Source: qlik

Some success stories

Source: Eschenzweig/WikimediaApplying machine learning in practical context:

  • Identification of spam messages in mails
  • Segmentation of customers for targeted advertising
  • Weather forecasts and long-term climate changes
  • Reduction of fraudulent credit card transactions
  • Prediction of election outcomes
  • Auto-piloting and self-driving cars
  • Face recognition
  • Optimization of energy use in homes and buildings
  • Discovery of genetic sequences linked to diseases -….

Content of this lecture

  1. What is Machine Learning?

    1.1. Concepts and Principles

    1.2. The General Machine Learning Pipeline

    1.3. Difference between statistics and machine learning

    1.4. Over- vs. underfitting

    1.5. Training vs. Testing

    1.6. Example Data: Predicting Genre from Lyrics

  2. Supervised text classification with different “algorithms”

    2.1. Naive Bayes

    2.2. Logistic Regression

    2.3. Support Vector Machines

    2.4. Artificial Neural Networks

  1. Testing Different Approaches

    3.1. Performance of different approaches

    3.2. Fine-tuning preprocessing and hyperparameters

    3.3. Effect of text preprocessing

    3.4. Validity of different text classification approaches

  2. Examples from the Literature

    4.1. Incivility on Facebook (Su et al., 2018)

    4.2. Electoral News Sharing (de León et al., 2021)

  3. Summary and Conclusion

What is machine learning?

Deductive vs. inductive approaches

  • In the previous lecture, we talked about deductive approaches (such as dictionary approaches)

  • These are deterministic and are based on a priori text theory (e.g., happy -> positive, hate -> negative)

  • Yet, natural language is often ambiguous and a probabilistic coding may be better

  • Dictionary-based or generally rule-based approaches are not very similar to manual coding; a human being assesses much more than just a list of words

  • Inductive approaches promise to combine the scalability of automatic coding with the validity of manual coding (supervised learning) or can even identify things or relations that we as human beings cannot identify (unsupervised learning)

Example of supervised text classification

General text classification pipeline

Supervised Text Classification Pipeline

Basic idea of Machine Learning Algorithms

  • The general goal, however, is always the same: Model the relationship between…

    • Input features
      • Similar to independent variables in statistics
      • Can be MANY features (e.g., all words in a corpus)
    • Output class
      • Similar to dependent variables
      • Can be categories or continuous

Statistical modeling vs. Supervised Machine learning

  • Machine learning, many people joke, is nothing other than a fancy name for statistics.

  • There is some truth to this: if you say “logistic regression”, this will sound familiar to both statisticians and machine learning practitioners.

  • Still, there are some differences between traditional statistical approaches and the machine learning approach, even if some of the same mathematical tools are used.

Source: Demetri Pananos/stackoverflow

Statistical Modeling

  • Statistical modeling is about understanding the relationship between one (or several predictors) and an outcom variable

  • Learn \(f\) so you can predict \(y\) from \(x\):

\(y = f(x)\)
  • In a linear regression model, we aim to find the best fitting “line” that best predicts y based on x:
\(y = -0.16 + 2.31 * x + \epsilon\)
  • In a typical communication science paper, we would say something like: when x increases by one unit,y increases by 2.31 units.

Machine Learning

  • Machine learning is less about understanding the relationship, but about maximizing prediction.

  • A statistical model such as the one estimated can be used to predict most likely y values based on new x data.

  • For example, despite not being in the data, x = -1 should be y = -2.47; x = 5 should be y = 11.41 based on the fitted line!

  • In other words, machine learnings doesn’t focus on explanation, but emphasizes prediction.

Differences in language

Source: van Atteveldt et al., 2021

Differences in Goals

  • Goal of statistical modeling: explaining/understanding

    • Serves to make inferences about a population (“Does X relate to Y?”)
    • Doesn’t use too many variables to avoid the difficulty of interpreting too many parameters
    • Requires interpretable parameters
  • Goal of machine learning: best possible prediction

    • make generalizable predictions (“How to best predict Y?”)
    • We do not care about the actual values of the coefficients, we just need them for our prediction.
    • In fact, in many machine learning models, we will have so many of them that we do not even bother to report them.

Note: Machine learning models often have 1000’s of collinear independent variables and can have many latent variables!

Over- vs. underfit

  • Problem in Machine Learning: Sufficiently complex algorithms can “predict” all training data perfectly

  • But such an algorithm does not generalize to new data (the actual goal!)

  • Essentially, we want the model to have a good fit to the data, but we also want it to not optimize on things that are specific to the training data set

  • Problem of under- vs- overfit

Examplifying over- and underfit

Linear fit (underfit)

Overfit

Good fit

Preventing overfitting

  • Regularization during fitting process

    • ‘Punish’ complexity
    • Constrain flexibility
    • Removing noise
  • Out-of-sample validation

    • To see whether a model generalizes to new data, simply test it on new data
    • This validation set clearly detects overfitting
  • In sum, we need to validate our new classifier on unseen data

Solution: Training and Testing

  • As models (almost) always overfit, it is clear that performance on only training data is not a good indicator of the real quality of the classifier

  • The standard solution is to split the labeled data into a training and test data sets: This way, we can train the algorithm on one part and then evaluate its validity / performance on the held-out part of the data.

  • We prevent overfit if the classifier still perform well on unseen data and not just on the training data!

Integrating testing in the pipeline

Testing = Validation

  • Remember how we validated dictionary approaches?

    • We manually coded a small set of the text
    • Compared this gold standard to the dictionary result
  • In supervised text classification, the procedure is similar

    • We use the classifier (trained on the training data) to predict the outcome in the test data
    • Because the test data is only a subset of our labeled data, it also contains the true outcome
    • We can compare the predicted with the actual outcome and compute the same performance scores (Accuracy, Precision,…)

A Small Note on Dealing with non-determinism

  • Many aspects of machine learning are non-deterministic, i.e., there is a certain randomness in the procedure

    • Randomly splitting the available labeled data into training and test sets
    • Some algorithms start from random initial states (e.g., drawing a random number)
  • Problem: Research thereby becomes not replicable and the outcome may be affected

  • To ensure replicability

    • We can set a so-called random seed in R with set.seed(123)
    • This way, we force R to use the exact same procedure (even when we repeat it)
  • For valid outcome:

    • Repeat the procedure X times
    • Report average performance

Example: Predicting music genre from lyrics

  • For the remainder of this lecture, I will exemplify different approaches to supervised text classification using a data set that contains song lyrics and the genre of the artist.

  • The goal is to investigate whether we can train an algorithm sufficiently well to predict the genre from only song lyrics.

library(tidyverse)

# Set seed
set.seed(42)

# Read data
lyrics_data <- read_csv("data/lyrics-data-prep.csv") |>               
  mutate(binary_genre = factor(ifelse(Genre =="Rock","rock", "other"),  # <-- create binary outcome 
                               levels = c("rock", "other") )) |>        # <-- order categories!
  sample_frac(size = .20) |>                                            # <-- Sample 20% of the data
  select(doc_id = SLink, Artist, Song = SName, 
         Genre, binary_genre, text = Lyric)
head(lyrics_data)
# A tibble: 6 × 6
  doc_id                                     Artist            Song                             Genre binary_genre text 
  <chr>                                      <chr>             <chr>                            <chr> <fct>        <chr>
1 /snoop-dogg/213-tha-gangsta-clicc.html     Snoop Dogg        213 Tha Gangsta Clicc            Hip … other        "[Sn…
2 /far-east-movement/jello-feat-rye-rye.html Far East Movement Jello (feat. Rye Rye)            Pop   other        "Jel…
3 /janet-jackson/got-till-its-gone.html      Janet Jackson     Got 'til It's Gone (feat. Q-Tip… Pop   other        "Wha…
4 /tori-amos/yo-george.html                  Tori Amos         Yo George                        Rock  rock         "I s…
5 /van-morrison/a-sense-of-wonder.html       Van Morrison      A Sense of Wonder                Rock  rock         "I w…
6 /heart/together-now.html                   Heart             Together Now                     Rock  rock         "Dee…

The data set

  • This data is scraped from the “Vagalume” website, so it depends on their storing and sharing millions of song lyrics (not really representative or complete)

  • Contains artist name, song name, lyrics, and genre of the artist (not the song)

  • The following genres are in this subsample of the data set:

    • Rock
    • Hip Hop
    • Pop Music
lyrics_data |> 
  group_by(Genre) |>  
  tally() |> 
  ggplot(aes(x = reorder(Genre, n), y = n, 
             fill = Genre)) +
  geom_col() +
  scale_fill_brewer(palette = "Pastel2") +
  coord_flip() +
  ggridges::theme_ridges() +
  guides(fill = F) +
  labs(x = "", 
       y = "Number of Songs", 
       title = "Songs per Genre")

Let’s check out a song

library(tidytext)
lyrics_data |> 
  filter(Artist == "Radiohead" & Song == "Paranoid Android") |> 
  unnest_tokens(lines, text, token = "sentences") |> 
  select(lines) |> 
  print(n = 20)
# A tibble: 39 × 1
   lines                                                        
   <chr>                                                        
 1 please could you stop the noise, i'm trying to get some rest.
 2 from all the unborn chicken voices in my head.               
 3 what's this...?                                              
 4 (i may be paranoid, but not an android).                     
 5 what's this...?                                              
 6 (i may be paranoid, but not an android).                     
 7 when i am king, you will be first against the wall.          
 8 with your opinion which is of no consequence at all.         
 9 what's this...?                                              
10 (i may be paranoid, but no android).                         
11 what's this...?                                              
12 (i may be paranoid, but no android).                         
13 ambition makes you look pretty ugly.                         
14 kicking and squealing gucci little piggy.                    
15 you don't remember.                                          
16 you don't remember.                                          
17 why don't you remember my name?.                             
18 off with his head, man.                                      
19 off with his head, man.                                      
20 why don't you remember my name?.                             
# ℹ 19 more rows

Splitting the data into a train and a test set

  • There are different way to think about creating a test vs. training data set

  • Simplest form: split-half (or any other percentage distribution)

  • But there are also other, more complex procedures that fall under the term “cross-validation”

    • Leave-1-out
    • k-fold
  • What approach is meaningful depends on the question and goal!


Leave-1-out cross validation



k-fold cross validation

Splitting the data set in R

  • For the entire model fitting process, we are going to use the package tidymodels, which nicely intersects with the already known package tidytext

  • Here, we can use the functions initial_split to split our data set. The functions training and testing create the actual data sets from the splits.

library(tidymodels)

# Set seed to insure replicability
set.seed(42)

# Create initial split proporations
split <- initial_split(lyrics_data, prop = .60)

# Create training and test data
train_data <- training(split)
test_data <- testing(split)

# Check
tibble(dataset = c("training", "testing"),
       n_songs = c(nrow(train_data), nrow(test_data)))
# A tibble: 2 × 2
  dataset  n_songs
  <chr>      <int>
1 training   10445
2 testing     6964

Algorithms for Supervised text classification

From Logistic Regression to Neural Networks.

Feature Engineering

  • Classic machine learning models require a numerical representation of text (e.g., document-feature matrix)

  • They further need the outcome variable that they should predict

  • All text-preprocessing steps (e.g., stopword removal, stemming, lemmatization, frequency trimming, weighting, etc.) may change the performance, but no clear rules on what works and what does not

  • Only solution: Trial and error!

Text Preprocessing with TidyText

library(tidytext)

# Text Preprocessing
dfm_train <- train_data |>
  unnest_tokens(word, text) |> 
  anti_join(stop_words) |> 
  group_by(doc_id, word) |> 
  summarize(n = n()) |> 
  pivot_wider(names_from = word, values_from = n, values_fill = 0)

# Check
dfm_train |> head()
# A tibble: 6 × 43,613
# Groups:   doc_id [6]
  doc_id  answer blowing breeze brick burning candle changed close color considered distant earth fading fingers flowers
  <chr>    <int>   <int>  <int> <int>   <int>  <int>   <int> <int> <int>      <int>   <int> <int>  <int>   <int>   <int>
1 /10000…      2       1      1     1       1      1       1     2     1          1       1     1      1       1       1
2 /10000…      0       0      0     0       0      0       0     0     0          0       0     0      2       0       0
3 /10000…      0       0      0     0       0      0       0     0     0          0       0     0      0       0       0
4 /10000…      0       0      0     0       0      0       0     0     0          0       0     0      0       0       0
5 /10000…      0       0      0     0       0      0       0     2     0          0       0     0      0       0       0
6 /10000…      0       0      0     0       0      0       0     0     0          0       0     0      0       0       0
# ℹ 43,597 more variables: forever <int>, frozen <int>, funny <int>, glass <int>, grey <int>, hold <int>, home <int>,
#   lane <int>, liquid <int>, live <int>, looked <int>, love <int>, melding <int>, memories <int>, overturned <int>,
#   past <int>, planting <int>, questions <int>, raw <int>, remember <int>, safe <int>, seldom <int>, shuddered <int>,
#   silent <int>, sleep <int>, slip <int>, slowly <int>, smile <int>, summer <int>, surface <int>, tend <int>,
#   threw <int>, vision <int>, walks <int>, weary <int>, whisper <int>, world <int>, `10,000` <int>, australian <int>,
#   blue <int>, calling <int>, calls <int>, cds <int>, choice <int>, choose <int>, cold <int>, days <int>,
#   deserting <int>, euro <int>, feelings <int>, hear <int>, hero <int>, left <int>, maniacs <int>, moment <int>, …

Alternative in tidymodels: Creating A “Recipe”

library(textrecipes)

# Create recipe for text preprocessing
rec <- recipe(binary_genre ~ text, data = lyrics_data) |>
  step_tokenize(text, options = list(strip_punct = T, strip_numeric = T)) |>  
  step_stopwords(text, language = "en") |>    
  step_tokenfilter(text, min_times = 20, max_tokens = 1000 ) |> 
  step_tf(all_predictors()) 

# "Bake" (check) based on recipe to see text preprocessing
rec |> 
  prep(train_data) |>
  bake(new_data=NULL)
# A tibble: 10,445 × 1,001
   binary_genre tf_text_2x tf_text_across tf_text_act tf_text_afraid tf_text_ah tf_text_ahead `tf_text_ain't`
   <fct>             <int>          <int>       <int>          <int>      <int>         <int>           <int>
 1 rock                  0              0           0              0          0             0               0
 2 other                 0              0           0              0          0             0               0
 3 other                 0              0           0              0          0             0               0
 4 rock                  0              0           0              0          0             0               0
 5 other                 0              0           0              0          0             0               2
 6 rock                  0              0           0              0          0             0               2
 7 other                 0              0           1              0          0             0               1
 8 rock                  0              1           0              0          0             0               0
 9 other                 0              0           0              0          0             0               0
10 rock                  0              0           0              1          0             0               1
# ℹ 10,435 more rows
# ℹ 993 more variables: tf_text_aint <int>, tf_text_air <int>, tf_text_alive <int>, tf_text_almost <int>,
#   tf_text_alone <int>, tf_text_along <int>, tf_text_already <int>, tf_text_alright <int>, tf_text_always <int>,
#   tf_text_angel <int>, tf_text_angels <int>, tf_text_another <int>, tf_text_answer <int>, tf_text_anybody <int>,
#   tf_text_anymore <int>, tf_text_anyone <int>, tf_text_anything <int>, tf_text_anyway <int>, tf_text_apart <int>,
#   tf_text_arms <int>, tf_text_around <int>, tf_text_ask <int>, tf_text_asking <int>, tf_text_ass <int>,
#   tf_text_awake <int>, tf_text_away <int>, tf_text_ay <int>, tf_text_b <int>, tf_text_ba <int>, tf_text_babe <int>, …

Workflow in Tidymodels

  • The collection tidymodels contains a variety of packages that facilitates and streamlines machine learning in R

  • The basic procedure is the following:

    1. Create a recipe (this includes already a formula and all pre-processing steps)
    2. Bake training and testing data using the recipe (not explicitly necesssary)
    3. Create a designated model function (depending on what type of algorithm should be used)
    4. Bind all together using a workflow
    5. Fit the entire workflow and evaluate performance

Setting up a recipe for our example

# Load specific support library
library(textrecipes)

# Create recipe
rec <- recipe(binary_genre ~ text, data = lyrics_data)                 # <-- predict binary_genre by text

Setting up a recipe for our example

# Load specific support library
library(textrecipes)

# Create recipe
rec <- recipe(binary_genre ~ text, data = lyrics_data) |>              # <-- predict binary_genre by text
  step_tokenize(text, options = list(strip_punct = T,                  # <-- tokenize, remove punctuation
                                     strip_numeric = T))               # <-- remove numbers

Setting up a recipe for our example

# Load specific support library
library(textrecipes)

# Create recipe
rec <- recipe(binary_genre ~ text, data = lyrics_data) |>              # <-- predict binary_genre by text
  step_tokenize(text, options = list(strip_punct = T,                  # <-- tokenize, remove punctuation
                                     strip_numeric = T)) |>            # <-- remove numbers
  step_stopwords(text, language = "en") |>                             # <-- remove stopwords                 
  step_tokenfilter(text, min_times = 20, max_tokens = 1000)            # <-- filter out rare words and use only top 1000                         

Setting up a recipe for our example

# Load specific support library
library(textrecipes)

# Create recipe
rec <- recipe(binary_genre ~ text, data = lyrics_data) |>              # <-- predict binary_genre by text
  step_tokenize(text, options = list(strip_punct = T,                  # <-- tokenize, remove punctuation
                                     strip_numeric = T)) |>            # <-- remove numbers
  step_stopwords(text, language = "en") |>                             # <-- remove stopwords                 
  step_tokenfilter(text, min_times = 20, max_tokens = 1000) |>         # <-- filter out rare words and use only top 1000              
  step_tf(all_predictors())                                            # <-- create document-feature matrix

Setting up a recipe for our example

# Load specific support library
library(textrecipes)

# Create recipe
rec <- recipe(binary_genre ~ text, data = lyrics_data) |>              # <-- predict binary_genre by text
  step_tokenize(text, options = list(strip_punct = T,                  # <-- tokenize, remove punctuation
                                     strip_numeric = T)) |>            # <-- remove numbers
  step_stopwords(text, language = "en") |>                             # <-- remove stopwords                 
  step_tokenfilter(text, min_times = 20, max_tokens = 1000) |>         # <-- filter out rare words and use only top 1000     
  step_tf(all_predictors())                                            # <-- create document-feature matrix

# Small adaption to the recipe (for SVM and neural networks)
rec_norm <- rec |> 
  step_normalize(all_predictors())    
# A tibble: 6 × 1,001
  binary_genre tf_text_2x tf_text_across tf_text_act tf_text_afraid tf_text_ah
  <fct>             <int>          <int>       <int>          <int>      <int>
1 rock                  0              0           0              0          0
2 other                 0              0           0              0          0
3 other                 0              0           0              0          0
4 rock                  0              0           0              0          0
5 other                 0              0           0              0          0
6 rock                  0              0           0              0          0
# ℹ 995 more variables: tf_text_ahead <int>, `tf_text_ain't` <int>,
#   tf_text_aint <int>, tf_text_air <int>, tf_text_alive <int>,
#   tf_text_almost <int>, tf_text_alone <int>, tf_text_along <int>,
#   tf_text_already <int>, tf_text_alright <int>, tf_text_always <int>,
#   tf_text_angel <int>, tf_text_angels <int>, tf_text_another <int>,
#   tf_text_answer <int>, tf_text_anybody <int>, tf_text_anymore <int>,
#   tf_text_anyone <int>, tf_text_anything <int>, tf_text_anyway <int>, …

Overview of different algorithms

  • There are many different “algorithms” or classifiers that we can use:

    • Naive Bayes
    • Logistic regression
    • Support Vector Machines
    • Decision trees
    • Artificial neural networks
    • … and many more
  • Most of these algorithms have certain hyperparameters that need to be set

    • e.g., learning rate, regularization, structure…
  • Unfortunately, there is no good theoretical basis for selecting an algorithm or hyperparameters

    • Solution: Choose algorithm that performs best

Naive Bayes

Using the Bayes Theorem for Classification

Theoretical background

  • Goes back to the Bayes’ Theorem published in 1763!

  • Computes the prior probability ( P ) for every category ( c = outcome variable ) based on the training data set

  • Computes the probability of every feature ( x ) to be a characteristic of the class ( c ); i.e., the relative frequency of the feature in category

  • For every probability of a category in light of certain features ( P(c|X) ), all feature probabilities ( x ) are multiplied

  • The algorithm hence chooses the class that has highest weighted sum of inputs

Applying the Naive Bayes classifier to our example

  • Without any knowledge about the genre, the best estimate of whether or not a song is rock would be P(rock), i.e., the probability that any prior song is rock (= class prior probability, here 54.2%)
  • The algorithm now “learns” (based on the document-feature matrix) that the word “death” can often be found in rock songs. This probability is known as the likelihood P(death|rock) (e.g., 15%)
  • The algorithm also “learns” the probability of the word “death” appearing in any song P(death) (predictor prior probability, e.g. 5%). Applying the Bayes Theorem, we get:


\(P(rock|death) = \frac{P(death|rock) * P(rock)}{P(death)} = \frac{.15 * .54}{.05} = .162\)

Why using Naive Bayes?

Strengths

  • Simple, fast, very effective
  • Does well with noisy and missing data
  • Requires relatively few examples for training, but also works with large numbers of examples
  • Easy to obtain the estimated probability of an estimation

Weaknesses

  • Relies on an often-faulty assumption that all features are equally important and independent
  • Not ideal for data sets with many numeric features
  • Estimated probabilities are less reliable than predicted classes


Lantz, 2013

Fitting a Naive Bayes model

# Create workflow 
nb_workflow <- workflow() |>                        # We initiate a workflow
  add_recipe(rec) |>                                # We add our text preprocessing recipe to the workflow
  add_model(naive_Bayes(mode = "classification",    # We choose Naive Bayes for classification
                        engine = "naiveBayes"))     # Engine = R package to be used

# Fitting the naive bayes model
m_nb <- fit(nb_workflow, data = training(split))     # Fitting the model
# Predict outcome in test set
predict_nb <- tibble(.pred_class = predict(m_nb, test_baked)) |>     # Using 'predict' to predict genre in test set 
  bind_cols(select(testing(split), binary_genre)) |>                 # Add actual genre of test set
  rename(predicted=.pred_class, actual=binary_genre) 

# Check
predict_nb 
# A tibble: 6,964 × 2
   predicted actual
   <fct>     <fct> 
 1 rock      rock  
 2 other     other 
 3 rock      rock  
 4 rock      other 
 5 rock      other 
 6 rock      other 
 7 rock      other 
 8 other     other 
 9 other     other 
10 rock      other 
# ℹ 6,954 more rows

Validation on the test data

  • Using the function conf_mat on the actual and predicted genre columns, we can now inspect the confusion matrix

  • We already see that there are many “false-positives” (1890!), but less false negatives.

# Confusion Matrix
predict_nb |> 
  conf_mat(truth = actual, estimate = predicted)
          Truth
Prediction rock other
     rock  3536  1890
     other  271  1267

Assessing performance scores

  • In tidymodels, we have to define a set of measures that we want to compute based on the predictions in the test set.
# Define a set of performance scores to be computed
class_metrics <- metric_set(accuracy, precision, recall, f_meas)
  • We can then use this custom function to extract the relevant measures
# Performance Scores
predict_nb |> 
  class_metrics(truth = actual, estimate = predicted)
# A tibble: 4 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 accuracy  binary         0.690
2 precision binary         0.652
3 recall    binary         0.929
4 f_meas    binary         0.766
  • Overall accuracy is 69%, which is better than chance, but not very good.

  • Precision is also low because of a lot of false positives (65.2%)

  • Yet, recall is quite high as only few rock songs were classified as another genre (92.9%)

Logistic Regression

Why not use a simple statistical model for binary classification

Theoretical background

  • As mentioned earlier, a linear regression can be described with the following formula:

\(Y_i = b_0 + b_1 x_i + \epsilon_i\)

  • When we classify text, the dependent variable (the outcome class: \(Y_i\)) is not metric, but has only two values (1 = “is class”, 0 = “is not class”).

  • We thus need to estimate the probability of \(Y = 1\) by using the so-called sigmoid function:

\(P(Y) = \frac {1}{1+e^{-(\beta_0 + \beta_1 x)}}\)
  • \(P(Y)\): Probability of \(Y\)
  • \(e\): Basis of the natural logarithm

Inverse logit (sigmoid) function

  • In both classic statistics as well as machine learning, this is know as logistic regression

  • In machine learning, we see it as an algorithm that accomplishes binary classification tasks by predicting the probability of an outcome, event, or observation

  • To train a classifier during text classification tasks, we simply fit a logistic regression model with all text features (e.g., the frequencies of words per document) as input (predictors) and provide the outcome class as output (dependent variable)

  • In contrast to statistical models that use logistic regression, the machine learning model contains thus only much more variables!

Why using logistic regression?

Strengths

  • Comparatively simple model
  • We can look under the hood and investigate coefficients of individual words (-> proxy for a words importance in predicting the outcome class)
  • Does comparatively well for many classification tasks!

Weaknesses

  • Dependent variable must be binary
  • Relies on an often-faulty assumption that all features are equally important and independent
  • Quite slow with many data sets and features

Lantz, 2013

Fitting a Logistic Regression model

# Additional package needed
library(discrim)

# Creating a workflow
lr_workflow <- workflow() |>                            # Initate the workflow
  add_recipe(rec) |>                                    # Add recipe
  add_model(logistic_reg(mixture = 0, penalty = 0.1,    # Choose logistic regression
                         engine = "glm"))               # Standard engine

# Fitting the logistic regression model
m_logistic <- fit(lr_workflow, data = training(split))  # Fit model
# Predict outcome in test set
predict_lr <- predict(m_logistic, new_data=test_data) |>  # Predict outcome in test data
  bind_cols(select(testing(split), binary_genre)) |>      # Add actual outcome
  rename(predicted=.pred_class, actual=binary_genre) 

# Check
predict_lr
# A tibble: 6,964 × 2
   predicted actual
   <fct>     <fct> 
 1 rock      rock  
 2 other     other 
 3 rock      rock  
 4 other     other 
 5 other     other 
 6 other     other 
 7 rock      other 
 8 other     other 
 9 other     other 
10 rock      other 
# ℹ 6,954 more rows

Validation

# Confusion Matrix
predict_lr |> 
  conf_mat(truth = actual, estimate = predicted)
          Truth
Prediction rock other
     rock  3171  1123
     other  636  2034
# Performance Scores
predict_lr |> 
  class_metrics(truth = actual, estimate = predicted)
# A tibble: 4 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 accuracy  binary         0.747
2 precision binary         0.738
3 recall    binary         0.833
4 f_meas    binary         0.783
  • Overall accuracy of the Logistic Regression classifier is 74.7%, which is better than the Naive Bayes!

  • Precision is also ok due to less false positives (73.8%)

  • Yet, recall is slightly lower, but still good (83.3%)

What words best predict rock?

  • Because a logistic regression model estimates “slopes” for individual words, we can extract those words the best predict whether a song is rock (or not)
m_logistic |> 
  extract_fit_parsnip() |>
  vip::vi() |> 
  group_by(Sign) |>
  mutate(Sign = recode(Sign, 
                       POS = "negative", 
                       NEG = "positive")) |> 
  top_n(20, wt = abs(Importance)) |> 
  ungroup() |>
  mutate(
    Importance = abs(Importance),
    Variable = str_remove(Variable, "tf_text_"),
    Variable = fct_reorder(Variable, Importance)
  ) |>
  ggplot(aes(x = Importance, 
             y = Variable, 
             fill = Sign)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Sign, scales = "free_y") +
  labs(y = NULL)

Break (5 Minutes)

Support Vector Machines

A versatile black box method

Theoretical background

  • Very often used machine learning method, due to its high performance in a variety of tasks

  • A support vector machine (SVM) can be imagined as a “surface” that creates the greatest separation between two groups in a multidimensional space representing classes and their feature values

  • Tries to find decision boundary between points that maximizes margin between classes while minimizing errors

  • More formally, a support-vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space (wait what?)

Classification with hyperplanes

  • The following figure depicts two groups of data that can be plotted in two dimensions (here, I show only two dimensions because it is difficult to imagine or illustrate space in greater than two dimensions)

  • Because the groups are perfectly separatable by a straight line, they are said to be linearly separable (but hyperplanes don’t have to be linear)

Classification with hyperplanes

  • The following figure depicts two groups of data that can be plotted in two dimensions (here, I show only two dimensions because it is difficult to imagine or illustrate space in greater than two dimensions)

  • Because the groups are perfectly separatable by a straight line, they are said to be linearly separable (but hyperplanes don’t have to be linear)

Classification with hyperplanes

  • The task of the SVM algorithm is to identify a line that separates the two classes, but there os always more than one choice

  • Therefore, the algorithm searches for the maximum margin hyperplane (MMH) that creates the greatest separation between the two classes, although any of the three lines separates the classes, the line that leads to the greatest separation will generalize best to new data

  • The support vectors are the points from each class that are closest to the MMH (each class must have at least on support vector)

Classification with hyperplanes

  • The task of the SVM algorithm is to identify a line that separates the two classes, but there os always more than one choice

  • Therefore, the algorithm searches for the maximum margin hyperplane (MMH) that creates the greatest separation between the two classes, although any of the three lines separates the classes, the line that leads to the greatest separation will generalize best to new data

  • The support vectors are the points from each class that are closest to the MMH (each class must have at least on support vector)

Why using support vector machines?

Strengths

  • Can be used for classification or numeric prediction
  • Not overly influenced by noisy data, not prone to overfitting
  • Easier to use than neural networks
  • Often high accuracy!

Weaknesses

  • Finding the best model requires testing of various combinations of model parameters
  • Can be slow to train, particularly with many features
  • Results in a complex ‘black box’ that is difficult, if not impossible to understand

Lantz, 2013

Fitting a SVM model

load("results/m_svm.Rdata")
#save(m_svm, file = "results/m_svm.Rdata")
library(LiblineaR)
svm_workflow <- workflow() |>
  add_recipe(rec_norm) |>
  add_model(svm_linear(mode = "classification", 
                     engine = "LiblineaR"))

# Fitting the logistic regression model
m_svm <- fit(svm_workflow, data = train_data)
# Predict outcome in test data
predict_svm <- predict(m_svm, new_data=testing(split)) |>
  bind_cols(select(testing(split), binary_genre)) |>
  rename(predicted=.pred_class, actual=binary_genre)

# Check
predict_svm
# A tibble: 6,964 × 2
   predicted actual
   <fct>     <fct> 
 1 rock      rock  
 2 other     other 
 3 rock      rock  
 4 other     other 
 5 rock      other 
 6 other     other 
 7 rock      other 
 8 other     other 
 9 other     other 
10 rock      other 
# ℹ 6,954 more rows

Validation

# Confusion matrix
predict_svm |> 
  conf_mat(truth = actual, estimate = predicted)
          Truth
Prediction rock other
     rock  3094  1126
     other  713  2031
# Performance scores
predict_svm |> 
  class_metrics(truth = actual, estimate = predicted)
# A tibble: 4 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 accuracy  binary         0.736
2 precision binary         0.733
3 recall    binary         0.813
4 f_meas    binary         0.771
  • Overall performance is very similar to logistic regression

  • Can probably be improved with fine-tuning of hyper-parameters

Artificial Neural Networks

The human brain abstracted to a numerical model

Theoretical background

  • An artificial neural network (ANN) models the relationship between a set of input signals and an output signal using a model derived from our understanding of the human brain

  • Like a brain uses a network of interconnected cells called “neurons” (a) to provide fast learning capabilities, ANN uses a network of artificial neurons (b) to solves learning tasks

Source: Arthur Arnx/Medium

How does a neural network work?

  • The operation of an ANN is straightforward:

    • One enters variables as inputs (e.g., features of a text)
    • And after some calculations via different neurons, an output is returned (e.g. the word “politics”)
  • In the simplest form, neurons on are stacked on top of one another and a neuron of colum n can only be connected to inputs from column n-1 and provide outputs to neurons in column n+1

The Neuron as generalized linear model

  • First, a neuron adds up the value of every neurons from the previous column it is connected to (here x1, x2, and x3)

  • This value is multiplied, before being added, by another variable called “weight” (w1, w2, w3): the strength of connection between two neurons

  • A bias value may be added (e.g., to regularize the network)

  • After all those summations, the neuron finally applies a function called activation function to the obtained value

Source: Arthur Arnx/Medium

The activation function

  • The activation function serves to turn the total value calculated to a number between 0 and 1

  • A threshold then defines at what value the function should “fire” the output to the next neuron (of course this can be probabilistic)

  • We can choose from different activation functions; which works best is sometimes hard to tell

Source: AI Wiki

How the neural network learns

  • Source: 3Blue1BrownIn a first try, the ANN randomly sets weights and thus can’t get the right output (except with luck)

  • If the (random) choice was a good one, actual parameters are kept and the next input is given. If the obtained output doesn’t match the desired output, the weights are changed.

  • To determine which weight is better to modify, a ANN uses backpropagation, which consists of “going back” on the neural network and inspect every connection to check how the output would behave according to a change on the weight

  • The learning rate thereby determines the speed a neural network will learn, i.e., how it will modify a weight (e.g., little by little or by bigger steps).

  • Learning rate and number of learning cycles (epochs) have to be set manually upfront!

Architecture of a neural network

  • A simple model with just inputs and outputs is called a perceptron

  • Despite its usefulness for many tasks, it cannot combine features

  • But we can add “hidden” layers (latent variables), creating a multilayer perceptron, which is able to process more complex inputs

  • Backpropagation is thus the way a neural network adjusts the weights of its connections (-> Can take a long time to find the right solutions)

Universal approximator: Neural Networks with single hidden layer can represent every continuous function!

Why use neural networks?

Strengths

  • Can be adapted to almost any prediction problem
  • Capable of modelling more complex patterns than nearly any other algorithm
  • Makes few assumptions about the data’s underlying characteristics

Weaknesses

  • Extremely computationally intensive and thus slow (however, simple models still fast)
  • Very prone to overfitting
  • Results in a complex ‘black box’ that is difficult, if not impossible to interpret

Lantz, 2013

Fitting an artifical neural network

# For replication purposes
set.seed(42)

# Specify multilayer perceptron
nnet_spec <- 
  mlp(epochs = 400,          # <- times that algorithm will work through train set
      hidden_units = c(6),   # <- nodes in hidden units
      penalty = 0.01,        # <- regularization
      learn_rate = 0.2) |>   # <- shrinkage
  set_engine("brulee") |>    # <-- engine = R package
  set_mode("classification")

Fitting an artifical neural network

# For replication purposes
set.seed(42)

# Specify multilayer perceptron
nnet_spec <- 
  mlp(epochs = 400,          # <- times that algorithm will work through train set
      hidden_units = c(6),   # <- nodes in hidden units
      penalty = 0.01,        # <- regularization
      learn_rate = 0.2) |>   # <- shrinkage
  set_engine("brulee") |>    # <-- engine = R package
  set_mode("classification")

Fitting an artifical neural network

# For replication purposes
set.seed(42)

# Specify multilayer perceptron
nnet_spec <- 
  mlp(epochs = 400,          # <- times that algorithm will work through train set
      hidden_units = c(6),   # <- nodes in hidden units
      penalty = 0.01,        # <- regularization
      learn_rate = 0.2) |>   # <- shrinkage
  set_engine("brulee") |>    # <-- engine = R package
  set_mode("classification")

Fitting an artifical neural network

# For replication purposes
set.seed(42)

# Specify multilayer perceptron
nnet_spec <- 
  mlp(epochs = 400,          # <- times that algorithm will work through train set
      hidden_units = c(6),   # <- nodes in hidden units
      penalty = 0.01,        # <- regularization
      learn_rate = 0.2) |>   # <- shrinkage
  set_engine("brulee") |>    # <-- engine = R package
  set_mode("classification")

# Create workflow
ann_workflow <- workflow() |>
  add_recipe(rec_norm) |>    # Use updated recipe with normalization
  add_model(nnet_spec)

Fitting an artifical neural network

# For replication purposes
set.seed(42)

# Specify multilayer perceptron
nnet_spec <- 
  mlp(epochs = 400,          # <- times that algorithm will work through train set
      hidden_units = c(6),   # <- nodes in hidden units
      penalty = 0.01,        # <- regularization
      learn_rate = 0.2) |>   # <- shrinkage
  set_engine("brulee") |>    # <-- engine = R package
  set_mode("classification")

# Create workflow
ann_workflow <- workflow() |>
  add_recipe(rec_norm) |>    # Use updated recipe with normalization
  add_model(nnet_spec)

# Fit model
m_ann <- fit(ann_workflow, train_data)
# Predict outcome in test data
predict_ann <- predict(m_ann, new_data=test_data) |>
  bind_cols(select(testing(split), binary_genre)) |>
  rename(predicted=.pred_class, actual=binary_genre)

# Check
predict_ann |> head(n = 4)
# A tibble: 4 × 2
  predicted actual
  <fct>     <fct> 
1 rock      rock  
2 other     other 
3 rock      rock  
4 other     other 

Validation

# Confusion matrix
predict_ann |> 
  conf_mat(truth = actual, estimate = predicted)
          Truth
Prediction rock other
     rock  3367  1305
     other  440  1852
# Performance score
predict_ann |> 
  class_metrics(truth = actual, estimate = predicted)
# A tibble: 4 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 accuracy  binary         0.749
2 precision binary         0.721
3 recall    binary         0.884
4 f_meas    binary         0.794
  • Highest accuracy of all four algorithms (74.9%)

  • Most balanced performance scores

  • Yet, performance overall similar to logistic regression and SVM

Testing Different Algorithms and Recipes

Comparison of the previous approaches

Fine-tuning preprocessing and hyperparameter

  • You may have noticed that I always set some hyperparameter in all of the models

  • There is no clear rule of how to set these parameters and their influence on performance is often unknown

  • Using trail and error, we simply compare many different model specifications to find optimal hyperparameters

  • Very good examples are the hyperparameters of support vector machines: it is hard to know how soft our margins should be and we may also be unsure about the right kernel , or in the case of a polynomial kernel, how many degrees we want to consider

  • Similarly, we have to simply try out whether a neural network needs more than one hidden layer or how many nodes make sense, what learning rate works best, etc.

  • Good machine learning practice is to conduct a so-called grid-search, i.e. systematically run combinations of different specifications (but computationally expensive!!!)

Hvitfeld & Silge (2021)

Grid search for best neural network architecture

# Create a model function for tuning a multilayer perceptrion neural network
mlp_spec <- mlp(hidden_units = tune(),     #
                penalty = tune(), 
                epochs = tune()) %>% 
  set_engine("brulee", trace = 0) %>% 
  set_mode("classification")

Grid search for best neural network architecture

# Create a model function for tuning a multilayer perceptrion neural network
mlp_spec <- mlp(hidden_units = tune(),     
                penalty = tune(), 
                epochs = tune()) |> 
  set_engine("brulee", trace = 0) |>  
  set_mode("classification")

# Estract "dials" for parameter tuning
mlp_param <- extract_parameter_set_dials(mlp_spec)

# Simple combinatorial design
mlp_param |> grid_regular(levels = c(hidden_units = 2, penalty = 2, epochs = 2)) 
# A tibble: 8 × 3
  hidden_units      penalty epochs
         <int>        <dbl>  <int>
1            1 0.0000000001      5
2           10 0.0000000001      5
3            1 1                 5
4           10 1                 5
5            1 0.0000000001    500
6           10 0.0000000001    500
7            1 1               500
8           10 1               500

Grid search for best neural network architecture

# Create a model function for tuning a multilayer perceptrion neural network
mlp_spec <- mlp(hidden_units = tune(),     
                penalty = tune(), 
                epochs = tune()) |> 
  set_engine("brulee", trace = 0) |>  
  set_mode("classification")

# Estract "dials" for parameter tuning
mlp_param <- extract_parameter_set_dials(mlp_spec)

# Random design with 1000 combinations
mlp_param |> grid_random(size = 1000) |> summary()
  hidden_units       penalty              epochs     
 Min.   : 1.000   Min.   :0.0000000   Min.   :  5.0  
 1st Qu.: 3.000   1st Qu.:0.0000000   1st Qu.:115.0  
 Median : 6.000   Median :0.0000212   Median :254.5  
 Mean   : 5.558   Mean   :0.0483832   Mean   :251.1  
 3rd Qu.: 8.000   3rd Qu.:0.0052931   3rd Qu.:382.2  
 Max.   :10.000   Max.   :0.9945632   Max.   :500.0  

Random design

  • We can plot the random combinations to get an idea what they will cover

  • Bear in mind, such a grid search would be quite computationally intensive as 50 neural networks would need to be fitted!

Fitting models across the grid

set.seed(42)

# Create new workflow
mlp_wflow <- workflow() |> 
  add_recipe(rec_norm) |> 
  add_model(mlp_spec)

# Adjust parameter extremes
mlp_param <- mlp_wflow  |>  
  extract_parameter_set_dials() |> 
  update(epochs = epochs(c(200, 600)),
         hidden_units = hidden_units(c(1, 9)),
         penalty = penalty(c(-10, -1)))

# Set metric of interest
acc <- metric_set(accuracy)

# Define resampling strategy
twofold <- vfold_cv(lyrics_data, v = 2) 

# Run the tuning process
mlp_reg_tune <- mlp_wflow  |> 
  tune_grid(
    resamples = twofold,
    grid = mlp_param  |>  
      grid_regular(levels = c(hidden_units = 4, penalty = 4, epochs = 3)),
    metrics = acc
  )

Evaluation

  • The result of this tuning grid search is a table that shows the best combinations of parameters in descending order

  • We can see here that 9 nodes in one hidden layer, a penalty of 0.0001 and 600 epochs lead to the best accuracy, but it is actually worse than our original model with 6 nodes and a penalty of 0.001 and 400 epochs

# Get best fit
best_fit <- show_best(mlp_reg_tune, n = 48) |> 
  select(-.estimator, -.config, -.metric) |> 
  rename(accuracy = mean)
best_fit
# A tibble: 48 × 6
   hidden_units      penalty epochs accuracy     n  std_err
          <int>        <dbl>  <int>    <dbl> <int>    <dbl>
 1            9 0.0001          600    0.746     2 0.00125 
 2            3 0.1             600    0.745     2 0.00849 
 3            9 0.1             600    0.745     2 0.00918 
 4            6 0.0001          600    0.743     2 0.00429 
 5            9 0.0000000001    600    0.742     2 0.00447 
 6            6 0.0000000001    600    0.742     2 0.00211 
 7            6 0.0000001       600    0.741     2 0.00257 
 8            9 0.0000001       600    0.740     2 0.000674
 9            3 0.0000001       600    0.740     2 0.000360
10            1 0.1             600    0.739     2 0.00182 
# ℹ 38 more rows

Evaluation

  • The effect of parameters becomes even more clear, when we visualize the grid search:

Using the best neural network classifier

  • The best fitting model is actually our first model:
# Model architecture
m_ann |> extract_fit_parsnip()
parsnip model object

Multilayer perceptron

relu activation
6 hidden units,  6,020 model parameters
10,445 samples, 1,000 features, 2 classes 
class weights rock=1, other=1 
weight decay: 0.01 
dropout proportion: 0 
batch size: 9401 
learn rate: 0.2 
validation loss after 55 epochs: 0.527 

How does it perform on some completely new songs?

# Some test songs (not in the training nor test data)
test_songs
# A tibble: 2 × 3
  artist                      song               text                           
  <chr>                       <chr>              <chr>                          
1 Joan Jett & the Blackhearts I love rock'n'roll "I saw him dancing there by th…
2 Celine Dion                 The Power of Love  "The whispers in the morning\n…
  • We can simply use the predict function and supply the new songs. This code could of course also be used in a software that detects genre from song lyric
# Predict genre
predict(m_ann, test_songs)
# A tibble: 2 × 1
  .pred_class
  <fct>      
1 rock       
2 other      

Impact of text preprocessing on performance?

  • Scharkow ran a simple simulation study in which he systematically varied text preprocessing

  • The classifier was always Naive Bayes, below we see the average difference in performance

General Drivers of model performance

  1. Task difficulty

  2. Amount of training data

  3. Choice of features (n-grams, lemmata, etc)

  4. Text preprocessing (e.g., exclude or include stopwords?)

  5. Representation of text (tf, tf-idf, word-embedding)

  6. Tuning of algorithm (what we did in the grid search)

Validity of different approaches

  • Van Atteveldt et al. (2021) re-analysised data reported in Boukes et al. (2020) to understand the validity of different text classification approaches for sentiment analysis

  • The data included news from a total of ten newspapers and five websites published between February 1 and July 7, 2015:

    • three quality newspapers (NRC Handelsblad, Trouw, de Volkskrant)
    • a financial newspaper (Financieel Dagblad)
    • three popular newspapers (Algemeen Dagblad, Metro, De Telegraaf)
    • three regional outlets (Dagblad van het Noorden, de Gelderlander, Noordhollands Dagblad)

Main results

Examples from the literature

How Machine Learning is used in Communication Science

Example 1: Incivility in Facebook comments

  • Su et al. (2018) examined the extent and patterns of incivility in the comment sections of 42 US news outlets’ Facebook pages in 2015–2016

  • News source outlets included

    • National-news outlets (e.g., ABC, CBS, CNN…)
    • Local-new outlets (e.g., The Denver Post, San Francisco Chronicle…)
    • Conservative and liberal partisan news outlets (e.g., Breitbart, The Daily Show…)
  • Implemented a combination of manual coding and supervised machine learning to code:

    • Civility
    • Interpersonal rudeness
    • Personal rudeness
    • Impersonal extreme incivility
    • Personal extreme incivility

Results: Overall differences

Example 2: Electoral news sharing in Mexico

  • de León et al. (2021) explored how elections transform news sharing behaviour on Facebook

  • They investigated changes in news coverage and news sharing behaviour on Facebook

    • by comparing election and routine periods, and
    • by addressing the ‘news gap’ between preferences of journalists and news consumers on social media.
  • Employed a novel data set of news articles (N = 83,054) in Mexico

  • First coded 2,000 articles manually into topics (Politics, Crime and Disasters, Culture and Entertainment, Economic and Business, Sports, and Other), then used support vector machines to classify the rest

Results

  • During periods of heightened political activity, both the publication and dissemination of political news increases

  • The gap between the news choices of journalists and consumers narrows, and increased political news sharing leading to a decrease in the sharing of other news.

Summary, Conclusion, and Outlook

Next week, we reach the present!

Machine Learning Text Classification Pipeline

General considerations

  • Machine learning in the social sciences generally used to solve an engineering problem

  • Output of Machine Learning is input for “actual” statistical model (e.g., we classify text, but run an analysis of variance with the output)

  • Ethical concerns

    • Is it okay to use a model we can’t possibly understand (e.g., SVM, neural networks)?

    • If they lead to false positives or false negatives, which in turn discriminate someone or lead to bias, it is difficult to find the source and denote responsibility

  • We always need to validate model on unseen and representative test data!

  • Think before you run models or you will waste a lot of computational resources

Conclusion and outlook

  • Classic machine learning is a useful tool for generalizing from a sample

  • It is very useful to reduce the amount of manual coding needed

  • That said, the field has moved on and innovations are fast-paced these days:

    • Deep Learning, Transfer Learning, Attention, Self-Attention…. (Next week!!!)


Thank you for your attention!

Required Reading



van Atteveldt, W., van der Velden, M. A. C. G., & Boukes, M.. (2021). The Validity of Sentiment Analysis: Comparing Manual Annotation, Crowd-Coding, Dictionary Approaches, and Machine Learning Algorithms. Communication Methods and Measures, (15)2, 121-140, https://doi.org/10.1080/19312458.2020.1869198

Su, L. Y.-F., Xenos, M. A., Rose, K. M., Wirz, C., Scheufele, D. A., & Brossard, D. (2018). Uncivil and personal? Comparing patterns of incivility in comments on the Facebook pages of news outlets. New Media & Society, 20(10), 3678–3699. https://doi.org/10.1177/1461444818757205


(available on Canvas)

References

  • Boumans, J. W., & Trilling, D. (2016). Taking stock of the toolkit: An overview of relevant automated content analysis approaches and techniques for digital journalism scholars. Digital journalism, 4(1), 8-23.

  • de León, E., Vermeer, S. & Trilling, D. (2023). Electoral news sharing: a study of changes in news coverage and Facebook sharing behaviour during the 2018 Mexican elections. Information, Communication & Society, 26(6), 1193-1209. https://doi.org/10.1080/1369118X.2021.1994629

  • Hvitfeld, E. & Silge, J. (2021). Supervised Machine Learning for Text Analysis in R. CRC Press. https://smltar.com/

  • Lantz, B. (2013). Machine learning in R. Packt Publishing Ltd.

  • Scharkow, M. (2013). Thematic content analysis using supervised machine learning: An empirical evaluation using german online news. Quality & Quantity, 47(2), 761–773. https://doi.org/10.1007/s11135-011-9545-7

  • Su, L. Y.-F., Xenos, M. A., Rose, K. M., Wirz, C., Scheufele, D. A., & Brossard, D. (2018). Uncivil and personal? Comparing patterns of incivility in comments on the Facebook pages of news outlets. New Media & Society, 20(10), 3678–3699. https://doi.org/10.1177/1461444818757205

  • van Atteveldt, W., van der Velden, M. A. C. G., & Boukes, M.. (2021). The Validity of Sentiment Analysis: Comparing Manual Annotation, Crowd-Coding, Dictionary Approaches, and Machine Learning Algorithms. Communication Methods and Measures, (15)2, 121-140, https://doi.org/10.1080/19312458.2020.1869198

Example Exam Question (Multiple Choice)

Van Atteveldt and colleagues (2020) tested the validity of various automated text analysis approaches. What was their main result?


A. English dictionaries performed better than Dutch dictionaries in classifying the sentiment of Dutch news paper headlines.

B. Dictionary approaches were as good as machine learning approaches in classifying the sentiment of Dutch news paper headlines.

C. Of all automated approaches, supervised machine learning approaches performed the best in classifying the sentiment of Dutch news paper headlines.

D. Manual coding and supervised machine learning approaches performed similarly well in classifying the sentiment of Dutch news paper headlines.

Example Exam Question (Multiple Choice)

Van Atteveldt and colleagues (2020) tested the validity of various automated text analysis approaches. What was their main result?


A. English dictionaries performed better than Dutch dictionaries in classifying the sentiment of Dutch news paper headlines.

B. Dictionary approaches were as good as machine learning approaches in classifying the sentiment of Dutch news paper headlines.

C. Of all automated approaches, supervised machine learning approaches performed the best in classifying the sentiment of Dutch news paper headlines.

D. Manual coding and supervised machine learning approaches performed similarly well in classifying the sentiment of Dutch news paper headlines.

Example Exam Question (Open Format)

Describe the typical process used in supervised text classification.

Any supervised machine learning procedure to analyze text usually contains at least 4 steps:

  1. One has to manually code a small set of documents for whatever variable(s) you care about (e.g., topics, sentiment, source,…).

  2. One has to train a machine learning model on the hand-coded /gold-standard data, using the variable as the outcome of interest and the text features of the documents as the predictors.

  3. One has to evaluate the effectiveness of the machine learning model via cross-validation. This means one has to test the model test on new (held-out) data.

  4. Once one has trained a model with sufficient predictive accuracy, precision and recall, one can apply the model to more documents that have never been hand-coded or use it for the purpose it was designed for (e.g., a spam filter detection software)