Week 5: Research Projects
Dr. Philipp K. Masur
Obtaining Text: Scraping data from Facebook and importing it into R
Feature Engineering: Removing noise, summarizing data to creating a tidy data set, transforming the data into a document-feature matrix
Text classification: Run a sentiment analysis to classify the tweets (using a dictionary approach)
Validation: Assessing the classification result against a gold-standard
Substantive Analysis: Predicting sentiment towards migration with party ideology
Visualize: Plotting the relationship between sentiment and party ideology with ggplot2
Communicate: Creating a table and figure and report results in the paper
Heidenreich et al., 2019
In this case, the text classification via a dictionary approach was just a tool to get the data they needed
The substantive analysis then used the labels that were created to assess the relationship between party ideology and sentiment towards migration
Obtaining Text: political news articles from six prominent Austrian news outlets spanning 2012 to 2021 via the API of the Austrian Press Agency
Feature Engineering: Used specific search strings previously validated by the Austrian National Election Study to identify relevant articles
Text classification: Used automated content analysis (using BERT) of political news articles from six prominent Austrian news outlets spanning 2012 to 2021 (n = 188,203) to scrutinize political actors’ visibility and favorability in news coverage; fine-tuned on an exisiting data set
Validation: Ran specific experiments to test the validity of the final model (accuracy = .77)
Substantive Analysis: Difference-in-difference estimations to find out whether Kurz (Austrian chancellor at the time) was favored by certain media
Visualize: Plotting differences between time points using coefficient plots with ggplot2
Communicate: Creating report results in the paper
Baluff et al., 2023
Bivariate regression: Estimating relationships between two variables
Multiple linear regression: Predicting a dependent variable with multiple independent variables
Mediation analysis: Understanding how something may indirectly affect/relate to something else
Moderation analysis: Testing whether a third variable influences a relationships or effect
Analysis of Variance: Differences in variables
Call:
lm(formula = mpg ~ hp, data = d)
Residuals:
Min 1Q Median 3Q Max
-5.7121 -2.1122 -0.8854 1.5819 8.2360
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
hp -0.06823 0.01012 -6.742 1.79e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.863 on 30 degrees of freedom
Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
Data Wrangling: Transforming and reshaping data to suit specific needs
Data Visualization: Creating insights with graphs and figures
Basics of Text Analysis: Extracting meaning from text / text as data
Automatic Text Analysis: Classifying big corpora
library(tidytext)
url <- 'https://bit.ly/2QoqUQS'
corp <- read_csv(url) |>
unnest_tokens(output = word, input = text) |>
anti_join(stop_words) |>
group_by(paragraph, word) |>
summarize(n = n()) |>
filter(n() > 100) |>
cast_dfm(document = paragraph, term = word, value = n)
corp
Document-feature matrix of: 436 documents, 28,071 features (96.80% sparse) and 0 docvars.
features
docs 04 05 10 100th 11 11th 12 1787 1979 1980
1 1 1 2 2 1 1 1 1 1 1
2 0 0 0 0 0 0 1 0 0 0
3 0 0 2 0 0 0 1 0 0 2
4 0 0 0 0 0 2 0 0 0 0
5 0 0 2 0 1 0 2 0 0 0
6 0 0 2 0 2 1 1 0 0 0
[ reached max_ndoc ... 430 more documents, reached max_nfeat ... 28,061 more features ]
For the purpose of this course, we have created more knowledge clips and tutorials that you can work through by yourself (if necessary for the project at hand). These tutorials cover some things that you already know (but haven’t done in R) as well as some new methods that might be interesting for the research projects:
Test theory and factor analyses: Measuring abstract concepts
Advanced statistical modeling:
Web Scraping: Getting data from online sources
More Text Analysis: Unsupervised Machine Learning
We create small working groups (4 students per group)
You will conduct your own research projects
Teachers will supervise your progress and support you during the practical sessions times
At the end, you will present your findings at a mini-conference (18th of December)
Each group will be assigned to a) a teacher (Kasper, Emma or Roan) and b) a particular time slot (during the usual practical session times)
You will have 15 minute time slots with your supervisor
The first supervision session is not on Tuesday, but on Thursday (so that you can go to the master thesis information session)
Yet, until Tuesday night, you will need to send a short proposal for the research project to your teacher.
In the actual supervision sessions, we expect you to come prepared:
This mini-conference is going to be a so-called “poster-market”
You will create an academic poster (DIN A0) that we will put on the wall
During the conference, everybody (including some colleagues) will be able to check out everybody’s project and discuss the results!
Time and Date: 18th December from 9:30 to 11:30
Structural Elements:
Title:
Introduction:
Methods:
Results:
Discussion:
Reference
Tips and Tricks for Design:
Visual Hierarchy:
Consistent and Balanced Style:
Use of Images and Graphics:
Color Scheme:
Limited Text:
You are required to hand in the RMarkdown, a compiled html output beforehand (Sharp Deadline: 17th of December 2021 at 24:00)!
If you want to have your poster printed by us, you need to submit it via mail as PDF to p.k.masur@vu.nl on Thursday, the 14th of December 2023 (at 24.00 latest). We need the Friday to actually print them.
You can choose to print it yourself later, but unfortunately, we won’t be able to reimburse you then.
Any questions?
Please go to Canvas
You can assign yourself to a group (max 4 people)
In the end, I will put remaining people randomly into the groups with open places
Note: Please only build groups with people from groups that have the same time slot (otherwise your supervision will be in at a different time than the practical sessions!)
Publicly repositories of research data sets
Other public repositories
More data:
Scraping primary sources
Proprietary texts from third parties:
Your own social media data
Check out this page: https://github.com/awesomedata/awesome-public-datasets
There are tons of great data sets worth exploring for text analysis purposes
paperboy
paperboy
to read the actual articles into R:library(paperboy)
# Get urls from mediacloud
d <- read_csv("data/mediacloud_example.csv")
# Provide urls to this function
corp <- pb_deliver(d$url[1:4])
# Resulting text corpus
corp |>
select(domain, datetime, headline, text)
# A tibble: 4 × 4
domain datetime headline text
<chr> <dttm> <chr> <chr>
1 amsterdamnews.net NA Storm Ciaran Whips Western Europe… "Voi…
2 nltimes.nl NA About 100 Dutch climate activists… "Abo…
3 nrc.nl 2023-11-03 00:00:00 Wat vindt NRC | Omgang met AI zal… ""
4 nu.nl 2023-11-02 15:20:54 Gaat klimaatverandering sneller d… "De …
rwhatsapp
It is very easy to download a whatsapp conversation
We simply have to export the conversation
rwhatsapp
rwhatsapp
, we can easly read this into R to get an interesting corpus:# A tibble: 45 × 5
time text source emoji emoji_name
<dttm> <chr> <chr> <list> <list>
1 2023-11-28 20:52:12 "Messages and calls are end-to-… data/… <NULL> <NULL>
2 2023-11-28 20:52:12 "You created group “Replication… data/… <NULL> <NULL>
3 2023-11-28 20:52:34 "image omitted" data/… <NULL> <NULL>
4 2023-11-28 20:55:05 "Major changes to the table in … data/… <NULL> <NULL>
5 2023-11-28 20:57:53 "That's all I can manage tonigh… data/… <chr> <chr [1]>
6 2023-11-28 21:05:47 "Thanks and good to hear baby i… data/… <NULL> <NULL>
7 2023-11-29 10:12:05 "Looks good! I like it. A few n… data/… <NULL> <NULL>
8 2023-11-29 10:14:25 "But you even used the word \"g… data/… <NULL> <NULL>
9 2023-11-29 10:16:22 "Also I am not sure about the a… data/… <NULL> <NULL>
10 2023-11-29 10:16:23 "See e.g., this paper's figure … data/… <NULL> <NULL>
# ℹ 35 more rows
Try to formulate an appropriate research question that can be studied with the respective data set, text documents, tweets,…
Formulate hypotheses using theory and relevant literature
If you combined text with survey data, test scales before using them!
Engage in descriptive analyses
Visualize data
Test assumptions if necessary
Put an emphasis on correctly reporting and interpreting your methods and results
Goal: Investigating news stance towards Dutch politicians before and after Elections
Method: Several different possibilities:
Goal: Create algorithm that can predict sentiment of financial news sufficiently well
Method: Comparing various supervised machine learning approaches and zero-shot learning
Data: https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news
Goal: Detect linguistic bias in specific news paper articles
Method: using dictionary or zero-shot approaches to detect linguistic bias (e.g., gender-bias)
Goal: Investigating sentiment in Tweets/News coverage about a company before and after a crisis
Method: Manual coding of a subset, supervised machine learning afterwards or zero-shot learning
Goal: Does chatGPT reproduce biases and stereotypes?
Method: Prompt GPT to write stories or headlines, gather output and text analyze these for stereotypes or biases
Goal: Use of emotions (or emojis) in whatsapp conversations
Method: Classifying emotions in whatsapp conversation (would require a bit more than just your own)
rwhatsapp
Analyzing your own data
Going beyond text analysis
You also learned a lot about data wrangling and visualization in this course
You can also use this knowledge and engage in other type of data analyses
e.g., using world value survey data to show trends in values over time and across countries; secondary analyses of large-scale survey data (Pew Research, Gesis…)
Computational Analysis of Digital Communication