Multiverse Analysis in R

Dr. Philipp K. Masur

Horoscopes

Fortune-telling framework:

Practice of predicting information about a person’s life
From vague facts, provide vague advice
Must be applicable to any case
Exaggerated importance of advice

Tasseomancy

A divination or fortune-telling method that interprets patterns in tea leaves, coffee grounds, or wine sediments:

snake = enmity or falsehood
spade = good fortune through industry
mountain = journey of hindrance
house = change, success
anchor = prosperity in business, stable romantic relation
axe = power to overcome difficulties

Fortune-telling and Statistics

Statistical stargazing

Statistical procedures only acquire meaning from scientific models
You cannot offload a subjective responsibility to an objective procedure
Often easier to defend an objective statistical procedure than an subjective choice that gives “meaning” to the procedure

Intro taken and adapted from McElreath’s (2023) “Statistical Rethinking” Lecture on Horoscopes

We should exercise caution in using seemingly complex statistical procedures, specifically those that claim to analyze a “multiverse”!

So who am I?

Assistant Professor at Vrije Universiteit Amsterdam
Studied Communication Science, Philosophy, and Economics
Research Areas
- Online Privacy
- Social Influence and Contagion
- Media Literacy
Methodological Interest
- Test Theory and Scale Development
- Bayesian Estimation
- Flexibility in Data Analysis
- Computational Methods and Machine Learning

Content

Tuesday, 19th November

Time	Topic
13:00 - 14:00	Statistical Fortune-Telling
14:00 - 15:00	A Garden of Forking Paths
15:30 - 16:30	R: Exercise I: Basic Mechanisms of Multiverse Analysis
16:30 - 17:00	Into the Multiverse
17:00 - 18:00	Q&A - Thinking through your own projects

Wednesday, 20th of November

Time	Topic
09:00 - 09:45	Arbitrary Decisions
09:45 - 10:45	R: Exercise II: Multiverse Analysis with `specr`
11:15 - 12:00	A Mass of Poorly Justified Alternatives
12:00 - 13:00	R: Exercise III: Advanced Specifications

About this workshop

Slides and Material

You can find everything on this website: https://masurp.github.io/workshop_specr/
The page will stay online beyond the workshop and will potentially be updated in the future

Formalia

A mix of theoretical concepts and ideas and practical sessions in R
Do ask question at any time!
Let’s make this more a discussion than a “me-telling-you-about-stuff” situation

A Garden of Forking Paths

“I thought of a labyrinth of labyrinths, of one sinuous spreading labyrinth that would encompass the past and the future… I felt myself to be, for an unknown period of time, an abstract perceiver of the world.”

Evolutionary Garden of Forking Paths

Image by Eisenberg, 2018

Statistical Garden of Forking Paths

Many Analyst, Different Conclusions

Creativity in science often linked to hypothesis generation and research design development.
Data analysis is sometimes perceived as a mechanical, unimaginative process.
Analytic strategies are influenced by theory, assumptions, and subjective choice points.
Multiple reasonable (and unreasonable) data evaluation approaches exist.

Silberzahn et al. 2018

Many Analyst, Different Conclusions

Researchers may default to familiar methods rather than rationale-based strategies.
Peer reviewers rarely engage directly with the data, often accepting analytic strategies at face value.
Reanalyses and critiques of analytic strategies are rare, partly due to limited data sharing.
Scientific results may hinge on subjective decisions, leading to potential uncertainty.

Silberzahn et al. 2018

Results

Silberzahn et al: COnclusion

Research outcomes are influenced by subjective but justifiable analytic decisions, adding uncertainty beyond statistical power or questionable practices.
Subjectivity in research is inevitable but does not disconnect findings from reality; it highlights the role of decision-making.
Transparency in data, methods, and processes allows the scientific community to scrutinize, question, and test research decisions.

Undisclosed flexibility

“Despite the nominal endorsement of a maximum false-positive rate of 5% (…) current standards for disclosing details of data collection and analyses make false positives vastly more likely. In fact, it is unacceptably easy to publish ‘statistically significant’ evidence consistent with any hypothesis.”

Method

Computer simulations were used to estimate how researcher degrees of freedom influence false-positive rates.
Four common degrees of freedom were analyzed: flexibility in dependent variables, sample size, covariates, and reporting subsets of conditions.
Simulations involved generating random samples and testing multiple analyses to observe how often p-values fell below significance levels.
Across 15,000 simulations, the probability of false positives was systematically quantified under various scenarios.

Simulation-based findings

As high as these estimates are, they may actually be conservative. We did not consider many other degrees of freedom that researchers commonly use… (p. 1361)

What can we do?

Solution 1: Choosing one analytical pathway beforehand

Classic falsification paradigm (Popper)
Strictly speaking, requires preregistration
Requires strong justification (maybe not always possible?)

Solution 2: Examine all possible analytical pathways

Specification curve (SCA) or multiverse analysis (MA)
Investigating all (theoretically plausible) models
Requires justification for inclusions of choices/models

A Prominent Paper

Type of analytical decisions

Results

Association between digital technology use and adolescent well-being is negative but small
Technology use explains at most 0.4% of the variation in well-being.
Findings suggests that these effects are too small to warrant policy change

How important was the method - specification curve analysis - for this finding?

R: Exercise I

Into the Multiverse

Two Seminal Papers

Geneology of the Approach

Long tradition of considering robustness to alternative specifications in the social sciences
Economics and Political science: reporting regression results in tables in which each column reports a different specification
Robustness tests: examining how certain “core” regression coefficient estimates behave when the regression specification is modified by adding or removing regressors (usually ~1-10 alternative models reported in the appendix)
Multiverse (MA) or specification curve analysis (SCA) can be regarded as extension and formalization of these approaches

Simonsohn et al., 2020

The problem

Multiverse analysis acknowledges that data are actively shaped during the transformation from raw data to analyzable form
Researchers often apply multiple processing steps during data preparation
These steps involve numerous “researcher’s degrees of freedom”, offering various, alternative options at each stage
Raw data don’t yield a single analyzable dataset, instead they generate multiple versions based on the choices made — a data multiverse!
Each dataset in this multiverse can produce different statistical outcomes - a statistical multiverse!

Steegen et al., 2016

A Universe of Specifications

A Universe of Specifications

A Universe of Specifications

A formal representation of the problem

Let’s consider a relationship between variables \(x\) and \(y\), in a context in which other variables, \(Z\), may influence the relationship:

\(y = F(x,Z) + \epsilon\)

We are faced with the following practical challenges:

\(x\) and \(y\) are often imprecisely defined latent variables
the set of moderators and confounders in \(Z\) are often not fully known ex ante
\(Z\) also contains imprecisely defined latent variables
the functional form \(F()\) is not known

A formal representation of the problem

To study \(y = F(x,Z)\), we must operationalize the underlying constructs. We usually approximate this relationship with a specification from a set of operationalizations:

\(y_{k_y} = F_{k_F}(x_{k_x}, Z_{k_Z})\)

where \(k_y\), \(k_F\), \(k_x\) and \(k_Z\) are indices for single operationalizations of the respective constructs.

For example \(y_1\) may operationalize ‘well-being’ with life satisfaction, while \(y_2\) with reversed depression.

Simonsohn et al., 2020

The number of reasonable specifications

For each construct there are multiple statistically valid, theoretically justified and non-redundant operationalizations. Their combination leads to what we refer to as the set of reasonable specifications.

Designating the total number of valid operationalizations for each construct with \(n_y\), \(n_x\), \(n_Z\) and \(n_F\), the total number of reasonable specifications available to study \(y = F(x,Z)\) is:

\(N = n_x × n_y × n_Z × n_F\)

Simonsohn et al., 2020

Sampling specifications

Let \(\Pi\) be this set of \(N\) reasonable specifications, and \(\pi\) be the subset of specifications reported in a paper
\(\pi\) can be regarded as a sample of \(\Pi\)
Any given \(y_{k_y} = F_{k_F}(x_{k_x}, Z_{k_Z})\) is considered a valid proxy for \(y = F(x,Z)\) and therefore so is the full set of all such proxies \(\Pi\).
A sufficiently large, random and independently drawn sample of \(\Pi\) should lead to a reasonable estimate of the model of interest
The problem is that \(\pi\), the sample of specifications reported in a paper, has usually none of these three properties (i.e., before specification curve analysis)

Simonsohn et al., 2020

How SCA may solve the problem

It allows to systematically generate a much larger \(\pi\), where hundreds or even thousands of specifications are reported
It makes transparent the existence of noise, and allows to determine its nature, i.e., which operationalization decisions are consequential and which are not
It generates a \(\pi\) with fewer arbitrary inclusion decisions, and thus more closely approximates a random sample of \(\Pi\)
Yet it may also inflate the number of specifications and thus hide true effects of interest (more on that later!)

Simonsohn et al., 2020

General Procedure

1. Define the set of reasonable specifications to estimate

Differentiating types of decisions (type-E-, type-N-, type-U-decisions)
Reducing redundancy

2. Estimate all specifications

Run all models (i.e., with all specifications)
Visualize specification curve and the influence of different choices

3. Conduct joint statistical tests using an inferential specification curve

If specifications are truly non-redundant and valid, we can technically run inference tests
A type of bootstrapping approach where the true effect is fixed to zero

Simonsohn et al., 2020

Specification curve analysis with `specr`

R Package (currently Version 1.0.0 on CRAN, 1.0.1 on github)
Versatile framework for conduction multiverse/specification curve analysis with R
Based around the “mapping” approach implemented in purrr
This also allows parallelization, which - depending how many specifications are estimated - is quite important to save computation time
Ties nicely into the tidyverse
Active and continuous development

The example data

For this example, we are going to use a synthetic data set that includes variables that allow to assess the relationship between media use and well-being:

#Package
library(specr)

# Loading data
d <- read_csv("https://raw.githubusercontent.com/ccs-amsterdam/r-course-material/master/data/pairfam_synth_data.csv")

d %>%
  select(age, gender, depression, 
         self_esteem, sns_use)

# A tibble: 3,132 × 5
     age gender depression self_esteem sns_use
   <dbl> <chr>       <dbl>       <dbl>   <dbl>
 1    16 female       3.77        4          3
 2    15 female       3.42        4          0
 3    15 female       1           1          3
 4    16 male         4.46        4          0
 5    17 male         2.38        1          5
 6    15 female       1           1          4
 7    16 female       2.04        1         NA
 8    15 male         2.38        1.75       4
 9    15 female       2.04        1.75       3
10    17 male         7.58        6.25       5
# ℹ 3,122 more rows

Preparations

# select variables
std_vars <- c("age", "tv_use", "internet_use",
              "sns_use", "self_esteem", 
              "depression", "life_satisfaction_r",
              "friend_satisfaction")

# standardizing relevant variables
d <- d %>% 
  mutate(across(all_of(std_vars), 
                function(x) (x - mean(x, na.rm = T)) / sd(x, T))) 

# check
d %>%
  select(all_of(std_vars)) %>%
  psych::describe() %>%
  select(n, mean, sd)

                       n mean sd
age                 3132    0  1
tv_use              3084    0  1
internet_use        3086    0  1
sns_use             2936    0  1
self_esteem         3084    0  1
depression          2899    0  1
life_satisfaction_r 3126    0  1
friend_satisfaction 3129    0  1

Technically, specification curve analysis does not necessarily require standardization of the variables
Yet, as the goal is often comparison of outcome estimates, it mostly makes sense to standardize the variables before the computation

Setting up reasonable specifications

The set of reasonable specifications can be generated by:

1. Identifying all of the data analytic decisions necessary to map the scientific hypothesis or construct of interest onto a statistical hypothesis

2. Listing all the reasonable alternative ways a researcher may make those decisions

3. Generating the exhaustive combination of decisions, eliminating combinations that are invalid or redundant.

If the resulting set is too large, then in the next step (estimation) one can randomly draw from them to create specification curves

Setting up specifications

specs <- setup(data = d,
               x = "sns_use",
               y = c("depression", "life_satisfaction_r"),
               model = "lm")
summary(specs)

Setup for the Specification Curve Analysis
-------------------------------------------
Class:                      specr.setup -- version: 1.0.1 
Number of specifications:   2 

Specifications:

  Independent variable:     sns_use 
  Dependent variable:       depression, life_satisfaction_r 
  Models:                   lm 
  Covariates:               no covariates 
  Subsets analyses:         all 

Function used to extract parameters:

  function (x) 
broom::tidy(x, conf.int = TRUE)
<environment: 0x11a1f5660>


Head of specifications table (first 6 rows):

# A tibble: 2 × 6
  x       y                   model controls      subsets formula                          
  <chr>   <chr>               <chr> <chr>         <chr>   <glue>                           
1 sns_use depression          lm    no covariates all     depression ~ sns_use + 1         
2 sns_use life_satisfaction_r lm    no covariates all     life_satisfaction_r ~ sns_use + 1

Setting up specifications

specs <- setup(data = d,
               x = "sns_use",
               y = c("depression", "life_satisfaction_r"),
               controls = c("friend_satisfaction"),
               model = "lm")
summary(specs)

Setup for the Specification Curve Analysis
-------------------------------------------
Class:                      specr.setup -- version: 1.0.1 
Number of specifications:   4 

Specifications:

  Independent variable:     sns_use 
  Dependent variable:       depression, life_satisfaction_r 
  Models:                   lm 
  Covariates:               no covariates, friend_satisfaction 
  Subsets analyses:         all 

Function used to extract parameters:

  function (x) 
broom::tidy(x, conf.int = TRUE)
<environment: 0x12a664900>


Head of specifications table (first 6 rows):

# A tibble: 4 × 6
  x       y                   model controls            subsets formula                                            
  <chr>   <chr>               <chr> <chr>               <chr>   <glue>                                             
1 sns_use depression          lm    no covariates       all     depression ~ sns_use + 1                           
2 sns_use depression          lm    friend_satisfaction all     depression ~ sns_use + friend_satisfaction         
3 sns_use life_satisfaction_r lm    no covariates       all     life_satisfaction_r ~ sns_use + 1                  
4 sns_use life_satisfaction_r lm    friend_satisfaction all     life_satisfaction_r ~ sns_use + friend_satisfaction

Setting up specifications

specs <- setup(data = d,
               x = c("sns_use", "tv_use", "internet_use"),
               y = c("depression", "life_satisfaction_r", "self_esteem"),
               controls = c("friend_satisfaction", "age"),
               model = "lm",
               subsets = list(gender = c("male", "female")))
summary(specs, rows = 4)

Setup for the Specification Curve Analysis
-------------------------------------------
Class:                      specr.setup -- version: 1.0.1 
Number of specifications:   108 

Specifications:

  Independent variable:     sns_use, tv_use, internet_use 
  Dependent variable:       depression, life_satisfaction_r, self_esteem 
  Models:                   lm 
  Covariates:               no covariates, friend_satisfaction, age, friend_satisfaction + age 
  Subsets analyses:         male, female, all 

Function used to extract parameters:

  function (x) 
broom::tidy(x, conf.int = TRUE)
<environment: 0x1386db558>


Head of specifications table (first 4 rows):

# A tibble: 4 × 7
  x       y          model controls            subsets gender formula                                   
  <chr>   <chr>      <chr> <chr>               <chr>   <fct>  <glue>                                    
1 sns_use depression lm    no covariates       male    male   depression ~ sns_use + 1                  
2 sns_use depression lm    no covariates       female  female depression ~ sns_use + 1                  
3 sns_use depression lm    no covariates       all     <NA>   depression ~ sns_use + 1                  
4 sns_use depression lm    friend_satisfaction male    male   depression ~ sns_use + friend_satisfaction

Assessing the garden of forking paths

plot(specs)

Running the analyses

results <- specr(specs)
summary(results)

Results of the specification curve analysis
-------------------
Technical details:

  Class:                          specr.object -- version: 1.0.1 
  Cores used:                     1 
  Duration of fitting process:    0.384 sec elapsed 
  Number of specifications:       108 

Descriptive summary of the specification curve:

 median  mad   min  max  q25  q75
   0.07 0.04 -0.06 0.15 0.02 0.08

Descriptive summary of sample sizes: 

 median  min  max
   1516 1311 3080

Head of the specification results (first 6 rows): 

# A tibble: 6 × 25
  x       y          model controls       subsets gender formula estimate std.error statistic p.value conf.low conf.high
  <chr>   <chr>      <chr> <chr>          <chr>   <fct>  <glue>     <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 sns_use depression lm    no covariates  male    male   depres…     0.08      0.03      2.71    0.01     0.02      0.13
2 sns_use depression lm    no covariates  female  female depres…     0.04      0.03      1.64    0.1     -0.01      0.09
3 sns_use depression lm    no covariates  all     <NA>   depres…     0.06      0.02      3.08    0        0.02      0.1 
4 sns_use depression lm    friend_satisf… male    male   depres…     0.09      0.03      3.4     0        0.04      0.14
5 sns_use depression lm    friend_satisf… female  female depres…     0.07      0.03      2.73    0.01     0.02      0.12
6 sns_use depression lm    friend_satisf… all     <NA>   depres…     0.08      0.02      4.34    0        0.04      0.12
# ℹ 12 more variables: fit_r.squared <dbl>, fit_adj.r.squared <dbl>, fit_sigma <dbl>, fit_statistic <dbl>,
#   fit_p.value <dbl>, fit_df <dbl>, fit_logLik <dbl>, fit_AIC <dbl>, fit_BIC <dbl>, fit_deviance <dbl>,
#   fit_df.residual <dbl>, fit_nobs <dbl>

Typical Visualization

plot(results)

Basic descriptive analysis

Basic descriptive summary of the entire specification curve:

summary(results, 
        type = "curve")

# A tibble: 1 × 7
  median    mad     min   max    q25    q75   obs
   <dbl>  <dbl>   <dbl> <dbl>  <dbl>  <dbl> <dbl>
1 0.0675 0.0393 -0.0629 0.150 0.0209 0.0834  1516

Basic descriptive analysis

Descriptive summary per specific choices:

summary(results, 
        type = "curve", 
        group = c("x"))  # group analysis by choices

# A tibble: 3 × 8
  x            median    mad     min    max      q25    q75   obs
  <chr>         <dbl>  <dbl>   <dbl>  <dbl>    <dbl>  <dbl> <dbl>
1 internet_use 0.0796 0.0144  0.0468 0.150   0.0720  0.124   1551
2 sns_use      0.0444 0.0466 -0.0136 0.0966  0.0119  0.0752  1495
3 tv_use       0.0284 0.0682 -0.0629 0.144  -0.00300 0.0817  1560

Basic descriptive analysis

Descriptive summary with customized statistics:

summary(results, 
        type = "curve", 
        group = "subsets", 
        stats = list(mean = mean, 
                     median = median))

# A tibble: 3 × 4
  subsets   mean median   obs
  <chr>    <dbl>  <dbl> <dbl>
1 all     0.0495 0.0646 2930.
2 female  0.0274 0.0389 1516 
3 male    0.0965 0.0985 1412.

Decomposing the variance

We propose to further “decompose” the variance in the specification curve
We can treat the analysis as a factorial design in the resulting estimates are nested in the choices and compute intra-class correlation coefficients (ICC)

plot(results, type = "variance")

       grp vcov  icc percent
1 controls    0 0.00    0.00
2  subsets    0 0.38   37.59
3        y    0 0.00    0.00
4        x    0 0.29   29.15
5 Residual    0 0.33   33.26

Decomposing the variance

We propose to “decompose” the variance in the specification curve
We can treat the analysis as a factorial design in the resulting estimates are nested in the choices and compute intra-class correlation coefficients (ICC)

plot(results, type = "variance", 
     formula = "estimate ~ 1 + (1|x) + (1|y) + 
               (1|controls) + (1|subsets) + (1|x:y)")

       grp vcov  icc percent
1      x:y    0 0.25   24.60
2 controls    0 0.00    0.00
3  subsets    0 0.38   38.00
4        y    0 0.00    0.00
5        x    0 0.21   21.39
6 Residual    0 0.16   16.01

Alternative visualization

plot(results, type = "boxplot")

Customized visualizations

Yet, bear in mind, the standard functions by specr only provide one perspective on the data
As the function specr() actually also creates a data frame with all specifications’ results, we can also wrangle the data ourselves and create whatever plot we want

head(results$data)

# A tibble: 6 × 27
  x       y          model controls  subsets gender formula model_function term 
  <chr>   <chr>      <chr> <chr>     <chr>   <fct>  <glue>  <list>         <chr>
1 sns_use depression lm    no covar… male    male   depres… <fn>           sns_…
2 sns_use depression lm    no covar… female  female depres… <fn>           sns_…
3 sns_use depression lm    no covar… all     <NA>   depres… <fn>           sns_…
4 sns_use depression lm    friend_s… male    male   depres… <fn>           sns_…
5 sns_use depression lm    friend_s… female  female depres… <fn>           sns_…
6 sns_use depression lm    friend_s… all     <NA>   depres… <fn>           sns_…
# ℹ 18 more variables: estimate <dbl>, std.error <dbl>, statistic <dbl>,
#   p.value <dbl>, conf.low <dbl>, conf.high <dbl>, fit_r.squared <dbl>,
#   fit_adj.r.squared <dbl>, fit_sigma <dbl>, fit_statistic <dbl>,
#   fit_p.value <dbl>, fit_df <dbl>, fit_logLik <dbl>, fit_AIC <dbl>,
#   fit_BIC <dbl>, fit_deviance <dbl>, fit_df.residual <int>, fit_nobs <int>

Customized visualizations

results$data |> 
   ggplot(aes(x = x, y = estimate, ymin = conf.low, ymax = conf.high, color = controls)) +
   geom_hline(yintercept = 0, linetype = "dashed", color = "grey")+
   geom_pointrange(position = position_dodge(width = .9)) +
   facet_grid(y ~ subsets) +
   coord_flip() +
   theme_bw()

Inference tests

Only partially implemented in the specr development version
Simonsohn et al. (2020) propose three test statistics:

#	Name	Explanation
1	Extremity of median	Testing whether the median effect across all specifications is more extreme than would be expected if all specifications had a true effect of zero
2	Share of significant results	Testing whether the share of specifications with statistically significant effects in the expected direction is more extreme than would be expected if all specifications had an effect of zero
3	Aggregated p-value	Testing whether the aggregated z value associated with each p-value (for example, z = 1.96 for p = 0.05) is more extreme than would be expected if all specifications had a true effect of zero

Computing the test statistics

Generating distributions for these test statistics under the null hypothesis isn’t feasible analytically, but we can create these distributions through resampling under the null hypothesis
This process entails adjusting the observed data to ensure a known true null hypothesis and then randomly sampling from the modified data
The test statistic is then computed on each of these samples

Generate K different dependent variables under the null: \(y_k* = y_k − b_k * x_k\)
Draw at random, and with replacement, \(N\) rows from this matrix, using the same drawn rows of data for all \(K\) specifications.
Estimate the \(K\) specifications on the drawn data
Repeat steps 3 and 4 a large number of times (e.g., 500 or 1,000).

Resampling under-the-null

set.seed(42)

# Custom extract function to get full model
tidy_full <- function(x) {
  fit <- broom::tidy(x, conf.int = TRUE)
  fit$res <- list(x)  # Store model object
  return(fit)
}

# Smaller model (without subsets)
specs <- setup(data = d,
               x = c("sns_use", "tv_use", "internet_use"),
               y = c("depression", "life_satisfaction_r", 
                     "self_esteem"),
               controls = c("friend_satisfaction", "age"),
               model = "lm",
               fun1 = tidy_full)

# Run specification curve analysis
results <- specr(specs)

# Resampling under-the-null (rather 1,000 samples!)
boot_results <- boot_null(results, specs, 
                          n_samples = 10)

Joint statistical tests

For each bootstrapped sample we now have K estimates, one for each specification
Now we compute what percentage of the resampled specification curves (for example, of the 16 resamples) exhibits an overall test statistic that is at least as extreme as that observed in the real data

summary(boot_results)

> # A tibble: 3 × 3
>   type           estimate p.value
>   <chr>          <chr>    <chr>  
> 1 median         0.06     < .001 
> 2 share positive 24 / 36   < .001 
> 3 share negative 0 / 36   < .001

Test statistic 1: As we can see, a median 0.06 across all specification is more extreme than would be expected if all specification had a true effect of zero: p < .001
Test statistic 2: The share of specifications that obtain a statistically significant effect in the predicted direction is more extreme that would be expected if all specification had an effect of zero: 24 / 36, (66.7%, p < .001).

Observed vs. under-the-null curve

We can also plot the actual specification curve on top of the distribution of resamples under-the-null (e.g., the 2.5%, 50% and 97.5% quantiles) to visually inspect the extremity of the observed results

plot(boot_results)

Exercise: Thinking through your own projects

Format

Do you have a current project in which you took certain analytical decisions?
Try to map out the decisional path that you took to arrive at the “results
Consider decisional cross-roads on your paths, are there other alternative analytical choices that you could have taken?
Try to identify a “universe” of possible, yet still justifiable choices that would represent the a basis for a potential specification curve analyis
Discuss your choice table with your neighbour

Arbitrary Decisions

Current practices as cause for concern?

The central notion of these methods is that the alternatives included in the multiverse are “arbitrary” or equally “reasonable”
Little guidance or consensus on how to evaluate arbitrariness
No consideration of the potential pitfalls of multiverse-style methods

Different types of choices/decisions

Type	Principle	Description	Example
E	Principled Equivalence	evidence and conceptual considerations may indicate that alternative analyses are effectively equivalent	Alternative measures with comparable validity and reliability; arbitrary thresholds for outliers
N	Principled Non-Equivalence	available evidence or considerations support the conclusion that alternative specifications are not equivalent, and some are objectively more justified	Theoretically plausible subsets; different types of measures; inclusion of theoretically plausible control variables
U	Uncertainty	There are not compelling reasons to expect equivalence or non-equivalence	Any of the above, when available evidence does not suggest that one is better than the other or equivalent

What choices are “ok” to include?

Alternative 1: Obtaining robust estimates of the effect of interest

Understanding how arbitrary choices affect the results
Confirmatory in nature
Only type E decision can be included
Inference tests can be done (step 3)

Alternative 2: Exposing the impact of hidden degrees of freedom

Understanding how different conceptual and analytical choices affect the results
Exploratory in nature
All types of decisions, but particularly type U decision may be included
No inference tests, but descriptive analyses of differences

Alternative 1 vs. 2 (Masur & Ranzini, 2023)

R: Exercise II

A Mass of Poorly Justified Alternatives

Dangers and pitfalls

The main danger of multiverse-style methods lies in their potential for combinatorial explosion
Just a few choices considered arbitrary can lead to a vast multiverse, overshadowing reasonable effect estimates with unjustified alternatives.
- A single decision with two options doubles the number of specifications
- Five binary decisions expand the multiverse by 32 times
- If one option is consistently justifiable in each case, justified choices represent only 3% of the entire multiverse

Del Giudice & Gangestad, 2021

Combinatorial Explosion

specs1 <- setup(data = example_data,
               x = c("x1", "x2", "x3", "x4"),
               y = c("y1", "y2", "y3", "y4"),
               model = "lm")
results1 <- specr(specs1)
plot(results1, type = "curve")

Combinatorial Explosion

specs2 <- setup(data = example_data,
               x = c("x1", "x2", "x3", "x4"),
               y = c("y1", "y2", "y3", "y4"),
               model = "lm",
               control = "c1")
results2 <- specr(specs2)
plot(results2, type = "curve")

Combinatorial Explosion

specs3 <- setup(data = example_data,
               x = c("x1", "x2", "x3", "x4"),
               y = c("y1", "y2", "y3", "y4"),
               model = c("lm", "glm"),
               control = c("c1", "c2", "c3", "c4"),
               simplify = FALSE)
results3 <- specr(specs3)
plot(results3, type = "curve")

Combinatorial Explosion

specs4 <- setup(data = example_data,
               x = c("x1", "x2", "x3", "x4"),
               y = c("y1", "y2", "y3", "y4"),
               model = c("lm", "glm"),
               control = c("c1", "c2", "c3", "c4"),
               subset = list(group1 = c("young", "middle", "old")),
               simplify = FALSE)
results4 <- specr(specs4)
plot(results4, type = "curve")

Comparison of the curves

Comparison of the variance decomposition

       grp vcov  icc percent
1        x 0.00 0.00     0.0
2        y 0.04 0.75    75.1
3 controls 0.00 0.00     0.0
4 Residual 0.01 0.25    24.9

       grp vcov  icc percent
1 controls 0.00 0.00    0.02
2  subsets 0.00 0.01    0.50
3        y 0.02 0.61   60.64
4        x 0.00 0.01    0.75
5    model 0.00 0.00    0.00
6 Residual 0.01 0.38   38.09

What exactly is the problem?

The explosion of unjustified specifications, by expanding the analysis space, paradoxically amplifies the appearance of comprehensiveness and credibility within the multiverse
Simultaneously, it significantly diminishes the informative portion of the multiverse
The vastness of the specification space can complicate the examination of results for potentially valuable insights.

R: Exercise III

Conclusion

So what?

It makes little sense to include in the multiverse a specification that, a priori, one would have dismissed as inferior to other specifications
Researchers conducting a multiverse-style analysis should provide a clear rationale for treating alternatives as equivalent (preregistration!)
However, type U decisions will likely not be uncommon
Strong call for systematic exploratory multiverse analysis!

“Specification curve analysis will not end debates about what specifications should be run. specification curve analysis will instead facilitate those debates” (Simonsohn et al., 2020, p. 1209).

Want to learn more?

Visit the website of specr
Several extra tutorials (e.g., parallelization, incorporating SEM, multilevel, Bayesian estimation)
Continuous development (e.g. integration of inferential tests, speed improvements)

THE END

Thank you!

References

Del Giudice, M., & Gangestad, S. W. (2021). A traveler’s guide to the multiverse: Promises, pitfalls, and a framework for the evaluation of analytic decisions. Advances in Methods and Practices in Psychological Science, 4(1). https://journals.sagepub.com/doi/abs/10.1177/2515245920954925
Eisenberg, L. (2018). The tree of life. Retrieved from: https://www.evogeneao.com/en
Masur, P. K. & Ranzini, G. (2023). Privacy Calculus, Privacy Paradox, and Context Collapse: A Replication of Three Key Studies in Communication Privacy Research. Manuscript in preparation.
Masur, P. K. & Scharkow, M. (2020). specr: Conducting and Visualizing Specification Curve Analyses (R-package, version 1.0.0). https://CRAN.R-project.org/package=specr
McElreath, R. (2023). Statistical Rethinking 2023 - Horoscopes. Lecture on Youtube: https://www.youtube.com/watch?v=qwF-st2NGTU&t=224s
Orben, A., & Przybylski, A. K. (2019). The association between adolescent well-being and digital technology use. Nature human behaviour, 3(2), 173-182.
Silberzahn, R., Uhlmann, E. L., Martin, D. P., Anselmi, P., Aust, F., Awtrey, E., … & Nosek, B. A. (2018). Many analysts, one data set: Making transparent how variations in analytic choices affect results. Advances in Methods and Practices in Psychological Science, 1(3), 337-356.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science, 22(11), 1359-1366.
Simonsohn, U., Simmons, J.P. & Nelson, L.D. (2020). Specification curve analysis. Nature Human Behaviour, 4, 1208–1214. https://doi.org/10.1038/s41562-020-0912-z
Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing Transparency Through a Multiverse Analysis. Perspectives on Psychological Science, 11(5), 702-712. https://doi.org/10.1177/1745691616658637

Multiverse Analysis in R

Horoscopes

Tasseomancy

Fortune-telling and Statistics

Social Media and Mental Health

Social Media and Mental Health

Statistical stargazing

We should exercise caution in using seemingly complex statistical procedures, specifically those that claim to analyze a “multiverse”!

So who am I?

Content

Tuesday, 19th November

Wednesday, 20th of November

About this workshop

Slides and Material

Formalia

A Garden of Forking Paths

Evolutionary Garden of Forking Paths

Statistical Garden of Forking Paths

Statistical Garden of Forking Paths

Statistical Garden of Forking Paths

Statistical Garden of Forking Paths

Statistical Garden of Forking Paths

Statistical Garden of Forking Paths

Many Analyst, Different Conclusions

Many Analyst, Different Conclusions

Many Analyst, Different Conclusions

Results

Silberzahn et al: COnclusion

Undisclosed flexibility

Method

Simulation-based findings

What can we do?

Solution 1: Choosing one analytical pathway beforehand

Solution 2: Examine all possible analytical pathways

A Prominent Paper

Type of analytical decisions

Results

R: Exercise I

Into the Multiverse

Two Seminal Papers

Geneology of the Approach

The problem

A Universe of Specifications

A Universe of Specifications

A Universe of Specifications

A formal representation of the problem

A formal representation of the problem

The number of reasonable specifications

Sampling specifications

How SCA may solve the problem

General Procedure

Specification curve analysis with specr

The example data

Preparations

Setting up reasonable specifications

Setting up specifications

Setting up specifications

Setting up specifications

Assessing the garden of forking paths

Running the analyses

Typical Visualization

Basic descriptive analysis

Basic descriptive analysis

Basic descriptive analysis

Decomposing the variance

Decomposing the variance

Alternative visualization

Customized visualizations

Customized visualizations

Inference tests

Computing the test statistics

Resampling under-the-null

Joint statistical tests

Observed vs. under-the-null curve

Exercise: Thinking through your own projects

Format

Arbitrary Decisions

Current practices as cause for concern?

Different types of choices/decisions

What choices are “ok” to include?

Specification curve analysis with `specr`