Multiverse Analysis in R

Dr. Philipp K. Masur

Horoscopes

Fortune-telling framework:

  • Practice of predicting information about a person’s life

  • From vague facts, provide vague advice

  • Must be applicable to any case

  • Exaggerated importance of advice

Tasseomancy

A divination or fortune-telling method that interprets patterns in tea leaves, coffee grounds, or wine sediments:

  • snake = enmity or falsehood
  • spade = good fortune through industry
  • mountain = journey of hindrance
  • house = change, success
  • anchor = prosperity in business, stable romantic relation
  • axe = power to overcome difficulties

Fortune-telling and Statistics

Social Media and Mental Health

  • Sampling 500 participants from a commercial online panel

  • Measuring social media use and mental health with (questionable) self-report measures

  • Simple decision path:

    • Type of data? Continuous!
    • Do you have a true independent variable? Yes!
  • My choice: A (somewhat arbitrary) statistical model such as linear regression

  • My conclusion: Social media use is negatively related to mental health

Social Media and Mental Health

  • Sampling 500 participants from a commercial online panel

  • Measuring social media use and mental health with (questionable) self-report measures

  • Simple decision path:

    • Type of data? Continuous!
    • Do you have a true independent variable? Yes!
  • My second guess: A (somewhat arbitrary) statistical model such as curvilinear regression

  • My conclusion: Low doses of social media are beneficial, but high doses are detrimental!

Statistical stargazing

  • Statistical procedures only acquire meaning from scientific models

  • You cannot offload a subjective responsibility to an objective procedure

  • Often easier to defend an objective statistical procedure than an subjective choice that gives “meaning” to the procedure




Intro taken and adapted from McElreath’s (2023) “Statistical Rethinking” Lecture on Horoscopes

We should exercise caution in using seemingly complex statistical procedures, specifically those that claim to analyze a “multiverse”!

So who am I?

  • Assistant Professor at Vrije Universiteit Amsterdam

  • Studied Communication Science, Philosophy, and Economics

  • Research Areas

    • Online Privacy
    • Social Influence and Contagion
    • Media Literacy
  • Methodological Interest

    • Test Theory and Scale Development
    • Bayesian Estimation
    • Flexibility in Data Analysis
    • Computational Methods and Machine Learning

Content

Tuesday, 19th November

Time Topic
13:00 - 14:00 Statistical Fortune-Telling
14:00 - 15:00 A Garden of Forking Paths
15:30 - 16:30 R: Exercise I: Basic Mechanisms of Multiverse Analysis
16:30 - 17:00 Into the Multiverse
17:00 - 18:00 Q&A - Thinking through your own projects

Wednesday, 20th of November

Time Topic
09:00 - 09:45 Arbitrary Decisions
09:45 - 10:45 R: Exercise II: Multiverse Analysis with `specr`
11:15 - 12:00 A Mass of Poorly Justified Alternatives
12:00 - 13:00 R: Exercise III: Advanced Specifications

About this workshop

Slides and Material

  • You can find everything on this website: https://masurp.github.io/workshop_specr/

  • The page will stay online beyond the workshop and will potentially be updated in the future

Formalia

  • A mix of theoretical concepts and ideas and practical sessions in R

  • Do ask question at any time!

  • Let’s make this more a discussion than a “me-telling-you-about-stuff” situation

A Garden of Forking Paths





“I thought of a labyrinth of labyrinths, of one sinuous spreading labyrinth that would encompass the past and the future… I felt myself to be, for an unknown period of time, an abstract perceiver of the world.”

Evolutionary Garden of Forking Paths

Image by Eisenberg, 2018

Statistical Garden of Forking Paths

Statistical Garden of Forking Paths

Statistical Garden of Forking Paths

Statistical Garden of Forking Paths

Statistical Garden of Forking Paths

Statistical Garden of Forking Paths

Many Analyst, Different Conclusions

Many Analyst, Different Conclusions

  • Creativity in science often linked to hypothesis generation and research design development.

  • Data analysis is sometimes perceived as a mechanical, unimaginative process.

  • Analytic strategies are influenced by theory, assumptions, and subjective choice points.

  • Multiple reasonable (and unreasonable) data evaluation approaches exist.

Silberzahn et al. 2018

Many Analyst, Different Conclusions

  • Researchers may default to familiar methods rather than rationale-based strategies.

  • Peer reviewers rarely engage directly with the data, often accepting analytic strategies at face value.

  • Reanalyses and critiques of analytic strategies are rare, partly due to limited data sharing.

  • Scientific results may hinge on subjective decisions, leading to potential uncertainty.

Silberzahn et al. 2018

Results

Silberzahn et al: COnclusion

  • Research outcomes are influenced by subjective but justifiable analytic decisions, adding uncertainty beyond statistical power or questionable practices.

  • Subjectivity in research is inevitable but does not disconnect findings from reality; it highlights the role of decision-making.

  • Transparency in data, methods, and processes allows the scientific community to scrutinize, question, and test research decisions.

Undisclosed flexibility

“Despite the nominal endorsement of a maximum false-positive rate of 5% (…) current standards for disclosing details of data collection and analyses make false positives vastly more likely. In fact, it is unacceptably easy to publish ‘statistically significant’ evidence consistent with any hypothesis.”

Method

  • Computer simulations were used to estimate how researcher degrees of freedom influence false-positive rates.

  • Four common degrees of freedom were analyzed: flexibility in dependent variables, sample size, covariates, and reporting subsets of conditions.

  • Simulations involved generating random samples and testing multiple analyses to observe how often p-values fell below significance levels.

  • Across 15,000 simulations, the probability of false positives was systematically quantified under various scenarios.

Simulation-based findings

As high as these estimates are, they may actually be conservative. We did not consider many other degrees of freedom that researchers commonly use… (p. 1361)

What can we do?


Solution 1: Choosing one analytical pathway beforehand

  • Classic falsification paradigm (Popper)
  • Strictly speaking, requires preregistration
  • Requires strong justification (maybe not always possible?)



Solution 2: Examine all possible analytical pathways

  • Specification curve (SCA) or multiverse analysis (MA)
  • Investigating all (theoretically plausible) models
  • Requires justification for inclusions of choices/models






A Prominent Paper

Type of analytical decisions

Results


  • Association between digital technology use and adolescent well-being is negative but small

  • Technology use explains at most 0.4% of the variation in well-being.

  • Findings suggests that these effects are too small to warrant policy change


  • How important was the method - specification curve analysis - for this finding?

R: Exercise I

Into the Multiverse

Two Seminal Papers

Geneology of the Approach

  • Long tradition of considering robustness to alternative specifications in the social sciences

  • Economics and Political science: reporting regression results in tables in which each column reports a different specification

  • Robustness tests: examining how certain “core” regression coefficient estimates behave when the regression specification is modified by adding or removing regressors (usually ~1-10 alternative models reported in the appendix)

  • Multiverse (MA) or specification curve analysis (SCA) can be regarded as extension and formalization of these approaches


Simonsohn et al., 2020

The problem

  • Multiverse analysis acknowledges that data are actively shaped during the transformation from raw data to analyzable form

  • Researchers often apply multiple processing steps during data preparation

  • These steps involve numerous “researcher’s degrees of freedom”, offering various, alternative options at each stage

  • Raw data don’t yield a single analyzable dataset, instead they generate multiple versions based on the choices made — a data multiverse!

  • Each dataset in this multiverse can produce different statistical outcomes - a statistical multiverse!


Steegen et al., 2016

A Universe of Specifications

Perspective of one researcher

Perspective of one researcher

A Universe of Specifications

Two researchers with similar views

Two researchers with similar views

A Universe of Specifications

Two rewearchers with dissimilar views

Two rewearchers with dissimilar views

A formal representation of the problem

Let’s consider a relationship between variables \(x\) and \(y\), in a context in which other variables, \(Z\), may influence the relationship:



\(y = F(x,Z) + \epsilon\)



We are faced with the following practical challenges:

  • \(x\) and \(y\) are often imprecisely defined latent variables
  • the set of moderators and confounders in \(Z\) are often not fully known ex ante
  • \(Z\) also contains imprecisely defined latent variables
  • the functional form \(F()\) is not known

A formal representation of the problem

To study \(y = F(x,Z)\), we must operationalize the underlying constructs. We usually approximate this relationship with a specification from a set of operationalizations:

\(y_{k_y} = F_{k_F}(x_{k_x}, Z_{k_Z})\)



where \(k_y\), \(k_F\), \(k_x\) and \(k_Z\) are indices for single operationalizations of the respective constructs.

For example \(y_1\) may operationalize ‘well-being’ with life satisfaction, while \(y_2\) with reversed depression.


Simonsohn et al., 2020

The number of reasonable specifications

For each construct there are multiple statistically valid, theoretically justified and non-redundant operationalizations. Their combination leads to what we refer to as the set of reasonable specifications.

Designating the total number of valid operationalizations for each construct with \(n_y\), \(n_x\), \(n_Z\) and \(n_F\), the total number of reasonable specifications available to study \(y = F(x,Z)\) is:

\(N = n_x × n_y × n_Z × n_F\)


Simonsohn et al., 2020

Sampling specifications

  • Let \(\Pi\) be this set of \(N\) reasonable specifications, and \(\pi\) be the subset of specifications reported in a paper

  • \(\pi\) can be regarded as a sample of \(\Pi\)

  • Any given \(y_{k_y} = F_{k_F}(x_{k_x}, Z_{k_Z})\) is considered a valid proxy for \(y = F(x,Z)\) and therefore so is the full set of all such proxies \(\Pi\).

  • A sufficiently large, random and independently drawn sample of \(\Pi\) should lead to a reasonable estimate of the model of interest

  • The problem is that \(\pi\), the sample of specifications reported in a paper, has usually none of these three properties (i.e., before specification curve analysis)


Simonsohn et al., 2020

How SCA may solve the problem

  • It allows to systematically generate a much larger \(\pi\), where hundreds or even thousands of specifications are reported

  • It makes transparent the existence of noise, and allows to determine its nature, i.e., which operationalization decisions are consequential and which are not

  • It generates a \(\pi\) with fewer arbitrary inclusion decisions, and thus more closely approximates a random sample of \(\Pi\)

  • Yet it may also inflate the number of specifications and thus hide true effects of interest (more on that later!)


Simonsohn et al., 2020

General Procedure

1. Define the set of reasonable specifications to estimate

  • Differentiating types of decisions (type-E-, type-N-, type-U-decisions)
  • Reducing redundancy

2. Estimate all specifications

  • Run all models (i.e., with all specifications)
  • Visualize specification curve and the influence of different choices

3. Conduct joint statistical tests using an inferential specification curve

  • If specifications are truly non-redundant and valid, we can technically run inference tests
  • A type of bootstrapping approach where the true effect is fixed to zero


Simonsohn et al., 2020

Specification curve analysis with specr

  • R Package (currently Version 1.0.0 on CRAN, 1.0.1 on github)

  • Versatile framework for conduction multiverse/specification curve analysis with R

  • Based around the “mapping” approach implemented in purrr

  • This also allows parallelization, which - depending how many specifications are estimated - is quite important to save computation time

  • Ties nicely into the tidyverse

  • Active and continuous development

The example data

For this example, we are going to use a synthetic data set that includes variables that allow to assess the relationship between media use and well-being:

#Package
library(specr)

# Loading data
d <- read_csv("https://raw.githubusercontent.com/ccs-amsterdam/r-course-material/master/data/pairfam_synth_data.csv")

d %>%
  select(age, gender, depression, 
         self_esteem, sns_use) 
# A tibble: 3,132 × 5
     age gender depression self_esteem sns_use
   <dbl> <chr>       <dbl>       <dbl>   <dbl>
 1    16 female       3.77        4          3
 2    15 female       3.42        4          0
 3    15 female       1           1          3
 4    16 male         4.46        4          0
 5    17 male         2.38        1          5
 6    15 female       1           1          4
 7    16 female       2.04        1         NA
 8    15 male         2.38        1.75       4
 9    15 female       2.04        1.75       3
10    17 male         7.58        6.25       5
# ℹ 3,122 more rows

Preparations

# select variables
std_vars <- c("age", "tv_use", "internet_use",
              "sns_use", "self_esteem", 
              "depression", "life_satisfaction_r",
              "friend_satisfaction")

# standardizing relevant variables
d <- d %>% 
  mutate(across(all_of(std_vars), 
                function(x) (x - mean(x, na.rm = T)) / sd(x, T))) 

# check
d %>%
  select(all_of(std_vars)) %>%
  psych::describe() %>%
  select(n, mean, sd)
                       n mean sd
age                 3132    0  1
tv_use              3084    0  1
internet_use        3086    0  1
sns_use             2936    0  1
self_esteem         3084    0  1
depression          2899    0  1
life_satisfaction_r 3126    0  1
friend_satisfaction 3129    0  1
  • Technically, specification curve analysis does not necessarily require standardization of the variables

  • Yet, as the goal is often comparison of outcome estimates, it mostly makes sense to standardize the variables before the computation

Setting up reasonable specifications

The set of reasonable specifications can be generated by:


1. Identifying all of the data analytic decisions necessary to map the scientific hypothesis or construct of interest onto a statistical hypothesis

2. Listing all the reasonable alternative ways a researcher may make those decisions

3. Generating the exhaustive combination of decisions, eliminating combinations that are invalid or redundant.


If the resulting set is too large, then in the next step (estimation) one can randomly draw from them to create specification curves

Setting up specifications

specs <- setup(data = d,
               x = "sns_use",
               y = c("depression", "life_satisfaction_r"),
               model = "lm")
summary(specs)
Setup for the Specification Curve Analysis
-------------------------------------------
Class:                      specr.setup -- version: 1.0.1 
Number of specifications:   2 

Specifications:

  Independent variable:     sns_use 
  Dependent variable:       depression, life_satisfaction_r 
  Models:                   lm 
  Covariates:               no covariates 
  Subsets analyses:         all 

Function used to extract parameters:

  function (x) 
broom::tidy(x, conf.int = TRUE)
<environment: 0x11a1f5660>


Head of specifications table (first 6 rows):
# A tibble: 2 × 6
  x       y                   model controls      subsets formula                          
  <chr>   <chr>               <chr> <chr>         <chr>   <glue>                           
1 sns_use depression          lm    no covariates all     depression ~ sns_use + 1         
2 sns_use life_satisfaction_r lm    no covariates all     life_satisfaction_r ~ sns_use + 1

Setting up specifications

specs <- setup(data = d,
               x = "sns_use",
               y = c("depression", "life_satisfaction_r"),
               controls = c("friend_satisfaction"),
               model = "lm")
summary(specs)
Setup for the Specification Curve Analysis
-------------------------------------------
Class:                      specr.setup -- version: 1.0.1 
Number of specifications:   4 

Specifications:

  Independent variable:     sns_use 
  Dependent variable:       depression, life_satisfaction_r 
  Models:                   lm 
  Covariates:               no covariates, friend_satisfaction 
  Subsets analyses:         all 

Function used to extract parameters:

  function (x) 
broom::tidy(x, conf.int = TRUE)
<environment: 0x12a664900>


Head of specifications table (first 6 rows):
# A tibble: 4 × 6
  x       y                   model controls            subsets formula                                            
  <chr>   <chr>               <chr> <chr>               <chr>   <glue>                                             
1 sns_use depression          lm    no covariates       all     depression ~ sns_use + 1                           
2 sns_use depression          lm    friend_satisfaction all     depression ~ sns_use + friend_satisfaction         
3 sns_use life_satisfaction_r lm    no covariates       all     life_satisfaction_r ~ sns_use + 1                  
4 sns_use life_satisfaction_r lm    friend_satisfaction all     life_satisfaction_r ~ sns_use + friend_satisfaction

Setting up specifications

specs <- setup(data = d,
               x = c("sns_use", "tv_use", "internet_use"),
               y = c("depression", "life_satisfaction_r", "self_esteem"),
               controls = c("friend_satisfaction", "age"),
               model = "lm",
               subsets = list(gender = c("male", "female")))
summary(specs, rows = 4)
Setup for the Specification Curve Analysis
-------------------------------------------
Class:                      specr.setup -- version: 1.0.1 
Number of specifications:   108 

Specifications:

  Independent variable:     sns_use, tv_use, internet_use 
  Dependent variable:       depression, life_satisfaction_r, self_esteem 
  Models:                   lm 
  Covariates:               no covariates, friend_satisfaction, age, friend_satisfaction + age 
  Subsets analyses:         male, female, all 

Function used to extract parameters:

  function (x) 
broom::tidy(x, conf.int = TRUE)
<environment: 0x1386db558>


Head of specifications table (first 4 rows):
# A tibble: 4 × 7
  x       y          model controls            subsets gender formula                                   
  <chr>   <chr>      <chr> <chr>               <chr>   <fct>  <glue>                                    
1 sns_use depression lm    no covariates       male    male   depression ~ sns_use + 1                  
2 sns_use depression lm    no covariates       female  female depression ~ sns_use + 1                  
3 sns_use depression lm    no covariates       all     <NA>   depression ~ sns_use + 1                  
4 sns_use depression lm    friend_satisfaction male    male   depression ~ sns_use + friend_satisfaction

Assessing the garden of forking paths

plot(specs)

Running the analyses

results <- specr(specs)
summary(results)
Results of the specification curve analysis
-------------------
Technical details:

  Class:                          specr.object -- version: 1.0.1 
  Cores used:                     1 
  Duration of fitting process:    0.384 sec elapsed 
  Number of specifications:       108 

Descriptive summary of the specification curve:

 median  mad   min  max  q25  q75
   0.07 0.04 -0.06 0.15 0.02 0.08

Descriptive summary of sample sizes: 

 median  min  max
   1516 1311 3080

Head of the specification results (first 6 rows): 

# A tibble: 6 × 25
  x       y          model controls       subsets gender formula estimate std.error statistic p.value conf.low conf.high
  <chr>   <chr>      <chr> <chr>          <chr>   <fct>  <glue>     <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 sns_use depression lm    no covariates  male    male   depres…     0.08      0.03      2.71    0.01     0.02      0.13
2 sns_use depression lm    no covariates  female  female depres…     0.04      0.03      1.64    0.1     -0.01      0.09
3 sns_use depression lm    no covariates  all     <NA>   depres…     0.06      0.02      3.08    0        0.02      0.1 
4 sns_use depression lm    friend_satisf… male    male   depres…     0.09      0.03      3.4     0        0.04      0.14
5 sns_use depression lm    friend_satisf… female  female depres…     0.07      0.03      2.73    0.01     0.02      0.12
6 sns_use depression lm    friend_satisf… all     <NA>   depres…     0.08      0.02      4.34    0        0.04      0.12
# ℹ 12 more variables: fit_r.squared <dbl>, fit_adj.r.squared <dbl>, fit_sigma <dbl>, fit_statistic <dbl>,
#   fit_p.value <dbl>, fit_df <dbl>, fit_logLik <dbl>, fit_AIC <dbl>, fit_BIC <dbl>, fit_deviance <dbl>,
#   fit_df.residual <dbl>, fit_nobs <dbl>

Typical Visualization

plot(results)

Basic descriptive analysis

Basic descriptive summary of the entire specification curve:

summary(results, 
        type = "curve")
# A tibble: 1 × 7
  median    mad     min   max    q25    q75   obs
   <dbl>  <dbl>   <dbl> <dbl>  <dbl>  <dbl> <dbl>
1 0.0675 0.0393 -0.0629 0.150 0.0209 0.0834  1516

Basic descriptive analysis

Descriptive summary per specific choices:

summary(results, 
        type = "curve", 
        group = c("x"))  # group analysis by choices
# A tibble: 3 × 8
  x            median    mad     min    max      q25    q75   obs
  <chr>         <dbl>  <dbl>   <dbl>  <dbl>    <dbl>  <dbl> <dbl>
1 internet_use 0.0796 0.0144  0.0468 0.150   0.0720  0.124   1551
2 sns_use      0.0444 0.0466 -0.0136 0.0966  0.0119  0.0752  1495
3 tv_use       0.0284 0.0682 -0.0629 0.144  -0.00300 0.0817  1560

Basic descriptive analysis

Descriptive summary with customized statistics:

summary(results, 
        type = "curve", 
        group = "subsets", 
        stats = list(mean = mean, 
                     median = median))
# A tibble: 3 × 4
  subsets   mean median   obs
  <chr>    <dbl>  <dbl> <dbl>
1 all     0.0495 0.0646 2930.
2 female  0.0274 0.0389 1516 
3 male    0.0965 0.0985 1412.

Decomposing the variance

  • We propose to further “decompose” the variance in the specification curve

  • We can treat the analysis as a factorial design in the resulting estimates are nested in the choices and compute intra-class correlation coefficients (ICC)

plot(results, type = "variance")
       grp vcov  icc percent
1 controls    0 0.00    0.00
2  subsets    0 0.38   37.59
3        y    0 0.00    0.00
4        x    0 0.29   29.15
5 Residual    0 0.33   33.26

Decomposing the variance

  • We propose to “decompose” the variance in the specification curve

  • We can treat the analysis as a factorial design in the resulting estimates are nested in the choices and compute intra-class correlation coefficients (ICC)

plot(results, type = "variance", 
     formula = "estimate ~ 1 + (1|x) + (1|y) + 
               (1|controls) + (1|subsets) + (1|x:y)")
       grp vcov  icc percent
1      x:y    0 0.25   24.60
2 controls    0 0.00    0.00
3  subsets    0 0.38   38.00
4        y    0 0.00    0.00
5        x    0 0.21   21.39
6 Residual    0 0.16   16.01

Alternative visualization

plot(results, type = "boxplot")

Customized visualizations

  • Yet, bear in mind, the standard functions by specr only provide one perspective on the data

  • As the function specr() actually also creates a data frame with all specifications’ results, we can also wrangle the data ourselves and create whatever plot we want

head(results$data)
# A tibble: 6 × 27
  x       y          model controls  subsets gender formula model_function term 
  <chr>   <chr>      <chr> <chr>     <chr>   <fct>  <glue>  <list>         <chr>
1 sns_use depression lm    no covar… male    male   depres… <fn>           sns_…
2 sns_use depression lm    no covar… female  female depres… <fn>           sns_…
3 sns_use depression lm    no covar… all     <NA>   depres… <fn>           sns_…
4 sns_use depression lm    friend_s… male    male   depres… <fn>           sns_…
5 sns_use depression lm    friend_s… female  female depres… <fn>           sns_…
6 sns_use depression lm    friend_s… all     <NA>   depres… <fn>           sns_…
# ℹ 18 more variables: estimate <dbl>, std.error <dbl>, statistic <dbl>,
#   p.value <dbl>, conf.low <dbl>, conf.high <dbl>, fit_r.squared <dbl>,
#   fit_adj.r.squared <dbl>, fit_sigma <dbl>, fit_statistic <dbl>,
#   fit_p.value <dbl>, fit_df <dbl>, fit_logLik <dbl>, fit_AIC <dbl>,
#   fit_BIC <dbl>, fit_deviance <dbl>, fit_df.residual <int>, fit_nobs <int>

Customized visualizations

results$data |> 
   ggplot(aes(x = x, y = estimate, ymin = conf.low, ymax = conf.high, color = controls)) +
   geom_hline(yintercept = 0, linetype = "dashed", color = "grey")+
   geom_pointrange(position = position_dodge(width = .9)) +
   facet_grid(y ~ subsets) +
   coord_flip() +
   theme_bw()

Inference tests

  • Only partially implemented in the specr development version

  • Simonsohn et al. (2020) propose three test statistics:

# Name Explanation
1 Extremity of median Testing whether the median effect across all specifications is more extreme than would be expected if all specifications had a true effect of zero
2 Share of significant results Testing whether the share of specifications with statistically significant effects in the expected direction is more extreme than would be expected if all specifications had an effect of zero
3 Aggregated p-value Testing whether the aggregated z value associated with each p-value (for example, z = 1.96 for p = 0.05) is more extreme than would be expected if all specifications had a true effect of zero

Computing the test statistics

  • Generating distributions for these test statistics under the null hypothesis isn’t feasible analytically, but we can create these distributions through resampling under the null hypothesis

  • This process entails adjusting the observed data to ensure a known true null hypothesis and then randomly sampling from the modified data

  • The test statistic is then computed on each of these samples


  1. Generate K different dependent variables under the null: \(y_k* = y_k − b_k * x_k\)
  2. Draw at random, and with replacement, \(N\) rows from this matrix, using the same drawn rows of data for all \(K\) specifications.
  3. Estimate the \(K\) specifications on the drawn data
  4. Repeat steps 3 and 4 a large number of times (e.g., 500 or 1,000).

Resampling under-the-null

set.seed(42)

# Custom extract function to get full model
tidy_full <- function(x) {
  fit <- broom::tidy(x, conf.int = TRUE)
  fit$res <- list(x)  # Store model object
  return(fit)
}

# Smaller model (without subsets)
specs <- setup(data = d,
               x = c("sns_use", "tv_use", "internet_use"),
               y = c("depression", "life_satisfaction_r", 
                     "self_esteem"),
               controls = c("friend_satisfaction", "age"),
               model = "lm",
               fun1 = tidy_full)

# Run specification curve analysis
results <- specr(specs)

# Resampling under-the-null (rather 1,000 samples!)
boot_results <- boot_null(results, specs, 
                          n_samples = 10) 

Joint statistical tests

  • For each bootstrapped sample we now have K estimates, one for each specification

  • Now we compute what percentage of the resampled specification curves (for example, of the 16 resamples) exhibits an overall test statistic that is at least as extreme as that observed in the real data

summary(boot_results)
> # A tibble: 3 × 3
>   type           estimate p.value
>   <chr>          <chr>    <chr>  
> 1 median         0.06     < .001 
> 2 share positive 24 / 36   < .001 
> 3 share negative 0 / 36   < .001
  • Test statistic 1: As we can see, a median 0.06 across all specification is more extreme than would be expected if all specification had a true effect of zero: p < .001

  • Test statistic 2: The share of specifications that obtain a statistically significant effect in the predicted direction is more extreme that would be expected if all specification had an effect of zero: 24 / 36, (66.7%, p < .001).

Observed vs. under-the-null curve

  • We can also plot the actual specification curve on top of the distribution of resamples under-the-null (e.g., the 2.5%, 50% and 97.5% quantiles) to visually inspect the extremity of the observed results
plot(boot_results)

Exercise: Thinking through your own projects

Format

  • Do you have a current project in which you took certain analytical decisions?

  • Try to map out the decisional path that you took to arrive at the “results

  • Consider decisional cross-roads on your paths, are there other alternative analytical choices that you could have taken?

  • Try to identify a “universe” of possible, yet still justifiable choices that would represent the a basis for a potential specification curve analyis

  • Discuss your choice table with your neighbour

Arbitrary Decisions

Current practices as cause for concern?

  • The central notion of these methods is that the alternatives included in the multiverse are “arbitrary” or equally “reasonable”

  • Little guidance or consensus on how to evaluate arbitrariness

  • No consideration of the potential pitfalls of multiverse-style methods

Different types of choices/decisions

Type Principle Description Example
E Principled Equivalence evidence and conceptual considerations may indicate that alternative analyses are effectively equivalent Alternative measures with comparable validity and reliability; arbitrary thresholds for outliers
N Principled Non-Equivalence available evidence or considerations support the conclusion that alternative specifications are not equivalent, and some are objectively more justified Theoretically plausible subsets; different types of measures; inclusion of theoretically plausible control variables
U Uncertainty There are not compelling reasons to expect equivalence or non-equivalence Any of the above, when available evidence does not suggest that one is better than the other or equivalent

What choices are “ok” to include?

Alternative 1: Obtaining robust estimates of the effect of interest

  • Understanding how arbitrary choices affect the results
  • Confirmatory in nature
  • Only type E decision can be included
  • Inference tests can be done (step 3)


Alternative 2: Exposing the impact of hidden degrees of freedom

  • Understanding how different conceptual and analytical choices affect the results
  • Exploratory in nature
  • All types of decisions, but particularly type U decision may be included
  • No inference tests, but descriptive analyses of differences

Alternative 1 vs. 2 (Masur & Ranzini, 2023)

R: Exercise II

A Mass of Poorly Justified Alternatives

Dangers and pitfalls

  • The main danger of multiverse-style methods lies in their potential for combinatorial explosion

  • Just a few choices considered arbitrary can lead to a vast multiverse, overshadowing reasonable effect estimates with unjustified alternatives.

    • A single decision with two options doubles the number of specifications
    • Five binary decisions expand the multiverse by 32 times
    • If one option is consistently justifiable in each case, justified choices represent only 3% of the entire multiverse


Del Giudice & Gangestad, 2021

Combinatorial Explosion

specs1 <- setup(data = example_data,
               x = c("x1", "x2", "x3", "x4"),
               y = c("y1", "y2", "y3", "y4"),
               model = "lm")
results1 <- specr(specs1)
plot(results1, type = "curve")

Combinatorial Explosion

specs2 <- setup(data = example_data,
               x = c("x1", "x2", "x3", "x4"),
               y = c("y1", "y2", "y3", "y4"),
               model = "lm",
               control = "c1")
results2 <- specr(specs2)
plot(results2, type = "curve")

Combinatorial Explosion

specs3 <- setup(data = example_data,
               x = c("x1", "x2", "x3", "x4"),
               y = c("y1", "y2", "y3", "y4"),
               model = c("lm", "glm"),
               control = c("c1", "c2", "c3", "c4"),
               simplify = FALSE)
results3 <- specr(specs3)
plot(results3, type = "curve")

Combinatorial Explosion

specs4 <- setup(data = example_data,
               x = c("x1", "x2", "x3", "x4"),
               y = c("y1", "y2", "y3", "y4"),
               model = c("lm", "glm"),
               control = c("c1", "c2", "c3", "c4"),
               subset = list(group1 = c("young", "middle", "old")),
               simplify = FALSE)
results4 <- specr(specs4)
plot(results4, type = "curve")

Comparison of the curves

Comparison of the variance decomposition

       grp vcov  icc percent
1        x 0.00 0.00     0.0
2        y 0.04 0.75    75.1
3 controls 0.00 0.00     0.0
4 Residual 0.01 0.25    24.9

       grp vcov  icc percent
1 controls 0.00 0.00    0.02
2  subsets 0.00 0.01    0.50
3        y 0.02 0.61   60.64
4        x 0.00 0.01    0.75
5    model 0.00 0.00    0.00
6 Residual 0.01 0.38   38.09

What exactly is the problem?

  • The explosion of unjustified specifications, by expanding the analysis space, paradoxically amplifies the appearance of comprehensiveness and credibility within the multiverse

  • Simultaneously, it significantly diminishes the informative portion of the multiverse

  • The vastness of the specification space can complicate the examination of results for potentially valuable insights.

R: Exercise III

Conclusion

So what?

  • It makes little sense to include in the multiverse a specification that, a priori, one would have dismissed as inferior to other specifications

  • Researchers conducting a multiverse-style analysis should provide a clear rationale for treating alternatives as equivalent (preregistration!)

  • However, type U decisions will likely not be uncommon

  • Strong call for systematic exploratory multiverse analysis!

“Specification curve analysis will not end debates about what specifications should be run. specification curve analysis will instead facilitate those debates” (Simonsohn et al., 2020, p. 1209).

Want to learn more?

  • Visit the website of specr

  • Several extra tutorials (e.g., parallelization, incorporating SEM, multilevel, Bayesian estimation)

  • Continuous development (e.g. integration of inferential tests, speed improvements)

THE END

Thank you!

References

  • Del Giudice, M., & Gangestad, S. W. (2021). A traveler’s guide to the multiverse: Promises, pitfalls, and a framework for the evaluation of analytic decisions. Advances in Methods and Practices in Psychological Science, 4(1). https://journals.sagepub.com/doi/abs/10.1177/2515245920954925

  • Eisenberg, L. (2018). The tree of life. Retrieved from: https://www.evogeneao.com/en

  • Masur, P. K. & Ranzini, G. (2023). Privacy Calculus, Privacy Paradox, and Context Collapse: A Replication of Three Key Studies in Communication Privacy Research. Manuscript in preparation.

  • Masur, P. K. & Scharkow, M. (2020). specr: Conducting and Visualizing Specification Curve Analyses (R-package, version 1.0.0). https://CRAN.R-project.org/package=specr

  • McElreath, R. (2023). Statistical Rethinking 2023 - Horoscopes. Lecture on Youtube: https://www.youtube.com/watch?v=qwF-st2NGTU&t=224s

  • Orben, A., & Przybylski, A. K. (2019). The association between adolescent well-being and digital technology use. Nature human behaviour, 3(2), 173-182.

  • Silberzahn, R., Uhlmann, E. L., Martin, D. P., Anselmi, P., Aust, F., Awtrey, E., … & Nosek, B. A. (2018). Many analysts, one data set: Making transparent how variations in analytic choices affect results. Advances in Methods and Practices in Psychological Science, 1(3), 337-356.

  • Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science, 22(11), 1359-1366.

  • Simonsohn, U., Simmons, J.P. & Nelson, L.D. (2020). Specification curve analysis. Nature Human Behaviour, 4, 1208–1214. https://doi.org/10.1038/s41562-020-0912-z

  • Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing Transparency Through a Multiverse Analysis. Perspectives on Psychological Science, 11(5), 702-712. https://doi.org/10.1177/1745691616658637