A major improvement in version 0.3.0 of specr is that we can parallelize the computations, which can reduce fitting time. For this, specr uses functions from the package furrr. I suggest to check out the website of the package for further information. To be to use relevant functions, we load the package furrr upfront.
Before we start to run some examples, bear in mind that using parallelization does not always mean that the computations are automatically faster. If the overall computation time is not very long, using parallelization may even lead to a longer fitting process as setting up several “workers” (essentially setting up procedures on several cores) produces a considerable “overhead”. A true reduction in fitting time is thus only achieved when the data set is large, the number of specification are high, and the computed models are complex. I thus first simulate a comparatively large data set that allows to specify more than 1000 specifications.
# Load packages
library(tidyverse)
library(specr)
library(furrr)
# Data generating function
generate_data <- function(seed = 42, n = 1e5) {
if (!is.na(seed)) set.seed(seed)
dat <- tibble(
x1 = rnorm(n),
x2 = rnorm(n)+ 0.9*x1,
x3 = rnorm(n)+ 0.9*x2,
x4 = rnorm(n)+ 0.9*x3,
y4 = rep(c(1, 0), times = n/2),
y1 = rnorm(n) + x1*.1 * 0.9*y4,
y2 = rnorm(n) + x1*.2,
y3 = rnorm(n) + x1*.2 + -0.4*x2,
c1 = rnorm(n) + x1*.3,
c2 = rnorm(n),
c3 = rnorm(n) + 0.9*c1,
c4 = rnorm(n),
group = sample(c("a", "b", "c", "d", "e"), n, replace = TRUE)
)
}
# Generate very large data set (n = 100,000, 9.9 MB on disk)
dat <- generate_data(9)
If we use standard model fitting function (e.g., “lm”) that are
included in the base package, parallelization is comparatively simple.
We only need to load the package furrr
and specify a “plan
for how to resolve a future” (for more information see
?future::plan
). In this case, I am choosing
multisession
(resolve the computations separate R sessions
running in the background on the same machine) and specify
workers = 4
so that it runs on 4 cores in parallel.
# Setup of specifications (number of specs = 1152)
specs <- setup(data = dat,
y = c("y1", "y2", "y3"),
x = c("x1", "x2", "x3", "x4"),
model = c("lm"),
controls = c("c1", "c2", "c3", "c4"),
subsets = list(group = unique(dat$group)))
# Default: Sequential ---
results_simple <- specr(specs)
# Parallel: Multisession (only works when `furrr` is loaded!)
plan(strategy = multisession, workers = 4)
results_parall <- specr(specs)
# Comparison
cat("Sequential: ", results_simple$time, "\n",
"Parallel: ", results_parall$time)
## Sequential: 57.227 sec elapsed
## Parallel: 36.781 sec elapsed
As we can see, the default (sequential) computation took around a minute (57.227 sec elapsed) and the parallel computation about half a minute (36.781 sec elapsed).
We have to acknowledge that even with this comparatively large data set and more than 1,000 specifications, the reduction in time not too much. Thus, parallelization only makes sense if you have a truly large data set and thousands of specifications or the type of model you are estimating takes a really long time (e.g., a complex structural equation model, a negative binomial model, etc.). The true power of parallelization furthermore only come into play if you are using a considerable number of cores.
If we use a custom function, we need to make sure that this function
is passed to the different workers. This can be done by specifying so
called furrr_options
. We need to pass objects from the
global environment (and potentially also packages) as shown below.
Please note that we do not have to specify the “future plan” again,
because have specified it already earlier in this session (see above).
If we are unsure what plan is currently specified, we can simply run
plan()
and get some information about the current
setup.
As computation can take a long time, it would be nice if we would see
some kind of progress indication. This can easily be done by simply
adding the argument .progress = TRUE
to
specr()
. This passes this argument to the
future_pmap()
function within specr()
and
prints a rudimentary progress bar during the fitting process.
# Custom function
log_model <- function(formula, data) {
glm(formula, data, family = binomial())
}
# Setup specs
specs <- setup(data = dat,
y = c("y4"),
x = c("x1", "x2", "x3"),
model = c("log_model"),
controls = c("c1", "c2"),
subsets = list(group = unique(dat$group)))
# Create furrr_options to be passed to specr() (only works if `furrr` is loaded)
opts <- furrr_options(
globals = list(log_model = log_model)
)
# What "plan" is currently specified?
plan()
## multisession:
## - args: function (..., workers = 4, envir = parent.frame())
## - tweaked: TRUE
## - call: plan(strategy = multisession, workers = 4)
# Run results
results_parall_2 <- specr(specs,
.options = opts, # Pass ops to specr
.progress = TRUE) # To add progress bar (not shown in output)
# Summarize results
summary(results_parall_2)
## Results of the specification curve analysis
## -------------------
## Technical details:
##
## Class: specr.object -- version: 1.0.0
## Cores used: 4
## Duration of fitting process: 37.736 sec elapsed
## Number of specifications: 72
##
## Descriptive summary of the specification curve:
##
## median mad min max q25 q75
## 0.01 0.01 -0.02 0.03 0 0.01
##
## Descriptive summary of sample sizes:
##
## median min max
## 20038 19844 1e+05
##
## Head of the specification results (first 6 rows):
##
## # A tibble: 6 × 21
## x y model controls subsets group formula estimate std.error statistic p.value conf.low
## <chr> <chr> <chr> <chr> <chr> <fct> <glue> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 x1 y4 log_m… no cova… e e y4 ~ x… 0.01 0.01 0.38 0.7 -0.02
## 2 x1 y4 log_m… no cova… a a y4 ~ x… 0.01 0.01 0.62 0.54 -0.02
## 3 x1 y4 log_m… no cova… d d y4 ~ x… -0.02 0.01 -1.17 0.24 -0.04
## 4 x1 y4 log_m… no cova… c c y4 ~ x… 0 0.01 -0.16 0.87 -0.03
## 5 x1 y4 log_m… no cova… b b y4 ~ x… 0.03 0.01 1.84 0.07 0
## 6 x1 y4 log_m… no cova… all <NA> y4 ~ x… 0 0.01 0.66 0.51 -0.01
## # … with 9 more variables: conf.high <dbl>, fit_null.devian… <dbl>, fit_df.null <dbl>,
## # fit_logLik <dbl>, fit_AIC <dbl>, fit_BIC <dbl>, fit_deviance <dbl>, fit_df.residual <dbl>,
## # fit_nobs <dbl>
Note: In the technical details of the summary, we also always see how many cores were used and how long the fitting process has taken.
At the end of our analysis, it makes sense to explicitly close multisession workers by switching the plan back to sequential.
plan(sequential)