`vignettes/parallel-bayesian-models.Rmd`

`parallel-bayesian-models.Rmd`

This vignette shows how to incorporate Bayesian modeling using the
package brms in
specr. When we fit a model with `brms`

, the package actually
calls Rstan in the background, which, in turn, is an R interface to the
statistical programming language `Stan`

. Stan is built in the
programming language C++ and models have to be compiled using C++ to be
run. This is all taken care of by brms, so you just need to run
`brm(…)`

to fit any model. `brms`

uses a syntax
for specifying model formulae that is based on the syntax of the
commonly known lme4 package (for multilevel modeling). The comparatively
easy syntax of `brms`

is then converted into Stan code
automatically.

That said, since the models have to be compiled in C++, you need to set up your computer so that it can actually use C++. This has to be done only once, before installing brms. Unfortunately, as almost always, how you install C++ depends on your operating system. Check this vignette to if you have problems installing brms.

We are also loading the library `furrr`

as we want to
parallelize the computations.

Although you can simply pass the function `brm`

to
`setup`

and run `specr()`

with
`workers = 1`

without any specific custum function, it
usually make sense to set up some custom model fitting and extraction
functions to make sure `specr()`

does exactly what we
want.

For example, I would suggest to create a custom model fitting
function and specify relevant parameters for the Bayesian models. Here,
I switch off parallel processing as I want to do this via
`specr()`

(i.e., setting `cores = 1`

in
`brm`

). I also reduce the number of chains to 1 to speed up
the fitting process (but we can bump this up if we want). This custom
function also allows me to load `brms`

itself and
`broom.mixed`

, which provides the tidy function to extract
parameters from the `brm.fit`

object.

Furthermore, I also create a custom fit extract function. Most
importantly, I add the full `brms.fit`

object to the
`glance`

output. This way, we can later access the entire
object in order to e.g., extract posterior distributions or other
aspects of the Bayesian models.

```
# Customized function, not necessary if standard values are to be used
brm_new <- function(formula, data) {
brm(formula, data,
silent = 2, # Don't print progress
refresh = 0, # Remove message (not necessary, but nicer in the output)
iter = 1000) # I just set this here to reduce computing time, can be higher, of course
}
# New fun2, fit extract function
glance_brm <- function(x) {
fit2 <- broom::glance(x)
fit2$full_model <- list(x) # add full model
return(fit2)
}
# Setting up specifications
specs <- setup(data = example_data,
x = c("x1", "x2", "x3"),
y = c("y1", "y2"),
model = "brm_new",
controls = c("c1", "c2"),
fun2 = glance_brm)
# Check specs
summary(specs)
#> Setup for the Specification Curve Analysis
#> -------------------------------------------
#> Class: specr.setup -- version: 1.0.0
#> Number of specifications: 24
#>
#> Specifications:
#>
#> Independent variable: x1, x2, x3
#> Dependent variable: y1, y2
#> Models: brm_new
#> Covariates: no covariates, c1, c2, c1 + c2
#> Subsets analyses: all
#>
#> Function used to extract parameters:
#>
#> function(x) broom::tidy(x, conf.int = TRUE)
#> <environment: 0x7f9c7371bad0>
#>
#>
#> Head of specifications table (first 6 rows):
#> # A tibble: 6 × 6
#> x y model controls subsets formula
#> <chr> <chr> <chr> <chr> <chr> <glue>
#> 1 x1 y1 brm_new no covariates all y1 ~ x1 + 1
#> 2 x1 y1 brm_new c1 all y1 ~ x1 + c1
#> 3 x1 y1 brm_new c2 all y1 ~ x1 + c2
#> 4 x1 y1 brm_new c1 + c2 all y1 ~ x1 + c1 + c2
#> 5 x1 y2 brm_new no covariates all y2 ~ x1 + 1
#> 6 x1 y2 brm_new c1 all y2 ~ x1 + c1
```

Because we are estimating 24 Bayesian models, this may take a while, but brms automatically parallelizes the computations.

`results <- specr(specs)`

Now we can again summarize and plot our results.

```
summary(results)
#> Results of the specification curve analysis
#> -------------------
#> Technical details:
#>
#> Class: specr.object -- version: 1.0.0
#> Cores used: 1
#> Duration of fitting process: 9.25 sec elapsed
#> Number of specifications: 24
#>
#> Descriptive summary of the specification curve:
#>
#> median mad min max q25 q75
#> 0.16 0.25 -0.34 0.62 -0.07 0.25
#>
#> Descriptive summary of sample sizes:
#>
#> median min max
#> 1000 1000 1000
#>
#> Head of the specification results (first 6 rows):
#>
#> # A tibble: 6 × 18
#> x y model controls subsets formula effect component group estimate std.error
#> <chr> <chr> <chr> <chr> <chr> <glue> <chr> <chr> <chr> <dbl> <dbl>
#> 1 x1 y1 brm_new no covariates all y1 ~ x1 + 1 fixed cond <NA> 0.62 0.04
#> 2 x1 y1 brm_new c1 all y1 ~ x1 + c1 fixed cond <NA> 0.6 0.04
#> 3 x1 y1 brm_new c2 all y1 ~ x1 + c2 fixed cond <NA> 0.61 0.04
#> 4 x1 y1 brm_new c1 + c2 all y1 ~ x1 + c… fixed cond <NA> 0.59 0.04
#> 5 x1 y2 brm_new no covariates all y2 ~ x1 + 1 fixed cond <NA> -0.33 0.04
#> 6 x1 y2 brm_new c1 all y2 ~ x1 + c1 fixed cond <NA> -0.34 0.04
#> # … with 7 more variable: conf.low <dbl>, conf.high <dbl>, fit_algorithm <chr>, fit_pss <dbl>
#> # fit_nobs <dbl>, fit_sigma <dbl>, fit_full_model <list>
plot(results, choices = c("x", "y", "controls"))
```

Again, the coefficients are median estimates from the posterior and their respective 95% HDIs.

Because we kept the entire brms models in our result data set, we can
explore specific models using the `pull`

function.

```
# Pull entire models from the listed vector
models <- results %>%
as_tibble %>%
pull(fit_full_model)
# Summarize e.g., the first model
summary(models[[1]])
#> Family: gaussian
#> Links: mu = identity; sigma = identity
#> Formula: y1 ~ x1 + 1
#> Data: data (Number of observations: 1000)
#> Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
#> total post-warmup draws = 2000
#>
#> Population-Level Effects:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept -1.10 0.04 -1.17 -1.02 1.00 1915 1356
#> x1 0.62 0.04 0.55 0.69 1.00 1900 1631
#>
#> Family Specific Parameters:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma 1.14 0.03 1.09 1.18 1.00 2903 1542
#>
#> Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).
```

From these model fit objects, we can also extract posterior distributions for the parameters of interest and stort them in one data frame.

```
posteriors <- results %>%
as_tibble %>%
pull(fit_full_model) %>%
map(function(x) as_tibble(as_draws_df(x))[, 2] %>%
gather(key, value)) %>%
bind_rows(., .id = "id")
# First 10 draws from the first specification for predictor `x2`
head(posteriors, n = 10)
#> # A tibble: 10 × 3
#> id key value
#> <chr> <chr> <dbl>
#> 1 1 b_x1 0.642
#> 2 1 b_x1 0.655
#> 3 1 b_x1 0.647
#> 4 1 b_x1 0.650
#> 5 1 b_x1 0.611
#> 6 1 b_x1 0.736
#> 7 1 b_x1 0.565
#> 8 1 b_x1 0.624
#> 9 1 b_x1 0.602
#> 10 1 b_x1 0.611
```

Using the `ggridges`

package, we can plot the posterior
distributions of the different specifications.

```
posteriors %>%
mutate(id = factor(id, levels = 1:24)) %>%
ggplot(aes(x = value, y = id, fill = key)) +
geom_density_ridges() +
scale_fill_brewer(palette = "Blues") +
theme_minimal()+
labs(x = "Posterior",
y = "Specifications",
fill = "")
```

Of course, the posterior distributions are not sorted in any way. With a bit of data wrangling, we can create a specification curve consisting of posterior distributions.

```
# First panel
p1 <- results %>%
as_tibble %>%
mutate(id = as.character(1:nrow(.))) %>%
arrange(estimate) %>%
mutate(specifications = factor(as.character(1:nrow(.)),
levels = c(1:24))) %>%
left_join(posteriors) %>%
ggplot(aes(x = value, y = specifications)) +
stat_density_ridges(quantile_lines = TRUE,
quantiles = c(0.025, 0.975),
alpha = .5,
fill = "lightblue",
color = "black") +
coord_flip() +
theme_minimal() +
labs(x = "Posterior", y = "") +
theme(strip.text = element_blank(),
axis.line = element_line("black", size = .5),
legend.position = "none",
panel.spacing = unit(.75, "lines"),
axis.text = element_text(colour = "black"))
# Second panel
p2 <- results %>%
as_tibble %>%
arrange(estimate) %>%
mutate(specifications = factor(as.character(1:nrow(.)),
levels = c(1:24))) %>%
gather(key, value, c("x", "y", "controls")) %>%
mutate(key = factor(key, levels = c("x", "y", "controls"))) %>%
ggplot() +
geom_point(aes(x = specifications,
y = value),
shape = 124,
size = 3.35) +
theme_minimal() +
facet_grid(.data$key~1, scales = "free_y", space = "free_y") +
theme(
axis.line = element_line("black", size = .5),
legend.position = "none",
panel.spacing = unit(.75, "lines"),
axis.text = element_text(colour = "black"),
strip.text.x = element_blank()) +
labs(x = "", y = "")
# Combine
plot_grid(p1, p2,
ncol = 1,
align = "hv",
axis = "tblr",
rel_heights = c(3, 2))
```