`specr.rmd`

The following example is a more comprehensive version of the example on the homepage and exemplifies how to use the major functions of this package.

In order to understand what type of analytical choices exists, you need to understand your data set. In a first step, you should hence investigate your data closely.

```
# Load library
library(specr)
# We have a look at the simulated data set that is included in the package
head(example_data)
#> x1 x2 c1 c2 y1 y2 group1
#> 1 1.533913 1.3697122 0.5424902 3.8924435 23.500543 10.4269278 0
#> 2 1.680639 1.5163745 -1.2415868 3.3377268 17.017955 0.5733467 1
#> 3 1.223941 -0.2381044 0.1405891 0.8911959 -3.678272 4.2303190 0
#> 4 1.765276 0.9524049 4.0397943 1.8567454 21.668684 14.8865252 1
#> 5 1.907134 0.6282816 3.1002518 5.5840574 32.713106 20.5251920 0
#> 6 1.710695 1.1898467 0.4648824 4.0239483 20.422171 4.3471236 1
#> group2
#> 1 A
#> 2 C
#> 3 B
#> 4 B
#> 5 A
#> 6 C
# Summary of the data set
summary(example_data)
#> x1 x2 c1 c2
#> Min. :-1.185 Min. :-1.6037 Min. :-4.093 Min. :-3.780
#> 1st Qu.: 1.823 1st Qu.: 0.5248 1st Qu.: 1.242 1st Qu.: 1.282
#> Median : 2.479 Median : 1.2740 Median : 2.513 Median : 2.775
#> Mean : 2.455 Mean : 1.2999 Mean : 2.499 Mean : 2.690
#> 3rd Qu.: 3.192 3rd Qu.: 2.1049 3rd Qu.: 3.897 3rd Qu.: 4.064
#> Max. : 5.576 Max. : 4.5741 Max. : 7.994 Max. : 9.001
#> y1 y2 group1 group2
#> Min. :-22.93 Min. :-22.148 Min. :0.0 A:144
#> 1st Qu.: 12.50 1st Qu.: 3.944 1st Qu.:0.0 B:162
#> Median : 21.69 Median : 11.753 Median :0.5 C:194
#> Mean : 23.89 Mean : 11.200 Mean :0.5
#> 3rd Qu.: 34.04 3rd Qu.: 17.966 3rd Qu.:1.0
#> Max. : 94.13 Max. : 46.341 Max. :1.0
```

There are several numeric variables. For this use case, we assume that `x`

represents independent variables, `y`

represents dependent variables, `c`

represents control variables, and `group`

denotes potential grouping variables that can be used for subsetting the data.

The next steps involves identifying possible analytical choices. This step involves an in-depth understanding of the research question and the model(s) that will be specified. In this case, we assume simply that `x`

should be positively correlated with `y`

. We can use the additional function `setup_specs()`

to check how different analytical decisions create varying factorial designs.

```
setup_specs(y = c("y1"), # We choose only one dependent variale
x = c("x1", "x2"), # We are not sure which independent variable is better
model = c("lm"), # We only estimate one type of model (linear model)
controls = c("c1", "c2")) # We include two control variable
#> # A tibble: 8 x 4
#> x y model controls
#> <chr> <chr> <chr> <chr>
#> 1 x1 y1 lm c1 + c2
#> 2 x2 y1 lm c1 + c2
#> 3 x1 y1 lm c1
#> 4 x2 y1 lm c1
#> 5 x1 y1 lm c2
#> 6 x2 y1 lm c2
#> 7 x1 y1 lm no covariates
#> 8 x2 y1 lm no covariates
```

The resulting data frame creates eight different specifications. `setup_specs()`

can be used to check and understand how different analytical choices create specific combinations. Yet, this step is not mandatory. It is used within the next function that actually runs the specification analysis.

The main function of the package is `run_specs()`

. Similar to `setup_specs`

we need to include our analytical choices as arguments. Additionally however, we also need to provide the data and we can also specify subsets that should be evaluated.

One type of analytical choice that could additionally affect the results refers to the type of model that is estimated. The function runs traditional linear regression models by default (i.e. when `model = "lm"`

is provided as argument). However, customized model functions can be passed to the function. The only requirement is that the customized functions requires a (regression) formula and the data.

```
# Self-made function
lm_gauss <- function(formula, data) {
glm(formula = formula, data = data, family = gaussian(link = "identity"))
}
# Run specification curve analysis
results <- run_specs(df = example_data,
y = c("y1", "y2"),
x = c("x1", "x2"),
model = c("lm", "lm_gauss"),
controls = c("c1", "c2"),
subsets = list(group1 = unique(example_data$group1),
group2 = unique(example_data$group2)))
# Check
results
#> # A tibble: 384 x 12
#> x y model controls estimate std.error statistic p.value
#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 x1 y1 lm c1 + c2 4.95 0.525 9.43 3.11e-18
#> 2 x2 y1 lm c1 + c2 6.83 0.321 21.3 1.20e-57
#> 3 x1 y2 lm c1 + c2 -0.227 0.373 -0.607 5.44e- 1
#> 4 x2 y2 lm c1 + c2 0.985 0.324 3.04 2.62e- 3
#> 5 x1 y1 lm_g… c1 + c2 4.95 0.525 9.43 3.11e-18
#> 6 x2 y1 lm_g… c1 + c2 6.83 0.321 21.3 1.20e-57
#> 7 x1 y2 lm_g… c1 + c2 -0.227 0.373 -0.607 5.44e- 1
#> 8 x2 y2 lm_g… c1 + c2 0.985 0.324 3.04 2.62e- 3
#> 9 x1 y1 lm c1 5.53 0.794 6.97 2.95e-11
#> 10 x2 y1 lm c1 8.07 0.557 14.5 6.90e-35
#> # … with 374 more rows, and 4 more variables: conf.low <dbl>,
#> # conf.high <dbl>, obs <int>, subsets <chr>
```

The result data frame includes relevant statistics for each of the estiamted models.

We can plot a simple decision tree to understand how our analytical choices lead to a large number of specifications.

`plot_decisiontree(results, legend = TRUE)`

The package includes a simple function `summarise_specs()`

that allows to get a first summary of the results.

```
# basic summary of the entire specification curve
summarise_specs(results)
#> # A tibble: 1 x 7
#> median mad min max q25 q75 obs
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3.59 4.56 -2.05 9.58 1.03 7.63 123
# summary by specific groups and statistics
summarise_specs(results,
stats = lst(median, min, max),
group = c("x", "y"))
#> # A tibble: 4 x 6
#> # Groups: x [2]
#> x y median min max obs
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 x1 y1 6.52 3.49 9.28 123
#> 2 x1 y2 0.498 -2.05 3.67 123
#> 3 x2 y1 7.80 5.89 9.58 123
#> 4 x2 y2 1.29 -0.258 2.91 123
# summary of another statistic
summarise_specs(results,
var = "p.value",
group = "subsets") %>%
mutate_if(is.numeric, round, 3)
#> # A tibble: 12 x 8
#> subsets median mad min max q25 q75 obs
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 all 0 0 0 0.838 0 0.063 500
#> 2 group1 = 0 0 0 0 0.994 0 0.259 250
#> 3 group1 = 0 & group2 = A 0.001 0.002 0 0.577 0 0.005 72
#> 4 group1 = 0 & group2 = B 0.001 0.001 0 0.93 0 0.123 76
#> 5 group1 = 0 & group2 = C 0.001 0.002 0 0.572 0 0.067 102
#> 6 group1 = 1 0 0 0 0.749 0 0.074 250
#> 7 group1 = 1 & group2 = A 0 0 0 0.04 0 0.002 72
#> 8 group1 = 1 & group2 = B 0 0 0 0.277 0 0.077 86
#> 9 group1 = 1 & group2 = C 0 0 0 0.862 0 0.178 92
#> 10 group2 = A 0 0 0 0.024 0 0 144
#> 11 group2 = B 0 0 0 0.231 0 0.008 162
#> 12 group2 = C 0 0 0 0.487 0 0.035 194
```

The output contains summary statistics such as the median, the median absolute deviation, … as well as the number of observations that were used for each model. Bear in mind that due to subsetting or missing data, sample sizes can vary considerably which, in turn, affects the results (e.g., the p-value).

However, in order to grasp how the different analytical choices affect the outcome of interest (in this case, the estimate refers to the unstandarized regression coefficient *b*), it is reasonable to plot a specification curve. The function `plot_specs()`

to produces the typical visualization of the specification curve and how the analytical choices affected the obtained results.

```
# Plot specification curve analysis
plot_specs(results)
```

Sometimes, it can be useful to check the influence of specific choices on the estimate of interest more precisely. We can use the function `plot_summary()`

to produce respective boxplots.

`plot_summary(results)`

Here, we can see clearly that the dependent variable (`y`

) produces the largest differences in the obtained estimates.

Finally, we can estimate how much variance in the specification curve is related to which analytical decisions. Therefore, we have to estimate a basic multilevel model without predictors and the analytical decisions as random effects (interactions could be included too). We then use the function `icc_specs()`

to calculate a respective table or `plot_variance()`

to visualize the distribution.

```
# Estimate multilevel model
library(lme4)
model <- lmer(estimate ~ 1 + (1|x) + (1|y) + (1|controls) + (1|subsets), data = results)
# Get intra-class correlation
icc_specs(model) %>%
mutate_if(is.numeric, round, 2)
#> grp vcov icc percent
#> 1 subsets 0.79 0.04 3.62
#> 2 controls 0.05 0.00 0.23
#> 3 y 19.74 0.90 89.83
#> 4 x 0.43 0.02 1.95
#> 5 Residual 0.96 0.04 4.38
# Plot decomposition
plot_variance(model)
```