The syntax of spdesign
is made to be intuitive and easy,
while at the same time very flexible. To achieve this, we use a
combination of regular expressions and R’s powerful expression parser
“under the hood” to translate the user-specified utility functions into
objects that can be manipulated by the package.
It is impossible to write checks for all possible edge cases of inputs. Therefore, you are strongly encouraged to follow the recommendations in this manual carefully when defining your utility functions.
In spdesign
, all parameters must have a prior.
This is a deliberate design choice to make it clear that using a 0 prior
is also an assumption in our design. The name of the prior is followed
by a square bracket, which contains the assumed value of the prior. All
priors have to start with b_
. This is to separate them from
attributes and to easily allow for non-linear utility functions.
If no prior is specified for a given parameter, an error is returned. If the parameter is generic, i.e. it enters multiple utility expressions, then you only need to specify the prior once. It is strongly recommended that priors for generic parameters are specified the first time they are used (this is a good tip for attributes as well!).
For example, to specify a parameter b_x1
with a prior of
0
, we would write:
b_x1[0]
We can specify a Bayesian prior by specifying the distribution of the
prior. For example, uniform_p(-1, 1)
means that we allow
the prior to follow a uniform distribution between -1 and 1.
"b_x1[uniform_p(-1, 1)]"
We can use the following distributions:
normal_p(mean, sd)
lognormal_p(mean, sd)
triangular_p(location, spread)
uniform_p(min, max)
All distributions are specified with a mean (location) or standard deviation (spread). In the case of the log-normal distribution we are specifying the mean and standard deviation of the underlying normal distribution.
The syntax for specifying attributes and levels is similar to how we
specify parameters and priors. The attribute name is specified to the
left and levels are specified inside []
. Below, the
attribute x_1
can take on the levels 1, 2, 3, 4 and 5.
"x1[1:5]"
When we specify the levels, we can make use of R’s built in functions for generating sequences and vectors. For example, the following three specifications are identical:
# Specification 1
"x1[1:5]"
# Specification 2
"x1[seq(1, 5, 1)]"
# Specification 3
"x1[c(1, 2, 3, 4, 5)]"
However,
"x1[1, 2, 3, 4, 5]"
is not a valid specification and will throw an error. The error will most likely be of the form:
Error in parse(text = x) : <text>:1:2: unexpected ','
1: 1,
^
While we have not written a new intuitive error message for this misspecification, it will be caught by R’s parser and we feel that this is sufficient. As a compromise, we state the most likely error here to help users catch why their specifications fail.
Important: Make sure that all attribute levels are numeric.
Even in the cases where you use the _dummy
coding syntax it
is always safer to use numeric values for the
levels.
We combine the parameter and attribute as in the example below to
start building our utility expression. We recommend that people use the
naming convention b_x1
being the parameter for the
attribute x1
. It has no practical implication for how the
code works or interprets the utility function, but it makes it much
easier for the user to see which parameter goes with which attribute. It
is also important that the parameter b_x1
precedes the
attribute x1
. There is currently no checks written to
ensure this, but failing to do so may cause unintended errors in certain
instances. For example, if you are using the dummy-coding shorthand
detailed below.
Below is an example:
"b_x1[0] * x1[1:5]"
It is strongly recommended to use identical names for priors and attributes, with the exception of “b_” for priors. This makes reading of the code easier.
The utility functions take the form of named lists. Let’s say that we would like to create a design comprising two alternatives with three attributes each and a status quo. We want our first attribute to take on the five levels 1, 2, 3, 4, and 5; our second attribute to take on the levels 0 and 1; and our third attribute to take on the values betwween 0 and 1 in .25 increments. One way to specify this would be the following:
utility <- list(
alt1 = "b_x1[0.1] * x1[1:5] + b_x2[0.4] * x2[c(0, 1)] + b_x3[-0.2] * x3[seq(0, 1, 0.25)]",
alt2 = "b_x1 * x1 + b_x2 * x2 + b_x3 * x3",
alt3 = "b_sq[0.15] * sq[1]"
)
Where the status quo or opt out alternative has a single constant. It is important that when you add constants the level is 1. You cannot currently specify specific levels of the attributes for the SQ. Since these do not vary, this is equivalent to a constant (which can be calculated beforehand, see more details below) because only differences in utility matter for the model.
To generate a design, we need to pass the utility function along with
other options to the generate_design
function. Currently,
we can only generate designs for the MNL model. To see the full set of
options for the function: ?generate_design
.
design <- generate_design(utility, rows = 20,
model = "mnl", efficiency_criteria = "d-error",
algorithm = "rsc", draws = "scrambled-sobol",
control = list(
max_iter = 21000
))
The list control
contains options that can be passed
along to the function to change default values. The default values and
options are:
default_control <- list(
cores = 1,
max_iter = 10000,
max_relabel = 10000,
max_swap = 10000,
efficiency_threshold = 0.1,
sample_with_replacement = FALSE
)
No default algorithm for generating a design is implemented because
they have different properties with respect to how easy it is to achieve
attribute level balance (see more details below) and how easy it is to
include restrictions/exclusions in the design (see more details below).
Three algorithms are implemented: rsc
,
federov
, and random
.
The rsc
algorithm creates an initial design candidate
based on the provided attributes, levels and number of rows. This design
candidate is attribute level balanced by default (if the number of
attribute levels for all attributes are a multiple of the number of
rows) or near attribute level balanced. It will not deviate from this.
It finds new candidates by relabling or swapping attribute levels. The
default is to cycle between relabling and swapping each 10000
candidates. The cycling part of the algorithm is not included because it
is rarely used in practice. You cannot include restrictions or
exclusions in your design when using the rsc
algorithm. For
details: ?rsc
The federov
and random
algorithms will
systematically or randomly select potential design candidates from the
candidate set. The candidate set can be either supplied or assumed to be
the full factorial. These algorithms completely lets go of attribute
level balance and you do run the risk of some attribute levels not
showing up in the final design. We show syntax below to include level
occurrence restrictions below. You can easily include restrictions in
your design using these to algorithms by excluding attribute level
combinations from the candidate set. That way, they will never be part
of the design.
Note that in general, both restrictions and attribute level balance may lead to less efficient designs and that if your restrictions are too tight and you are trying to force attribute level balance, no designs may be found. Few checks are made to notify the user of this.
Blocking the design is done after the design has been generate and
will add a blocking column to the design. The following will block the
design into 4 trying to minimize the average squared correlation between
the blocking column and each of the design columns. More information:
?block
design <- block(design, 4)
As stated above, the rsc
algorithm wil ensure attribute
level balance, whereas the federov
and random
algorithms will not. However, there is some flexibility in how this is
specified. The following will only work for the federov
and
random
algorithms and will be ignored without warning if
the rsc
algorithm is used.
The implicit restriction in the package is that each attribute level
can occur between 0 and rows
number of times. For example,
if you have a design with 20 rows, then the attribute level can occur
between 0 and 20 times. In practice, anywhere between never and always.
The package does provide the user with the ability to specify a narrower
range.
To do this, we would specify a parenthesis behind the attribute levels where we give the range that the attribute levels can take. Assume that our design has 18 rows. The following specification would force attribute level balance by saying that each level has to occur exactly 6 times.
"b_x1[0] * x1[c(0, 1, 2)](6)"
As stated, it is not always possible to get attribute level balance
and indeed it can be quite hard with the federov
and
random
algorithms. Specifying a range may be better. For
example, we could specify that all attribute levels should occur between
5 and 7 times.
"b_x1[0] * x1[c(0, 1, 2)](5:7)"
In very special cases, it may be desirable to let some attribute levels occur more frequently than others and we could specify:
"b_x1[0] * x1[c(0, 1, 2)](3:5, 5:7, 6:9)"
which would mean that 0 occurs between 3 and 5 times, 1 between 5 and 7, and 2 between 6 and 9 times. Indeed, in the case where we only specify a single range, it is expanded to be equal to the number of attribute levels such that:
"b_x1[0] * x1[c(0, 1, 2)](5:7)"
is equivalent to
"b_x1[0] * x1[c(0, 1, 2)](5:7, 5:7, 5:7)"
A few important things to note:
rows
and the sum of maximum level occurrences must be
larger than rows
. Otherwise, no design can be found. No
checks are implemented for this.If you believe that there are interaction effects between attributes in your model, then these will have to be considered at the design stage. Any design that is not a full factorial assumes that some higher order interaction effects are insignificant. To ensure that you are able to identify the effect of an interaction in your data, this will have to be included in the design. To include an interaction term in the design, we can use the following syntax.
"b_x1[0] * x1[c(0, 1)] + b_x2 * x2[c(0, 1)] + b_x1x2 * I(x1 * x2)"
Two things to note about the specification of the utility function
with interactions. First, the levels of the attributes are specified
when the attribute is first introduced and not in the interaction term
itself, and two, for the parser to correctly interpret an interaction
term, it needs to be wrapped in R’s internal function for handling
interactions I()
.
Often we have attributes in mind that are categorical in nature,
e.g., comfort or environmental quality. In the design, it is often
desirable to let these be dummy-coded. We can easily specify an
attribute to dummy-coded using the _dummy
syntax. Assume
that we have a single categorical attribute that can take on three
levels. We would then specify the utility functions as follows:
V = list(
alt1 = "b_x1_dummy[c(0.2, 0.5)] * x1[c(1, 2, 3)]",
alt2 = "b_x1_dummy * x1"
)
Here we have specified 2 priors and 3 levels. The dummy coding expands the attribute levels by transforming the attribute into a factor and expanding it using a formula and the model.matrix. The first level is chosen as the baseline and dropped. Similarly, the priors are expanded into separate priors for each level. This also works with Bayesian priors.
It is recommended to use 1, 2, 3, 4 for the levels of a dummy-coded attribute. Note that the levels don’t matter, but this naming convention secures that there is a correspondence between the names for the priors and attributes.
A few things to note about using the _dummy
syntax
_dummy
as an extension for your
attribute name._dummy
extension. Placing anything between
_dummmy
and [
will cause the code to
fail.When using the federov
or random
algorithms, we can pass along a list of exclusions to
generate_design
. The exclusions will exclude the specified
combinations from the full factorial prior to generating design
candidates. The exclusions take the form of an unamed list of strings.
You must ensure that the names of the list corresponds to the names of
the expanded attribute levels (see
?expand_attribute_levels
).
For example:
exclusions = list(
"alt1_x1 == 2 & alt1_x2 == 0 & alt1_x3 == 0",
"alt2_x2 == 1 & alt2_x3 == 1"
)
You cannot currently specify fixed SQ levels. Specifying new attributes that can only take on a single level for the SQ leads to a computationally singular Hessian matrix (because there is no variation). To get around this, you can calculate what the size of the SQ constant would be given the defined attribute levels for the SQ and the assumed prior and include this as a single constant for the SQ alternative. This will ensure that the correct utility differences are used in the design.
A fix with additional level restrictions will be included in a future version of the package.
Remember, that when you specify fixed levels for the SQ, you cannot also specify alternative specific constants for the non-SQ alternatives. This is a violation of the J-1 ‘rule’ and will result in a singular Fisher information matrix.
There is no point specifying SQ levels that only take on the value of 0. These have no bearing on the design and may be problematic when inverting the Fisher information matrix.
Related to the above point, in the case where your attribute is dummy-coded the easiest is to define it such that the base level of your dummy-coded attribute corresponds to the level of the SQ. Remember, that in your design, these only take on the values of 1 and 0 with the base level being zero for all non-base-level levels. Defining the base level to be equal to the SQ obviates the need to explicitly specify the SQ beyond a constant.
Creating large designs can be beneficial, however, if you have many attributes and levels, the size of the full factorial can be too large to hold in memory leading to an error of the type: “Error: vector memory exhausted (limit reached)”. There are two possible solutions to fix this:
rsc
algorithm instead, which will
obviate the need to have the full factorial present.You can supply a candidate set to use with the modified federov and
random design algorithms. There are 2 important things with respect to
the candidate set that spdesign
will check for:
V = list(
alt1 = "b_x1_dummy[c(0.2, 0.5)] * x1[c(1, 2, 3)]",
alt2 = "b_x1_dummy * x1"
)
Then the column names of the candidate set would be:
alt1_x1
and alt2_x1
.
The candidate set is passed in as a data.frame
using the
candidate_set
argument to the function
generate_design
.