__Click [HERE](https://julianquandt.com/rmd_files/2020-06-16-power-analysis-by-data-simulation-in-r-part-iii.Rmd) to download the .Rmd file__ _This blog is also available on [R-Bloggers](https://www.r-bloggers.com/)_

', '', '

', '',
'

", sep = "\n")
}
})
```
# The Power Analysis by simulation in `R` for really any design - Part III
This is Part III of my tutorial on how to do power-analysis by simulation.
In Part I, we saw how to do a simulation for a simple toy-example with a coin-toss.
In part II we learned how to simulate univariate and multivariate normal-distributions, specify our assumptions about the expected effect and test our simulated data with different t-tests.
In this part, we will focus on how we can write a flexible test for any model by making use of the fact that many statistical standard methods (e.g. t-test, ANOVA) can be rewritten as a linear model.
This way, no matter what situation we find ourselves in, we will always be able to use the same power-simulation with only slight changes.
In part IV we will learn how to apply this technique to complicated designs such as linear mixed-effects models and generalized mixed-effects models.
## Revisiting the t-test: just another linear model
Before I even start with this topic, I want to point out that I took much of the information I present here from Jonas Kristoffer Lindelov's __fantastic__ blog post ["Common statistical tests are linear models (or: how to teach stats)"](https://lindeloev.github.io/tests-as-linear/).
If you have not read it yet, you should definitely do so, either now or after you finished this part of the tutorial.
Seriously, it is a great blog-post that inspired me to write up this part of the tutorial in a similar way in which we want to learn one flexible method of data-simulation rather than changing what we are doing for each different situation.
Ok so what does it mean to say that, for instance, a t-test is just a linear model?
First of all, here is a quick reminder about the linear model with a quick example:
$$
y_i = \beta_0 + \beta_1x_i
$$
The above formula states that each value of y (the little i can be read as "each") can be described by adding a value $\beta_0$ and the product of $beta_0 \times x$.
Now, how does this relate to the t-test? Again, if you want a more in-depth discussion read the blog-post above.
To demonstrate the point, lets revisit the example from part II of this tutorial.
We had 2 groups and we assumed that one group can be described as `group1 = normal(0,2)` and the other group can be described as `group2 = normal(1,2)`.
Now if we want to test whether the difference in means is significant, we can run a independent-samples t-test.
However, if we rephrase the problem what we want to know whether group membership (whether you belong to group1 or group2; the $x$ variable in this case) significantly _predicts_ the person's score (y in this case).
To do so, we can just assume let $beta_0$ describe the mean of, say, group1 in this case.
If we do this and if we re-write model above to predict the mean of group1, it looks the following way:
$$
M_{group1} = M_{group1}+\beta_1\times x
$$
What should $\beta_1$ and $x$ be in this case?
Well, as we already have the mean of group1 on the right side in the term that was previously $beta_0$, we do not want to add _anything_ to it.
Thus, if someone member from group1, to predict the mean-score we should have an observed value for group-membership of 0.
This way, the formula becomes
$$
M_{group1} = M_{group1}+\beta_1\times 0
$$
and the $\beta_1$ term becomes 0 entirely. Note that we of course do not know the mean of group1 as we will _simulate_ it.
$\beta_0$ (or $M_{group1}$ here) is a _parameter_, i.e. something that we want the model to estimate because we do not know it.
What if we want to describe the mean of a person in group2?
We can do this by just adding something to the mean of group1, namely the _difference_ between the two groups.
In other words, we can describe the mean of group2 by saying its the mean of group1 + the difference of their means times x:
$$
M_{group2} = M_{group1}+(M_{group2}-M_{group1})\times x
$$
What should $\beta_1$ and $x$ be now?
Just like $\beta_0$, our new mean-difference parameter $\beta_1$ is an unknown that depends on our simulated means and will be estimated by the model.
$x$ on the other hand should be 1, because if someone is from group2 we want the predicted mean of that person to be the mean of group1 + __1 times__ the difference between group1 and group2.
Therefore if someone is from group2 our model becomes:
$$
M_{group2} = M_{group1}+(M_{group2}-M_{group1})\times 1
$$
What we just did is so-called _dummy coding_, i.e. we made a new dummy variable that tells us someone's group membership:
- 0 if the person is in group1
- 1 if the person is in group2
If we use this dummy variable in the linear model we can do exactly the same test that we did with the t-test earlier.
Let me demonstrate with the exact same groups that we simulated in the beginning of part II of this tutorial.
```{r sim_groups}
set.seed(1234)
group1 <- rnorm(30, 1, 2)
group2 <- rnorm(30, 0, 2)
```
Now we can run the independent-samples t-test again, yielding the same results as in part II.
```{r t-test-again}
t.test(group1, group2, paired = FALSE, var.equal = TRUE, conf.level = 0.9)
```
To do our linear model analysis instead, we first have to create a data-frame that contains the 2 groups and tells us about their group membership.
In other words we will make a data-set that has the variables `score` which are the values in the vectors of group1 and group2 and a variable `group` that will be 0 if a person is from group1 and 1 if the person is from group2.
In this case, after putting the data together, we create a dummy variable with `ifelse` a function that you can read like: if the group variable has value "group1" in a certain row, then make dummy_group in that row = 0, else make it 1.
```{r make_groupdat}
names(group1) <- rep("group1", length(group1)) # name the vectors to use the names as the group variable
names(group2) <- rep("group2", length(group2))
lm_groupdat <- data.frame(score = c(group1,group2), group = c(names(group1), names(group2))) # make data-set from scores and names
lm_groupdat$dummy_group <- ifelse(lm_groupdat$group == "group1", 0, 1) # create dummy variable
```
Now we can use this to run our linear model with the `lm` function.
In case you have not used it before, we indicate the _intercept_, i.e. $beta_0$ with a 1 in this case, to tell the model that we want 1 times $beta_0$ in there and tell it that we want dummy_group times the difference between the groups, $beta_1$ in there.
The code looks like this:
```{r lm_groupdat}
summary(lm(score ~ 1+ dummy_group, dat = lm_groupdat)) # use summary function to get p-values
```
First let us see what the estimates mean in this case.
The intercept has an estimate of .4072.
As we said earlier, this, $\beta_0$, should be the group-mean of group1 and indeed wee see that `mean(group1)` = `r mean(group1)` gives us the same result.
What is the second estimate, $\beta_1$ for dummy_group should yield the difference between the means.
We can confirm that `mean(group2)-mean(group1)` = `r mean(group2)-mean(group1)` gives us the same result in this case.
Further we see that the DFs, t-value and p-value are identical to the earlier t-test.
This method would also work with the one-sample t-test approach and the paired-samples t-test approach from part II of the tutorial, but I will leave it as an exercise to the reader to try it out, as I do want to move on to other situations.
Again (cannot mention this enough) make sure you head to Jonas Kristoffer Lindelov's [blog post](https://lindeloev.github.io/tests-as-linear/) if you want to see how to do this or if you want to read more on this topic.
## Simulating a 2x2 between-subject ANOVA (as a linear model)
### How to specify raw effect-sizes with no prior data
In this part, I will give an example of how we could specify prior effect-sizes for our research design given we have not collected any prior data and are only working with assumptions.
Ok time to move on to more sophisticated designs.
Imagine you are interested in students attitude towards smoking and how it depends on the medium of the message and the focus of the message.
More precisely we want to know whether people's attitude is different after seeing a visual anti-smoking message (these pictures on the package) vs a text-message (the text belonging to that picture).
Moreover, we are interested in whether the attitude that people report is different after seeing a message that regards the consequences on other people (e.g. smoking can harm your loved ones) as compared to yourself (smoking can cause cancer).
Thus we have the following design
_Disclaimer: This is just an example that i made up so it does not reflect any real-life effect and one might easily argue a different way as this RQ is not based on any underlying theory but it allows me to demonstrate the point without further elaborating on any psychological theory._
DV: attitude towards smoking (0-100)
IV1: medium (text vs. visual)
IV2: focus (internal vs. external)
This is, there are 4 groups:
- group_TI will receive text-messages that are internal
- group_TE will receive text-messages that are external
- group_VI will receive visual messages that are internal
- group_VE will receive visual messages that are external
Lets assume that we expect that people's attitude will be more negative after seeing a visual rather than text message if the focus is internal (i.e. the message is about yourself) because it might be difficult to imagine that oneself would get cancer after reading a text but seeing a picture might cause fear regardless.
For the external focus on the other hand, we expect a more negative attitude after reading a text as compared to seeing a picture, as it might have more impact on attitude to imagine a loved one get hurt than seeing a stranger in a picture suffering from the consequences of second-hand smoking.
Furthermore, we expect that the internal focus messages will be related to lower attitudes compared to the external focus messages on average but we expect no main-effect of picture vs. text-messages.
Thus, we expect:
- no main-effect of medium
- main-effect of focus
- crossover-interaction of medium x focus
At this point, in my opinion it is more difficult to work with standardized effect-sizes like partial correlations e.g. in this case then just to specify everything on the response scale directly.
A first good step is to visualize some rough means that show the desired behavior that we described in words earlier and see where we are going, starting with the main-effects.
We expect a main effect of focus so the means of the 2 internal groups should be lower (more negative attitude) than the means of the external groups.
For example, we could make the overall mean of the internal focus groups (group_TI and group_VI) 20 and the mean of the external groups (group_TE and group_VE) 50.
Note how this would already reflect the main-effect but also a belief that the smoking-attitudes are on average quite negative as we assume both means to be on the low end of the scale.
However, for now we just want to make a rough guess and we might adjust our expectations in a second.
Ok so now what we have to do is specify the means for the media groups in a way that reflects our interaction hypothesis.
We could for example assume that the mean of group_TI is 30 while the mean of group_VI is 10 and we could assume that the mean of group_TE is 40 and the mean of group_VE is 60.
This way, for the internal groups, visual messages are more negative than texts, while for the external groups texts are more negative than visual messages.
We could plot this to see how it looks like
```{r plot_exp1}
focus <- rep(c("internal", "external"), each = 2)
media <- rep(c("text", "visual"), times = 2)
mean_TI <- 50
mean_VI <- 20
mean_TE <- 30
mean_VE <- 60
pd <- data.frame(score = c(mean_TI, mean_VI, mean_TE, mean_VE), focus = focus, media = media)
interaction.plot(pd$focus, pd$media, pd$score, ylim = c(0,100))
mean(pd$score[pd$focus == "internal"]) != mean(pd$score[pd$focus == "external"])
mean(pd$score[pd$media == "text"]) == mean(pd$score[pd$media == "visual"])
```
We can see that the above situation satisfies all things that we wanted to specify:
- The means of the two focus groups are not the same so that internal are lower than external values if we pool over the media conditions.
- The mean for both media conditions are the same so there is no main-effect.
- The lines cross over showing the interaction that we wanted.
However, three problems remain here.
First of all, the simple slope for the visual messages is steeper than the simple slope for the text messages, which is not something we specified in our hypotheses beforehand.
It rather comes as a side-effect of that we want our internal means to be lower than the external means but we also want the visual messages to be more impactful in the internal condition while our text-messages should be more impactful on the external conditions.
Do we find this plausible or do we think this would rather be the other way around?
', sep = "\n")
} else {
paste("

", "This a good point for a short reflection: You might already be overwhelmed how complicated power-analysis by simulation already is with these still relatively simple designs. However, notice that e.g. the follow-up tests for the slopes are things that we _can_ care about but we do not _need_ to care about them. Our interaction-hypothesis is mainly interested in whether the difference in scores between the visual and the text messages when comparing the internal focus group is the same as the difference for the external groups. As this is clearly not the case in this example, we are good to go and test the interaction hypothesis. However, if we clearly are interested in the simple slopes and have hypotheses about them, we _can_ put it in our simulation, and that of course urges us to think some more but it is a _great_ tool to see whether we actually thought about everything, as this might be a point that would have slipped our attention entirely when running a classical power-analysis and pre-registering the overall interaction only. Now, we can think about it if we _want_ to or continue by saying that for now we only want to specify the interaction.

You might never have heard of a Poisson distribution and to avoid confusion here: We could just use a uniform distribution between 1 and 36 for example so that smoking 30 cigarettes a day is just as likely as smoking 1 cigarette, given that a person is a smoker. However, I do not think that this is the case and that the likelihood of a smoker smoking 30 cigarettes is smaller compared to only smoking 5 cigarettes. The Poisson distribution mainly carries this assumption forward into the simulation. As you continue simulating stuff in the future, you will probably get to know more of these useful distributions that neatly describe certain assumptions that we have about the data.