Sensible choice of parameters to simulate a continuous outcome

2 minute read

Published:

Let’s say you want to make a simulation study to explore Method C (C for cool, because that is the best descriptor of your method). Your method assumes a continuous outcome $Y$ and a linear relationship with the vector $X$ through some coefficients $\beta$. Yes, this is the linear regression setup. You open your favorite coding program and start thinking. Ok, so you start by generating a matrix of covariates $X$. These covariates have a correlation matrix $\Sigma$. You think to yourself, I’m doing great, almost done. You are very proud because you have the mean of $Y$ perfectly specified. The only thing that you need is $\beta$. Is it?

Well, not really. You have to decide a value for the variance of the measurement error of $Y$; we will call it $\sigma^2$. Everything’s not lost. Just pick a number, and that is it. It is simple, choose a vector for $\beta$, select a number for $\sigma^2$, then you are done.

But wait a minute, that looks pretty arbitrary. How can you be sure you’ve chosen sensible values for these parameters?

Here is an idea. What if you fix the proportion of variance explained by your covariates (called $R^2$) and then set everything else. Here are two reasons why this is a good idea; the first is that it is much more interpretable than $\sigma^2$, and the second is very interpretable overall. I know, it is not two reasons but only one, but it is so good that it counts as two. You can choose $R^2$ according to the situations you expect your method to work.

Ok, a value of $R^2$ is a proportion, so it must be between 0 and 1. You think about all the situations that motivated your method. All those happy memories had $R^2$ between 0.2 and 0.3. You pick 0.25, and you are good to go. What next?

Remember the definition, $R^2$ is the proportion of the total variance of $Y$ explained by the covariates in your model. Your model is $Y=X\beta+\text{measurement error}$, and the covariates part is only $X\beta$, which gives us the following equation.

\[R^2=\dfrac{var(X\beta)}{var(X\beta+\text{measurement error})}\]

From here, you can get an expression of $\sigma^2$ as a function of $\beta$, $\Sigma$, and $R^2$ by following the algebra. You’ll finally get

\[\sigma^2=\frac{\beta^T\Sigma\beta(1-R^2)}{R^2}\]

Using this equation, you now have a sensible way to begin your simulation study, since fixing $R^2$ can be much less arbitrary than fixing $\sigma^2$.

Thanks for reading. Let me know if you have questions or corrections. Especially corrections.