By Robert Mobley If you’re involved in STEM classes or research, you’ve likely come across the term “sample size.” Understanding sample size is crucial, whether you’re an undergraduate planning an independent research project, a graduate student preparing a lab research plan, or someone just trying to find out if the research paper you read has valid results. This stats primer will help you grasp one of the most important aspects of conducting and analyzing research. Photo by CHUTTERSNAP on Unsplash Unleashing the power of simulation: determining sample size for your experiment The day has finally come to perform your experiment. You have an interesting new idea to test, you know what you’re comparing and how you’re going to measure it. The only question left is: how many times are you going to run your experiment? Choosing the right sample size tells you if your study has any power, i.e., how likely the results you see are due to the thing you’re testing. For example, should you believe an antibiotic will help you if it’s been shown to cure 3 people or 3 million people? The answer is, it depends. Generally, larger sample sizes increase a study’s power, but you don’t need to go overboard. By determining the minimum sample size at the start, you avoid unnecessary repetition and clarify your results. Think about it like this: you could flip a coin and get “heads” 5 times in a row just by chance. If you stopped there, you could say that a coin always lands heads up. But, if you continue flipping the coin a dozen more times, you’ll start to approach the true 50/50 probability. Now you’ll need to decide when to stop repeating your coin-flipping experiment. Is 100 times sufficient? 500? Flipping the coin 1000 more times will only reinforce what you already know. On top of that, actually flipping a coin this many times in real life takes a decent amount of time – it’s much faster and easier to use an online coin-flipping simulator than to do it manually. Similarly, sample size determination often involves complex formulas and mathematical notations that no one really likes. Instead, we’re going to work around those using the power of simulation. By the end of this post, you’ll not only understand the basics of sample size determination, but also have an editable R script to help you perform these calculations for your own experiments. Why use simulation? Traditional methods of calculating sample size often rely on standard formulas and assumptions that might not fit well with your specific experiment. This is where simulation shines. By generating random samples and repeatedly testing them, simulation provides a flexible way to estimate the necessary sample size under various conditions. It’s like running your experiment thousands of times on your computer to see how many participants you’d need to detect an effect. Simulation: making (good) guesses In this process, we will analyze the potential results of the experiment across various sample sizes. To do this, we need to make educated guesses about important effect sizes and the study’s reliability. Estimates for effect sizes can come from meta-analyses, previous publications, pilot studies, or making educated guesses about reasonable outcomes for your treatment groups. While studies often only report test statistics, when simulating, it’s usually easier to think of potential outcomes in terms of the mean (and/or standard deviation—a measure of data spread) of what you are measuring. Even without prior data, you should be able to imagine a range of likely outcomes and exclude implausible ones (e.g. negative values for measures of size). One nice thing about simulating is that it’s simple to adjust your assumed outcomes and see how these changes impact your sample size determination. The reliability of a statistical test is measured by both its significance level and the study’s power. Significance is a measure of how often the test is expected to return a false positive (i.e. finding a difference when there isn’t one), while power measures how often the test correctly identifies true positives. Standard settings often include a 5% significance level and 80% power, but the appropriateness of these levels requires careful consideration. Would you trust a test that correctly identifies outcomes only 80% of the time or one that yields false positives 5% of the time? Sample size determination should not just aim for a specific power or significance level, but should also evaluate how changes in sample size affect these metrics and your overall conclusions. Getting Started with Simulation in R R is a powerful programming language for statistical computing and graphics, widely used in both academia and industry. Adding R to your toolkit can be highly valuable. In this guide, we’ll use R to simulate data and determine the sample size needed to achieve a desired level of statistical power. To follow along, simply copy and paste the entire provided R code for each step. (Anything preceded by “#” in the code isn’t code itself – it’s either notation to help explain the purpose of that particular line or shows the results you should see in your own R file). Simulating data Let’s first simulate data for an experiment seeking to measure some type of difference between two groups - such as the body temperature of people receiving an antibiotic and of people receiving a placebo. We’ll start with a manageable sample size, say 10 for each of the two groups. We’ll create a table (dataframe) to hold our observations for each subject, including the identification of the treatment they receive. We’ll also mix up the order of treatments, just like we would in a real experiment. set.seed(1) df <- data.frame( x = sample(rep(c("control", "treatment"), each = 10))) To populate your dataframe, R has many ways to generate random numbers, including doing draws from a variety of different distributions of numbers – such as the normal, or uniform distributions. For our example, we will assume that we do expect a difference between our two groups and therefore each group has a different mean. Subsequently, in our simulation, we will draw samples of random numbers from two normal distributions that have the same standard deviation, but different means. For each entry in our dataframe, we’ll provide a random draw from the appropriate distribution to serve as an observed value. We’ll then compare the groups with a t-test (https://stats.oarc.ucla.edu/stat/data/intro_r/intro_r_interactive_flat.html#independent-samples-t-tests). set.seed(1) # controls random number generator # assign treatments in dataframe for(i in 1:length(df$x)){ # provide a value of y for each x ifelse(df$x[i] == "treatment", df$y[i] <- rnorm(1,6,1), df$y[i] <- rnorm(1,7,1) ) } # show the top of the dataframe head(df) This is the output you should see: x y 1 control 6.373546 2 control 7.183643 3 control 6.164371 4 control 8.595281 5 treatment 6.329508 6 treatment 5.179532 # conduct the t-test t.test(df$y[df$x == "treatment"], df$y[df$x == "control"], paired = F) This is the output you should see: Welch Two Sample t-test data: df$y[df$x == "treatment"] and df$y[df$x == "control"] t = -2.6442, df = 17.219, p-value = 0.01692 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.9903293 -0.2246649 sample estimates: mean of x mean of y 6.136775 7.244272 Note: make sure you're using an up-to-date installation of R. If you're not seeing the exact same numbers, it may be due to you using a different version of R than what was used at the time of writing this article. This random sample give us a significant result (p <0.05) (meaning there was a measured difference between the two groups). So, does that mean we found our correct sample size on the first try? As you might have guessed, the answer is no. Many other samples from the same distributions will not give you a value of p <0.05 in your t-test. In other words, to check to see if your first guess was right, you’ll want to run this again with the same sample size number, but using different random numbers in your dataframe. We can demonstrate this in R by changing the random number generator value. set.seed(3) # controls the random number generator for(i in 1:length(df$x)){ # provide value of y for each x ifelse(df$x[i] == "treatment", df$y[i] <- rnorm(1,6,1), df$y[i] <- rnorm(1,7,1) ) } # perform the analysis t.test(df$y[df$x == "treatment"], df$y[df$x == "control"], paired = F) This is the output you should see: Welch Two Sample t-test data: df$y[df$x == "treatment"] and df$y[df$x == "control"] t = -2.0622, df = 17.962, p-value = 0.05395 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.47279811 0.01379503 sample estimates: mean of x mean of y 5.968059 6.697560 For this sample, the result (p = 0.053) is not statistically significant at 5% significance. To check how often samples from these distributions get a significant result on average at this sample size, we need to test many sets of samples. But we don’t want to change the random number generator manually each time - we want R to do it for us. Simulating, many, many times, very quickly Using a “for loop” (https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Repetitive-execution), we can recreate the dataframe, and rerun the analysis as many times as we want. Each time we run it, we will check the p-value. When the p-value is significant (p < 0.05) we’ll save the result as 1, and if not, we’ll save it as 0. We’ll save these values in a list outside the loop. The average of this list will be our measure of power. # make an empty list to hold the significance of each simulation pwrlist <- list() set.seed(10) # controls the random number generator for(s in 1:10000){ # perform 10000 simulations #create the data frame df <- data.frame( x = sample(rep(c("control", "treatment"), each = 10))) # populate data frame with numbers from distributions for(i in 1:length(df$x)){ if(df$x[i] == "treatment") df$y[i] <- rnorm(1,6,1) if(df$x[i] == "control") df$y[i] <- rnorm(1,7,1) } # perform the analysis test <- t.test(df$y[df$x == "treatment"], df$y[df$x == "control"], paired = F) # compare the p-value to 0.05, and store as significant or not ifelse(test$p.value < 0.05, pwrlist$sig[s] <- 1, pwrlist$sig[s] <- 0) } # compute the average significance to determine power at sample size mean(pwrlist$sig) This is the output you should see: [1] 0.5647 So, for our example, using a sample size of 10 in each group would result in us finding a power of only about 56% for a mean difference of 1, at a 5% rate of false positives. R has built-in functions to determine power that demonstrate the same result. power.t.test(n = 10, # size of each group delta = 1, # expected mean difference b/w groups sd = 1, # expected standard deviation of each group sig.level = 0.05) # desired significance level) This is the output you should see: Two-sample t test power calculation n = 10 delta = 1 sd = 1 sig.level = 0.05 power = 0.5619846 alternative = two.sided Note: n is number in *each* group While this function is admittedly quicker to use, as discussed below, simulation has some advantages when exploring power, as slight modifications to the code allow comparisons between different sample sizes, effect sizes, and even different types of analysis. Next, we will expand our simulation to show how the power changes when using different sample sizes. Size selection To compare power levels at different sample sizes, we will make a vector containing different sample sizes to use, and an additional for loop. Be patient running the entire thing: it’s a lot of counting! set.seed(5) pwrlist <- list() samplesizes <- 10:20 # number of subjects in each group to compare for(n in 1:length(samplesizes)){ for(s in 1:10000){ # simulations to run for each sample size #create the data frame df <- data.frame( x = sample(rep(c("control", "treatment"), each = samplesizes[n]))) # populate data frame with numbers from random distributions for(i in 1:length(df$x)){ if(df$x[i] == "treatment") df$y[i] <- rnorm(1,6,1) if(df$x[i] == "control") df$y[i] <- rnorm(1,7,1) } # perform the analysis test <- t.test(df$y[df$x == "treatment"], df$y[df$x == "control"], paired = F) # compare the p-value to 0.05, and store as significant or not ifelse(test$p.value < 0.05, pwrlist$sig[s] <- 1, pwrlist$sig[s] <- 0) } # compute the average significance to determine power at each sample size pwrlist$PWR[n] <- mean(pwrlist$sig) pwrlist$n[n] <- samplesizes[n] } data.frame( Power = pwrlist$PWR, Samplesize =pwrlist$n) This is the output you should see: Power Samplesize 1 0.5617 10 2 0.6066 11 3 0.6436 12 4 0.6892 13 5 0.7214 14 6 0.7537 15 7 0.7826 16 8 0.8151 17 9 0.8295 18 10 0.8518 19 11 0.8692 20 Then, you can visualize your results: # plot the results plot(pwrlist$PWR ~ pwrlist$n, xlab = "Sample Size in Each Group", ylab = "Power") The plot of the results show how power increases on the y axis as the sample size for each group increases on the x axis. For this example, when using a 5% significance level, the power increases rapidly, reaching the conventional 80% threshold at 17 samples in each group. So, if you chose to go with the original sample size of 10, your experiment wouldn’t have very much power at all – you’d expect to get a significant result about 55% of the time. Nevertheless, adding just a few more samples in each group makes it much more likely to get a significant result. Flexing your power There are many ways to encode a power analysis, but the code above is meant to provide flexibility. Simple modifications to the code can be used for sample size determinations at different effect sizes and significance levels. Additionally, you can change the types data you use or draw from different types of distribution that reflect the type of data you’ll be analyzing (e.g., binomial (0,1) data). You can also include more treatment groups, and you can modify the type of analysis performed, for example doing ANOVA, or even generalized linear models instead of a t-test. Simulation is a versatile and accessible method for determining the sample size needed for your experiments. By leveraging the tools in R, you can tailor your sample size calculations to fit your unique research needs, making your experiments more robust and reliable. Armed with these tools and knowledge, you’re well-equipped to embark on your research journey with confidence and power. Robert MobleyRobert Mobley is a biologist, civil servant and science communicator. He is interested in how other animals perceive their environments and the methods researchers use to understand these abilities.
0 Comments
Your comment will be posted after it is approved.
Leave a Reply. |
Education BlogAbout ScientistaSubscribe!NEW!New PostsWhat's HotClick to set custom HTML
You Might Like...
Connect With UsLatest tweets |
The Scientista Foundation, Inc. All Rights Reserved © 2011-2021 | Based in NY | [email protected]
The Network for Pre-Professional Women in Science and Engineering
The Scientista Foundation is a registered 501(c)(3) -- Donate!
The Network for Pre-Professional Women in Science and Engineering
The Scientista Foundation is a registered 501(c)(3) -- Donate!