Every statistical test is designed to answer a different type of question, and makes different assumptions about the data being analyzed. If you don’t choose an appropriate test, the results may not be valid. You don’t need to become a mathematical wizard to use inferential statistics, but it’s important to have a good basic understanding of the purpose and assumptions of the test. The best plan is to decide on a statistical test before you carry out your experiment. Knowing the test you plan to use will help you to design your experiment so that you collect the appropriate data.
Below are some general factors to consider in choosing a statistical test. Once you have read them, refer to the flowchart to help you choose a specific test for your data. If you know the type of test you want to use you can select it from the table at the bottom of the page.
Keep in mind that the notes below, and the detailed instructions on individual tests, are only an introduction to the field of statistics, and not a complete treatment. Senior students will want to consult a good biological statistics text (e.g. Zar 1996, Sokal and Rohlf 1995) in addition to this site. Likewise, the table and flowchart direct students only to tests covered in the Science Toolkit. Other, more appropriate tests, may be available.
Type of Question:
Most biological questions can be broken down into two main categories:
Questions about differences.
Questions about correlations.
Questions about differences can be further broken down into categories on the basis of what we are looking for a difference in (Ambrose and Ambrose 1995):
Differences in central tendency. This is probably the most intuitive type of question. We look at two or more groups and ask if the mean or median is different between them. For example, we could ask if the mean length of needles is different in spruce trees than in fir trees, or if the median distance ground squirrels travel from their burrows is different when predators are present. (See also here for more on mean vs. median.)
Differences in variance. We can ask if the variances are different between two groups just as we ask if their means are different. For example, we might suspect that a new pesticide was causing plants to grow erratically. We could compare plants exposed to the pesticide with those not exposed, and ask if there was more variation in size in the pesticide-treated plants. (Note that despite the name, the Analysis of Variance, or ANOVA, test looks for differences in central tendency.)
Differences in distributions. Data obtained by counting some nominal variables are called frequency or distribution data. We are often interested in seeing how close the real counts come to what we might expect from a hypothesis (Goodness of Fit). For example, we could count how many cars of each colour are found in the university parking lot, and ask if any colour is more common than the others (comparing to the expectation of no differences). Another way of using distribution data is to measure two or more sets of variables simultaneously, and seeing if they are associated (Independence). If we not only counted the number of cars of each colour, but whether the driver was male or female, we could ask if there was an association between the sex of the driver and the colour of the car. Here we would not care about how common different car colours are, only whether men tend to have different colour cars from women. The test of independence can also be used to compare different samples. We could count the number of cars of different colours at two universities, and compare these two samples to see if the distribution of frequencies were different between campuses.
A correlation between two factors means that when one changes, the other tends to change along with it. For example there is a correlation between the age of a tree and its size — older trees tend to be larger. We call this a positive correlation since as average age increases, average size also increases. But negative correlations are also seen in nature. Here one factor decreases when the other increases and vice-versa. For example, as the density of aphids increases on your tomato plants, the number of tomatoes they produce will tend to decline.
Keep in mind that correlations do not necessarily reflect simple cause and effect relationships. Marketing wizards have discovered a positive correlation between beer and diapers. Men who buy more diapers also buy more beer. It seems unlikely that buying diapers directly causes men to buy more beer, or vice-versa. More likely, men in their twenties tend to have young children, and also drink more beer. (Or perhaps their children drive them to drink, who knows!) See here for more on correlation vs. regression.
Type of Hypothesis:
Many effects that we are interested in looking at can go in two directions. If we are comparing mean A and mean B, mean A can be larger than mean B, or mean B can be larger than mean A. (The null hypothesis is the same in either case — H0: mean of A is the same as mean of B.) So we can shape our alternative hypothesis specifically — HA: mean of A is larger than mean of B — or more generally — mean of A is different from mean of B. It is more conservative to use the general hypothesis in a statistical test since it makes less assumptions. Tests of this type are called two-tailed tests. However, if we can be completely confident that we only need to consider an effect in one direction, we can use a one-tailed test. These tests will be more powerful, meaning we are more likely to be able to reject our null hypothesis. When a two-tailed test does not show support for the alternative hypothesis, it is tempting to reanalyze the data using a one-tailed test, but this is not legitimate. The decision to use a one-tailed test must be made because it is justified, not because it is convenient.
Type of Distribution:
Many statistical tests, such as the t-test, make assumptions about the distribution of the populations we are sampling. The most important assumption of parametric tests is that the factor we are trying to estimate (the population parameter) is normally distributed. Statistical tests are available to determine if a sample is normally distributed, but a simpler solution for students is to graph the data on a histogram. If the histogram looks like a bell curve, you’re probably okay to use a parametric test.
If the data are not normal, several options are available.
The data can be transformed, which means converting each measurement using some mathematical formula. A common transformation is to take the logarithm of each number (log transformation). This often produces a distribution which is closer to normal.
A non-parametric test can be used. Non-parametric tests make fewer assumptions about the distribution of data. Typically they work by considering only the rankings of the data points, rather than their actual values (converting interval/ratio data to ordinal data). By doing this, however, we are throwing away some of our information, and so non-parametric tests are usually less powerful than their parametric counterparts.
Number of Comparisons:
Some tests, such as the t-test and sign test, are designed to compare only two means. Since these tests tend to be easier to use than tests designed to handle more than than two means, it is tempting to run the simpler test several times, rather than running a test designed to handle more than two means. So for example, if you were comparing Mean A, B, and C, you might be tempted to run three t-tests (comparing A-B, A-C and B-C), rather than using an ANOVA. This is not legitimate. It can greatly increase the chance of making a Type I error (Zar 1996).
To see why, consider this simple example. We have 10 coins and we want to see if any of them are biased. We know that if we flip a coin 10 times, there is less than one chance 20 that we will get at least nine heads or tails. Since p < 0.05 (0.05 is just one chance in 20) we can reject the null hypothesis of a fair coin (no difference between heads and tails) with some confidence. If we test one coin and reject the null hypothesis, we have one chance in 20 of making a mistake (Type I error), which we can live with. But now we test a second coin. If we reject the null hypothesis again, we again have a one in 20 chance of making an error. If we have a one in 20 chance of making an error in each test, what is the chance that we have made an error on at least one test? The exact answer may not be obvious, but hopefully what is obvious is that the probability is greater than one in 20. (It’s actually 0.0975.) The more tests we run, the greater the chance that we will make an error on at least one of them. After 10 coins, our probability of error is about 0.4. A 40% chance of a Type I error is clearly not acceptable in science. Type of Measurement: Each statistical test is designed to deal with a particular type of measurement. Before choosing a test make sure you know whether you have nominal, ordinal, or interval/ratio data. Most tests also assume all data points are independent. Exceptions to this are tests which use data collected in pairs (e.g. right and left leg lengths on the same grasshopper, or body temperature before and after taking a fever drug). Data within each pair are not independent but all data pairs must be independent of one another.
Test |
General Usage |
Chi-Square Test |
Comparing two or more distributions |
F-test | Comparing two variances |
t-Test | Comparing two means (parametric) |
ANOVA | Comparing > 2 means (parametric) |
Kruskal-Wallis Test |
Comparing 2 or more means (non-parametric) |
Sign Test |
Comparing 2 means (non-parametric, data can be paired) |
Paired t-Test |
Comparing 2 means (parametric, data must be paired) |
Pearson Correlation |
Correlation (parametric) |
Spearman Correlation |
Correlation (non-parametric) |
Linear Regression |
Correlation where dependent variable is known |