Editing 1478: P-Values
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone.
Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 8: | Line 8: | ||
==Explanation== | ==Explanation== | ||
− | This comic plays on how scientific experiments interpret the significance of their data. | + | This comic plays on how scientific experiments measure and interpret the significance of their data. When performing a comparison (for example, seeing whether listening to various types of music can influence test scores), a properly designed experiment includes an ''experimental group'' (of people who listen to music while taking tests) and a ''control group'' (of people who take tests without listening to music), as well as a ''{{w|null hypothesis}}'' that "music has no effect on test scores". The test scores of each group are gathered, and a series of statistical tests performed on the data to produce a value known as the {{w|P-value|p-value}}. In a nutshell, this is the probability that the null hypothesis is indeed true, and any variance in scores between the experimental and control group are just caused by random chance. (For a more drastic example, an experiment could be to see if wearing glasses affects the outcome of coin flips - there would likely be some amount of difference between the coin results when wearing glasses and not wearing glasses, and the ''p''-value serves to measure exactly "how much" difference can be called significant and how much can just be attributed to chance.) |
− | |||
− | |||
− | |||
− | When performing a comparison (for example, seeing whether listening to various types of music can influence test scores), a properly designed experiment includes an ''experimental group'' (of people who listen to music while taking tests) and a ''control group'' (of people who take tests without listening to music), as well as a ''{{w|null hypothesis}}'' that "music has no effect on test scores". The test scores of each group are gathered, and a series of statistical tests | ||
If the ''p''-value is low, then the null hypothesis is said to be ''rejected'', and it can be fairly said that, in this case, music does have a significant effect on test scores. Otherwise if the ''p''-value is too high, the data is said to ''fail to reject'' the null hypothesis, meaning that it is not necessarily counter-evidence, but rather more results are needed. The standard and generally accepted ''p''-value for experiments is <0.05, hence why all values below that number in the comic are marked "significant" at the least. | If the ''p''-value is low, then the null hypothesis is said to be ''rejected'', and it can be fairly said that, in this case, music does have a significant effect on test scores. Otherwise if the ''p''-value is too high, the data is said to ''fail to reject'' the null hypothesis, meaning that it is not necessarily counter-evidence, but rather more results are needed. The standard and generally accepted ''p''-value for experiments is <0.05, hence why all values below that number in the comic are marked "significant" at the least. | ||
− | + | This comic reflects the fact that in most real-world scenarios, the person carrying out the test usually has a vested interest in the results, typically because it is their own hypothesis under test. A result which does not show the proper significance can feel like a major blow, and this may lead to desperate attempts to 'encourage' the data to show the desired outcome. For example, the chart labels a ''p''-value of exactly 0.050 as "Oh crap. Redo calculations" because the ''p''-value is very close to being considered significant, but isn't. The desperate researcher might be able to redo the calculations in order to nudge the result under 0.050. This could be achieved validly if an error is found in the calculations or data set, or falsely by erasing certain unwelcome data points or by using creative mathematical adjustments such as rounding to arbitrary place values. | |
− | |||
− | |||
Values between 0.051 and 0.06 are labelled as being "on the edge of significance". This illustrates the regular use of "creative language" to qualify significance in reports, as a flat "not significant" result may look 'bad'. The validity of such use is of course a contested topic, with debates centering on whether ''p''-values slightly larger than the significance level should be noted as nearly significant or flatly classed as not-significant. The logic of having such an absolute cutoff point for significance may be questioned. | Values between 0.051 and 0.06 are labelled as being "on the edge of significance". This illustrates the regular use of "creative language" to qualify significance in reports, as a flat "not significant" result may look 'bad'. The validity of such use is of course a contested topic, with debates centering on whether ''p''-values slightly larger than the significance level should be noted as nearly significant or flatly classed as not-significant. The logic of having such an absolute cutoff point for significance may be questioned. | ||
− | Values between 0.07 and 0.099 continue the trend of using qualifying language, calling the results "suggestive | + | Values between 0.07 and 0.099 continue the trend of using qualifying language, calling the results "suggestive". This category also illustrates the 'technique' of resorting to adjusting the significance threshold. Appropriate {{w|Design of experiments|experimental design}} requires that the significance threshold be set prior to the experiment, not allowing changes afterward in order to "get a better experiment report", as this would insert bias into the result. A simple change of the threshold (e.g. from 0.05 to 0.1) can change an experiment's result from "not significant" to "significant". Although the statement "significant at the ''p''<0.10 level" is technically true, it would be highly frowned upon to use in an actual report. |
− | Values higher than 0.1 | + | Values higher than 0.1 should be considered not significant at all, however the comic suggests taking a part of the sample (a ''subgroup'') and analyzing that subgroup without regard to the rest of the sample. For example, in a study trying to prove that people sneeze more often when walking by a particular street lamp, data could include the proportion of people who pass the lamp and sneeze, and the proportion of people who sneeze without passing the lamp (to see if the first group's value is statistically significantly higher). If the results don't get the desired ''p''<0.05 (or ''p''<0.1), it is mathematically possible to pick an arbitrary subgroup (e.g. OK, not all people sneeze, but look! women sneeze more than men, so let's analyze only women) with a better ''p''-value. Of course, this is not accepted scientific procedure as it directly adds a lot of sampling bias to the result. This is an example of the {{w|multiple comparisons problem}}, which is also the topic of comic [[882]]. |
− | If the results cannot be normally considered significant, the title text suggests as a last resort to invert p<0.050, making it p>0.050. This leaves the statement mathematically true, but may fool casual readers, as the single-character change may go unnoticed or be dismissed as a typographical error ("no one would claim their results aren't significant, they must mean p<0.050"). Of course, the statement on its face is useless, as it is equivalent to stating that the results are "not significant". | + | If the results cannot be normally considered significant, the title text suggests as a last resort to invert p<0.050, making it p>0.050. This leaves the statement mathematically true, but may fool casual readers, as the single-character change may go unnoticed or be dismissed as a typographical error ("no-one would claim their results aren't significant, they must mean p<0.050"). Of course, the statement on its face is useless, as it is equivalent to stating that the results are "not significant". |
==Transcript== | ==Transcript== | ||
:[A two-column table where the second column selects various areas of the first column using square brackets.] | :[A two-column table where the second column selects various areas of the first column using square brackets.] | ||
− | : | + | |
− | + | :{| class="wikitable alternance" | |
− | + | ! P-value | |
− | + | ! Interpretation | |
− | + | |- | |
− | + | | 0.001 | |
− | + | | rowspan="4"| Highly significant | |
− | + | |- | |
− | + | | 0.01 | |
− | + | |- | |
− | + | | 0.02 | |
− | + | |- | |
− | + | | 0.03 | |
− | + | |- | |
− | + | | 0.04 | |
− | + | | rowspan="2"| Significant | |
− | + | |- | |
− | + | | 0.049 | |
− | + | |- | |
− | + | | 0.050 | |
− | + | | Oh crap. Redo calculations. | |
− | + | |- | |
− | + | | 0.051 | |
− | + | | rowspan="2"| On the edge of significance | |
− | + | |- | |
− | + | | 0.06 | |
− | + | |- | |
− | + | | 0.07 | |
− | + | | rowspan="4"| Highly suggestive, relevant at the p<0.10 level | |
+ | |- | ||
+ | | 0.08 | ||
+ | |- | ||
+ | | 0.09 | ||
+ | |- | ||
+ | | 0.099 | ||
+ | |- | ||
+ | | ≥0.1 | ||
+ | | Hey, look at this interesting subgroup analysis | ||
+ | |} | ||
{{comic discussion}} | {{comic discussion}} | ||
[[Category:Charts]] | [[Category:Charts]] | ||
[[Category:Statistics]] | [[Category:Statistics]] | ||
− |