Editing 1478: P-Values

Jump to: navigation, search

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision Your text
Line 12: Line 12:
 
By the standard significance level, analyses with a ''p''-value less than .05 are said to be 'statistically significant'. Although the difference between .04 and .06 may seem minor, the practical consequences can be major. For example, scientific journals are much more likely to publish statistically significant results. In medical research, billions of dollars of sales may ride on whether a drug shows statistically significant benefits or not. A result which does not show the proper significance can ruin months or years of work, and might inspire desperate attempts to 'encourage' the desired outcome.
 
By the standard significance level, analyses with a ''p''-value less than .05 are said to be 'statistically significant'. Although the difference between .04 and .06 may seem minor, the practical consequences can be major. For example, scientific journals are much more likely to publish statistically significant results. In medical research, billions of dollars of sales may ride on whether a drug shows statistically significant benefits or not. A result which does not show the proper significance can ruin months or years of work, and might inspire desperate attempts to 'encourage' the desired outcome.
  
βˆ’
When performing a comparison (for example, seeing whether listening to various types of music can influence test scores), a properly designed experiment includes an ''experimental group'' (of people who listen to music while taking tests) and a ''control group'' (of people who take tests without listening to music), as well as a ''{{w|null hypothesis}}'' that "music has no effect on test scores". The test scores of each group are gathered, and a series of statistical tests are performed to produce the ''p''-value. In a nutshell, this is the probability that the observed difference (or a greater difference) in scores between the experimental and control group could occur due to random chance, if the experimental stimulus has no effect. For a more drastic example, an experiment could test whether wearing glasses affects the outcome of coin flips - there would likely be some amount of difference between the coin results when wearing glasses and not wearing glasses, and the ''p''-value serves to essentially test whether this difference is small enough to be attributed to random chance, or whether it can be said that wearing glasses actually had a significant difference on the results.
+
When performing a comparison (for example, seeing whether listening to various types of music can influence test scores), a properly designed experiment includes an ''experimental group'' (of people who listen to music while taking tests) and a ''control group'' (of people who take tests without listening to music), as well as a ''{{w|null hypothesis}}'' that "music has no effect on test scores". The test scores of each group are gathered, and a series of statistical tests are performed to produce the ''p''-value. In a nutshell, this is the probability that the observed difference (or a greater difference) in scores between the experimental and control group could occur due to random chance, if the experimental stimuli has no effect. For a more drastic example, an experiment could test whether wearing glasses affects the outcome of coin flips - there would likely be some amount of difference between the coin results when wearing glasses and not wearing glasses, and the ''p''-value serves to essentially test whether this difference is small enough to be attributed to random chance, or whether it can be said that wearing glasses actually had a significant difference on the results.
  
 
If the ''p''-value is low, then the null hypothesis is said to be ''rejected'', and it can be fairly said that, in this case, music does have a significant effect on test scores. Otherwise if the ''p''-value is too high, the data is said to ''fail to reject'' the null hypothesis, meaning that it is not necessarily counter-evidence, but rather more results are needed. The standard and generally accepted ''p''-value for experiments is <0.05, hence why all values below that number in the comic are marked "significant" at the least.
 
If the ''p''-value is low, then the null hypothesis is said to be ''rejected'', and it can be fairly said that, in this case, music does have a significant effect on test scores. Otherwise if the ''p''-value is too high, the data is said to ''fail to reject'' the null hypothesis, meaning that it is not necessarily counter-evidence, but rather more results are needed. The standard and generally accepted ''p''-value for experiments is <0.05, hence why all values below that number in the comic are marked "significant" at the least.
Line 24: Line 24:
 
Values between 0.07 and 0.099 continue the trend of using qualifying language, calling the results "suggestive" or "relevant". This category also illustrates the 'technique' of resorting to adjusting the significance threshold. Appropriate {{w|Design of experiments|experimental design}} requires that the significance threshold be set prior to the experiment, not allowing changes afterward in order to "get a better experiment report", as this would again insert bias into the result. A simple change of the threshold (e.g. from 0.05 to 0.1) can change an experiment's result from "not significant" to "significant". Although the statement "significant at the ''p''<0.10 level" is technically true, it would be highly frowned upon to use in an actual report.
 
Values between 0.07 and 0.099 continue the trend of using qualifying language, calling the results "suggestive" or "relevant". This category also illustrates the 'technique' of resorting to adjusting the significance threshold. Appropriate {{w|Design of experiments|experimental design}} requires that the significance threshold be set prior to the experiment, not allowing changes afterward in order to "get a better experiment report", as this would again insert bias into the result. A simple change of the threshold (e.g. from 0.05 to 0.1) can change an experiment's result from "not significant" to "significant". Although the statement "significant at the ''p''<0.10 level" is technically true, it would be highly frowned upon to use in an actual report.
  
βˆ’
Values higher than 0.1 are usually considered not significant at all, however the comic suggests taking a part of the sample (a ''subgroup'') and analyzing that subgroup without regard to the rest of the sample. Choosing to analyze a subgroup ''in advance for scientifically plausible reasons'' is good practice. For example, a drug to prevent heart attacks is likely to benefit men more than women, since men are more likely to have heart attacks. Choosing to focus on a subgroup after conducting an experiment may also be valid if there is a credible scientific justification - sometimes researchers learn something new from experiments. However, the danger is that it is usually possible to find and pick an arbitrary subgroup that happens to have a better ''p''-value simply due to chance. A researcher reporting results for subgroups that have little scientific basis (the pill only benefits people with black hair, or only people who took it on a Wednesday, etc.) would clearly be "cheating." Even when the subgroup has a plausible scientific justification, skeptics will rightly be suspicious that the researcher might have considered numerous possible subgroups (men, older people, fat people, sedentary people, diabetes suffers, etc.) and only reported the subgroups for which there are statistically significant results. This is an example of the {{w|multiple comparisons problem}}, which is also the topic of [[882: Significant]].
+
Values higher than 0.1 are usually considered not significant at all, however the comic suggests taking a part of the sample (a ''subgroup'') and analyzing that subgroup without regard to the rest of the sample. Choosing to analyze a subgroup ''in advance for scientifically plausible reasons'' is good practice. For example, a drug to prevent heart attacks is likely to benefit men more than women, since men are more likely to have heart attacks. Choosing to focus on a subgroup after conducting an experiment may also be valid if there is a credible scientific justification - sometimes researchers learn something new from experiments. However, the danger is that it is usually possible to find and pick an arbitrary subgroup that happens to have a better ''p''-value simply due to chance. A researcher reporting results for subgroups that have little scientific basis (the pill only benefits people with black hair, or only people who took it on a Wednesday, etc.) would clearly be "cheating." Even when the subgroup has a plausible scientific justification, skeptics will rightly be suspicious that the researcher might have considered numerous possible subgroups (men, older people, fat people, sedentary people, diabetes suffers, etc.) and only reported the subgroups for which there are statistically significant results. This is an example of the {{w|multiple comparisons problem}}, which is also the topic of comic [[882]].
  
 
If the results cannot be normally considered significant, the title text suggests as a last resort to invert p<0.050, making it p>0.050. This leaves the statement mathematically true, but may fool casual readers, as the single-character change may go unnoticed or be dismissed as a typographical error ("no one would claim their results aren't significant, they must mean p<0.050"). Of course, the statement on its face is useless, as it is equivalent to stating that the results are "not significant".
 
If the results cannot be normally considered significant, the title text suggests as a last resort to invert p<0.050, making it p>0.050. This leaves the statement mathematically true, but may fool casual readers, as the single-character change may go unnoticed or be dismissed as a typographical error ("no one would claim their results aren't significant, they must mean p<0.050"). Of course, the statement on its face is useless, as it is equivalent to stating that the results are "not significant".
Line 63: Line 63:
 
[[Category:Charts]]
 
[[Category:Charts]]
 
[[Category:Statistics]]
 
[[Category:Statistics]]
βˆ’
[[Category:Scientific research]]
 

Please note that all contributions to explain xkcd may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see explain xkcd:Copyrights for details). Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel | Editing help (opens in new window)