Editing 1478: P-Values

{{comic
| number    = 1478
| date      = January 26, 2015
| title     = P-Values
| image     = p_values.png
| titletext = If all else fails, use "significant at a p>0.05 level" and hope no one notices.
}}

==Explanation==
{{incomplete|Needs work to improve readability for non-statisticians.}}

This comic plays on how the significance of scientific experiments is measured and interpreted. The {{w|P-value|p-value}} is a statistical measure of how well the results of an experiment fit with the results predicted by the hypothesis. In lay terms, ''p'' is the probability that random chance can explain the results, without the experimental prediction. When low ''p''-values occur, the results appear to reject the {{w|null hypothesis}}, whereas high ''p''-values indicate that the data can not be used to support the hypothesis. High ''p''-values do not signify counter-evidence, but only that more results are needed.

Appropriate {{w|Design of experiments|experimental design}} generally requires that the significance threshold (usually 0.05) be set prior the experiment, not allowing ex-post changes in order to get a better experiment report. A simple change of this threshold (e.g. from 0.05 to 0.1) can change the experiment result with ''p''-value=0.06 from "barely significant" to "significant".

The highest ''p''-value at which most studies typically draw significance is ''p''<0.05, which is why all ''p''-values in the comic below that number are marked at least significant. 0.050 is labeled "Oh crap. Redo calculations," because the ''p''-value is very close to being considered significant, but isn't. Redoing the calculations may result in a different answer, but it is not guaranteed that it will be lower than 0.050. Values that are higher than 0.050 and lower than 0.1 are considered to be suggesting significance without actually supporting it, which will likely support additional trials.

Values higher than 0.1 should be considered not significant at all, however the comic suggests taking a part of the sample (a "subgroup") and analyzing that subgroup without regard to the rest of the sample. For example, in a study trying to prove that people always sneeze when walking by a particular street lamp, someone would record the number of people who pass the lamp and the number of people who sneeze. If the results don't get the desired ''p''<0.1, then pick a subgroup (e.g. OK, not all people sneeze, but look! women sneeze more than men, so let's analyze only women). Of course, this is not accepted scientific procedure as it's very likely to add sampling bias to the result.  This is an example of the {{w|Multiple comparisons problem|multiple comparisons problem}}, which is also the topic of comic [[882]]

If the results cannot be normally considered significant, the title text suggests inverting p<0.050, making it p>0.050. This may fool casual readers, as the change is only to the inequality sign, which may go unnoticed or be dismissed as a typographical error ("no-one would claim their results aren't significant, they must mean p<0.050").

==Transcript==
:[A two columns T-table where the interpretation column selects various areas of the first column using square brackets.]

:{| class="wikitable alternance"
! P-value
! Interpretation
|-
| 0.001
| rowspan="4"| Highly significant
|-
| 0.01
|-
| 0.02
|-
| 0.03
|-
| 0.04
| rowspan="2"| Significant
|-
| 0.049
|-
| 0.050
| Oh crap. Redo calculations.
|-
| 0.051
| rowspan="2"| On the edge of significance
|-
| 0.06
|-
| 0.07
| rowspan="4"| Highly suggestive, relevant at the p<0.10 level
|-
| 0.08
|-
| 0.09
|-
| 0.099
|-
|  ≥0.1
| Hey, look at this interesting subgroup analysis
|}

{{comic discussion}}
[[Category:Charts]]
[[Category:Statistics]]
@@ Line 8: / Line 8: @@
 ==Explanation==
-This comic plays on how scientific experiments interpret the significance of their data. {{w|P-value}} is a statistical measure whose meaning can be difficult to explain to non-experts, and is frequently '''wrongly''' understood (even in this wiki) as indicating how likely that the results could have happened by accident. [http://amstat.tandfonline.com/doi/full/10.1080/00031305.2016.1154108 Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.]
+{{incomplete|Needs work to improve readability for non-statisticians.}}
-By the standard significance level, analyses with a ''p''-value less than .05 are said to be 'statistically significant'. Although the difference between .04 and .06 may seem minor, the practical consequences can be major. For example, scientific journals are much more likely to publish statistically significant results. In medical research, billions of dollars of sales may ride on whether a drug shows statistically significant benefits or not. A result which does not show the proper significance can ruin months or years of work, and might inspire desperate attempts to 'encourage' the desired outcome.
+This comic plays on how the significance of scientific experiments is measured and interpreted. The {{w|P-value|p-value}} is a statistical measure of how well the results of an experiment fit with the results predicted by the hypothesis. In lay terms, ''p'' is the probability that random chance can explain the results, without the experimental prediction. When low ''p''-values occur, the results appear to reject the {{w|null hypothesis}}, whereas high ''p''-values indicate that the data can not be used to support the hypothesis. High ''p''-values do not signify counter-evidence, but only that more results are needed.
-When performing a comparison (for example, seeing whether listening to various types of music can influence test scores), a properly designed experiment includes an ''experimental group'' (of people who listen to music while taking tests) and a ''control group'' (of people who take tests without listening to music), as well as a ''{{w|null hypothesis}}'' that "music has no effect on test scores". The test scores of each group are gathered, and a series of statistical tests are performed to produce the ''p''-value. In a nutshell, this is the probability that the observed difference (or a greater difference) in scores between the experimental and control group could occur due to random chance, if the experimental stimulus has no effect. For a more drastic example, an experiment could test whether wearing glasses affects the outcome of coin flips - there would likely be some amount of difference between the coin results when wearing glasses and not wearing glasses, and the ''p''-value serves to essentially test whether this difference is small enough to be attributed to random chance, or whether it can be said that wearing glasses actually had a significant difference on the results.
+Appropriate {{w|Design of experiments|experimental design}} generally requires that the significance threshold (usually 0.05) be set prior the experiment, not allowing ex-post changes in order to get a better experiment report. A simple change of this threshold (e.g. from 0.05 to 0.1) can change the experiment result with ''p''-value=0.06 from "barely significant" to "significant".
-If the ''p''-value is low, then the null hypothesis is said to be ''rejected'', and it can be fairly said that, in this case, music does have a significant effect on test scores. Otherwise if the ''p''-value is too high, the data is said to ''fail to reject'' the null hypothesis, meaning that it is not necessarily counter-evidence, but rather more results are needed. The standard and generally accepted ''p''-value for experiments is <0.05, hence why all values below that number in the comic are marked "significant" at the least.
+The highest ''p''-value at which most studies typically draw significance is ''p''<0.05, which is why all ''p''-values in the comic below that number are marked at least significant. 0.050 is labeled "Oh crap. Redo calculations," because the ''p''-value is very close to being considered significant, but isn't. Redoing the calculations may result in a different answer, but it is not guaranteed that it will be lower than 0.050. Values that are higher than 0.050 and lower than 0.1 are considered to be suggesting significance without actually supporting it, which will likely support additional trials.
-The chart labels a ''p''-value of exactly 0.050 as "Oh crap. Redo calculations" because the ''p''-value is very close to being considered significant, but isn't. The desperate researcher might be able to redo the calculations in order to nudge the result under 0.050. For example, problems can often have a number of slightly different and equally plausible methods of analysis, so by arbitrarily choosing one it can be easy to tweak the ''p''-value. This could also be achieved if an error is found in the calculations or data set, or by erasing certain unwelcome data points. While correcting errors is usually valid, correcting only the errors that lead to unwelcome results is not. Plausible justifications can also be found for deleting certain data points, though again, only doing this to the unwelcome ones is invalid. All of these effectively introduce sampling bias into the reports.
+Values higher than 0.1 should be considered not significant at all, however the comic suggests taking a part of the sample (a "subgroup") and analyzing that subgroup without regard to the rest of the sample. For example, in a study trying to prove that people always sneeze when walking by a particular street lamp, someone would record the number of people who pass the lamp and the number of people who sneeze. If the results don't get the desired ''p''<0.1, then pick a subgroup (e.g. OK, not all people sneeze, but look! women sneeze more than men, so let's analyze only women). Of course, this is not accepted scientific procedure as it's very likely to add sampling bias to the result.  This is an example of the {{w|Multiple comparisons problem|multiple comparisons problem}}, which is also the topic of comic [[882]]
-The value of 0.050 demanding a "redo calculations" may also be a commentary on the precision of harder sciences, as the rest of the chart implicitly accepts any value following the described digit for a given description; if you get exactly 0.050, there's the possibility that you erred in your calculations, and thus the actual result may be either higher or lower.
+If the results cannot be normally considered significant, the title text suggests inverting p<0.050, making it p>0.050. This may fool casual readers, as the change is only to the inequality sign, which may go unnoticed or be dismissed as a typographical error ("no-one would claim their results aren't significant, they must mean p<0.050").
-Values between 0.051 and 0.06 are labelled as being "on the edge of significance". This illustrates the regular use of "creative language" to qualify significance in reports, as a flat "not significant" result may look 'bad'. The validity of such use is of course a contested topic, with debates centering on whether ''p''-values slightly larger than the significance level should be noted as nearly significant or flatly classed as not-significant. The logic of having such an absolute cutoff point for significance may be questioned.
+==Transcript==
+:[A two columns T-table where the interpretation column selects various areas of the first column using square brackets.]
-Values between 0.07 and 0.099 continue the trend of using qualifying language, calling the results "suggestive" or "relevant". This category also illustrates the 'technique' of resorting to adjusting the significance threshold. Appropriate {{w|Design of experiments|experimental design}} requires that the significance threshold be set prior to the experiment, not allowing changes afterward in order to "get a better experiment report", as this would again insert bias into the result. A simple change of the threshold (e.g. from 0.05 to 0.1) can change an experiment's result from "not significant" to "significant". Although the statement "significant at the ''p''<0.10 level" is technically true, it would be highly frowned upon to use in an actual report.
+:{| class="wikitable alternance"
+! P-value
-Values higher than 0.1 are usually considered not significant at all, however the comic suggests taking a part of the sample (a ''subgroup'') and analyzing that subgroup without regard to the rest of the sample. Choosing to analyze a subgroup ''in advance for scientifically plausible reasons'' is good practice. For example, a drug to prevent heart attacks is likely to benefit men more than women, since men are more likely to have heart attacks. Choosing to focus on a subgroup after conducting an experiment may also be valid if there is a credible scientific justification - sometimes researchers learn something new from experiments. However, the danger is that it is usually possible to find and pick an arbitrary subgroup that happens to have a better ''p''-value simply due to chance. A researcher reporting results for subgroups that have little scientific basis (the pill only benefits people with black hair, or only people who took it on a Wednesday, etc.) would clearly be "cheating." Even when the subgroup has a plausible scientific justification, skeptics will rightly be suspicious that the researcher might have considered numerous possible subgroups (men, older people, fat people, sedentary people, diabetes suffers, etc.) and only reported the subgroups for which there are statistically significant results. This is an example of the {{w|multiple comparisons problem}}, which is also the topic of [[882: Significant]].
+! Interpretation
+|-
-If the results cannot be normally considered significant, the title text suggests as a last resort to invert p<0.050, making it p>0.050. This leaves the statement mathematically true, but may fool casual readers, as the single-character change may go unnoticed or be dismissed as a typographical error ("no one would claim their results aren't significant, they must mean p<0.050"). Of course, the statement on its face is useless, as it is equivalent to stating that the results are "not significant".
+| 0.001
+| rowspan="4"| Highly significant
-==Transcript==
+|-
-:[A two-column table where the second column selects various areas of the first column using square brackets.]
+| 0.01
-:
+|-
-:;P-value
+| 0.02
-::Interpretation
+|-
-----
+| 0.03
-:;0.001
+|-
-:;0.01
+| 0.04
-:;0.02
+| rowspan="2"| Significant
-:;0.03
+|-
-::Highly Significant
+| 0.049
-:
+|-
-:;0.04
+| 0.050
-:;0.049
+| Oh crap. Redo calculations.
-::Significant
+|-
-:
+| 0.051
-:;0.050
+| rowspan="2"| On the edge of significance
-::Oh crap. Redo calculations.
+|-
-:
+| 0.06
-:;0.051
+|-
-:;0.06
+| 0.07
-::On the edge of significance
+| rowspan="4"| Highly suggestive, relevant at the p<0.10 level
-:
+|-
-:;0.07
+| 0.08
-:;0.08
+|-
-:;0.09
+| 0.09
-:;0.099
+|-
-::Highly suggestive, relevant at the p<0.10 level
+| 0.099
-:
+|-
-:;≥0.1
+|  ≥0.1
-::Hey, look at this interesting subgroup analysis
+| Hey, look at this interesting subgroup analysis
+|}
 {{comic discussion}}
 [[Category:Charts]]
 [[Category:Statistics]]
-[[Category:Scientific research]]