Editing 2533: Slope Hypothesis Testing

Jump to: navigation, search

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision Your text
Line 8: Line 8:
  
 
==Explanation==
 
==Explanation==
 +
{{incomplete|Created by a SCREAMINGLY SIGNIFICANT STAT STUDENT.  Note: there's a name for when the bone in your ear pulls away after exposure to loud noise, could be thematic to reference it.  There's probably also a name for the statistical mistake the comic demonstrates.  Do NOT delete this tag too soon.}}
 
"Slope hypothesis testing" is a method of testing the significance of a hypothesis involving a scatter plot.
 
"Slope hypothesis testing" is a method of testing the significance of a hypothesis involving a scatter plot.
  
In this comic, [[Cueball]] and [[Megan]] are performing a study comparing student exam grades to the volume of their screams. Student A has the worst grade and softest scream, but Student B has the ''best'' grades and Student C the ''loudest'' scream. A trendline has been plotted, indicating a positive correlation between grades and volume...but the p-value is extremely high, indicating little statistical significance to the trend. P-value is based on both how well the data fits the trendline and how many data points have been taken; the more data points and the better they fit, the lower the p-value and more significant the data.
+
In this comic, [[Cueball]] and [[Megan]] are performing a study comparing student exam grades to the volume of their screams. Student A has the worst grade and softest scream, but Student B has the ''best'' grades and Student C the ''loudest'' scream. A trendline has been plotted, indicating a positive correlation between grades and volume...but the p-value is extremely high, indicating little statistical significance to the trend. P-value is based on both how well the data fits the trendline and how many data points have been taken; the more data points and the better they fit, the lower the p-value and more significant the data.
  
Megan complains about the insignificance of their results, so Cueball suggests having each student scream into the microphone a few more times. (The three students are still there as they can be seen behind them. The three students look like schoolkids; one of them is [[Jill]].)
+
Megan complains about the insignificance of their results, so Cueball suggests having each student scream into the microphone a few more times (the three students are still there as they can be seen behind them. The three students looks like school kids, one of them is [[Science Girl]]).  
  
Having the students scream again will not help though, because it only provides more data on the screaming without providing more data on its relation to exam scores and is a joke around poor statistical calculations likely made in the field today. The p-value is incorrectly recalculated based on the increased number of measurements without accounting for the fact that observations are nested within students. Each student has exactly the same test scores (probably referencing the same datum as before) and have vocal volume ranges that don't drift far either (each seems to have a range of scream that is fairly consistent and far from overlapping). Megan is pleased by these results, but Cueball belatedly realizes this technique may not be scientifically valid. Cueball is correct (presuming that they are using simple linear regression). A more appropriate technique would account for the non-independence of the data (that multiple data points come from each person). Examples of such techniques are multilevel modeling and Huber-White robust standard errors.  
+
Having the students scream again will not help though, because it only provides more data on the screaming without providing more data on its relation to exam scores, and is a joke around poor statistical calculations likely made in the field today. The p-value is incorrectly recalculated based on the increased number of measurements without accounting for the fact that observations are nested within students. Each student has exactly the same test scores (probably referencing the same datum as before) and have vocal volume ranges that don't drift far either (each seems to have a range of scream that is fairly consistent and far from overlapping). Megan is pleased by these results, but Cueball belatedly realizes this technique may not be scientifically valid. Cueball is correct (presuming that they are using simple linear regression). A more appropriate technique would account for the non-independence of the data (that multiple data points come from each person). Examples of such techniques are multilevel modeling and Huber-White robust standard errors.
  
Measuring data multiple times can be a way to increase its accuracy but does not increase the number of data points with regard to another metric, and the horizontally clustered points on the chart make this visually clear. A more effective and scientifically correct way of gathering data test would be to test other students and add their figures to the existing data, rather than repeatedly testing the same three students.
+
Measuring data multiple times can be a way to increase its accuracy, but does not increase the number of data points with regard to another metric, and the horizontally clustered points on the chart make this visually clear.
  
 
Common statistical formulae assume the data points are statistically independent, that is, that the test score and volume measurement from one point don't reveal anything about those of the other points. By measuring each individual's scream multiple times, Cueball and Megan violate the independence assumption (a person's scream volume is unlikely to be independent from one scream to the next) and invalidate their significance calculation. This is an example of pseudoreplication. Furthermore, Megan and Cueball fail to obtain new test scores for each student, which would further limit their statistical options.
 
Common statistical formulae assume the data points are statistically independent, that is, that the test score and volume measurement from one point don't reveal anything about those of the other points. By measuring each individual's scream multiple times, Cueball and Megan violate the independence assumption (a person's scream volume is unlikely to be independent from one scream to the next) and invalidate their significance calculation. This is an example of pseudoreplication. Furthermore, Megan and Cueball fail to obtain new test scores for each student, which would further limit their statistical options.
 
Another strange aspect of their experiment is that the p-values obtained during a typical linear regression assume there is uncertainty in the y-values, but the x-values are fully known, whereas in this experiment, they are reducing uncertainty in the x-values of their data, while doing nothing to improve knowledge of the y-values.
 
 
Moreover, even if the new data were statistically independent, this still appears to be a classic example of "p-hacking", where new data is added until a statistically significant p-value is obtained.
 
  
 
In current AI, there's a push toward "few-shot learning", where only a few data items are used to form conclusions, rather than the usual millions of them.  This comic displays danger associated with using such approaches without understanding them in depth.
 
In current AI, there's a push toward "few-shot learning", where only a few data items are used to form conclusions, rather than the usual millions of them.  This comic displays danger associated with using such approaches without understanding them in depth.
  
Additionally, a common theme in some research is the discovery of correlations that do not survive independent reproduction.  This is because randomness with too few samples produces apparent correlations, and Randall has repeatedly made comics about this hopeful error (see [[111]], [[925]] and [[882]] among others).
+
Additionally, a common theme in some research is the discovery of correlations that do not survive independent reproduction.  This is because randomness with too few samples produces apparent correlations, and Randall has repeatedly made comics about this hopeful error.
  
In the title text, Megan and Cueball are trying to yell over each other, asking each other to speak up so they can be heard; they are presumably suffering tinnitus or other hearing problems after listening to so much shouting.
+
In the title text, Megan and Cueball are trying to yell over each other, asking each other to speak up so they can be heard, presumably because they are having trouble hearing from the yelling experiment.  Or possibly they have trouble speaking audibly because they score lowly on statistics exams.
  
 
==Transcript==
 
==Transcript==
:[Three points are labeled "Student A", "Student B" and "Student C" from left to right in a scatter plot with axes labeled "Stats exam grade" (60-100) and "Scream loudness (decibels)" (86-94) with a trend line. Student B has the highest exam grade, followed by Student C and then Student A.]
+
{{incomplete transcript|Do NOT delete this tag too soon.}}
:[A line goes from the trend line to a box containing the following:]
+
:[Three points labeled "Student A", "Student B" and "Student C" in a scatter plot with axes labeled "Stats exam grade" (60-100) and "Scream loudness (decibel)" (86-94) with a trend line]
 +
:[A line goes from the trend line to a text box with the text:]
 
:β=1.94  
 
:β=1.94  
 
:p=0.586
 
:p=0.586
  
:[In a frameless panel, Megan reads a piece of paper while facing Cueball while three students look at them from the background.]
+
:[In a frameless panel, Megan (holding a piece of paper) and Cueball are facing each other with three kids in the background]
 
:Megan: Darn, not significant.
 
:Megan: Darn, not significant.
:Cueball: We need more data. Have them each try yelling into the mic a few more times.
+
:Cueball: We need more data. Have them each try yelling in to the mic a few more times.
  
 
:[The same scatter plot as in the first panel except with more points for each of the students with slightly different decibel values, and the text in the text box changed to:]
 
:[The same scatter plot as in the first panel except with more points for each of the students with slightly different decibel values, and the text in the text box changed to:]
Line 53: Line 51:
 
[[Category:Comics featuring Megan]]
 
[[Category:Comics featuring Megan]]
 
[[Category:Comics featuring Cueball]]
 
[[Category:Comics featuring Cueball]]
[[Category:Comics featuring Jill]] <!-- The other two kids are also, well, kids, and thus not Hairy or Megan -->
+
[[Category:Comics featuring Science Girl]] <!-- The other two kids are also, well, kids, and thus not Hairy or Megan -->
[[Category:Statistics]]
+
[[Category:Science]]
 
[[Category:Charts]]
 
[[Category:Charts]]
[[Category:Scientific research]]
 
[[Category:Kids]]
 

Please note that all contributions to explain xkcd may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see explain xkcd:Copyrights for details). Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel | Editing help (opens in new window)