Difference between revisions of "Talk:2118: Normal Distribution"

Explain xkcd: It's 'cause you're dumb.
Jump to: navigation, search
 
(35 intermediate revisions by 23 users not shown)
Line 1: Line 1:
 
<!--Please sign your posts with ~~~~ and don't delete this text. New comments should be added at the bottom.-->
 
<!--Please sign your posts with ~~~~ and don't delete this text. New comments should be added at the bottom.-->
 
Is there a statistician in the house? [[User:Hawthorn|Hawthorn]] ([[User talk:Hawthorn|talk]]) 15:32, 1 March 2019 (UTC)
 
Is there a statistician in the house? [[User:Hawthorn|Hawthorn]] ([[User talk:Hawthorn|talk]]) 15:32, 1 March 2019 (UTC)
 +
    I think they all got annoyed at the graph and left. [[User:Margath|Margath]] ([[User talk:Margath|talk]]) 15:46, 1 March 2019 (UTC)
 +
Of course there is! [[Special:Contributions/162.158.214.22|162.158.214.22]] 15:44, 1 March 2019 (UTC)
 +
 +
As an example: When measuring the height of people in the same age bracket, then you'll expect the number of people at each height to look like this graph. There will be a lot of people around the average height, fewer a foot shorter/taller, some (but very few) exceptionally tall people, and some (but very few) exceptionally short people. The x-value represents the height, the y-value essentially represents the amount of population that share that height. When we measure the middle 50% of the population using vertical bars, then people at a certain height are either inside '''OR''' outside the middle. Randall uses horizontal bars here, which means some people at a certain height will be counted in the middle 50%, but other people with the same height won't be. In fact, some people with the exact average height of the whole population would fall outside the middle. [[Special:Contributions/108.162.241.214|108.162.241.214]] 16:01, 1 March 2019 (UTC)
 +
 +
Feel free to rip me apart for referring to it as the "number of people at each height", since y-axis is more complicated than a simple count. [[Special:Contributions/108.162.241.214|108.162.241.214]] 16:03, 1 March 2019 (UTC)
 +
 +
Just to say, Randall's horizontal slice isn't entirely meaningless. It's a calculation I've had to do, where I have a series of binned samples of a population (say I knew how many fell in -10..10, how many fell in -5..5, how many fell in -2..2) and wanted to combine them with an appropriate weighting to approximate a Gaussian. I was using it for filtering, but it's logically similar. [[User:Fluppeteer|Fluppeteer]] ([[User talk:Fluppeteer|talk]]) 16:19, 1 March 2019 (UTC)
 +
::Also, the slice sampler for MCMC is a trick for sampling from a distribution by "turning it on its side". But I don't think the 50% figure would be meaningful in that context. (Though the 52.7% number on this graph would be.) [[Special:Contributions/172.68.54.136|172.68.54.136]] 21:16, 1 March 2019 (UTC)
 +
 +
Pedant: etymologically, there *is* actually a connection between a normal (to a surface or line) and the normal distribution; the former comes from the Latin for a set square (giving you perpendicular), and it later came to mean "standard". The "tangential distribution" certainly fits the etymology of "odd/unusual" though. [[User:Fluppeteer|Fluppeteer]] ([[User talk:Fluppeteer|talk]]) 16:26, 1 March 2019 (UTC)
 +
 +
This reminds me of the difference between Riemann(-Stieltjes) and Lebesgue integration. [[Special:Contributions/172.68.54.160|172.68.54.160]] 20:16, 1 March 2019 (UTC)
  
Of course there is! [[Special:Contributions/162.158.214.22|162.158.214.22]] 15:44, 1 March 2019 (UTC)
+
As the axis are not labeled (see comic 833) we could consider this a multivariate distribution where one parameter is uniform and the other is normal. That was my first thought when I saw this. [[Special:Contributions/172.68.34.88|172.68.34.88]] 18:43, 1 March 2019 (UTC)
 +
 
 +
Is there any meaning to midpoint: 52.7%?  Maybe that is the arbitrary center he formed the horizontal bounds around?  Maybe it relates to data?  Is this a reference to something?  It's certainly reminiscent of how normal distributions produce statistically meaningful numbers that have weird decimals in them (like the % represented by being within so many standard deviations). [[Special:Contributions/162.158.78.178|162.158.78.178]] 19:45, 1 March 2019 (UTC)
 +
::Maybe it's because the meaning of "50% of the chart lies between these lines" specifically becomes roughly useless for discerning error if the lines are not centered around the origin. [[Special:Contributions/162.158.78.178|162.158.78.178]] 19:52, 1 March 2019 (UTC)
 +
::I might get it!!! The area between the lines is 52.7% of the total area: which means that 50% is technically included in what lies between them. [[Special:Contributions/162.158.78.220|162.158.78.220]] 23:07, 1 March 2019 (UTC)
 +
 
 +
The correct way to do this is to have the topmost vertical line equal to or above the top of the normal plot.  Then the bottom-most line would represent the same values as vertical lines would. [[Special:Contributions/162.158.78.220|162.158.78.220]] 23:32, 1 March 2019 (UTC)
 +
 
 +
Say I want to build a diverse team or a representative council. And it is more important that the selection is representative of several subpopulations (who should not be voted down by the majority) than that it gives an equal fair chance to anybody. I would cut away the absolute outliers and reduce the weight of the most abundant group - this gives just the area between the two lines. Sebastian --[[Special:Contributions/172.68.110.70|172.68.110.70]] 23:40, 1 March 2019 (UTC)
 +
:That's actually... not a horrible idea. Problem is, it's not robust to transformations of the X axis, because of the Jacobian multiplier that comes with such transformations. Which in practice would look like people loudly insisting they have nothing in common with each other ("we wear baseball hats with the brim to the RIGHT while those other completely unrelated people wear them with the brim to the LEFT")[[Special:Contributions/162.158.63.244|162.158.63.244]] 16:26, 2 March 2019 (UTC)
 +
 
 +
Has somebody measured or calculated (by assuming normal distribution) the areas? It seems that the upper area is way smaller than the lower one, but both having the same 'height' in the middle. Is the 52.7% graphically correct? I tried half of the height at 0: .398942 and integrated, then I get 52,6% for the white area and 47,4% for the gray area. On the y-axis it seems that the three visible ticks are .1, .2, .3, then the gray area would be a bit broader than .2 and centered at .1. Sebastian --[[Special:Contributions/172.68.110.70|172.68.110.70]] 23:40, 1 March 2019 (UTC)
 +
 
 +
Got [[356:_Nerd_Sniping|Nerd Sniped]] by the number "52.7%", but failed on an analytic solution and settled for a quick and dirty numerical integration instead, which suggested that the exact number might be somewhere between .5268 and .5269, so I think I'm not far from the truth.  As I see it, the shaded area is vertically centered around the vertical midpoint, with a relative vertical width chosen such that the shaded area is exactly 50% of the total area under the curve.  Just as usual, only with vertical instead of horizontal binning, which of course is the twist that makes this graph puzzling, funny, and completely useless for meaningful interpretation. 
 +
The label "52.7%" is not an addition to the Midpoint label but instead gives the width of the vertical bin, as a percentage of the vertical height of the curve. I read the tics on the vertical axis to indicate just quarters of the curve maximum, which is consistent with my understanding of "Midpoint".
 +
Oh, and you are certainly right in that the marginal distributions at the top and the bottom are asymmetric, as is the gaussian when viewed sideways.
 +
[[Special:Contributions/172.68.110.64|172.68.110.64]] 23:56, 1 March 2019 (UTC)
 +
: Feh. You merely have to integrate something like Sqrt[Log[x]] which I'm too lazy for and use Mathematica instead which gives...<covers eyes>...what was #2117 about again? [[Special:Contributions/162.158.94.2|162.158.94.2]] 11:57, 2 March 2019 (UTC)
 +
:: There's a way to (attempt to) symbolically integrate functions involving things like e^(-x^2) like you have with the normal distribution (Cherry's extension of the Risch algorithm, see his thesis or his 1985 paper), but I have no idea how to apply it here. It's definitely a very complex procedure. As I understand even Mathematica has not implemented it in full. - [[User:CRGreathouse|CRGreathouse]] ([[User talk:CRGreathouse|talk]]) 03:59, 3 March 2019 (UTC)
 +
::: I found this calculation of the number 52.7% from wolfram community. https://community.wolfram.com/groups/-/m/t/1623478 I found the area subtraction diagram near the middle most useful for understanding the basic idea of it. Also, a related question in quora. https://www.quora.com/In-the-xkcd-comic-Normal-Distribution-how-was-the-number-52-7-calculated [[User:Lamty101|Lamty101]] ([[User talk:Lamty101|talk]]) 08:21, 21 August 2020 (UTC)
 +
 
 +
How to annoy a Democratic Liberal Statician- Point out that every identity group that they're trying to make "normal" falls to the far left or the far right of the normal distribution curve.[[User:Seebert|Seebert]] ([[User talk:Seebert|talk]]) 14:50, 2 March 2019 (UTC)
 +
:As somebody who happens to be all 3 of those things, I can confirm that your comment annoyed me. But only for bringing politics into a discussion that isn't political, and for misusing "normal" in a way like Randall's alt-text. The actual "edgy" political content of your post I find wrong but not particularly annoying. YMMV. [[Special:Contributions/162.158.63.244|162.158.63.244]] 16:26, 2 March 2019 (UTC)
 +
::All statistics are ultimately political, in that they are used to politically argue for predetermined conclusions.  Statistics aren't very useful at actually discovering anything not previously determined to be true.  And it isn't me has misused the word normal, it's those ~2% of the population identity groups that are now using the courts to claim to be normal, when mathematically, they'll never be normal.[[User:Seebert|Seebert]] ([[User talk:Seebert|talk]]) 15:14, 3 March 2019 (UTC)
 +
 
 +
'''"Completely meaningless?"'''<br>
 +
The explanation currently says, "Randall finds the area between two horizontal lines instead, which is mathematically completely meaningless." This doesn't seem right. Each of the two horizontal lines intersect the curve at points and those points have meaningful values on the x axis. I'm not sure if they represent anything interesting (or rather, what their significance might be), but the result is the horizontal lines are not meaningless. I'm a little reluctant to edit it because I'm not sure how meaning to ascribe (and I also haven't measured the or calculated what those points are), but the explanation as-written seems improper. Do I have it wrong? [[User:JohnHawkinson|JohnHawkinson]] ([[User talk:JohnHawkinson|talk]]) 15:02, 2 March 2019 (UTC)
 +
:Nothing is ever completely meaningless.  I think the change to "completely meaningless" may have been added by an annoyed statistician.  I wrote the previous phrasing of it rarely being used for anything meaningful, so it seems impolite for me to edit it back.  It's notable that implying there is meaning to the horizontal lines could be misleading to those new to statistics.  It's also notable that the area between them represents a calculable portion of the samplesets, and that the points of intersection are just as meaningful as with vertical lines, two uses mentioned in comments above. [[Special:Contributions/162.158.79.245|162.158.79.245]] 15:13, 2 March 2019 (UTC)
 +
 
 +
The horizontal division is vaguely reminiscent of Lebesgue integration. I wonder if that was intentional. [[User:Dfeuer|Dfeuer]] ([[User talk:Dfeuer|talk]]) 06:37, 3 March 2019 (UTC)
 +
 
 +
There is now a statistician in the house.  I have added two paragraphs that discuss some of the fine points.  This is wrong (which, of course, Randall knows) in so many ways! I tried to keep what I said simple, but it may need some expansion.  I also don't think we need the graphic in the explanation because, as I say in the text I added, that is the ''wrong way'' to describe a nonsymmetric distribution like the "tangent distribution". [[User:Cjgeyer|Cjgeyer]] ([[User talk:Cjgeyer|talk]]) 22:56, 3 March 2019 (UTC)
 +
 
 +
'''Sloppy explanation'''<br>
 +
What I don't like, are phrases like: "To turn that bar chart into a distribution, you'd get an infinite number of people, put them into age bins that are infinitely narrow, [...]". Infinitely narrow is actually zero or 0. No other interpretation exists.
 +
 
 +
'''Pictures'''<br>
 +
Hey @Zom-b, you changed the picture I set and gave the comment "I don't know what that other curve is, but it's not normal. (no) pun intended."  The two pictures appear to have exactly the same curve in them.  I was wondering what you meant by your comment?  This is the first picture I've ever set in a wiki, and I worry I could have made an error.  Here are the two pictures: [[File:Empirical_Rule.PNG|64px]] [[File:Standard_deviation_diagram.svg|64px]].  I like the first one, mine, because the lines extend beyond the graph as Randall's do.  I like the second one, yours, because it includes percentages over the graph as Randall's has.  But the curves both appear normal, in both senses, to me? [[Special:Contributions/162.158.79.113|162.158.79.113]] 13:05, 5 March 2019 (UTC)
 +
 
 +
:Regarding "infinitely narrow", I disagree that this is sloppy wording; it is concisely describing something that tends to zero at the limit of infinity, which is useful information. [[User:Hawthorn|Hawthorn]] ([[User talk:Hawthorn|talk]]) 10:26, 20 March 2019 (UTC)

Latest revision as of 08:23, 21 August 2020

Is there a statistician in the house? Hawthorn (talk) 15:32, 1 March 2019 (UTC)

   I think they all got annoyed at the graph and left. Margath (talk) 15:46, 1 March 2019 (UTC)

Of course there is! 162.158.214.22 15:44, 1 March 2019 (UTC)

As an example: When measuring the height of people in the same age bracket, then you'll expect the number of people at each height to look like this graph. There will be a lot of people around the average height, fewer a foot shorter/taller, some (but very few) exceptionally tall people, and some (but very few) exceptionally short people. The x-value represents the height, the y-value essentially represents the amount of population that share that height. When we measure the middle 50% of the population using vertical bars, then people at a certain height are either inside OR outside the middle. Randall uses horizontal bars here, which means some people at a certain height will be counted in the middle 50%, but other people with the same height won't be. In fact, some people with the exact average height of the whole population would fall outside the middle. 108.162.241.214 16:01, 1 March 2019 (UTC)

Feel free to rip me apart for referring to it as the "number of people at each height", since y-axis is more complicated than a simple count. 108.162.241.214 16:03, 1 March 2019 (UTC)

Just to say, Randall's horizontal slice isn't entirely meaningless. It's a calculation I've had to do, where I have a series of binned samples of a population (say I knew how many fell in -10..10, how many fell in -5..5, how many fell in -2..2) and wanted to combine them with an appropriate weighting to approximate a Gaussian. I was using it for filtering, but it's logically similar. Fluppeteer (talk) 16:19, 1 March 2019 (UTC)

Also, the slice sampler for MCMC is a trick for sampling from a distribution by "turning it on its side". But I don't think the 50% figure would be meaningful in that context. (Though the 52.7% number on this graph would be.) 172.68.54.136 21:16, 1 March 2019 (UTC)

Pedant: etymologically, there *is* actually a connection between a normal (to a surface or line) and the normal distribution; the former comes from the Latin for a set square (giving you perpendicular), and it later came to mean "standard". The "tangential distribution" certainly fits the etymology of "odd/unusual" though. Fluppeteer (talk) 16:26, 1 March 2019 (UTC)

This reminds me of the difference between Riemann(-Stieltjes) and Lebesgue integration. 172.68.54.160 20:16, 1 March 2019 (UTC)

As the axis are not labeled (see comic 833) we could consider this a multivariate distribution where one parameter is uniform and the other is normal. That was my first thought when I saw this. 172.68.34.88 18:43, 1 March 2019 (UTC)

Is there any meaning to midpoint: 52.7%? Maybe that is the arbitrary center he formed the horizontal bounds around? Maybe it relates to data? Is this a reference to something? It's certainly reminiscent of how normal distributions produce statistically meaningful numbers that have weird decimals in them (like the % represented by being within so many standard deviations). 162.158.78.178 19:45, 1 March 2019 (UTC)

Maybe it's because the meaning of "50% of the chart lies between these lines" specifically becomes roughly useless for discerning error if the lines are not centered around the origin. 162.158.78.178 19:52, 1 March 2019 (UTC)
I might get it!!! The area between the lines is 52.7% of the total area: which means that 50% is technically included in what lies between them. 162.158.78.220 23:07, 1 March 2019 (UTC)

The correct way to do this is to have the topmost vertical line equal to or above the top of the normal plot. Then the bottom-most line would represent the same values as vertical lines would. 162.158.78.220 23:32, 1 March 2019 (UTC)

Say I want to build a diverse team or a representative council. And it is more important that the selection is representative of several subpopulations (who should not be voted down by the majority) than that it gives an equal fair chance to anybody. I would cut away the absolute outliers and reduce the weight of the most abundant group - this gives just the area between the two lines. Sebastian --172.68.110.70 23:40, 1 March 2019 (UTC)

That's actually... not a horrible idea. Problem is, it's not robust to transformations of the X axis, because of the Jacobian multiplier that comes with such transformations. Which in practice would look like people loudly insisting they have nothing in common with each other ("we wear baseball hats with the brim to the RIGHT while those other completely unrelated people wear them with the brim to the LEFT")162.158.63.244 16:26, 2 March 2019 (UTC)

Has somebody measured or calculated (by assuming normal distribution) the areas? It seems that the upper area is way smaller than the lower one, but both having the same 'height' in the middle. Is the 52.7% graphically correct? I tried half of the height at 0: .398942 and integrated, then I get 52,6% for the white area and 47,4% for the gray area. On the y-axis it seems that the three visible ticks are .1, .2, .3, then the gray area would be a bit broader than .2 and centered at .1. Sebastian --172.68.110.70 23:40, 1 March 2019 (UTC)

Got Nerd Sniped by the number "52.7%", but failed on an analytic solution and settled for a quick and dirty numerical integration instead, which suggested that the exact number might be somewhere between .5268 and .5269, so I think I'm not far from the truth. As I see it, the shaded area is vertically centered around the vertical midpoint, with a relative vertical width chosen such that the shaded area is exactly 50% of the total area under the curve. Just as usual, only with vertical instead of horizontal binning, which of course is the twist that makes this graph puzzling, funny, and completely useless for meaningful interpretation. The label "52.7%" is not an addition to the Midpoint label but instead gives the width of the vertical bin, as a percentage of the vertical height of the curve. I read the tics on the vertical axis to indicate just quarters of the curve maximum, which is consistent with my understanding of "Midpoint". Oh, and you are certainly right in that the marginal distributions at the top and the bottom are asymmetric, as is the gaussian when viewed sideways. 172.68.110.64 23:56, 1 March 2019 (UTC)

Feh. You merely have to integrate something like Sqrt[Log[x]] which I'm too lazy for and use Mathematica instead which gives...<covers eyes>...what was #2117 about again? 162.158.94.2 11:57, 2 March 2019 (UTC)
There's a way to (attempt to) symbolically integrate functions involving things like e^(-x^2) like you have with the normal distribution (Cherry's extension of the Risch algorithm, see his thesis or his 1985 paper), but I have no idea how to apply it here. It's definitely a very complex procedure. As I understand even Mathematica has not implemented it in full. - CRGreathouse (talk) 03:59, 3 March 2019 (UTC)
I found this calculation of the number 52.7% from wolfram community. https://community.wolfram.com/groups/-/m/t/1623478 I found the area subtraction diagram near the middle most useful for understanding the basic idea of it. Also, a related question in quora. https://www.quora.com/In-the-xkcd-comic-Normal-Distribution-how-was-the-number-52-7-calculated Lamty101 (talk) 08:21, 21 August 2020 (UTC)

How to annoy a Democratic Liberal Statician- Point out that every identity group that they're trying to make "normal" falls to the far left or the far right of the normal distribution curve.Seebert (talk) 14:50, 2 March 2019 (UTC)

As somebody who happens to be all 3 of those things, I can confirm that your comment annoyed me. But only for bringing politics into a discussion that isn't political, and for misusing "normal" in a way like Randall's alt-text. The actual "edgy" political content of your post I find wrong but not particularly annoying. YMMV. 162.158.63.244 16:26, 2 March 2019 (UTC)
All statistics are ultimately political, in that they are used to politically argue for predetermined conclusions. Statistics aren't very useful at actually discovering anything not previously determined to be true. And it isn't me has misused the word normal, it's those ~2% of the population identity groups that are now using the courts to claim to be normal, when mathematically, they'll never be normal.Seebert (talk) 15:14, 3 March 2019 (UTC)

"Completely meaningless?"
The explanation currently says, "Randall finds the area between two horizontal lines instead, which is mathematically completely meaningless." This doesn't seem right. Each of the two horizontal lines intersect the curve at points and those points have meaningful values on the x axis. I'm not sure if they represent anything interesting (or rather, what their significance might be), but the result is the horizontal lines are not meaningless. I'm a little reluctant to edit it because I'm not sure how meaning to ascribe (and I also haven't measured the or calculated what those points are), but the explanation as-written seems improper. Do I have it wrong? JohnHawkinson (talk) 15:02, 2 March 2019 (UTC)

Nothing is ever completely meaningless. I think the change to "completely meaningless" may have been added by an annoyed statistician. I wrote the previous phrasing of it rarely being used for anything meaningful, so it seems impolite for me to edit it back. It's notable that implying there is meaning to the horizontal lines could be misleading to those new to statistics. It's also notable that the area between them represents a calculable portion of the samplesets, and that the points of intersection are just as meaningful as with vertical lines, two uses mentioned in comments above. 162.158.79.245 15:13, 2 March 2019 (UTC)

The horizontal division is vaguely reminiscent of Lebesgue integration. I wonder if that was intentional. Dfeuer (talk) 06:37, 3 March 2019 (UTC)

There is now a statistician in the house. I have added two paragraphs that discuss some of the fine points. This is wrong (which, of course, Randall knows) in so many ways! I tried to keep what I said simple, but it may need some expansion. I also don't think we need the graphic in the explanation because, as I say in the text I added, that is the wrong way to describe a nonsymmetric distribution like the "tangent distribution". Cjgeyer (talk) 22:56, 3 March 2019 (UTC)

Sloppy explanation
What I don't like, are phrases like: "To turn that bar chart into a distribution, you'd get an infinite number of people, put them into age bins that are infinitely narrow, [...]". Infinitely narrow is actually zero or 0. No other interpretation exists.

Pictures
Hey @Zom-b, you changed the picture I set and gave the comment "I don't know what that other curve is, but it's not normal. (no) pun intended." The two pictures appear to have exactly the same curve in them. I was wondering what you meant by your comment? This is the first picture I've ever set in a wiki, and I worry I could have made an error. Here are the two pictures: Empirical Rule.PNG Standard deviation diagram.svg. I like the first one, mine, because the lines extend beyond the graph as Randall's do. I like the second one, yours, because it includes percentages over the graph as Randall's has. But the curves both appear normal, in both senses, to me? 162.158.79.113 13:05, 5 March 2019 (UTC)

Regarding "infinitely narrow", I disagree that this is sloppy wording; it is concisely describing something that tends to zero at the limit of infinity, which is useful information. Hawthorn (talk) 10:26, 20 March 2019 (UTC)