Difference between revisions of "2731: K-Means Clustering"

Explain xkcd: It's 'cause you're dumb.
Jump to: navigation, search
(Explanation)
(Converted the first 2 paragraphs into the vernacular.)
Line 12: Line 12:
 
{{incomplete|Created by EITHER 8 BILLION OR 3 TYPES OF BOTS - Please change this comment when editing this page. Do NOT delete this tag too soon.}}
 
{{incomplete|Created by EITHER 8 BILLION OR 3 TYPES OF BOTS - Please change this comment when editing this page. Do NOT delete this tag too soon.}}
  
{{w|K-means_clustering|''k''-means clustering}} is a method of categorizing vectors into ''k'' clusters. For example, we might categorize a population by two metrics, and want to best categorize this scatter graph into the distinct populations, algorithmically drawing {{w|Voronoi cell}}s to decide the within-cluster variances.  
+
A popular class of wry observations begin with "There are two types of people in the world... those that $do-something, and those that $do-something-else". The most self-referent version is the joke "There are two types of people in the world - those that divide people into two types, and those that don't".
  
Ponytail's determination that there are three clusters is unsurprising if she herself falls into the category of those who use K=3 as a fixed value, which will inevitably result in three data clusters regardless of actual distribution. The qualitative interpretation of the other two categories — that is, what placement in the other two categories means — is unclear as Ponytail's analysis is either using a binary criterion (whether or not one sorts data into three groups) as the basis for sorting people into three categories, or is a black box using unknown criteria and she has only been able to determine that her own group shares the tendency to group things into threes.  
+
{{w|K-means_clustering|''k''-means clustering}} is a method of categorizing data: it determines the clumpiness of data and will tell you how many clumps there are. For example, it might categorize a population into two clumps, and indicate on a graph of the data has two distinct populations.
 +
 
 +
Ponytail's determination that there are three clusters is unsurprising if she herself falls into the category of those who use K=3 as a fixed value, which will inevitably result in three data clusters regardless of actual distribution, a call back to the aforementioned joke. The qualitative interpretation of the other two categories — that is, what placement in the other two categories means — is unclear as Ponytail's analysis is either using a binary criterion (whether or not one sorts data into three groups) as the basis for sorting people into three categories, or is a black box using unknown criteria and she has only been able to determine that her own group shares the tendency to group things into threes.  
  
 
The title text refers to a K-means algorithm with the opposite problem, with no reduction of K value to converge any two human beings into a common cluster based on shared traits. This is humorous because it would make such a clustering useless for the purposes for which a K-Means Clustering is typically used, such as of making insurance risk pools or targets of advertisement campaigns.
 
The title text refers to a K-means algorithm with the opposite problem, with no reduction of K value to converge any two human beings into a common cluster based on shared traits. This is humorous because it would make such a clustering useless for the purposes for which a K-Means Clustering is typically used, such as of making insurance risk pools or targets of advertisement campaigns.

Revision as of 22:40, 30 January 2023

K-Means Clustering
According to my especially unsupervised K-means clustering algorithm, there are currently about 8 billion types of people in the world.
Title text: According to my especially unsupervised K-means clustering algorithm, there are currently about 8 billion types of people in the world.

Explanation

Ambox notice.png This explanation may be incomplete or incorrect: Created by EITHER 8 BILLION OR 3 TYPES OF BOTS - Please change this comment when editing this page. Do NOT delete this tag too soon.
If you can address this issue, please edit the page! Thanks.

A popular class of wry observations begin with "There are two types of people in the world... those that $do-something, and those that $do-something-else". The most self-referent version is the joke "There are two types of people in the world - those that divide people into two types, and those that don't".

k-means clustering is a method of categorizing data: it determines the clumpiness of data and will tell you how many clumps there are. For example, it might categorize a population into two clumps, and indicate on a graph of the data has two distinct populations.

Ponytail's determination that there are three clusters is unsurprising if she herself falls into the category of those who use K=3 as a fixed value, which will inevitably result in three data clusters regardless of actual distribution, a call back to the aforementioned joke. The qualitative interpretation of the other two categories — that is, what placement in the other two categories means — is unclear as Ponytail's analysis is either using a binary criterion (whether or not one sorts data into three groups) as the basis for sorting people into three categories, or is a black box using unknown criteria and she has only been able to determine that her own group shares the tendency to group things into threes.

The title text refers to a K-means algorithm with the opposite problem, with no reduction of K value to converge any two human beings into a common cluster based on shared traits. This is humorous because it would make such a clustering useless for the purposes for which a K-Means Clustering is typically used, such as of making insurance risk pools or targets of advertisement campaigns.

Interestingly, by including the entire human population, the algorithm should be immune to bias in creating its input data. However, since every human is unique,[citation needed] the only way to have the clusters converge is to "throw out" some traits of humans as unimportant. This may be objectionable to humans who disagree with that assessment. In contrast, in a supervised algorithm, the training data is tagged with traits that the trainers seek. These traits could be applied in a manner that is socially unacceptable, and lead to AI behavior that reflects the biases of the trainers.

This comic is potentially an adaptation of the joke "There are 10 kinds of people in the world: Those that understand binary and those that don't"

Transcript

Ambox notice.png This transcript is incomplete. Please help editing it! Thanks.
Ponytail is presenting on a stage, pointing a screen with a stick. The writings and possible figures on the screen are illegible.
Ponytail: Our analysis shows that there are three kinds of people in the world: Those who use k-means clustering with k=3, and two other types whose qualitative interpretation is unclear.


comment.png add a comment! ⋅ comment.png add a topic (use sparingly)! ⋅ Icons-mini-action refresh blue.gif refresh comments!

Discussion

The wikipedia article does not clear anything up 162.158.78.228 13:53, 30 January 2023 (UTC)Bumpf

Yeah. A while back I read a wikipedia article and was determined, for once, to completely understand it. Four years later, I had a PhD in an obscure (and totally useless) element of esoteric math. BTW, it turns out the article was completely wrong! /S 172.70.114.7 12:54, 2 February 2023 (UTC)Bumpf aussi

The "Convergence of k-means" animation is reasonably distinctive for a two-dimensional case, showing at least the motivation for the problem . Could it be attached here? Mia yun Ruse (talk) 14:08, 30 January 2023 (UTC)

Yeah, this is probably the least explanatory Explain xkcd I've read in the past 3 years. Still a lot of heavy math. 162.158.186.95 16:50, 30 January 2023 (UTC)

This feels very similar to the joke "There are 10 types of people: those who know binary and those who don't." Except that the real joke here is that Ponytail doesn't have anything meaningful to justify her version. 172.70.206.150 17:45, 30 January 2023 (UTC)

Current explanation claims that since every human is unique, clusters can only be formed by ignoring some traits. This seems false; a cluster could depend on multiple traits, so there's no obvious limit to the number of traits that could be used when forming clusters. Perhaps they mean that clusters can only be formed by combining non-identical points into the same cluster, but that's literally the entire purpose of clustering and applies to all clustering ever, so it seems like both a trivial observation and a non-sequitur. Am I missing something? 172.70.211.90 19:54, 30 January 2023 (UTC)

Yes, the joke about why there are 8 billion clusters mentioned in the title text. 162.158.78.220 20:47, 30 January 2023 (UTC)
No, I did not miss that. --172.70.211.136 22:53, 30 January 2023 (UTC)
While it's true that clusters can depend on multiple traits, a cluster that depends on ALL human traits at once (or a very large number of them) is useless in practice. A useful cluster depends on a relatively limited number of traits. I think that's where the "ignoring" comes in. 162.158.146.208 22:30, 30 January 2023 (UTC)
Supposing that's true, that would apply to any sample of humans. The "since all humans are unique" part would still be false, and the comment still wouldn't make sense in context as a response to the specific scenario of 8 billion humans. --172.70.211.136 22:53, 30 January 2023 (UTC)
Most people would object to the idea that they are fully defined by their DNA. Yet even taking just DNA, the probability of two humans having same is practically zero. Even identical twins have differences in DNA due to radiation and toxins! Sure, 99% of DNA is identical between all humans (is what makes them human), but DNA is over 6 Gigabase pairs. And how many do you think criminalists needs in DNA identification to ensure match probabilities of 1 in a quintillion? Just hundreds. Yes, every human is unique. -- Hkmaly (talk) 02:50, 31 January 2023 (UTC)
Obviously humans are unique, and I never suggested otherwise. The thing that's false is the complete statement "it's necessary to ignore some traits BECAUSE all humans are unique". I actually think "it's necessary to ignore some traits" is not well-supported even if you stop there, but even if that part is true, it's definitely not a RESULT of all humans being unique. The current explanation reads like someone is twisting the topic to squeeze in a comment about their hobby horse even though it's not actually relevant. --162.158.90.38 00:37, 1 February 2023 (UTC)
It's just wrong to say you have to ignore some traits. I'm a data scientist and I've actually used k-means clustering at my job... everyone *is* unique so, you do lose information when you bucket them, but it isn't because you're throwing out some traits. You're just defining groups based on those traits. If I've got 20 people of all different heights, grouping them into "tall" and "short" is not throwing out height as a trait. The explanation is simply wrong. 172.70.38.77 13:48, 2 February 2023 (UTC)

Many people object to being defined by some group they belong to. E.g. people objecct to blanket statements about members of political parties ("I'm a Republican, but I'm pro-choice"), religions, age groups (the adage "If You Are Not a Liberal at 25, You Have No Heart. If You Are Not a Conservative at 35 You Have No Brain"), etc. I think this is the idea that the title text is going for. Barmar (talk) 20:43, 31 January 2023 (UTC)

There are two types of people in the world: those who use the word “who” to refer to people and the word “that” to refer to things, and those who don’t. 172.71.151.77 02:58, 31 January 2023 (UTC)

...and those whom use "whom"..? 172.70.162.57 09:00, 31 January 2023 (UTC)
Sure, there are plenty who misuse “whom” also. “Who / he / she / they VERB” vs “PREPOSITION whom / him / her / them” - who did, who has, who owns, he did, she has, they own - for whom, by whom, about whom, for him, by her, about them. A person who, a thing that. It’s really not that complicated. 172.71.147.21 10:48, 1 February 2023 (UTC)

Methodological note: k-means is a special case of parametric model-based clustering (here spheres with equal variance) which allows to calculate cluster models with different number of clusters and choose the 'best' one according to the best BIC (Bayesian Information Criterion), see https://cran.r-project.org/package=mclust. A broader non-parametric class of cluster solutions can be fitted with the truecluster meta-algorithm and then choose the one with the best CIC (Cluster Information Criterion), see https://arxiv.org/abs/cs/0601001 and https://arxiv.org/abs/0705.4302. Joehl (talk) 16:55, 2 February 2023 (UTC)

editorial

This sentence clause appears to contain a typo: ", and indicate on a graph of the data has two distinct populations". It might be clearer as ", and indicate on a graph if the data has two distinct populations". (Fixed)