Difference between revisions of "2731: K-Means Clustering"

Explain xkcd: It's 'cause you're dumb.
Jump to: navigation, search
(Explanation)
(Explanation: Still more than three groups. And B=!A is the default *except when it's not...*)
Line 13: Line 13:
 
[[Ponytail]] is giving a talk about her research groups analysis of which different types of people there are in the world.
 
[[Ponytail]] is giving a talk about her research groups analysis of which different types of people there are in the world.
  
A popular class of wry observations use the {{wiktionary|snowclone}} "There are two types of people in the world... those that do A, and those that do B". Here B will usually, though not always, be some variant of A. The most self-referent version is the joke "There are two types of people in the world - those that divide people into two types, and those that don't". Another well known joke is: "There are two types of people in the world - those that can interpolate... "
+
A popular class of wry observations use the {{wiktionary|snowclone}} "There are two types of people in the world... those that do A, and those that do B". Here B will usually, though not always, be some antithesis of A. The most self-referent version is the joke "There are two types of people in the world - those that divide people into two types, and those that don't". Another well known joke is: "There are two types of people in the world - those that can interpolate... "
  
 
Ponytail uses {{w|K-means_clustering|''k''-means clustering}} with k=3. This is a method of categorizing data. To explain how it works, imagine a set of people of various heights and weights, that should be split into 3 groups (which gives k the value 3). One way to do this would be to plot the data onto a scatter chart; then pick three points at random for reference; then sort the people according to which point they are closest to, forming 3 initial groups. After forming 3 groups, the average of the data point of every item in each group is found; these average data points are used as new reference points to once again categorize all the data into 3 new groups. This process is repeated until the data converges; that is, the data point do no longer change groups even after new reference points are picked.
 
Ponytail uses {{w|K-means_clustering|''k''-means clustering}} with k=3. This is a method of categorizing data. To explain how it works, imagine a set of people of various heights and weights, that should be split into 3 groups (which gives k the value 3). One way to do this would be to plot the data onto a scatter chart; then pick three points at random for reference; then sort the people according to which point they are closest to, forming 3 initial groups. After forming 3 groups, the average of the data point of every item in each group is found; these average data points are used as new reference points to once again categorize all the data into 3 new groups. This process is repeated until the data converges; that is, the data point do no longer change groups even after new reference points are picked.
Line 21: Line 21:
 
Ponytail's determination that there are three clusters is unsurprising if she herself falls into the category of those who use k=3 as a fixed value, which will inevitably result in three data clusters. However, the joke is that while one group's trait is "uses K=3", this logically means all the data that isn't in the group does not use k=3... except that with two other groups, then that description applies to both, meaning what distinguishes the other two groups from each other is unclear.
 
Ponytail's determination that there are three clusters is unsurprising if she herself falls into the category of those who use k=3 as a fixed value, which will inevitably result in three data clusters. However, the joke is that while one group's trait is "uses K=3", this logically means all the data that isn't in the group does not use k=3... except that with two other groups, then that description applies to both, meaning what distinguishes the other two groups from each other is unclear.
  
This could though easily have been fixed by saying those who use k-means clustering with k=3, those that use k<3 and those who use k>3. So there is actually no problem to split the rest up in just two groups!
+
This could though easily have been fixed by saying those who use k-means clustering with k=3, those that use k<3 and those who use k>3. So splitting the rest up in just two groups would seem to be no problem... ''except'' for accounting for those who do not have a preconceived value of k at all! (Ideally, one perhaps finds the lowest practical k having the least amount of total scatter away from any cluster's focus, for which there are various competing solutions according to the details of the analysis.)
  
 
In the title text Ponytail, or maybe it is [[Randall]], claims that: "According to my especially unsupervised K-means clustering algorithm, there are currently about 8 billion types of people in the world."
 
In the title text Ponytail, or maybe it is [[Randall]], claims that: "According to my especially unsupervised K-means clustering algorithm, there are currently about 8 billion types of people in the world."

Revision as of 11:04, 31 January 2023

K-Means Clustering
According to my especially unsupervised K-means clustering algorithm, there are currently about 8 billion types of people in the world.
Title text: According to my especially unsupervised K-means clustering algorithm, there are currently about 8 billion types of people in the world.

Explanation

Ambox notice.png This explanation may be incomplete or incorrect: Created by 3 TYPES OF EDITORS - Please change this comment when editing this page. Do NOT delete this tag too soon.
If you can address this issue, please edit the page! Thanks.
Ponytail is giving a talk about her research groups analysis of which different types of people there are in the world.

A popular class of wry observations use the snowclone "There are two types of people in the world... those that do A, and those that do B". Here B will usually, though not always, be some antithesis of A. The most self-referent version is the joke "There are two types of people in the world - those that divide people into two types, and those that don't". Another well known joke is: "There are two types of people in the world - those that can interpolate... "

Ponytail uses k-means clustering with k=3. This is a method of categorizing data. To explain how it works, imagine a set of people of various heights and weights, that should be split into 3 groups (which gives k the value 3). One way to do this would be to plot the data onto a scatter chart; then pick three points at random for reference; then sort the people according to which point they are closest to, forming 3 initial groups. After forming 3 groups, the average of the data point of every item in each group is found; these average data points are used as new reference points to once again categorize all the data into 3 new groups. This process is repeated until the data converges; that is, the data point do no longer change groups even after new reference points are picked.

The k-means algorithm is quite simple, which lends to its popularity, but it has a major drawback: the analyst has to determine how many groups (or clusters) to split the data into (that is, what to set k equal to). A value of k that doesn't match the underlying structure of data can yield a partitioning that's hard to explain in terms of properties that distinguish each cluster (in other words, their qualitative interpretation is unclear).

Ponytail's determination that there are three clusters is unsurprising if she herself falls into the category of those who use k=3 as a fixed value, which will inevitably result in three data clusters. However, the joke is that while one group's trait is "uses K=3", this logically means all the data that isn't in the group does not use k=3... except that with two other groups, then that description applies to both, meaning what distinguishes the other two groups from each other is unclear.

This could though easily have been fixed by saying those who use k-means clustering with k=3, those that use k<3 and those who use k>3. So splitting the rest up in just two groups would seem to be no problem... except for accounting for those who do not have a preconceived value of k at all! (Ideally, one perhaps finds the lowest practical k having the least amount of total scatter away from any cluster's focus, for which there are various competing solutions according to the details of the analysis.)

In the title text Ponytail, or maybe it is Randall, claims that: "According to my especially unsupervised K-means clustering algorithm, there are currently about 8 billion types of people in the world."

This seems to be Randall saying that every human is unique and cannot be meaningfully clustered together in groups. The human population passed 8 billion on 2022-11-15 2,5 months before this comic came out.

The title text uses the K-means algorithm with an absurdly exaggerated variant of this problem. If the number of clusters is equal to the number of data points, each point will be assigned to a separate cluster for which there is an exact match; in other words, each member is the sole member of its own group. With such parameters, it makes it impossible to meaningfully comment on similarities between any two members. This is humorous because it would make the result useless for the purposes for which clustering algorithms are typically used, such as making insurance risk pools or targets of advertisement campaigns. In assessing whether k or k±1 clusters are a more useful to describe data, a weighting related to the number of clusters, or the numbers of points per cluster, might encourage identical clusters (for exactly coincident member points) to be merged, as it should for near-identical source data such that it sufficiently embraces clustering - yet Randall's unconstrained algorithm seems to have no such metric and stops at the 'perfect' initial assumption where k≡n.

Interestingly, by including the entire human population, the algorithm should be immune to bias in creating its input data. However, if every human is unique as Randall's algorithm claims, the only way to have the clusters converge is to "throw out" some traits of humans as unimportant. This may be objectionable to humans who disagree with that assessment. In contrast, in a supervised algorithm, the training data is tagged with traits that the trainers seek. These traits could be applied in a manner that is socially unacceptable, and lead to AI behavior that reflects the biases of the trainers.

Transcript

[Ponytail is standing on a podium pointing a stick towards a poster hanging behind her. The writings and figures on the poster are illegible. But there seems to be a large scatter plot at the top with a heading above it. Also a couple of tables beneath this. She addresses an unseen audience in front of the podium.]
Ponytail: Our analysis shows that there are three kinds of people in the world:
Ponytail: Those who use k-means clustering with k=3, and two other types whose qualitative interpretation is unclear.


comment.png add a comment! ⋅ comment.png add a topic (use sparingly)! ⋅ Icons-mini-action refresh blue.gif refresh comments!

Discussion

The wikipedia article does not clear anything up 162.158.78.228 13:53, 30 January 2023 (UTC)Bumpf

Yeah. A while back I read a wikipedia article and was determined, for once, to completely understand it. Four years later, I had a PhD in an obscure (and totally useless) element of esoteric math. BTW, it turns out the article was completely wrong! /S 172.70.114.7 12:54, 2 February 2023 (UTC)Bumpf aussi

The "Convergence of k-means" animation is reasonably distinctive for a two-dimensional case, showing at least the motivation for the problem . Could it be attached here? Mia yun Ruse (talk) 14:08, 30 January 2023 (UTC)

Yeah, this is probably the least explanatory Explain xkcd I've read in the past 3 years. Still a lot of heavy math. 162.158.186.95 16:50, 30 January 2023 (UTC)

This feels very similar to the joke "There are 10 types of people: those who know binary and those who don't." Except that the real joke here is that Ponytail doesn't have anything meaningful to justify her version. 172.70.206.150 17:45, 30 January 2023 (UTC)

Current explanation claims that since every human is unique, clusters can only be formed by ignoring some traits. This seems false; a cluster could depend on multiple traits, so there's no obvious limit to the number of traits that could be used when forming clusters. Perhaps they mean that clusters can only be formed by combining non-identical points into the same cluster, but that's literally the entire purpose of clustering and applies to all clustering ever, so it seems like both a trivial observation and a non-sequitur. Am I missing something? 172.70.211.90 19:54, 30 January 2023 (UTC)

Yes, the joke about why there are 8 billion clusters mentioned in the title text. 162.158.78.220 20:47, 30 January 2023 (UTC)
No, I did not miss that. --172.70.211.136 22:53, 30 January 2023 (UTC)
While it's true that clusters can depend on multiple traits, a cluster that depends on ALL human traits at once (or a very large number of them) is useless in practice. A useful cluster depends on a relatively limited number of traits. I think that's where the "ignoring" comes in. 162.158.146.208 22:30, 30 January 2023 (UTC)
Supposing that's true, that would apply to any sample of humans. The "since all humans are unique" part would still be false, and the comment still wouldn't make sense in context as a response to the specific scenario of 8 billion humans. --172.70.211.136 22:53, 30 January 2023 (UTC)
Most people would object to the idea that they are fully defined by their DNA. Yet even taking just DNA, the probability of two humans having same is practically zero. Even identical twins have differences in DNA due to radiation and toxins! Sure, 99% of DNA is identical between all humans (is what makes them human), but DNA is over 6 Gigabase pairs. And how many do you think criminalists needs in DNA identification to ensure match probabilities of 1 in a quintillion? Just hundreds. Yes, every human is unique. -- Hkmaly (talk) 02:50, 31 January 2023 (UTC)
Obviously humans are unique, and I never suggested otherwise. The thing that's false is the complete statement "it's necessary to ignore some traits BECAUSE all humans are unique". I actually think "it's necessary to ignore some traits" is not well-supported even if you stop there, but even if that part is true, it's definitely not a RESULT of all humans being unique. The current explanation reads like someone is twisting the topic to squeeze in a comment about their hobby horse even though it's not actually relevant. --162.158.90.38 00:37, 1 February 2023 (UTC)
It's just wrong to say you have to ignore some traits. I'm a data scientist and I've actually used k-means clustering at my job... everyone *is* unique so, you do lose information when you bucket them, but it isn't because you're throwing out some traits. You're just defining groups based on those traits. If I've got 20 people of all different heights, grouping them into "tall" and "short" is not throwing out height as a trait. The explanation is simply wrong. 172.70.38.77 13:48, 2 February 2023 (UTC)

Many people object to being defined by some group they belong to. E.g. people objecct to blanket statements about members of political parties ("I'm a Republican, but I'm pro-choice"), religions, age groups (the adage "If You Are Not a Liberal at 25, You Have No Heart. If You Are Not a Conservative at 35 You Have No Brain"), etc. I think this is the idea that the title text is going for. Barmar (talk) 20:43, 31 January 2023 (UTC)

There are two types of people in the world: those who use the word “who” to refer to people and the word “that” to refer to things, and those who don’t. 172.71.151.77 02:58, 31 January 2023 (UTC)

...and those whom use "whom"..? 172.70.162.57 09:00, 31 January 2023 (UTC)
Sure, there are plenty who misuse “whom” also. “Who / he / she / they VERB” vs “PREPOSITION whom / him / her / them” - who did, who has, who owns, he did, she has, they own - for whom, by whom, about whom, for him, by her, about them. A person who, a thing that. It’s really not that complicated. 172.71.147.21 10:48, 1 February 2023 (UTC)

Methodological note: k-means is a special case of parametric model-based clustering (here spheres with equal variance) which allows to calculate cluster models with different number of clusters and choose the 'best' one according to the best BIC (Bayesian Information Criterion), see https://cran.r-project.org/package=mclust. A broader non-parametric class of cluster solutions can be fitted with the truecluster meta-algorithm and then choose the one with the best CIC (Cluster Information Criterion), see https://arxiv.org/abs/cs/0601001 and https://arxiv.org/abs/0705.4302. Joehl (talk) 16:55, 2 February 2023 (UTC)

editorial

This sentence clause appears to contain a typo: ", and indicate on a graph of the data has two distinct populations". It might be clearer as ", and indicate on a graph if the data has two distinct populations". (Fixed)