2731: K-Means Clustering

Explain xkcd: It's 'cause you're dumb.
Revision as of 16:00, 30 January 2023 by 172.70.114.7 (talk) (})
Jump to: navigation, search
K-Means Clustering
According to my especially unsupervised K-means clustering algorithm, there are currently about 8 billion types of people in the world.
Title text: According to my especially unsupervised K-means clustering algorithm, there are currently about 8 billion types of people in the world.

Explanation

Ambox notice.png This explanation may be incomplete or incorrect: Created by EITHER 8 BILLION OR 3 TYPES OF BOTS - Please change this comment when editing this page. Do NOT delete this tag too soon.
If you can address this issue, please edit the page! Thanks.
k-means clustering is a method of categorising n vectors into k clusters. For example, we might categorise a population by two metrics, and want to best categorise this scatter graph into the distinct populations, algorithmically drawing Voronoi cells to decide the within-cluster variances.

The title text refers to a K-means algorithm applied to the human population, which has, as yet, not converged to assign any two human beings into a common cluster based on shared traits. This is humorous because it is effectively saying that each individual is an entirely unique individual, which would make such a clustering useless as there is literally no commonality between any two people so you can't make any cohort out of any two people for the purposes for which a K-Means Clustering is typically used, such as of making insurance risk pools or targets of advertisement campaigns.

Interestingly, by including the entire human population, the algorithm should be immune to bias in creating its input data. However, since every human is unique, the only way to have the clusters converge is to "throw out" some traits of humans as unimportant. This may be objectionable to humans who disagree with that assessment. In contrast, in a supervised algorithm, the training data is tagged with traits that the trainers seek. These traits could be applied in a manner that is socially unacceptable, and lead to AI behavior that reflects the biases of the trainers.

Transcript

Ambox notice.png This transcript is incomplete. Please help editing it! Thanks.
Ponytail is presenting on a stage, pointing a screen with a stick. The writings and possible figures on the screen are illegible.
Ponytail: Our analysis shows that there are three kinds of people in the world: Those who use k-means clustering with k=3, and two other types whose qualitative interpretation is unclear.


comment.png add a comment! ⋅ comment.png add a topic (use sparingly)! ⋅ Icons-mini-action refresh blue.gif refresh comments!

Discussion

The wikipedia article does not clear anything up 162.158.78.228 13:53, 30 January 2023 (UTC)Bumpf

Yeah. A while back I read a wikipedia article and was determined, for once, to completely understand it. Four years later, I had a PhD in an obscure (and totally useless) element of esoteric math. BTW, it turns out the article was completely wrong! /S 172.70.114.7 12:54, 2 February 2023 (UTC)Bumpf aussi

The "Convergence of k-means" animation is reasonably distinctive for a two-dimensional case, showing at least the motivation for the problem . Could it be attached here? Mia yun Ruse (talk) 14:08, 30 January 2023 (UTC)

Yeah, this is probably the least explanatory Explain xkcd I've read in the past 3 years. Still a lot of heavy math. 162.158.186.95 16:50, 30 January 2023 (UTC)

This feels very similar to the joke "There are 10 types of people: those who know binary and those who don't." Except that the real joke here is that Ponytail doesn't have anything meaningful to justify her version. 172.70.206.150 17:45, 30 January 2023 (UTC)

Current explanation claims that since every human is unique, clusters can only be formed by ignoring some traits. This seems false; a cluster could depend on multiple traits, so there's no obvious limit to the number of traits that could be used when forming clusters. Perhaps they mean that clusters can only be formed by combining non-identical points into the same cluster, but that's literally the entire purpose of clustering and applies to all clustering ever, so it seems like both a trivial observation and a non-sequitur. Am I missing something? 172.70.211.90 19:54, 30 January 2023 (UTC)

Yes, the joke about why there are 8 billion clusters mentioned in the title text. 162.158.78.220 20:47, 30 January 2023 (UTC)
No, I did not miss that. --172.70.211.136 22:53, 30 January 2023 (UTC)
While it's true that clusters can depend on multiple traits, a cluster that depends on ALL human traits at once (or a very large number of them) is useless in practice. A useful cluster depends on a relatively limited number of traits. I think that's where the "ignoring" comes in. 162.158.146.208 22:30, 30 January 2023 (UTC)
Supposing that's true, that would apply to any sample of humans. The "since all humans are unique" part would still be false, and the comment still wouldn't make sense in context as a response to the specific scenario of 8 billion humans. --172.70.211.136 22:53, 30 January 2023 (UTC)
Most people would object to the idea that they are fully defined by their DNA. Yet even taking just DNA, the probability of two humans having same is practically zero. Even identical twins have differences in DNA due to radiation and toxins! Sure, 99% of DNA is identical between all humans (is what makes them human), but DNA is over 6 Gigabase pairs. And how many do you think criminalists needs in DNA identification to ensure match probabilities of 1 in a quintillion? Just hundreds. Yes, every human is unique. -- Hkmaly (talk) 02:50, 31 January 2023 (UTC)
Obviously humans are unique, and I never suggested otherwise. The thing that's false is the complete statement "it's necessary to ignore some traits BECAUSE all humans are unique". I actually think "it's necessary to ignore some traits" is not well-supported even if you stop there, but even if that part is true, it's definitely not a RESULT of all humans being unique. The current explanation reads like someone is twisting the topic to squeeze in a comment about their hobby horse even though it's not actually relevant. --162.158.90.38 00:37, 1 February 2023 (UTC)
It's just wrong to say you have to ignore some traits. I'm a data scientist and I've actually used k-means clustering at my job... everyone *is* unique so, you do lose information when you bucket them, but it isn't because you're throwing out some traits. You're just defining groups based on those traits. If I've got 20 people of all different heights, grouping them into "tall" and "short" is not throwing out height as a trait. The explanation is simply wrong. 172.70.38.77 13:48, 2 February 2023 (UTC)

Many people object to being defined by some group they belong to. E.g. people objecct to blanket statements about members of political parties ("I'm a Republican, but I'm pro-choice"), religions, age groups (the adage "If You Are Not a Liberal at 25, You Have No Heart. If You Are Not a Conservative at 35 You Have No Brain"), etc. I think this is the idea that the title text is going for. Barmar (talk) 20:43, 31 January 2023 (UTC)

There are two types of people in the world: those who use the word “who” to refer to people and the word “that” to refer to things, and those who don’t. 172.71.151.77 02:58, 31 January 2023 (UTC)

...and those whom use "whom"..? 172.70.162.57 09:00, 31 January 2023 (UTC)
Sure, there are plenty who misuse “whom” also. “Who / he / she / they VERB” vs “PREPOSITION whom / him / her / them” - who did, who has, who owns, he did, she has, they own - for whom, by whom, about whom, for him, by her, about them. A person who, a thing that. It’s really not that complicated. 172.71.147.21 10:48, 1 February 2023 (UTC)

Methodological note: k-means is a special case of parametric model-based clustering (here spheres with equal variance) which allows to calculate cluster models with different number of clusters and choose the 'best' one according to the best BIC (Bayesian Information Criterion), see https://cran.r-project.org/package=mclust. A broader non-parametric class of cluster solutions can be fitted with the truecluster meta-algorithm and then choose the one with the best CIC (Cluster Information Criterion), see https://arxiv.org/abs/cs/0601001 and https://arxiv.org/abs/0705.4302. Joehl (talk) 16:55, 2 February 2023 (UTC)

editorial

This sentence clause appears to contain a typo: ", and indicate on a graph of the data has two distinct populations". It might be clearer as ", and indicate on a graph if the data has two distinct populations". (Fixed)