Editing 2731: K-Means Clustering

{{comic
| number    = 2731
| date      = January 30, 2023
| title     = K-Means Clustering
| image     = k_means_clustering_2x.png
| imagesize = 320x385px
| noexpand  = true
| titletext = According to my especially unsupervised K-means clustering algorithm, there are currently about 8 billion types of people in the world.
}}

==Explanation==

[[Ponytail]] is giving a talk about her research groups analysis of which different types of people there are in the world.

A popular class of wry observations use the {{wiktionary|snowclone}} "There are two types of people in the world... those who do A, and those who do B". Here B will usually, though not always, be some antithesis of A. The most self-referent version is the joke "There are two types of people in the world - those who divide people into two types, and those who don't". Other well known versions include: "There are three types of people in the world - those who can count, and those who can't", "There are two types of people in the world - those who can extrapolate... ", and "There are 10 types of people in the world - those who understand binary and those who don't."

Ponytail uses {{w|K-means_clustering|''k''-means clustering}} with k=3. This is a method of categorizing data. To explain how it works, imagine a set of people of various heights and weights, that should be split into 3 groups (which gives k the value 3). One way to do this would be to plot the data onto a scatter chart; then pick three points at random for reference; then sort the people according to which point they are closest to, forming 3 initial groups. After forming 3 groups, the average of the data point of every item in each group is found; these average data points are used as new reference points to once again categorize all the data into 3 new groups. This process is repeated until the data converges; that is, the data point do no longer change groups even after new reference points are picked.

The ''k''-means algorithm is quite simple, which lends to its popularity, but it has a major drawback: the analyst has to determine how many groups (or clusters) to split the data into (that is, what to set k equal to). A value of k that doesn't match the underlying structure of data can yield a partitioning that's hard to explain in terms of properties that distinguish each cluster (in other words, their qualitative interpretation is unclear).

Ponytail's determination that there are three clusters is unsurprising if she herself falls into the category of those who use k=3 as a fixed value, which will inevitably result in three data clusters. However, the joke is that while one group's trait is "uses K=3", this logically means all the data that isn't in the group does not use k=3... except that with two other groups, then that description applies to both, meaning what distinguishes the other two groups from each other is unclear.

In the title text Ponytail, or maybe it is [[Randall]], claims that: "According to my especially unsupervised K-means clustering algorithm, there are currently about 8 billion types of people in the world."

This seems to be Randall saying that every human is unique and cannot be meaningfully clustered together in groups. The {{w|Day of Eight Billion|human population passed 8 billion on 2022-11-15}} two and a half months before this comic came out.

The title text uses the K-means algorithm with an absurdly exaggerated variant of this problem. If the number of clusters is equal to the number of data points, each point will be assigned to a separate cluster for which there is an exact match; in other words, each member is the sole member of its own group. With such parameters, it makes it impossible to meaningfully comment on similarities between any two members. This is humorous because it would make the result useless for the purposes for which clustering algorithms are typically used, such as making insurance risk pools or targets of advertisement campaigns. In assessing whether ''k'' or ''k±1'' clusters are more useful to describe data, a weighting related to the number of clusters, or the numbers of points per cluster, might encourage identical clusters (for exactly coincident member points) to be merged, as it should for near-identical source data such that it sufficiently embraces clustering - yet Randall's unconstrained algorithm seems to have no such metric and stops at the 'perfect' initial assumption where ''k≡n''.

Interestingly, by including the entire human population, the algorithm should be immune to bias in creating its input data. However, if every human is unique as Randall's algorithm claims, the only way to have the clusters converge is to "throw out" some traits of humans as unimportant. This may be objectionable to humans who disagree with that assessment. In contrast, in a supervised algorithm, the training data is tagged with traits that the trainers seek. These traits could be applied in a manner that is socially unacceptable, and lead to AI behavior that reflects the biases of the trainers.

==Transcript==
:[Ponytail is standing on a podium pointing a stick towards a poster hanging behind her. The writings and figures on the poster are illegible. But there seems to be a large scatter plot at the top with a heading above it. Also a couple of tables beneath this. She addresses an unseen audience in front of the podium.]
:Ponytail: Our analysis shows that there are three kinds of people in the world: 
:Ponytail: Those who use '''''k'''''-means clustering with k=3, and two other types whose qualitative interpretation is unclear.

{{comic discussion}}

[[Category:Comics featuring Ponytail]]
[[Category:Public speaking]]
[[Category:Statistics]]
[[Category:Scientific research]]