2739: Data Quality

Explain xkcd: It's 'cause you're dumb.
Revision as of 18:29, 17 February 2024 by 162.158.146.89 (talk) (Details)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
Data Quality
[exclamation about how cute your cat is] -> [last 4 digits of your cat's chip ID] -> [your cat's full chip ID] -> [a drawing of your cat] -> [photo of your cat] -> [clone of your cat] -> [your actual cat] -> [my better cat]
Title text: [exclamation about how cute your cat is] -> [last 4 digits of your cat's chip ID] -> [your cat's full chip ID] -> [a drawing of your cat] -> [photo of your cat] -> [clone of your cat] -> [your actual cat] -> [my better cat]

Explanation[edit]

Digital data can be compressed to make transmission and/or storage more efficient; some compression algorithms discard some information to improve the compression, which is known as lossy compression, since some of the information is lost (this can be acceptable in audio or visual data, since the difference may be hard for humans to perceive).

This comic shows a chart in the form of a line, increasing quality from very lossy to most lossless. This means that it goes, at the extremes, from having so little information as to make it effectively meaningless, to having significant extra information included (eventually making the original actually an unnecessary distraction). Some of this extra information mitigates the risk of another sense of 'loss' in data - digital data are transferred in bits, and data loss is the process by which some of these bits are lost or altered during data transport. However the highest quality, "better data", is using a different sense of the term "quality", referring more to the general excellence of the data than how accurately it represents the original.

The title text uses your cat as an example of this range of losses (or, in the case of the latter reaches of the graph, gains) in the information. This is possibly a reference to Norbert Wiener's quote, "The best material model of a cat is another, or preferably the same, cat." The most lossy is an exclamation about how cute your cat is, which is ephemeral and obviously carries very little significance in terms of actually providing specific, transferable information about your cat. The example then progresses into your cat's chip ID; presumably your cat has been microchipped, and between the last four digits (commonly used in sensitive information as an identifier without revealing the full number) or the entire chip ID, provides a still-uninformative yet slightly improved way of identifying your cat. A drawing of your cat and a photo of your cat would portray the cat reasonably well, while a clone of your cat and (of course) your actual cat would be the best way of gaining information about your cat. However, as in the actual comic, the final, most lossless (in this case, with the most gain) form of data transfer has nothing to do with your cat, but is simply Randall's better cat. This is apparently made out by Randall to be the pinnacle of cat data.

Details[edit]

Item Title Text Explanation
Someone who once saw the data describing it at a party exclamation about how cute your cat is This is referring to how unreliable and inaccurate it is to get information verbally second-hand, as humans are naturally terrible at maintaining accuracy when passing on information received. This is the basic premise behind the Telephone Game. People naturally and instinctively mentally summarize information received in the way they understand, often in their own words instead of what they literally heard or read.
Bloom filter last 4 digits of your cat's chip ID A Bloom filter is a probabilistic data structure that can efficiently say whether an element is probably part of the dataset, while it can say "element is not in set" with 100% accuracy. If a Bloom filter is used to represent the contents of a book, reference to the Bloom filter could perhaps reconstruct everything, just by guessing, but in a highly inefficient and potentially inaccurate way. A bloom-filter is like a the last four digits of the cat's ID in that while you can know for sure a cat isn't your cat if it's last four digits don't match, you can't know for sure that it is yours if they do.
Hash table your cat's full chip ID A hash table allows you to find data very fast. Randall probably means hashing the contents of entire books. Calculating a hash value for an entire book means that there is (most probably) a unique relationship between the book and a hash value - e.g. "58b8893b172d00e9". This means this exact version of the book will yield this exact hash value, though it's practically impossible to reconstruct the book's potential content from a hash value. It is a method of checking that a copy is the same as the original, but is meaningless on its own and has the possibility of being wrong. An average book contains several millions of bits, yet the SHA-2 hash has only 256 bits, so there are theoretically many (mostly nonsensical, but not necessarily) 'wrong' versions that might look correct.
JPG, GIF, MPEG a drawing of your cat Image and video formats that are considered 'lossy'. JPG (or "JPEG") format and the MPEG group of formats typically use a range of data-compression methods that save space by selectively fudging (thus losing) what details it can of the image (and audio, where appropriate), to make disproportionate gains in compression; best used for real world images (and films) where real-world 'noise' can afford to be replaced by a more compressible version, without too much obvious change.

GIF compression is not 'lossy' in the same way, i.e. whatever it is asked to encode can be faithfully decoded, but Randall may consider its limitations (it can only write images of 256 unique hues, albeit that these can come from anywhere across the whole 65,536 "True color" range, plus transparency) to be a form of loss, as conversion from a more sophisticated format (e.g. PNG, below) could lose many of the subtle shades of the original and produce an inferior image. For this reason, GIF format becomes one best left to render diagrams and other computer-generated imagery with swathes of identical pixels and mostly sharp edges (and to utilize the optional transparent mask), for which JPEG compression will create prominant image artefacts. Alternatively, he may just have included it as a joke/nerd-snipe.

PNG, ZIP, TIFF, WAV, raw data photo of your cat A series of formats using lossless compression. PNG and TIFF are image formats that are suitable for photos, but without (necessarily) resorting to reduced accuracy in order to assist compression. WAV is an audio format that also does not arbitrarily sacrifice 'unnecessary' details, unlike the more recently developed MPEG Audio Layer III which has become the de-facto consumer audio format for many.

ZIP is a generic compression algorithm (and the name of the format it creates) that can be used to store any other digital files. Anything put within a ZIP file can be exactly decompressed into the original state later on, although any such file already compressed in some way (such as any of the image formats mentioned in this comic, or other ZIPs) are unlikely to recompress significantly more.

Raw data + parity bits for error detection clone of your cat In the number 135, the sum of its digits is 9. So the number 135 could be written as "1359", for example, slightly increasing the amount of data that needs to be sent. But with the slight advantage that, if the number was tampered with, the parity bits may be able tell you that an error has occured. (Possibly that the parity itself was the digit that was miswritten.) But a change from "1359" to "1539" could not be detected, in this method, when extracting the parity digit and using this to presume that the first three digits are indeed 'correct'.

There are more reliable means to detect errors, such as CRC-32 (now considered obsolete), MD5 and the much more modern SHA. Such values were alluded to in the Hash Table section. But here they are sent alongside the data, slightly increasing the amount of data transmitted/stored (in order to establish its accuracy), rather than instead of it and vastly decreasing the amount of 'necessary' data (but leaving the virtually impossible task of performing a correct reconstruction).

However it is done, if the check indicates a problem then you can only seek a new copy (of the data, and/or the parity or hash), hoping that the problems encountered can be resolved.

Raw data + parity bits for error correction your actual cat With extra error-checking, there are ways to immediately restore the original data with the given additional data. One method is to 'overlap' multiple error-detection parities such that any small enough corruption of data (including of parity bits themselves) can be reconstructed to the correct original value by cross-comparison between all parity bits and the supposed data. One of the first modern methods developed was Hamming(7,4), invented around 1950, which was a balanced approach designed to handle the typical error conditions typically encountered at the time and has inspired even contemporary electronic methods of maintaining data integrity. Another practical application of error correction bits would be that present in QR Codes, using Reed–Solomon error correction.
Better data my better cat This gives up on the data in question and suggests swapping it for different data entirely. It is no longer about the quality of the transfer of data, but judging the actual data instead. Philosophically, it could be saying that the data or cat are better in some nebulous way, or that it simply is more accurate to what the data is trying to record and represent, in the title text's case saying that Randall's cat more closely represents the essence of "catness."

Transcript[edit]

[A line chart is shown with eight unevenly-spaced ticks each one with a label beneath the line. Above the middle of the line there is a dotted vertical line with a word on either side of this divider. Above the chart there is a big caption with an arrow beneath it pointing right.]
Data Quality
Lossy ┊ Lossless
[Labels to the left of the dotted line from left to right:]
Someone who once saw the data describing it at a party
Bloom filter
Hash table
JPEG, GIF MPEG
[Labels to the right of the dotted line from left to right:]
PNG, ZIP, TIFF, WAV, Raw data
Raw data + parity bits for error detection
Raw data + parity bits for error correction
Better data


comment.png add a comment! ⋅ comment.png add a topic (use sparingly)! ⋅ Icons-mini-action refresh blue.gif refresh comments!

Discussion

Hash tables aren't lossy, maybe Randall means hash functions? Barmar (talk) 17:06, 17 February 2023 (UTC)

I was thinking more a (subset of) a Rainbow table, than an associative array... Although such things tend not to preserve/respect item order (in reading, writing and altering in general), which is potentially information-lossy. 172.69.79.185 18:50, 17 February 2023 (UTC)
Hash tables have an ultra-low collision rate, as compared to the transforms used in packetwise error-correction... Since the comic is primarily focused on contrasting media fidelity with direct alteration of the content, ciphers seem a less direct association than content distribution networks? Given the context presented, my immediate association was the use of both piece & whole-pack hash verification, which has a collision rate so low terms like "number of particles in the universe" start entering the conversation. Upon further consideration, I wonder if Randall is referring to plain old CRC32 hash checking? Or the SHA hashes commonly used to verify disc downloads? (If it passes SHA *and* torrent content checking, I'd say you've probably got better chances of 1:1 integrity, than any original medium has of retaining it?)
ProphetZarquon (talk) 22:51, 17 February 2023 (UTC)
Maybe it was to be about cuckoo filters, which are probabilistic data structure alternative to classic Bloom filter, which are based on space-efficient variants of cuckoo hashing? --JakubNarebski (talk) 14:05, 20 February 2023 (UTC)
Hash tables don't have to store the original data at all, technically; they are commonly done as hash table->KEY:DATA or hash table->KEY:Pointer to data (or suchlike), but hash table->present is a valid hashing scheme, which results in a likely verification that you have the right data (but not guarunteed because collisions) but no way of reconstructing the data itself. Mneme (talk) 02:25, 21 February 2023 (UTC)
He’s casually referring to the hash conflict situation in common implementations of hash tables: the table of hashes, not the whole structure. You have O(n) lookup speed proportional to the impact of uniqueness lost in the hash lookup. The point is that this is the same way that bloom filters {which also usually need a source of truth to be useful) are used. The two concepts perform the same function but with different degrees of lossiness, different widenesses of matching. 162.158.62.140 16:40, 24 February 2023 (UTC) EDIT: it also leaves it ambiguous that it could mean a table of hash functions outputs as you suggest, where hashes have often been thought of as uniquely identifying data that is not recoverable (this does require a sufficiently constrained situation but is often used), where bloom filterd are thought of as ambiguously referring to multiple items. I can imagine it being more clear to leave out the word table. 172.70.114.78 16:48, 24 February 2023 (UTC)

GIF's aren't lossy either, though often other formats can't be converted to GIF without discarding information. Bemasher (talk) 18:27, 17 February 2023 (UTC)

I think that's the point. 172.68.50.203 20:12, 17 February 2023 (UTC)
GIFs are lossy in the very act of creating them: the actual colors of the real object have to be smashed down into (I think it’s) 256 different colors, resulting in an image that even human perception recognizes as crappy. Even the so-called ‘lossless’ formats such as PNG are lossy in the act of creation, just not as drastically as GIFs. A truly ‘lossless’ format would have to specify the exact intensity of every wavelength of electromagnetic radiation emanating from every atom of the original object. Good luck with that. 172.71.151.99 01:00, 18 February 2023 (UTC)
GIFs can only have 256 colors per *frame*, but can have many frames, so 16,777,216 (256^3) colors total should be possible. SDSpivey (talk) 01:39, 7 March 2023 (UTC)
Temporal dithering? Don't know if that's the term for it, but it's the one I'd use to describe it.
And I remember trying that on a BBC Microcomputer, messing with fast direct video-memory copying and also the interupts to get the high-res but monochome MODE 0 (1-bitplane, but with some choice of foreground and background colours that are used that can be changed fairly rapidly, as well as in horizontal bands) to create a disconcerting effect (I wouldn't subject an epileptic to it!) that could still approximated at least a 3-bit colour-mode. Half the colour-res of MODE 2, twice that of MODE 1, but vertical dot-res twice that of the latter and quadruple that of the former. IIRC. 172.70.85.225 02:01, 7 March 2023 (UTC)
It's subjective whether formats (even .gif) can be recognised as 'crappy'. The display format may further tune down everything so that something defined with 65536 colours is more like 256, or it could work well with any given stippling/halftoning/dithering to produce something more like the better original than the file data strictly allows (even from 6bits-per-pixel, or 3) when viewed at sufficient remove. And a .gif of a block-coloured diagram is notably better than a typical .jpg of one, despite the technically superior palette the later has. (Nobody says that an image has to be from a real-life subject, with all kinds of missing data, such as photons thst happen to hit the gap between CCD pixels but might be considered important and might well have been captured with the Mk 1 Eyeball and significantly 'noticed' by the nerves and ultimately the respective processing usters of the brain behind it... Which has a complete set of 'analogue lossiness' to it, anyway.) 172.71.242.203 16:37, 18 February 2023 (UTC)
I encoded the records you wanted transferred to your department's systems into a standard GIF format file. Would you prefer an MJPEG video? 172.70.114.79 16:51, 24 February 2023 (UTC) EDIT: You're right, though. Maybe Randall has experience with color loss using GIF. In the 90's GIF was a compressed photography format, smaller than BMP. 16:54, 24 February 2023 (UTC)

Someone needs to add a table describing all the formats in the chart. Barmar (talk) 19:29, 17 February 2023 (UTC)

Yep. It needs a description of each point on the graph. I'm on my phone though... and feeling lazy after shoveling snow.
ProphetZarquon (talk) 22:54, 17 February 2023 (UTC)
I'm tempted, but it would require learning how to MAKE a table, and my ideal table would be 5 columns, TOO WIDE!, LOL! Table label, what scale (data quality or item quality), a description (the main thing needed), the cat version from the Title Text, and finally how the cat example applies/parallels the comic version. I could lose the "what scale" as only one isn't data quality, and I guess I could see two tables, Comic and Title/Cat (adding to cat also the Table Label column).NiceGuy1 (talk) 06:38, 19 February 2023 (UTC)
Tables are actually quite easy to do (if you don't intend to do much complex stuff), but also very easy to slightly mess up (temporarily - Preview is your friend, especially if you need to rowspan/colspan at all). For this purpose, nothing fancy. Header row, other rows, nothing particar special in alignment, sorting, colour (foreground and/or background), etc. It'll be fairly intelligently fitted to the browser window, according to the contents.
However, here (when you might have large amounts of narrative in one column), perhaps just ";"-prefix a mini-header (can include "(in Title text)" or other shorthand details) and then have ":"-prefixed 'definition' prose that rambles on about each item in freehand text. I would suggest that's as complicated as you need it, no real need for tabling at all. (But, without wanting to show you how to use a hammer, then making every problem now look like a nail to you, I think you could handle learning the basic table-markup/learning where to get the more complex stuff. So there you are.) 172.70.91.197 16:54, 19 February 2023 (UTC)
Oh, I've done tables in HTML many times, I'm perfectly comfortable tackling it. It's just that I'd have to take the time to look up the wiki syntax. :) Additional effort, you see. And now someone has already done it. :) NiceGuy1 (talk) 15:41, 21 February 2023 (UTC)

It seems there are two definitions of data quality that Randall is juxtaposing for comic effect: in one, quality data is data that represents the original phenomenon without error or degradation. In the other, he's applying the concept of quality to the phenomenon itself – data is better if it describes a better phenomenon. My cat is better than your cat, therefore data about my cat is better than data about your cat. I'd like to see this concept in the explanation of the page but don't know how to add into the flow of the current text.K95 (talk) 19:33, 17 February 2023 (UTC)

I already put that in earlier. See the second sentence of the second paragraph, I called it "general excellence". Barmar (talk) 21:45, 17 February 2023 (UTC)

"Data are transferred in bits"...Hear, hear. I'm over 60, I still remember of stuff that is called "analog" ;-) -- 172.71.160.37 (talk) 20:07, 17 February 2023 (UTC) (please sign your comments with ~~~~)

Note, however, that we are transferring data digitally for over four thousand years. That's how long is technically possible to make a lossless copy of written story. -- Hkmaly (talk) 22:19, 17 February 2023 (UTC)
That's only if you're lucky enough to be still reading it in the original Klingon language, etc... 172.69.79.184 22:53, 17 February 2023 (UTC)
"It is a Klingon name!" 😾
Transcription definitely suffers from a Darmok & Jalad type contextual dependency.
ProphetZarquon (talk) 22:59, 17 February 2023 (UTC)

I think that "Better data" is a reference to gainful compression, and that "my better cat" doesn't specifically refer to the author but to the lyrical subject (as in poems). 172.68.50.203 20:12, 17 February 2023 (UTC)

TIFF can contain a JPEG, which makes it technically a lossy format. 172.69.109.33 23:26, 19 February 2023 (UTC)

And an actual JPEG may be lossless. (I still remember JPEG2000 being 'a thing', amongst the other situations mentioned there, but that wasn't even what I was thinking of whn I started this reply!) Yet, I think we're talking broad sweeps here. Not strict accuracy. There's Randall's trolling of us with GIF as 'lossy', frexample... 172.69.79.159

The opening sentence of the explanation, about data loss in transit, seems a bit irrelevant to the comic, which is only concerned with lossiness in information due to format. 172.70.91.197 10:40, 20 February 2023 (UTC)

Very relevent to the parity ones. (Leads me to believe it's a scale of "amount of provided data to represent original data". You send less than you really ought to, the more left you go, you send more than you should technically need to as you go to the right. Checksums add a little bit extra, once you get to them, and correcting checksums (hamming bits, etc) are significantly extra overhead. The whole 'better data' is basically "send a similar amount of newer information, or even more, on top of the original".) 162.158.34.71 12:55, 20 February 2023 (UTC)
But that's about adding information to the file (which happens to be mitigation against the potential future loss of data) - not directly about the loss of data itself. 172.71.178.65 11:23, 22 February 2023 (UTC)
Data can be lost (deliberately or otherwise) in the process of transfering data. That's where parity may be useful, and that's when boiling things down into hashes alone probably is not... But you may actually have good reasons/circumstances to do (or not do) either. 172.70.85.224 19:20, 22 February 2023 (UTC)

Since when is CRC-32 obsolete? 162.158.238.6 08:24, 22 February 2023 (UTC)